Inferencing tuned models

Last updated: Jun 26, 2025

You can test your tuned model by inferencing it to see how it responds to input that follows tuned-for patterns.

Ways to work

You can experiment with the tuned model by inferencing it in the following ways:

From the UI: Inferencing a tuned model deployment from a project:
- Project: Useful when you want to test your model during the development and testing phases before moving it into production.
- Deployment space: Useful when you want to test your model programmatically. From the API Reference tab, you can find information about the available endpoints and code examples. You can also submit input as text and choose to return the output or in a stream, as the output is generated. However, you cannot change the prompt parameters for the input text.
- Prompt Lab: Useful when you want to use a tool with an intuitive user interface for prompting foundation models. You can customize the prompt parameters for each input. You can also save the prompt as a notebook so you can interact with it programmatically.
Programmatic methods to inference tuned model deployments:
- REST API
- Python

Before you begin

If your model is tuned using low-rank adaptation (LoRA) fine tuning, make sure you have deployed your LoRA adapter model before with inferencing the tuned model.

Inferencing a tuned model deployment from a project

You can test your tuned model by inferencing it to see how it responds to input that follows tuned-for patterns.

Testing the model deployment in a project or deployment space

To test your tuned model in the project, complete the following steps:

From your project's Deployments tab, click the name of the deployed model asset in your project, or the name of the deployment space where you deployed the tuned model.
Click the Test tab.
In the Input data field, add a prompt that follows the prompt pattern that your tuned model is trained to recognize, and then click Generate.

You can click View parameter settings to see the prompt parameters that are applied to the model by default. To change the prompt parameters, you must go to the Prompt Lab.

Testing the model deployment in Prompt Lab

To test your tuned model in Prompt Lab, complete the following steps:

From your project's Deployments tab, click the name of the deployed model asset in your project, or the name of the deployment space where you deployed the tuned model.
In the project, click Open in Prompt Lab. If you are working in a deployment space, you are prompted to choose the project where you want to work with the model.

Prompt Lab opens and the tuned model that you deployed is selected from the Model field.
In the Try section, add a prompt to the Input field that follows the prompt pattern that your tuned model is trained to recognize, and then click Generate.

For more information about how to use the prompt editor, see Prompt Lab.

Inferencing a tuned model deployment with the REST API

You can use the watsonx.ai REST API to inference your tuned model deployment by providing text or stream input data to generate predictions in real-time.

Generating text response

To generate a text response from your deployed PEFT model, use the following code sample:

curl -X POST "https://{region}.ml.cloud.ibm.com/ml/v1/deployments/<deployment_id>/text/generation?version=2024-01-29" \
-H "Authorization: Bearer <token>" \
-H "content-type: application/json" \
--data '{
 "input": "What is the boiling point of water?",
 "parameters": {
    "max_new_tokens": 200,
    "min_new_tokens": 20
 }
}'

Make sure to replace the placeholders with your actual values and adjust the parameters according to your specific use case.

Generating stream response

To generate a stream response from your deployed PEFT model, use the following code sample:

curl -X POST "https://{region}.ml.cloud.ibm.com/ml/v1/deployments/<deployment_id>/text/generation_stream?version=2024-01-29" \
-H "Authorization: Bearer <token>" \
-H "content-type: application/json" \
--data '{
 "input": "What is the boiling point of water?",
 "parameters": {
    "max_new_tokens": 200,
    "min_new_tokens": 20
 }
}'

Make sure to replace the placeholders with your actual values and adjust the parameters according to your specific use case.

Inferencing deployed tuned models with the Python client library

You can use the watsonx.ai Python client library to inference your tuned model deployment and generate predictions in real-time.

Generating text response

To generate a text response from your deployed fine-tuned model, use the client.Deployments.generate_text function from the watsonx.ai Python client library. For more information, see Generating text response with generate_text() in the Python client library documentation.

Generate stream response

To generate a stream response from your deployed fine-tuned model, use the client.Deployments.generate_text_stream function from the watsonx.ai Python client library. For more information, see Generating stream response with generate_text_stream() in the Python client library documentation.

Learn more

Managing tuned model deployments

Parent topic: Deploying Parameter-Efficient Fine-Tuned (PEFT) models

Was the topic helpful?

0/1000