The core definition of the inference service feature is to deploy trained machine learning or deep learning models as online callable services, using protocols such as HTTP API or gRPC, enabling applications to use the model's prediction, classification, generation, and other features in real-time or in batches. This feature mainly addresses how to efficiently, stably, and conveniently deploy models to production environments after model training is completed, and provide scalable online services.
Direct Model Deployment for Inference Services
Custom Image Deployment for Inference Services
Batch Operation of Inference Services
Inference Service Experience
Inference Runtime Support
Access Methods, Logs, Swagger, Monitoring, etc.
Performance Advantages:
Scalability:
Security:
Stability:
Choose Custom publishing
Custom publishing inference service requires manual setting of parameters. You can also create a "template" by combining input parameters for quick publishing of inference services.
Provide inference service details for model publishing
Parameters | Description |
---|---|
Name | Required, The name of the inference API. |
Description | A detailed description of the inference API, explaining its functionality and purpose. |
Model | Required, The name of the model used for inference. |
Version | Required, The version of the model. options include Branch and Tag. |
Inference Runtimes | Required, The engine used for inference runtime |
Requests CPU | Required, The amount of CPU resources requested by the inference service. |
Requests Memory | Required, The amount of memory resources requested by the inference service. |
Limits CPU | Required, The maximum amount of CPU resources that the inference service can use. |
Limits Memory | Required, The maximum amount of memory resources that the inference service can use. |
GPU Acceleration Type | The type of GPU acceleration. |
GPU Acceleration Value | The value of GPU acceleration. |
Temporary storage | Temporary storage space used by the inference service. |
Mount existing PVC | Mount an existing Kubernetes Persistent Volume Claim (PVC) as storage. |
Capacity | Required, The capacity size of temporary storage or PVC. |
Auto scaling | Enable or disable auto-scaling functionality. |
Number of instances | Required, The number of instances running the inference service. |
Environment variables | Key-value pairs injected into the container runtime environment. |
Add parameters | Parameters passed to the container's entrypoint executable. Array of strings (e.g. ["--port=8080", "--batch_size=4"]). |
Startup command | Overrides the default ENTRYPOINT instruction in the container image. Executable + arguments (e.g. ["python", "serve.py"]) |
Click the Publish button to create an inference service.
From the Inference API services list, click the name of any Running service to view its details.
Click Experience to reveal the right-side panel.
Ask a question
System Role
Defines the AI's purpose, tone, and operational boundaries (e.g., "You are a helpful assistant specialized in medical information").
Parameters
Choose parameters according to your task type. Refer to the parameter descriptions below for details.
Parameter Descriptions for Different Task Types
Text Generation
Preset Parameters
Parameter | Data Type | Description |
---|---|---|
do_sample | bool | Whether to use sampling; if not, greedy decoding is used. |
max_new_tokens | int | The maximum number of tokens to generate, ignoring tokens in the prompt. |
repetition_penalty | float | Repetition penalty to control repeated content in the generated text; 1.0 means no repetition, 0 means repetition. |
temperature | float | The randomness of the model for the next token when generating text; 1.0 is high randomness, 0 is low randomness. |
top_k | int | When calculating the probability distribution of the next token, only consider the top k tokens with the highest probability. |
top_p | float | Controls the cumulative probability distribution considered by the model when selecting the next token. |
use_cache | bool | Whether to use the intermediate results calculated by the model during the generation process. |
Other Parameters
Parameter | Data Type | Description |
---|---|---|
max_length | int | The maximum number of generated tokens. Corresponds to the number of tokens in the input prompt + max_new_tokens . If max_new_tokens is set, its effect is overridden by max_new_tokens . |
min_length | int | The minimum number of generated tokens. Corresponds to the number of tokens in the input prompt + min_new_tokens . If min_new_tokens is set, its effect is overridden by min_new_tokens . |
min_new_tokens | int | The minimum number of generated tokens, ignoring tokens in the prompt. |
early_stop | bool | Controls the stopping condition for beam-based methods. True: generation stops when num_beams complete candidates appear. False: applies heuristics to stop generation when it is unlikely to find better candidates. |
num_beams | int | Number of beams used for beam search. 1 means no beam search. |
max_time | int | The maximum time allowed for calculation to run, in seconds. |
num_beam_groups | int | Divides num_beams into groups to ensure diversity among different beam groups. |
diversity_penalty | float | Effective when num_beam_groups is enabled. This parameter applies a diversity penalty between groups to ensure that the content generated by each group is as different as possible. |
penalty_alpha | float | Contrastive search is enabled when penalty_alpha is greater than 0 and top_k is greater than 1. The larger the penalty_alpha value, the stronger the contrastive penalty, and the more likely the generated text is to meet expectations. If the penalty_alpha value is set too large, it may cause the generated text to be too uniform. |
typical_p | float | Local typicality measures the similarity between the conditional probability of predicting the next target token and the expected conditional probability of predicting the next random token given the generated partial text. If set to a floating-point number less than 1, the smallest set of locally typical tokens whose probabilities add up to or exceed typical_p will be retained for generation. |
epsilon_cutoff | float | If set to a floating-point number strictly between 0 and 1, only tokens with conditional probabilities greater than epsilon_cutoff will be sampled. Suggested values range from 3e-4 to 9e-4, depending on the model size. |
eta_cutoff | float | Eta sampling is a hybrid of local typical sampling and epsilon sampling. If set to a floating-point number strictly between 0 and 1, a token will only be considered if it is greater than eta_cutoff or sqrt(eta_cutoff ) * exp(-entropy(softmax(next_token_logits))). Suggested values range from 3e-4 to 2e-3, depending on the model size. |
repetition_penalty | float | Parameter for repetition penalty. 1.0 means no penalty. |
For more parameters, please refer to Text Generation Parameter Configuration.
Text-to-Image
Preset Parameters
Parameter | Data Type | Description |
---|---|---|
num_inference_steps | int | The number of denoising steps. More denoising steps usually result in higher quality images but slower inference. |
use_cache | bool | Whether to use the intermediate results calculated by the model during the generation process. |
Other Parameters
Parameter | Data Type | Description |
---|---|---|
height | int | The height of the generated image, in pixels. |
width | int | The width of the generated image, in pixels. |
guidance_scale | float | Used to adjust the balance between quality and diversity of the generated image. Larger values increase diversity but reduce quality; suggested range is 7 to 8.5. |
negative_prompt | str or List[str] | Used to guide what content should not be included in the image generation. |
For more parameters, please refer to Text-to-Image Parameter Configuration.
Text Classification
Preset Parameters
Parameter | Data Type | Description |
---|---|---|
top_k | int | The number of top-scoring type labels. If the provided number is None or higher than the number of labels available in the model configuration, the default is to return the number of labels. |
use_cache | bool | Whether to use the intermediate results calculated by the model during the generation process. |
For more parameters, please refer to Text Classification Parameter Configuration
Image Classification Parameter Configuration
Conversational Parameter Configuration
Summarization Parameter Configuration
Translation Parameter Configuration