Feature Introduction
The core definition of the inference service feature is to deploy trained machine learning or deep learning models as online callable services, using protocols such as HTTP API or gRPC, enabling applications to use the model's prediction, classification, generation, and other features in real-time or in batches. This feature mainly addresses how to efficiently, stably, and conveniently deploy models to production environments after model training is completed, and provide scalable online services.
TOC
Advantages
- Simplifies the model deployment process, reducing deployment complexity.
- Provides high-availability, high-performance online and batch inference services.
- Supports dynamic model updates and version management.
- Realizes automated operation, maintenance, and monitoring of model inference services.
Applicable Scenarios
- Real-time recommendation systems: real-time recommendation of goods or content based on user behavior.
- Image recognition: classification, detection, or recognition of uploaded images.
- Natural language processing: provides services such as text classification, sentiment analysis, and machine translation.
- Financial risk control: real-time assessment of user credit risk or transaction risk.
- Large language model services: providing online question answering, text generation, and other services.
- Batch inference: inferring large amounts of non-real-time data, such as historical data analysis and report generation.
Value Brought
- Accelerates model deployment and shortens application development cycles.
- Improves model inference efficiency and reduces latency.
- Reduces operation and maintenance costs and improves system stability.
- Supports rapid business iteration and innovation.
Main Features
Direct Model Deployment for Inference Services
- Allows users to directly select specific versions of model files from the model repository and specify the inference runtime image to quickly deploy online inference services. The system automatically downloads, caches, and loads the model, starting the inference service. This simplifies the model deployment process and lowers the deployment threshold.
Custom Image Deployment for Inference Services
- Supports users in writing Dockerfiles to package models and their dependencies into custom images, and then deploy inference services through standard Kubernetes Deployments. This approach provides greater flexibility, allowing users to customize the inference environment according to their needs.
Batch Operation of Inference Services
- Supports batch operations on multiple inference services, such as batch starting, stopping, updating, and deleting.
- Able to support the creation, monitoring, and result export of batch inference tasks.
- Provides batch resource management, which can allocate and adjust the resources of inference services in batches.
Inference Service Experience
- Provides an interactive interface to facilitate user testing and experience of inference services.
- Supports multiple input and output formats to meet the needs of different application scenarios.
- Provides model performance evaluation tools to help users optimize model deployment.
Inference Runtime Support
- Integrates a variety of mainstream inference frameworks, such as vLLM, Seldon MLServer, etc., and supports user-defined inference runtimes.
- vLLM: Optimized for large language models (LLMs) like DeepSeek/Qwen, featuring high-concurrency processing and enhanced throughput with superior resource efficiency.
- MLServer: Designed for traditional ML models (XGBoost/image classification), offering multi-framework compatibility and streamlined debugging.
Access Methods, Logs, Swagger, Monitoring, etc.
- Provides multiple access methods, such as HTTP API and gRPC.
- Supports detailed log recording and analysis to facilitate user troubleshooting.
- Automatically generates Swagger documentation to facilitate user integration and invocation of inference services.
- Provides real-time monitoring and alarm features to ensure stable service operation.
Feature Advantages
Performance Advantages:
- Supports GPU acceleration to improve model inference speed.
- Supports batch inference to improve throughput.
- Optimizes the inference runtime to reduce latency.
Scalability:
- Built on Kubernetes, supporting elastic scaling.
- Supports horizontal scaling to handle high concurrency scenarios.
- Supports distributed inference of large models.
- Supports parallel processing of batch tasks.
Security:
- Provides identity authentication and authorization mechanisms to ensure service security.
- Supports network isolation to prevent data leakage.
- Supports secure deployment and updating of models.
Stability:
- Provides health checks and automatic restart mechanisms to improve service availability.
- Supports log monitoring and alarms to detect and resolve problems in a timely manner.
Create inference service
Step 1
Choose Custom publishing
Custom publishing inference service requires manual setting of parameters. You can also create a "template" by combining input parameters for quick publishing of inference services.
Step 2
Provide inference service details for model publishing
Step 3
Click the Publish button to create an inference service.
Experience
Step 1
From the Inference API services list, click the name of any Running service to view its details.
Step 2
Click Experience to reveal the right-side panel.
Step 3
Ask a question
-
System Role
Defines the AI's purpose, tone, and operational boundaries (e.g., "You are a helpful assistant specialized in medical information").
-
Parameters
Choose parameters according to your task type. Refer to the parameter descriptions below for details.
Parameter Descriptions for Different Task Types
Text Generation
Preset Parameters
Other Parameters
For more parameters, please refer to Text Generation Parameter Configuration.
Text-to-Image
Preset Parameters
Other Parameters
For more parameters, please refer to Text-to-Image Parameter Configuration.
Text Classification
Preset Parameters
For more parameters, please refer to Text Classification Parameter Configuration
Additional References
Image Classification Parameter Configuration
Conversational Parameter Configuration
Summarization Parameter Configuration
Translation Parameter Configuration