This document will guide you step-by-step on how to add new inference runtimes for serving either Large Language Model (LLM) or any other models like "image classification", "object detection", "text classification" etc.
Alauda AI comes with a builtin "vLLM" inference engine, with "custom inference runtimes", you can introduce more inference engines like Seldon MLServer, Triton inference server and so on.
By introducing custom runtimes, you can expand the platform's support for a wider range of model types and GPU types, and optimize performance for specific scenarios to meet broader business needs.
In this section, we'll demonstrate extending current AI platform with a custom XInfernece serving runtime to deploy LLMs and serve an "OpenAI compatible API".
Consider extending your AI Platform inference service runtimes if you encounter any of the following situations:
vLLM.Before you start, please ensure you meet these conditions:
xprobe/xinference:v1.2.2 (for GPU) or xprobe/xinference:v1.2.2-cpu (for CPU).You'll need to create the corresponding inference runtime resources based on your target hardware environment (GPU/CPU/NPU).
Prepare the Runtime YAML Configuration:
Based on the type of runtime you want to add (e.g., Xinference) and your target hardware environment, prepare the appropriate YAML configuration file. Here are examples for the Xinference runtime across different hardware environments:
image field value with the path to your actual prepared runtime image. You can also modify the annotations.cpaas.io/display-name field to customize the display name of the runtime in the AI Platform UI.Apply the YAML File to Create the Resource:
From a terminal with cluster administrator privileges, execute the following command to apply your YAML file and create the inference runtime resource:
limits, and requests to ensure the runtime is compatible with your model and hardware environment and runs efficiently.Once the Xinference inference runtime resource is successfully created, you can select and configure it when publishing your LLM inference service on the AI Platform.
Configure Inference Framework for the Model:
Ensure that on the model details page of the model repository you are about to publish, you have selected the appropriate framework through the File Management metadata editing function. The framework parameter value chosen here must match a value included in the supportedModelFormats field when you created the inference service runtime. Please ensure the model framework parameter value is listed in the supportedModelFormats list set in the inference runtime.
Navigate to the Inference Service Publishing Page:
Log in to the AI Platform and navigate to the "Inference Services" or "Model Deployment" modules, then click "Publish Inference Service."
Select the Xinference Runtime:
In the inference service creation wizard, find the "Runtime" or "Inference Framework" option. From the dropdown menu or list, select the Xinference runtime you created in Step 1 (e.g., "Xinference CPU Runtime" or "Xinference GPU Runtime (CUDA)").
Set Environment Variables: The Xinference runtime requires specific environment variables to function correctly. On the inference service configuration page, locate the "Environment Variables" or "More Settings" section and add the following environment variable:
Environment Variable Parameter Description
| Parameter Name | Description | 
|---|---|
| MODEL_FAMILY | Required. Specifies the family type of the LLM model you are deploying. Xinference uses this parameter to identify and load the correct inference logic for the model. For example, if you are deploying a Llama 3 model, set it to llama; if it's a ChatGLM model, set it tochatglm. Please set this based on your model's actual family. | 
Example:
MODEL_FAMILYllama (if you are using a Llama series model, checkout the docs for more detail. Or you can run xinference registrations -t LLM to list all supported model families.)