Introduction

The Inference Service is a core feature of the Alauda AI platform, dedicated to efficiently deploying LLM models as online inference services, supporting various invocation methods such as HTTP API and gRPC. Through the Inference Service, users can rapidly build LLM applications and provide stable, high-performance LLM capabilities externally.

WARNING

Running the built-in runtime in the container requires root privileges. Please ensure it is used in a trusted environment and follow your security policies.

TOC

Core Advantages

  • Rapid Model Deployment:
    • Supports direct deployment of inference services from the Model Repository, simplifying deployment steps.
    • Support user defined docker images, to deploy complex user defined inference services.
  • Multi-Framework Runtime Support:
    • Integrates mainstream inference runtimes such as Seldon MLServer and vLLM, supporting various model frameworks and meeting the deployment needs of different models.
  • Visual Inference Demonstration:
    • Provides visual "inference demonstration" features for common task types, allowing users to quickly verify inference results.
  • Flexible Invocation Methods:
    • Supports various invocation methods such as HTTP API and gRPC, allowing users to invoke LLM capabilities in different application scenarios.

Application Scenarios

  • Online LLM Applications:
    • Deploying LLM models as online services to provide LLM capabilities externally.
  • Real-Time Inference:
    • Supports real-time inference scenarios, meeting the needs of applications with high response speed requirements.
  • Batch Inference:
    • Supports batch inference, performing inference calculations on large datasets.
  • Application Integration:
    • Integrating LLM capabilities into existing applications through APIs.