logo
Alauda AI
English
Русский
English
Русский
logo
Alauda AI
Navigation

Overview

Introduction
Quick Start
Release Notes

Install

Pre-installation Configuration
Install Alauda AI Essentials
Install Alauda AI

Upgrade

Upgrade from AI 1.3

Uninstall

Uninstall

Infrastructure Management

Device Management

About Alauda Build of Hami
About Alauda Build of NVIDIA GPU Device Plugin

Multi-Tenant

Guides

Namespace Management

Workbench

Overview

Introduction
Install
Upgrade

How To

Create WorkspaceKind
Create Workbench

Model Deployment & Inference

Overview

Introduction
Features

Inference Service

Introduction

Guides

Inference Service

How To

Extend Inference Runtimes
Configure External Access for Inference Services
Configure Scaling for Inference Services

Troubleshooting

Experiencing Inference Service Timeouts with MLServer Runtime
Inference Service Fails to Enter Running State

Model Management

Introduction

Guides

Model Repository

Monitoring & Ops

Overview

Introduction
Features Overview

Logging & Tracing

Introduction

Guides

Logging

Resource Monitoring

Introduction

Guides

Resource Monitoring

API Reference

Introduction

Kubernetes APIs

Inference Service APIs

ClusterServingRuntime [serving.kserve.io/v1alpha1]
InferenceService [serving.kserve.io/v1beta1]

Workbench APIs

Workspace Kind [kubeflow.org/v1beta1]
Workspace [kubeflow.org/v1beta1]

Manage APIs

AmlNamespace [manage.aml.dev/v1alpha1]

Operator APIs

AmlCluster [amlclusters.aml.dev/v1alpha1]
Glossary
Previous PageConfigure External Access for Inference Services
Next PageTroubleshooting

#Configure Scaling for Inference Services

#TOC

#Introduction

This document provides a step-by-step guide for configuring autoscaling up and down for inference services. With these settings, you can optimize resource usage, ensure service availability during high load, and release resources during low load.

About the Autoscaler

Knative Serving supports two autoscalers: Knative Pod Autoscaler (KPA) and Kubernetes' Horizontal Pod Autoscaler (HPA). By default, our services use the Knative Pod Autoscaler (KPA).

KPA is designed for serverless workloads and can quickly scale up based on concurrent requests or RPS (requests per second), and can scale services to zero replicas to save costs. HPA is more general and typically scales based on metrics like CPU or memory usage. This guide primarily focuses on configuring services via the Knative Pod Autoscaler (KPA).

#Steps

#Autoscaling Down Configuration

This section describes how to configure inference services to automatically scale down to zero replicas when there is no traffic, or to maintain a minimum number of replicas.

#Enable/Disable Scale to Zero

You can configure whether to allow the inference service to scale down to zero replicas when there is no traffic. By default, this value is true, which allows scaling to zero.

Using InferenceService Resource Parameters

In the spec.predictor field of the InferenceService, set the minReplicas parameter.

  • minReplicas: 0: Allows scaling down to zero replicas.

  • minReplicas: 1: Disables scaling down to zero replicas, keeping at least one replica.

    apiVersion: serving.kserve.io/v1beta1
    kind: InferenceService
    metadata:
      name: demo
      namespace: demo-space
    spec:
      predictor:
        minReplicas: 0
        ...

#Platform-wide Disable of Scale to Zero

WARNING

Once the platform-wide feature is disabled, the minReplicas: 0 configuration for all services will be ignored.

You can modify the global ConfigMap to disable the platform's scale-to-zero feature. This configuration has the highest priority and will override the settings in all individual InferenceService resources.

In the config-autoscaler ConfigMap in the knative-serving namespace, modify the value of enable-scale-to-zero to "false"

apiVersion: v1
kind: ConfigMap
metadata:
  annotations:
    "helm.sh/resource-policy": keep
  name: config-autoscaler
  namespace: knative-serving
data:
  enable-scale-to-zero: "false"
  1. Please ensure this annotation exists, otherwise your configuration will be reverted to the default value.

#Configure Pod Retention Period After Scaling to Zero

This setting determines the minimum time the last Pod remains active after the autoscaler decides to scale to zero. This helps the service respond quickly when it starts receiving traffic again. The default value is 0s

You can choose to configure this for a single service or modify the global ConfigMap to make this setting effective for all services.

Method 1: Using InferenceService Annotations

In the spec.predictor.annotations of the InferenceService, add the autoscaling.knative.dev/scale-to-zero-pod-retention-period annotation.

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: qwen2
  namespace: fy-1
spec:
  predictor:
    annotations:
      autoscaling.knative.dev/scale-to-zero-pod-retention-period: "1m5s"
    ...

Method 2: Using a Global ConfigMap

In the config-autoscaler ConfigMap in the knative-serving namespace, modify the value of scale-to-zero-pod-retention-period to a non-negative duration string, such as "1m5s".

apiVersion: v1
kind: ConfigMap
metadata:
  annotations:
    "helm.sh/resource-policy": keep
  name: config-autoscaler
  namespace: knative-serving
data:
  scale-to-zero-pod-retention-period: "1m5s"
  1. Please ensure this annotation exists, otherwise your configuration will be reverted to the default value.

#Configure the Grace Period for Scaling to Zero

This setting adds a delay before removing the last replica after traffic stops, ensuring the activator/routing path is ready and preventing request loss during the transition to zero.

TIP

This value should only be adjusted if you encounter lost requests due to services scaling to zero. It does not affect the retention time of the last replica after there's no traffic, nor does it guarantee that the replica will be retained during this period.

Method: Using a Global ConfigMap

In the config-autoscaler ConfigMap in the knative-serving namespace, modify the value of scale-to-zero-grace-period to a duration string, such as "40s".

apiVersion: v1
kind: ConfigMap
metadata:
  annotations:
    "helm.sh/resource-policy": keep
  name: config-autoscaler
  namespace: knative-serving
data:
  scale-to-zero-grace-period: "40s"
  1. Please ensure this annotation exists, otherwise your configuration will be reverted to the default value.

#Autoscaling Up Configuration

This section describes how to configure the inference service to automatically scale up in response to increased traffic.

#Configure Concurrency Thresholds

Concurrency determines the number of requests that each application replica can handle simultaneously. You can set concurrency with a soft limit or a hard limit.

  • Soft Limit:A target limit that can be temporarily exceeded during a traffic surge, but which will trigger autoscaling to maintain the target value. The default value is 100.
  • Hard Limit:A strict upper bound. When concurrency reaches this value, excess requests will be buffered and queued for processing. The default value is 0, which means unlimited.
WARNING

If both a soft and a hard limit are specified, the smaller of the two values will be used. This prevents the Autoscaler from having a target value that is not permitted by the hard limit value.

You can choose to configure this for a single service or modify the global ConfigMap to make this setting effective for all services.

Method 1: Using InferenceService Resource Parameters

  • Soft Limit:In spec.predictor, set scaleTarget and set scaleMetric to concurrency.

  • Hard Limit:In spec.predictor, set containerConcurrency

    # Set soft and hard limits
    apiVersion: serving.kserve.io/v1beta1
    kind: InferenceService
    metadata:
      name: demo
      namespace: demo-space
    spec:
      predictor:
        scaleTarget: 200
        scaleMetric: concurrency
        containerConcurrency: 50
        ...

Method 2: Using a Global ConfigMap

  • Soft Limit:In the config-autoscaler ConfigMap, set container-concurrency-target-default.
  • Hard Limit:There is no global setting for the hard limit, as it affects request buffering and queuing.

Target Utilization Percentage

This value specifies the target percentage the autoscaler aims for when metric=concurrency, allowing proactive scale‑up before the hard limit. Default: 70. It does not apply when using RPS.

Method 1: Using InferenceService Annotations

In the spec.predictor.annotations of the InferenceService, add the autoscaling.knative.dev/target-utilization-percentage annotation.

# Set target utilization by service
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: qwen2
  namespace: fy-1
spec:
  predictor:
    annotations:
      autoscaling.knative.dev/target-utilization-percentage: "80"

Method 2: Using a Global ConfigMap

In the config-autoscaler ConfigMap, set container-concurrency-target-percentage.

apiVersion: v1
kind: ConfigMap
metadata:
  annotations:
    "helm.sh/resource-policy": keep
  name: config-autoscaler
  namespace: knative-serving
data:
  container-concurrency-target-percentage: "80"
  1. Please ensure this annotation exists, otherwise your configuration will be reverted to the default value.

#Configure Requests Per Second (RPS) Target

You can change the scaling metric from concurrency to requests per second (RPS). The default value is 200. Note: In RPS mode, the concurrency target‑percentage setting is not used.

Method 1: Using InferenceService Resource Parameters

In spec.predictor, set scaleTarget and set scaleMetric to rps.

# Set RPS target by service
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: qwen2
  namespace: fy-1
spec:
  predictor:
    scaleTarget: 150
    scaleMetric: rps
    ...

Method 2: Using a Global ConfigMap

In the config-autoscaler ConfigMap, set requests-per-second-target-default.

apiVersion: v1
kind: ConfigMap
metadata:
  annotations:
    "helm.sh/resource-policy": keep
  name: config-autoscaler
  namespace: knative-serving
data:
  requests-per-second-target-default: "200"
  1. Please ensure this annotation exists, otherwise your configuration will be reverted to the default value.