This document provides a step-by-step guide for configuring autoscaling up and down for inference services. With these settings, you can optimize resource usage, ensure service availability during high load, and release resources during low load.
About the Autoscaler
Knative Serving supports two autoscalers: Knative Pod Autoscaler (KPA) and Kubernetes' Horizontal Pod Autoscaler (HPA). By default, our services use the Knative Pod Autoscaler (KPA).
KPA is designed for serverless workloads and can quickly scale up based on concurrent requests or RPS (requests per second), and can scale services to zero replicas to save costs. HPA is more general and typically scales based on metrics like CPU or memory usage. This guide primarily focuses on configuring services via the Knative Pod Autoscaler (KPA).
This section describes how to configure inference services to automatically scale down to zero replicas when there is no traffic, or to maintain a minimum number of replicas.
You can configure whether to allow the inference service to scale down to zero replicas when there is no traffic. By default, this value is true, which allows scaling to zero.
Using InferenceService Resource Parameters
In the spec.predictor field of the InferenceService, set the minReplicas parameter.
minReplicas: 0: Allows scaling down to zero replicas.
minReplicas: 1: Disables scaling down to zero replicas, keeping at least one replica.
Once the platform-wide feature is disabled, the minReplicas: 0 configuration for all services will be ignored.
You can modify the global ConfigMap to disable the platform's scale-to-zero feature. This configuration has the highest priority and will override the settings in all individual InferenceService resources.
In the config-autoscaler ConfigMap in the knative-serving namespace, modify the value of enable-scale-to-zero to "false"
This setting determines the minimum time the last Pod remains active after the autoscaler decides to scale to zero. This helps the service respond quickly when it starts receiving traffic again. The default value is 0s
You can choose to configure this for a single service or modify the global ConfigMap to make this setting effective for all services.
Method 1: Using InferenceService Annotations
In the spec.predictor.annotations of the InferenceService, add the autoscaling.knative.dev/scale-to-zero-pod-retention-period annotation.
Method 2: Using a Global ConfigMap
In the config-autoscaler ConfigMap in the knative-serving namespace, modify the value of scale-to-zero-pod-retention-period to a non-negative duration string, such as "1m5s".
This setting adds a delay before removing the last replica after traffic stops, ensuring the activator/routing path is ready and preventing request loss during the transition to zero.
This value should only be adjusted if you encounter lost requests due to services scaling to zero. It does not affect the retention time of the last replica after there's no traffic, nor does it guarantee that the replica will be retained during this period.
Method: Using a Global ConfigMap
In the config-autoscaler ConfigMap in the knative-serving namespace, modify the value of scale-to-zero-grace-period to a duration string, such as "40s".
This section describes how to configure the inference service to automatically scale up in response to increased traffic.
Concurrency determines the number of requests that each application replica can handle simultaneously. You can set concurrency with a soft limit or a hard limit.
100.0, which means unlimited.If both a soft and a hard limit are specified, the smaller of the two values will be used. This prevents the Autoscaler from having a target value that is not permitted by the hard limit value.
You can choose to configure this for a single service or modify the global ConfigMap to make this setting effective for all services.
Method 1: Using InferenceService Resource Parameters
Soft Limit:In spec.predictor, set scaleTarget and set scaleMetric to concurrency.
Hard Limit:In spec.predictor, set containerConcurrency
Method 2: Using a Global ConfigMap
config-autoscaler ConfigMap, set container-concurrency-target-default.Target Utilization Percentage
This value specifies the target percentage the autoscaler aims for when metric=concurrency, allowing proactive scale‑up before the hard limit. Default: 70. It does not apply when using RPS.
Method 1: Using InferenceService Annotations
In the spec.predictor.annotations of the InferenceService, add the autoscaling.knative.dev/target-utilization-percentage annotation.
Method 2: Using a Global ConfigMap
In the config-autoscaler ConfigMap, set container-concurrency-target-percentage.
You can change the scaling metric from concurrency to requests per second (RPS). The default value is 200.
Note: In RPS mode, the concurrency target‑percentage setting is not used.
Method 1: Using InferenceService Resource Parameters
In spec.predictor, set scaleTarget and set scaleMetric to rps.
Method 2: Using a Global ConfigMap
In the config-autoscaler ConfigMap, set requests-per-second-target-default.