Alauda Container Platform

Improving Kubernetes Stability for Large-Scale Clusters

This guide helps cluster operators and SREs reduce control-plane overload, improve reliability, and limit blast radius in large-scale Kubernetes clusters.

TOC

Notes

Network, storage, load‑balancer, logging and monitoring tuning are out of scope; see vendor docs for those components.

WARNING

Test configuration changes before production.
Avoid a single huge cluster when risk is high; consider multi-cluster management to reduce blast radius.

Issue	Description	Optimization
etcd disk	etcd is highly sensitive to disk I/O in large clusters.	Use a dedicated NVMe SSD for the etcd data directory.
etcd DB size	DBs larger than ~8 GB significantly worsen latency, resource use and recovery time.	Keep the etcd DB ≤ 8 GB; remove unused objects and keep frequently-updated objects small (~≤100 KB).
etcd key churn	High read/write frequency on keys strains etcd.	Analyze etcd metrics to find and reduce hot keys.
Data size per resource type in etcd	Large totals per resource type make full-list operations expensive and can block controllers.	Keep each resource type's total data ≤ 800 MB; clean up unused Deployments/Services. ¹
API server LB bandwidth & connections	Load‑balancer bandwidth or connection limits can cause nodes to become NotReady.	Monitor/provision the API-server load balancer in advance.
Services per namespace	Many Services cause large env var injection into Pods and slow startups.	Keep Services per namespace < 5,000, or set `enableServiceLinks: false`. ²
Total Services in cluster	Excess Services increase kube-proxy rules and harm performance.	Keep total Services < 10,000 and garbage-collect unused ones. ¹
CoreDNS	Large Pod counts can degrade CoreDNS performance.	Run NodeLocal DNSCache (nodelocaldns).
Pod update rate	High update rates push Endpoint/EndpointSlice changes to all nodes and can cause storms.	Reduce Pod churn; for RollingUpdate set conservative `maxUnavailable`/`maxSurge`.
ServiceAccount token mounts	Each token Secret can create a watch; many watches burden the control plane.	For Pods that don't need API access, set `automountServiceAccountToken: false`. ³
Object count/size	Accumulated ConfigMaps, Secrets, PVCs, etc. increase control-plane load.	Limit ReplicaSet history (`revisionHistoryLimit`) and use `ttlSecondsAfterFinished` for Jobs/CronJobs.
Pod requests/limits	Large gaps between requests and limits can trigger cascading failures on node loss.	Prefer setting requests equal to limits where feasible.
Controller restarts & monitoring	Controller or API server restarts cause re-lists that can overload the API server.	Monitor controllers, set appropriate resources to avoid restarts, and reduce unnecessary control-plane operations.

View full docs as PDF