Improving Kubernetes Stability for Large-Scale Clusters

This guide helps cluster operators and SREs reduce control-plane overload, improve reliability, and limit blast radius in large-scale Kubernetes clusters.

TOC

Notes

  • Network, storage, load‑balancer, logging and monitoring tuning are out of scope; see vendor docs for those components.
WARNING
  • Test configuration changes before production.
  • Avoid a single huge cluster when risk is high; consider multi-cluster management to reduce blast radius.
IssueDescriptionOptimization
etcd disketcd is highly sensitive to disk I/O in large clusters.Use a dedicated NVMe SSD for the etcd data directory.
etcd DB sizeDBs larger than ~8 GB significantly worsen latency, resource use and recovery time.Keep the etcd DB ≤ 8 GB; remove unused objects and keep frequently-updated objects small (~≤100 KB).
etcd key churnHigh read/write frequency on keys strains etcd.Analyze etcd metrics to find and reduce hot keys.
Data size per resource type in etcdLarge totals per resource type make full-list operations expensive and can block controllers.Keep each resource type's total data ≤ 800 MB; clean up unused Deployments/Services. 1
API server LB bandwidth & connectionsLoad‑balancer bandwidth or connection limits can cause nodes to become NotReady.Monitor/provision the API-server load balancer in advance.
Services per namespaceMany Services cause large env var injection into Pods and slow startups.Keep Services per namespace < 5,000, or set enableServiceLinks: false. 2
Total Services in clusterExcess Services increase kube-proxy rules and harm performance.Keep total Services < 10,000 and garbage-collect unused ones. 1
CoreDNSLarge Pod counts can degrade CoreDNS performance.Run NodeLocal DNSCache (nodelocaldns).
Pod update rateHigh update rates push Endpoint/EndpointSlice changes to all nodes and can cause storms.Reduce Pod churn; for RollingUpdate set conservative maxUnavailable/maxSurge.
ServiceAccount token mountsEach token Secret can create a watch; many watches burden the control plane.For Pods that don't need API access, set automountServiceAccountToken: false. 3
Object count/sizeAccumulated ConfigMaps, Secrets, PVCs, etc. increase control-plane load.Limit ReplicaSet history (revisionHistoryLimit) and use ttlSecondsAfterFinished for Jobs/CronJobs.
Pod requests/limitsLarge gaps between requests and limits can trigger cascading failures on node loss.Prefer setting requests equal to limits where feasible.
Controller restarts & monitoringController or API server restarts cause re-lists that can overload the API server.Monitor controllers, set appropriate resources to avoid restarts, and reduce unnecessary control-plane operations.

Footnotes

  1. https://github.com/kubernetes/community/blob/master/sig-scalability/configs-and-limits/thresholds.md 2

  2. https://kubernetes.io/docs/tutorials/services/connect-applications-service/#accessing-the-service

  3. https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/#opt-out-of-api-credential-automounting