This guide helps cluster operators and SREs reduce control-plane overload, improve reliability, and limit blast radius in large-scale Kubernetes clusters.
| Issue | Description | Optimization | 
|---|---|---|
| etcd disk | etcd is highly sensitive to disk I/O in large clusters. | Use a dedicated NVMe SSD for the etcd data directory. | 
| etcd DB size | DBs larger than ~8 GB significantly worsen latency, resource use and recovery time. | Keep the etcd DB ≤ 8 GB; remove unused objects and keep frequently-updated objects small (~≤100 KB). | 
| etcd key churn | High read/write frequency on keys strains etcd. | Analyze etcd metrics to find and reduce hot keys. | 
| Data size per resource type in etcd | Large totals per resource type make full-list operations expensive and can block controllers. | Keep each resource type's total data ≤ 800 MB; clean up unused Deployments/Services. 1 | 
| API server LB bandwidth & connections | Load‑balancer bandwidth or connection limits can cause nodes to become NotReady. | Monitor/provision the API-server load balancer in advance. | 
| Services per namespace | Many Services cause large env var injection into Pods and slow startups. | Keep Services per namespace < 5,000, or set enableServiceLinks: false. 2 | 
| Total Services in cluster | Excess Services increase kube-proxy rules and harm performance. | Keep total Services < 10,000 and garbage-collect unused ones. 1 | 
| CoreDNS | Large Pod counts can degrade CoreDNS performance. | Run NodeLocal DNSCache (nodelocaldns). | 
| Pod update rate | High update rates push Endpoint/EndpointSlice changes to all nodes and can cause storms. | Reduce Pod churn; for RollingUpdate set conservative maxUnavailable/maxSurge. | 
| ServiceAccount token mounts | Each token Secret can create a watch; many watches burden the control plane. | For Pods that don't need API access, set automountServiceAccountToken: false. 3 | 
| Object count/size | Accumulated ConfigMaps, Secrets, PVCs, etc. increase control-plane load. | Limit ReplicaSet history (revisionHistoryLimit) and use ttlSecondsAfterFinished for Jobs/CronJobs. | 
| Pod requests/limits | Large gaps between requests and limits can trigger cascading failures on node loss. | Prefer setting requests equal to limits where feasible. | 
| Controller restarts & monitoring | Controller or API server restarts cause re-lists that can overload the API server. | Monitor controllers, set appropriate resources to avoid restarts, and reduce unnecessary control-plane operations. | 
https://github.com/kubernetes/community/blob/master/sig-scalability/configs-and-limits/thresholds.md ↩ ↩2
https://kubernetes.io/docs/tutorials/services/connect-applications-service/#accessing-the-service ↩
https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/#opt-out-of-api-credential-automounting ↩