Monitor Component Capacity Planning

The monitor component is responsible for storing metrics data collected from one or more clusters in the platform. Therefore, you need to assess your monitor scale in advance and plan the resources needed for the monitor component according to the guidelines in this document.

TOC

Assumptions and Methodology

  • Data in this document comes from controlled lab performance reports and is intended as a sizing baseline for production planning.
  • Retention for disk examples is 7 days; adjust proportionally for other retention targets.
  • Storage baseline matches the warning above (SSD, ~6000 IOPS, ~250MB/s read/write, independent mount).
  • Test workloads exercised typical monitoring pages such as "acp ns overview page" and "platform region detail page".

Prometheus

Below are sizing recommendations by scale for Prometheus and related components (Thanos Query, Thanos Sidecar, etc.).

Small Scale — 10 worker nodes, 500 double-container Pods

  • Metric ingestion rate: ~2800 samples/second
ComponentContainerReplicasCPU LimitMemory LimitDisk (if applicable)Notes
courier-apicourier22C4Gi--
kube-prometheus-thanos-querythanos-query11C1Gi--
prometheus-kube-prometheus-0prometheus12C8Gi20G~10G write over 7 days

Medium Scale — 50 worker nodes, 2000 double-container Pods

  • Metric ingestion rate: ~7294 samples/second
ComponentContainerReplicasCPU LimitMemory LimitDisk (if applicable)Notes
courier-apicourier24C4Gi--
kube-prometheus-thanos-querythanos-query12.5C8Gi--
prometheus-kube-prometheus-0prometheus14C8Gi40G~30G write over 7 days

Large Scale — 500 worker nodes, 10000 double-container Pods

  • Metric ingestion rate: ~41575 samples/second
ComponentContainerReplicasCPU LimitMemory LimitDisk (if applicable)Notes
courier-apicourier26C4Gi--
kube-prometheus-thanos-querythanos-query12C6Gi-In-field deployments may use 2 replicas
prometheus-kube-prometheus-0prometheus18C20Gi100GPeak mem ~15Gi; ~69G write over 7 days

VictoriaMetrics

Below are sizing recommendations by scale for VictoriaMetrics components.

Small Scale — 10 worker nodes, 500 double-container Pods

  • Metric ingestion rate: ~3274 samples/second
ComponentContainerReplicasCPU LimitMemory LimitDisk (if applicable)Notes
courier-apicourier12C4Gi--
vmselect-clusterproxy11C200Mi--
vmselectvmselect1500m1Gi--
vmstorage-clustervmstorage1500m2Gi3G~1.5G write over 7 days

Medium Scale — 50 worker nodes, 2000 double-container Pods

  • Metric ingestion rate: ~6940 samples/second
ComponentContainerReplicasCPU LimitMemory LimitDisk (if applicable)Notes
courier-apicourier24C4Gi--
vmselect-clusterproxy11C200Mi--
vmselectvmselect12C2Gi--
vmstorage-clustervmstorage12C2Gi10G~2.6G write over 7 days

Large Scale — 500 worker nodes, 10000 double-container Pods

  • Metric ingestion rate: ~34300 samples/second
ComponentContainerReplicasCPU LimitMemory LimitDisk (if applicable)Notes
courier-apicourier26C4Gi--
vmselect-clusterproxy12C200Mi--
vmselectvmselect15C3Gi--
vmstorage-clustervmstorage12C6Gi30G~16.8G write over 7 days