English

Introduction

Resource Monitoring is a core component of Alauda AI's Monitoring & Ops module, designed specifically for tracking and analyzing resource utilization metrics of inference services. As part of the full-stack MLOps platform, it provides real-time visibility into infrastructure resource consumption, enabling users to optimize model deployment, prevent resource bottlenecks, and ensure stable operation of AI workloads. Integrated with Alauda AI's unified monitoring ecosystem, Resource Monitoring eliminates the need for fragmented tooling by delivering actionable insights directly within your MLOps workflow.

Advantages

The core advantages of Resource Monitoring are as follows:

Real-Time Metrics Visualization
Offers intuitive dashboards with granular CPU/Memory usage data updated in near real-time, supporting both cluster-level and pod-level monitoring for precise resource analysis.
MLOps-Centric Integration
Seamlessly correlates resource metrics with other operational data (GPU utilization, request traffic, etc.) within the Alauda AI platform, enabling holistic performance troubleshooting.
Cost-Optimization Insights
Identifies underutilized resources and overprovisioned containers through historical trend analysis.

Application Scenarios

Key application scenarios for Resource Monitoring include:

Inference Service Health Management
Continuously monitor CPU/memory consumption spikes during model serving peaks to ensure SLA compliance and auto-scaling effectiveness.
Resource Allocation Tuning
Analyze historical usage patterns to right-size container resource requests/limits, improving cluster utilization efficiency.
Performance Anomaly Investigation
Cross-reference resource metrics with application logs and request traffic data during incident diagnosis to identify causal relationships.
Capacity Planning
Forecast infrastructure needs by tracking long-term usage trends and seasonal workload variations.

Usage Limitations

When using Resource Monitoring, note the following constraints:

Data Collection Intervals
- Minimum metric scraping interval: 60 seconds
- Historical data retention: 7 days by default
Dependency Requirements
- Requires Prometheus/VictoriaMetrics monitoring stack deployed in target clusters
- Node exporter must be running on all worker nodes
- DCGM exporter must be running on GPU nodes

Guides

Guides

Troubleshooting

how_to

Guides

Guides

Guides

Manage APIs

Operator APIs

Inference Service APIs

Introduction

TOC

Advantages

Application Scenarios

Usage Limitations

Guides

Guides

Troubleshooting

how_to

Guides

Guides

Guides

Manage APIs

Operator APIs

Inference Service APIs

#Introduction

#TOC

#Advantages

#Application Scenarios

#Usage Limitations

Introduction

TOC

Advantages

Application Scenarios

Usage Limitations