GPU Resource Monitoring
TOC
Feature Overview
The Resource Monitoring feature enables real-time and historical tracking of GPU utilization and memory usage across nodes and pods within the Container Platform. This functionality helps administrators and developers:
- Monitor GPU Performance: Identify bottlenecks in GPU resource allocation.
- Troubleshoot Issues: Analyze GPU usage trends for debugging resource-related problems.
- Optimize Workloads: Make data-driven decisions to improve workload distribution.
Applicable Scenarios:
- Real-time monitoring of GPU-intensive applications.
- Historical analysis of GPU utilization for capacity planning.
- Multi-node/pod GPU performance comparison.
Value Delivered:
- Enhanced visibility into GPU resource consumption.
- Improved cluster efficiency through actionable insights.
Core Features
- Node-Level Monitoring: Track GPU utilization and memory usage per node.
- Pod-Level Monitoring: Monitor GPU metrics for individual pods.
- Custom Time Ranges: Analyze data from 30 minutes up to 7 days.
Feature Advantages
- Real-Time Visualization: Interactive dashboards with auto-refresh capabilities.
- Multi-Dimensional Filtering: Narrow down metrics by GPU type, namespace, or pod.
Node Monitoring
Monitor GPU resources at the node level through these steps:
Access GPU Dashboards
- Navigate to Platform Management view
- Go to Operations Center → Monitoring → Dashboards
- Switch to the GPU directory
Select Node Metrics
- Choose Node Monitoring dashboard
- Select target node from dropdown
- Pick time range:
- Last 30 minutes
- Last 1/6/12/24 hours
- Last 2/7 days
- Custom range
Interpret Metrics
Metric | Description |
---|
GPU Utilization | Percentage of GPU computing capacity used (0-100%) |
GPU Memory Usage | Total memory consumed vs. available memory (in GiB) |
Pod Monitoring
Analyze GPU usage at the pod level with granular filtering:
Access Pod Metrics
- Navigate to GPU directory dashboards
- Choose Pod Monitoring
Configure Filters
- Select GPU type:
- pGPU
- GPU-Manager(vGPU)
- MPS
- Choose namespace containing GPU pods
- Select specific pod
Key Metrics
Metric | Description |
---|
Pod GPU Utilization | GPU compute usage by selected pod |
Pod GPU Memory | Memory allocation for selected pod |
Time Range Selection
Both dashboards support flexible time windows:
Available Presets:
- Last 30 minutes
- Last 1 hour
- Last 6 hours
- Last 12 hours
- Last 24 hours
- Last 2 days
- Last 7 days
- Custom range