GPU Resource Monitoring

TOC

Feature Overview

The Resource Monitoring feature enables real-time and historical tracking of GPU utilization and memory usage across nodes and pods within the Container Platform. This functionality helps administrators and developers:

  • Monitor GPU Performance: Identify bottlenecks in GPU resource allocation.
  • Troubleshoot Issues: Analyze GPU usage trends for debugging resource-related problems.
  • Optimize Workloads: Make data-driven decisions to improve workload distribution.

Applicable Scenarios:

  • Real-time monitoring of GPU-intensive applications.
  • Historical analysis of GPU utilization for capacity planning.
  • Multi-node/pod GPU performance comparison.

Value Delivered:

  • Enhanced visibility into GPU resource consumption.
  • Improved cluster efficiency through actionable insights.

Core Features

  • Node-Level Monitoring: Track GPU utilization and memory usage per node.
  • Pod-Level Monitoring: Monitor GPU metrics for individual pods.
  • Custom Time Ranges: Analyze data from 30 minutes up to 7 days.

Feature Advantages

  • Real-Time Visualization: Interactive dashboards with auto-refresh capabilities.
  • Multi-Dimensional Filtering: Narrow down metrics by GPU type, namespace, or pod.

Node Monitoring

Monitor GPU resources at the node level through these steps:

Access GPU Dashboards

  1. Navigate to Platform Management view
  2. Go to Operations Center → Monitoring → Dashboards
  3. Switch to the GPU directory

Select Node Metrics

  1. Choose Node Monitoring dashboard
  2. Select target node from dropdown
  3. Pick time range:
    • Last 30 minutes
    • Last 1/6/12/24 hours
    • Last 2/7 days
    • Custom range

Interpret Metrics

MetricDescription
GPU UtilizationPercentage of GPU computing capacity used (0-100%)
GPU Memory UsageTotal memory consumed vs. available memory (in GiB)

Pod Monitoring

Analyze GPU usage at the pod level with granular filtering:

Access Pod Metrics

  1. Navigate to GPU directory dashboards
  2. Choose Pod Monitoring

Configure Filters

  1. Select GPU type:
    • pGPU
    • GPU-Manager(vGPU)
    • MPS
  2. Choose namespace containing GPU pods
  3. Select specific pod

Key Metrics

MetricDescription
Pod GPU UtilizationGPU compute usage by selected pod
Pod GPU MemoryMemory allocation for selected pod

Time Range Selection

Both dashboards support flexible time windows:

Available Presets:
- Last 30 minutes
- Last 1 hour
- Last 6 hours
- Last 12 hours
- Last 24 hours
- Last 2 days
- Last 7 days
- Custom range