GPU Resource Monitoring

Feature Overview

The Resource Monitoring feature enables real-time and historical tracking of GPU utilization and memory usage across nodes and pods within the Container Platform. This functionality helps administrators and developers:

Monitor GPU Performance: Identify bottlenecks in GPU resource allocation.
Troubleshoot Issues: Analyze GPU usage trends for debugging resource-related problems.
Optimize Workloads: Make data-driven decisions to improve workload distribution.

Applicable Scenarios:

Real-time monitoring of GPU-intensive applications.
Historical analysis of GPU utilization for capacity planning.
Multi-node/pod GPU performance comparison.

Value Delivered:

Enhanced visibility into GPU resource consumption.
Improved cluster efficiency through actionable insights.

Core Features

Node-Level Monitoring: Track GPU utilization and memory usage per node.
Pod-Level Monitoring: Monitor GPU metrics for individual pods.
Custom Time Ranges: Analyze data from 30 minutes up to 7 days.

Feature Advantages

Real-Time Visualization: Interactive dashboards with auto-refresh capabilities.
Multi-Dimensional Filtering: Narrow down metrics by GPU type, namespace, or pod.

Node Monitoring

Monitor GPU resources at the node level through these steps:

Access GPU Dashboards

Navigate to Platform Management view
Go to Operations Center → Monitoring → Dashboards
Switch to the GPU directory

Select Node Metrics

Choose Node Monitoring dashboard
Select target node from dropdown
Pick time range:
- Last 30 minutes
- Last 1/6/12/24 hours
- Last 2/7 days
- Custom range

Interpret Metrics

Metric	Description
GPU Utilization	Percentage of GPU computing capacity used (0-100%)
GPU Memory Usage	Total memory consumed vs. available memory (in GiB)

Pod Monitoring

Analyze GPU usage at the pod level with granular filtering:

Access Pod Metrics

Navigate to GPU directory dashboards
Choose Pod Monitoring

Configure Filters

Select GPU type:
- pGPU
- GPU-Manager(vGPU)
- MPS
Choose namespace containing GPU pods
Select specific pod

Key Metrics

Metric	Description
Pod GPU Utilization	GPU compute usage by selected pod
Pod GPU Memory	Memory allocation for selected pod

Time Range Selection

Both dashboards support flexible time windows:

Available Presets:
- Last 30 minutes
- Last 1 hour
- Last 6 hours
- Last 12 hours
- Last 24 hours
- Last 2 days
- Last 7 days
- Custom range

View full docs as PDF

GPU Resource Monitoring

Feature Overview

Monitor GPU Performance: Identify bottlenecks in GPU resource allocation.
Troubleshoot Issues: Analyze GPU usage trends for debugging resource-related problems.
Optimize Workloads: Make data-driven decisions to improve workload distribution.

Applicable Scenarios:

Real-time monitoring of GPU-intensive applications.
Historical analysis of GPU utilization for capacity planning.
Multi-node/pod GPU performance comparison.

Value Delivered:

Enhanced visibility into GPU resource consumption.
Improved cluster efficiency through actionable insights.

Core Features

Node-Level Monitoring: Track GPU utilization and memory usage per node.
Pod-Level Monitoring: Monitor GPU metrics for individual pods.
Custom Time Ranges: Analyze data from 30 minutes up to 7 days.

Feature Advantages

Real-Time Visualization: Interactive dashboards with auto-refresh capabilities.
Multi-Dimensional Filtering: Narrow down metrics by GPU type, namespace, or pod.

Node Monitoring

Monitor GPU resources at the node level through these steps:

Access GPU Dashboards

Navigate to Platform Management view
Go to Operations Center → Monitoring → Dashboards
Switch to the GPU directory

Select Node Metrics

Choose Node Monitoring dashboard
Select target node from dropdown
Pick time range:
- Last 30 minutes
- Last 1/6/12/24 hours
- Last 2/7 days
- Custom range

Interpret Metrics

Metric	Description
GPU Utilization	Percentage of GPU computing capacity used (0-100%)
GPU Memory Usage	Total memory consumed vs. available memory (in GiB)

Pod Monitoring

Analyze GPU usage at the pod level with granular filtering:

Access Pod Metrics

Navigate to GPU directory dashboards
Choose Pod Monitoring

Configure Filters

Select GPU type:
- pGPU
- GPU-Manager(vGPU)
- MPS
Choose namespace containing GPU pods
Select specific pod

Key Metrics

Metric	Description
Pod GPU Utilization	GPU compute usage by selected pod
Pod GPU Memory	Memory allocation for selected pod

Time Range Selection

Both dashboards support flexible time windows:

Available Presets:
- Last 30 minutes
- Last 1 hour
- Last 6 hours
- Last 12 hours
- Last 24 hours
- Last 2 days
- Last 7 days
- Custom range

How to

Backup Management

Recovery Management

Architecture

Concepts

Guides

How To

Trouble Shooting

Concepts

Guides

How To

Troubleshooting

Install

Concepts

Guides

How To

Disaster Recovery

Concepts

Guides

How To

Guides

Compliance

Install

API Refiner

User

Guides

Group

Guides

Role

Guides

IDP

Guides

Troubleshooting

User Policy

Guides

Overview

Images

Guides

How To

Virtual Machine

Guides

How To

Troubleshooting

Network

Guides

How To

Storage

Guides

Backup and Recovery

Guides

Concepts

Concepts

Guides

Namespaces

Pre-Application-Creation Preparation

Creating Applications

Post-Application-Creation Configuration

Operation and Maintenance

Application Observability

Workloads

Pod

Container

How To

Install

How To

Install

Guides

How To

Concepts

Guides

Argo CD Concept

Alauda Container Platform GitOps Concepts

Creating GitOps Application

GitOps Observability

Architecture

Guides

How To

Guides

How To

Troubleshooting