Introduction

Resource Monitoring Introduction

Resource Monitoring is a critical component of the Kubernetes Hardware Accelerator Suite, designed to provide comprehensive visibility into GPU resource utilization across your containerized workloads. This module delivers real-time metrics and historical data analysis for both compute utilization and GPU memory consumption at two fundamental levels:

Resource Monitoring is a critical component of the Kubernetes Hardware Accelerator Suite, designed to provide comprehensive visibility into GPU resource utilization across your containerized workloads. This module delivers both compute utilization and GPU memory consumption at two fundamental levels:

Node-Level Monitoring: Track aggregate GPU resource usage across entire Kubernetes nodes
Pod-Level Monitoring: Analyze per-workload GPU consumption with pod granularity

Integrated with the platform's core accelerator modules (pGPU/vGPU(GPU-Manager)/MPS), this monitoring solution enables users to optimize GPU allocation, enforce resource quotas, and troubleshoot performance bottlenecks in AI/ML workloads, real-time inference services, etc.

Advantages

The core advantages of Resource Monitoring are as follows:

Multi-Dimensional Observability

Simultaneously monitor both compute units (CUDA cores) and memory utilization across physical/virtual GPUs, providing holistic insights into accelerator usage patterns.
Hierarchical Metrics Collection

Capture data at both node and pod granularity, enabling correlation between cluster-wide resource trends and individual workload demands.
Native Integration

Seamlessly works with all accelerator modules (pGPU/vGPU/MPS) without requiring additional agents, leveraging Kubernetes-native metrics pipelines.
Historical Analysis

Store GPU metrics with configurable retention periods (default 7 days) for capacity planning and usage pattern analysis through integrated visualization tools.

Application Scenarios

The main application scenarios for Resource Monitoring are as follows:

Performance Optimization

Identify underutilized GPUs in training clusters and right-size resource requests for deep learning workloads. For example, detect pods consistently using <30% of allocated GPU memory to optimize memory allocations.
Multi-Tenant Governance

Enforce GPU quota compliance in shared environments by monitoring vGPU consumption across teams. Track cumulative usage against allocated quotas in AI platform deployments.
Cost Attribution

Generate per-namespace GPU utilization reports for chargeback/showback models in enterprise Kubernetes environments, correlating pod-level metrics with organizational units.
Fault Diagnosis

Investigate OOM (Out-of-Memory) incidents in GPU-accelerated workloads by analyzing memory usage trends preceding container crashes. Cross-reference with Kubernetes events for root cause analysis.
Capacity Planning

Analyze historical GPU utilization patterns (e.g., peak compute demand periods) to inform infrastructure scaling decisions and budget allocations for AI infrastructure.

Usage Limitations

When using Resource Monitoring, please note the following constraints:

Module Dependencies
- Requires at least one accelerator module (pGPU/vGPU/MPS) to be deployed in the cluster

View full docs as PDF

Node Management

Managed Clusters

Import Clusters

Public Cloud Cluster Initialization

Network Initialization

Storage Initialization

How to

How to

Backup Management

Recovery Management

Architecture

Concepts

Guides

How To

Trouble Shooting

Concepts

Guides

How To

Troubleshooting

Install

Concepts

Guides

How To

Disaster Recovery

Concepts

Guides

How To

Guides

Compliance

Install

API Refiner

User

Guides

Group

Guides

Role

Guides

IDP

Guides

Troubleshooting

User Policy

Guides

Overview

Images

Guides

How To

Virtual Machine

Guides

How To

Troubleshooting

Network

Guides

How To

Storage

Guides

Backup and Recovery

Guides

Concepts

Concepts

Guides

Namespaces

Pre-Application-Creation Preparation

Creating Applications

Post-Application-Creation Configuration

Operation and Maintenance

Application Observability

Workloads

Pod

Container

How To

Install

How To

Install

Guides

How To

Concepts

Guides

Argo CD Concept

Alauda Container Platform GitOps Concepts

Creating GitOps Application