logo
Alauda AI
English
Русский
English
Русский
logo
Alauda AI
Navigation

Overview

Introduction
Quick Start
Release Notes

Install

Pre-installation Configuration
Install Alauda AI Essentials
Install Alauda AI

Upgrade

Upgrade from AI 1.3

Uninstall

Uninstall

Infrastructure Management

Device Management

About Alauda Build of Hami
About Alauda Build of NVIDIA GPU Device Plugin

Multi-Tenant

Guides

Namespace Management

Workbench

Overview

Introduction
Install
Upgrade

How To

Create WorkspaceKind
Create Workbench

Model Deployment & Inference

Overview

Introduction
Features

Inference Service

Introduction

Guides

Inference Service

How To

Extend Inference Runtimes
Configure External Access for Inference Services
Configure Scaling for Inference Services

Troubleshooting

Experiencing Inference Service Timeouts with MLServer Runtime
Inference Service Fails to Enter Running State

Model Management

Introduction

Guides

Model Repository

Monitoring & Ops

Overview

Introduction
Features Overview

Logging & Tracing

Introduction

Guides

Logging

Resource Monitoring

Introduction

Guides

Resource Monitoring

API Reference

Introduction

Kubernetes APIs

Inference Service APIs

ClusterServingRuntime [serving.kserve.io/v1alpha1]
InferenceService [serving.kserve.io/v1beta1]

Workbench APIs

Workspace Kind [kubeflow.org/v1beta1]
Workspace [kubeflow.org/v1beta1]

Manage APIs

AmlNamespace [manage.aml.dev/v1alpha1]

Operator APIs

AmlCluster [amlclusters.aml.dev/v1alpha1]
Glossary
Previous PageGuides
Next PageAPI Reference

#Resource Monitoring

#TOC

#Function Overview

Resource Monitoring in Alauda AI's Monitoring & Ops module provides real-time insights into the CPU, memory, GPU, token usage, and request metrics of your inference services. This feature helps you identify performance bottlenecks, optimize resource allocation, and ensure stable service operations. It is particularly useful for scenarios such as:

  • Performance tuning: Diagnose high resource usage and adjust resource limits.
  • Anomaly detection: Monitor sudden spikes in resource consumption and request patterns.
  • Capacity planning: Analyze historical trends to scale resources efficiently.
  • Cost optimization: Track token usage and GPU utilization for budget management.

#Main Features

  • Resource Monitoring:
    • CPU Usage: Displays absolute CPU usage (e.g., cores used).
    • CPU Utilization: Shows CPU usage as a percentage of allocated resources.
    • Memory Usage: Tracks actual memory consumption (e.g., in GB).
    • Memory Utilization: Displays memory usage as a percentage of allocated resources.
  • Computing Monitoring:
    • GPU Usage: Tracks GPU compute resource consumption.
    • GPU Utilization: Shows GPU usage as a percentage of allocated resources.
    • GPU Memory Usage: Monitors GPU memory consumption.
    • GPU Memory Utilization: Displays GPU memory usage as a percentage of allocated resources.
    • Note: MPS deployment mode does not support GPU compute and memory monitoring.
  • Other Monitoring:
    • Token Metrics: Tracks prompt and generation tokens (available for 'vllm' runtime services).
    • Request Metrics: Monitors response time (avg/tp50/p90/p95), QPS (success/fail), and traffic (in/out).
  • Time Range Selection: Analyze metrics over customizable periods (from 30 minutes to 7 days).

#Accessing Resource Monitoring

#Step 1: Navigate to Inference Service Details

  1. Go to Inference Services in the left navigation pane.
  2. Click the target Inference Service name to open its details page.

#Step 2: Open Monitoring Dashboard

  1. Select the Monitor tab.
  2. Ensure the Resource Monitor section is expanded (default view).

#Step 3: Select Time Range

Use the time picker in the top-right corner to choose a predefined or custom range:

Preset OptionsCustom Range
Last 30 minutesStart/End datetime
Last 1 hour
Last 6 hours
Last 24 hours
Last 2 days
Last 7 days

#Monitoring Metrics

#CPU Usage

  • Description: Shows actual CPU cores consumed by the service.
  • Data Format: cores (float)

#CPU Utilization

  • Description: Percentage of allocated CPU resources being used.
  • Calculation: (Used Cores / Allocated Cores) × 100%
  • Interpretation:
    • Sustained >90%: Consider scaling CPU allocation
    • <20%: Potential over-provisioning

#Memory Usage

  • Description: Physical memory consumed by the service.
  • Data Format: GiB or MiB
  • Critical Note: Kubernetes OOM kills occur when usage exceeds allocated memory.

#Memory Utilization

  • Description: Percentage of allocated memory resources being used.
  • Calculation: (Used Memory / Allocated Memory) × 100%

#GPU Usage

  • Description: GPU compute resources consumed by the service.
  • Data Format: Compute units
  • Note: Not available for MPS deployment mode.

#GPU Utilization

  • Description: Percentage of allocated GPU compute resources being used.
  • Calculation: (Used GPU / Allocated GPU) × 100%
  • Note: Not available for MPS deployment mode.

#GPU Memory Usage

  • Description: GPU memory consumed by the service.
  • Data Format: GiB or MiB
  • Note: Not available for MPS deployment mode.

#GPU Memory Utilization

  • Description: Percentage of allocated GPU memory being used.
  • Calculation: (Used GPU Memory / Allocated GPU Memory) × 100%
  • Note: Not available for MPS deployment mode.

#Token Metrics

  • Token Prompt: Tracks the number of prompt tokens processed.
  • Token Generation: Monitors the number of tokens generated by the model.
  • Availability: Only available for inference services using the 'vllm' runtime.

#Request Metrics

  • Response Time: Measures service response latency (avg/tp50/p90/p95).
  • QPS (Queries Per Second): Tracks successful and failed requests per second.
  • Traffic: Monitors inbound and outbound data transfer.