English

Resource Monitoring

Function Overview

Resource Monitoring in Alauda AI's Monitoring & Ops module provides real-time insights into the CPU, memory, GPU, token usage, and request metrics of your inference services. This feature helps you identify performance bottlenecks, optimize resource allocation, and ensure stable service operations. It is particularly useful for scenarios such as:

Performance tuning: Diagnose high resource usage and adjust resource limits.
Anomaly detection: Monitor sudden spikes in resource consumption and request patterns.
Capacity planning: Analyze historical trends to scale resources efficiently.
Cost optimization: Track token usage and GPU utilization for budget management.

Main Features

Resource Monitoring:
- CPU Usage: Displays absolute CPU usage (e.g., cores used).
- CPU Utilization: Shows CPU usage as a percentage of allocated resources.
- Memory Usage: Tracks actual memory consumption (e.g., in GB).
- Memory Utilization: Displays memory usage as a percentage of allocated resources.
Computing Monitoring:
- GPU Usage: Tracks GPU compute resource consumption.
- GPU Utilization: Shows GPU usage as a percentage of allocated resources.
- GPU Memory Usage: Monitors GPU memory consumption.
- GPU Memory Utilization: Displays GPU memory usage as a percentage of allocated resources.
- Note: MPS deployment mode does not support GPU compute and memory monitoring.
Other Monitoring:
- Token Metrics: Tracks prompt and generation tokens (available for 'vllm' runtime services).
- Request Metrics: Monitors response time (avg/tp50/p90/p95), QPS (success/fail), and traffic (in/out).
Time Range Selection: Analyze metrics over customizable periods (from 30 minutes to 7 days).

Accessing Resource Monitoring

Step 1: Navigate to Inference Service Details

Go to Inference Services in the left navigation pane.
Click the target Inference Service name to open its details page.

Step 2: Open Monitoring Dashboard

Select the Monitor tab.
Ensure the Resource Monitor section is expanded (default view).

Step 3: Select Time Range

Use the time picker in the top-right corner to choose a predefined or custom range:

Preset Options	Custom Range
Last 30 minutes	Start/End datetime
Last 1 hour
Last 6 hours
Last 24 hours
Last 2 days
Last 7 days

Monitoring Metrics

CPU Usage

Description: Shows actual CPU cores consumed by the service.
Data Format: cores (float)

CPU Utilization

Description: Percentage of allocated CPU resources being used.
Calculation: (Used Cores / Allocated Cores) × 100%
Interpretation:
- Sustained >90%: Consider scaling CPU allocation
- <20%: Potential over-provisioning

Memory Usage

Description: Physical memory consumed by the service.
Data Format: GiB or MiB
Critical Note: Kubernetes OOM kills occur when usage exceeds allocated memory.

Memory Utilization

Description: Percentage of allocated memory resources being used.
Calculation: (Used Memory / Allocated Memory) × 100%

GPU Usage

Description: GPU compute resources consumed by the service.
Data Format: Compute units
Note: Not available for MPS deployment mode.

GPU Utilization

Description: Percentage of allocated GPU compute resources being used.
Calculation: (Used GPU / Allocated GPU) × 100%
Note: Not available for MPS deployment mode.

GPU Memory Usage

Description: GPU memory consumed by the service.
Data Format: GiB or MiB
Note: Not available for MPS deployment mode.

GPU Memory Utilization

Description: Percentage of allocated GPU memory being used.
Calculation: (Used GPU Memory / Allocated GPU Memory) × 100%
Note: Not available for MPS deployment mode.

Token Metrics

Token Prompt: Tracks the number of prompt tokens processed.
Token Generation: Monitors the number of tokens generated by the model.
Availability: Only available for inference services using the 'vllm' runtime.

Request Metrics

Response Time: Measures service response latency (avg/tp50/p90/p95).
QPS (Queries Per Second): Tracks successful and failed requests per second.
Traffic: Monitors inbound and outbound data transfer.

Resource Monitoring

Function Overview

Performance tuning: Diagnose high resource usage and adjust resource limits.
Anomaly detection: Monitor sudden spikes in resource consumption and request patterns.
Capacity planning: Analyze historical trends to scale resources efficiently.
Cost optimization: Track token usage and GPU utilization for budget management.

Main Features

Resource Monitoring:
- CPU Usage: Displays absolute CPU usage (e.g., cores used).
- CPU Utilization: Shows CPU usage as a percentage of allocated resources.
- Memory Usage: Tracks actual memory consumption (e.g., in GB).
- Memory Utilization: Displays memory usage as a percentage of allocated resources.
Computing Monitoring:
- GPU Usage: Tracks GPU compute resource consumption.
- GPU Utilization: Shows GPU usage as a percentage of allocated resources.
- GPU Memory Usage: Monitors GPU memory consumption.
- GPU Memory Utilization: Displays GPU memory usage as a percentage of allocated resources.
- Note: MPS deployment mode does not support GPU compute and memory monitoring.
Other Monitoring:
- Token Metrics: Tracks prompt and generation tokens (available for 'vllm' runtime services).
- Request Metrics: Monitors response time (avg/tp50/p90/p95), QPS (success/fail), and traffic (in/out).
Time Range Selection: Analyze metrics over customizable periods (from 30 minutes to 7 days).

Accessing Resource Monitoring

Step 1: Navigate to Inference Service Details

Go to Inference Services in the left navigation pane.
Click the target Inference Service name to open its details page.

Step 2: Open Monitoring Dashboard

Select the Monitor tab.
Ensure the Resource Monitor section is expanded (default view).

Step 3: Select Time Range

Use the time picker in the top-right corner to choose a predefined or custom range:

Preset Options	Custom Range
Last 30 minutes	Start/End datetime
Last 1 hour
Last 6 hours
Last 24 hours
Last 2 days
Last 7 days

Monitoring Metrics

CPU Usage

Description: Shows actual CPU cores consumed by the service.
Data Format: cores (float)

CPU Utilization

Description: Percentage of allocated CPU resources being used.
Calculation: (Used Cores / Allocated Cores) × 100%
Interpretation:
- Sustained >90%: Consider scaling CPU allocation
- <20%: Potential over-provisioning

Memory Usage

Description: Physical memory consumed by the service.
Data Format: GiB or MiB
Critical Note: Kubernetes OOM kills occur when usage exceeds allocated memory.

Memory Utilization

Description: Percentage of allocated memory resources being used.
Calculation: (Used Memory / Allocated Memory) × 100%

GPU Usage

Description: GPU compute resources consumed by the service.
Data Format: Compute units
Note: Not available for MPS deployment mode.

GPU Utilization

Description: Percentage of allocated GPU compute resources being used.
Calculation: (Used GPU / Allocated GPU) × 100%
Note: Not available for MPS deployment mode.

GPU Memory Usage

Description: GPU memory consumed by the service.
Data Format: GiB or MiB
Note: Not available for MPS deployment mode.

GPU Memory Utilization

Description: Percentage of allocated GPU memory being used.
Calculation: (Used GPU Memory / Allocated GPU Memory) × 100%
Note: Not available for MPS deployment mode.

Token Metrics

Token Prompt: Tracks the number of prompt tokens processed.
Token Generation: Monitors the number of tokens generated by the model.
Availability: Only available for inference services using the 'vllm' runtime.

Request Metrics

Response Time: Measures service response latency (avg/tp50/p90/p95).
QPS (Queries Per Second): Tracks successful and failed requests per second.
Traffic: Monitors inbound and outbound data transfer.

Guides

Guides

How To

Troubleshooting

Guides

Guides

Guides

Inference Service APIs

Workbench APIs

Manage APIs

Operator APIs

#Resource Monitoring

#TOC

#Function Overview

#Main Features

#Accessing Resource Monitoring

#Step 1: Navigate to Inference Service Details

#Step 2: Open Monitoring Dashboard

#Step 3: Select Time Range

#Monitoring Metrics

#CPU Usage

#CPU Utilization

#Memory Usage

#Memory Utilization

#GPU Usage

#GPU Utilization

#GPU Memory Usage

#GPU Memory Utilization

#Token Metrics

#Request Metrics

#Resource Monitoring

#TOC

#Function Overview

#Main Features

#Accessing Resource Monitoring

#Step 1: Navigate to Inference Service Details

#Step 2: Open Monitoring Dashboard

#Step 3: Select Time Range

#Monitoring Metrics

#CPU Usage

#CPU Utilization

#Memory Usage

#Memory Utilization

#GPU Usage

#GPU Utilization

#GPU Memory Usage

#GPU Memory Utilization

#Token Metrics

#Request Metrics

Resource Monitoring

TOC

Function Overview

Main Features

Accessing Resource Monitoring

Step 1: Navigate to Inference Service Details

Step 2: Open Monitoring Dashboard

Step 3: Select Time Range

Monitoring Metrics

CPU Usage

CPU Utilization

Memory Usage

Memory Utilization

GPU Usage

GPU Utilization

GPU Memory Usage

GPU Memory Utilization

Token Metrics

Request Metrics

Resource Monitoring

TOC

Function Overview

Main Features

Accessing Resource Monitoring

Step 1: Navigate to Inference Service Details

Step 2: Open Monitoring Dashboard

Step 3: Select Time Range

Monitoring Metrics

CPU Usage

CPU Utilization

Memory Usage

Memory Utilization

GPU Usage

GPU Utilization

GPU Memory Usage

GPU Memory Utilization

Token Metrics

Request Metrics