Resource Monitoring

TOC

Function Overview

Resource Monitoring in Alauda AI's Monitoring & Ops module provides real-time insights into the CPU, memory, GPU, token usage, and request metrics of your inference services. This feature helps you identify performance bottlenecks, optimize resource allocation, and ensure stable service operations. It is particularly useful for scenarios such as:

  • Performance tuning: Diagnose high resource usage and adjust resource limits.
  • Anomaly detection: Monitor sudden spikes in resource consumption and request patterns.
  • Capacity planning: Analyze historical trends to scale resources efficiently.
  • Cost optimization: Track token usage and GPU utilization for budget management.

Main Features

  • Resource Monitoring:
    • CPU Usage: Displays absolute CPU usage (e.g., cores used).
    • CPU Utilization: Shows CPU usage as a percentage of allocated resources.
    • Memory Usage: Tracks actual memory consumption (e.g., in GB).
    • Memory Utilization: Displays memory usage as a percentage of allocated resources.
  • Computing Monitoring:
    • GPU Usage: Tracks GPU compute resource consumption.
    • GPU Utilization: Shows GPU usage as a percentage of allocated resources.
    • GPU Memory Usage: Monitors GPU memory consumption.
    • GPU Memory Utilization: Displays GPU memory usage as a percentage of allocated resources.
    • Note: MPS deployment mode does not support GPU compute and memory monitoring.
  • Other Monitoring:
    • Token Metrics: Tracks prompt and generation tokens (available for 'vllm' runtime services).
    • Request Metrics: Monitors response time (avg/tp50/p90/p95), QPS (success/fail), and traffic (in/out).
  • Time Range Selection: Analyze metrics over customizable periods (from 30 minutes to 7 days).

Accessing Resource Monitoring

Step 1: Navigate to Inference Service Details

  1. Go to Inference Services in the left navigation pane.
  2. Click the target Inference Service name to open its details page.

Step 2: Open Monitoring Dashboard

  1. Select the Monitor tab.
  2. Ensure the Resource Monitor section is expanded (default view).

Step 3: Select Time Range

Use the time picker in the top-right corner to choose a predefined or custom range:

Preset OptionsCustom Range
Last 30 minutesStart/End datetime
Last 1 hour
Last 6 hours
Last 24 hours
Last 2 days
Last 7 days

Monitoring Metrics

CPU Usage

  • Description: Shows actual CPU cores consumed by the service.
  • Data Format: cores (float)

CPU Utilization

  • Description: Percentage of allocated CPU resources being used.
  • Calculation: (Used Cores / Allocated Cores) × 100%
  • Interpretation:
    • Sustained >90%: Consider scaling CPU allocation
    • <20%: Potential over-provisioning

Memory Usage

  • Description: Physical memory consumed by the service.
  • Data Format: GiB or MiB
  • Critical Note: Kubernetes OOM kills occur when usage exceeds allocated memory.

Memory Utilization

  • Description: Percentage of allocated memory resources being used.
  • Calculation: (Used Memory / Allocated Memory) × 100%

GPU Usage

  • Description: GPU compute resources consumed by the service.
  • Data Format: Compute units
  • Note: Not available for MPS deployment mode.

GPU Utilization

  • Description: Percentage of allocated GPU compute resources being used.
  • Calculation: (Used GPU / Allocated GPU) × 100%
  • Note: Not available for MPS deployment mode.

GPU Memory Usage

  • Description: GPU memory consumed by the service.
  • Data Format: GiB or MiB
  • Note: Not available for MPS deployment mode.

GPU Memory Utilization

  • Description: Percentage of allocated GPU memory being used.
  • Calculation: (Used GPU Memory / Allocated GPU Memory) × 100%
  • Note: Not available for MPS deployment mode.

Token Metrics

  • Token Prompt: Tracks the number of prompt tokens processed.
  • Token Generation: Monitors the number of tokens generated by the model.
  • Availability: Only available for inference services using the 'vllm' runtime.

Request Metrics

  • Response Time: Measures service response latency (avg/tp50/p90/p95).
  • QPS (Queries Per Second): Tracks successful and failed requests per second.
  • Traffic: Monitors inbound and outbound data transfer.