TIP

Explore the key features of the Monitoring & Ops module designed for Inference Services. This overview introduces core capabilities to help users efficiently monitor, analyze, and optimize AI service operations.

Features Overview

TOC

Logging

  • Realtime Pod Logs
    Stream logs from Replica pods associated with inference services in real time. Debug issues instantly and track service behavior across deployments.

Monitoring

Resource Monitor

  • CPU/Memory Utilization
    Track CPU and memory usage metrics for inference services to optimize resource allocation and prevent bottlenecks.

Computing Monitor

  • GPU Metrics & VRAM
    Monitor GPU compute utilization and video memory (VRAM) consumption to ensure efficient hardware usage for accelerated workloads.

Other Monitor

  • Token Throughput
    Measure token processing rates to evaluate model performance and scalability.
  • Request Traffic Analytics
    Analyze request volume, latency, and track successful/failed requests per second (QPS) to maintain service reliability and meet SLAs.