logo
Alauda AI
English
Русский
English
Русский
logo
Alauda AI
Navigation

Overview

Introduction
Quick Start
Release Notes

Install

Pre-installation Configuration
Install Alauda AI Essentials
Install Alauda AI

Upgrade

Upgrade from AI 1.3

Uninstall

Uninstall

Infrastructure Management

Device Management

About Alauda Build of Hami
About Alauda Build of NVIDIA GPU Device Plugin

Multi-Tenant

Guides

Namespace Management

Workbench

Overview

Introduction
Install
Upgrade

How To

Create WorkspaceKind
Create Workbench

Model Deployment & Inference

Overview

Introduction
Features

Inference Service

Introduction

Guides

Inference Service

How To

Extend Inference Runtimes
Configure External Access for Inference Services
Configure Scaling for Inference Services

Troubleshooting

Experiencing Inference Service Timeouts with MLServer Runtime
Inference Service Fails to Enter Running State

Model Management

Introduction

Guides

Model Repository

Monitoring & Ops

Overview

Introduction
Features Overview

Logging & Tracing

Introduction

Guides

Logging

Resource Monitoring

Introduction

Guides

Resource Monitoring

API Reference

Introduction

Kubernetes APIs

Inference Service APIs

ClusterServingRuntime [serving.kserve.io/v1alpha1]
InferenceService [serving.kserve.io/v1beta1]

Workbench APIs

Workspace Kind [kubeflow.org/v1beta1]
Workspace [kubeflow.org/v1beta1]

Manage APIs

AmlNamespace [manage.aml.dev/v1alpha1]

Operator APIs

AmlCluster [amlclusters.aml.dev/v1alpha1]
Glossary
Previous PageResource Monitoring
Next PageGuides

#Introduction

Resource Monitoring is a core component of Alauda AI's Monitoring & Ops module, designed specifically for tracking and analyzing resource utilization metrics of inference services. As part of the full-stack MLOps platform, it provides real-time visibility into infrastructure resource consumption, enabling users to optimize model deployment, prevent resource bottlenecks, and ensure stable operation of AI workloads. Integrated with Alauda AI's unified monitoring ecosystem, Resource Monitoring eliminates the need for fragmented tooling by delivering actionable insights directly within your MLOps workflow.

#TOC

#Usage Limitations

When using Resource Monitoring, note the following constraints:

  • Data Collection Intervals

    • Minimum metric scraping interval: 60 seconds
    • Historical data retention: 7 days by default
  • Dependency Requirements

    • Requires Prometheus/VictoriaMetrics monitoring stack deployed in target clusters
    • Node exporter must be running on all worker nodes
    • DCGM exporter must be running on GPU nodes