Introduction

Module Overview

The Monitoring Module provides operational capabilities such as metrics, dashboards, alerts, and notifications for platform administrators and operations personnel.

The platform integrates open-source components like Prometheus / VictoriaMetrics and monitoring dashboards, enabling real-time monitoring of clusters, nodes, components, custom applications, Pods, containers, and more, managed by the platform.

It supports quick setup for monitoring metrics alerts at the cluster, node, and computing component levels, log alerts (for computing components only), and event alerts. Additionally, it allows for custom monitoring metric algorithms based on actual requirements, increasing the necessary alert metrics and rules. Notification strategies can be configured to send alert information promptly to operations personnel, helping to avoid system failures or to address issues swiftly, reducing system operation costs and ensuring system stability.

Module Advantages

The Monitoring Module has the following core advantages:

Comprehensive Monitoring Coverage

Supports extensive monitoring across multiple levels such as clusters, nodes, components, and containers, achieving an end-to-end monitoring chain from infrastructure to applications.
Flexible Alert Configuration

Offers a rich set of preset alert rules, while also supporting custom alert rules and algorithms to meet different monitoring scenarios.
Diverse Visualization Displays

Integrates professional monitoring dashboards that support multiple data visualization methods, providing an intuitive representation of system operational status.
Efficient Alert Notifications

Supports multi-channel alert notifications, including email, SMS, webhook, etc., ensuring timely delivery of alert information.
Scalable Monitoring Architecture

Based on the industry-leading Prometheus / VictoriaMetrics technology stack, it possesses excellent scalability and compatibility.

Application Scenarios

The Monitoring Module is applicable in the following scenarios:

Cluster Health Monitoring

Real-time monitoring of resource usage, node status, and component operation conditions within the cluster to promptly detect potential issues.
Application Performance Analysis

Monitoring running metrics of applications and resource usage of containers to optimize application performance.
Fault Early Warning and Diagnosis

By setting reasonable alert rules, system anomalies can be detected in advance, facilitating rapid problem identification and resolution.
Capacity Planning

Conducting trend analysis based on historical monitoring data to provide a basis for resource expansion and optimization.

Usage Limitations

When using the Monitoring Module, please note the following limitations:

The storage duration of monitoring data depends on the storage capacity configuration, with a default retention period of 7 days.
Prometheus and VictoriaMetrics cannot be installed simultaneously in the same cluster, please make a selection plan and choose one for installation.
The minimum support for the collection interval of custom monitoring metrics is 60 seconds.
Alert notification channels need to have the corresponding services pre-configured (such as email servers, SMS gateways, WeChat/DingTalk bots, etc.).

#Introduction

#TOC

#Module Overview

#Module Advantages

#Application Scenarios

#Usage Limitations

Introduction

TOC

Module Overview

Module Advantages

Application Scenarios

Usage Limitations