The Monitoring Module provides operational capabilities such as metrics, dashboards, alerts, and notifications for platform administrators and operations personnel.
The platform integrates open-source components like Prometheus / VictoriaMetrics and monitoring dashboards, enabling real-time monitoring of clusters, nodes, components, custom applications, Pods, containers, and more, managed by the platform.
It supports quick setup for monitoring metrics alerts at the cluster, node, and computing component levels, log alerts (for computing components only), and event alerts. Additionally, it allows for custom monitoring metric algorithms based on actual requirements, increasing the necessary alert metrics and rules. Notification strategies can be configured to send alert information promptly to operations personnel, helping to avoid system failures or to address issues swiftly, reducing system operation costs and ensuring system stability.
The Monitoring Module has the following core advantages:
Comprehensive Monitoring Coverage
Supports extensive monitoring across multiple levels such as clusters, nodes, components, and containers, achieving an end-to-end monitoring chain from infrastructure to applications.
Flexible Alert Configuration
Offers a rich set of preset alert rules, while also supporting custom alert rules and algorithms to meet different monitoring scenarios.
Diverse Visualization Displays
Integrates professional monitoring dashboards that support multiple data visualization methods, providing an intuitive representation of system operational status.
Efficient Alert Notifications
Supports multi-channel alert notifications, including email, SMS, webhook, etc., ensuring timely delivery of alert information.
Scalable Monitoring Architecture
Based on the industry-leading Prometheus / VictoriaMetrics technology stack, it possesses excellent scalability and compatibility.
The Monitoring Module is applicable in the following scenarios:
Cluster Health Monitoring
Real-time monitoring of resource usage, node status, and component operation conditions within the cluster to promptly detect potential issues.
Application Performance Analysis
Monitoring running metrics of applications and resource usage of containers to optimize application performance.
Fault Early Warning and Diagnosis
By setting reasonable alert rules, system anomalies can be detected in advance, facilitating rapid problem identification and resolution.
Capacity Planning
Conducting trend analysis based on historical monitoring data to provide a basis for resource expansion and optimization.
When using the Monitoring Module, please note the following limitations:
The storage duration of monitoring data depends on the storage capacity configuration, with a default retention period of 7 days.
Prometheus and VictoriaMetrics cannot be installed simultaneously in the same cluster, please make a selection plan and choose one for installation.
The minimum support for the collection interval of custom monitoring metrics is 60 seconds.
Alert notification channels need to have the corresponding services pre-configured (such as email servers, SMS gateways, WeChat/DingTalk bots, etc.).