Probes
The platform provides Probe capabilities (black-box monitoring) based on the Blackbox Exporter, enabling network service checks through protocols such as ICMP, TCP, and HTTP. Unlike white-box monitoring, which relies on internal system metrics, Probe evaluates services externally from the user’s perspective, rapidly identifying failures that impact user experience.
For example, if a business interface fails to respond (e.g. HTTP 5xx errors) or a critical service becomes unavailable, Probe immediately detects the anomaly, generates alerts, and streamlines troubleshooting for operations teams.
Monitoring Dashboard
The platform features a modernized monitoring dashboard management function, providing a more user-friendly visual configuration experience compared to traditional Grafana. By offering a unified monitoring view, it aggregates and displays various monitoring metric data, helping users quickly build the required monitoring dashboards.
Alert Strategies
The platform provides comprehensive alerting capabilities, supporting the configuration of alert rules based on metrics, logs, and events. With a rich set of built-in monitoring metrics and alert templates, users can rapidly configure alert strategies that align with business needs, enabling timely detection and resolution of issues.
Alert Templates
Alert templates standardize and encapsulate alert rules and notification strategies, supporting rapid reuse across multiple monitoring targets. Template-based configuration significantly reduces the management costs of alert strategies and enhances operational efficiency.
Alert History
The system fully records the lifecycle of alerts, including trigger time, recovery time, alert status, alert level, and alert content. Users can trace and analyze issues through alert history, continuously optimizing alert configurations.
Notifications
The platform supports multiple alert notification channels, including email, DingTalk, WeChat Work, Feishu, and Webhook, ensuring that alert information reaches the relevant personnel promptly. Users can flexibly configure notification strategies based on actual needs.
The distributed tracing provides full-link tracing capabilities for microservice architectures. By collecting metadata of inter-service calls, it helps users quickly locate issues in cross-service calls.
The platform automatically collects and centrally manages standard output and file logs from clusters, nodes, and containers. It provides powerful log storage, retrieval, and analysis capabilities, supporting multi-dimensional log queries and visual displays, helping users quickly pinpoint issues.
The platform collects critical event information in real-time from Kubernetes clusters, recording the complete process of resource state changes. When exceptions occur in clusters, nodes, Pods, etc., events can be traced to pinpoint root causes, significantly enhancing issue resolution efficiency.
Inspection
Drawing on extensive enterprise-level operational experience, the platform offers automated inspection capabilities. Through multi-dimensional health checks, it helps users monitor resource operational statuses in real-time, detect potential risks early, and reduce manual inspection costs.
Platform Health Status
An intuitive overview of the platform's functional health status is provided, supporting the view of deployment conditions and component operational statuses. Users with platform management permissions can delve into detailed health check data, quickly locating and resolving platform-level issues.