Distributed storage provides out-of-the-box monitoring metrics collection and alert notification capabilities. Once the monitoring and alerting features are enabled, you can monitor and alert on aspects such as the storage cluster, storage performance, and storage components, with support for configuring notification strategies.
The intuitively presented monitoring data can be used to provide decision support for operation and maintenance inspections or performance tuning, and a comprehensive alert and notification mechanism will help ensure the stable operation of the storage system.
Tip: If the monitoring and alerting features were not enabled when creating the distributed storage, you will need to find alternative solutions for storage monitoring and alerting. For example, manually configure monitoring dashboards and alert strategies in the operation and maintenance center.
The platform automatically collects common monitoring metrics for distributed storage, such as read and write performance, CPU and memory usage. In the Storage Management > Distributed Storage section under the Monitoring tab, you can view real-time monitoring data for these metrics.
Monitor the health status of the storage, physical capacity usage, and the number of active OSD/MON components. In the event of abnormal storage status, you can check the reason for the alert.
Monitor read and write bandwidth and read and write IOPS from three dimensions: cluster, storage pool, and OSD. Additionally, you can monitor read and write latency specifically for OSD.
Monitor CPU usage and memory usage of components such as MON and OSD.
The platform has a set of default alert strategies enabled. Once a resource becomes abnormal or monitoring data reaches the warning state, alerts will be automatically triggered. The preset strategies are sufficient for common operational needs such as component and cluster status alerts, device capacity alerts, and user data alerts.
To receive alerts in a timely manner, it is recommended that you set up notification strategies in the operation and maintenance center: send alert information via email, SMS, and other means to relevant personnel, reminding them to take necessary measures to resolve issues or prevent failures. Click Alert Configuration to switch to the operation and maintenance center to complete the operation, refer to Create Alert Strategies。
If the storage cluster is monitored to be in a Warning
state, it means an alert has been triggered, and the related anomaly may lead to a failure. Please promptly check the details in Real-time Alerts and identify and troubleshoot the fault based on the cause.
If the storage cluster is monitored to be in a Failure
state, it indicates that the storage cluster is unable to operate normally. Please locate the issue immediately and carry out troubleshooting.
The table below indicates the meanings of the alert levels used by the preset strategies, which can serve as a reference for you when establishing alert handling principles.
Alert Level | Meaning |
---|---|
Disaster | The resource corresponding to the alert rule has failed, causing platform service interruption, data loss, and significant impact. |
Severe | The resource corresponding to the alert rule has known issues, which may lead to platform function failures and affect the normal operation of services. |
Warning | The resource corresponding to the alert rule faces operational risks, which could affect the normal operation of services if not dealt with promptly. |
The Alert History records all alerts that have been triggered and no longer require action. When conducting a fault review using the alert history, to effectively achieve the purpose of summarizing experiences, you may need to answer the following questions.
What were the specific abnormal conditions at the time of the incident.
Is there a pattern to a certain alert that appears repeatedly in the alert list, Can it be prevented before it occurs next time.
Does the timeline show a surge in alerts during a certain period; was it caused by force majeure or an operational accident, Is there a need to adjust the operational plan.