Monitoring and Alerting

Local storage provides out-of-the-box monitoring metrics collection and alerting capabilities. Once the platform monitoring component is enabled, monitoring and alerts can be configured based on storage clusters, storage performance, and storage capacity, with support for configuring notification policies.

The intuitively presented monitoring data can be utilized to support decision making for operational inspections or performance tuning, and a comprehensive alerting mechanism will help ensure the stable operation of the storage system.

Monitoring

Performance Monitoring

By default, the platform collects commonly used performance monitoring metrics such as read and write bandwidth, IOPS, and latency for local storage. Real-time monitoring data for these metrics can be viewed on the Monitoring tab of the Local Storage page under Storage Management. The platform displays these metrics visually through graphs and charts, allowing administrators to clearly observe current storage performance and quickly identify potential issues.

Capacity Monitoring

Since local storage can only use locally available storage resources on nodes, users must ensure there is sufficient available capacity on the nodes before declaring local storage to avoid issues caused by over-declaring.

To assist with this, the platform provides detailed capacity monitoring in the Details section of local storage, categorized by device types. Users can check available storage space clearly displayed in numerical and graphical formats. If any device type shows insufficient available capacity, space should be cleared or additional disk devices added before using local storage.

Alerts

The platform includes a set of default alerting policies. If resources become abnormal or monitoring data reaches a warning threshold, alerts are automatically triggered. The preconfigured alerting policies effectively cover common operational needs, including alerts for cluster health status and device type capacity.

Configuring Notifications

To ensure alerts are received in a timely manner, notification policies should be configured in the operations center. Notifications can be sent through email, SMS, or other methods to relevant personnel, prompting immediate attention to resolve issues or prevent failures. Users can access the notification policy settings directly from the operations center interface. Detailed instructions on configuring alerts can be found in the [Creating Alert Policies] documentation.

Handling Alerts

  • If the health status of the storage cluster changes to Alert, administrators should investigate immediately. The Details section provides information for troubleshooting and resolving these issues. Common causes include abnormal node services or problems with specific device types.

    Inspection ItemCorresponding StatusCause
    Health StatusAlertCaused by abnormal node services or device type issues.
    Service StatusUnknownNode is in a notready state, possibly due to network failures or power outages.
    Device Type StatusUnavailableThe disk in use may not be a raw disk, or it might be missing.
  • Real-time alerts triggered on the Alert tab require prompt attention, even if the storage cluster status currently appears Healthy. Quick responses prevent escalation into more serious issues. The following table outlines alert levels and their implications:

    Alert LevelMeaning
    CriticalIndicates significant issues causing platform service interruptions or data loss, with severe impacts.
    MajorKnown issues potentially affecting platform functionality and normal business operations.
    WarningRisk of operational issues exists; timely intervention needed to avoid impact on normal business operations.

Post-Mortem Analysis

The Alert History logs all alerts triggered previously that no longer require immediate action. During post-mortem analysis, consider the following:

  • What specific abnormalities were observed at the time of the incident?
  • Are there patterns of specific alerts repeatedly occurring? How can these be proactively prevented in the future?
  • Was there a surge in alerts during specific periods linked to external factors or operational incidents? Should operational strategies be adjusted accordingly?