Monitoring & Alerts

The object storage system comes with built-in monitoring and alerting capabilities, covering storage clusters, service health, and resource utilization. It also supports configurable notification policies to keep your operations team informed. Real-time monitoring insights help with performance tuning and operational decision-making, while automated alerts ensure the stability and reliability of your storage system.

Monitoring

By default, the platform collects key metrics on storage clusters and service status. You can access real-time monitoring data under Storage Management > Object Storage > Monitoring.

Storage Overview

This section provides a high-level view of storage system health, service status, and raw capacity utilization. If the storage status is abnormal, alert details will indicate the root cause, helping you diagnose and resolve issues efficiently.

Cluster Monitoring

Track raw capacity usage and I/O performance trends across your storage cluster. This helps identify storage bottlenecks, optimize resource allocation, and ensure smooth data operations.

Object Monitoring

Monitor access patterns, including total request counts and failed requests. These insights help analyze storage workload and detect anomalies that may indicate service disruptions or security risks.

Alerts

The platform comes with pre-configured alerting policies to detect anomalies and trigger notifications when predefined thresholds are reached. These built-in rules cover essential areas such as component health, capacity usage, and user data integrity.

Configuring Notifications

To ensure timely responses, configure notification policies in the Operations Center. Alerts can be sent via email, SMS, or other channels to notify the right personnel. Fine-tune your settings to match your organization’s incident response workflow.

Handling Alerts

  • Cluster in "Alert" state: A warning has been triggered, and system stability may be at risk. Check the Live Alerts section for details, identify the root cause, and take corrective actions.
  • Cluster in "Failure" state: The storage cluster is no longer operational. Immediate intervention is required to restore service availability.

The platform categorizes alerts into different severity levels, helping teams prioritize incident response:

SeverityDescription
CriticalA system failure impacting business operations or causing data loss. Immediate action required.
MajorA known issue that may lead to functionality breakdowns, potentially disrupting business processes.
WarningA potential risk that, if unaddressed, could impact performance or availability.

Post-Incident Analysis

The Alert History logs all past incidents, providing valuable data for post-mortem analysis and system improvements. When reviewing past alerts, consider the following:

  1. What were the exact symptoms when the incident occurred?
  2. Are certain alerts repeating over time? Can proactive measures be taken to prevent recurrence?
  3. Did a specific time window show a spike in alerts? Was it caused by an operational issue or an external factor? Should the response strategy be adjusted?

By continuously analyzing alert patterns and refining monitoring strategies, teams can enhance system resilience, minimize downtime, and ensure seamless storage operations.