Click on Operation Center > Inspection > Basic Inspection in the left navigation bar.
Tip: The inspection page displays the inspection data information from the most recent inspection. During the inspection process, you can view the resource data of completed inspections in real-time.
On the Basic Inspection page, the following actions are supported:
Execute Inspection: Click the Inspection button in the upper right corner of the page to perform an inspection on the platform.
Download Inspection Report: Click the Download Report button in the upper right corner of the page, select the report format (PDF and Excel) in the pop-up dialog, and click to download, which will download the corresponding format report to your local machine.
The PDF format inspection report does not include resource risk details page data;
The Excel format inspection report contains all data from the inspection;
Supports simultaneous download of both formats of reports.
Inspection Configuration | Description |
---|---|
Scheduled Inspection | Automated task execution timing rules, supporting input of Crontab expressions. Tip: Click the input box to expand the platform's preset Trigger Rule Templates, select the appropriate template, and quickly set the trigger rules with simple modifications. |
Inspection Record Retention | The number of inspection records to retain. |
Email Notification | Select email notification contacts. Note: Notification contacts must have email configured. |
Inspection Report Name | The name that will be used by the platform's built-in inspection notification template to notify contacts. |
Inspection Configuration Items | Modify the warning thresholds or disable inspection items according to the platform's default inspection items for certificates, cluster hosts, and pods. |
In the Most Recent Inspection information area, you can view relevant information from the most recent inspection:
Inspection Time: The start and end time of the most recent inspection.
Total Number of Inspection Resources: The total number of resources (clusters, nodes, pods, certificates) inspected in the most recent inspection.
Risks: The number of resources at risk, including those classified as Fault and Warning.
In the Resource Risk Inspection page, you can view an overview of risk information for global
clusters, self-built clusters, accessed clusters, and all nodes, pods, and certificates under these clusters.
Click the Risk Details button on the card of the corresponding resource type (Cluster, Node, pod, Certificate) to enter the risk details page for that resource type. On the details page, you can view the most recent inspection information for the resource, as well as the list of resources with faults and warnings.
Click on the resource name to jump to the resource details page.
Click the expand button on the right side of the Name field in the list to expand the judgment conditions and reasons for faults and warnings.
For explanations of the risk status judgment criteria (Fault, Warning) for each resource, refer to the table below.
Note: There are multiple conditions used to judge the faults and warnings for each resource type; when the inspection data of the resource matches any one of the judgment conditions, it is considered a piece of risk data.
Resource Type | Inspection Scope | Fault Judgment Conditions | Warning Judgment Conditions |
---|---|---|---|
Cluster | - global cluster - Self-built cluster - Accessed cluster | - Cluster status is Abnormal; - apiserver connection is abnormal | - After the cluster scale (number of nodes/pods/mrtrics) increases, the monitoring component resource configuration has not been updated. - After the log data volume and log collection frequency increase, the log component resource configuration has not been updated. - Cluster CPU usage exceeds 60%; - Cluster memory usage exceeds 60%; - Any pod in the ETCD component of the cluster is in a non-Running state; - Any host in the cluster is in a non-Ready state; - The time difference between any two nodes in the cluster exceeds 40S; - The CPU request rate of the cluster (actual request value / total) exceeds 60%; - The memory request rate of the cluster (actual request value / total) exceeds 80%; - Monitoring components are not installed in the cluster; - Monitoring components of the cluster are abnormal; - Any pod in the kube-controller-manager component of the cluster is in a non-Running state; - Any pod in the kube-scheduler component of the cluster is in a non-Running state; - Any pod in the kube-apiserver component of the cluster is in a non-Running state. |
Node | - All control nodes - All compute nodes | - Node status is Abnormal; - The node-exporter component's pod on the node is in a non-Running state; - The kubelet component's pod on the node is in a non-Running state. | - Node's inode free is less than 1000 - Node CPU usage exceeds 60%; - Node memory usage exceeds 60%; - Disk space usage of the node directory exceeds 60%; - Node system load exceeds 200% and runtime exceeds 15 minutes; - At least one NodeDeadlock (node deadlock) event occurred in the past day; - At least one NodeOOM (out of memory) event occurred in the past day; - At least one NodeTaskHung (task hung) event occurred in the past day; - At least one NodeCorruptDockerImage (corrupted Docker image) event occurred in the past day. |
pod | All pods | - pod status is Error; - The pod has been in the starting state for more than 5 minutes. | - Pod CPU usage exceeds 80%; - Pod memory usage exceeds 80%; - The number of restarts of the Pod in the past 5 minutes is greater than or equal to 1. |
Certificate | - Certmanager certificates - Kubernetes certificates | Certificate status is Expired. | Certificate's validity period is less than 29 days. |
Click on the Resource Utilization Inspection tab to enter the Resource Utilization Inspection page.
In the Resource Utilization Inspection page, you can view the total amount, usage, and usage rate of CPU, memory, and disk of global
clusters, accessed clusters, and self-built clusters, as well as the number of resources such as clusters, nodes, pods, and projects on the platform.
Resource Usage Statistics: You can view the total amount and total usage rate of CPU, memory, and disk of global, accessed, and self-built clusters.
Platform Resource Quantity: You can view the number of resources running on the platform.