Health Checks

Understanding Health Checks

Refer to the official Kubernetes documentation:

In Kubernetes, health checks, also known as probes, are a critical mechanism to ensure the high availability and resilience of your applications. Kubernetes uses these probes to determine the health and readiness of your Pods, allowing the system to take appropriate actions, such as restarting containers or routing traffic. Without proper health checks, Kubernetes cannot reliably manage your application's lifecycle, potentially leading to service degradation or outages.

Kubernetes offers three types of probes:

livenessProbe: Detects if the container is still running. If a liveness probe fails, Kubernetes will terminate the Pod and restart it according to its restart policy.
readinessProbe: Detects if the container is ready to serve traffic. If a readiness probe fails, the Endpoint Controller removes the Pod from the Service's Endpoint list until the probe succeeds.
startupProbe: Specifically checks if the application has successfully started. Liveness and readiness probes will not execute until the startup probe succeeds. This is very useful for applications with long startup times.

Properly configuring these probes is essential for building robust and self-healing applications on Kubernetes.

Probe Types

Kubernetes supports three mechanisms for implementing probes:

HTTP `GET` Action

Executes an HTTP GET request against the Pod's IP address on a specified port and path. The probe is considered successful if the response code is between 200 and 399.

Use Cases: Web servers, REST APIs, or any application exposing an HTTP endpoint.

Example:

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 15
  periodSeconds: 20

`exec` Action

Executes a specified command inside the container. The probe is successful if the command exits with status code 0.

Use Cases: Applications without HTTP endpoints, checking internal application state, or performing complex health checks that require specific tools.

Example:

readinessProbe:
  exec:
    command:
      - cat
      - /tmp/healthy
  initialDelaySeconds: 5
  periodSeconds: 5

TCP `Socket` Action

Attempts to open a TCP socket on the container's IP address and a specified port. The probe is successful if the TCP connection can be established.

Use Cases: Databases, message queues, or any application that communicates over a TCP port but might not have an HTTP endpoint.

Example:

startupProbe:
  tcpSocket:
    port: 3306
  initialDelaySeconds: 5
  periodSeconds: 10
  failureThreshold: 30

Best Practices

Liveness vs. Readiness:
- Liveness: If your application is unresponsive, it's better to restart it. If it fails, Kubernetes will restart it.
- Readiness: If your application is temporarily unable to serve traffic (e.g., connecting to a database), but might recover without a restart, use a Readiness Probe. This prevents traffic from being routed to an unhealthy instance.
Startup Probes for Slow Applications: Use Startup Probes for applications that take a significant amount of time to initialize. This prevents premature restarts due to Liveness Probe failures or traffic routing issues due to Readiness Probe failures during startup.
Lightweight Probes: Ensure your probe endpoints are lightweight and perform quickly. They should not involve heavy computation or external dependencies (like database calls) that could make the probe itself unreliable.
Meaningful Checks: Probe checks should genuinely reflect the health and readiness of your application, not just whether the process is running. For example, for a web server, check if it can serve a basic page, not just if the port is open.
Adjust initialDelaySeconds: Set initialDelaySeconds appropriately to give your application enough time to start before the first probe.
Tune periodSeconds and failureThreshold: Balance the need for quick detection of failures with avoiding false positives. Too frequent probes or too low a failureThreshold can lead to unnecessary restarts or unready states.
Logs for Debugging: Ensure your application logs clear messages related to health check endpoint calls and internal state to aid in debugging probe failures.
Combine Probes: Often, all three probes (Liveness, Readiness, Startup) are used together to manage application lifecycle effectively.

YAML file example

spec:
  template:
    spec:
      containers:
        - name: nginx
          image: nginx:1.14.2 # Container image
          ports:
            - containerPort: 80 # Container exposed port
          startupProbe:
            httpGet:
              path: /startup-check
              port: 8080
            initialDelaySeconds: 0 # Usually 0 for startup probes, or very small
            periodSeconds: 5
            failureThreshold: 60 # Allows 60 * 5 = 300 seconds (5 minutes) for startup
          livenessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 5 # Delay 5 seconds after Pod starts before checking
            periodSeconds: 10 # Check every 10 seconds
            timeoutSeconds: 5 # Timeout after 5 seconds
            failureThreshold: 3 # Consider unhealthy after 3 consecutive failures
          readinessProbe:
            httpGet:
              path: /ready
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 10
            timeoutSeconds: 5
            failureThreshold: 3

Health Checks configuration parameters by using web console

Common parameters

Parameters	Description
Initial Delay	`initialDelaySeconds`: Grace period (seconds) before starting probes. Default: `300`.
Period	`periodSeconds`: Probe interval (1-120s). Default: `60`.
Timeout	`timeoutSeconds`: Probe timeout duration (1-300s). Default: `30`.
Success Threshold	`successThreshold`: Minimum consecutive successes to mark healthy. Default: `0`.
Failure Threshold	`failureThreshold`: Maximum consecutive failures to trigger action: - `0`: Disables failure-based actions - Default: `5` failures → container restart.

Protocol specific parameters

Parameter	Applicable Protocols	Description
Protocol	HTTP/HTTPS	Health check protocol
Port	HTTP/HTTPS/TCP	Target container port for probing.
Path	HTTP/HTTPS	Endpoint path (e.g., `/healthz`).
HTTP Headers	HTTP/HTTPS	Custom headers (Add key-value pairs).
Command	EXEC	Container-executable check command (e.g., `sh -c "curl -I localhost:8080 \| grep OK"`). Note: Escape special characters and test command viability.

Troubleshooting probe failures

When a Pod's status indicates issues related to probes, here's how to troubleshoot:

Check pod events

kubectl describe pod <pod-name>

Look for events related to LivenessProbe failed, ReadinessProbe failed, or StartupProbe failed. These events often provide specific error messages (e.g., connection refused, HTTP 500 error, command exit code).

View container logs

kubectl logs <pod-name> -c <container-name>

Examine application logs to see if there are errors or warnings around the time the probe failed. Your application might be logging why its health endpoint isn't responding correctly.

Test probe endpoint manually

HTTP: If possible, kubectl exec -it <pod-name> -- curl <probe-path>:<probe-port> or wget from within the container to see the actual response.
Exec: Run the probe command manually: kubectl exec -it <pod-name> -- <command-from-probe> and check its exit code and output.
TCP: Use nc (netcat) or telnet from another Pod in the same network or from the host if allowed, to test TCP connectivity: kubectl exec -it <another-pod> -- nc -vz <pod-ip> <probe-port>.

Review probe configuration

Double-check the probe parameters (path, port, command, delays, thresholds) in your Deployment/Pod YAML. A common mistake is an incorrect port or path.

Check application code

Ensure your application's health check endpoint is correctly implemented and truly reflects the application's readiness/liveness. Sometimes, the endpoint might return success even when the application itself is broken.

Resource constraints

Insufficient CPU or memory resources could cause your application to become unresponsive, leading to probe failures. Check Pod resource usage (kubectl top pod <pod-name>) and consider adjusting resources limits/requests.

Network issues

In rare cases, network policies or CNI issues might prevent probes from reaching the container. Verify network connectivity within the cluster.

View full docs as PDF

Node Management

Managed Clusters

Import Clusters

Public Cloud Cluster Initialization

Network Initialization

Storage Initialization

How to

How to

Backup Management

Recovery Management

Architecture

Concepts

Guides

How To

Trouble Shooting

Concepts

Guides

How To

Troubleshooting

Install

Concepts

Guides

How To

Disaster Recovery

Concepts

Guides

How To

Guides

Compliance

Install

API Refiner

User

Guides

Group

Guides

Role

Guides

IDP

Guides

Troubleshooting

User Policy

Guides

Overview

Images

Guides

How To

Virtual Machine

Guides

How To

Troubleshooting

Network

Guides

How To

Storage

Guides

Backup and Recovery

Guides

Concepts

Concepts

Guides

Namespaces

Pre-Application-Creation Preparation

Creating Applications

Post-Application-Creation Configuration

Operation and Maintenance

Application Observability

Workloads

Pod

Container

How To

Install

How To

Install

Guides

How To

Concepts

Guides

Argo CD Concept

Alauda Container Platform GitOps Concepts

Creating GitOps Application