Health Checks

Understanding Health Checks

Refer to the official Kubernetes documentation:

In Kubernetes, health checks, also known as probes, are a critical mechanism to ensure the high availability and resilience of your applications. Kubernetes uses these probes to determine the health and readiness of your Pods, allowing the system to take appropriate actions, such as restarting containers or routing traffic. Without proper health checks, Kubernetes cannot reliably manage your application's lifecycle, potentially leading to service degradation or outages.

Kubernetes offers three types of probes:

  • livenessProbe: Detects if the container is still running. If a liveness probe fails, Kubernetes will terminate the Pod and restart it according to its restart policy.
  • readinessProbe: Detects if the container is ready to serve traffic. If a readiness probe fails, the Endpoint Controller removes the Pod from the Service's Endpoint list until the probe succeeds.
  • startupProbe: Specifically checks if the application has successfully started. Liveness and readiness probes will not execute until the startup probe succeeds. This is very useful for applications with long startup times.

Properly configuring these probes is essential for building robust and self-healing applications on Kubernetes.

Probe Types

Kubernetes supports three mechanisms for implementing probes:

HTTPGET Action

Executes an HTTP GET request against the Pod's IP address on a specified port and path. The probe is considered successful if the response code is between 200 and 399.

  • Use Cases: Web servers, REST APIs, or any application exposing an HTTP endpoint.
  • Example:
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 15
  periodSeconds: 20

exec Action

Executes a specified command inside the container. The probe is successful if the command exits with status code 0.

  • Use Cases: Applications without HTTP endpoints, checking internal application state, or performing complex health checks that require specific tools.

  • Example:

readinessProbe:
  exec:
    command:
    - cat
    - /tmp/healthy
  initialDelaySeconds: 5
  periodSeconds: 5

TCPSocket Action

Attempts to open a TCP socket on the container's IP address and a specified port. The probe is successful if the TCP connection can be established.

  • Use Cases: Databases, message queues, or any application that communicates over a TCP port but might not have an HTTP endpoint.

Example:

startupProbe:
  tcpSocket:
    port: 3306
  initialDelaySeconds: 5
  periodSeconds: 10
  failureThreshold: 30

Best Practices

  • Liveness vs. Readiness:
    • Liveness: If your application is unresponsive, it's better to restart it. If it fails, Kubernetes will restart it.
    • Readiness: If your application is temporarily unable to serve traffic (e.g., connecting to a database), but might recover without a restart, use a Readiness Probe. This prevents traffic from being routed to an unhealthy instance.
  • Startup Probes for Slow Applications: Use Startup Probes for applications that take a significant amount of time to initialize. This prevents premature restarts due to Liveness Probe failures or traffic routing issues due to Readiness Probe failures during startup.
  • Lightweight Probes: Ensure your probe endpoints are lightweight and perform quickly. They should not involve heavy computation or external dependencies (like database calls) that could make the probe itself unreliable.
  • Meaningful Checks: Probe checks should genuinely reflect the health and readiness of your application, not just whether the process is running. For example, for a web server, check if it can serve a basic page, not just if the port is open.
  • Adjust initialDelaySeconds: Set initialDelaySeconds appropriately to give your application enough time to start before the first probe.
  • Tune periodSeconds and failureThreshold: Balance the need for quick detection of failures with avoiding false positives. Too frequent probes or too low a failureThreshold can lead to unnecessary restarts or unready states.
  • Logs for Debugging: Ensure your application logs clear messages related to health check endpoint calls and internal state to aid in debugging probe failures.
  • Combine Probes: Often, all three probes (Liveness, Readiness, Startup) are used together to manage application lifecycle effectively.

YAML file example

spec:
  template:
    spec:
      containers:
      - name: nginx
        image: nginx:1.14.2 # Container image
        ports:
        - containerPort: 80 # Container exposed port
        startupProbe:
          httpGet:
            path: /startup-check
            port: 8080
          initialDelaySeconds: 0 # Usually 0 for startup probes, or very small
          periodSeconds: 5
          failureThreshold: 60 # Allows 60 * 5 = 300 seconds (5 minutes) for startup
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 5  # Delay 5 seconds after Pod starts before checking
          periodSeconds: 10       # Check every 10 seconds
          timeoutSeconds: 5       # Timeout after 5 seconds
          failureThreshold: 3     # Consider unhealthy after 3 consecutive failures
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        

Health Checks configuration parameters by using web console

Common parameters

ParametersDescription
Initial DelayinitialDelaySeconds: Grace period (seconds) before starting probes. Default: 300.
PeriodperiodSeconds: Probe interval (1-120s). Default: 60.
TimeouttimeoutSeconds: Probe timeout duration (1-300s). Default: 30.
Success ThresholdsuccessThreshold: Minimum consecutive successes to mark healthy. Default: 0.
Failure ThresholdfailureThreshold: Maximum consecutive failures to trigger action:
- 0: Disables failure-based actions
- Default: 5 failures → container restart.

Protocol specific parameters

ParameterApplicable ProtocolsDescription
ProtocolHTTP/HTTPSHealth check protocol
PortHTTP/HTTPS/TCPTarget container port for probing.
PathHTTP/HTTPSEndpoint path (e.g., /healthz).
HTTP HeadersHTTP/HTTPSCustom headers (Add key-value pairs).
CommandEXECContainer-executable check command (e.g., sh -c "curl -I localhost:8080 | grep OK").
Note: Escape special characters and test command viability.

Troubleshooting probe failures

When a Pod's status indicates issues related to probes, here's how to troubleshoot:

Check pod events

kubectl describe pod <pod-name>

Look for events related to LivenessProbe failed, ReadinessProbe failed, or StartupProbe failed. These events often provide specific error messages (e.g., connection refused, HTTP 500 error, command exit code).

View container logs

kubectl logs <pod-name> -c <container-name>

Examine application logs to see if there are errors or warnings around the time the probe failed. Your application might be logging why its health endpoint isn't responding correctly.

Test probe endpoint manually

  • HTTP: If possible, kubectl exec -it <pod-name> -- curl <probe-path>:<probe-port> or wget from within the container to see the actual response.
  • Exec: Run the probe command manually: kubectl exec -it <pod-name> -- <command-from-probe> and check its exit code and output.
  • TCP: Use nc (netcat) or telnet from another Pod in the same network or from the host if allowed, to test TCP connectivity: kubectl exec -it <another-pod> -- nc -vz <pod-ip> <probe-port>.

Review probe configuration

  • Double-check the probe parameters (path, port, command, delays, thresholds) in your Deployment/Pod YAML. A common mistake is an incorrect port or path.

Check application code

  • Ensure your application's health check endpoint is correctly implemented and truly reflects the application's readiness/liveness. Sometimes, the endpoint might return success even when the application itself is broken.

Resource constraints

  • Insufficient CPU or memory resources could cause your application to become unresponsive, leading to probe failures. Check Pod resource usage (kubectl top pod <pod-name>) and consider adjusting resources limits/requests.

Network issues

  • In rare cases, network policies or CNI issues might prevent probes from reaching the container. Verify network connectivity within the cluster.