Health Checks
TOC
Understanding Health Checks
Refer to the official Kubernetes documentation:
In Kubernetes, health checks, also known as probes, are a critical mechanism to ensure the high availability and resilience of your applications. Kubernetes uses these probes to determine the health and readiness of your Pods, allowing the system to take appropriate actions, such as restarting containers or routing traffic. Without proper health checks, Kubernetes cannot reliably manage your application's lifecycle, potentially leading to service degradation or outages.
Kubernetes offers three types of probes:
livenessProbe
: Detects if the container is still running. If a liveness probe fails, Kubernetes will terminate the Pod and restart it according to its restart policy.
readinessProbe
: Detects if the container is ready to serve traffic. If a readiness probe fails, the Endpoint Controller removes the Pod from the Service's Endpoint list until the probe succeeds.
startupProbe
: Specifically checks if the application has successfully started. Liveness and readiness probes will not execute until the startup probe succeeds. This is very useful for applications with long startup times.
Properly configuring these probes is essential for building robust and self-healing applications on Kubernetes.
Probe Types
Kubernetes supports three mechanisms for implementing probes:
HTTP GET
Action
Executes an HTTP GET
request against the Pod's IP address on a specified port and path. The probe is considered successful if the response code is between 200 and 399.
- Use Cases: Web servers, REST APIs, or any application exposing an HTTP endpoint.
- Example:
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 15
periodSeconds: 20
exec
Action
Executes a specified command inside the container. The probe is successful if the command exits with status code 0.
readinessProbe:
exec:
command:
- cat
- /tmp/healthy
initialDelaySeconds: 5
periodSeconds: 5
TCP Socket
Action
Attempts to open a TCP socket on the container's IP address and a specified port. The probe is successful if the TCP connection can be established.
- Use Cases: Databases, message queues, or any application that communicates over a TCP port but might not have an HTTP endpoint.
Example:
startupProbe:
tcpSocket:
port: 3306
initialDelaySeconds: 5
periodSeconds: 10
failureThreshold: 30
Best Practices
- Liveness vs. Readiness:
- Liveness: If your application is unresponsive, it's better to restart it. If it fails, Kubernetes will restart it.
- Readiness: If your application is temporarily unable to serve traffic (e.g., connecting to a database), but might recover without a restart, use a Readiness Probe. This prevents traffic from being routed to an unhealthy instance.
- Startup Probes for Slow Applications: Use Startup Probes for applications that take a significant amount of time to initialize. This prevents premature restarts due to Liveness Probe failures or traffic routing issues due to Readiness Probe failures during startup.
- Lightweight Probes: Ensure your probe endpoints are lightweight and perform quickly. They should not involve heavy computation or external dependencies (like database calls) that could make the probe itself unreliable.
- Meaningful Checks: Probe checks should genuinely reflect the health and readiness of your application, not just whether the process is running. For example, for a web server, check if it can serve a basic page, not just if the port is open.
- Adjust initialDelaySeconds: Set initialDelaySeconds appropriately to give your application enough time to start before the first probe.
- Tune periodSeconds and failureThreshold: Balance the need for quick detection of failures with avoiding false positives. Too frequent probes or too low a failureThreshold can lead to unnecessary restarts or unready states.
- Logs for Debugging: Ensure your application logs clear messages related to health check endpoint calls and internal state to aid in debugging probe failures.
- Combine Probes: Often, all three probes (Liveness, Readiness, Startup) are used together to manage application lifecycle effectively.
YAML file example
spec:
template:
spec:
containers:
- name: nginx
image: nginx:1.14.2 # Container image
ports:
- containerPort: 80 # Container exposed port
startupProbe:
httpGet:
path: /startup-check
port: 8080
initialDelaySeconds: 0 # Usually 0 for startup probes, or very small
periodSeconds: 5
failureThreshold: 60 # Allows 60 * 5 = 300 seconds (5 minutes) for startup
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5 # Delay 5 seconds after Pod starts before checking
periodSeconds: 10 # Check every 10 seconds
timeoutSeconds: 5 # Timeout after 5 seconds
failureThreshold: 3 # Consider unhealthy after 3 consecutive failures
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
Health Checks configuration parameters by using web console
Common parameters
Parameters | Description |
---|
Initial Delay | initialDelaySeconds : Grace period (seconds) before starting probes. Default: 300 . |
Period | periodSeconds : Probe interval (1-120s). Default: 60 . |
Timeout | timeoutSeconds : Probe timeout duration (1-300s). Default: 30 . |
Success Threshold | successThreshold : Minimum consecutive successes to mark healthy. Default: 0 . |
Failure Threshold | failureThreshold : Maximum consecutive failures to trigger action: - 0 : Disables failure-based actions - Default: 5 failures → container restart. |
Protocol specific parameters
Parameter | Applicable Protocols | Description |
---|
Protocol | HTTP/HTTPS | Health check protocol |
Port | HTTP/HTTPS/TCP | Target container port for probing. |
Path | HTTP/HTTPS | Endpoint path (e.g., /healthz ). |
HTTP Headers | HTTP/HTTPS | Custom headers (Add key-value pairs). |
Command | EXEC | Container-executable check command (e.g., sh -c "curl -I localhost:8080 | grep OK" ). Note: Escape special characters and test command viability. |
Troubleshooting probe failures
When a Pod's status indicates issues related to probes, here's how to troubleshoot:
Check pod events
kubectl describe pod <pod-name>
Look for events related to LivenessProbe failed, ReadinessProbe failed, or StartupProbe failed. These events often provide specific error messages (e.g., connection refused, HTTP 500 error, command exit code).
View container logs
kubectl logs <pod-name> -c <container-name>
Examine application logs to see if there are errors or warnings around the time the probe failed. Your application might be logging why its health endpoint isn't responding correctly.
Test probe endpoint manually
- HTTP: If possible,
kubectl exec -it <pod-name> -- curl <probe-path>:<probe-port>
or wget
from within the container to see the actual response.
- Exec: Run the probe command manually:
kubectl exec -it <pod-name> -- <command-from-probe>
and check its exit code and output.
- TCP: Use
nc
(netcat) or telnet
from another Pod in the same network or from the host if allowed, to test TCP connectivity: kubectl exec -it <another-pod> -- nc -vz <pod-ip> <probe-port>
.
Review probe configuration
- Double-check the probe parameters (path, port, command, delays, thresholds) in your Deployment/Pod YAML. A common mistake is an incorrect port or path.
Check application code
- Ensure your application's health check endpoint is correctly implemented and truly reflects the application's readiness/liveness. Sometimes, the endpoint might return success even when the application itself is broken.
Resource constraints
- Insufficient CPU or memory resources could cause your application to become unresponsive, leading to probe failures. Check Pod resource usage (
kubectl top pod <pod-name>
) and consider adjusting resources
limits/requests.
Network issues
- In rare cases, network policies or CNI issues might prevent probes from reaching the container. Verify network connectivity within the cluster.