Kubernetes Best Practices — Part 2
by Krunal Chaudhari, Team Lead, iauro Systems Pvt. Ltd
In the first part, we talked about containers, orchestration need, the need of k8s and a few best practices. In this part, we will talk about health checks.
The software industry today is migrating towards microservices, a trending technology that enables building more robust systems. Orchestration engines complement microservices architecture for having a stable and reliable output. Microservices or containers are always present in a distributed manner, and to be able to manage components in a distributed system is always a huge deal! If any small piece of the architecture breaks up, the system has to discover it and recover it. All this needs to be automated. This is where health checks come in. A health Check proves to be great platform for building fault tolerance and adding stability into the already existing system.
Health checks are the best way to check if the whole system is up and running. If any of the container is down, other microservices/containers should not send any request to the one that is not functional, nor should they route the request to other containers which does the same job as the defected container. At the same time, the container which is not working should reboot automatically to have the same replica count. Kubernetes does not send any requests directly to containers, this communication happens through a Pod.
There are two types of health checks as follows:
Both of these checks take place throughout the lifecycle of the pod, not only at the time of start up.
If the readiness probe fails, k8s will leave the pod in a running state, but will not send any traffic to that pod. Once the readiness probe get ready, that pod will join the pool of the traffic receivers.
Let us say that your pod depends on a database. It will be useless if that database pod/service is down. There are two types of probes you can write for such a scenario: either a smart or a dumb probe. The smart probe is one which checks database and other related dependencies. On the other hand, a dumb probe doesn’t have any logic to it. This that means that as soon as a pod is up, the probe would return an ‘ok’ or 200 status code.
Check the following snippet for the readiness probe of HTTP.
initialDelaySeconds: Number of seconds after the container has started before liveness or readiness probes are initiated.
timeoutSeconds: Number of seconds after which the probe times out. Defaults to 1 second. Minimum value is 1.
failureThreshold: When a Pod starts and the probe fails, k8s will try failure Threshold times before giving up.
This communicates with k8s to determine whether your pod is still alive. If this fails, k8s will kill the pod and bring up a new pod in its place to make the same replica count. This will also ensure the system is working properly. If you are 100% sure that a restart will solve the problem, then you should use this probe. Otherwise, a situation might arise where your pod did not restart properly and hung up the whole cluster.
One example where the liveness probe can be used is in case of a memory leak. Let’s say you have an application which hits it’s memory threshold after continued usage. You can use the liveness probe for such a scenario where it will restart the pod, and the memory will get cleared.
There are three types of probes as follows:
- HTTP : This is one of the simplest ways to check a system’s health. You can create http endpoint to check this, and if the status ranges between 200–300, k8s will mark the pod as healthy
- TCP: In this type of probe, k8s creates the TCP connection on the specified port during deployment. If the connection happens, it will be marked as healthy
- Command: This is a very important probe, where k8s runs the command provided in the deployment. If it returns exit code 0, it will be marked as a healthy pod
General rules of thumb:
- Make both, readiness and liveness probes, as dumb as possible. This means that they should only check the endpoint with 200 success. If you need to handle some internal dependencies, you should handle them in the failure block of your code with help of circuit breakers.
- The ‘initialDelaySeconds’ should be marked properly. You need to make sure that the liveness probe does not start until the app is ready. Otherwise, the app will constantly restart and never be ready.
Author: Krunal Chaudhari, Team Lead, iauro Systems Pvt. Ltd