Create livenessProbe which restarts database pods in Kubernetes
The GlusterFS block storage utilized by the database in Root is susceptible to timeouts and stale file handles, and appears to be a limitation of Gluster itself. While it can be mitigated by running periodic queries or read/write of data, the only real way to fix the issue is to kill/remove the pod and allow the Kubernetes stateful set to reschedule and respawn.
To this end, we need a livenessProbe for the database resources that can detect and automate this process. The helm chart from Bitnami we use for deployment does not include one by default.
- Kubernetes container probes: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-probes
- Bitnami Helm Chart: https://github.com/bitnami/charts/tree/master/bitnami/postgresql
Needed actions:
- Assess the Bitnami helm chart to determine how to inject a livenessProbe. (Staying within the existing environment variables and adding the probe via a value in the configuration YAML is desirable. Forking the repository and building it in is acceptable, if that is the only option.)
- Determine the needed livenessProbe (probably something that utilizes
psql
to query the available database tables would work), and options to kill/reschedule/restart the container if it fails. Important: Just restarting the container will be insufficient. The GlusterFS mount point is only refreshed when the container is respawned, which first requires its removal.