Prometheus Monitoring—Fix It before It Breaks
Mit guter Laune hat man sich gerade einen Kaffee gemacht, da klingelt plötzlich das Smartphone. Am anderen Ende der Leitung erklärt der Kunde hektisch, dass sich seine Nutzer nicht mehr auf seiner Website einloggen können. Von einer Minute auf die andere ist es vorbei mit der Ruhe, schließlich verlieren wir jede Sekunde Geld.
Aber warum? Keine Ahnung. Also gehen wir auf die Suche und schauen uns selbst die Fehlermeldung an. Das Frontend ist da und bittet uns freundlich und wie gewohnt um unsere Zugangsdaten. Nach Eingabe derselben stehen auch wir plötzlich vor einer relativ schlank gehaltenen Fehlermeldung.
Unexpected Journey—I’m Going on an Adventure!
Weird. That’s not the way it’s supposed to be, but at least the front end is still responding. As the error source is not obvious at first glance, we will have to check one service after another.
So, we take a look at the back end. This service also looks ready for use from the outside, but a test API call shows us relatively quickly that the call to the authentication service, which should validate our token, is only answered with the not very informative HTTP status code 500.
Since we have ruled out the back end as the culprit, we continue down the troubleshooting path, knowing full well that the downtime has real consequences and that every minute counts—a truly helpful thought when you’re trying to focus on identifying the source of error.
The authentication service logs tell us that during incoming calls, it is trying to reach a database where our rights management is persisted, but without success. Looks like we’re getting closer to the root of the problem. As we know our environment exactly, we arecaware that the database in question is a mirror of a core database and is regularly synchronized with it via a CronJob. However, the logs of the database now indicate that this synchronization was not executed as planned. This means that access to the data is denied because it may not be up to date, and thus our rights management is corrupted.
So the real villain seems to be our synchronization job, but why? A single look at the right place is enough for us to see that the node on which the job was supposed to run has a massive resource problem. It no longer has enough memory to start the container for synchronization.
After a one-liner changing the configuration and starting the container on a different node, the original error is fixed and, as if by magic, our users can log into the front end without any problems.
That was Stressful—and Really Cost-Intensive
In fact, the solution was quite easy, wasn’t it? Maybe, but the analysis still cost us two hours and probably a lot of money for the customer. In the worst case, the user’s trust is irretrievably lost, as some may now no longer use the service.
Wouldn’t it have been nice if we could have identified this resource problem? And wouldn’t it have been even nicer if we could have prevented the problem from occurrring in the first place?
Monitoring can become a challenge, especially in distributed and highly dynamic container environments, such as Kubernetes or OpenShift. Fortunately, Prometheus has already developed a solution for this.
Chronologically speaking, Prometheus Monitoring is the second project after Kubernetes hosted by the Cloud Native Computing Foundation. It offers a monitoring and alerting toolkit that is ideal for the container environments mentioned.
The main features of a modern monitoring system include a multidimensional data model consisting of time series data in the form of key/value pairs, which are assigned to the metric by name. Prometheus also has a strong query language, PromQL, which allows you to use the collected data in various constellations. Data is retrieved via a pull mechanism through HTTP, and the targets for this retrieval can either be exposed dynamically via service discovery or configured statically. For short-lived jobs that would only occur between pull intervals, there’s a push gateway to which these jobs can send their metrics before disappearing again.
Especially in the Kubernetes context, the official Helm chart kube-prometheus-stack is recommended. With a single command, you also get Grafana with preconfigured dashboards that provide direct insight into the resources of the nodes.
Not Only Fit for Kubernetes
If you think that Prometheus Monitoring is mainly suited for Kubernetes, you are not completely wrong. Nevertheless, Prometheus also offers excellent monitoring on individual VMs or bare-metal servers if it is started as a Docker container. Now, if you want to get the metrics from a specific machine, all that is needed is the Node Exporter, which is cloned as a GitHub project and just needs to be run. That’s all you need to get extremely deep metrics. Once this is in progress, the Prometheus container can access the metrics provided by the Node Exporter at localhost:9090/metrics. by default.
No Control without Visibility
With Prometheus, distributed monitoring is no longer an impossible task and should be one of the three standard stacks rolled out in any environment, especially in a DevOps context, along with tracing and logging. In the anecdote described at the beginning, a quick look at the status of the nodes would have been sufficient to detect the resource problem. This would have saved us the time-consuming backtracking. IAt best, we would have had an alerting system based on our metrics in place to notify us of the impending resource shortage on a node. Then we would have eliminated this risk in a matter of minutes.
In a DevOps context, good monitoring saves you a lot of stress, and you can enjoy your coffee in peace. The premise under which we conduct monitoring for our customers is always “fix it before it breaks.”