Are You Paying Attention?

November 24. 2015 0 Comments

So, we have all these logs now, which is awesome. Centralised and easily searchable, containing lots of information relating to application health, feature usage, infrastructure utilization along with many other things I’m not going to bother to list here.

But who is actually looking at them?

The answer to that question is something we’ve been struggling with for a little while. Sure, we go into Kibana and do arbitrary searches, we use dashboards, we do our best to keep an eye on things like CPU usage, free memory, number of errors and so on, but we often have other things to do. Nobody has a full time job of just watching this mountain of data for interesting patterns and problems.

We’ve missed things:

We had an outage recently that was the result of a disk filling up with log files because an old version of the log management script had a bug in it. The disk usage was clearly going down when you looked at it in Kibana dashboard, but it was happening so gradually that it was never really brought up as a top priority.
We had a different outage recently where we had a gradual memory leak in the non-paged memory pool on some of our API instances. Similar to above, we were recording free memory and it was clearly dropping over time, but no-one noticed.

There has been other instances (like an increase in the total number of 500’s being returned from an API, indicating a bug), but I won’t go into too much more detail about the fact that we miss things. We’re human, we have other things to do, it happens.

Instead, lets attack the root of the issue. The human element.

We can’t reasonably expect anyone to keep an eye on all of the data hurtling towards us. Its too much. On the other hand, all of the above problems could have easily been detected by a computer, all we need is something that can do the analysis for us, and then let us know when there is something to action. It doesn’t have to be incredibly fancy (no learning algorithms….yet), all it has to do is be able to compare a few points in time and alert off a trend in the wrong direction.

One of my colleagues was investigating solutions to this problem, and they settled on Sensu.

Latin: In The Sense Of

I won’t go into too much detail about Sensu here, because I think the documentation will do a much better job than I will.

My understanding of it, however, is that it is a very generic, messaging based check/handle system, where a check can be almost anything (run an Elasticsearch query, go get some current system stats, respond to some incoming event) and a handler is an arbitrary reaction (send an email, restart a server, launch the missiles).

Sensu has a number of components, including servers (wiring logic, check –> handler), clients (things that get checks executed on them) and an API (separate from the server). All communication happens through RabbitMQ and there is some usage of Redis for storage (which I’m not fully across yet).

I am by no means any sort of expert in Sensu, as I did not implement our current proof of concept. I am, however, hopefully going to use it to deal with some of the alerting problems that I outlined above.

The first check/handler to implement?

Alert us via email/SMS when the available memory on an API instance is below a certain threshold.

Alas I have not actually done this yet. This post is more going to outline the conceptual approach, and I will return later with more information about how it actually worked (or didn’t work).

Throw Out Broken Things

One of the things that you need to come to terms with early when using AWS is that everything will break. It might be your fault, it might not be, but you should accept right from the beginning that at some point, your things will break. This is good in a way, because it forces you to not have any single points of failure (unless you are willing to accept the risk that they might go down and you will have an outage, which is a business decision).

I mention this because the problem with the memory in our API instances that I mentioned above is pretty mysterious. Its not being taken by any active process (regardless of user), so it looks like a driver problem. It could be one of those weird AWS things (there are a lot), and it goes away if you reboot, so the easiest solution is to just recycle the bad API instance and move on. Its already in an auto-scaling group for redundancy, and there is always more than 1, so its better to just murder it, relax, and let the ASG do its work.

Until we’re comfortable automating that sort of recycling, we’ll settle for an alert that someone can use to make a decision and execute the recycle themselves.

By installing the Sensu client on the machines in question (incorporating it into the environment setup itself), we can create a check that allow us to remotely query the available free memory and compare it against some configured value that we deem too low (lets say 100MB). We can then configure 2 handlers for the check result, one that emails a set of known addresses and another that does the same for SMS.

Seems simple enough. I wonder if it will actually be that simple in practice.

Summary

Alerting on your aggregate of information (logs, stats, etc) is a pretty fundamental ability that you need to have.

AWS does provide some alerting in the form of CloudWatch alarms, but we decided to investigate a different (more generic) route instead, mostly because of the wealth of information that we already had available inside our ELK stack (and our burning desire to use it for something other than pretty graphs).

As I said earlier, this post is more of an outline of how we plan to attack the situation using Sensu, so its a bit light on details I’m afraid.

I’m sure the followup will be amazing though.

Right?