Health Is Just A State Of Mind

January 16. 2018 0 Comments

Another week, another post.

Stepping away the data synchronization algorithm for a bit, its time to explore some AWS weirdness..

Well, strictly speaking, AWS is not really all that weird. Its just that sometimes things happen, and they don’t immediately make any sense, and it takes a few hours and some new information before you realise “oh, that makes perfect sense”.

That intervening period can be pretty confusing though.

Todays post is about a disagreement between a Load Balancer and a CloudWatch alarm about just how healthy some EC2 instances were.

Schrödinger's Symptoms

At the start of this whole adventure, we got an alert email from CloudWatch saying that one of our Load Balancers contained some unhealthy instances.

This happens from time to time, but its one of those things that you need to get to the bottom of quickly, just in case its the first sign of something more serious.

In this particular case, the alert was for one of the very small number of services whose infrastructure we manually manage. That is, the service is hosted on hand crafted EC2 instances, manually placed into a load balancer. That means no auto scaling group, so no capability to self heal or scale if necessary, at least not without additional effort. Not the greatest situation to be in, but sometimes compromises must be made.

Our first step for this sort of thing is typically to log into the AWS dashboard and have a look around, starting with the entities involved in the alert.

After logging in and checking the Load Balancer though, everything seemed fine. No unhealthy instances were being reported. Fair enough, maybe the alarm had already reset and we just hadn’t gotten the followup email (its easy enough to forgot to configure an alert to send emails when the alarm returns back to the OK state).

But no, checking on the CloudWatch alarm showed that it was still triggered.

Two views on the same information, one says “BAD THING HAPPENING” the only says “nah, its cool, I’m fine”.

But which one was right?

Diagnosis: Aggregations

When you’re working with instance health, one of the most painful things is that AWS does not offer detailed logs showing the results of Load Balancer health checks. Sure, you can see whether or not there were any healthy/unhealthy instances, but if you want to know how the machines are actually responding, you pretty much just have to execute the health check yourself and see what happens.

In our case, that meant hitting the EC2 instances directly over HTTP with a request for /index.html.

Unfortunately (fortunately?) everything appeared to be working just fine, and each request returned 200 OK (which is what the health check expects).

Our suspicion then fell to the CloudWatch alarm itself. Perhaps we had misconfigured it and it wasn’t doing what we expected? Maybe it was detecting missing data as an error and then alarming on that. It might still indicate a problem of some sort, but would at least confirm that the instances were functional, which is what everything else appeared to be saying.

The alarm was correctly configured though, saying the equivalent of “alert when there is more than 0 unhealthy instances in the last 5 minutes”.

We poked around a bit into the other metrics available on the Load Balancer (request count, latency, errors, etc) and discovered that latency had spiked a bit, request count had dropped a bit and that there was a tonne of backend errors, so something was definitely wrong.

Returning back to the alarm we noticed that the aggregation on the data point was “average”, which meant that it was actually saying “alert when there is more than 0 unhealthy instances on average over the last 5 minutes”. Its not obvious what the average value of a health check is over time, but changing the aggregation to minimum showed that there were zero unhealthy instances over the same time period, and changing it to maximum showed that all four of the instances were unhealthy over the same time period.

Of course, this meant that the instances were flapping between up and down, which meant the health checks were sometimes failing and sometimes succeeding.

It was pure chance that when we looked at the unhealthy instances directly in the Load Balancer that it had never shown any, and similarly when we had manually hit the health endpoint it had always responded appropriately. The CloudWatch alarm remembered though, and the average of [healthy, healthy, unhealthy] was unhealthy as far as it was concerned, so it was alerting correctly.

Long story short, both views of the data were strictly correct, and were just showing slightly different interpretations.

Cancerous Growth

The root cause of the flapping was exceedingly high CPU usage on the EC2 instances in question, which was leading to timeouts.

We’d done a deployment earlier that day, and it had increased the overall CPU usage of two of the services hosted on those instances by enough that the machines were starting to strain with the load.

More concerning though was the fact that the deployment had only really spiked the CPU from 85% to 95-100%. We’d actually been running those particularly instances hot for quite a while.

In fact, looking back at the CPU usage over time, there was a clear series of step changes leading up to the latest increase, and we just hadn’t been paying attention.

Conclusion

It can be quite surprising when two different views on the same data suddenly start disagreeing, especially when you’ve never seen that sort of thing happen before.

Luckily for us, both views were actually correct, and it just took a little digging to understand what was going on. There was a moment there where we started thinking that maybe we’d found a bug in CloudWatch (or the Load Balancer), but realistically, at this point in the lifecycle of AWS, that sort of thing is pretty unlikely, especially considering that neither of those systems are exactly new or experimental.

More importantly, now we’re aware of the weirdly high CPU usage on the underlying EC2 instances, so we’re going to have to deal with that.

Its not like we can auto scale them to deal with the traffic.