0 Comments

Another week, another post.

Stepping away the data synchronization algorithm for a bit, its time to explore some AWS weirdness..

Well, strictly speaking, AWS is not really all that weird. Its just that sometimes things happen, and they don’t immediately make any sense, and it takes a few hours and some new information before you realise “oh, that makes perfect sense”.

That intervening period can be pretty confusing though.

Todays post is about a disagreement between a Load Balancer and a CloudWatch alarm about just how healthy some EC2 instances were.

Schrödinger's Symptoms

At the start of this whole adventure, we got an alert email from CloudWatch saying that one of our Load Balancers contained some unhealthy instances.

This happens from time to time, but its one of those things that you need to get to the bottom of quickly, just in case its the first sign of something more serious.

In this particular case, the alert was for one of the very small number of services whose infrastructure we manually manage. That is, the service is hosted on hand crafted EC2 instances, manually placed into a load balancer. That means no auto scaling group, so no capability to self heal or scale if necessary, at least not without additional effort. Not the greatest situation to be in, but sometimes compromises must be made.

Our first step for this sort of thing is typically to log into the AWS dashboard and have a look around, starting with the entities involved in the alert.

After logging in and checking the Load Balancer though, everything seemed fine. No unhealthy instances were being reported. Fair enough, maybe the alarm had already reset and we just hadn’t gotten the followup email (its easy enough to forgot to configure an alert to send emails when the alarm returns back to the OK state).

But no, checking on the CloudWatch alarm showed that it was still triggered.

Two views on the same information, one says “BAD THING HAPPENING” the only says “nah, its cool, I’m fine”.

But which one was right?

Diagnosis: Aggregations

When you’re working with instance health, one of the most painful things is that AWS does not offer detailed logs showing the results of Load Balancer health checks. Sure, you can see whether or not there were any healthy/unhealthy instances, but if you want to know how the machines are actually responding, you pretty much just have to execute the health check yourself and see what happens.

In our case, that meant hitting the EC2 instances directly over HTTP with a request for /index.html.

Unfortunately (fortunately?) everything appeared to be working just fine, and each request returned 200 OK (which is what the health check expects).

Our suspicion then fell to the CloudWatch alarm itself. Perhaps we had misconfigured it and it wasn’t doing what we expected? Maybe it was detecting missing data as an error and then alarming on that. It might still indicate a problem of some sort, but would at least confirm that the instances were functional, which is what everything else appeared to be saying.

The alarm was correctly configured though, saying the equivalent of “alert when there is more than 0 unhealthy instances in the last 5 minutes”.

We poked around a bit into the other metrics available on the Load Balancer (request count, latency, errors, etc) and discovered that latency had spiked a bit, request count had dropped a bit and that there was a tonne of backend errors, so something was definitely wrong.

Returning back to the alarm we noticed that the aggregation on the data point was “average”, which meant that it was actually saying “alert when there is more than 0 unhealthy instances on average over the last 5 minutes”. Its not obvious what the average value of a health check is over time, but changing the aggregation to minimum showed that there were zero unhealthy instances over the same time period, and changing it to maximum showed that all four of the instances were unhealthy over the same time period.

Of course, this meant that the instances were flapping between up and down, which meant the health checks were sometimes failing and sometimes succeeding.

It was pure chance that when we looked at the unhealthy instances directly in the Load Balancer that it had never shown any, and similarly when we had manually hit the health endpoint it had always responded appropriately. The CloudWatch alarm remembered though, and the average of [healthy, healthy, unhealthy] was unhealthy as far as it was concerned, so it was alerting correctly.

Long story short, both views of the data were strictly correct, and were just showing slightly different interpretations.

Cancerous Growth

The root cause of the flapping was exceedingly high CPU usage on the EC2 instances in question, which was leading to timeouts.

We’d done a deployment earlier that day, and it had increased the overall CPU usage of two of the services hosted on those instances by enough that the machines were starting to strain with the load.

More concerning though was the fact that the deployment had only really spiked the CPU from 85% to 95-100%. We’d actually been running those particularly instances hot for quite a while.

In fact, looking back at the CPU usage over time, there was a clear series of step changes leading up to the latest increase, and we just hadn’t been paying attention.

Conclusion

It can be quite surprising when two different views on the same data suddenly start disagreeing, especially when you’ve never seen that sort of thing happen before.

Luckily for us, both views were actually correct, and it just took a little digging to understand what was going on. There was a moment there where we started thinking that maybe we’d found a bug in CloudWatch (or the Load Balancer), but realistically, at this point in the lifecycle of AWS, that sort of thing is pretty unlikely, especially considering that neither of those systems are exactly new or experimental.

More importantly, now we’re aware of the weirdly high CPU usage on the underlying EC2 instances, so we’re going to have to deal with that.

Its not like we can auto scale them to deal with the traffic.

0 Comments

Alerting on potential problems before they become real problems is something that we have been historically bad at. We’ve got a wealth of information available both through AWS and in our ELK Stack, but we’ve never really put together anything to use that information to notify us when something interesting happens.

In an effort to alleviate that, we’ve recently started using CloudWatch alarms to do some basic alerting. I’m not exactly sure why I shied away from them to begin with to be honest, as they are easy to setup and maintain. It might have been that the old way that we managed our environments didn’t lend itself to easy, non-destructive updates, making tweaking and creating alarms a relatively onerous process.

Regardless of my initial hesitation, when I was polishing the ELK stack environment recently I made sure to include some CloudWatch alarms to let us know if certain pieces fell over.

The alerting setup isn’t anything special:

  • Dedicated SNS topics for each environment (i.e. CI, Staging, Production, Scratch)
  • Each environment has two topics, one for alarms and one for notifications
  • The entire team is signed up to emails from both Staging and Production, but CI/Scratch are optional

Email topic subscriptions are enough for us for now, but we have plans to use the topics to trigger messages in HipChat as well.

I started with some basic alarms, like unhealthy instances > 0 for Elasticsearch/Logstash. Both of those components expose HTTP endpoints that can easily be pinged, so if they stop responding to traffic for any decent amount of time, we want to be notified. Other than some tweaks to the Logstash health check (I tuned it too tightly initially, so it was going off whenever a HTTP request took longer than 5 seconds), these alarms have served us well in picking up things like Logstash/Elasticsearch crashing or not starting properly.

As far as Elasticsearch is concerned, this is pretty much enough.

With Logstash being Logstash though, more work needed to be done.

Transport Failure

As a result of the issues I had with HTTP outputs in Logstash, we’re still using the Logstash TCP input/output combo.

The upside of this is that it works.

The downside is that sometimes the TCP input on the Broker side seems to crash and get into an unrecoverable state.

That wouldn’t be so terrible if it took Logstash with it, but it doesn’t. Instead, Logstash continues to accept HTTP requests, and just fails all of the TCP stuff. I’m not actually sure if the log events received via HTTP during this time are still processed through the pipeline correctly, but all of the incoming TCP traffic is definitely black holed.

As a result of the HTTP endpoint continuing to return 200 OK, the alarms I setup for unhealthy instances completely fail to pick up this particular issue.

In fact, because of the nature of TCP traffic through an ELB, and the relatively poor quality of TCP metrics, it can be very difficult to tell whether or not its working at a glance. Can’t use requests or latency, because they have no bearing on TCP traffic, and certainly can’t use status codes (obviously). Maybe network traffic, but that doesn’t seem like the best idea due to natural traffic fluctuations.

The only metric that I could find was “Backend Connection Errors”. This metric appears to be a measure of how many low level connection errors occurred between the ELB and the underlying EC2 instances, and seems like a good fit. Even better, when the Logstash TCP input falls over, it is this metric that actually changes, as all of the TCP traffic being forwarded through the ELB fails miserably.

One simple alarm initialization through CloudFormation later, and we were safe and secure in the knowledge that the next time the TCP input fell over, we wouldn’t find out about it 2 days later.

"BrokerLoadBalancerBackendConnectionErrorsAlarm": {
  "Type" : "AWS::CloudWatch::Alarm",
  "Properties" : {
    "AlarmName" : { "Fn::Join": [ "", [ "elk-broker-", { "Ref": "OctopusEnvironment" }, "-backend-errors" ] ] },
    "AlarmDescription" : "Alarm for when there is an increase in the backend connection errors on the Broker load balancer, typically indicating a problem with the Broker EC2 instances. Suggest restarting them",
    "AlarmActions" : [ { "Ref" : "AlarmsTopicARN" } ],
    "OKActions": [ { "Ref" : "AlarmsTopicARN" } ],
    "TreatMissingData": "notBreaching",
    "MetricName" : "BackendConnectionErrors",
    "Namespace" : "AWS/ELB",
    "Statistic" : "Maximum",
    "Period" : "60",
    "EvaluationPeriods" : "5",
    "Threshold" : "100",
    "ComparisonOperator" : "GreaterThanThreshold",
    "Dimensions" : [ {
      "Name" : "LoadBalancerName",
      "Value" : { "Ref" : "BrokerLoadBalancer" }
    } ]
  }
}

Mistakes Were Made

Of course, a few weeks later the TCP input crashed again and we didn’t find out for 2 days.

But why?

It turns out that the only statistic worth a damn for Backend Connection Errors when alerting is SUM, and I created the alarm on the MAXIMUM, assuming that it would act like requests and other metrics (which give you the maximum number of X that occurred during the time period). Graphing the maximum backend connection errors during a time period where the TCP input was broken gives a flat line at y = 1, which is definitely not greater than the threshold of 100 that I entered.

I switched the alarm to SUM and as far as I can see the next time the TCP input goes down, we should get a notification.

But I’ve been fooled before.

Conclusion

I’ll be honest, even though I did make a mistake with this particular CloudWatch alarm, I’m glad that we started using them.

Easy to implement (via our CloudFormation templates) and relatively easy to use, they provide an element of alerting on our infrastructure that was sorely missing. I doubt we will go about making thousands of alarms for all of the various things we want to be alerted on (like increases in 500s and latency problems), but we’ll definitely include a few alarms in each stack we create to yell at us when something simple goes wrong (like unhealthy instances).

To bring it back to the start, I think one of the reasons we’ve hesitated to use the CloudWatch alarms was because we were waiting for the all singing all dancing alerting solution based off the wealth of information in our log stack, but I don’t think that is going to happen in a hurry.

Its been years already.