We’re Finally Paying Attention, Part 2

November 7. 2017 0 Comments

Full disclosure, most of the Elastalert related work was actually done by a colleague of mine, I’m just writing about it because I thought it was interesting.

Last week I did a bit of an introduction to Elastalert, as it is the new mechanism that we use to alert on the data in our ELK stack.

We take our infrastructure pretty seriously though, so I didn’t want to just manually create an Elastalert instance and set up it up to do things. It all needs to be codified and controlled, with a deployment pipeline for distributing changes (like new rules or changed rules) and everything needs to be versioned as appropriate.

After doing some very high level playing around (just to make sure it all worked relatively as advertised), it was time to do it properly and set up an auto-scaling, auto-healing Elastalert environment, just like all of the other ones.

Packing It Away

Installing Elastalert is pretty straightforward.

Its all Python based, so its a fairly simple matter to use pip to install the package:

pip install elastalert

This doesn’t quite work out of the box on an Amazon Linux EC2 instance though, as you have to also install some dependencies that are not immediately obvious.

sudo yum update -y;
sudo yum install gcc gcc-c++ -y;
sudo yum install libffi-devel -y;
sudo yum install openssl-devel -y;
sudo pip install elastalert;

With that out of the way, the machine is basically ready to run Elastalert, assuming you configure it correctly (as per the documentation).

With a relatively self contained installation script out of the way, it was time to create an AMI containing using Packer, to be used inside the impending environment.

The Packer configuration for an AMI with Elastalert installed on it is pretty straightforward, and just follows the normal pattern, which I described in this post and which you can see directly in this Github repository. The only meaningful difference is the script that installs Elastalert itself, which you can see above.

Cumulonimbus Clouds Are My Favourite

With an AMI created and ready to go, all that’s left is to create a simple environment to run it in.

Nothing fancy, just a CloudFormation template with a single auto scaling group in it, such that accidental or unexpected terminations self-heal. No need for a load balancer, DNS entries or anything like that, its a purely background process that sits quietly and yells at us as appropriate.

Again, this is a problem that we’ve solved before, and we have a decent pattern in place for putting this sort of thing together.

A dedicated repository for the environment, containing the CloudFormation template, configuration and deployment logic
A TeamCity Build Configuration, which uses the contents of this repository and builds and tests a versioned package
An Octopus project, which contains all of the logic necessary to target the deployment, along with any environment level variables (like target ES cluster)

The good news was that the standard environment stuff worked perfectly. It built, a package was created and that package was deployed.

The bad news was that the deployment never actually completed successfully because the Elastalert AMI failed to result in a working EC2 instance, which meant that the environment failed miserably as the Auto Scaling Group never received a success signal.

But why?

Snakes Are Tricky

It actually took us a while to get to the bottom of the problem, because Elastalert appeared to be fully functional at the end of the Packer process, but the AMI created from that EC2 instance seemed to be fundamentally broken.

Any EC2 instance created from that AMI just didn’t work, regardless of how we used it (i.e. CloudFormation vs manual instance creation, nothing mattered).

The instance would be created and it would “go green” (i.e. the AWS status checks and whatnot would complete successfully) but we couldn’t connect to it using any of the normal mechanisms (SSH using the specified key being the most obvious). It was like none of the normal EC2 setup was being executed, which was weird, because we’ve created many different AMIs through Packer and we hadn’t done anything differently this time.

Looking at the system log for the broken EC2 instances (via the AWS Dashboard) we could see that the core setup procedure of the EC2 instance (where it uses the supplied key file to setup access among other things) was failing due to problems with Python.

What else uses Python?

That’s right, Elastalert.

It turned out that by our Elastalert installation script was updating some dependencies that the EC2 initialization was relied on, and those updates had completely broken the normal setup procedure.

The AMI was functionally useless.

Dock Worker

We went through a few different approaches to try and fix the underlying dependency conflicts, but in the end we settled on using Docker.

At a very high level, Docker is a kind of a virtualization platform, except it doesn’t virtualize the entire OS and instead sits a little bit above that, virtualizing a set of applications instead, leveraging the OS rather than simulating the entire thing. Each Docker image generally hosts a single application in a completely isolated environment, which makes it the perfect solution when you have system software conflicts like we did.

Of course, we had to change our strategy somewhat in order for this to work.

Instead of using Packer to create an AMI with Elastalert installed, we now have to create an AMI with Docker (and Octopus) installed and available.

Same pattern as before, just different software being installed.

Nothing much changed in the environment though, as its still just an Auto Scaling Group spinning up an EC2 instance using the specified AMI.

The big changes were in the Elastalert configuration deployment, which now had to be responsible for both deploying the actual configuration and making sure the Elastalert docker images was correctly configured and running.

To Be Continued

And that is as good a place as any to stop for now.

Next week I’ll explain what our original plan was for the Elastalert configuration deployment and how that changed when we switched to using Docker to host an Elastalert image.

We’re Finally Paying Attention, Part 1

October 31. 2017 0 Comments

Well, its been almost 2 years now since I made a post about Sensu as a generic alerting/alarming mechanism. It ended on a hopeful note, explaining that the content of the post was relatively theoretical and that we hoped to put some of it in place in the coming weeks/months.

Yeah, that never happened.

Its not like we didn’t have any alerts or alarms during that time, we just never continued on with the whole theme of “lets put something together to yell at us whenever weird stuff happens in our ELK stack”. We’ve been using Pingdom ever since our first service went live (to monitor HTTP endpoints and websites) and we’ve been slowly increasing our usage of CloudWatch alarms, but all of that juicy intelligence in the ELK stack is still languishing in alerting limbo.

Until now.

Attention Deficit Disorder

As I’ve previously outlined, we have a wealth of information available in our ELK stack, including things like IIS logs, application logs, system statistics for infrastructure (i.e. memory, CPU, disk space, etc), ELB logs and various intelligence events (like “user used feature X”).

This information has proven to be incredibly valuable for general analysis (bug identification and resolution is a pretty common case), but historically the motivation to start using the logs occurs through some other channel, like a customer complaining via our support team someone just noticing that “hey, this thing doesn’t look right”.

Its all very reactive, and we’ve missed early warning signs in the past such that an issue affected real people, which is sloppy at best.

We can do better.

Ideally what we need to do is identify symptoms or leading indicators that things are starting to go wrong or degrade, and then dynamically alerted the appropriate people when these things are detected, so we can action them ASAP. In a perfect world, these sorts of triggers would be identified and put in place as an integral part of the feature delivery, but for now it would be enough that they just exist at some point in time.

And that’s where Elastalert comes in.

Its Not That We Can’t Pay Attention

Elastalert is a relatively straightforward piece of installed software that allows you to do things when the data in an Elasticsearch cluster meets certain criteria.

It was created at Yelp to work in conjunction with their ELK stack for exactly the purpose that we’re chasing, so its basically a perfect fit.

Also its free.

Elastic.co offers an alerting solution themselves, in the form of X-Pack Alerting (formerly Watcher). As far as I know its pretty amazing, and integrates smoothly with Kibana. However, it costs money, and its one of those things where you actually have to request a quote, rather than just being a price on a website, so you know its expensive. I think we looked into it briefly, but I can’t remember what the actual price would have been for us. I remember it being crazy though.

The Elastalert documentation is pretty awesome, but at a high level the tool offers a number of different ways to trigger alerts and a number of notification channels (like Hipchat, Slack, Email, etc) to execute when an alert is triggered.

All of the configuration is YAML based, which is a pretty common format these days, and all of the rules are just files, so its easy to manage.

Here’s an example rule that we use for detecting spikes in the amount of 50X response codes occurring for any of our services:

name: Spike in 5xxs
type: spike
index: logstash-*

timeframe:
  seconds: @@ELASTALERT_CHECK_FREQUENCY_SECONDS@@

spike_height: 2
spike_type: up
threshold_cur: @@general-spike-5xxs.yaml.threshold_cur@@

filter:
- query:
    query_string:
      query: "Status: [500 TO 599]"
alert: "hipchat"
alert_text_type: alert_text_only
alert_text: |
  <b>{0}</b>
  <a href="@@KIBANA_URL@@">5xxs spiked {1}x. Was {2} in the last {3}, compared to {4} the previous {3}</a>
hipchat_message_format: html
hipchat_from: Elastalert
hipchat_room_id: "@@HIPCHAT_ROOM@@"
hipchat_auth_token: "@@HIPCHAT_TOKEN@@"
alert_text_args:
- name
- spike_height
- spike_count
- reference_count

The only thing in the rule above not covered extensively in the documentation is the @@SOMETHING@@ notation that we use to do some substitutions during deployment. I’ll talk about that a little bit later, but essentially its just a way to customise the rules on a per environment basis without having to rewrite the entire rule (so CI rules can execute every 30 seconds over the last 4 hours, but production might check every few minutes over the last hour and so on).

There’s Just More Important Thi….Oh A Butterfly!

With the general introduction to Elastalert out of the way, the plan for this series of posts is eerily similar to what I did for the ELK stack refresh.

Hopefully I can put together a publicly accessible repository in Github with all of the Elastalert work in it before the end of this series of posts, but I can’t make any promises. Its pretty time consuming to take one of our internal repositories and sanitized it for consumption by the greater internet, even if it is pretty useful.

To Be Continued

Before I finished up, I should make it clear that we’ve already implemented the Elastalert stuff, so its not in the same boat as our plans for Sensu. We’re literally using Elastalert right now to yell at us whenever interesting things happen in our ELK stack and its already proven to be quite useful in that respect.

Next week, I’ll go through the Elastalert environment we set up, and why the Elastalert application and Amazon Linux EC2 instances don’t get along very well.

Alarming Connections

July 11. 2017 0 Comments

Alerting on potential problems before they become real problems is something that we have been historically bad at. We’ve got a wealth of information available both through AWS and in our ELK Stack, but we’ve never really put together anything to use that information to notify us when something interesting happens.

In an effort to alleviate that, we’ve recently started using CloudWatch alarms to do some basic alerting. I’m not exactly sure why I shied away from them to begin with to be honest, as they are easy to setup and maintain. It might have been that the old way that we managed our environments didn’t lend itself to easy, non-destructive updates, making tweaking and creating alarms a relatively onerous process.

Regardless of my initial hesitation, when I was polishing the ELK stack environment recently I made sure to include some CloudWatch alarms to let us know if certain pieces fell over.

The alerting setup isn’t anything special:

Dedicated SNS topics for each environment (i.e. CI, Staging, Production, Scratch)
Each environment has two topics, one for alarms and one for notifications
The entire team is signed up to emails from both Staging and Production, but CI/Scratch are optional

Email topic subscriptions are enough for us for now, but we have plans to use the topics to trigger messages in HipChat as well.

I started with some basic alarms, like unhealthy instances > 0 for Elasticsearch/Logstash. Both of those components expose HTTP endpoints that can easily be pinged, so if they stop responding to traffic for any decent amount of time, we want to be notified. Other than some tweaks to the Logstash health check (I tuned it too tightly initially, so it was going off whenever a HTTP request took longer than 5 seconds), these alarms have served us well in picking up things like Logstash/Elasticsearch crashing or not starting properly.

As far as Elasticsearch is concerned, this is pretty much enough.

With Logstash being Logstash though, more work needed to be done.

Transport Failure

As a result of the issues I had with HTTP outputs in Logstash, we’re still using the Logstash TCP input/output combo.

The upside of this is that it works.

The downside is that sometimes the TCP input on the Broker side seems to crash and get into an unrecoverable state.

That wouldn’t be so terrible if it took Logstash with it, but it doesn’t. Instead, Logstash continues to accept HTTP requests, and just fails all of the TCP stuff. I’m not actually sure if the log events received via HTTP during this time are still processed through the pipeline correctly, but all of the incoming TCP traffic is definitely black holed.

As a result of the HTTP endpoint continuing to return 200 OK, the alarms I setup for unhealthy instances completely fail to pick up this particular issue.

In fact, because of the nature of TCP traffic through an ELB, and the relatively poor quality of TCP metrics, it can be very difficult to tell whether or not its working at a glance. Can’t use requests or latency, because they have no bearing on TCP traffic, and certainly can’t use status codes (obviously). Maybe network traffic, but that doesn’t seem like the best idea due to natural traffic fluctuations.

The only metric that I could find was “Backend Connection Errors”. This metric appears to be a measure of how many low level connection errors occurred between the ELB and the underlying EC2 instances, and seems like a good fit. Even better, when the Logstash TCP input falls over, it is this metric that actually changes, as all of the TCP traffic being forwarded through the ELB fails miserably.

One simple alarm initialization through CloudFormation later, and we were safe and secure in the knowledge that the next time the TCP input fell over, we wouldn’t find out about it 2 days later.

"BrokerLoadBalancerBackendConnectionErrorsAlarm": {
  "Type" : "AWS::CloudWatch::Alarm",
  "Properties" : {
    "AlarmName" : { "Fn::Join": [ "", [ "elk-broker-", { "Ref": "OctopusEnvironment" }, "-backend-errors" ] ] },
    "AlarmDescription" : "Alarm for when there is an increase in the backend connection errors on the Broker load balancer, typically indicating a problem with the Broker EC2 instances. Suggest restarting them",
    "AlarmActions" : [ { "Ref" : "AlarmsTopicARN" } ],
    "OKActions": [ { "Ref" : "AlarmsTopicARN" } ],
    "TreatMissingData": "notBreaching",
    "MetricName" : "BackendConnectionErrors",
    "Namespace" : "AWS/ELB",
    "Statistic" : "Maximum",
    "Period" : "60",
    "EvaluationPeriods" : "5",
    "Threshold" : "100",
    "ComparisonOperator" : "GreaterThanThreshold",
    "Dimensions" : [ {
      "Name" : "LoadBalancerName",
      "Value" : { "Ref" : "BrokerLoadBalancer" }
    } ]
  }
}

Mistakes Were Made

Of course, a few weeks later the TCP input crashed again and we didn’t find out for 2 days.

But why?

It turns out that the only statistic worth a damn for Backend Connection Errors when alerting is SUM, and I created the alarm on the MAXIMUM, assuming that it would act like requests and other metrics (which give you the maximum number of X that occurred during the time period). Graphing the maximum backend connection errors during a time period where the TCP input was broken gives a flat line at y = 1, which is definitely not greater than the threshold of 100 that I entered.

I switched the alarm to SUM and as far as I can see the next time the TCP input goes down, we should get a notification.

But I’ve been fooled before.

Conclusion

I’ll be honest, even though I did make a mistake with this particular CloudWatch alarm, I’m glad that we started using them.

Easy to implement (via our CloudFormation templates) and relatively easy to use, they provide an element of alerting on our infrastructure that was sorely missing. I doubt we will go about making thousands of alarms for all of the various things we want to be alerted on (like increases in 500s and latency problems), but we’ll definitely include a few alarms in each stack we create to yell at us when something simple goes wrong (like unhealthy instances).

To bring it back to the start, I think one of the reasons we’ve hesitated to use the CloudWatch alarms was because we were waiting for the all singing all dancing alerting solution based off the wealth of information in our log stack, but I don’t think that is going to happen in a hurry.

Its been years already.

Are You Paying Attention?

November 24. 2015 0 Comments

So, we have all these logs now, which is awesome. Centralised and easily searchable, containing lots of information relating to application health, feature usage, infrastructure utilization along with many other things I’m not going to bother to list here.

But who is actually looking at them?

The answer to that question is something we’ve been struggling with for a little while. Sure, we go into Kibana and do arbitrary searches, we use dashboards, we do our best to keep an eye on things like CPU usage, free memory, number of errors and so on, but we often have other things to do. Nobody has a full time job of just watching this mountain of data for interesting patterns and problems.

We’ve missed things:

We had an outage recently that was the result of a disk filling up with log files because an old version of the log management script had a bug in it. The disk usage was clearly going down when you looked at it in Kibana dashboard, but it was happening so gradually that it was never really brought up as a top priority.
We had a different outage recently where we had a gradual memory leak in the non-paged memory pool on some of our API instances. Similar to above, we were recording free memory and it was clearly dropping over time, but no-one noticed.

There has been other instances (like an increase in the total number of 500’s being returned from an API, indicating a bug), but I won’t go into too much more detail about the fact that we miss things. We’re human, we have other things to do, it happens.

Instead, lets attack the root of the issue. The human element.

We can’t reasonably expect anyone to keep an eye on all of the data hurtling towards us. Its too much. On the other hand, all of the above problems could have easily been detected by a computer, all we need is something that can do the analysis for us, and then let us know when there is something to action. It doesn’t have to be incredibly fancy (no learning algorithms….yet), all it has to do is be able to compare a few points in time and alert off a trend in the wrong direction.

One of my colleagues was investigating solutions to this problem, and they settled on Sensu.

Latin: In The Sense Of

I won’t go into too much detail about Sensu here, because I think the documentation will do a much better job than I will.

My understanding of it, however, is that it is a very generic, messaging based check/handle system, where a check can be almost anything (run an Elasticsearch query, go get some current system stats, respond to some incoming event) and a handler is an arbitrary reaction (send an email, restart a server, launch the missiles).

Sensu has a number of components, including servers (wiring logic, check –> handler), clients (things that get checks executed on them) and an API (separate from the server). All communication happens through RabbitMQ and there is some usage of Redis for storage (which I’m not fully across yet).

I am by no means any sort of expert in Sensu, as I did not implement our current proof of concept. I am, however, hopefully going to use it to deal with some of the alerting problems that I outlined above.

The first check/handler to implement?

Alert us via email/SMS when the available memory on an API instance is below a certain threshold.

Alas I have not actually done this yet. This post is more going to outline the conceptual approach, and I will return later with more information about how it actually worked (or didn’t work).

Throw Out Broken Things

One of the things that you need to come to terms with early when using AWS is that everything will break. It might be your fault, it might not be, but you should accept right from the beginning that at some point, your things will break. This is good in a way, because it forces you to not have any single points of failure (unless you are willing to accept the risk that they might go down and you will have an outage, which is a business decision).

I mention this because the problem with the memory in our API instances that I mentioned above is pretty mysterious. Its not being taken by any active process (regardless of user), so it looks like a driver problem. It could be one of those weird AWS things (there are a lot), and it goes away if you reboot, so the easiest solution is to just recycle the bad API instance and move on. Its already in an auto-scaling group for redundancy, and there is always more than 1, so its better to just murder it, relax, and let the ASG do its work.

Until we’re comfortable automating that sort of recycling, we’ll settle for an alert that someone can use to make a decision and execute the recycle themselves.

By installing the Sensu client on the machines in question (incorporating it into the environment setup itself), we can create a check that allow us to remotely query the available free memory and compare it against some configured value that we deem too low (lets say 100MB). We can then configure 2 handlers for the check result, one that emails a set of known addresses and another that does the same for SMS.

Seems simple enough. I wonder if it will actually be that simple in practice.

Summary

Alerting on your aggregate of information (logs, stats, etc) is a pretty fundamental ability that you need to have.

AWS does provide some alerting in the form of CloudWatch alarms, but we decided to investigate a different (more generic) route instead, mostly because of the wealth of information that we already had available inside our ELK stack (and our burning desire to use it for something other than pretty graphs).

As I said earlier, this post is more of an outline of how we plan to attack the situation using Sensu, so its a bit light on details I’m afraid.

I’m sure the followup will be amazing though.

Right?