We’re Finally Paying Attention, Part 1

October 31. 2017 0 Comments

Well, its been almost 2 years now since I made a post about Sensu as a generic alerting/alarming mechanism. It ended on a hopeful note, explaining that the content of the post was relatively theoretical and that we hoped to put some of it in place in the coming weeks/months.

Yeah, that never happened.

Its not like we didn’t have any alerts or alarms during that time, we just never continued on with the whole theme of “lets put something together to yell at us whenever weird stuff happens in our ELK stack”. We’ve been using Pingdom ever since our first service went live (to monitor HTTP endpoints and websites) and we’ve been slowly increasing our usage of CloudWatch alarms, but all of that juicy intelligence in the ELK stack is still languishing in alerting limbo.

Until now.

Attention Deficit Disorder

As I’ve previously outlined, we have a wealth of information available in our ELK stack, including things like IIS logs, application logs, system statistics for infrastructure (i.e. memory, CPU, disk space, etc), ELB logs and various intelligence events (like “user used feature X”).

This information has proven to be incredibly valuable for general analysis (bug identification and resolution is a pretty common case), but historically the motivation to start using the logs occurs through some other channel, like a customer complaining via our support team someone just noticing that “hey, this thing doesn’t look right”.

Its all very reactive, and we’ve missed early warning signs in the past such that an issue affected real people, which is sloppy at best.

We can do better.

Ideally what we need to do is identify symptoms or leading indicators that things are starting to go wrong or degrade, and then dynamically alerted the appropriate people when these things are detected, so we can action them ASAP. In a perfect world, these sorts of triggers would be identified and put in place as an integral part of the feature delivery, but for now it would be enough that they just exist at some point in time.

And that’s where Elastalert comes in.

Its Not That We Can’t Pay Attention

Elastalert is a relatively straightforward piece of installed software that allows you to do things when the data in an Elasticsearch cluster meets certain criteria.

It was created at Yelp to work in conjunction with their ELK stack for exactly the purpose that we’re chasing, so its basically a perfect fit.

Also its free.

Elastic.co offers an alerting solution themselves, in the form of X-Pack Alerting (formerly Watcher). As far as I know its pretty amazing, and integrates smoothly with Kibana. However, it costs money, and its one of those things where you actually have to request a quote, rather than just being a price on a website, so you know its expensive. I think we looked into it briefly, but I can’t remember what the actual price would have been for us. I remember it being crazy though.

The Elastalert documentation is pretty awesome, but at a high level the tool offers a number of different ways to trigger alerts and a number of notification channels (like Hipchat, Slack, Email, etc) to execute when an alert is triggered.

All of the configuration is YAML based, which is a pretty common format these days, and all of the rules are just files, so its easy to manage.

Here’s an example rule that we use for detecting spikes in the amount of 50X response codes occurring for any of our services:

name: Spike in 5xxs
type: spike
index: logstash-*

timeframe:
  seconds: @@ELASTALERT_CHECK_FREQUENCY_SECONDS@@

spike_height: 2
spike_type: up
threshold_cur: @@general-spike-5xxs.yaml.threshold_cur@@

filter:
- query:
    query_string:
      query: "Status: [500 TO 599]"
alert: "hipchat"
alert_text_type: alert_text_only
alert_text: |
  <b>{0}</b>
  <a href="@@KIBANA_URL@@">5xxs spiked {1}x. Was {2} in the last {3}, compared to {4} the previous {3}</a>
hipchat_message_format: html
hipchat_from: Elastalert
hipchat_room_id: "@@HIPCHAT_ROOM@@"
hipchat_auth_token: "@@HIPCHAT_TOKEN@@"
alert_text_args:
- name
- spike_height
- spike_count
- reference_count

The only thing in the rule above not covered extensively in the documentation is the @@SOMETHING@@ notation that we use to do some substitutions during deployment. I’ll talk about that a little bit later, but essentially its just a way to customise the rules on a per environment basis without having to rewrite the entire rule (so CI rules can execute every 30 seconds over the last 4 hours, but production might check every few minutes over the last hour and so on).

There’s Just More Important Thi….Oh A Butterfly!

With the general introduction to Elastalert out of the way, the plan for this series of posts is eerily similar to what I did for the ELK stack refresh.

Hopefully I can put together a publicly accessible repository in Github with all of the Elastalert work in it before the end of this series of posts, but I can’t make any promises. Its pretty time consuming to take one of our internal repositories and sanitized it for consumption by the greater internet, even if it is pretty useful.

To Be Continued

Before I finished up, I should make it clear that we’ve already implemented the Elastalert stuff, so its not in the same boat as our plans for Sensu. We’re literally using Elastalert right now to yell at us whenever interesting things happen in our ELK stack and its already proven to be quite useful in that respect.

Next week, I’ll go through the Elastalert environment we set up, and why the Elastalert application and Amazon Linux EC2 instances don’t get along very well.