Where Have All The Good Shards Gone

March 20. 2018 0 Comments

A wild technical post appears!

This weeks post returns to a topic very close to my heart, the Elasticsearch, Logstash and Kibana (ELK) Stack that we use for log aggregation. As you might be able to tell from my post history, logging, metrics and business intelligence rank pretty high on my list of priorities, regardless of any other focuses I might have. To me, if you don’t have good intelligence, you might as well be fighting in the dark, flailing about in the hopes that you hit something important.

This post specifically is about the process by which we deploy new versions of Elasticsearch, and an issue that can occur when you do rolling deployments and the Elasticsearch cluster is hosted in AWS.

Version Control

Way back in August 2017 I wrote about automating the deployment of new Elasticsearch versions to our ELK stack.

Long story short, the part of that post that it relevant to this one is the bit about unassigned shards in Elasticsearch when rebalancing after a version upgrade. Specifically, if you have nodes that are at a later version of Elasticsearch than others (which is normal when doing a rolling deployment), and the later version node is elected to hold the primary shard, replicas cannot be assigned to any of the nodes with the lower version.

This is troublesome if you’re waiting for a cluster to go green before progressing to the next node replacement, because unassigned shards equal a yellow cluster. You’ll be waiting forever (or you’ll hit your timeout because you were smart enough to put a timeout in, right?).

Without some additional change, the system will never reach a state of equilibrium.

La La La I Can’t Hear You

To extrapolate on the content of the initial post, the solution was to check that all remaining unassigned shards were version conflicts whenever an appropriate end state is reached. An end state would be something like a timeout waiting for the cluster to go green, or maybe something fancier like “number of unassigned shards has not changed over a period of time.

If the only unassigned shards left are version conflicts, its relatively safe to just continue on with the process and let Elasticsearch sort it out (which it will once all of the nodes are replaced). There is minimal risk of data loss (the primary shards are all guaranteed to exist in order for this problem to happen anyway), and each time a new node comes online, the cluster will rebalance into a better state anyway.

The script for checking for version conflicts is below:

function Get-UnassignedShards
{
    [CmdletBinding()]
    param
    (
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$elasticsearchUrl
    )

    $shards = Invoke-RestMethod -Method GET -Uri "$elasticsearchUrl/_cat/shards" -Headers @{"accept"="application/json"} -Verbose:$false;
    $unassigned = $shards | Where-Object { $_.state -eq "UNASSIGNED" };

    return $unassigned;
}

function Test-AllUnassignedShardsAreVersionConflicts
{
    [CmdletBinding()]
    param
    (
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$elasticsearchUrl
    )

    Write-Verbose "Getting all UNASSIGNED shards, to see if all of them are UNASSIGNED because of version conflicts";

    $unassigned = Get-UnassignedShards -elasticsearchUrl $elasticsearchUrl;

    foreach ($unassignedShard in $unassigned)
    {
        $primary = "true";
        if ($unassignedShard.prirep -eq "r")
        {
            $primary = "false";
        }
        $explainBody = "{ `"index`": `"$($unassignedShard.index)`", `"shard`": $($unassignedShard.shard), `"primary`": $primary }";
        Write-Verbose "Getting allocation explanation using query [$explainBody]";
        $explain = Invoke-RestMethod -Method POST -Uri "$elasticsearchUrl/_cluster/allocation/explain" -Headers @{"accept"="application/json"} -Body $explainBody -Verbose:$false;

        $versionConflictRegex = "target node version.*is older than the source node version.*";
        $sameNodeConflictRegex = "the shard cannot be allocated to the same node on which a copy of the shard already exists";
        $explanations = @();
        foreach ($node in $explain.node_allocation_decisions)
        {
            foreach ($decider in $node.deciders)
            {
                $explanations += @{Node=$node.node_name;Explanation=$decider.explanation};
            }
        }

        foreach ($explanation in $explanations)
        {
            if ($explanation.explanation -notmatch $versionConflictRegex -and $explanation.explanation -notmatch $sameNodeConflictRegex)
            {
                Write-Verbose "The node [$($explanation.Node)] in the explanation for shard [$($unassignedShard.index):$($unassignedShard.shard)] contained an allocation decider explanation that was unacceptable (i.e. not version conflict and not same node conflict). It was [$($explanation.Explanation)]";
                return $false;
            }
        }
    }

    return $true;
}

In The Zone…Or Out Of It?

This solution works really well for the specific issue that it was meant to detect, but to absolutely no-ones surprise, it doesn’t work so well for other problems.

Case in point, if your Elasticsearch cluster is AWS Availability Zone aware, then you can encounter a very similar problem to what I’ve just described, except with availability zone conflicts instead of version conflicts.

An availability zone aware Elasticsearch cluster will avoid putting shard replicas in the same availability zone as the primary (within reason), which is just another way to protect itself against losing data in the event of a catastrophic failure. I’m sure you can disable the functionality, but that seems like a relatively sane safety measure, so I’m not sure why you would.

Unfortunately, when combined with version conflicts also preventing shard allocation, you can be left in a situation where there is no appropriate place to dump a shard, so our deployment process can’t move on because the cluster never goes green.

Interestingly enough, there are two possible solutions for this:

The first is to be more careful about the order that you annihilate nodes in. Alternating availability zones is the way to go here, but this approach can get complicated if you’re also dealing with version conflicts at the same time. Also, it doesn’t really work all that well if you don’t have a full complement of nodes (with redundancy) spread about both availability zones.
The second is to just replicate the version conflict solution above, except for unassigned shards as a result of availability zone conflicts. This is by far the easier and less fiddly approach, assuming that the entire deployment finishes (so the cluster can rebalance as appropriate)

I haven’t actually updated our deployment since I discovered the issue, but my current plan is to go with the second option and see how far I get.

Conclusion

This is one of those cases where I knew that the initial solution that I put into place would probably not be enough over the long term, but there was just no value in trying to engineer for every single eventuality right at the start.

Also, truth be told, I didn’t know that Elasticsearch took AWS Availability Zones into account when allocating shards, so that was a bit of a surprise anyway.

Thinking about the actual deployment process some more, it might be easier to scale up, wait for a rebalance and then scale down again, terminating the oldest (and thus earlier version) nodes after all the new ones have already come online. The downside to this approach is mostly just time (because you have to wait for 2N rebalances, instead of just N rebalances (where N is the number of nodes), but it feels like it might be more robust in the face of unexpected weirdness.

Which, from my experience, I should probably just start expecting from now on, as it (ironically) seems like the one constant in software.

We’re Finally Paying Attention, Conclusion

November 21. 2017 0 Comments

Full disclosure, most of the Elastalert related work was actually done by a colleague of mine, I’m just writing about it because I thought it was interesting.

Unfortunately, this post brings me to the end of all the Elastalert goodness, at least for now.

Like I said right at the start (and embedded in the post titles), we’re finally paying attention to the wealth of information inside our ELK stack. Well, we aren’t really paying attention to everything right now, but when we notice something or even realize ahead of time that “it would be good if we got told when this happens” we actually have somewhere to put that logic.

I’ll call that a victory.

Anyway, to bring it all full circle:

The first post was a general and gentle introduction to Elastalert, rules and why we even bothered with this whole process
The second post explained how we put together the underlying infrastructure in AWS to support our Elastalert environments
The third post outlined how we actually manage the configuration for Elastalert (i.e. the rules and notifications) and the deployment thereof
Finally, this post is basically just a capstone, drawing everything together and finishing it all off

To be honest, when you look at what we’ve done for Elastalert from a distance, it looks suspiciously similar to the ELK stack (specifically the Elasticsearch segment).

I don’t necessarily think that’s a bad thing though. Honestly, I think we’ve just found a pattern that works for us, so rather than reinventing the wheel each time, we just roll with it.

Consistency is a quality all on its own.

Rule The World

Its actually been almost a couple of months now since we put this all together, and people are slowly starting to incorporate different rules to notify us when interesting things happen.

A good example of this sort of thing is with one of our new features.

As a general rule of thumb, we try our best to include dedicated business intelligence events into the software for whatever features we develop, including major checkpoints like starting, finishing and failure. One of our recent features also raised a “configured” event, which indicated when a customer had put in the specific configuration necessary for the feature to be enabled (it was a third party integration, so required an externally provided API key to function).

We added a rule to detect when this relatively rare event occurred, and now we get a notification whenever someone configures the new feature. This sort of thing is useful when you still have a relatively small number of people coming online (so you can keep tabs on them and follow through to see if they are experiencing any issues), but we’ll probably turn it off one usage picks up so we’re not constantly being spammed.

Recently a customer came online with the new feature, but never followed up with actual usage beyond the initial configuration, so we were able to flag this with the relevant parties (like their account manager) and investigate why that was happening and how we could help.

Without Elastalert, we never would have known, even though the information was actually available for all to see.

Breaking All The Rules

Of course, no series of blog posts would be complete without noting down some potential ways in which we could improve the thing we literally just finished putting together.

I mean, we could barely call ourselves engineers if we weren’t already engineering a better version in our heads before the paint had even dried on the first one.

There are two areas that I think could use improvement, but neither of them are particularly simple:

The architecture that we put together is high availability, even though it is self healing. There is only one Elastalert instance and we don’t really have particularly good protection against that instance being “alive” according to AWS but not actually evaluating rules. We should probably put some more effort into detecting issues with Elastalert so that the AWS Auto Scaling Group self healing can kick in at the appropriate times. I don’t think we can really do anything about side-by-side redundancy though, as Elastalert isn’t really designed to be a distributed alerting system. Two copies would probably just raise two alerts which would get annoying quickly.
There is no real concept of an alert getting worse over time, like there is with some other alerting platforms. Pingdom is a good example of this, though its alerts are a lot simpler (pretty much just up/down). If a website is down, different actions get triggered based on the length of the downtime. We use this sort of approach to first send a note to Hipchat, then to email, then to SMS some relevant parties in a natural progression. Elastalert really only seems to have on/off, as opposed to a schedule of notifications. You could probably accomplish the same thing by having multiple similar rules with different criteria, but that sounds like a massive pain to manage moving forward. This is something that will probably have to be done at the Elastalert level, and I doubt it would be a trivial change, so I’m not going to hold my breath.

Having said that, the value that Elastalert provides in its current state is still astronomically higher that having nothing, so who am I to complain?

Conclusion

When all is said and done, I’m pretty happy that we finally have the capability to alert of our ELK stack.

I mean, its not like the data was going to waste before we had that capability, it just feels better knowing that we don’t always have to be watching in order to find out when interesting things happen.

I know I don’t have time to watch the ELK stack all day, and I doubt anyone else does.

Thought it is awfully pretty to look at.

We’re Finally Paying Attention, Part 3

November 14. 2017 0 Comments

Full disclosure, most of the Elastalert related work was actually done by a colleague of mine, I’m just writing about it because I thought it was interesting.

Continuing with the Elastalert theme, its time to talk configuration and the deployment thereof.

Last week I covered off exactly how we put together the infrastructure for the Elastalert stack. It wasn’t anything fancy (AMI through Packer, CloudFormation template deployed via Octopus), but there were some tricksy bits relating to Python conflicts between Elastalert and the built-in AWS EC2 initialization scripts.

With that out of the way, we get into the meatiest part of the process; how we manage the configuration of Elastalert, i.e. the alerts themselves.

The Best Laid Plans

When it comes to configuring Elastalert, there are basically only two things to worry about; the overall configuration and the rules and actions that make up the alerts.

The overall configuration covers things like where to find Elasticsearch, which Elasticsearch index to write results into, high level execution timings and so on. All that stuff is covered clearly in the documentation, and there aren’t really any surprises.

The rules are where it gets interesting. There are a wide variety of ways to trigger actions off the connected Elasticsearch cluster, and I provided an example in the initial blog post of this series. I’m not going to go into too much detail about the rules and their structure or capabilities because the documentation goes into that sort of thing at length. For the purposes of this post, the main thing to be aware of is that each rule is fully encapsulated within a file.

The nice thing about everything being inside files is that it makes deployment incredibly easy.

All you have to do is identify the locations where the files are expected to be and throw the new ones in, overwriting as appropriate. If you’re dealing with a set of files its usually smart to clean out the destination first (so deletions are handled correctly), but its still pretty straightforward.

When we started on the whole Elastalert journey, the original plan was for a simple file copy + service restart.

Then Docker came along.

No Plan Survives Contact With The Enemy

To be fair, even with Docker, the original plan was still valid.

All of the configuration was still file based, so deployment was still as simple as copying some files around.

Mostly.

Docker did complicate a few things though. Instead of Elastalert being installed, we had to run an Elastalert image inside a Docker container.

Supplying the configuration files to the Elastalert container isn’t hard. When starting the container you just map certain local directories to directories in the container and it all works pretty much as expected. As long as the files exist in a known place after deployment, you’re fine.

However, in order to “restart” Elastalert, you have to find and murder the container you started last time, and then start up a new one so it will capture the new configuration files and environment variables correctly.

This is all well and good, but even after doing that you only really know whether or not the container itself is running, not necessarily the Elastalert process inside the container. If your config is bad in some way, the Elastalert process won’t start, even though the container will quite happily keep chugging along. So you need something to detect if Elastalert itself is up inside the container.

Putting all of the above together, you get something like this:

echo -e "STEP: Stop and remove existing docker containers..."
echo "Checking for any existing docker containers"
RUNNING_CONTAINERS=$(docker ps -a -q)
if [ -n "$RUNNING_CONTAINERS" ]; then
    echo "Found existing docker containers."
    echo "Stopping the following containers:"
    docker stop $(docker ps -a -q)
    echo "Removing the following containers:"
    docker rm $(docker ps -a -q)
    echo "All containers removed"
else
    echo "No existing containers found"
fi
echo -e "...SUCCESS\n"

echo -e "STEP: Run docker container..."
ELASTALERT_CONFIG_FILE="/opt/config/elastalert.yaml"
SUPERVISORD_CONFIG_FILE="/opt/config/supervisord.conf"
echo "Elastalert config file: $ELASTALERT_CONFIG_FILE"
echo "Supervisord config file: $SUPERVISORD_CONFIG_FILE"
echo "ES HOST: $ES_HOST"
echo "ES PORT: $ES_PORT"
docker run -d \
    -v $RUN_DIR/config:/opt/config \
    -v $RUN_DIR/rules:/opt/rules \
    -v $RUN_DIR/logs:/opt/logs \
    -e "ELASTALERT_CONFIG=$ELASTALERT_CONFIG_FILE" \
    -e "ELASTALERT_SUPERVISOR_CONF=$SUPERVISORD_CONFIG_FILE" \
    -e "ELASTICSEARCH_HOST=$ES_HOST" \
    -e "ELASTICSEARCH_PORT=$ES_PORT" \
    -e "SET_CONTAINER_TIMEZONE=true" \
    -e "CONTAINER_TIMEZONE=$TIMEZONE" \
    --cap-add SYS_TIME \
    --cap-add SYS_NICE $IMAGE_ID
if [ $? != 0 ]; then
    echo "docker run command returned a non-zero exit code."
    echo -e "...FAILED\n"
    exit -1
fi
CID=$(docker ps --latest --quiet)
echo "Elastalert container with ID $CID is now running"
echo -e "...SUCCESS\n"

echo -e "STEP: Checking for Elastalert process inside container..."
echo "Waiting 10 seconds for elastalert process"
sleep 10
if docker top $CID | grep -q elastalert; then
    echo "Found running Elastalert process. Nice."
else
    echo "Did not find elastalert running"
    echo "You can view logs for the container with: docker logs -f $CID"
    echo "You can shell into the container with: docker exec -it $CID sh"
    echo -e "...FAILURE\n"
    exit -1
fi
echo -e "...SUCCESS\n"

But wait, there’s more!

Environmental Challenges

Our modus operandi is to have multiple copies of our environments (CI, Staging, Production) which form something of a pipeline for deployment purposes. I’ve gone through this sort of thing in the past, the most recent occurrence of which was when I wrote about rebuilding the ELK stack. Its a fairly common pattern, but it does raise some interesting challenges, especially around configuration.

For Elastalert specifically, each environment should have the same baseline behaviour (rules, timings, etc), but also different settings for things like where the Elasticsearch cluster is located, or which Hipchat room notifications go to.

When using Octopus Deploy, the normal way to accomplish this is to have variables defined in your Octopus Deploy project that are scoped to the environments being deployed to, and then leverage some of the built in substitution functionality to do replacements in whatever files need to be changed.

This works great at first, but has a few limitations:

You now have two places to look when trying to track changes, which can become a bit of a pain. Its much nicer to be able to view all of the changes (barring sensitive credentials of course) in your source control tool of choice.
You can’t easily develop and test the environment outside of Octopus, especially if your deployment is only valid after passing through a successful round of substitutions in Octopus Deploy.

Keeping those two things in mind, we now lean towards having all of our environment specific parameters and settings in configuration files in source control (barring sensitive variables, which require some additional malarkey), and then loading the appropriate file based on some high level flags that are set either by Octopus or in the local development environment.

For Elastalert specifically we settled into having a default configuration file (which is always loaded) and then environment specific overrides. Which environment the deployment is executing in is decided by the following snippet of code:

echo -e "STEP: Determining Environmnet..."
if [ "$(type -t get_octopusvariable)" = function ]; then
    echo "get_octopusvariable function is defined => assuming we are running on Octopus"
    ENVIRONMENT=$(get_octopusvariable "Octopus.Environment.Name")
elif [ -n "$ENVIRONMENT" ]; then
    echo "--environment command line option was used"
else
    echo "Not running on Octopous and no --environment command line option used. Using 'Default'"
    ENVIRONMENT="Default"
fi
echo "ENVIRONMENT=$ENVIRONMENT"
echo -e "...SUCCESS\n"

Once the selection of the environment is out of the way, the deployed files are mutated by executing a substitution routine written in Python which does most of the heavy lifting (replacing any tokens of the format @@KEY@@ in the appropriate files).

To Be Continued

I’ve covered the two biggest challenges in the deployment of our Elastalert configuration, but I’ve glossed over quite a few pieces of the process because covering the entire thing in this blog post would make it way too big.

The best way to really understand how it works is to have a look at the actual repository.

With both the environment and configuration explained, all that is really left to do is bring it all together, and explain some areas that I think could use improvement.

That’s a job for next week though.

We’re Finally Paying Attention, Part 2

November 7. 2017 0 Comments

Full disclosure, most of the Elastalert related work was actually done by a colleague of mine, I’m just writing about it because I thought it was interesting.

Last week I did a bit of an introduction to Elastalert, as it is the new mechanism that we use to alert on the data in our ELK stack.

We take our infrastructure pretty seriously though, so I didn’t want to just manually create an Elastalert instance and set up it up to do things. It all needs to be codified and controlled, with a deployment pipeline for distributing changes (like new rules or changed rules) and everything needs to be versioned as appropriate.

After doing some very high level playing around (just to make sure it all worked relatively as advertised), it was time to do it properly and set up an auto-scaling, auto-healing Elastalert environment, just like all of the other ones.

Packing It Away

Installing Elastalert is pretty straightforward.

Its all Python based, so its a fairly simple matter to use pip to install the package:

pip install elastalert

This doesn’t quite work out of the box on an Amazon Linux EC2 instance though, as you have to also install some dependencies that are not immediately obvious.

sudo yum update -y;
sudo yum install gcc gcc-c++ -y;
sudo yum install libffi-devel -y;
sudo yum install openssl-devel -y;
sudo pip install elastalert;

With that out of the way, the machine is basically ready to run Elastalert, assuming you configure it correctly (as per the documentation).

With a relatively self contained installation script out of the way, it was time to create an AMI containing using Packer, to be used inside the impending environment.

The Packer configuration for an AMI with Elastalert installed on it is pretty straightforward, and just follows the normal pattern, which I described in this post and which you can see directly in this Github repository. The only meaningful difference is the script that installs Elastalert itself, which you can see above.

Cumulonimbus Clouds Are My Favourite

With an AMI created and ready to go, all that’s left is to create a simple environment to run it in.

Nothing fancy, just a CloudFormation template with a single auto scaling group in it, such that accidental or unexpected terminations self-heal. No need for a load balancer, DNS entries or anything like that, its a purely background process that sits quietly and yells at us as appropriate.

Again, this is a problem that we’ve solved before, and we have a decent pattern in place for putting this sort of thing together.

A dedicated repository for the environment, containing the CloudFormation template, configuration and deployment logic
A TeamCity Build Configuration, which uses the contents of this repository and builds and tests a versioned package
An Octopus project, which contains all of the logic necessary to target the deployment, along with any environment level variables (like target ES cluster)

The good news was that the standard environment stuff worked perfectly. It built, a package was created and that package was deployed.

The bad news was that the deployment never actually completed successfully because the Elastalert AMI failed to result in a working EC2 instance, which meant that the environment failed miserably as the Auto Scaling Group never received a success signal.

But why?

Snakes Are Tricky

It actually took us a while to get to the bottom of the problem, because Elastalert appeared to be fully functional at the end of the Packer process, but the AMI created from that EC2 instance seemed to be fundamentally broken.

Any EC2 instance created from that AMI just didn’t work, regardless of how we used it (i.e. CloudFormation vs manual instance creation, nothing mattered).

The instance would be created and it would “go green” (i.e. the AWS status checks and whatnot would complete successfully) but we couldn’t connect to it using any of the normal mechanisms (SSH using the specified key being the most obvious). It was like none of the normal EC2 setup was being executed, which was weird, because we’ve created many different AMIs through Packer and we hadn’t done anything differently this time.

Looking at the system log for the broken EC2 instances (via the AWS Dashboard) we could see that the core setup procedure of the EC2 instance (where it uses the supplied key file to setup access among other things) was failing due to problems with Python.

What else uses Python?

That’s right, Elastalert.

It turned out that by our Elastalert installation script was updating some dependencies that the EC2 initialization was relied on, and those updates had completely broken the normal setup procedure.

The AMI was functionally useless.

Dock Worker

We went through a few different approaches to try and fix the underlying dependency conflicts, but in the end we settled on using Docker.

At a very high level, Docker is a kind of a virtualization platform, except it doesn’t virtualize the entire OS and instead sits a little bit above that, virtualizing a set of applications instead, leveraging the OS rather than simulating the entire thing. Each Docker image generally hosts a single application in a completely isolated environment, which makes it the perfect solution when you have system software conflicts like we did.

Of course, we had to change our strategy somewhat in order for this to work.

Instead of using Packer to create an AMI with Elastalert installed, we now have to create an AMI with Docker (and Octopus) installed and available.

Same pattern as before, just different software being installed.

Nothing much changed in the environment though, as its still just an Auto Scaling Group spinning up an EC2 instance using the specified AMI.

The big changes were in the Elastalert configuration deployment, which now had to be responsible for both deploying the actual configuration and making sure the Elastalert docker images was correctly configured and running.

To Be Continued

And that is as good a place as any to stop for now.

Next week I’ll explain what our original plan was for the Elastalert configuration deployment and how that changed when we switched to using Docker to host an Elastalert image.

We’re Finally Paying Attention, Part 1

October 31. 2017 0 Comments

Well, its been almost 2 years now since I made a post about Sensu as a generic alerting/alarming mechanism. It ended on a hopeful note, explaining that the content of the post was relatively theoretical and that we hoped to put some of it in place in the coming weeks/months.

Yeah, that never happened.

Its not like we didn’t have any alerts or alarms during that time, we just never continued on with the whole theme of “lets put something together to yell at us whenever weird stuff happens in our ELK stack”. We’ve been using Pingdom ever since our first service went live (to monitor HTTP endpoints and websites) and we’ve been slowly increasing our usage of CloudWatch alarms, but all of that juicy intelligence in the ELK stack is still languishing in alerting limbo.

Until now.

Attention Deficit Disorder

As I’ve previously outlined, we have a wealth of information available in our ELK stack, including things like IIS logs, application logs, system statistics for infrastructure (i.e. memory, CPU, disk space, etc), ELB logs and various intelligence events (like “user used feature X”).

This information has proven to be incredibly valuable for general analysis (bug identification and resolution is a pretty common case), but historically the motivation to start using the logs occurs through some other channel, like a customer complaining via our support team someone just noticing that “hey, this thing doesn’t look right”.

Its all very reactive, and we’ve missed early warning signs in the past such that an issue affected real people, which is sloppy at best.

We can do better.

Ideally what we need to do is identify symptoms or leading indicators that things are starting to go wrong or degrade, and then dynamically alerted the appropriate people when these things are detected, so we can action them ASAP. In a perfect world, these sorts of triggers would be identified and put in place as an integral part of the feature delivery, but for now it would be enough that they just exist at some point in time.

And that’s where Elastalert comes in.

Its Not That We Can’t Pay Attention

Elastalert is a relatively straightforward piece of installed software that allows you to do things when the data in an Elasticsearch cluster meets certain criteria.

It was created at Yelp to work in conjunction with their ELK stack for exactly the purpose that we’re chasing, so its basically a perfect fit.

Also its free.

Elastic.co offers an alerting solution themselves, in the form of X-Pack Alerting (formerly Watcher). As far as I know its pretty amazing, and integrates smoothly with Kibana. However, it costs money, and its one of those things where you actually have to request a quote, rather than just being a price on a website, so you know its expensive. I think we looked into it briefly, but I can’t remember what the actual price would have been for us. I remember it being crazy though.

The Elastalert documentation is pretty awesome, but at a high level the tool offers a number of different ways to trigger alerts and a number of notification channels (like Hipchat, Slack, Email, etc) to execute when an alert is triggered.

All of the configuration is YAML based, which is a pretty common format these days, and all of the rules are just files, so its easy to manage.

Here’s an example rule that we use for detecting spikes in the amount of 50X response codes occurring for any of our services:

name: Spike in 5xxs
type: spike
index: logstash-*

timeframe:
  seconds: @@ELASTALERT_CHECK_FREQUENCY_SECONDS@@

spike_height: 2
spike_type: up
threshold_cur: @@general-spike-5xxs.yaml.threshold_cur@@

filter:
- query:
    query_string:
      query: "Status: [500 TO 599]"
alert: "hipchat"
alert_text_type: alert_text_only
alert_text: |
  <b>{0}</b>
  <a href="@@KIBANA_URL@@">5xxs spiked {1}x. Was {2} in the last {3}, compared to {4} the previous {3}</a>
hipchat_message_format: html
hipchat_from: Elastalert
hipchat_room_id: "@@HIPCHAT_ROOM@@"
hipchat_auth_token: "@@HIPCHAT_TOKEN@@"
alert_text_args:
- name
- spike_height
- spike_count
- reference_count

The only thing in the rule above not covered extensively in the documentation is the @@SOMETHING@@ notation that we use to do some substitutions during deployment. I’ll talk about that a little bit later, but essentially its just a way to customise the rules on a per environment basis without having to rewrite the entire rule (so CI rules can execute every 30 seconds over the last 4 hours, but production might check every few minutes over the last hour and so on).

There’s Just More Important Thi….Oh A Butterfly!

With the general introduction to Elastalert out of the way, the plan for this series of posts is eerily similar to what I did for the ELK stack refresh.

Hopefully I can put together a publicly accessible repository in Github with all of the Elastalert work in it before the end of this series of posts, but I can’t make any promises. Its pretty time consuming to take one of our internal repositories and sanitized it for consumption by the greater internet, even if it is pretty useful.

To Be Continued

Before I finished up, I should make it clear that we’ve already implemented the Elastalert stuff, so its not in the same boat as our plans for Sensu. We’re literally using Elastalert right now to yell at us whenever interesting things happen in our ELK stack and its already proven to be quite useful in that respect.

Next week, I’ll go through the Elastalert environment we set up, and why the Elastalert application and Amazon Linux EC2 instances don’t get along very well.