0 Comments

Full disclosure, most of the Elastalert related work was actually done by a colleague of mine, I’m just writing about it because I thought it was interesting.

Last week I did a bit of an introduction to Elastalert, as it is the new mechanism that we use to alert on the data in our ELK stack.

We take our infrastructure pretty seriously though, so I didn’t want to just manually create an Elastalert instance and set up it up to do things. It all needs to be codified and controlled, with a deployment pipeline for distributing changes (like new rules or changed rules) and everything needs to be versioned as appropriate.

After doing some very high level playing around (just to make sure it all worked relatively as advertised), it was time to do it properly and set up an auto-scaling, auto-healing Elastalert environment, just like all of the other ones.

Packing It Away

Installing Elastalert is pretty straightforward.

Its all Python based, so its a fairly simple matter to use pip to install the package:

pip install elastalert

This doesn’t quite work out of the box on an Amazon Linux EC2 instance though, as you have to also install some dependencies that are not immediately obvious.

sudo yum update -y;
sudo yum install gcc gcc-c++ -y;
sudo yum install libffi-devel -y;
sudo yum install openssl-devel -y;
sudo pip install elastalert;

With that out of the way, the machine is basically ready to run Elastalert, assuming you configure it correctly (as per the documentation).

With a relatively self contained installation script out of the way, it was time to create an AMI containing using Packer, to be used inside the impending environment.

The Packer configuration for an AMI with Elastalert installed on it is pretty straightforward, and just follows the normal pattern, which I described in this post and which you can see directly in this Github repository. The only meaningful difference is the script that installs Elastalert itself, which you can see above.

Cumulonimbus Clouds Are My Favourite

With an AMI created and ready to go, all that’s left is to create a simple environment to run it in.

Nothing fancy, just a CloudFormation template with a single auto scaling group in it, such that accidental or unexpected terminations self-heal. No need for a load balancer, DNS entries or anything like that, its a purely background process that sits quietly and yells at us as appropriate.

Again, this is a problem that we’ve solved before, and we have a decent pattern in place for putting this sort of thing together.

  • A dedicated repository for the environment, containing the CloudFormation template, configuration and deployment logic
  • A TeamCity Build Configuration, which uses the contents of this repository and builds and tests a versioned package
  • An Octopus project, which contains all of the logic necessary to target the deployment, along with any environment level variables (like target ES cluster)

The good news was that the standard environment stuff worked perfectly. It built, a package was created and that package was deployed.

The bad news was that the deployment never actually completed successfully because the Elastalert AMI failed to result in a working EC2 instance, which meant that the environment failed miserably as the Auto Scaling Group never received a success signal.

But why?

Snakes Are Tricky

It actually took us a while to get to the bottom of the problem, because Elastalert appeared to be fully functional at the end of the Packer process, but the AMI created from that EC2 instance seemed to be fundamentally broken.

Any EC2 instance created from that AMI just didn’t work, regardless of how we used it (i.e. CloudFormation vs manual instance creation, nothing mattered).

The instance would be created and it would “go green” (i.e. the AWS status checks and whatnot would complete successfully) but we couldn’t connect to it using any of the normal mechanisms (SSH using the specified key being the most obvious). It was like none of the normal EC2 setup was being executed, which was weird, because we’ve created many different AMIs through Packer and we hadn’t done anything differently this time.

Looking at the system log for the broken EC2 instances (via the AWS Dashboard) we could see that the core setup procedure of the EC2 instance (where it uses the supplied key file to setup access among other things) was failing due to problems with Python.

What else uses Python?

That’s right, Elastalert.

It turned out that by our Elastalert installation script was updating some dependencies that the EC2 initialization was relied on, and those updates had completely broken the normal setup procedure.

The AMI was functionally useless.

Dock Worker

We went through a few different approaches to try and fix the underlying dependency conflicts, but in the end we settled on using Docker.

At a very high level, Docker is a kind of a virtualization platform, except it doesn’t virtualize the entire OS and instead sits a little bit above that, virtualizing a set of applications instead, leveraging the OS rather than simulating the entire thing. Each Docker image generally hosts a single application in a completely isolated environment, which makes it the perfect solution when you have system software conflicts like we did.

Of course, we had to change our strategy somewhat in order for this to work.

Instead of using Packer to create an AMI with Elastalert installed, we now have to create an AMI with Docker (and Octopus) installed and available.

Same pattern as before, just different software being installed.

Nothing much changed in the environment though, as its still just an Auto Scaling Group spinning up an EC2 instance using the specified AMI.

The big changes were in the Elastalert configuration deployment, which now had to be responsible for both deploying the actual configuration and making sure the Elastalert docker images was correctly configured and running.

To Be Continued

And that is as good a place as any to stop for now.

Next week I’ll explain what our original plan was for the Elastalert configuration deployment and how that changed when we switched to using Docker to host an Elastalert image.