Are you monitoring a production environment with a system that raises alerts when something strange happens?

Do you also have some sort of pre-production or staging environment where you smoke test changes before pushing them live?

Does the pre-production environment have exactly the same monitoring as the production environment?

It really should.

Silence Definitely Doesn’t Equal Consent

A few months back we were revisiting one of our older API’s.

We’d just made a release into the pre-production environment and verified that the API was doing what it needed to do with a quick smoke test. Everything seemed fine.

Promote to production and a few moments later a whole bunch of alarms went off as they detected a swathe of errors occurring in the backend. I honestly can’t remember what the nature of the errors were, but they were real problems that were causing subtle failures in the migration process.

Of course, we were surprised, because we didn’t receive any such indication from the pre-production environment.

When we dug into it a bit deeper though, exactly the same things were happening in pre-prod, the environment was just missing equivalent alarms.


I’m honestly not sure how we got ourselves into this particular situation, where there was a clear difference in behaviour between two environments that should be basically identical. Perhaps the production environment was manually tweaked to include different alarms? I’m not as familiar with the process used for this API as I am for others (Jenkins and Ansible vs TeamCity and Octopus Deploy), but regardless of the technology involved, its easy to accidentally fall into the trap of “I’ll just manually create this alarm here” when you’re in the belly of the beast during a production incident.

Thar be dragons that way though.

Ideally you should be treating your infrastructure as code and deploying it similarly to how you deploy your applications. Of course, this assumes you have an equivalent, painless deployment pipeline for your infrastructure, which can be incredibly difficult to put together.

We’ve had some good wins in the past with this approach (like our log stack rebuild), where the entire environment is encapsulated within a single Nuget package (using AWS CloudFormation), and then deployed using Octopus Deploy.

Following such a process strictly can definitely slow you down when you need to get something done fast (because you have to add it, test it, review it and then deploy it through the chain of CI, Staging, Production), but it does prevent situations like this from arising.

Our Differences Make Us Special

As always, there is at least one caveat.

Sometimes you don’t WANT your production and pre-production systems to have the same alarms.

For example, imagine if you had an alarm that fired when the traffic dropped below a certain threshold. Production is always getting enough traffic, such that a drop indicates a serious problem.

Your pre-production environment might not be getting as much traffic, and the alarm might always be firing.

An alarm that is always firing is a great way to get people to ignore all of the alarms, even when they matter.

By that logic, there might be good reasons to have some differences between the two environments, but they are almost certainly going to be exceptions rather than the rule.


To be honest, not having the alarms go off in the pre-production environment wasn’t exactly the worst thing that’s ever happened to us, but it was annoying.

Short term it was easy enough to just remember to check the logs before pushing to production, but the manual and unreliable nature of that check is why we have alarms altogether.

Machines are really good at doing repetitive things after all.


Another week, another post.

Stepping away the data synchronization algorithm for a bit, its time to explore some AWS weirdness..

Well, strictly speaking, AWS is not really all that weird. Its just that sometimes things happen, and they don’t immediately make any sense, and it takes a few hours and some new information before you realise “oh, that makes perfect sense”.

That intervening period can be pretty confusing though.

Todays post is about a disagreement between a Load Balancer and a CloudWatch alarm about just how healthy some EC2 instances were.

Schrödinger's Symptoms

At the start of this whole adventure, we got an alert email from CloudWatch saying that one of our Load Balancers contained some unhealthy instances.

This happens from time to time, but its one of those things that you need to get to the bottom of quickly, just in case its the first sign of something more serious.

In this particular case, the alert was for one of the very small number of services whose infrastructure we manually manage. That is, the service is hosted on hand crafted EC2 instances, manually placed into a load balancer. That means no auto scaling group, so no capability to self heal or scale if necessary, at least not without additional effort. Not the greatest situation to be in, but sometimes compromises must be made.

Our first step for this sort of thing is typically to log into the AWS dashboard and have a look around, starting with the entities involved in the alert.

After logging in and checking the Load Balancer though, everything seemed fine. No unhealthy instances were being reported. Fair enough, maybe the alarm had already reset and we just hadn’t gotten the followup email (its easy enough to forgot to configure an alert to send emails when the alarm returns back to the OK state).

But no, checking on the CloudWatch alarm showed that it was still triggered.

Two views on the same information, one says “BAD THING HAPPENING” the only says “nah, its cool, I’m fine”.

But which one was right?

Diagnosis: Aggregations

When you’re working with instance health, one of the most painful things is that AWS does not offer detailed logs showing the results of Load Balancer health checks. Sure, you can see whether or not there were any healthy/unhealthy instances, but if you want to know how the machines are actually responding, you pretty much just have to execute the health check yourself and see what happens.

In our case, that meant hitting the EC2 instances directly over HTTP with a request for /index.html.

Unfortunately (fortunately?) everything appeared to be working just fine, and each request returned 200 OK (which is what the health check expects).

Our suspicion then fell to the CloudWatch alarm itself. Perhaps we had misconfigured it and it wasn’t doing what we expected? Maybe it was detecting missing data as an error and then alarming on that. It might still indicate a problem of some sort, but would at least confirm that the instances were functional, which is what everything else appeared to be saying.

The alarm was correctly configured though, saying the equivalent of “alert when there is more than 0 unhealthy instances in the last 5 minutes”.

We poked around a bit into the other metrics available on the Load Balancer (request count, latency, errors, etc) and discovered that latency had spiked a bit, request count had dropped a bit and that there was a tonne of backend errors, so something was definitely wrong.

Returning back to the alarm we noticed that the aggregation on the data point was “average”, which meant that it was actually saying “alert when there is more than 0 unhealthy instances on average over the last 5 minutes”. Its not obvious what the average value of a health check is over time, but changing the aggregation to minimum showed that there were zero unhealthy instances over the same time period, and changing it to maximum showed that all four of the instances were unhealthy over the same time period.

Of course, this meant that the instances were flapping between up and down, which meant the health checks were sometimes failing and sometimes succeeding.

It was pure chance that when we looked at the unhealthy instances directly in the Load Balancer that it had never shown any, and similarly when we had manually hit the health endpoint it had always responded appropriately. The CloudWatch alarm remembered though, and the average of [healthy, healthy, unhealthy] was unhealthy as far as it was concerned, so it was alerting correctly.

Long story short, both views of the data were strictly correct, and were just showing slightly different interpretations.

Cancerous Growth

The root cause of the flapping was exceedingly high CPU usage on the EC2 instances in question, which was leading to timeouts.

We’d done a deployment earlier that day, and it had increased the overall CPU usage of two of the services hosted on those instances by enough that the machines were starting to strain with the load.

More concerning though was the fact that the deployment had only really spiked the CPU from 85% to 95-100%. We’d actually been running those particularly instances hot for quite a while.

In fact, looking back at the CPU usage over time, there was a clear series of step changes leading up to the latest increase, and we just hadn’t been paying attention.


It can be quite surprising when two different views on the same data suddenly start disagreeing, especially when you’ve never seen that sort of thing happen before.

Luckily for us, both views were actually correct, and it just took a little digging to understand what was going on. There was a moment there where we started thinking that maybe we’d found a bug in CloudWatch (or the Load Balancer), but realistically, at this point in the lifecycle of AWS, that sort of thing is pretty unlikely, especially considering that neither of those systems are exactly new or experimental.

More importantly, now we’re aware of the weirdly high CPU usage on the underlying EC2 instances, so we’re going to have to deal with that.

Its not like we can auto scale them to deal with the traffic.


I learned a new thing about AWS, instance retirement and auto scaling groups a few weeks ago.

I mean, lets be honest, the amount of things I don’t know dwarfs the amount of things I do know, but this one in particular was surprising. At the time the entire event was incredibly confusing, and no-one at my work really knew what was going on, but later on, with some new knowledge, it all made perfect sense.

I’m Getting Too Old For This

Lets go back a step though.

Sometimes AWS needs to murder one of your EC2 instances. As far as I know, this tends to happen when AWS detects failures in the underlying hardware, and they schedule the instance to be “retired”, notifying the account holder as appropriate. Of course, this assumes that AWS notices the failure before it becomes an issue, so if you’re really unlucky, sometimes you don’t get a warning and an EC2 instance just disappears into the ether.

The takeaway here, is that you should never rely on just one EC2 instance for any critical functions. This is one of the reasons why auto scaling groups are so useful, because you specify a template for the instances instead of just making one by itself. Of course, if your instances are accumulating important state, then you’ve still got a problem if one goes poof.

Interestingly enough, when you stop (not terminate, just stop) an EC2 instance that is owned by an auto scaling group, the auto scaling group tends to murder it and spin up another one, because it thinks the instance has gone bad and needs to be replaced.

Anyway, I was pretty surprised when AWS scheduled two of the Elasticsearch data nodes in our ELK stack for retirement and:

  1. The nodes hung around in a stopped state, i.e. they didn’t get replaced, even though they were owned by an auto scaling group
  2. AWS didn’t trigger the CloudWatch alarm on the Elasticsearch load balancer that is supposed to detect unhealthy instances, even though the stopped instances were clearly marked unhealthy

After doing some soul searching, I can explain the first point somewhat.

The second point still confuses me though.

You’re Suspended! Hand In Your Launch Configuration And Get Out Of Here

It turns out that when AWS schedules an instance for retirement, it doesn’t necessarily mean the instance is actually going to disappear forever. Well, it won’t if you’re using an EBS volume at least. If you’re just using an instance store volume you’re pretty boned, but those are ephemeral anyway, so you should really know better.

Once the instance is “retired” (i.e. stopped), you can just start it up again. It will migrate to some new (healthy) hardware, and off it goes, doing whatever the hell it was doing in the first place.

However, as I mentioned earlier, if you stop an EC2 instance owned by an auto scaling group, the auto scaling group will detect it as a failure, terminate the instance and spin up a brand new replacement.

Now, this sort of reaction can be pretty dangerous, especially when AWS is the one doing the shutdown(as opposed to the account holder), so AWS does the nice thing and suspends the terminate and launch processes of the auto scaling group, just to be safe.

Of course, the assumption here is that the account holder knows that the processes have been suspended and that some instances are being retired, and they go and restart the stopped instances, resume the auto scaling processes and continues on with their life, singing merrily to themselves.

Until this happened to us, I did not even know that suspending auto scaling group processes was an option, let alone that AWS would do it for me. When we happened to notice that two of our Elasticsearch data nodes had become unavailable through Octopus Deploy, I definitely was not the “informed account holder” in the equation, and instead went on an adventure trying to figure out what the hell was going on.

I tried terminating the stopped nodes, in the hopes that they would be replaced, but because the processes were suspended, I got nothing. I tried raising the number of desired instances, but again, the processes were suspended, so nothing happened.

In the end, I created a secondary auto scaling group using the same Launch Configuration and got it to spin up a few instances, which then joined the cluster and helped to settle everything down.

It wasn’t until the next morning that cooler heads prevailed and I got a handle on what was actually happening that we cleaned everything up properly.


This was one of those cases where AWS was definitely doing the right thing (helping people to avoid data loss because of an underlying failure out of their control), but a simple lack of knowledge on our part caused a bit of a kerfuffle.

The ironic thing is that if AWS had simply terminated the EC2 instances (which I’ve seen happen before) the cluster would have self-healed and rebalanced perfectly well (as long as only a few nodes were terminated of course).

Like I said earlier, I still don’t know why we didn’t get a CloudWatch alarm when the instances were stopped, as they were definitely marked as “unhealthy” in the Load Balancer. We didn’t even realise that something had gone wrong until someone noticed that the data nodes were reporting as unavailable in Octopus Deploy, and that happened purely by chance.

Granted, we still had four out of six data nodes in service, and we run our shards with one primary and two replicas, so we weren’t exactly in the danger zone, but we were definitely approaching it.

Maybe its time to try and configure an alarm on the cluster health.

That’s always nice and colourful.


Full disclosure, most of the Elastalert related work was actually done by a colleague of mine, I’m just writing about it because I thought it was interesting.

Unfortunately, this post brings me to the end of all the Elastalert goodness, at least for now.

Like I said right at the start (and embedded in the post titles), we’re finally paying attention to the wealth of information inside our ELK stack. Well, we aren’t really paying attention to everything right now, but when we notice something or even realize ahead of time that “it would be good if we got told when this happens” we actually have somewhere to put that logic.

I’ll call that a victory.

Anyway, to bring it all full circle:

To be honest, when you look at what we’ve done for Elastalert from a distance, it looks suspiciously similar to the ELK stack (specifically the Elasticsearch segment).

I don’t necessarily think that’s a bad thing though. Honestly, I think we’ve just found a pattern that works for us, so rather than reinventing the wheel each time, we just roll with it.

Consistency is a quality all on its own.

Rule The World

Its actually been almost a couple of months now since we put this all together, and people are slowly starting to incorporate different rules to notify us when interesting things happen.

A good example of this sort of thing is with one of our new features.

As a general rule of thumb, we try our best to include dedicated business intelligence events into the software for whatever features we develop, including major checkpoints like starting, finishing and failure. One of our recent features also raised a “configured” event, which indicated when a customer had put in the specific configuration necessary for the feature to be enabled (it was a third party integration, so required an externally provided API key to function).

We added a rule to detect when this relatively rare event occurred, and now we get a notification whenever someone configures the new feature. This sort of thing is useful when you still have a relatively small number of people coming online (so you can keep tabs on them and follow through to see if they are experiencing any issues), but we’ll probably turn it off one usage picks up so we’re not constantly being spammed.

Recently a customer came online with the new feature, but never followed up with actual usage beyond the initial configuration, so we were able to flag this with the relevant parties (like their account manager) and investigate why that was happening and how we could help.

Without Elastalert, we never would have known, even though the information was actually available for all to see.

Breaking All The Rules

Of course, no series of blog posts would be complete without noting down some potential ways in which we could improve the thing we literally just finished putting together.

I mean, we could barely call ourselves engineers if we weren’t already engineering a better version in our heads before the paint had even dried on the first one.

There are two areas that I think could use improvement, but neither of them are particularly simple:

  1. The architecture that we put together is high availability, even though it is self healing. There is only one Elastalert instance and we don’t really have particularly good protection against that instance being “alive” according to AWS but not actually evaluating rules. We should probably put some more effort into detecting issues with Elastalert so that the AWS Auto Scaling Group self healing can kick in at the appropriate times. I don’t think we can really do anything about side-by-side redundancy though, as Elastalert isn’t really designed to be a distributed alerting system. Two copies would probably just raise two alerts which would get annoying quickly.
  2. There is no real concept of an alert getting worse over time, like there is with some other alerting platforms. Pingdom is a good example of this, though its alerts are a lot simpler (pretty much just up/down). If a website is down, different actions get triggered based on the length of the downtime. We use this sort of approach to first send a note to Hipchat, then to email, then to SMS some relevant parties in a natural progression. Elastalert really only seems to have on/off, as opposed to a schedule of notifications. You could probably accomplish the same thing by having multiple similar rules with different criteria, but that sounds like a massive pain to manage moving forward. This is something that will probably have to be done at the Elastalert level, and I doubt it would be a trivial change, so I’m not going to hold my breath.

Having said that, the value that Elastalert provides in its current state is still astronomically higher that having nothing, so who am I to complain?


When all is said and done, I’m pretty happy that we finally have the capability to alert of our ELK stack.

I mean, its not like the data was going to waste before we had that capability, it just feels better knowing that we don’t always have to be watching in order to find out when interesting things happen.

I know I don’t have time to watch the ELK stack all day, and I doubt anyone else does.

Thought it is awfully pretty to look at.


Full disclosure, most of the Elastalert related work was actually done by a colleague of mine, I’m just writing about it because I thought it was interesting.

Continuing with the Elastalert theme, its time to talk configuration and the deployment thereof.

Last week I covered off exactly how we put together the infrastructure for the Elastalert stack. It wasn’t anything fancy (AMI through Packer, CloudFormation template deployed via Octopus), but there were some tricksy bits relating to Python conflicts between Elastalert and the built-in AWS EC2 initialization scripts.

With that out of the way, we get into the meatiest part of the process; how we manage the configuration of Elastalert, i.e. the alerts themselves.

The Best Laid Plans

When it comes to configuring Elastalert, there are basically only two things to worry about; the overall configuration and the rules and actions that make up the alerts.

The overall configuration covers things like where to find Elasticsearch, which Elasticsearch index to write results into, high level execution timings and so on. All that stuff is covered clearly in the documentation, and there aren’t really any surprises.

The rules are where it gets interesting. There are a wide variety of ways to trigger actions off the connected Elasticsearch cluster, and I provided an example in the initial blog post of this series. I’m not going to go into too much detail about the rules and their structure or capabilities because the documentation goes into that sort of thing at length. For the purposes of this post, the main thing to be aware of is that each rule is fully encapsulated within a file.

The nice thing about everything being inside files is that it makes deployment incredibly easy.

All you have to do is identify the locations where the files are expected to be and throw the new ones in, overwriting as appropriate. If you’re dealing with a set of files its usually smart to clean out the destination first (so deletions are handled correctly), but its still pretty straightforward.

When we started on the whole Elastalert journey, the original plan was for a simple file copy + service restart.

Then Docker came along.

No Plan Survives Contact With The Enemy

To be fair, even with Docker, the original plan was still valid.

All of the configuration was still file based, so deployment was still as simple as copying some files around.


Docker did complicate a few things though. Instead of Elastalert being installed, we had to run an Elastalert image inside a Docker container.

Supplying the configuration files to the Elastalert container isn’t hard. When starting the container you just map certain local directories to directories in the container and it all works pretty much as expected. As long as the files exist in a known place after deployment, you’re fine.

However, in order to “restart” Elastalert, you have to find and murder the container you started last time, and then start up a new one so it will capture the new configuration files and environment variables correctly.

This is all well and good, but even after doing that you only really know whether or not the container itself is running, not necessarily the Elastalert process inside the container. If your config is bad in some way, the Elastalert process won’t start, even though the container will quite happily keep chugging along. So you need something to detect if Elastalert itself is up inside the container.

Putting all of the above together, you get something like this:

echo -e "STEP: Stop and remove existing docker containers..."
echo "Checking for any existing docker containers"
RUNNING_CONTAINERS=$(docker ps -a -q)
if [ -n "$RUNNING_CONTAINERS" ]; then
    echo "Found existing docker containers."
    echo "Stopping the following containers:"
    docker stop $(docker ps -a -q)
    echo "Removing the following containers:"
    docker rm $(docker ps -a -q)
    echo "All containers removed"
    echo "No existing containers found"
echo -e "...SUCCESS\n"

echo -e "STEP: Run docker container..."
echo "Elastalert config file: $ELASTALERT_CONFIG_FILE"
echo "Supervisord config file: $SUPERVISORD_CONFIG_FILE"
echo "ES HOST: $ES_HOST"
echo "ES PORT: $ES_PORT"
docker run -d \
    -v $RUN_DIR/config:/opt/config \
    -v $RUN_DIR/rules:/opt/rules \
    -v $RUN_DIR/logs:/opt/logs \
    --cap-add SYS_TIME \
    --cap-add SYS_NICE $IMAGE_ID
if [ $? != 0 ]; then
    echo "docker run command returned a non-zero exit code."
    echo -e "...FAILED\n"
    exit -1
CID=$(docker ps --latest --quiet)
echo "Elastalert container with ID $CID is now running"
echo -e "...SUCCESS\n"

echo -e "STEP: Checking for Elastalert process inside container..."
echo "Waiting 10 seconds for elastalert process"
sleep 10
if docker top $CID | grep -q elastalert; then
    echo "Found running Elastalert process. Nice."
    echo "Did not find elastalert running"
    echo "You can view logs for the container with: docker logs -f $CID"
    echo "You can shell into the container with: docker exec -it $CID sh"
    echo -e "...FAILURE\n"
    exit -1
echo -e "...SUCCESS\n"

But wait, there’s more!

Environmental Challenges

Our modus operandi is to have multiple copies of our environments (CI, Staging, Production) which form something of a pipeline for deployment purposes. I’ve gone through this sort of thing in the past, the most recent occurrence of which was when I wrote about rebuilding the ELK stack. Its a fairly common pattern, but it does raise some interesting challenges, especially around configuration.

For Elastalert specifically, each environment should have the same baseline behaviour (rules, timings, etc), but also different settings for things like where the Elasticsearch cluster is located, or which Hipchat room notifications go to.

When using Octopus Deploy, the normal way to accomplish this is to have variables defined in your Octopus Deploy project that are scoped to the environments being deployed to, and then leverage some of the built in substitution functionality to do replacements in whatever files need to be changed.

This works great at first, but has a few limitations:

  • You now have two places to look when trying to track changes, which can become a bit of a pain. Its much nicer to be able to view all of the changes (barring sensitive credentials of course) in your source control tool of choice.
  • You can’t easily develop and test the environment outside of Octopus, especially if your deployment is only valid after passing through a successful round of substitutions in Octopus Deploy.

Keeping those two things in mind, we now lean towards having all of our environment specific parameters and settings in configuration files in source control (barring sensitive variables, which require some additional malarkey), and then loading the appropriate file based on some high level flags that are set either by Octopus or in the local development environment.

For Elastalert specifically we settled into having a default configuration file (which is always loaded) and then environment specific overrides. Which environment the deployment is executing in is decided by the following snippet of code:

echo -e "STEP: Determining Environmnet..."
if [ "$(type -t get_octopusvariable)" = function ]; then
    echo "get_octopusvariable function is defined => assuming we are running on Octopus"
    ENVIRONMENT=$(get_octopusvariable "Octopus.Environment.Name")
elif [ -n "$ENVIRONMENT" ]; then
    echo "--environment command line option was used"
    echo "Not running on Octopous and no --environment command line option used. Using 'Default'"
echo -e "...SUCCESS\n"

Once the selection of the environment is out of the way, the deployed files are mutated by executing a substitution routine written in Python which does most of the heavy lifting (replacing any tokens of the format @@KEY@@ in the appropriate files).

To Be Continued

I’ve covered the two biggest challenges in the deployment of our Elastalert configuration, but I’ve glossed over quite a few pieces of the process because covering the entire thing in this blog post would make it way too big.

The best way to really understand how it works is to have a look at the actual repository.

With both the environment and configuration explained, all that is really left to do is bring it all together, and explain some areas that I think could use improvement.

That’s a job for next week though.