0 Comments

Full disclosure, most of the Elastalert related work was actually done by a colleague of mine, I’m just writing about it because I thought it was interesting.

Last week I did a bit of an introduction to Elastalert, as it is the new mechanism that we use to alert on the data in our ELK stack.

We take our infrastructure pretty seriously though, so I didn’t want to just manually create an Elastalert instance and set up it up to do things. It all needs to be codified and controlled, with a deployment pipeline for distributing changes (like new rules or changed rules) and everything needs to be versioned as appropriate.

After doing some very high level playing around (just to make sure it all worked relatively as advertised), it was time to do it properly and set up an auto-scaling, auto-healing Elastalert environment, just like all of the other ones.

Packing It Away

Installing Elastalert is pretty straightforward.

Its all Python based, so its a fairly simple matter to use pip to install the package:

pip install elastalert

This doesn’t quite work out of the box on an Amazon Linux EC2 instance though, as you have to also install some dependencies that are not immediately obvious.

sudo yum update -y;
sudo yum install gcc gcc-c++ -y;
sudo yum install libffi-devel -y;
sudo yum install openssl-devel -y;
sudo pip install elastalert;

With that out of the way, the machine is basically ready to run Elastalert, assuming you configure it correctly (as per the documentation).

With a relatively self contained installation script out of the way, it was time to create an AMI containing using Packer, to be used inside the impending environment.

The Packer configuration for an AMI with Elastalert installed on it is pretty straightforward, and just follows the normal pattern, which I described in this post and which you can see directly in this Github repository. The only meaningful difference is the script that installs Elastalert itself, which you can see above.

Cumulonimbus Clouds Are My Favourite

With an AMI created and ready to go, all that’s left is to create a simple environment to run it in.

Nothing fancy, just a CloudFormation template with a single auto scaling group in it, such that accidental or unexpected terminations self-heal. No need for a load balancer, DNS entries or anything like that, its a purely background process that sits quietly and yells at us as appropriate.

Again, this is a problem that we’ve solved before, and we have a decent pattern in place for putting this sort of thing together.

  • A dedicated repository for the environment, containing the CloudFormation template, configuration and deployment logic
  • A TeamCity Build Configuration, which uses the contents of this repository and builds and tests a versioned package
  • An Octopus project, which contains all of the logic necessary to target the deployment, along with any environment level variables (like target ES cluster)

The good news was that the standard environment stuff worked perfectly. It built, a package was created and that package was deployed.

The bad news was that the deployment never actually completed successfully because the Elastalert AMI failed to result in a working EC2 instance, which meant that the environment failed miserably as the Auto Scaling Group never received a success signal.

But why?

Snakes Are Tricky

It actually took us a while to get to the bottom of the problem, because Elastalert appeared to be fully functional at the end of the Packer process, but the AMI created from that EC2 instance seemed to be fundamentally broken.

Any EC2 instance created from that AMI just didn’t work, regardless of how we used it (i.e. CloudFormation vs manual instance creation, nothing mattered).

The instance would be created and it would “go green” (i.e. the AWS status checks and whatnot would complete successfully) but we couldn’t connect to it using any of the normal mechanisms (SSH using the specified key being the most obvious). It was like none of the normal EC2 setup was being executed, which was weird, because we’ve created many different AMIs through Packer and we hadn’t done anything differently this time.

Looking at the system log for the broken EC2 instances (via the AWS Dashboard) we could see that the core setup procedure of the EC2 instance (where it uses the supplied key file to setup access among other things) was failing due to problems with Python.

What else uses Python?

That’s right, Elastalert.

It turned out that by our Elastalert installation script was updating some dependencies that the EC2 initialization was relied on, and those updates had completely broken the normal setup procedure.

The AMI was functionally useless.

Dock Worker

We went through a few different approaches to try and fix the underlying dependency conflicts, but in the end we settled on using Docker.

At a very high level, Docker is a kind of a virtualization platform, except it doesn’t virtualize the entire OS and instead sits a little bit above that, virtualizing a set of applications instead, leveraging the OS rather than simulating the entire thing. Each Docker image generally hosts a single application in a completely isolated environment, which makes it the perfect solution when you have system software conflicts like we did.

Of course, we had to change our strategy somewhat in order for this to work.

Instead of using Packer to create an AMI with Elastalert installed, we now have to create an AMI with Docker (and Octopus) installed and available.

Same pattern as before, just different software being installed.

Nothing much changed in the environment though, as its still just an Auto Scaling Group spinning up an EC2 instance using the specified AMI.

The big changes were in the Elastalert configuration deployment, which now had to be responsible for both deploying the actual configuration and making sure the Elastalert docker images was correctly configured and running.

To Be Continued

And that is as good a place as any to stop for now.

Next week I’ll explain what our original plan was for the Elastalert configuration deployment and how that changed when we switched to using Docker to host an Elastalert image.

0 Comments

Well, its been almost 2 years now since I made a post about Sensu as a generic alerting/alarming mechanism. It ended on a hopeful note, explaining that the content of the post was relatively theoretical and that we hoped to put some of it in place in the coming weeks/months.

Yeah, that never happened.

Its not like we didn’t have any alerts or alarms during that time, we just never continued on with the whole theme of “lets put something together to yell at us whenever weird stuff happens in our ELK stack”. We’ve been using Pingdom ever since our first service went live (to monitor HTTP endpoints and websites) and we’ve been slowly increasing our usage of CloudWatch alarms, but all of that juicy intelligence in the ELK stack is still languishing in alerting limbo.

Until now.

Attention Deficit Disorder

As I’ve previously outlined, we have a wealth of information available in our ELK stack, including things like IIS logs, application logs, system statistics for infrastructure (i.e. memory, CPU, disk space, etc), ELB logs and various intelligence events (like “user used feature X”).

This information has proven to be incredibly valuable for general analysis (bug identification and resolution is a pretty common case), but historically the motivation to start using the logs occurs through some other channel, like a customer complaining via our support team someone just noticing that “hey, this thing doesn’t look right”.

Its all very reactive, and we’ve missed early warning signs in the past such that an issue affected real people, which is sloppy at best.

We can do better.

Ideally what we need to do is identify symptoms or leading indicators that things are starting to go wrong or degrade, and then dynamically alerted the appropriate people when these things are detected, so we can action them ASAP. In a perfect world, these sorts of triggers would be identified and put in place as an integral part of the feature delivery, but for now it would be enough that they just exist at some point in time.

And that’s where Elastalert comes in.

Its Not That We Can’t Pay Attention

Elastalert is a relatively straightforward piece of installed software that allows you to do things when the data in an Elasticsearch cluster meets certain criteria.

It was created at Yelp to work in conjunction with their ELK stack for exactly the purpose that we’re chasing, so its basically a perfect fit.

Also its free.

Elastic.co offers an alerting solution themselves, in the form of X-Pack Alerting (formerly Watcher). As far as I know its pretty amazing, and integrates smoothly with Kibana. However, it costs money, and its one of those things where you actually have to request a quote, rather than just being a price on a website, so you know its expensive. I think we looked into it briefly, but I can’t remember what the actual price would have been for us. I remember it being crazy though.

The Elastalert documentation is pretty awesome, but at a high level the tool offers a number of different ways to trigger alerts and a number of notification channels (like Hipchat, Slack, Email, etc) to execute when an alert is triggered.

All of the configuration is YAML based, which is a pretty common format these days, and all of the rules are just files, so its easy to manage.

Here’s an example rule that we use for detecting spikes in the amount of 50X response codes occurring for any of our services:

name: Spike in 5xxs
type: spike
index: logstash-*

timeframe:
  seconds: @@ELASTALERT_CHECK_FREQUENCY_SECONDS@@

spike_height: 2
spike_type: up
threshold_cur: @@general-spike-5xxs.yaml.threshold_cur@@

filter:
- query:
    query_string:
      query: "Status: [500 TO 599]"
alert: "hipchat"
alert_text_type: alert_text_only
alert_text: |
  <b>{0}</b>
  <a href="@@KIBANA_URL@@">5xxs spiked {1}x. Was {2} in the last {3}, compared to {4} the previous {3}</a>
hipchat_message_format: html
hipchat_from: Elastalert
hipchat_room_id: "@@HIPCHAT_ROOM@@"
hipchat_auth_token: "@@HIPCHAT_TOKEN@@"
alert_text_args:
- name
- spike_height
- spike_count
- reference_count

The only thing in the rule above not covered extensively in the documentation is the @@SOMETHING@@ notation that we use to do some substitutions during deployment. I’ll talk about that a little bit later, but essentially its just a way to customise the rules on a per environment basis without having to rewrite the entire rule (so CI rules can execute every 30 seconds over the last 4 hours, but production might check every few minutes over the last hour and so on).

There’s Just More Important Thi….Oh A Butterfly!

With the general introduction to Elastalert out of the way, the plan for this series of posts is eerily similar to what I did for the ELK stack refresh.

Hopefully I can put together a publicly accessible repository in Github with all of the Elastalert work in it before the end of this series of posts, but I can’t make any promises. Its pretty time consuming to take one of our internal repositories and sanitized it for consumption by the greater internet, even if it is pretty useful.

To Be Continued

Before I finished up, I should make it clear that we’ve already implemented the Elastalert stuff, so its not in the same boat as our plans for Sensu. We’re literally using Elastalert right now to yell at us whenever interesting things happen in our ELK stack and its already proven to be quite useful in that respect.

Next week, I’ll go through the Elastalert environment we set up, and why the Elastalert application and Amazon Linux EC2 instances don’t get along very well.

0 Comments

Choo choo goes the Elasticsearch train.

After the last few blog posts about rolling updates to our Elasticsearch environment, I thought I might as well continue with the Elasticsearch theme and do a quick post about reindexing.

Cartography

An index in Elasticsearch is kind of similar to a table in a relational database, but not really. In the same vein, index templates are kind of like schemas, and field mappings are kind of like columns.

But not really.

If you were using Elasticsearch purely for searching through some set of data, you might create an index and then add some mappings to it manually. For example, if you wanted to make all of the addresses in your system searchable, you might create fields for street, number, state, postcode and other common address elements, and maybe another field for the full address combined (like 111 None St, Brisbane, QLD, 4000 or something), to give you good coverage over the various sort of searches that might be requested.

Then you jam a bunch of documents into that index, each one representing a different address that needs to be searchable.

Over time, you might discover that you could really use a field to represent the unit or apartment number, to help narrow down those annoying queries that involve a unit complex or something.

Well, with Elasticsearch you can add a new field to the index, in a similar way to how you add a new column to a table in a relational database.

Except again, not really.

You can definitely add a new field mapping, but it will only work for documents added to the index after you’ve applied the change. You can’t make that new mapping retroactive. That is to say, you can’t magically make it apply to every document that was already in the index when you created the new mapping.

When it comes to your stock standard ELK stack, your data indexes are generally time based and generated from an index template, which adds another layer of complexity. If you want to change the mappings, you typically just change the template and then wait for some time period to rollover.

This leaves you in an unfortunate place for historical data, especially if you’ve been conservative with your field mappings.

Or does it?

Dexterous Storage

In both of the cases above (the manually created and maintained index, the swarm of indexes created automatically via a template) its easy enough to add new field mappings and have them take effect moving forward.

The hard part is always the data that already exists.

That’s where reindexing comes in.

Conceptually, reindexing is taking all of the documents that are already in an index and moving them to another index, where the new index has all the field mappings you want in it. In moving the raw documents like that, Elasticsearch will redo everything that it needs to do in order to analyse and breakdown the data into the appropriate fields, exactly like the first time the document was seen.

For older versions of Elasticsearch, the actual document migration had to be done with an external tool or script, but the latest versions (we use 5.5.1) have a reindex endpoint on the API, which is a lot simpler to use.

curl -XPUT "{elasticsearch_url}/{new_index}?pretty" -H "accept:application/json"
curl -XPOST "{elasticsearch_url}/_reindex?pretty" H "content-type:application/json" -H "accept:application/json" -d "{ "source": { "index": "{old_index}" }, "dest": { "index": "{new_index}", "version_type": "external" } }"

It doesn’t have to be a brand new index (there are options for how to handle documents that conflict if you’re reindexing into an index that already has data in it), but I imagine that a new index is the most common usage.

The useful side effect of this, is that in requiring a different index, the old one is left intact and unchanged. Its then completely up to you how to use both the new and old indexes, the most common operation being to delete the old one when you’re happy with how and where the new one is being used.

Seamless Replacement

We’ve changed our field mappings in our ELK stack over time, so while the most recent indexes do what we want them to, the old indexes have valuable historical data sitting around that we can’t really query or aggregate on.

The naive implementation is just to iterate through all the indexes we want to reindex (maybe using a regex or something to identify them), create a brand new index with a suffix (like logstash-2017.08.21-r) and then run the reindex operation via the Elasticsearch API, similar to the example above.

That leaves us with two indexes with the same data in them, which is less than ideal, especially considering that Kibana will quite happily query both indexes when you ask for data for a time period, so we can’t really leave the old one around or we’ll run into issues with duplicate data.

So we probably want to delete the old index once we’re finished reindexing into the new one.

But how do we know that we’re finished?

The default mode for the reindex operation is to wait for completion before returning a response from the API, which is handy, because that is exactly what we want.

The only other thing we needed to consider is that after a reindex, all of the indexes will have a suffix of –r, and our Curator configuration wouldn’t pick them up without some changes. In the interest of minimising the amount of things we had to touch just to reindex, we decided to do the reindex again from the temporary index back into an index named the same as the one we started with, deleting the temporary index once that second operation was done.

When you do things right, people wont be sure you’ve done anything at all.

Danger Will Robinson

Of course, the first time I ran the script (iterate through indexes, reindex to temporary index, delete source, reindex back, delete temp) on a real Elasticsearch cluster I lost a bunch of documents.

Good thing we have a staging environment specifically for this sort of thing.

I’m still not entirely sure what happened, but I think it had something to do with the eventually consistent nature of Elasticsearch, the fact we connect to the data nodes via an AWS ELB and the reindex being “complete” according to the API but not necessarily synced across all nodes, so the deletion of the source index threw a massive spanner in the works.

Long story short, I switched the script to start the reindex asynchronously and then poll the destination index until it returned the same number of documents as the source. As a bonus, this fixed another problem I had with the HTTP request for the reindex timing out on large indexes, which was nice.

The only downside of this is that we can’t reindex an index that is currently being written to (because the document counts will definitely change over the period of time the reindex occurs), but I didn’t want to do that anyway.

Conclusion

I’ve uploaded the full script to Github. Looking at it now, its a bit more complicated than you would expect, even taking into account the content of this post, but as far as I can tell, its pretty robust.

All told, I probably spent a bit longer on this than I should have, especially taking into account that its not something we do every day.

The flip side of that is that its extremely useful to know that old data is not just useless when we update our field mappings, which is nice.

0 Comments

In last weeks post I explained how we would have liked to do updates to our Elasticsearch environment using CloudFormation. Reality disagreed with that approach and we encountered timing problems as a result of the ES cluster and CloudFormation not talking with one another during the update.

Of course, that means that we need to come up with something ourselves to accomplish the same result.

Move In, Now Move Out

Obviously the first thing we have to do is turn off the Update Policy for the Auto Scaling Groups containing the master and data nodes. With that out of the way, we can safely rely on CloudFormation to update the rest of the environment (including the Launch Configuration describing the EC2 instances that make up the cluster), safe in the knowledge that CloudFormation is ready to create new nodes, but will not until we take some sort of action.

At that point its just a matter of controlled node replacement using the auto healing capabilities of the cluster.

If you terminate one of the nodes directly, the AWS Auto Scaling Group will react by creating a replacement EC2 instance, and it will use the latest Launch Configuration for this purpose. When that instance starts up it will get some configuration deployed to it by Octopus Deploy, and shortly afterwards will join the cluster. With a new node in play, the cluster will react accordingly and rebalance, moving shards and replicas to the new node as necessary until everything is balanced and green.

This sort of approach can be written in just about any scripting language, out poison of choice is Powershell, which was then embedded inside the environment nuget package to be executed whenever an update occurs.

I’d copy the script here, but its pretty long and verbose, so here is the high level algorithm instead:

  • Iterate through the master nodes in the cluster
    • Check the version tag of the EC2 instance behind the node
    • If equal to the current version, move on to the new node
    • If not equal to the current version
      • Get the list of current nodes in the cluster
      • Terminate the current master node
      • Wait for the cluster to report that the old node is gone
      • Wait for the cluster to report that the new node exists
  • Iterate through the data nodes in the cluster
    • Check the version tag of the EC2 instance behind the node
    • If equal to the current version, move on to the new node
    • If not equal to the current version
      • Get the list of current nodes in the cluster
      • Terminate the current data node
      • Wait for the cluster to report that the old node is gone
      • Wait for the cluster to report that the new node exists
      • Wait for the cluster to go yellow (indicating rebalancing is occurring
      • Wait for the cluster to go green (indicating rebalancing is complete). This can take a while, depending on the amount of data in the cluster

As you can see, there isn’t really all that much to the algorithm, and the hardest part of the whole thing is knowing that you should wait for the node to leave/join the cluster and for the cluster to rebalance before moving on to the next replacement.

If you don’t do that, you risk destroying the cluster by taking away too many of its parts before its ready (which was exactly the problem with leaving the update to CloudFormation).

Hands Up, Now Hands Down

For us, the most common reason to run an update on the ELK environment is when there is a new version of Elasticsearch available. Sure we run updates to fix bugs and tweak things, but those are generally pretty rare (and will get rarer as time goes on and the stack gets more stable).

As a general rule of thumb, assuming you don’t try to jump too far all at once, new versions of Elasticsearch are pretty easily integrated.

In fact, you can usually have nodes in your cluster at the new version while there are still active nodes on the old version, which is nice.

There are at least two caveats that I’m aware of though:

  • The latest version of Kibana generally doesn’t work when you point it towards a mixed cluster. It requires that all nodes are running the same version.
  • If new indexes are created in a mixed cluster, and the primary shards for that index live on a node with the latest version, nodes with the old version cannot be assigned replicas

The first one isn’t too problematic. As long as we do the upgrade overnight (unattended), no-one will notice that Kibana is down for a little while.

The second one is a problem though, especially for our production cluster.

We use hourly indexes for Logstash, so a new index is generally created every hour or so. Unfortunately it takes longer than an hour for the cluster to rebalance after a node is replaced.

This means that the cluster is almost guaranteed to be stuck in the yellow status (indicating unassigned shards, in this case the replicas from the new index that cannot be assigned to the old node), which means that our whole process of “wait for green before continuing” is not going to work properly when we do a version upgrade on the environment that actually matter, production.

Lucky for us, the API for Elasticsearch is pretty amazing, and allows you to get all of the unassigned shards, along with the reason why they were unassigned.

What this means is that we can keep our process the same, and when the “wait for green” part of the algorithm times out, we can check to see whether or not the remaining unassigned shards are just version conflicts, and if they are, just move on.

Works like a charm.

Tell Me What You’re Gonna Do Now

The last thing that we need to take into account during an upgrade is related to Octopus Tentacles.

Each Elasticsearch node that is created by the Auto Scaling Group registers itself as a Tentacle so that it can have the Elasticsearch  configuration deployed to it after coming online.

With us terminating nodes constantly during the upgrade, we generate a decent number of dead Tentacles in Octopus Deploy, which is not a situation you want to be in.

The latest versions (3+ I think) of Octopus Deploy allow you to automatically remove dead tentacles whenever a deployment occurs, but I’m still not sure how comfortable I am with that piece of functionality. It seems like if your Tentacle is dead for a bad reason (i.e. its still there, but broken) then you probably don’t want to just clean it up and keep on chugging along.

At this point I would rather clean up the Tentacles that I know to be dead because of my actions.

As a result of this, one of the outputs from the upgrade process is a list of the EC2 instances that were terminated. We can easily use the instance name to lookup the Tentacle in Octopus Deploy, and remove it.

Conclusion

What we’re left with at the end of this whole adventure is a fully automated process that allows us to easily deploy changes to our ELK environment and be confident that not only have all of the AWS components updated as we expect them to, but that Elasticsearch has been upgraded as well.

Essentially exactly what we would have had if the CloudFormation update policy had worked the way that I initially expected it to.

Speaking of which, it would be nice if AWS gave a little bit more control over that update policy (like timing, or waiting for a signal from a system component before moving on), but you can’t win them all.

Honestly, I wouldn’t be surprised if there was a way to override the whole thing with a custom behaviour, or maybe a custom CloudFormation resource or something, but I wouldn’t even know where to start with that.

I’ve probably run the update process around 10 times at this point, and while I usually discover something each time I run it, each tweak makes it more and more stable.

The real test will be what happens when Elastic.co releases version 6 of Elasticsearch and I try to upgrade.

I foresee explosions.

0 Comments

Its been a little while since I made a post on Elasticsearch. Time to remedy that.

Our log stack has been humming along relatively well ever since we took control of it. Its not perfect, but its much better than it was.

One of the nicest side effects of the restructure has been the capability to test our changes in the CI/Staging environments before pushing them into Production. Its saved us from a a few boneheaded mistakes already (mostly just ES configuration blunders), which has been great to see. It does make pushing things into the environment actually care about a little bit slower than they otherwise would be, but I’m willing to make that tradeoff for a bit more safety.

When I was putting together the process for deploying our log stack (via Nuget, Powershell and Octopus Deploy), I tried to keep in mind what it would be like when I needed to deploy an Elasticsearch version upgrade. To be honest, I thought I had a pretty good handle on it:

  • Make an AMI with the new version of Elasticsearch on it
  • Change the environment definition to reference this new AMI instead of the old one
  • Deploy the updated package, leveraging the Auto Scaling Group instance replacement functionality
  • Dance like no-one is watching

The dancing part worked perfectly. I am a graceful swan.

The rest? Not so much.

Rollin’, Rollin’

I think the core issue was that I had a little bit too much faith in Elasticsearch to react quickly and robustly in the face of random nodes dying and being replaced.

Don’t get me wrong, its pretty amazing at what it does, but there are definitely situations where it is understandably incapable of adjusting and balancing itself.

Case in point, the process that occurs when an AWS Auto Scaling Group starts doing a rolling update because the definition of its EC2 instance launch configuration has changed.

When you use CloudFormation to initialize an Auto Scaling Group, you define the instances inside that group through a configuration structure called a Launch Configuration. This structure contains the definition of your EC2 instances, including the base AMI, security groups, tags and other meta information, along with any initialization that needs to be performed on startup (user data, CFN init, etc).

Inside the Auto Scaling Group definition in the template, you decide what should be the appropriate reaction upon detecting changes to the launch configuration, which mostly amounts to a choice between “do nothing” or “start replacing the instances in a sane way”. That second option is referred to as a “rolling update”, and you can specify a policy in the template for how you would like it to occur.

For our environment, a new ES version means a new AMI, so theoretically, it should be a simple matter to update the Launch Configuration with the new AMI and push out an update, relying on the Auto Scaling Group to replace the old nodes with the new ones, and relying on Elasticsearch to rebalance and adjust as appropriate.

Not that simple unfortunately, as I learned when I tried to apply it to the ES Master and Data ASGs in our ELK template.

Whenever changes were detected, CloudFormation would spin up a new node, wait for it to complete its initialization (which was just machine up + octopus tentacle registration), then it would terminate an old node and rinse and repeat until everything was replaced. This happened for both the master nodes and data nodes at the same time (two different Auto Scaling Groups).

Elasticsearch didn’t stand a chance.

With no feedback loop between ES and CloudFormation, there was no way for ES to tell CloudFormation to wait until it had rebalanced the cluster, replicated the shards and generally recovered from the traumatic event of having a piece of itself ripped out and thrown away.

The end result? Pretty much every scrap of data in the environment disappeared.

Good thing it was a scratch environment.

Rollin’, Rollin’

Sticking with the whole “we should probably leverage CloudFormation” approach. I implemented a script to wait for the node to join the cluster and for the cluster to be green (bash scripting is fun!). The intent was that this script would be present in the baseline ES AMI, would be executed as part of the user data during EC2 instance initialization, and would essentially force the auto scaling process to wait for Elasticsearch to actually be functional before moving on.

This wrought havoc with the initial environment creation though, as the cluster isn’t even valid until enough master nodes exist to elect a primary (which is 3), so while it kind of worked for the updates, initial creation was broken.

Not only that, but in a cluster with a decent amount of data, the whole “wait for green” thing takes longer than the maximum time allowed for CloudFormation Auto Scaling Group EC2 instance replacements, which would cause the auto scaling to time out and the entire stack to fail.

So we couldn’t use CloudFormation directly.

The downside of that is that CloudFormation is really good at detecting changes and determining if it actually has to do anything, so not only did we need to find another way to update our nodes, we needed to find a mechanism that would safely know when that node update should be applied.

To Be Continued

That’s enough Elasticsearch for now I think, so next time I’ll continue with the approach we actually settled on.