Its been a little while since I made a post on Elasticsearch. Time to remedy that.
Our log stack has been humming along relatively well ever since we took control of it. Its not perfect, but its much better than it was.
One of the nicest side effects of the restructure has been the capability to test our changes in the CI/Staging environments before pushing them into Production. Its saved us from a a few boneheaded mistakes already (mostly just ES configuration blunders), which has been great to see. It does make pushing things into the environment actually care about a little bit slower than they otherwise would be, but I’m willing to make that tradeoff for a bit more safety.
When I was putting together the process for deploying our log stack (via Nuget, Powershell and Octopus Deploy), I tried to keep in mind what it would be like when I needed to deploy an Elasticsearch version upgrade. To be honest, I thought I had a pretty good handle on it:
- Make an AMI with the new version of Elasticsearch on it
- Change the environment definition to reference this new AMI instead of the old one
- Deploy the updated package, leveraging the Auto Scaling Group instance replacement functionality
- Dance like no-one is watching
The dancing part worked perfectly. I am a graceful swan.
The rest? Not so much.
I think the core issue was that I had a little bit too much faith in Elasticsearch to react quickly and robustly in the face of random nodes dying and being replaced.
Don’t get me wrong, its pretty amazing at what it does, but there are definitely situations where it is understandably incapable of adjusting and balancing itself.
Case in point, the process that occurs when an AWS Auto Scaling Group starts doing a rolling update because the definition of its EC2 instance launch configuration has changed.
When you use CloudFormation to initialize an Auto Scaling Group, you define the instances inside that group through a configuration structure called a Launch Configuration. This structure contains the definition of your EC2 instances, including the base AMI, security groups, tags and other meta information, along with any initialization that needs to be performed on startup (user data, CFN init, etc).
Inside the Auto Scaling Group definition in the template, you decide what should be the appropriate reaction upon detecting changes to the launch configuration, which mostly amounts to a choice between “do nothing” or “start replacing the instances in a sane way”. That second option is referred to as a “rolling update”, and you can specify a policy in the template for how you would like it to occur.
For our environment, a new ES version means a new AMI, so theoretically, it should be a simple matter to update the Launch Configuration with the new AMI and push out an update, relying on the Auto Scaling Group to replace the old nodes with the new ones, and relying on Elasticsearch to rebalance and adjust as appropriate.
Not that simple unfortunately, as I learned when I tried to apply it to the ES Master and Data ASGs in our ELK template.
Whenever changes were detected, CloudFormation would spin up a new node, wait for it to complete its initialization (which was just machine up + octopus tentacle registration), then it would terminate an old node and rinse and repeat until everything was replaced. This happened for both the master nodes and data nodes at the same time (two different Auto Scaling Groups).
Elasticsearch didn’t stand a chance.
With no feedback loop between ES and CloudFormation, there was no way for ES to tell CloudFormation to wait until it had rebalanced the cluster, replicated the shards and generally recovered from the traumatic event of having a piece of itself ripped out and thrown away.
The end result? Pretty much every scrap of data in the environment disappeared.
Good thing it was a scratch environment.
Sticking with the whole “we should probably leverage CloudFormation” approach. I implemented a script to wait for the node to join the cluster and for the cluster to be green (bash scripting is fun!). The intent was that this script would be present in the baseline ES AMI, would be executed as part of the user data during EC2 instance initialization, and would essentially force the auto scaling process to wait for Elasticsearch to actually be functional before moving on.
This wrought havoc with the initial environment creation though, as the cluster isn’t even valid until enough master nodes exist to elect a primary (which is 3), so while it kind of worked for the updates, initial creation was broken.
Not only that, but in a cluster with a decent amount of data, the whole “wait for green” thing takes longer than the maximum time allowed for CloudFormation Auto Scaling Group EC2 instance replacements, which would cause the auto scaling to time out and the entire stack to fail.
So we couldn’t use CloudFormation directly.
The downside of that is that CloudFormation is really good at detecting changes and determining if it actually has to do anything, so not only did we need to find another way to update our nodes, we needed to find a mechanism that would safely know when that node update should be applied.
To Be Continued
That’s enough Elasticsearch for now I think, so next time I’ll continue with the approach we actually settled on.