Keep On Rolling Baby, Part 2

August 15. 2017 0 Comments

In last weeks post I explained how we would have liked to do updates to our Elasticsearch environment using CloudFormation. Reality disagreed with that approach and we encountered timing problems as a result of the ES cluster and CloudFormation not talking with one another during the update.

Of course, that means that we need to come up with something ourselves to accomplish the same result.

Move In, Now Move Out

Obviously the first thing we have to do is turn off the Update Policy for the Auto Scaling Groups containing the master and data nodes. With that out of the way, we can safely rely on CloudFormation to update the rest of the environment (including the Launch Configuration describing the EC2 instances that make up the cluster), safe in the knowledge that CloudFormation is ready to create new nodes, but will not until we take some sort of action.

At that point its just a matter of controlled node replacement using the auto healing capabilities of the cluster.

If you terminate one of the nodes directly, the AWS Auto Scaling Group will react by creating a replacement EC2 instance, and it will use the latest Launch Configuration for this purpose. When that instance starts up it will get some configuration deployed to it by Octopus Deploy, and shortly afterwards will join the cluster. With a new node in play, the cluster will react accordingly and rebalance, moving shards and replicas to the new node as necessary until everything is balanced and green.

This sort of approach can be written in just about any scripting language, out poison of choice is Powershell, which was then embedded inside the environment nuget package to be executed whenever an update occurs.

I’d copy the script here, but its pretty long and verbose, so here is the high level algorithm instead:

Iterate through the master nodes in the cluster
- Check the version tag of the EC2 instance behind the node
- If equal to the current version, move on to the new node
- If not equal to the current version
  - Get the list of current nodes in the cluster
  - Terminate the current master node
  - Wait for the cluster to report that the old node is gone
  - Wait for the cluster to report that the new node exists
Iterate through the data nodes in the cluster
- Check the version tag of the EC2 instance behind the node
- If equal to the current version, move on to the new node
- If not equal to the current version
  - Get the list of current nodes in the cluster
  - Terminate the current data node
  - Wait for the cluster to report that the old node is gone
  - Wait for the cluster to report that the new node exists
  - Wait for the cluster to go yellow (indicating rebalancing is occurring
  - Wait for the cluster to go green (indicating rebalancing is complete). This can take a while, depending on the amount of data in the cluster

As you can see, there isn’t really all that much to the algorithm, and the hardest part of the whole thing is knowing that you should wait for the node to leave/join the cluster and for the cluster to rebalance before moving on to the next replacement.

If you don’t do that, you risk destroying the cluster by taking away too many of its parts before its ready (which was exactly the problem with leaving the update to CloudFormation).

Hands Up, Now Hands Down

For us, the most common reason to run an update on the ELK environment is when there is a new version of Elasticsearch available. Sure we run updates to fix bugs and tweak things, but those are generally pretty rare (and will get rarer as time goes on and the stack gets more stable).

As a general rule of thumb, assuming you don’t try to jump too far all at once, new versions of Elasticsearch are pretty easily integrated.

In fact, you can usually have nodes in your cluster at the new version while there are still active nodes on the old version, which is nice.

There are at least two caveats that I’m aware of though:

The latest version of Kibana generally doesn’t work when you point it towards a mixed cluster. It requires that all nodes are running the same version.
If new indexes are created in a mixed cluster, and the primary shards for that index live on a node with the latest version, nodes with the old version cannot be assigned replicas

The first one isn’t too problematic. As long as we do the upgrade overnight (unattended), no-one will notice that Kibana is down for a little while.

The second one is a problem though, especially for our production cluster.

We use hourly indexes for Logstash, so a new index is generally created every hour or so. Unfortunately it takes longer than an hour for the cluster to rebalance after a node is replaced.

This means that the cluster is almost guaranteed to be stuck in the yellow status (indicating unassigned shards, in this case the replicas from the new index that cannot be assigned to the old node), which means that our whole process of “wait for green before continuing” is not going to work properly when we do a version upgrade on the environment that actually matter, production.

Lucky for us, the API for Elasticsearch is pretty amazing, and allows you to get all of the unassigned shards, along with the reason why they were unassigned.

What this means is that we can keep our process the same, and when the “wait for green” part of the algorithm times out, we can check to see whether or not the remaining unassigned shards are just version conflicts, and if they are, just move on.

Works like a charm.

Tell Me What You’re Gonna Do Now

The last thing that we need to take into account during an upgrade is related to Octopus Tentacles.

Each Elasticsearch node that is created by the Auto Scaling Group registers itself as a Tentacle so that it can have the Elasticsearch configuration deployed to it after coming online.

With us terminating nodes constantly during the upgrade, we generate a decent number of dead Tentacles in Octopus Deploy, which is not a situation you want to be in.

The latest versions (3+ I think) of Octopus Deploy allow you to automatically remove dead tentacles whenever a deployment occurs, but I’m still not sure how comfortable I am with that piece of functionality. It seems like if your Tentacle is dead for a bad reason (i.e. its still there, but broken) then you probably don’t want to just clean it up and keep on chugging along.

At this point I would rather clean up the Tentacles that I know to be dead because of my actions.

As a result of this, one of the outputs from the upgrade process is a list of the EC2 instances that were terminated. We can easily use the instance name to lookup the Tentacle in Octopus Deploy, and remove it.

Conclusion

What we’re left with at the end of this whole adventure is a fully automated process that allows us to easily deploy changes to our ELK environment and be confident that not only have all of the AWS components updated as we expect them to, but that Elasticsearch has been upgraded as well.

Essentially exactly what we would have had if the CloudFormation update policy had worked the way that I initially expected it to.

Speaking of which, it would be nice if AWS gave a little bit more control over that update policy (like timing, or waiting for a signal from a system component before moving on), but you can’t win them all.

Honestly, I wouldn’t be surprised if there was a way to override the whole thing with a custom behaviour, or maybe a custom CloudFormation resource or something, but I wouldn’t even know where to start with that.

I’ve probably run the update process around 10 times at this point, and while I usually discover something each time I run it, each tweak makes it more and more stable.

The real test will be what happens when Elastic.co releases version 6 of Elasticsearch and I try to upgrade.

I foresee explosions.