0 Comments

Continuous delivery of your software is something you should aspire to, even if you may never quite reach that lofty goal of “code gets committed, changes available in production”.

There is just so much value to business value to be had by making sure that your improvements , features and bug fixes are always in the hands of your users, and never spend dead time waiting for the stars to align and that mysterious release window to open up.

Of course, it can all take an incredible amount of engineering effort though, so as with everything in software, you probably need to think about what exactly you are trying to accomplish and how much you’re willing to pay for it.

Along the way, you’ll run across a wide variety of situations that make constant deployments challenging, and that’s where this post gets relevant. One my my teams has recently been made responsible for an API whose core function is to execute tasks that might take up to an hour to run (spoiler alert, its data migration), and that means being able to arbitrarily deploy our changes whenever we want is just not a capability we have right now.

In fact, the entire deployment process for this particular API is a bit of a special flower, differing in a number of ways from the rest of the organization.

And special flowers are not to be tolerated

Its A Bit Of A Marathon

I’ve written about this a few times already, but our organization (like most), has an old, highly profitable piece of legacy desktop software. Its future is…limited, to say the least, so the we’re engaged on a long term project to build a replacement that offers similar (but better) features using a SaaS (Software as a Service) model.

Ideally, we want to make the transition from old and busted to new hotness as easy as possible for every single one of our users, so there is a huge amount of value to be gained by investing in a reliable and painless migration process.

We’re definitely not there yet, but its getting closer with every deployment that we make.

Architecturally, we have a specialist migration tool, backed by an API, all of which is completely separate from the main user interface to the system. At this point in time, migrations are executed by an internal team, but the dream is that the user will be able to do it themselves.

The API is basically a fancy ETL, in that it gets data from some source (extract), transforms it into a format that works for the cloud product (transform) and then injects everything as appropriate via the cloud APIs (load). Its written in Java (specifically Kotlin) and leverages Spring for its API-ness, and Spring Batch for job scheduling and management. Deployment wise, the API is encapsulated in a Docker image (output from our build process) and when its time to ship a new version, the existing Docker containers are simply replaced with new ones in a controlled fashion.

More importantly to the blog post at hand, each migration is a relatively long running task that executes a series of steps in sequence in order to get customer data from legacy system A into new shiny cloud system B.

Combine “long running uninterruptable task” with “container replacement” and you get in-flight migrations being terminated every time a deployment occurs, which in turn leads to the fear, and we all know where fear leads.

A manual deployment process, gated by another manual “hey, are you guys running any migrations” process.

Waiting At The Finish Line

To allow for arbitrary deployments, one of the simplest solutions is to have the deployment process simply wait for any in-flight migrations to complete before it does anything destructive.

Automation wise, the approach is pretty straight forward:

  • Implement a new endpoint on the API that returns if the API instance can be “shutdown” using in-memory information about migrations in flight
  • Change the deployment process to use this endpoint to decide whether or not the container is ready to be replaced

With the way we’re using Spring Batch, each “migration” job runs from start to finish on a single API instance, so its simple enough to just trigger an in-memory count to increment whenever a job starts, and decrement when it finishes (or fails).

The deployment process then just waits for each container to state whether or not its allowed to shutdown before tearing anything down. Specifically each individual container needs to be asked though, not the API through the load balancer, because they all have their own state.

This approach has the unfortunate side effect of making it hard to reason about how long a deployment might actually take though, as a migration in flight could have anywhere from a few seconds to a few hours of runtime left and the deployment cannot continue until all migrations have finished. Additionally, if another migration is started while the process is waiting for any in-flight migrations to complete, it might never get to the chance to continue, which is troublesome.

That is, unless you put the API into “maintenance mode” or something, where its not allowed to start new migrations. That’s downtime though, which isn’t continuous delivery.

Running More Than One Race

A slight tweak to the first solution is to allow for some parallel execution between the old and new containers:

  • Spin up as many new version API containers as necessary
  • Take all of the old containers out of the load balancer (or equivalent) so no new migrations can be started on them
  • Leave the old ones around, but only until they finish up their migrations, and then terminate

This allows for continuous operation of the service (which is in line with the original goal, of the user not knowing that anything is going on behind the scenes), but can lead to complications.

The main one is what will happen if the new API version contains any database updates? That might make the database incompatible with the old version, which would cause everything to explode. Obviously, there is value in making sure that changes are at least one version backwards compatible, but that can be hard to enforce automatically, and it can be dangerous to just leave up to people to remember.

The other complication is that this approach assumes that the new containers can answer requests about the jobs running on the old containers (i.e. status), which is probably true if everything is behind a load balancer anyway, but its still something to be aware of.

So again, not an ideal solution, but at least it maintains availability while doing its thing.

Or We Could Just Do Sprints

If you really want to offer continuous delivery with something that does long running background tasks, the answer is to not have long running background tasks.

Well, to be fair, the entire operation can be long running, but it needs to be broken down into smaller atomic elements.

With smaller constituent elements, you can use a very similar process to the solutions above:

  • Spin up a bunch of new containers, have them automatically pick up new tasks as they become available
  • Stop traffic going directly to the old containers
  • Mark old containers as “to be shutdown” so they don’t grab new tasks
  • Wait for each old container to be “finished” and murder it

You get a much tighter deployment cycle, and you also get the nice side effect of potentially allowing for parallelisation of tasks across multiple containers (if there are tasks that will allow it obviously).

Conclusion

For our API, we went with the first option (wait for long running tasks to finish, then shutdown), mostly because it was the simplest, and with the assumption that we’ll probably revisit it in time. Its still a vast improvement over the manual “only do deployments when the system is not in use” approach, so I consider it a win.

More generally, and to echo my opening statement, the idea of “continuous delivery” is something that should be aimed for at the very least, even if you might not make it all the way. When you’re free to deploy any time that you want, you gain a lot of flexibility in the way that you are able to react to things, especially bug fixes and the like.

Also, each deployment is likely to be smaller, as you don’t have to wait for an “acceptable window”, and bundle up a bunch of stuff together when that window arrives. This means that if you do get a failure (and you probably will) its much easier to reason about what might have gone wrong.

Mostly I’m just really tired of only being able to deploy on Sundays though, so anything that stops that practice is amazing from my point of view.

0 Comments

A wild technical post appears!

This weeks post returns to a topic very close to my heart, the Elasticsearch, Logstash and Kibana (ELK) Stack that we use for log aggregation. As you might be able to tell from my post history, logging, metrics and business intelligence rank pretty high on my list of priorities, regardless of any other focuses I might have. To me, if you don’t have good intelligence, you might as well be fighting in the dark, flailing about in the hopes that you hit something important.

This post specifically is about the process by which we deploy new versions of Elasticsearch, and an issue that can occur when you do rolling deployments and the Elasticsearch cluster is hosted in AWS.

Version Control

Way back in August 2017 I wrote about automating the deployment of new Elasticsearch versions to our ELK stack.

Long story short, the part of that post that it relevant to this one is the bit about unassigned shards in Elasticsearch when rebalancing after a version upgrade. Specifically, if you have nodes that are at a later version of Elasticsearch than others (which is normal when doing a rolling deployment), and the later version node is elected to hold the primary shard, replicas cannot be assigned to any of the nodes with the lower version.

This is troublesome if you’re waiting for a cluster to go green before progressing to the next node replacement,  because unassigned shards equal a yellow cluster. You’ll be waiting forever (or you’ll hit your timeout because you were smart enough to put a timeout in, right?).

Without some additional change, the system will never reach a state of equilibrium.

La La La I Can’t Hear You

To extrapolate on the content of the initial post, the solution was to check that all remaining unassigned shards were version conflicts whenever an appropriate end state is reached. An end state would be something like a timeout waiting for the cluster to go green, or maybe something fancier like “number of unassigned shards has not changed over a period of time.

If the only unassigned shards left are version conflicts, its relatively safe to just continue on with the process and let Elasticsearch sort it out (which it will once all of the nodes are replaced). There is minimal risk of data loss (the primary shards are all guaranteed to exist in order for this problem to happen anyway), and each time a new node comes online, the cluster will rebalance into a better state anyway.

The script for checking for version conflicts is below:

function Get-UnassignedShards
{
    [CmdletBinding()]
    param
    (
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$elasticsearchUrl
    )

    $shards = Invoke-RestMethod -Method GET -Uri "$elasticsearchUrl/_cat/shards" -Headers @{"accept"="application/json"} -Verbose:$false;
    $unassigned = $shards | Where-Object { $_.state -eq "UNASSIGNED" };

    return $unassigned;
}

function Test-AllUnassignedShardsAreVersionConflicts
{
    [CmdletBinding()]
    param
    (
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$elasticsearchUrl
    )

    Write-Verbose "Getting all UNASSIGNED shards, to see if all of them are UNASSIGNED because of version conflicts";

    $unassigned = Get-UnassignedShards -elasticsearchUrl $elasticsearchUrl;

    foreach ($unassignedShard in $unassigned)
    {
        $primary = "true";
        if ($unassignedShard.prirep -eq "r")
        {
            $primary = "false";
        }
        $explainBody = "{ `"index`": `"$($unassignedShard.index)`", `"shard`": $($unassignedShard.shard), `"primary`": $primary }";
        Write-Verbose "Getting allocation explanation using query [$explainBody]";
        $explain = Invoke-RestMethod -Method POST -Uri "$elasticsearchUrl/_cluster/allocation/explain" -Headers @{"accept"="application/json"} -Body $explainBody -Verbose:$false;

        $versionConflictRegex = "target node version.*is older than the source node version.*";
        $sameNodeConflictRegex = "the shard cannot be allocated to the same node on which a copy of the shard already exists";
        $explanations = @();
        foreach ($node in $explain.node_allocation_decisions)
        {
            foreach ($decider in $node.deciders)
            {
                $explanations += @{Node=$node.node_name;Explanation=$decider.explanation};
            }
        }

        foreach ($explanation in $explanations)
        {
            if ($explanation.explanation -notmatch $versionConflictRegex -and $explanation.explanation -notmatch $sameNodeConflictRegex)
            {
                Write-Verbose "The node [$($explanation.Node)] in the explanation for shard [$($unassignedShard.index):$($unassignedShard.shard)] contained an allocation decider explanation that was unacceptable (i.e. not version conflict and not same node conflict). It was [$($explanation.Explanation)]";
                return $false;
            }
        }
    }

    return $true;
}

In The Zone…Or Out Of It?

This solution works really well for the specific issue that it was meant to detect, but to absolutely no-ones surprise, it doesn’t work so well for other problems.

Case in point, if your Elasticsearch cluster is AWS Availability Zone aware, then you can encounter a very similar problem to what I’ve just described, except with availability zone conflicts instead of version conflicts.

An availability zone aware Elasticsearch cluster will avoid putting shard replicas in the same availability zone as the primary (within reason), which is just another way to protect itself against losing data in the event of a catastrophic failure. I’m sure you can disable the functionality, but that seems like a relatively sane safety measure, so I’m not sure why you would.

Unfortunately, when combined with version conflicts also preventing shard allocation, you can be left in a situation where there is no appropriate place to dump a shard, so our deployment process can’t move on because the cluster never goes green.

Interestingly enough, there are two possible solutions for this:

  • The first is to be more careful about the order that you annihilate nodes in. Alternating availability zones is the way to go here, but this approach can get complicated if you’re also dealing with version conflicts at the same time. Also, it doesn’t really work all that well if you don’t have a full complement of nodes (with redundancy) spread about both availability zones.
  • The second is to just replicate the version conflict solution above, except for unassigned shards as a result of availability zone conflicts. This is by far the easier and less fiddly approach, assuming that the entire deployment finishes (so the cluster can rebalance as appropriate)

I haven’t actually updated our deployment since I discovered the issue, but my current plan is to go with the second option and see how far I get.

Conclusion

This is one of those cases where I knew that the initial solution that I put into place would probably not be enough over the long term, but there was just no value in trying to engineer for every single eventuality right at the start.

Also, truth be told, I didn’t know that Elasticsearch took AWS Availability Zones into account when allocating shards, so that was a bit of a surprise anyway.

Thinking about the actual deployment process some more, it might be easier to scale up, wait for a rebalance and then scale down again, terminating the oldest (and thus earlier version) nodes after all the new ones have already come online. The downside to this approach is mostly just time (because you have to wait for 2N rebalances, instead of just N rebalances (where N is the number of nodes), but it feels like it might be more robust in the face of unexpected weirdness.

Which, from my experience, I should probably just start expecting from now on, as it (ironically) seems like the one constant in software.

0 Comments

We’ve had full control over our Elasticsearch field mappings for a while now, but to be honest, once we got the initial round of mappings out of the way, we haven’t really had to deal with too many deployments. Sure, they happen every now and then, but its not something we do every single day.

In the intervening time period, we’ve increased the total amount of data that we store in the ELK stack, which has had an unfortunate side effect when it comes to the deployment of more mappings.

It tends to take the Elasticsearch cluster down.

Is The Highway To Hell Paved With Good Intentions?

When we originally upgraded the ELK stack and introduced the deployment of the Elasticsearch configuration it was pretty straightforward. The deployment consisted of only those settings related to the cluster/node, and said settings wouldn’t be applied until the node was restarted anyway, so it was an easy decision to just force a node restart whenever the configuration was deployed.

When deploying to multiple nodes, the first attempt at orchestration just deployed sequentially, one node at a time, restarting each one and then waiting for the node to rejoin the cluster before moving on.

This approach….kind of worked.

It was good for when a fresh node was joining the cluster (i.e. after an auto scale) or when applying simple configuration changes (like log settings), but tended to fall apart when applying major configuration changes or when the cluster did not already exist (i.e. initial creation).

The easy solution was to just deploy to all nodes at the same time, which pretty much guaranteed a small downtime while the cluster reassembled itself.

Considering that we weren’t planning on deploying core configuration changes all that often, this seemed like a decent enough compromise.

Then we went and included index templates and field mappings into the configuration deployment.

Each time we deployed a new field mapping the cluster would go down for a few moments, but would usually come good again shortly after. Of course, this was when we still only had a weeks worth of data in the cluster, so it was pretty easy for it to crunch through all of the indexes and shards and sort them out when it came back online.

Now we have a little over a months worth of data, and every time the cluster goes down it takes a fair while to come back.

That’s real downtime for no real benefit, because most of the time we’re just deploying field mappings, which can actually just be updated using the HTTP API, no restart required.

Dirty Deployments, Done Dirt Cheap

This situation could have easily been an opportunity to shift the field mappings into a deployment of their own, but I still had the same problem as I did the first time I had to make this decision – what’s the hook for the deployment when spinning up a new environment?

In retrospect the answer is probably “the environment is up and has passed its smoke test”, but that didn’t occur to me until later, so we went in a different direction.

What if we didn’t always have to restart the node on a configuration deployment?

We really only deploy three files that could potentially require a node restart:

  • The core Elasticsearch configuration file (/etc/elasticsearch/elasticsearch.yml)
  • The JVM options file (/etc/elasticsearch/jvm.options)
  • The log4j2 configuration file (/etc/elasticsearch/log4j2.properties)

If none of those files have changed, then we really don’t need to do a node restart, which means we can just move ahead with the deployment of the field mappings.

No fuss, no muss.

Linux is pretty sweet in this regard (well, at least the Amazon Linux baseline is) in that it provides a diff command that can be used to easily compare two files.

It was a relatively simple matter to augment the deployment script with some additional logic, like below:

… more script up here

 

temporary_jvm_options="/tmp/elk-elasticsearch/jvm.options" destination_jvm_options="/etc/elasticsearch/jvm.options" echo "Mutating temporary jvm.options file [$temporary_jvm_options] to contain appropriate memory allocation and to fix line endings" es_memory=$(free -m | awk '/^Mem:/ { print int($2/2) }') || exit 1 sed -i "s/@@ES_MEMORY@@/$es_memory/" $temporary_jvm_options || exit 1 sed -i 's/\r//' $temporary_jvm_options || exit 1 sed -i '1 s/^\xef\xbb\xbf//' $temporary_jvm_options || exit 1 echo "Diffing $temporary_jvm_options and $destination_jvm_options" diff --brief $temporary_jvm_options $destination_jvm_options jvm_options_changed=$?

 

… more script down here

 

if [[ $jvm_options_changed == 1 || $configuration_file_changed == 1 || $log_config_file_changed == 1 ]]; then
    echo "Configuration change detected, will now restart elasticsearch service."
    sudo service elasticsearch restart || exit 1
else
    echo "No configuration change detected, elasticsearch service will not be restarted."
fi

No more unnecessary node restarts, no more unnecessary downtime.

Conclusion

This is another one of those cases that seems incredibly obvious in retrospect, but I suppose everything does. Its still almost always better to go with the naive solution at first, and then improve, rather than try to deal with everything up front. Its better to focus on making something easy to adapt in the face of unexpected issues than to try and engineer some perfect unicorn.

Regardless, with no more risk that the Elasticsearch cluster will go down for an unspecified amount of time whenever we just deploy field mapping updates, we can add new fields with impunity (it helps that we figured out how to reindex).

Of course, as with anything, there are still issues with the deployment logic:

  • Linking the index template/field mapping deployment to the core Elasticsearch configuration was almost certainly a terrible mistake, so we’ll probably have to deal with that eventually.
  • The fact that a configuration deployment can still result in the cluster going down is not great, but to be honest, I can’t really think of a better way either. You could deploy the configuration to the master nodes first, but that leaves you in a tricky spot if it fails (or if the configuration is a deep enough change to completely rename or otherwise move the cluster). You might be able to improve the logic to differentiate between “first time” and “additional node”, but you still have the problem of dealing with major configuration changes. Its all very complicated and honestly we don’t really do configuration deployments enough to spend time solving that particularly problem.
  • The index template/field mapping deployment technically occurs once on every node, simultaneously. For something that can be accomplished by a HTTP call, this is pretty wasteful (though doesn’t have any obvious negative side effects).

There’s always room for improvement.

0 Comments

In last weeks post I explained how we would have liked to do updates to our Elasticsearch environment using CloudFormation. Reality disagreed with that approach and we encountered timing problems as a result of the ES cluster and CloudFormation not talking with one another during the update.

Of course, that means that we need to come up with something ourselves to accomplish the same result.

Move In, Now Move Out

Obviously the first thing we have to do is turn off the Update Policy for the Auto Scaling Groups containing the master and data nodes. With that out of the way, we can safely rely on CloudFormation to update the rest of the environment (including the Launch Configuration describing the EC2 instances that make up the cluster), safe in the knowledge that CloudFormation is ready to create new nodes, but will not until we take some sort of action.

At that point its just a matter of controlled node replacement using the auto healing capabilities of the cluster.

If you terminate one of the nodes directly, the AWS Auto Scaling Group will react by creating a replacement EC2 instance, and it will use the latest Launch Configuration for this purpose. When that instance starts up it will get some configuration deployed to it by Octopus Deploy, and shortly afterwards will join the cluster. With a new node in play, the cluster will react accordingly and rebalance, moving shards and replicas to the new node as necessary until everything is balanced and green.

This sort of approach can be written in just about any scripting language, out poison of choice is Powershell, which was then embedded inside the environment nuget package to be executed whenever an update occurs.

I’d copy the script here, but its pretty long and verbose, so here is the high level algorithm instead:

  • Iterate through the master nodes in the cluster
    • Check the version tag of the EC2 instance behind the node
    • If equal to the current version, move on to the new node
    • If not equal to the current version
      • Get the list of current nodes in the cluster
      • Terminate the current master node
      • Wait for the cluster to report that the old node is gone
      • Wait for the cluster to report that the new node exists
  • Iterate through the data nodes in the cluster
    • Check the version tag of the EC2 instance behind the node
    • If equal to the current version, move on to the new node
    • If not equal to the current version
      • Get the list of current nodes in the cluster
      • Terminate the current data node
      • Wait for the cluster to report that the old node is gone
      • Wait for the cluster to report that the new node exists
      • Wait for the cluster to go yellow (indicating rebalancing is occurring
      • Wait for the cluster to go green (indicating rebalancing is complete). This can take a while, depending on the amount of data in the cluster

As you can see, there isn’t really all that much to the algorithm, and the hardest part of the whole thing is knowing that you should wait for the node to leave/join the cluster and for the cluster to rebalance before moving on to the next replacement.

If you don’t do that, you risk destroying the cluster by taking away too many of its parts before its ready (which was exactly the problem with leaving the update to CloudFormation).

Hands Up, Now Hands Down

For us, the most common reason to run an update on the ELK environment is when there is a new version of Elasticsearch available. Sure we run updates to fix bugs and tweak things, but those are generally pretty rare (and will get rarer as time goes on and the stack gets more stable).

As a general rule of thumb, assuming you don’t try to jump too far all at once, new versions of Elasticsearch are pretty easily integrated.

In fact, you can usually have nodes in your cluster at the new version while there are still active nodes on the old version, which is nice.

There are at least two caveats that I’m aware of though:

  • The latest version of Kibana generally doesn’t work when you point it towards a mixed cluster. It requires that all nodes are running the same version.
  • If new indexes are created in a mixed cluster, and the primary shards for that index live on a node with the latest version, nodes with the old version cannot be assigned replicas

The first one isn’t too problematic. As long as we do the upgrade overnight (unattended), no-one will notice that Kibana is down for a little while.

The second one is a problem though, especially for our production cluster.

We use hourly indexes for Logstash, so a new index is generally created every hour or so. Unfortunately it takes longer than an hour for the cluster to rebalance after a node is replaced.

This means that the cluster is almost guaranteed to be stuck in the yellow status (indicating unassigned shards, in this case the replicas from the new index that cannot be assigned to the old node), which means that our whole process of “wait for green before continuing” is not going to work properly when we do a version upgrade on the environment that actually matter, production.

Lucky for us, the API for Elasticsearch is pretty amazing, and allows you to get all of the unassigned shards, along with the reason why they were unassigned.

What this means is that we can keep our process the same, and when the “wait for green” part of the algorithm times out, we can check to see whether or not the remaining unassigned shards are just version conflicts, and if they are, just move on.

Works like a charm.

Tell Me What You’re Gonna Do Now

The last thing that we need to take into account during an upgrade is related to Octopus Tentacles.

Each Elasticsearch node that is created by the Auto Scaling Group registers itself as a Tentacle so that it can have the Elasticsearch  configuration deployed to it after coming online.

With us terminating nodes constantly during the upgrade, we generate a decent number of dead Tentacles in Octopus Deploy, which is not a situation you want to be in.

The latest versions (3+ I think) of Octopus Deploy allow you to automatically remove dead tentacles whenever a deployment occurs, but I’m still not sure how comfortable I am with that piece of functionality. It seems like if your Tentacle is dead for a bad reason (i.e. its still there, but broken) then you probably don’t want to just clean it up and keep on chugging along.

At this point I would rather clean up the Tentacles that I know to be dead because of my actions.

As a result of this, one of the outputs from the upgrade process is a list of the EC2 instances that were terminated. We can easily use the instance name to lookup the Tentacle in Octopus Deploy, and remove it.

Conclusion

What we’re left with at the end of this whole adventure is a fully automated process that allows us to easily deploy changes to our ELK environment and be confident that not only have all of the AWS components updated as we expect them to, but that Elasticsearch has been upgraded as well.

Essentially exactly what we would have had if the CloudFormation update policy had worked the way that I initially expected it to.

Speaking of which, it would be nice if AWS gave a little bit more control over that update policy (like timing, or waiting for a signal from a system component before moving on), but you can’t win them all.

Honestly, I wouldn’t be surprised if there was a way to override the whole thing with a custom behaviour, or maybe a custom CloudFormation resource or something, but I wouldn’t even know where to start with that.

I’ve probably run the update process around 10 times at this point, and while I usually discover something each time I run it, each tweak makes it more and more stable.

The real test will be what happens when Elastic.co releases version 6 of Elasticsearch and I try to upgrade.

I foresee explosions.

0 Comments

Its been a little while since I made a post on Elasticsearch. Time to remedy that.

Our log stack has been humming along relatively well ever since we took control of it. Its not perfect, but its much better than it was.

One of the nicest side effects of the restructure has been the capability to test our changes in the CI/Staging environments before pushing them into Production. Its saved us from a a few boneheaded mistakes already (mostly just ES configuration blunders), which has been great to see. It does make pushing things into the environment actually care about a little bit slower than they otherwise would be, but I’m willing to make that tradeoff for a bit more safety.

When I was putting together the process for deploying our log stack (via Nuget, Powershell and Octopus Deploy), I tried to keep in mind what it would be like when I needed to deploy an Elasticsearch version upgrade. To be honest, I thought I had a pretty good handle on it:

  • Make an AMI with the new version of Elasticsearch on it
  • Change the environment definition to reference this new AMI instead of the old one
  • Deploy the updated package, leveraging the Auto Scaling Group instance replacement functionality
  • Dance like no-one is watching

The dancing part worked perfectly. I am a graceful swan.

The rest? Not so much.

Rollin’, Rollin’

I think the core issue was that I had a little bit too much faith in Elasticsearch to react quickly and robustly in the face of random nodes dying and being replaced.

Don’t get me wrong, its pretty amazing at what it does, but there are definitely situations where it is understandably incapable of adjusting and balancing itself.

Case in point, the process that occurs when an AWS Auto Scaling Group starts doing a rolling update because the definition of its EC2 instance launch configuration has changed.

When you use CloudFormation to initialize an Auto Scaling Group, you define the instances inside that group through a configuration structure called a Launch Configuration. This structure contains the definition of your EC2 instances, including the base AMI, security groups, tags and other meta information, along with any initialization that needs to be performed on startup (user data, CFN init, etc).

Inside the Auto Scaling Group definition in the template, you decide what should be the appropriate reaction upon detecting changes to the launch configuration, which mostly amounts to a choice between “do nothing” or “start replacing the instances in a sane way”. That second option is referred to as a “rolling update”, and you can specify a policy in the template for how you would like it to occur.

For our environment, a new ES version means a new AMI, so theoretically, it should be a simple matter to update the Launch Configuration with the new AMI and push out an update, relying on the Auto Scaling Group to replace the old nodes with the new ones, and relying on Elasticsearch to rebalance and adjust as appropriate.

Not that simple unfortunately, as I learned when I tried to apply it to the ES Master and Data ASGs in our ELK template.

Whenever changes were detected, CloudFormation would spin up a new node, wait for it to complete its initialization (which was just machine up + octopus tentacle registration), then it would terminate an old node and rinse and repeat until everything was replaced. This happened for both the master nodes and data nodes at the same time (two different Auto Scaling Groups).

Elasticsearch didn’t stand a chance.

With no feedback loop between ES and CloudFormation, there was no way for ES to tell CloudFormation to wait until it had rebalanced the cluster, replicated the shards and generally recovered from the traumatic event of having a piece of itself ripped out and thrown away.

The end result? Pretty much every scrap of data in the environment disappeared.

Good thing it was a scratch environment.

Rollin’, Rollin’

Sticking with the whole “we should probably leverage CloudFormation” approach. I implemented a script to wait for the node to join the cluster and for the cluster to be green (bash scripting is fun!). The intent was that this script would be present in the baseline ES AMI, would be executed as part of the user data during EC2 instance initialization, and would essentially force the auto scaling process to wait for Elasticsearch to actually be functional before moving on.

This wrought havoc with the initial environment creation though, as the cluster isn’t even valid until enough master nodes exist to elect a primary (which is 3), so while it kind of worked for the updates, initial creation was broken.

Not only that, but in a cluster with a decent amount of data, the whole “wait for green” thing takes longer than the maximum time allowed for CloudFormation Auto Scaling Group EC2 instance replacements, which would cause the auto scaling to time out and the entire stack to fail.

So we couldn’t use CloudFormation directly.

The downside of that is that CloudFormation is really good at detecting changes and determining if it actually has to do anything, so not only did we need to find another way to update our nodes, we needed to find a mechanism that would safely know when that node update should be applied.

To Be Continued

That’s enough Elasticsearch for now I think, so next time I’ll continue with the approach we actually settled on.