Where Have All The Good Shards Gone

March 20. 2018 0 Comments

A wild technical post appears!

This weeks post returns to a topic very close to my heart, the Elasticsearch, Logstash and Kibana (ELK) Stack that we use for log aggregation. As you might be able to tell from my post history, logging, metrics and business intelligence rank pretty high on my list of priorities, regardless of any other focuses I might have. To me, if you don’t have good intelligence, you might as well be fighting in the dark, flailing about in the hopes that you hit something important.

This post specifically is about the process by which we deploy new versions of Elasticsearch, and an issue that can occur when you do rolling deployments and the Elasticsearch cluster is hosted in AWS.

Version Control

Way back in August 2017 I wrote about automating the deployment of new Elasticsearch versions to our ELK stack.

Long story short, the part of that post that it relevant to this one is the bit about unassigned shards in Elasticsearch when rebalancing after a version upgrade. Specifically, if you have nodes that are at a later version of Elasticsearch than others (which is normal when doing a rolling deployment), and the later version node is elected to hold the primary shard, replicas cannot be assigned to any of the nodes with the lower version.

This is troublesome if you’re waiting for a cluster to go green before progressing to the next node replacement, because unassigned shards equal a yellow cluster. You’ll be waiting forever (or you’ll hit your timeout because you were smart enough to put a timeout in, right?).

Without some additional change, the system will never reach a state of equilibrium.

La La La I Can’t Hear You

To extrapolate on the content of the initial post, the solution was to check that all remaining unassigned shards were version conflicts whenever an appropriate end state is reached. An end state would be something like a timeout waiting for the cluster to go green, or maybe something fancier like “number of unassigned shards has not changed over a period of time.

If the only unassigned shards left are version conflicts, its relatively safe to just continue on with the process and let Elasticsearch sort it out (which it will once all of the nodes are replaced). There is minimal risk of data loss (the primary shards are all guaranteed to exist in order for this problem to happen anyway), and each time a new node comes online, the cluster will rebalance into a better state anyway.

The script for checking for version conflicts is below:

function Get-UnassignedShards
{
    [CmdletBinding()]
    param
    (
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$elasticsearchUrl
    )

    $shards = Invoke-RestMethod -Method GET -Uri "$elasticsearchUrl/_cat/shards" -Headers @{"accept"="application/json"} -Verbose:$false;
    $unassigned = $shards | Where-Object { $_.state -eq "UNASSIGNED" };

    return $unassigned;
}

function Test-AllUnassignedShardsAreVersionConflicts
{
    [CmdletBinding()]
    param
    (
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$elasticsearchUrl
    )

    Write-Verbose "Getting all UNASSIGNED shards, to see if all of them are UNASSIGNED because of version conflicts";

    $unassigned = Get-UnassignedShards -elasticsearchUrl $elasticsearchUrl;

    foreach ($unassignedShard in $unassigned)
    {
        $primary = "true";
        if ($unassignedShard.prirep -eq "r")
        {
            $primary = "false";
        }
        $explainBody = "{ `"index`": `"$($unassignedShard.index)`", `"shard`": $($unassignedShard.shard), `"primary`": $primary }";
        Write-Verbose "Getting allocation explanation using query [$explainBody]";
        $explain = Invoke-RestMethod -Method POST -Uri "$elasticsearchUrl/_cluster/allocation/explain" -Headers @{"accept"="application/json"} -Body $explainBody -Verbose:$false;

        $versionConflictRegex = "target node version.*is older than the source node version.*";
        $sameNodeConflictRegex = "the shard cannot be allocated to the same node on which a copy of the shard already exists";
        $explanations = @();
        foreach ($node in $explain.node_allocation_decisions)
        {
            foreach ($decider in $node.deciders)
            {
                $explanations += @{Node=$node.node_name;Explanation=$decider.explanation};
            }
        }

        foreach ($explanation in $explanations)
        {
            if ($explanation.explanation -notmatch $versionConflictRegex -and $explanation.explanation -notmatch $sameNodeConflictRegex)
            {
                Write-Verbose "The node [$($explanation.Node)] in the explanation for shard [$($unassignedShard.index):$($unassignedShard.shard)] contained an allocation decider explanation that was unacceptable (i.e. not version conflict and not same node conflict). It was [$($explanation.Explanation)]";
                return $false;
            }
        }
    }

    return $true;
}

In The Zone…Or Out Of It?

This solution works really well for the specific issue that it was meant to detect, but to absolutely no-ones surprise, it doesn’t work so well for other problems.

Case in point, if your Elasticsearch cluster is AWS Availability Zone aware, then you can encounter a very similar problem to what I’ve just described, except with availability zone conflicts instead of version conflicts.

An availability zone aware Elasticsearch cluster will avoid putting shard replicas in the same availability zone as the primary (within reason), which is just another way to protect itself against losing data in the event of a catastrophic failure. I’m sure you can disable the functionality, but that seems like a relatively sane safety measure, so I’m not sure why you would.

Unfortunately, when combined with version conflicts also preventing shard allocation, you can be left in a situation where there is no appropriate place to dump a shard, so our deployment process can’t move on because the cluster never goes green.

Interestingly enough, there are two possible solutions for this:

The first is to be more careful about the order that you annihilate nodes in. Alternating availability zones is the way to go here, but this approach can get complicated if you’re also dealing with version conflicts at the same time. Also, it doesn’t really work all that well if you don’t have a full complement of nodes (with redundancy) spread about both availability zones.
The second is to just replicate the version conflict solution above, except for unassigned shards as a result of availability zone conflicts. This is by far the easier and less fiddly approach, assuming that the entire deployment finishes (so the cluster can rebalance as appropriate)

I haven’t actually updated our deployment since I discovered the issue, but my current plan is to go with the second option and see how far I get.

Conclusion

This is one of those cases where I knew that the initial solution that I put into place would probably not be enough over the long term, but there was just no value in trying to engineer for every single eventuality right at the start.

Also, truth be told, I didn’t know that Elasticsearch took AWS Availability Zones into account when allocating shards, so that was a bit of a surprise anyway.

Thinking about the actual deployment process some more, it might be easier to scale up, wait for a rebalance and then scale down again, terminating the oldest (and thus earlier version) nodes after all the new ones have already come online. The downside to this approach is mostly just time (because you have to wait for 2N rebalances, instead of just N rebalances (where N is the number of nodes), but it feels like it might be more robust in the face of unexpected weirdness.

Which, from my experience, I should probably just start expecting from now on, as it (ironically) seems like the one constant in software.