0 Comments

Choo choo goes the Elasticsearch train.

After the last few blog posts about rolling updates to our Elasticsearch environment, I thought I might as well continue with the Elasticsearch theme and do a quick post about reindexing.

Cartography

An index in Elasticsearch is kind of similar to a table in a relational database, but not really. In the same vein, index templates are kind of like schemas, and field mappings are kind of like columns.

But not really.

If you were using Elasticsearch purely for searching through some set of data, you might create an index and then add some mappings to it manually. For example, if you wanted to make all of the addresses in your system searchable, you might create fields for street, number, state, postcode and other common address elements, and maybe another field for the full address combined (like 111 None St, Brisbane, QLD, 4000 or something), to give you good coverage over the various sort of searches that might be requested.

Then you jam a bunch of documents into that index, each one representing a different address that needs to be searchable.

Over time, you might discover that you could really use a field to represent the unit or apartment number, to help narrow down those annoying queries that involve a unit complex or something.

Well, with Elasticsearch you can add a new field to the index, in a similar way to how you add a new column to a table in a relational database.

Except again, not really.

You can definitely add a new field mapping, but it will only work for documents added to the index after you’ve applied the change. You can’t make that new mapping retroactive. That is to say, you can’t magically make it apply to every document that was already in the index when you created the new mapping.

When it comes to your stock standard ELK stack, your data indexes are generally time based and generated from an index template, which adds another layer of complexity. If you want to change the mappings, you typically just change the template and then wait for some time period to rollover.

This leaves you in an unfortunate place for historical data, especially if you’ve been conservative with your field mappings.

Or does it?

Dexterous Storage

In both of the cases above (the manually created and maintained index, the swarm of indexes created automatically via a template) its easy enough to add new field mappings and have them take effect moving forward.

The hard part is always the data that already exists.

That’s where reindexing comes in.

Conceptually, reindexing is taking all of the documents that are already in an index and moving them to another index, where the new index has all the field mappings you want in it. In moving the raw documents like that, Elasticsearch will redo everything that it needs to do in order to analyse and breakdown the data into the appropriate fields, exactly like the first time the document was seen.

For older versions of Elasticsearch, the actual document migration had to be done with an external tool or script, but the latest versions (we use 5.5.1) have a reindex endpoint on the API, which is a lot simpler to use.

curl -XPUT "{elasticsearch_url}/{new_index}?pretty" -H "accept:application/json"
curl -XPOST "{elasticsearch_url}/_reindex?pretty" H "content-type:application/json" -H "accept:application/json" -d "{ "source": { "index": "{old_index}" }, "dest": { "index": "{new_index}", "version_type": "external" } }"

It doesn’t have to be a brand new index (there are options for how to handle documents that conflict if you’re reindexing into an index that already has data in it), but I imagine that a new index is the most common usage.

The useful side effect of this, is that in requiring a different index, the old one is left intact and unchanged. Its then completely up to you how to use both the new and old indexes, the most common operation being to delete the old one when you’re happy with how and where the new one is being used.

Seamless Replacement

We’ve changed our field mappings in our ELK stack over time, so while the most recent indexes do what we want them to, the old indexes have valuable historical data sitting around that we can’t really query or aggregate on.

The naive implementation is just to iterate through all the indexes we want to reindex (maybe using a regex or something to identify them), create a brand new index with a suffix (like logstash-2017.08.21-r) and then run the reindex operation via the Elasticsearch API, similar to the example above.

That leaves us with two indexes with the same data in them, which is less than ideal, especially considering that Kibana will quite happily query both indexes when you ask for data for a time period, so we can’t really leave the old one around or we’ll run into issues with duplicate data.

So we probably want to delete the old index once we’re finished reindexing into the new one.

But how do we know that we’re finished?

The default mode for the reindex operation is to wait for completion before returning a response from the API, which is handy, because that is exactly what we want.

The only other thing we needed to consider is that after a reindex, all of the indexes will have a suffix of –r, and our Curator configuration wouldn’t pick them up without some changes. In the interest of minimising the amount of things we had to touch just to reindex, we decided to do the reindex again from the temporary index back into an index named the same as the one we started with, deleting the temporary index once that second operation was done.

When you do things right, people wont be sure you’ve done anything at all.

Danger Will Robinson

Of course, the first time I ran the script (iterate through indexes, reindex to temporary index, delete source, reindex back, delete temp) on a real Elasticsearch cluster I lost a bunch of documents.

Good thing we have a staging environment specifically for this sort of thing.

I’m still not entirely sure what happened, but I think it had something to do with the eventually consistent nature of Elasticsearch, the fact we connect to the data nodes via an AWS ELB and the reindex being “complete” according to the API but not necessarily synced across all nodes, so the deletion of the source index threw a massive spanner in the works.

Long story short, I switched the script to start the reindex asynchronously and then poll the destination index until it returned the same number of documents as the source. As a bonus, this fixed another problem I had with the HTTP request for the reindex timing out on large indexes, which was nice.

The only downside of this is that we can’t reindex an index that is currently being written to (because the document counts will definitely change over the period of time the reindex occurs), but I didn’t want to do that anyway.

Conclusion

I’ve uploaded the full script to Github. Looking at it now, its a bit more complicated than you would expect, even taking into account the content of this post, but as far as I can tell, its pretty robust.

All told, I probably spent a bit longer on this than I should have, especially taking into account that its not something we do every day.

The flip side of that is that its extremely useful to know that old data is not just useless when we update our field mappings, which is nice.