Building A Better Beast, Shackling Elasticsearch

May 30. 2017 0 Comments

Posted in:
elk
elasticsearch

The log stack is mostly under control now:

We have appropriate build and deployment pipelines in place
Everything is hosted in our AWS accounts
We can freely deploy both environments and configuration for each layer in the stack (Broker, Cache, Indexer, Storage)
We’ve got Cerebro in place to help us visualize Elasticsearch

We’re in a pretty good place to be honest.

Still, there are a few areas I would like to improve before I run away screaming, especially while we still have the old stack up and running in parallel (while we smooth out any kinks in the new stack).

One area in particular that needs some love is the way in which we store data in Elasticsearch. Up until now, we’ve mostly just left Elasticsearch to its own devices. Sure, it had a little help from Logstash, but it was mostly up to Elasticsearch what it did with incoming data. We did have a simple index template in place, but it was pretty vanilla. Basically we just left dynamic field mapping on and didn’t really think about it a whole lot.

The problem with this approach is that Elasticsearch is technically doing a whole bunch of work (and storing a whole bunch of data) for no real benefit to us, as we have thousands of fields, but only really search/aggregate on a few hundred at best. Our daily logstash indexes contain quite a lot of documents (around 40 million) and generally tally up at around 35GB each. Depending on how much of that 35GB of storage belongs to indexed fields that have no value, there might be considerable savings in reining the whole process in.

That doesn’t even take into account the cognitive load in having to deal with a large number of fields whenever you’re doing analysis, or the problems we’ve had with Kibana and refreshing our index mappings when there are many fields.

It was time to shackle the beast.

Anatomy Of A Template

Index templates are relatively simple constructs, assuming you understand some of the basic concepts behind indexes, types and field mappings in Elasticsearch. You could almost consider them to be schemas, but that is not strictly accurate, because you can change a schema, but you can’t really change an index once its been created. They really are templates in that sense, because they only apply when a new index is created.

Basically, a template is a combination of index settings (like replicas, shards, field limits, etc), types (which are collections of fields), and field mappings (i.e. Event.Name should be treated as text, and analysed up to the first 256 characters). They are applied to new indexes based on a pattern that matches against the new indexes name. For example, if I had a template that I wanted to apply to all logstash indexes (which are named logstash-YY.MM.DD), I would give it a pattern of logstash-*.

For a more concrete example, here is an excerpt from our current logstash index template:

{
  "order": 0,
  "template": "logstash-*",
  "settings": {
    "index": {
      "refresh_interval": "5s",
      "number_of_shards": "3",
      "number_of_replicas": "2",
      "mapper.dynamic": false
    }
  },
  "mappings": {
    "logs": {
      "dynamic" : false,
      "_all": {
        "omit_norms": true,
        "enabled": false
      },
      "properties": {
        "@timestamp": {
          "type": "date",
          "doc_values": true,
          "index": true
        },
        "@version": {
          "type": "keyword",
          "index": false,
          "doc_values": true
        },
        "message" : {
          "type" : "text",
          "index": false,
          "fielddata": false
        },
        "Severity" : {
          "type": "keyword",
          "index": true,
          "doc_values": true
        },
        "TimeTaken" : {
          "type": "integer",
          "index": true,
          "doc_values": true
        }
      }
    }
  },
  "aliases": {}
}

Templates can be managed directly from the Elasticsearch HTTP API via the /_template/{template-name} endpoint.

By default, the mappings.{type}.dynamic field is set to true when creating an index. This means that based on the raw data encountered, Elasticsearch will attempt to infer an appropriate type for the field (i.e. if it sees numbers, its probably going to make it a long or something). To be honest, Elasticsearch is pretty good at this, assuming your raw data doesn’t contain fields that sometimes look like numbers and sometimes look like text.

Unfortunately, ours does, so we can sometimes get stuck in a bad situation where Elasticseach will infer a field as a number, and all documents with text there will fail. This is a mapping conflict, and is a massive pain, because you can’t change a field mapping. You have to delete the index, or make a new index and migrate the data across. In the case of logstash, because you have time based indexes, you can also just wait it out.

This sort of thing can be solved by leaving dynamic mapping on, but specifying the type of the troublesome fields in the template.

The other downside of dynamic mapping is the indexing of fields that you really don’t need to be indexed, which takes up space for no benefit. This is actually pretty tricky though, because if you don’t index a field in some way, its still stored, but you can’t search or aggregate on it without creating a new index and adding an appropriate field mapping. I don’t know about you, but I don’t always know exactly what I want to search/aggregate on before the situation arises, so its a dangerous optimization to make.

This is especially true for log events, which are basically invisible up to the point where you have to debug some arcane thing that happened to some poor bastard.

I’m currently experimenting with leaving dynamic mapping off until I get a handle on some of the data coming into our stack, but I imagine that it will probably be turned back on before I’m done, sitting alongside a bunch of pre-defined field mappings for consistency.

Template Unleashed

With a template defined (like the example above), all that was left was to create a deployment pipeline.

There were two paths I could have gone down.

The first was to have a package specifically for the index template, with its own Octopus project and a small amount of logic that used the Elasticsearch HTTP API to push the template into the stack.

The second was to incorporate templates into the Logging.ELK.Elasticsearch.Config package/deployment, which was the package that dealt with the Elasticsearch configuration (i.e. master vs data nodes, EC2 discovery, ES logging, etc).

In the end I went with the second option, because I could not find an appropriate trigger to bind the first deployment to. Do you deploy when a node comes online? The URL might not be valid then, so you’d probably have to use the raw IP. That would mean exposing those instances outside of their ELB, which wasn’t something I wanted to do.

It was just easier to add some logic to the existing configuration deployment to deploy templates after the basic configuration completes.

# Wait for a few moments for Elasticsearch to become available
attempts=0
maxAttempts=20
waitSeconds=15
until $(curl --output /dev/null --silent --head --fail http://localhost:9200); do
    if [[ $attempts -ge $maxAttempts ]]; then 
        echo "Elasticsearch was not available after waiting ($attempts) times, sleeping for ($waitSeconds) seconds between each connection attempt"
        exit 1 
    fi
    attempts=$(($attempts + 1))
    echo "Waiting ($waitSeconds) to see if Elasticsearch will become available"
    sleep $waitSeconds
done

# Push the index template
template_upload_status=$(curl -XPUT --data "@/tmp/elk-elasticsearch/templates/logstash.json" -o /tmp/elk-elasticsearch/logstash-template-upload-result.json -w '%{http_code}' http://localhost:9200/_template/logstash;)
if [[ $template_upload_status -ne 200 ]]; then
    echo "Template upload failed"
    cat /tmp/elk-elasticsearch/logstash-template-upload-result.json
    exit 1
fi

A little bit more complicated than I would have liked, but it needs to wait for Elasticsearch to come online (and for the cluster to go green) before it can do anything, and the previous steps in this script actually restart the node (to apply configuration changes), so its necessary.

Conclusion

I’m hopeful that a little bit of template/field management will give us some big benefits in terms of the amount of fields we need to deal with and how much storage our indexes consume. Sure, we could always manage the template manually (usually via Kopf/Cerebro), but it feels a lot better to have it controlled and tracked in source control and embedded into our pipeline.

As I mentioned earlier, I still haven’t quite decided how to handle things in the long run, i.e. the decision between all manual mappings or some manual and the rest dynamic. It gets a bit complicated with index templates only applying once for each index (at creation), so if you want to put some data in you need to either anticipate what it looks like ahead of time, or you need to wait until the next index rolls around. I’ve got our logstash indexes running hourly (instead of daily), which helps, but I think it causes performance problems of its own, so its a challenging situation.

The other thing to consider is that managing thousands of fields in that template file sounds like its going to be a maintenance nightmare. Even a few hundred would be pretty brutal, so I’m wary of trying to control absolutely all of the things.

Taking a step back, it might actually be more useful to just remove those fields from the log events inside the Indexer layer, so Elasticsearch never even knows they exist.

Of course, you have to know what they are before you can apply this sort of approach anyway, so we’re back to where we started.