0 Comments

That thing that I complained about with Logstash and HTTP outputs a few weeks ago? Turns out it was actually a bug after all!

Not only that, but it was totally a race condition relating to the way the success/failure handlers of the Manticore HTTP library were being used, which was what I guessed, so go me. I guess even a broken clock is right once a day.

Long story short, don’t fire off an asynchronous request until you’ve got your asynchronous handlers in place to handle the result.

In their typical awesome style, the maintainers of the HTTP output plugin have already fixed the issue and released a new version (4.3.2) for the masses to consume.

Unfortunately, I don’t think this new version has been rolled into an official Logstash release yet, so if I want to use it I’ll have to make a custom redistributable package to use in Octopus deploy. Not a huge issue, just a small pain. I’m sure it will be rolled into the official package soon.

Other than that piece of news, I don’t really have anything else, so I’m going to take this opportunity to muse on some of my experiences with asynchronous code.

Hellish Code

Depending on how its written, asynchronous code can be hard to reason about, and even harder to test.

Vanilla Node is a good example of bad asynchronous code. Its incredibly easy to get into callback hell, where this function will call into this function, which has a handler that calls this function, which does this thing and so on. Its even worse when you realise that the easiest way to accumulate state between so many asynchronous functions is to close over variables that are in scope when the function is created, which leaves you in a position where its hard to break things apart into nice self contained, easily testable components.

*shudder*

Encapsulating asynchronous operations into some sort of structure is a good counter to callback hell, and once we started using Promises in the ELB logs processor Javascript, I found the quality of the code increased rapidly. It was still Javascript, but at least you could reason about it.

Continuing on with a language that I actually like, C#, you’ve got a similar concept at play with tasks and the TPL (Task Parallel Library).

Before the TPL, you would probably just use raw threads to offload work (or maybe dedicated asynchronous functions or classes like BackgroundWorker), and you could easily get into the same sort of problems that you get into in vanilla Node. I mean, you had the benefit of working with a nice, structured language, but it was still difficult and prone to error.

The TPL introduced encapsulations around asynchronous operations called Tasks, which helped write asynchronous code that actually made sense.

It did something else that was incredibly helpful as well.

It exposed the Task schedulers.

All The Control

Encapsulating operations as Tasks is all well and good, and in most cases you’re probably looking to run those operations asynchronously on another thread (or a pool of threads perhaps). I mean, that’s literally exactly what the default TaskScheduler does in C# (the ThreadPoolScheduler).

Of course, that makes testing hard, because now your main thread (which the test is running on) has to have some logic in it wait for outstanding operations to complete before you can assert things. This gets even more difficult when working with ViewModels, which require that certain updates be performed on the main/UI thread in order to work properly.

This is particularly important for WPF and PropertyChanged events, which only work properly if you raise them on the UI thread.

The good news is that you can write your own TaskScheduler and then inject it for testing purposes, assuming you’ve written your code in such a way that your Tasks are created with a specified scheduler.

I’ve written a TaskScheduler that is not threaded at all (i.e. it just executes the task inline), one that accumulates operations to be executed at will (good for testing transitions) and another that allows you to run tasks on the WPF Dispatcher construct (i.e. the main/UI thread in WPF) and I use all of them frequently while developing applications that rely on Tasks to do things.

They aren’t even all that hard to write really, as the following NonThreadedTaskScheduler shows:

public sealed class NonThreadedTaskScheduler : TaskScheduler
{
    protected override void QueueTask(Task task)
    {
        TryExecuteTask(task);
    }

    protected override bool TryExecuteTaskInline(Task task, bool taskWasPreviouslyQueued)
    {
        return TryExecuteTask(task);
    }

    protected override IEnumerable<Task> GetScheduledTasks()
    {
        return Enumerable.Empty<Task>();
    }

    public override int MaximumConcurrencyLevel
    {
        get { return 1; }
    }
}

Without the ability to change how the tasks are executed, none of that would have been possible, and while they would still be useful, they would be much harder to actually use and test.

Coincidentally, this is one of the reasons why I’m a little bit leery of the async/await keywords in C#, because I don’t think they expose any of the scheduling functionality, instead assuming that you are happy with the magical way that they work. Honestly though, my knowledge of async/await is fairly limited, so that might not be entirely correct.

Conclusion

To conclude, asynchronous code is hard and its bugs are subtle and hard to detect.

Its also one of the easiest and highest impact ways to wring more performance out of a piece of software. If you’re doing a lot of things that naturally involve waiting (disk writes, network calls, etc), you stand to gain an incredible amount of time by just doing other things during those times when you’re waiting on something external and out of your control.

Double edged sword really.

Like most software.

0 Comments

Almost a month ago, I made a bold claim that switching our Logstash instances to HTTP outputs for all log event traffic wouldn’t be all that difficult.

Oh god, how wrong I was.

In some weird display of prescience however, these were the exact words I used:

Conceptually, changing our Logstash instances to use a HTTP output instead of TCP should also be easy. Just need to change some configuration and deploy through Octopus. Keep in mind I haven’t actually done it yet, but it feels simple enough.

What Is Logstash?

Theoretically, I was entirely correct. It was as simple as changing the TCP output to HTTP. Same host, no fancy settings, no real complexity in the configuration.

For good measure, I upgraded the version of Logstash to 5.4.1 as well (from 5.2.0), which was a whole thing that I’m not going to get into right now. Suffice to say, I had a very recent version of Logstash installed when it all went to tatter.

We’ve got a pretty good deployment pipeline for distributing Logstash to our various API instances and whatnot, so after I made the changes, I checked in, builds kicked off, tests were run, deployments were made, and it was time to check whether or not CI was still working.

It was fine. For the first time, I could see the traffic caused by Logstash on the Broker using CloudWatch, which was nice.

Promote to Staging, same thing. Worked fine, traffic visible, all seemed well.

Promote to Prod, same thing. Worked fine, traffic visible, all is well.

And that’s the end of the story.

Baby Don’t Hurt Me

Ha ha ha ha ha ha no.

What I really should have done is left Staging run for a few days before promoting anything. I guess I was just a little too eager.

It wasn’t long after promoting to production that I noticed data disappearing from our dashboards and visualizations. At first glance, it all seemed to be isolated to one of our services in particular, so I started digging.

  • Log files were being appended to locally, so not the issue we encountered recently with our Nancy APIs. This problem had different symptoms anyway (no logs at all, instead of just no application logs).
  • Restarting the local instance of Logstash seemed to get a few events through into the ELK stack, so the communication channels were intact (i.e. traffic wasn’t being blocked at the network level), but only a few hundred.
  • Broker seemed to be healthy and handling traffic well, so it didn’t seem like a resource issue on the ELK stack side.

Based on the evidence, it looked like the problem was entirely the fault of the local Logstash instances, but if that was the case, why not more failures? At this point in the investigation, other Logstash instances appeared to be pushing their log events into the ELK stack without issue.

There didn’t even seem to be anything special about the logs going through the failing Logstash instances either. It was all pretty vanilla IIS, Performance Counter and API application logs. Nothing fancy.

Time to dig into Logstash itself.

The good news was that Logstash 5.4.1 features a really nice internal API, granting access to a bunch of information about the internal workings of the application.

Using a little bit of Powershell you can get statistics about the Logstash pipeline.

(Invoke-WebRequest http://localhost:9600/_node/stats/pipeline).Content

On one of the broken machines, our output looked something like this:

{ "host" : "a-hostname", "version" : "5.4.1", "http_address" : "127.0.0.1:9600", "id" : "a82b92de-d402-4b99-a569-a859c8e00d2b", "name" : "a-name", "pipeline" : { "events" : { "duration_in_millis" : 11293, "in" : 377, "filtered" : 375, "out" : 250, "queue_push_duration_in_millis" : 28027 }, "plugins" : { "inputs" : [ { "id" : "get-events-from-api-iis-logs", "events" : { "out" : 129, "queue_push_duration_in_millis" : 14075 }, "name" : "file" }, { "id" : "get-events-from-system-stats", "events" : { "out" : 248, "queue_push_duration_in_millis" : 13952 }, "name" : "file" }, { "id" : "get-events-from-api-application-logs", "events" : { "out" : 0, "queue_push_duration_in_millis" : 0 }, "name" : "file" } ], "filters" : [ { "id" : "break-iis-fields-down-via-csv", "events" : { "duration_in_millis" : 1135, "in" : 128, "out" : 128 }, "name" : "csv" }

# More filters here, cut for brevity

], "outputs" : [ { "id" : "send-events-to-elk-broker-over-http", "events" : { "duration_in_millis" : 5441, "in" : 375, "out" : 250 }, "name" : "http" } ] }, "reloads" : { "last_error" : null, "successes" : 0, "last_success_timestamp" : null, "last_failure_timestamp" : null, "failures" : 0 }, "queue" : { "type" : "memory" }, "id" : "main" } }

Super dense, but the interesting part is the HTTP output at the bottom.

See how it has 375 events in, but only 250 out?

It hangs like that forever.

Restarting Logstash causes it to start again and send a similar number of events, but eventually hang in exactly the same manner.

Don’t Hurt Me, No More

I’ll be honest, I tried a lot of things to get this to work, and I just couldn’t.

In fact, the problem actually got worse, because other Logstash instances started failing in the same way, and more and more data dropped off our dashboards and visualizations.

In the end, I had no choice but to roll back to the TCP output (which was flaky, but in a known way) and to log an issue on the Logstash Github describing my problem in the hope that someone would be able to look into it and help out.

Why not look myself?

Well I started, but Logstash is Ruby code running through Java (which is then running on Windows via Powershell) and getting into that stack from a development point of view looked pretty complicated and time-consuming.

I had a bit of a look at the Ruby code of the HTTP output and the only thing I could think of was a potential race condition in the construction of the HTTP requests.

# Create an async request
request = client.background.send(@http_method, url, :body => body, :headers => headers)
request.call # Actually invoke the request in the background

request.on_success do |response|
  begin
    # bunch of code dealing with success
  end
end

request.on_failure do |exception|
  begin 
    # bunch of code dealing with failures
  end
end

I don’t know how the HTTP library works (I think its Manticore?), but if it doesn’t support the registering (and correct execution) of event handlers AFTER a call has completed, then its possible for the request.call to finish and the success and failure handlers to never run.

Part of the success handler’s responsibilities is to inform the rest of the Logstash pipeline that the the HTTP output is done with its events, so that sort of thing could easily lead to a deadlock.

Again, I don’t know Ruby at all, so that is complete conjecture on my part.

Conclusion

This was another one of those issues that just came out of nowhere and stole more of my time than I would like to admit to. I would still like to switch all of our log event traffic to HTTP (for the same reasons I went through last time), but I have other things to do for now. At least its still working, even though the TCP input in Logstash sometimes just stop working. It doesn’t happen very often, and I have some monitoring around it, so its not the end of the world.

To be honest, a smarter man would have seen that it was not working with no clear solution, rolled back to the TCP output and moved on. Probably within a few hours.

I am not a smart man.

I am, however, ridiculously stubborn, so I wailed away at this problem until it stole my will to live, then I gave up.

0 Comments

If anybody actually reads this blog, it would be easy to see that my world has been all about log processing and aggregation for a while now. I mean, I do other things as well, but the questions and problems that end up here on this blog have been pretty focused around logging for as far back as I care to remember. The main reason is that I just like logging, and appreciate the powerful insights that it can give about your software. Another reason is that we don’t actually have anyone who sits in the traditional “operations” role (i.e. infrastructure/environment improvement and maintenance), so in order to help keep the rest of the team focused on our higher level deliverables, I end up doing most of that stuff.

Anyway, I don’t see that pattern changing any time soon, so on with the show.

While I was going about regaining control over our log stack, I noticed that it was extremely difficult to reason about TCP traffic when using an AWS ELB.

Why does this matter?

Well, the Broker layer in our ELK stack (i.e the primary ingress point), uses a Logstash configuration with a TCP input and as a result, all of the things that write to the Broker (our externally accessible Ingress API, other instances of Logstash, the ELB Logs Processor) use TCP. That’s a significant amount of traffic, and something that I’d really like to be able to monitor and understand.

Stealthy Traffic

As is my current understanding, when you make a TCP connection through an ELB, the ELB records the initial creation of the connection as a “request” (one of the metrics you can track with CloudWatch) and then pretty much nothing else after that. I mean, this makes sense, as its the ELB’s job to essentially pick an underlying machine to route traffic to, and most TCP connections created and used specifically as TCP connections tend to be long lived (as opposed to TCP connections created as part of HTTP requests and responses).

As far as our three primary contributors are concerned:

  • The Logging Ingress API is pretty oblivious. It just makes a new TCP connection for each incoming log event, so unless the .NET Framework is silently caching TCP connections for optimization purposes, it’s going to cause one ELB request per log event.
  • The ELB Logs Processor definitely caches TCP connections. We went through a whole ordeal with connection pooling and reuse before it would function in production, so its definitely pushing multiple log events through a single socket.
  • The Logstash instances that we have distributed across our various EC2 instances (local machine log aggregation, like IIS and application logs) are using the Logstash TCP output. I assume it uses one (or many) long live connections to do its business, but I don’t really know. Logstash is very mysterious.

This sort of usage makes it very hard to tell just how many log events are coming through the system via CloudWatch, which is a useful metric, especially when things start to go wrong and you need to debug which part of the stack is actually causing the failure.

Unfortunately, the monitoring issue isn’t the only problem with using the Logstash TCP input/output. Both input and output have, at separate times, been…flakey. I’ve experienced both sides of the pipeline going down for no obvious reason, or simply not working after running uninterrupted for a long time.

The final nail in the coffin for TCP came recently, when Elastic.co released the Logstash Persistent Queue feature for Logstash 5.4.0, which does not work with TCP at all (it only supports inputs that use the request-response model). I want to use persistent queues to remove both the Cache and Indexer layers from our log stack, so it was time for TCP to die.

Socket Surgery

Adding a HTTP input to our Broker layer was easy enough. In fact, such an input was already present because the ELB uses a HTTP request to check whether or not the Broker EC2 instances are healthy.

Conceptually, changing our Logstash instances to use a HTTP output instead of TCP should also be easy. Just need to change some configuration and deploy through Octopus. Keep in mind I haven’t actually done it yet, but it feels simple enough.

In a similar vein, changing our Logging Ingress API to output through HTTP instead of TCP should also be easy. A small code change to use HttpClient or RestSharp or something, a new deployment and everything is puppies and roses. Again, I haven’t actually done it yet, so who knows what dragons lurk there.

Then we have the ELB Logs Processor, which is a whole different kettle of fish.

It took a significant amount of effort to get it working with TCP in the first place (connection pooling was the biggest problem), and due to the poor quality of the Javascript (entirely my fault), its pretty tightly coupled to that particular mechanism.

Regardless of the difficulty, TCP has to go, for the good of the log stack

The first issue I ran into was “how do you even do HTTP requests in Node 4.3.2 anyway?”. There are many answers to this question, but the most obvious one is to use the HTTP API that comes with Node. Poking around this for a while showed that it wasn’t too bad, as long as I didn’t want to deal with a response payload, which I didn’t.

The biggest issue with the native Node HTTP API was that it was all callbacks, all the time. In my misadventures with the ELB Logs Processor I’d become very attached to promises and the effect they have on the readability of the resulting code, and didn’t really want to give that up so easily. I dutifully implemented a simple promise wrapper around our specific usage of the native Node HTTP API (which was just a POST of a JSON payload), and incorporated it into the Lambda function.

Unfortunately, this is where my memory gets a little bit fuzzy (it was a few weeks ago), and I don’t really remember how well it went. I don’t think it went well, because I decided to switch to a package called Axios which offered promise based HTTP requests out of the box.

Axios of Evil

Axios was pretty amazing. Well, I mean, Axios IS pretty amazing, but I suppose that sentence gave it away that the relationship didn’t end well.

The library did exactly what it said it did and with its native support for promises, was relatively easy to incorporate into the existing code, as you can see from the following excerpt:

// snip, whole bunch of setup, including summary object initialization

let axios = require('axios').create({
    baseURL: 'http://' + config.logstashHost,
    headers: {'Content-Type': 'application/json'}
});

// snip, more setup, other functions

function handleLine(line) {
    summary.lines.encountered += 1;

    var entry = {
        Component: config.component,
        SourceModuleName: config.sourceModuleName,
        Environment: config.environment,
        Application: config.application,
        message: line,
        type: config.type,
        Source: {
            S3: s3FileDetails
        }
    };
        
    var promise = axios
        .post("/", entry)
        .then((response) => {
            summary.lines.sent += 1;
        })
        .catch((error) => { 
            summary.failures.sending.total += 1;
            if (summary.failures.sending.lastFew.length >= 5) {
                summary.failures.sending.lastFew.shift();
            }
            summary.failures.sending.lastFew.push(error);
        });

    promises.push(promise);
}

// snip, main body of function (read S3, stream to line processor, wait for promises to finish

Even though it took a lot of effort to write, it was nice to remove all of the code relating to TCP sockets and connection pooling, as it simplified the whole thing.

The (single, manual) test proved that it still did its core job (contents of file written into the ELK stack), it worked in CI and it worked in Staging, so I was pretty happy.

For about 15 minutes that is, until I deployed it into Production.

Crushing Disappointment

Just like last time, the implementation simply could not deal with the amount of traffic that was being thrown at it. Even worse, it wasn’t actually logging any errors or giving me any indication as to why it was failing. After a brief and frustrating investigation, it looked like it was simply running out of memory (the Lambda function was only configured to use 192 MB, which had been enough for the TCP approach) and it was simply falling over once it reached that amount. This was my hypothesis, but I was never able to conclusively prove ii, but it was definitely using all of the memory available to the function each time it ran.

I could have just increased the available memory, but I wanted to understand where all the memory was going first.

Then I realised I would have to learn how to do memory analysis in Javascript, and I just gave up.

On Javascript that is.

Instead, I decided to rewrite the ELB Logs Processor in .NET Core, using a language that I actually like (C#).

Conclusion

This is one of those cases where looking back, with the benefits of hindsight, I probably should have just increased the memory until it worked and then walked away.

But I was just so tired of struggling with Javascript and Node that it was incredibly cathartic to just abandon it all in favour of something that actually made sense to me.

Of course, implementing the thing in C# via .NET Core wasn’t exactly painless, but that’s a topic for another time.

Probably next week.

0 Comments

A very quick post this week, because I’ve been busy rebuilding our ELB Logs Processor in .NET Core. I had some issues converting it to use HTTP instead of TCP for connecting to Logstash and I just got tired of dealing with Javascript.

I’m sure Javsacript is a perfectly fine language capable of accomplishing many wonderous things. Its not a language I will voluntarily choose to work with though, not when I can pick C# to accomplish the same thing.

On to the meat of this post though, which is quick heads up for people who want to upgrade from Logstash 5.2.X to 5.4.0 (something I did a recently for the Broker/Indexer layer inside our Log Aggregation Stack).

Make sure you configure a data directory and that that directory both exists and has appropriate permissions.

Queueing Is Very British

Logstash 5.4.0 marked the official release of the Persistent Queues feature (which had been in beta for a few versions). This is a pretty neat feature that allows you to skip the traditional queue/cache layer in your log aggregation stack. Basically, when enabled, it inserts a disk queue into your Logstash instance in between inputs and filters/outputs. It only works for inputs that have request/response models (so HTTP good, TCP bad), but it’s a pretty cool feature all round.

I have plans to eventually use it to completely replace our Cache and Indexer layers in the log aggregation stack (a win for general complexity and number of moving parts), but when I upgraded to 5.4.0 I left it disabled because we already have Elasticache:Redis for that.

That didn’t stop it from causing problems though.

I Guess They Just Like Taking Orderly Turns

Upgrading the version of Logstash we use is relatively straightforward. We bake a known version of Logstash into a new AMI via Packer, update an environment parameter for the stack, kick off a build and let TeamCity/Octopus take care of the rest.

To actually bake the AMI, we just update the Packer template with the new information (in this case, the Logstash version that should be installed via yum) and then run it through TeamCity.

On the other side, in the environment itself, when we update the AMI in use, CloudFormation will slowly replace all of the EC2 instances inside the Auto Scaling Group with new ones, waiting for each one to come online before continuing. We use Octopus Deploy Triggers to automate the deployment of software to those machines when they come online.

This is where things started to fall down with Logstash 5.4.0.

The Octopus deployment of the Logstash Configuration was failing. Specifically, Logstash would simply never come online with the AMI that used 5.4.0 and the configuration that we were using successfully for 5.2.0.

The Logstash log files were full of errors like this:

[2017-05-24T04:42:02,021][FATAL][logstash.runner          ] An unexpected error occurred! 
{
    : error =>  #  < ArgumentError: Path "/usr/share/logstash/data/queue" must be a writable directory.It is not writable. > ,
    : backtrace => [
        "/usr/share/logstash/logstash-core/lib/logstash/settings.rb:433:in `validate'", 
        "/usr/share/logstash/logstash-core/lib/logstash/settings.rb:216:in `validate_value'", 
        "/usr/share/logstash/logstash-core/lib/logstash/settings.rb:132:in `validate_all'", 
        "org/jruby/RubyHash.java:1342:in `each'", 
        "/usr/share/logstash/logstash-core/lib/logstash/settings.rb:131:in `validate_all'", 
        "/usr/share/logstash/logstash-core/lib/logstash/runner.rb:217:in `execute'", 
        "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/clamp-0.6.5/lib/clamp/command.rb:67:in `run'", 
        "/usr/share/logstash/logstash-core/lib/logstash/runner.rb:185:in `run'", 
        "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/clamp-0.6.5/lib/clamp/command.rb:132:in `run'", 
        "/usr/share/logstash/lib/bootstrap/environment.rb:71:in `(root)'"
    ]
}

A bit weird considering I hadn’t changed anything in our config, but it makes sense that maybe Logstash itself can’t write to the directory it was installed into by yum, and the new version now needs to do just that.

Moving the data directory was simple enough. Add path.datato the logstash.yml inside our configuration package, making sure that the data directory exists and that the Logstash user/group has ownership and full control.

I still got the same error though, except the directory was different (it was the one I specified).

I Mean Who Doesn’t

I fought with this problem for a few hours to be honest. Trying various permutations of permissions, ACLs, ownership, groups, users, etc.

In the end, I just created the queue directory ahead of time (as part of the config deployment) and set the ownership of the data directory recursively to the Logstash user/group.

This was enough to make Logstash stop complaining about the feature I didn’t want to use and get on with its life.

I still don’t understand what happened though, so I logged an issue in the Logstash repo in Github. Maybe someone will explain it one day. Weirdly it looks like Logstash created a directory that it was not allowed to access (the /queue directory under the specified data directory), which leads me towards something being wrong with my configuration (like ownership or something like that), but I couldn’t find anything that would point to that.

Conclusion

This one really came out of left field. I didn’t expect the upgrade to 5.4.0 to be completely painless (rarely is a software upgrade painless), but I didn’t expect to struggle with an issue caused by a feature that I didn’t even want to use.

What’s even weirder about the whole thing is Persistent Queues were available in the version of Logstash that was was upgrading from (5.2.0), at least in beta form, and I had no issues whatsoever.

Don’t get me wrong, Logstash is an amazing product, but it can also be incredibly frustrating.

0 Comments

The log stack is mostly under control now:

  • We have appropriate build and deployment pipelines in place
  • Everything is hosted in our AWS accounts
  • We can freely deploy both environments and configuration for each layer in the stack (Broker, Cache, Indexer, Storage)
  • We’ve got Cerebro in place to help us visualize Elasticsearch

We’re in a pretty good place to be honest.

Still, there are a few areas I would like to improve before I run away screaming, especially while we still have the old stack up and running in parallel (while we smooth out any kinks in the new stack).

One area in particular that needs some love is the way in which we store data in Elasticsearch. Up until now, we’ve mostly just left Elasticsearch to its own devices. Sure, it had a little help from Logstash, but it was mostly up to Elasticsearch what it did with incoming data. We did have a simple index template in place, but it was pretty vanilla. Basically we just left dynamic field mapping on and didn’t really think about it a whole lot.

The problem with this approach is that Elasticsearch is technically doing a whole bunch of work (and storing a whole bunch of data) for no real benefit to us, as we have thousands of fields, but only really search/aggregate on a few hundred at best. Our daily logstash indexes contain quite a lot of documents (around 40 million) and generally tally up at around 35GB each. Depending on how much of that 35GB of storage belongs to indexed fields that have no value, there might be considerable savings in reining the whole process in.

That doesn’t even take into account the cognitive load in having to deal with a large number of fields whenever you’re doing analysis, or the problems we’ve had with Kibana and refreshing our index mappings when there are many fields.

It was time to shackle the beast.

Anatomy Of A Template

Index templates are relatively simple constructs, assuming you understand some of the basic concepts behind indexes, types and field mappings in Elasticsearch. You could almost consider them to be schemas, but that is not strictly accurate, because you can change a schema, but you can’t really change an index once its been created. They really are templates in that sense, because they only apply when a new index is created.

Basically, a template is a combination of index settings (like replicas, shards, field limits, etc), types (which are collections of fields), and field mappings (i.e. Event.Name should be treated as text, and analysed up to the first 256 characters). They are applied to new indexes based on a pattern that matches against the new indexes name. For example, if I had a template that I wanted to apply to all logstash indexes (which are named logstash-YY.MM.DD), I would give it a pattern of logstash-*.

For a more concrete example, here is an excerpt from our current logstash index template:

{
  "order": 0,
  "template": "logstash-*",
  "settings": {
    "index": {
      "refresh_interval": "5s",
      "number_of_shards": "3",
      "number_of_replicas": "2",
      "mapper.dynamic": false
    }
  },
  "mappings": {
    "logs": {
      "dynamic" : false,
      "_all": {
        "omit_norms": true,
        "enabled": false
      },
      "properties": {
        "@timestamp": {
          "type": "date",
          "doc_values": true,
          "index": true
        },
        "@version": {
          "type": "keyword",
          "index": false,
          "doc_values": true
        },
        "message" : {
          "type" : "text",
          "index": false,
          "fielddata": false
        },
        "Severity" : {
          "type": "keyword",
          "index": true,
          "doc_values": true
        },
        "TimeTaken" : {
          "type": "integer",
          "index": true,
          "doc_values": true
        }
      }
    }
  },
  "aliases": {}
}

Templates can be managed directly from the Elasticsearch HTTP API via the /_template/{template-name} endpoint.

By default, the mappings.{type}.dynamic field is set to true when creating an index. This means that based on the raw data encountered, Elasticsearch will attempt to infer an appropriate type for the field (i.e. if it sees numbers, its probably going to make it a long or something). To be honest, Elasticsearch is pretty good at this, assuming your raw data doesn’t contain fields that sometimes look like numbers and sometimes look like text.

Unfortunately, ours does, so we can sometimes get stuck in a bad situation where Elasticseach will infer a field as a number, and all documents with text there will fail. This is a mapping conflict, and is a massive pain, because you can’t change a field mapping. You have to delete the index, or make a new index and migrate the data across. In the case of logstash, because you have time based indexes, you can also just wait it out.

This sort of thing can be solved by leaving dynamic mapping on, but specifying the type of the troublesome fields in the template.

The other downside of dynamic mapping is the indexing of fields that you really don’t need to be indexed, which takes up space for no benefit. This is actually pretty tricky though, because if you don’t index a field in some way, its still stored, but you can’t search or aggregate on it without creating a new index and adding an appropriate field mapping. I don’t know about you, but I don’t always know exactly what I want to search/aggregate on before the situation arises, so its a dangerous optimization to make.

This is especially true for log events, which are basically invisible up to the point where you have to debug some arcane thing that happened to some poor bastard.

I’m currently experimenting with leaving dynamic mapping off until I get a handle on some of the data coming into our stack, but I imagine that it will probably be turned back on before I’m done, sitting alongside a bunch of pre-defined field mappings for consistency.

Template Unleashed

With a template defined (like the example above), all that was left was to create a deployment pipeline.

There were two paths I could have gone down.

The first was to have a package specifically for the index template, with its own Octopus project and a small amount of logic that used the Elasticsearch HTTP API to push the template into the stack.

The second was to incorporate templates into the Logging.ELK.Elasticsearch.Config package/deployment, which was the package that dealt with the Elasticsearch configuration (i.e. master vs data nodes, EC2 discovery, ES logging, etc).

In the end I went with the second option, because I could not find an appropriate trigger to bind the first deployment to. Do you deploy when a node comes online? The URL might not be valid then, so you’d probably have to use the raw IP. That would mean exposing those instances outside of their ELB, which wasn’t something I wanted to do.

It was just easier to add some logic to the existing configuration deployment to deploy templates after the basic configuration completes.

# Wait for a few moments for Elasticsearch to become available
attempts=0
maxAttempts=20
waitSeconds=15
until $(curl --output /dev/null --silent --head --fail http://localhost:9200); do
    if [[ $attempts -ge $maxAttempts ]]; then 
        echo "Elasticsearch was not available after waiting ($attempts) times, sleeping for ($waitSeconds) seconds between each connection attempt"
        exit 1 
    fi
    attempts=$(($attempts + 1))
    echo "Waiting ($waitSeconds) to see if Elasticsearch will become available"
    sleep $waitSeconds
done

# Push the index template
template_upload_status=$(curl -XPUT --data "@/tmp/elk-elasticsearch/templates/logstash.json" -o /tmp/elk-elasticsearch/logstash-template-upload-result.json -w '%{http_code}' http://localhost:9200/_template/logstash;)
if [[ $template_upload_status -ne 200 ]]; then
    echo "Template upload failed"
    cat /tmp/elk-elasticsearch/logstash-template-upload-result.json
    exit 1
fi

A little bit more complicated than I would have liked, but it needs to wait for Elasticsearch to come online (and for the cluster to go green) before it can do anything, and the previous steps in this script actually restart the node (to apply configuration changes), so its necessary.

Conclusion

I’m hopeful that a little bit of template/field management will give us some big benefits in terms of the amount of fields we need to deal with and how much storage our indexes consume. Sure, we could always manage the template manually (usually via Kopf/Cerebro), but it feels a lot better to have it controlled and tracked in source control and embedded into our pipeline.

As I mentioned earlier, I still haven’t quite decided how to handle things in the long run, i.e. the decision between all manual mappings or some manual and the rest dynamic. It gets a bit complicated with index templates only applying once for each index (at creation), so if you want to put some data in you need to either anticipate what it looks like ahead of time, or you need to wait until the next index rolls around. I’ve got our logstash indexes running hourly (instead of daily), which helps, but I think it causes performance problems of its own, so its a challenging situation.

The other thing to consider is that managing thousands of fields in that template file sounds like its going to be a maintenance nightmare. Even a few hundred would be pretty brutal, so I’m wary of trying to control absolutely all of the things.

Taking a step back, it might actually be more useful to just remove those fields from the log events inside the Indexer layer, so Elasticsearch never even knows they exist.

Of course, you have to know what they are before you can apply this sort of approach anyway, so we’re back to where we started.