0 Comments

The log stack is mostly under control now:

  • We have appropriate build and deployment pipelines in place
  • Everything is hosted in our AWS accounts
  • We can freely deploy both environments and configuration for each layer in the stack (Broker, Cache, Indexer, Storage)
  • We’ve got Cerebro in place to help us visualize Elasticsearch

We’re in a pretty good place to be honest.

Still, there are a few areas I would like to improve before I run away screaming, especially while we still have the old stack up and running in parallel (while we smooth out any kinks in the new stack).

One area in particular that needs some love is the way in which we store data in Elasticsearch. Up until now, we’ve mostly just left Elasticsearch to its own devices. Sure, it had a little help from Logstash, but it was mostly up to Elasticsearch what it did with incoming data. We did have a simple index template in place, but it was pretty vanilla. Basically we just left dynamic field mapping on and didn’t really think about it a whole lot.

The problem with this approach is that Elasticsearch is technically doing a whole bunch of work (and storing a whole bunch of data) for no real benefit to us, as we have thousands of fields, but only really search/aggregate on a few hundred at best. Our daily logstash indexes contain quite a lot of documents (around 40 million) and generally tally up at around 35GB each. Depending on how much of that 35GB of storage belongs to indexed fields that have no value, there might be considerable savings in reining the whole process in.

That doesn’t even take into account the cognitive load in having to deal with a large number of fields whenever you’re doing analysis, or the problems we’ve had with Kibana and refreshing our index mappings when there are many fields.

It was time to shackle the beast.

Anatomy Of A Template

Index templates are relatively simple constructs, assuming you understand some of the basic concepts behind indexes, types and field mappings in Elasticsearch. You could almost consider them to be schemas, but that is not strictly accurate, because you can change a schema, but you can’t really change an index once its been created. They really are templates in that sense, because they only apply when a new index is created.

Basically, a template is a combination of index settings (like replicas, shards, field limits, etc), types (which are collections of fields), and field mappings (i.e. Event.Name should be treated as text, and analysed up to the first 256 characters). They are applied to new indexes based on a pattern that matches against the new indexes name. For example, if I had a template that I wanted to apply to all logstash indexes (which are named logstash-YY.MM.DD), I would give it a pattern of logstash-*.

For a more concrete example, here is an excerpt from our current logstash index template:

{
  "order": 0,
  "template": "logstash-*",
  "settings": {
    "index": {
      "refresh_interval": "5s",
      "number_of_shards": "3",
      "number_of_replicas": "2",
      "mapper.dynamic": false
    }
  },
  "mappings": {
    "logs": {
      "dynamic" : false,
      "_all": {
        "omit_norms": true,
        "enabled": false
      },
      "properties": {
        "@timestamp": {
          "type": "date",
          "doc_values": true,
          "index": true
        },
        "@version": {
          "type": "keyword",
          "index": false,
          "doc_values": true
        },
        "message" : {
          "type" : "text",
          "index": false,
          "fielddata": false
        },
        "Severity" : {
          "type": "keyword",
          "index": true,
          "doc_values": true
        },
        "TimeTaken" : {
          "type": "integer",
          "index": true,
          "doc_values": true
        }
      }
    }
  },
  "aliases": {}
}

Templates can be managed directly from the Elasticsearch HTTP API via the /_template/{template-name} endpoint.

By default, the mappings.{type}.dynamic field is set to true when creating an index. This means that based on the raw data encountered, Elasticsearch will attempt to infer an appropriate type for the field (i.e. if it sees numbers, its probably going to make it a long or something). To be honest, Elasticsearch is pretty good at this, assuming your raw data doesn’t contain fields that sometimes look like numbers and sometimes look like text.

Unfortunately, ours does, so we can sometimes get stuck in a bad situation where Elasticseach will infer a field as a number, and all documents with text there will fail. This is a mapping conflict, and is a massive pain, because you can’t change a field mapping. You have to delete the index, or make a new index and migrate the data across. In the case of logstash, because you have time based indexes, you can also just wait it out.

This sort of thing can be solved by leaving dynamic mapping on, but specifying the type of the troublesome fields in the template.

The other downside of dynamic mapping is the indexing of fields that you really don’t need to be indexed, which takes up space for no benefit. This is actually pretty tricky though, because if you don’t index a field in some way, its still stored, but you can’t search or aggregate on it without creating a new index and adding an appropriate field mapping. I don’t know about you, but I don’t always know exactly what I want to search/aggregate on before the situation arises, so its a dangerous optimization to make.

This is especially true for log events, which are basically invisible up to the point where you have to debug some arcane thing that happened to some poor bastard.

I’m currently experimenting with leaving dynamic mapping off until I get a handle on some of the data coming into our stack, but I imagine that it will probably be turned back on before I’m done, sitting alongside a bunch of pre-defined field mappings for consistency.

Template Unleashed

With a template defined (like the example above), all that was left was to create a deployment pipeline.

There were two paths I could have gone down.

The first was to have a package specifically for the index template, with its own Octopus project and a small amount of logic that used the Elasticsearch HTTP API to push the template into the stack.

The second was to incorporate templates into the Logging.ELK.Elasticsearch.Config package/deployment, which was the package that dealt with the Elasticsearch configuration (i.e. master vs data nodes, EC2 discovery, ES logging, etc).

In the end I went with the second option, because I could not find an appropriate trigger to bind the first deployment to. Do you deploy when a node comes online? The URL might not be valid then, so you’d probably have to use the raw IP. That would mean exposing those instances outside of their ELB, which wasn’t something I wanted to do.

It was just easier to add some logic to the existing configuration deployment to deploy templates after the basic configuration completes.

# Wait for a few moments for Elasticsearch to become available
attempts=0
maxAttempts=20
waitSeconds=15
until $(curl --output /dev/null --silent --head --fail http://localhost:9200); do
    if [[ $attempts -ge $maxAttempts ]]; then 
        echo "Elasticsearch was not available after waiting ($attempts) times, sleeping for ($waitSeconds) seconds between each connection attempt"
        exit 1 
    fi
    attempts=$(($attempts + 1))
    echo "Waiting ($waitSeconds) to see if Elasticsearch will become available"
    sleep $waitSeconds
done

# Push the index template
template_upload_status=$(curl -XPUT --data "@/tmp/elk-elasticsearch/templates/logstash.json" -o /tmp/elk-elasticsearch/logstash-template-upload-result.json -w '%{http_code}' http://localhost:9200/_template/logstash;)
if [[ $template_upload_status -ne 200 ]]; then
    echo "Template upload failed"
    cat /tmp/elk-elasticsearch/logstash-template-upload-result.json
    exit 1
fi

A little bit more complicated than I would have liked, but it needs to wait for Elasticsearch to come online (and for the cluster to go green) before it can do anything, and the previous steps in this script actually restart the node (to apply configuration changes), so its necessary.

Conclusion

I’m hopeful that a little bit of template/field management will give us some big benefits in terms of the amount of fields we need to deal with and how much storage our indexes consume. Sure, we could always manage the template manually (usually via Kopf/Cerebro), but it feels a lot better to have it controlled and tracked in source control and embedded into our pipeline.

As I mentioned earlier, I still haven’t quite decided how to handle things in the long run, i.e. the decision between all manual mappings or some manual and the rest dynamic. It gets a bit complicated with index templates only applying once for each index (at creation), so if you want to put some data in you need to either anticipate what it looks like ahead of time, or you need to wait until the next index rolls around. I’ve got our logstash indexes running hourly (instead of daily), which helps, but I think it causes performance problems of its own, so its a challenging situation.

The other thing to consider is that managing thousands of fields in that template file sounds like its going to be a maintenance nightmare. Even a few hundred would be pretty brutal, so I’m wary of trying to control absolutely all of the things.

Taking a step back, it might actually be more useful to just remove those fields from the log events inside the Indexer layer, so Elasticsearch never even knows they exist.

Of course, you have to know what they are before you can apply this sort of approach anyway, so we’re back to where we started.

0 Comments

This post is not as technical as some of my others. I really just want to bring attention to a tool for Elasticsearch that I honestly don’t think I could do without.

Cerebro.

From my experience, one of the hardest things to wrap my head around when working with Elasticsearch was visualizing how everything fit together. My background is primarily C# and .NET in a very Microsoft world, so I’m used to things like SQL Server, which comes with an excellent exploration and interrogation tool in the form of SQL Server Management studio. When it comes to Elasticsearch though, there seems to be no equiavelent, so I felt particularly blind.

Since starting to use Elasticsearch, I’ve become more and more fond of using the command line, so I’ve started to appreciate its amazing HTTP API more and more, but that initial learning curve was pretty vicious.

Anyway, to bring it back around, my first port of call when I started using Elasticsearch was to find a tool conceptually similar to SQL Server Management Studio. Something I could use to both visualize the storage system (however it worked) and possibly even query it as necessary.

I found Kopf.

Kopf did exactly what I wanted it to do. It provided a nice interface on top of Elasticsearch that helped me visualize how everything was structured and what sort of things I could do. To this day, if I attempt to visualize an Elasticsearch cluster in my head, the pictures that come to mind are of the Kopf interface. I can thank it for my understanding of the cluster, the nodes that make it up and the indexes stored therein, along with the excellent Elasticsearch documentation of course.

Later on I learnt that Kopf didn’t have to be used from the creators demonstration website (which is how I had been using it, connecting from my local machine to our ES ELK cluster), but could in fact be installed as a plugin inside Elasticsearch itself, which was even better, because you could access it from {es-url]}/plugins/_kopf, which was a hell of a lot easier.

Unfortunately, everything changed when the fire nation attacked…

No wait, that’s not right.

Everything changed when Elasticsearch 5 was released.

I’m The Juggernaut

Elasticsearch 5 deprecated site plugins. No more site plugins meant no more Kopf, or at least no more Kopf hosted within Elasticsearch. This made me sad, obviously, but I could still use the standalone site, so it wasn’t the end of the world.

My memory of the next bit is a little bit fuzzy, but I think even the standalone site stopped working properly when connecting to Elasticsearch 5. The creator of Kopf was no longer maintaining the project either, so it was unlikely that the problems would be solved.

I was basically blind.

Enter Cerebro.

No bones about it, Cerebro IS Kopf. It’s made by the same guy and is still being actively developed. Its pretty much a standalone Kopf (i.e. built in web server), and any differences between the two (other than some cosmetic stuff and the capability to easily save multiple Elasticsearch addresses) are lost on me.

As of this post, its up to 0.6.5, but as far as I can tell, it’s fully functional.

For my usage, I’ve incorporated Cerebro into our ELK stack, with a simple setup (ELB + single instance ASG), pre-configured with the appropriate Elasticsearch address in each environment that we spin up. As is the normal pattern, I’ve set it up on an AMI via Packer, and I deploy its configuration via Octopus deploy, but there is nothing particularly complicated there.

Kitty, Its Just A Phase

This post is pretty boring so far, so lets talk about Cerebro a little with the help of a screenshot.

This is the main screen of Cerebro, and it contains a wealth of information to help you get a handle on your Elasticsearch cluster.

It shows an overview of the cluster status, data nodes, indexes and their shards and replicas.

  • The cluster status is shown at the top of the screen, mostly via colour. Green good, yellow troublesome, red bad. Helpfully enough, the icon in the browser also changes colour according to the cluster status.
  • Data nodes are shown on the left, and display information like memory, cpu and disk, as well as IP address and name.
  • Indexes pretty much fill the rest of the screen, displaying important statistics like the number of documents and size, while allowing you to access things like mappings and index operations (like delete)
  • The intersection of index and data node gives information about shard/replica allocation. In the example above, we have 3 shards, 2 replicas and 3 nodes, so each node has a full copy of the data. Solid squares indicate the primary shard.
  • If you have unassigned or relocating shards, this information appears directly above the nodes, and shards currently being moved are shown in same place as normal shards, except blue.

Honestly, I don’t really use the other screens in Cerebro very much, or at least nowhere near as much as I use the overview screen. The dedicated nodes screen can be useful to view your master nodes (which aren’t shown on the overview), and to get a more performance focused display. I’ve also used the index templates screen for managing/viewing our logstash index template, but that’s mostly done through an Octopus deployment now.

There are others (including an ad-hoc query screen), but again, I haven’t really dug into them in depth. At least not enough to talk about them anyway.

That first screen though, the overview, is worth its weight in gold as far as I’m concerned.

Conclusion

I doubt I would understand Elasticsearch anywhere near as much as I do without Kopf/Cerebro. Realistically, I don’t really understand it much at all, but that little understanding I do have would be non-existent without these awesome tools.

Its not just a one horse town though. Elastic.co provides some equivalent tools as well (like Monitoring (formerly Marvel)) which offer similar capabilities, but they are mostly paid services as far as I can tell, so I’ve been hesitant to explore them in more depth.

I’m already spending way too much on the hardware for our log stack, so adding software costs on top of that is a challenging battle that I’m not quite ready to fight.

It doesn’t help that the last time I tried to price it, their answer for “How much for the things?” was basically “How much you got?”.

0 Comments

We’ve been doing some work recently to regain control over our Log Stack. Nothing groundbreaking, just cleaning up, creating deployment pipelines and moving infrastructure around.

To go into it in a little more details, the nature of the work could be broadly categorised as follows:

  • Amazon Machine Images (AMIs) created through Packer (via TeamCity + Powershell)
  • CloudFormation scripts executed through Octopus Deploy
  • Application specific configuration files deployed to Linux machines through Octopus Deploy

The third topic is relevant to this post, as it concerns an annoying problem I encountered while working on the configuration for the Elasticsearch layer.

Operating Tolerances

I do all of my development on my Surface Book running whatever flavour of Windows 10 is the most current. When doing C# application/service development, I mostly use Visual Studio, and for just about everything else (Powershell, CloudFormation, Bash Scripts, etc) I use Visual Studio Code (it’s plugins are really good). This holds true even when I’m building things to deploy to Linux operating systems.

When I was creating the repository/Nuget package/deployment logic for the Elasticsearch configuration, it was all Visual Studio Code, all the time.

The first pass at the configuration was pretty straightforward. An elasticsearch.yml file with some variable substitution via Octopus Deploy dealing with things like master vs data nodes, cluster names and cluster discovery. Once completed and deployed, this worked pretty much as expected, with the Elasticsearch cluster coming online in the arrangement I specified.

Further tweaking was required though, because the default memory allocation settings for Elasticsearch do not take available memory into account.

You might be wondering when you would need to take control over something as low level as the amount of memory allocated to the process. Elasticsearch is a pretty memory intensive process, so it does a lot of upfront memory allocation in order to make sure that performance doesn’t suffer later if the JVM decides it needs to swap memory or something similarly painful. The default allocation for ES is only 1GB though, so if you want to make use of your beefy EC2 instance with 32GB of ram, you’ll have to do some tweaking.

Back when I first built an ELK Stack (waaaaaay back in March 2015) you would usually set an environment variable called ES_HEAP_SIZE in order to change the default memory allocation. You were free to set the Xms and Xmx JVM options directly when running the Elasticsearch process as well, but the recommended way was to use the environment variable.

Elasticsearch has evolved since then, and the current recommendation is to set memory allocation settings via a jvm.options file living in an appropriate place (typically /etc/elasticsearch). It is in this file that you can set the Xms (minimum) and Xmx (maximum) values for heap allocation, along with any other settings on the JVM that you might want to change.

The only complication for me was that we were deploying Elasticsearch to a variety of different EC2 instance types, with highly variable total amounts of memory, so the memory settings couldn’t be set in source, but had to be determine dynamically. Easy enough to accomplish with a small Bash script that runs during deployment:

# Echo commands after expansion
set -x

# Mutate and copy the JVM options file file
es_memory=$(free -m | awk '/^Mem:/ { print int($2/2) }') || exit 1
sed -i "s/@@ES_MEMORY@@/$es_memory/" /tmp/elk-elasticsearch/jvm.options || exit 1

With the new jvm.options file in place, a new version of the configuration package was deployed to CI, and Elasticsearch broke.

You’re Lying To Me

The deployment logic is simple. Get Octopus to deploy some files to a temporary location, mutate those files if necessary (this is where the Bash script above comes in), copy the files to the appropriate locations and then restart Elasticsearch.

Simply restarting the Elasticsearch node is not necessarily the best way to go about things, giving that Elasticsearch will try to shuffle shards/replicas around if a node disappears from the cluster. I’m going to ignore this particular fact for now, and hope that it doesn’t come back to bite us in the ass at a later date.

When the Elasticsearch service was restarted after the new deployment, it failed with the following error:

Invalid initial heap size: -Xms468M
Could not create the Java virtual machine.

A cursory internet search finds a decent number of these errors, but they mostly seem to be related to trying to allocate more memory than is available, i.e. a valid problem with the configuration. The problem was, Elasticsearch was running on a t2.micro EC2 instance, which has 1GB of memory, so 468MB is valid. The instance in question only runs Elasticsearch too, so its not like anything else had already stolen the memory.

My first instinct when I get error messages that seem to make no sense, especially with Java components, is to look at the encoding. In this particular case though, the jvm.options file was UTF-8, so that was a dead end.

The next step was to grab the jvm.options file that ships with Elasticsearch and see if I had made some sort of critical error when I was editing it. Eyeballing the two files showed no major differences though, aside from the change to the memory allocation settings.

Next idea?

Line breaks.

This was where I made a critical error. I looked at the file in Linux using VI and its : set list command. Both my file and the default jvm.options file that ships with Elasticsearch looked like they treated line endings the same.

Long story short, I tried a few more things, tore my (non-existent) hair out of a bit and then finally copied the two files to my computer and compared them in Notepad++.

My file had CRLF, the default LF. The extra whitespace character was causing errors when interpreting the options file, which meant it was trying to use a setting of –Xms468MB[CR], which is strictly invalid.

A quick edit to the Bash script used during deployment to clean this up and to stop it from happening again, and everything was fine.

# Echo commands after expansion
set -x

# Mutate and copy the JVM options file file
es_memory=$(free -m | awk '/^Mem:/ { print int($2/2) }') || exit 1
sed -i "s/@@ES_MEMORY@@/$es_memory/" /tmp/elk-elasticsearch/jvm.options || exit 1
sed -i 's/\r//' /tmp/elk-elasticsearch/jvm.options || exit 1

Conclusion

This was one of those weird cases where I checked all of the usual suspects and yet still failed miserably to resolve the problem in a reasonable timeframe. A complete lack of experience using Linux native tools to debug and diagnose common issues was the main contributing factor, but I really wish software would give better errors in these cases.

I mean, anything would be better than “completely valid looking setting is invalid, please try again”.

0 Comments

I use Packer on and off. Mostly I use it to make Amazon Machine Images (AMI’s) for our environment management packages, specifically by creating Packer templates that operate on top of the Amazon supplied Windows Server images.

You should never use an Amazon supplied Windows Server AMI in your Auto Scaling Group Launch Configurations. These images are regularly retired, so if you’ve taken a dependency on one, there is a good chance it will disappear just when you need it most. Like when you need to auto-scale your API cluster because you’ve unknowingly burnt through all of the CPU credits you had on the machines slowly over the course of the last few months. What you should do is create an AMI of your own from the ones supplied by AWS so you can control its lifetime. Packer is a great tool for this.

A Packer template is basically a set of steps to execute on a virtual machine of some sort, where the core goal is to take some sort of baseline thing, apply a set of steps to it programmatically and end up with some sort of reusable thing out the other end. Like I mentioned earlier, we mostly deal in AWS AMI’s, but it can do a bunch of other things as well (VWWare, Docker, etc).

The benefits of using a Packer template for this sort of thing (instead of just doing it all manually) is reproducibility. Specifically, if you built your custom image using the AWS AMI for Windows Server 2012 6 months ago, you can go and grab the latest one from yesterday (with all of the patches and security upgrades), execute your template on it and you’ll be in a great position to upgrade all of the existing usages of your old custom AMI with minimal effort.

When using Packer templates though, you need to be cognizant of how errors are dealt with. Specifically:

Step failures appear to be indicated entirely by the exit code of the tool used in the step.

I’ve been bitten by this on two separate occasions.

A Powerful Cry For Help

Packer has much better support for Windows than it once did, but even taking that into account, Powershell steps can still be a troublesome beast.

The main issue with the Powershell executable is that if an error or exception occurs and terminates the process (i.e. its a terminating error of you have ErrorActionPreference set to Stop) the Powershell process itself still exits with zero.

In a sane world, an exit code of zero indicates success, which is what Packer expects (and most other automation tools like TeamCity/Octopus Deploy).

If you don’t take this into account, your Powershell steps may fail but the Packer execution will still succeed, giving you an artefact that hasn’t been configured the way it should have been.

Packer is pretty configurable though, and is very clear about the command that it uses to execute your Powershell steps. The great thing is, it also enables you to override that command, so you can customise your Powershell steps to exit with a non-zero code if an error occurs without actually having to change every line in your step to take that sort of thing into account.

Take this template excerpt below, which uses Powershell to set the timezone of the machine and turn off negative DNS result caching.

{
    "type": "powershell",
    "inline": [
        "tzutil.exe /s \"AUS Eastern Standard Time_dstoff\"",
        "[Microsoft.Win32.Registry]::SetValue('HKEY_LOCAL_MACHINE\\SYSTEM\\CurrentControlSet\\Services\\Dnscache\\Parameters','NegativeCacheTime',0,[Microsoft.Win32.RegistryValueKind]::DWord)"
    ],
    "execute_command": "powershell -Command \"$ErrorActionPreference = 'Stop'; try { & '{{.Path}}' } catch { Write-Warning $_; exit 1; } \""
}

The “execute_command” is the customisation, providing error handling for exceptions that occur during the execution of the Powershell snippet. Packer will take each line in that inline array, copy it to a file on the machine being setup (using WinRM) and then execute it using the command you specify. The {{.Path}} syntax is the Packer variable substitution and specifically refers to the path on the virtual machine that packer has copied the current command to. With this custom command in place, you have a much better chance of catching errors in your Powershell commands before they come back to bite you later on.

So Tasty

In a similar vein to the failures with Powershell above, be careful when doing package installs via yum on Linux.

The standard “yum install” command will not necessarily exit with a non-zero code when a package fails to install. Sometimes it will, but if a package couldn’t be found (maybe you misconfigured the repository or something) it still exits with a zero.

This can throw a pretty big spanner in the works when you’re expecting your AMI to have Elasticsearch on it (for example) and it just doesn’t because the package installation failed but Packer thought everything was fine.

Unfortunately, there is no easy way to get around this like there is for the Powershell example above, but you can mitigate it by just adding an extra step after your package install that validates the package was actually installed.

{
    "type" : "shell",
    "inline" : [
        "sudo yum remove java-1.7.0-openjdk -y",
        "sudo yum install java-1.8.0 -y",
        "sudo yum update -y",
        "sudo sh -c 'echo \"[logstash-5.x]\" >> /etc/yum.repos.d/logstash.repo'",
        "sudo sh -c 'echo \"name=Elastic repsitory for 5.x packages\" >> /etc/yum.repos.d/logstash.repo'",
        "sudo sh -c 'echo \"baseurl=https://artifacts.elastic.co/packages/5.x/yum\" >> /etc/yum.repos.d/logstash.repo'",
        "sudo sh -c 'echo \"gpgcheck=1\" >> /etc/yum.repos.d/logstash.repo'",
        "sudo sh -c 'echo \"gpgkey=http://packages.elastic.co/GPG-KEY-elasticsearch\" >> /etc/yum.repos.d/logstash.repo'",
        "sudo sh -c 'echo \"enabled=1\" >> /etc/yum.repos.d/logstash.repo'",
        "sudo rpm --import https://packages.elastic.co/GPG-KEY-elasticsearch",
        "sudo yum install logstash-5.2.2 -y",
        "sudo rpm --query logstash-5.2.2"
    ]
}

In the example above, the validation is the rpm --query command after the yum install. It will return a non-zero exit code (and thus fail the Packer execution) if the package with that version is not installed.

Conclusion

Packer is an incredibly powerful automation tool for dealing with a variety of virtual machine platforms and I highly recommend using it.

If you’re going to use it though, you need to understand what failure means in your specific case, and you need to take that into account when you decide how to signal to the Packer engine that something isn’t right.

For me, I prefer to treat every error as critical, because I prefer to deal with them at the time the AMI is created, rather than 6 months later when I try to use the AMI and can’t figure out why the Windows Firewall on an internal API instance is blocking requests from its ELB. Not that that has ever happened of course.

In order to accomplish this lofty goal of dealing with errors ASAP you need to understand how each one of your steps (and the applications and tools they use) communicate failure, and then make sure they all communicate that appropriately in a way Packer can understand.

Understanding how to deal with failure is useful outside Packer too.