0 Comments

We’ve been doing some work recently to regain control over our Log Stack. Nothing groundbreaking, just cleaning up, creating deployment pipelines and moving infrastructure around.

To go into it in a little more details, the nature of the work could be broadly categorised as follows:

  • Amazon Machine Images (AMIs) created through Packer (via TeamCity + Powershell)
  • CloudFormation scripts executed through Octopus Deploy
  • Application specific configuration files deployed to Linux machines through Octopus Deploy

The third topic is relevant to this post, as it concerns an annoying problem I encountered while working on the configuration for the Elasticsearch layer.

Operating Tolerances

I do all of my development on my Surface Book running whatever flavour of Windows 10 is the most current. When doing C# application/service development, I mostly use Visual Studio, and for just about everything else (Powershell, CloudFormation, Bash Scripts, etc) I use Visual Studio Code (it’s plugins are really good). This holds true even when I’m building things to deploy to Linux operating systems.

When I was creating the repository/Nuget package/deployment logic for the Elasticsearch configuration, it was all Visual Studio Code, all the time.

The first pass at the configuration was pretty straightforward. An elasticsearch.yml file with some variable substitution via Octopus Deploy dealing with things like master vs data nodes, cluster names and cluster discovery. Once completed and deployed, this worked pretty much as expected, with the Elasticsearch cluster coming online in the arrangement I specified.

Further tweaking was required though, because the default memory allocation settings for Elasticsearch do not take available memory into account.

You might be wondering when you would need to take control over something as low level as the amount of memory allocated to the process. Elasticsearch is a pretty memory intensive process, so it does a lot of upfront memory allocation in order to make sure that performance doesn’t suffer later if the JVM decides it needs to swap memory or something similarly painful. The default allocation for ES is only 1GB though, so if you want to make use of your beefy EC2 instance with 32GB of ram, you’ll have to do some tweaking.

Back when I first built an ELK Stack (waaaaaay back in March 2015) you would usually set an environment variable called ES_HEAP_SIZE in order to change the default memory allocation. You were free to set the Xms and Xmx JVM options directly when running the Elasticsearch process as well, but the recommended way was to use the environment variable.

Elasticsearch has evolved since then, and the current recommendation is to set memory allocation settings via a jvm.options file living in an appropriate place (typically /etc/elasticsearch). It is in this file that you can set the Xms (minimum) and Xmx (maximum) values for heap allocation, along with any other settings on the JVM that you might want to change.

The only complication for me was that we were deploying Elasticsearch to a variety of different EC2 instance types, with highly variable total amounts of memory, so the memory settings couldn’t be set in source, but had to be determine dynamically. Easy enough to accomplish with a small Bash script that runs during deployment:

# Echo commands after expansion
set -x

# Mutate and copy the JVM options file file
es_memory=$(free -m | awk '/^Mem:/ { print int($2/2) }') || exit 1
sed -i "s/@@ES_MEMORY@@/$es_memory/" /tmp/elk-elasticsearch/jvm.options || exit 1

With the new jvm.options file in place, a new version of the configuration package was deployed to CI, and Elasticsearch broke.

You’re Lying To Me

The deployment logic is simple. Get Octopus to deploy some files to a temporary location, mutate those files if necessary (this is where the Bash script above comes in), copy the files to the appropriate locations and then restart Elasticsearch.

Simply restarting the Elasticsearch node is not necessarily the best way to go about things, giving that Elasticsearch will try to shuffle shards/replicas around if a node disappears from the cluster. I’m going to ignore this particular fact for now, and hope that it doesn’t come back to bite us in the ass at a later date.

When the Elasticsearch service was restarted after the new deployment, it failed with the following error:

Invalid initial heap size: -Xms468M
Could not create the Java virtual machine.

A cursory internet search finds a decent number of these errors, but they mostly seem to be related to trying to allocate more memory than is available, i.e. a valid problem with the configuration. The problem was, Elasticsearch was running on a t2.micro EC2 instance, which has 1GB of memory, so 468MB is valid. The instance in question only runs Elasticsearch too, so its not like anything else had already stolen the memory.

My first instinct when I get error messages that seem to make no sense, especially with Java components, is to look at the encoding. In this particular case though, the jvm.options file was UTF-8, so that was a dead end.

The next step was to grab the jvm.options file that ships with Elasticsearch and see if I had made some sort of critical error when I was editing it. Eyeballing the two files showed no major differences though, aside from the change to the memory allocation settings.

Next idea?

Line breaks.

This was where I made a critical error. I looked at the file in Linux using VI and its : set list command. Both my file and the default jvm.options file that ships with Elasticsearch looked like they treated line endings the same.

Long story short, I tried a few more things, tore my (non-existent) hair out of a bit and then finally copied the two files to my computer and compared them in Notepad++.

My file had CRLF, the default LF. The extra whitespace character was causing errors when interpreting the options file, which meant it was trying to use a setting of –Xms468MB[CR], which is strictly invalid.

A quick edit to the Bash script used during deployment to clean this up and to stop it from happening again, and everything was fine.

# Echo commands after expansion
set -x

# Mutate and copy the JVM options file file
es_memory=$(free -m | awk '/^Mem:/ { print int($2/2) }') || exit 1
sed -i "s/@@ES_MEMORY@@/$es_memory/" /tmp/elk-elasticsearch/jvm.options || exit 1
sed -i 's/\r//' /tmp/elk-elasticsearch/jvm.options || exit 1

Conclusion

This was one of those weird cases where I checked all of the usual suspects and yet still failed miserably to resolve the problem in a reasonable timeframe. A complete lack of experience using Linux native tools to debug and diagnose common issues was the main contributing factor, but I really wish software would give better errors in these cases.

I mean, anything would be better than “completely valid looking setting is invalid, please try again”.