Logstash, Why Do You Keep Hurting Me

March 8. 2016 0 Comments

Posted in:
logstash
monitoring

Logstash is a harsh mistress. Master. Whatever gender you want to apply to it, its the sort of application that is so incredibly useful that you’re usually willing to overlook its various flaws and problems, as long as it just keeps shipping those sweet sweet log events into your aggregator. It doesn’t help when the only other alternative on Windows that I know of is Nxlog.

Don’t get me wrong, Nxlog is a fine, stable, established product. It does what it needs to do and it does it reliably. Its just that its a complete nightmare to configure. Logstash is just so easy to configure, with its nice JSON configuration files and its plugins. I can do just about anything in Logstash, but I struggled to do anything more than the basics with Nxlog.

I’ve blogged a couple of times about Logstash (the most recent being a post about a mysterious memory leak occurring on machines with Logstash installed and running), but the problem I’m about to describe is only really partially the fault of Logstash.

More specifically, its Java’s fault.

I’ve Got 99 Issues

Recently, we’ve noticed an annoying trend on the machines acting as log shippers (using Logstash, aggregating mostly ELB logs from S3 buckets). After a certain amount of time, they would just stop shipping logs. When we investigated, we discovered that the disks were full, which pretty quickly puts a stop to any sort of useful activities on Windows.

Our initial suspicions were that the log files themselves (of which there can be quite a lot) were simply not being cleaned up. It wouldn’t be the first time that we’d had a bug in our log file cleanup script and honestly, it probably won’t be the last. A execution of the following simple Powershell script was usually enough to get them up and running again, so we noted it as an issue and moved on to more important things.

Get-ChildItem –Path C:\logs –Recurse –Filter *.log | Where { $_.LastWriteTime –lt (Get-Date).AddDays(-7) } ) | Remove-Item –Force

But it kept happening.

Eventually we realised that something was pretty seriously wrong (and I got tired of manually running Powershell to restore functionality), so we put some more effort into finding the root cause.

The system disk was definitely filling up, but with what? Because we use headless Windows instances (who needs a UI?), it was a little more difficult than normal to find exactly what was taking up all of the disk (a combination of using the Sysinternals DU tool and using notepad to browse folders with its handy little Windows explorer equivalent), but in the end we found a lot of very large (500MB+) memory dump files, in the root directory of our Logstash installation.

It looked like Logstash was crashing (which it does) and on crash it was dumping all of its memory to a file. Now, we wrap our Logstash installations inside NSSM to turn them into Windows services (for exactly this reason), so NSSM would detect that the process had crashed and restart it. It would keep running for a while, and then crash again, repeating the cycle (a “while” in this context was usually a few days).

Over time, this meant that the memory dump files were just accumulating on the disk, eventually causing the machine to choke and die.

Fixed! Ship It?

Logstash is Java based, so I did a little digging and found that the most likely cause of .mdmp files was simply the JVM doing its thing when it crashed. In particular, there seemed to be one option which controlled whether or not a memory dump was made whenever the process crashed. Using the options supplied to the JVM, it seemed like it would be relatively easy to turn this option off (assuming that it was on by default). All you need to do is add –XX:-CreateMinidumpOnCrash to the process and everything should be fine.

The process in question was Logstash, and I had already added JVM options before (for proxy settings and memory limits) so I added the new option to JAVA_OPTS environment variable, which was where I had been putting the other ones. Doing a little reading, I discovered that documentation suggested using the LS_JAVA_OPTS environment variable instead (because it was additive with default options), so I switched to that and ran a test Logstash instance locally to check that everything seemed okay.

Using Sysinternals Process Explorer (procexp) I viewed the process tree of Logstash, showing that Powershell started a batch file, which started JRuby which started Java and so on.

None of my options had been applied to the Java process though…

Dear God Why

On Windows, Logstash runs via a batch file. This batch file in turn calls into another batch file, called setup. Setup is responsible for setting up various things (obviously), one of which is the JVM. Over time, I imagine that many people have experienced various issues with Logstash in a Windows environment, so being open source, they add intelligent default settings to the JVM so that other people don’t have to feel the same pain.

As I mentioned above, you are supposed to be able to append additional JVM settings to Logstash using the LS_JAVA_OPTS environment variable and completely override the options that Logstash uses by setting the JAVA_OPTS environment variable. Originally I didn’t realise that the JAVA_OPTS variable was supposed to be a complete override (which I didn’t want) so it was good that I had chosen to switch to LS_JAVA_OPTS.

The problem was, the usage of JAVA_OPTS and LS_JAVA_OPTS is not how the Windows version of Logstash works at all.

For starters, the Windows version does nothing with the LS_JAVA_OPTS environment variable. Nothing at all.

Secondarily, regardless of what you set JAVA_OPTS to, it will add a bunch of its own JAVA_OPTS (including duplicates of yours, which I assume results in only the latest one being used by the JVM) before it runs Logstash.

By switching to the LS_JAVA_OPTS environment variable, I had completely broken my custom JVM options.

I forked the elastic/logstash repository on Github, fixed the setup.bat file (if JAVA_OPTS, don’t add any, if LS_JAVA_OPTS, add those to the defaults) and created a pull request. Unfortunately, though Onthehouse already had a contributor agreement, but it didn’t seem to recognise me, so that particular pull request is sitting in some weird purgatory and I need to do something about it.

For my purposes though, because we wrap Logstash to make a package deployable through Octopus, I just edited the setup.bat that we use and pushed.

Once it was deployed, I checked the settings applied to the JVM and it was correctly showing my setting to not dump memory on a crash. Yay!

Of course, only time will tell whether or not it actually keeps dumping memory to disk on crash though.

Conclusion

The problem that I outlined above is one of those cases where you really do need to understand how something works before you use it in anger.

I had assumed that the behaviour of Logstash between Windows and Linux would be consistent, when in reality that was definitely not the case. It wasn’t until I really started digging into how Logstash starts, specifically on my OS (rather than just using it) that I realised the truth.

It wasn’t even difficult to fix, the functionality was simply not there, probably as a result of someone changing the Linux implementation at some distant point in the past and equal changes not being made to the Windows scripts. I’m surprised no-one noticed until now (or they might have noticed and just not fixed it in the elastic/logstash repository I suppose), but maybe that just means I’m the only person insane enough to run Logstash on Windows.

I hope not.

There is also the matter of Logstash constantly crashing, which is somewhat concerning, but that’s a task for another day.