0 Comments

We’ve come a long way in our log aggregation journey. Don’t get me wrong, we still have a long way to go, but bit by bit, we’re getting better at it.

A good example of getting better, is the way in which we process our Elastic Load Balancer (ELB) logs. Over a year ago I put together a system for processing these logs into our ELK stack with way too many moving parts. It used Logstash (for processing) and Powershell (for downloading files from S3) hosted on an EC2 instance to aggregate ELB logs from S3 into to our ELK stack. Somewhat complicated in practice, but it worked, even if I was never particularly happy with it.

As is the way with these things though, because it did work, we’ve had no reason to revisit it, and we’re successfully applied the same approach to at least 3 other environments we’ve setup since.

It wasn’t without its share of problems though:

  • The EC2 instances hosting the solution had a tendency to cap themselves at 100% CPU for long periods. They were initially t2.mediums, but they kept expending all of their CPU credits, so we had to upgrade them to m3.mediums, which was a 50% increase in cost ($US 115/month). Never did figure out exactly what needed all that CPU, but the leading theory was Logstash.
  • For a while, the logs simply stopped processing after a period of time (days/weeks). This turned out to be an issue with accumulating memory dumps from Java as a result of Logstash crashing and NSSM automatically restarting it.
  • These were the machines most vulnerable to the memory leak in Logstash that causes its TCP driver to accumulate non-paged memory on Windows AWS instances due to some driver problem.

Good old Logstash.

To turn the discussion back to getting better, we had the opportunity to revisit this process when building some new environments, using all of the knowledge and experience that we’d gained in the intervening period. I think we came up with a much more efficient and understandable solution, but it wasn’t without its share of difficulties, which makes for a good post.

Anomalous Materials

One of the primary weaknesses in the previous approach for processing ELB logs was that it required an entire EC2 instance all to itself, for each environment that we spun up. We did this in order to keep each log processor isolated from the other and to allow us to be able to spin up an entirely self-contained environment without having to worry about some common machine that processed all of the logs in a bunch of different buckets.

Another weakness in the process that bothered me was that it had way too many moving parts. Sometimes you have to have a lot of moving parts working together in order to accomplish a goal, but you should always strive for simplicity, both from an operational point of view and from a maintenance point of view. Less is almost always better in software development.

AWS has come a long way since we jammed the initial solution together, so we decided to use this opportunity to simplify the process and experiment with some AWS tools that we don’t frequently use.

After some discussion, the we formed an idea of what we would like the new log processor to look like. We wanted to use Lambda to process the ELB logs as they were created, pushing them to the same Logstash ingress endpoint that we’ve been using consistently for the last year or so. The benefits we were expecting were a reduction in complexity (no need to have 4 different things working together), a reduction in running cost (mostly due to the removal of the EC2 instance) and a reduction in latency (the Lambda function would trigger whenever a file was written to the S3 bucket by the ELB, which meant no more polling for changes).

For those of you unfamiliar with Lamba, its a service offered by AWS that lets you configure code to run whenever a variety of events occurs. I’ve used it before to create a quicker S3 bucket clone, so if you want some more information, feel free to read up on that adventure.

In order to accomplish our goal, we would need to deal with 3 things:

Nothing particularly insane there, but definitely a few things that we’d never done before.

To Be Continued

In order to avoid creating a single monstrous post with more words than a small novel, I’m going to break it here.

Next week I’ll continue, explaining the Javascript code that we put together to process the log files (its not particularly complicated) and how we configured the Lambda function by incorporating its setup into our environment setup.

Until then, may all your Lambda functions execute quickly and your S3 buckets not turn into ghosts.

0 Comments

Logstash is a harsh mistress. Master. Whatever gender you want to apply to it, its the sort of application that is so incredibly useful that you’re usually willing to overlook its various flaws and problems, as long as it just keeps shipping those sweet sweet log events into your aggregator. It doesn’t help when the only other alternative on Windows that I know of is Nxlog.

Don’t get me wrong, Nxlog is a fine, stable, established product. It does what it needs to do and it does it reliably. Its just that its a complete nightmare to configure. Logstash is just so easy to configure, with its nice JSON configuration files and its plugins. I can do just about anything in Logstash, but I struggled to do anything more than the basics with Nxlog.

I’ve blogged a couple of times about Logstash (the most recent being a post about a mysterious memory leak occurring on machines with Logstash installed and running), but the problem I’m about to describe is only really partially the fault of Logstash.

More specifically, its Java’s fault.

I’ve Got 99 Issues

Recently, we’ve noticed an annoying trend on the machines acting as log shippers (using Logstash, aggregating mostly ELB logs from S3 buckets). After a certain amount of time, they would just stop shipping logs. When we investigated, we discovered that the disks were full, which pretty quickly puts a stop to any sort of useful activities on Windows.

Our initial suspicions were that the log files themselves (of which there can be quite a lot) were simply not being cleaned up. It wouldn’t be the first time that we’d had a bug in our log file cleanup script and honestly, it probably won’t be the last. A execution of the following simple Powershell script was usually enough to get them up and running again, so we noted it as an issue and moved on to more important things.

Get-ChildItem –Path C:\logs –Recurse –Filter *.log | Where { $_.LastWriteTime –lt (Get-Date).AddDays(-7) } ) | Remove-Item –Force

But it kept happening.

Eventually we realised that something was pretty seriously wrong (and I got tired of manually running Powershell to restore functionality), so we put some more effort into finding the root cause.

The system disk was definitely filling up, but with what? Because we use headless Windows instances (who needs a UI?), it was a little more difficult than normal to find exactly what was taking up all of the disk (a combination of using the Sysinternals DU tool and using notepad to browse folders with its handy little Windows explorer equivalent), but in the end we found a lot of very large (500MB+) memory dump files, in the root directory of our Logstash installation.

It looked like Logstash was crashing (which it does) and on crash it was dumping all of its memory to a file. Now, we wrap our Logstash installations inside NSSM to turn them into Windows services (for exactly this reason), so NSSM would detect that the process had crashed and restart it. It would keep running for a while, and then crash again, repeating the cycle (a “while” in this context was usually a few days).

Over time, this meant that the memory dump files were just accumulating on the disk, eventually causing the machine to choke and die.

Fixed! Ship It?

Logstash is Java based, so I did a little digging and found that the most likely cause of .mdmp files was simply the JVM doing its thing when it crashed. In particular, there seemed to be one option which controlled whether or not a memory dump was made whenever the process crashed. Using the options supplied to the JVM, it seemed like it would be relatively easy to turn this option off (assuming that it was on by default). All you need to do is add –XX:-CreateMinidumpOnCrash to the process and everything should be fine.

The process in question was Logstash, and I had already added JVM options before (for proxy settings and memory limits) so I added the new option to JAVA_OPTS environment variable, which was where I had been putting the other ones. Doing a little reading, I discovered that documentation suggested using the LS_JAVA_OPTS environment variable instead (because it was additive with default options), so I switched to that and ran a test Logstash instance locally to check that everything seemed okay.

Using Sysinternals Process Explorer (procexp) I viewed the process tree of Logstash, showing that Powershell started a batch file, which started JRuby which started Java and so on.

None of my options had been applied to the Java process though…

Dear God Why

On Windows, Logstash runs via a batch file. This batch file in turn calls into another batch file, called setup. Setup is responsible for setting up various things (obviously), one of which is the JVM. Over time, I imagine that many people have experienced various issues with Logstash in a Windows environment, so being open source, they add intelligent default settings to the JVM so that other people don’t have to feel the same pain.

As I mentioned above, you are supposed to be able to append additional JVM settings to Logstash using the LS_JAVA_OPTS environment variable and completely override the options that Logstash uses by setting the JAVA_OPTS environment variable. Originally I didn’t realise that the JAVA_OPTS variable was supposed to be a complete override (which I didn’t want) so it was good that I had chosen to switch to LS_JAVA_OPTS.

The problem was, the usage of JAVA_OPTS and LS_JAVA_OPTS is not how the Windows version of Logstash works at all.

For starters, the Windows version does nothing with the LS_JAVA_OPTS environment variable. Nothing at all.

Secondarily, regardless of what you set JAVA_OPTS to, it will add a bunch of its own JAVA_OPTS (including duplicates of yours, which I assume results in only the latest one being used by the JVM) before it runs Logstash.

By switching to the LS_JAVA_OPTS environment variable, I had completely broken my custom JVM options.

I forked the elastic/logstash repository on Github, fixed the setup.bat file (if JAVA_OPTS, don’t add any, if LS_JAVA_OPTS, add those to the defaults) and created a pull request. Unfortunately, though Onthehouse already had a contributor agreement, but it didn’t seem to recognise me, so that particular pull request is sitting in some weird purgatory and I need to do something about it.

For my purposes though, because we wrap Logstash to make a package deployable through Octopus, I just edited the setup.bat that we use and pushed.

Once it was deployed, I checked the settings applied to the JVM and it was correctly showing my setting to not dump memory on a crash. Yay!

Of course, only time will tell whether or not it actually keeps dumping memory to disk on crash though.

Conclusion

The problem that I outlined above is one of those cases where you really do need to understand how something works before you use it in anger.

I had assumed that the behaviour of Logstash between Windows and Linux would be consistent, when in reality that was definitely not the case. It wasn’t until I really started digging into how Logstash starts, specifically on my OS (rather than just using it) that I realised the truth.

It wasn’t even difficult to fix, the functionality was simply not there, probably as a result of someone changing the Linux implementation at some distant point in the past and equal changes not being made to the Windows scripts. I’m surprised no-one noticed until now (or they might have noticed and just not fixed it in the elastic/logstash repository I suppose), but maybe that just means I’m the only person insane enough to run Logstash on Windows.

I hope not.

There is also the matter of Logstash constantly crashing, which is somewhat concerning, but that’s a task for another day.

0 Comments

Something is eating all of the memory on some of our production API instances. I say somethingbecause its non-trivial diagnosing exactly what is eating the memory.

How is that even possible? Well, its eating the memory in such a way that the monitoring tools available (i.e. task manager, performance counters, etc) are completely unable to say which process is the culprit. The processes don’t have the memory, at least not in any of their working sets and the only way to notice that it is missing is that the amount of total available memory is dropping over time. It seems that the memory is accumulating in the non-paged memory pool.

Ugh, non-paged pool memory leak. Not fun. Probably a device driver or something else equally low level.

As is usually the case with this sort of thing, I blame Logstash, hence the tag on this post, but I can’t really back that up.

Unfortunately, we have not yet identified the root cause. Instead, this post will talk about some things we did to run away screaming from the problem until we have time to investigate in depth. Sometimes you just have to make it work so that everyone can stop panicking long enough to form coherent thoughts.

Immediate Mitigation

First step, scheduled reboot for the affected boxes before they die. That maintains the level of service while we attempt to get to the bottom of the issue.

Easiest way to accomplish this? Calendar reminder for a few specific people in the organisation. Odds are at least one of those people will action the item and that everything will continue to work as expected from an external point of view.

The risks here are many and varied. What if everyone on the list expects that someone on the list will do the thing? What if everyone is on holidays (Christmas is a particular bad time for this), or sick. If the scheduled task lasts long enough, you have to consider what will happen as people leave the organisation.

Its a pretty bad sign if your immediate, manual mitigation step lasts long enough for the people involved to leave the organisation. Either you are bad at prioritising or you have some serious churn problems.

Engineers and Manual Tasks

The easiest way to get something automated is to assign a regular, manual task to an engineer, or group of engineers. There is nothing an engineer hates more than repeatedly doing the same thing on some schedule. The response? Automation.

On our case, we originally thought that the best way to automate this particular restart was using a tag based system like we do for managing start and stop times for EC2 instances. The problem was, we didn’t want to restart all of the API instances inside the auto scaling group, just the oldest one (because it was the mostly likely to be closest to experiencing the problem). We didn’t want to get into a situation where we brought down the service because everything restarted at once.

Our next thought was to target the auto scaling group instead of the API instances themselves. On some regular interval, we could scaling up to N + 1, then after everything was good, scale down to N again. This would automatically terminate the oldest instance (because our termination policy was set to oldest first). Seems simple enough.

Luckily, because we went too far down the “lets write our own script path” on of our operations team remember that this functionality (scheduled scaling policies) was actually already a feature in AWS. Alas, its not exposed via the AWS management console (i.e. the website), but you can definitely create and manage the policies from the command line using the AWS CLI.

I’m not sure if you can use the equivalent AWS client libraries (like the Powershell cmdlets), but its definitely available in the CLI.

We created two policies. Scale up to N + 1 at midnight, and then scale down to N at 0100. This acts as a recycle for the instances we are having problems with, and leverages no custom code or scripting. Its just AWS functionality.

To create a schedule, assuming you have already configured the CLI, you can use the following snippet:

aws autoscaling put-scheduled-update-group-action --scheduled-action-name ScaleUp --auto-scaling-group-name <ASG Name> --recurrence "0 0 * * 1" --desired-capacity 3

This will create a scheduled action to set the desired capacity to 3 for the specified Auto Scaling Group at midnight UTC on every Monday of the year (standard Cron format, the only thing to remember is that it will execute based on UTC time).

Summary

I’m pretty disappointed that we still haven’t actually had a chance to really did into what the root cause of the issue is. In all seriousness, I do actually blame Logstash, specifically its TCP output that we use to write to another Logstash endpoint as part of our log aggregation. We’ve had some issues with that plugin before, and it wouldn’t surprise me if there was some issue where it was not properly disposing of sockets or some other sort of low level object as part of its normal operation.

I worry that the automated solution that we put into place (to workaround the issue by recycling) will probably remain in place for far longer than anyone wants it to. From a business point of view what is the motivation to identify and solve the root cause when everything is working well enough, at least from an outside perspective.

Still, its better than having to manually recycle the instances ourselves.

0 Comments

The L in ELK stands for Logstash. This is not news.

When we put together the system we use for log aggregation, we needed to find a way to deal with Logstash such that it would fit into our relatively new (and still maturing) model for automated build and deployment. We put together a system that stores a deployable version of Logstash in Bitbucket, a build/package script written in Powershell that puts everything together into a Nuget package, TeamCity to listen for changes and then Octopus Deploy for managing the deployment.

The difficult part in this was that each place that required Logstash required a different configuration (because it was processing different files or transforming the data in different ways). There was a lot of commonality, but it was mostly in the actual installation and management of Logstash, rather than the configuration Logstash happened to be running at any particular point in time.

We have one git repository that contains everything necessary to run a copy of Logstash on a machine. This includes Logstash itself, the Java Runtime Environment (JRE) and a set of Powershell scripts that allow us to install Logstash as a Windows Service and choose which configuration it should be running. This meant that all of the configuration files for each of our Logstash deployments all lived in the same repository.

This worked okay for a little while, until we started adding more configuration files.

One Change Makes Many Work?

Suddenly, we had a bunch Build Configurations in TeamCity triggering off changes to the one repository. They only triggered off changes to their own Logstash configurations at least, but they all triggered whenever we made changes to Logstash itself or to the scripts surrounding it. Technically, they were all building (and deploying) the same package (X.Logging.Logstash.nupkg), but each would generate a different version, and deploy a different Octopos Deploy project. Luckily our versions are based off the current time, so it wasn’t like the version kept going up and down (because of the difference in number of builds), but there was the occasional conflict when two TeamCity tasks just happened to build on two different Build Agents within a few seconds of each other (which would generate identical packages).

The bigger issue was that each package was over 100MB! Logstash is a good 80MB by itself and the JRE is another 40MB, so once you add in a MB or two for everything else, your package is huge.

Yes, technically we could deal with this issue by making sure Java and Logstash are installed on the target machines ahead of time, but I don’t like this from an isolation/self-reliance point of view. I want to be able to push a package with the minimal dependencies already in existence on the target machine, ideally just an Octopus tentacle and Powershell (for Windows boxes anyway). Anything else that is required should be contained within the package itself (or, in extreme cases, bootstrapped from somewhere after deployment (but that just moves the problem around slightly)).

Suddenly a checkin to the repository would gum up our CI build agents with a set of relatively unimportant tasks, stopping other builds from progressing and interrupting peoples work.

The easiest (but somewhat fiddly) solution was to split the concept of an installed version of Logstash from the configuration it happened to be running. With this approach we could deploy Logstash to a target machine once and then not have to pay the price of shifting that much data over the network every single time we wanted to alter the config. When we did want to upgrade Logstash, we could simply build a new version and have it deployed in the same way.

The plan was relatively simple. Create one repository for a deployable version of Logstash by itself (making sure to generalise it enough such that you could easily point it at any configuration you wanted) and then split out each configuration into a repository of its own. Whenever Logstash changes, it would be built and published to Octopus, but nothing would be deployed. Each configuration repository would be able to choose to upgrade to the new version (by changing the dependency in source) and then TeamCity would pick up the changes and run the normal build/package/deploy cycle for that configuration.

Executions Are Tiresome

As is almost always the case, coming up with the idea and its execution plan was a lot more fun than actually doing it.

The Logstash repository that we had already was pretty tightly coupled to the way it handled configuration. It actually used the current Octopus Project name during the installation to determine the configuration that it should be running, and each configuration really only consisted of a single Logstash conf file.

The first task was to generalise the Logstash installation, so that we could deploy it separately and then have the configuration projects use it via a known interface. Nothing particularly interesting here from a design standpoint, just a few Powershell functions. Execute-Logstash, Install-LogstashService (and its counterpart, Remove-LogstashService) and some helpers for generating configuration files based on templates (because sometimes during deployment you need to be able to substitute some deployment specific values into your configuration, like AWS Instance Id).

The next task was taking one of the current configuration files and converting it into the new pattern, a repository of its own. This repository would need to contain everything necessary for the configuration of a Logstash instance, plus some tests to verify that the config file works as expected when given a set of known inputs.

Its not overly easy to test a Logstash configuration, especially when it has never been tested before. Like all code, you need to be able to substitute certain values (like the location of the log files to read) and then figure out a way to measure the outcome, without changing the actual configuration too much. The approach I settled on was to parameterise the log locations like I mentioned above and to add an additional output during testing that wrote everything to a file. That way I could read the file in and check to make sure that it was outputting as many lines as I expected.

The last task was to rebuild the Octopus Deploy project for the configuration to deploy both the Logstash deployable component and the configuration and verify that it installed and ran correctly on deployment. The most difficult part here was that different versions of each component were required, so we had to extend our Octopus scripts to handle this properly (i.e. step 1 which deploys Logstash needs to know that it should deploy version 1.5.2, but step 2 needs to deploy version 1.0.15235.12789 of the actual configuration).

Conclusion

I really should have structured our Logstash deployments in this way from the start. Its almost always better to separate configuration from application code, especially when you don’t control the application. Often you will find that configuration changes a lot more than the code does, and when the code itself is quite large (as is the case with Logstash and its dependencies) it can get quite painful shifting all of those bytes around for no real reason.

But, alas, you can never see into the future with any sort of clarity, and you need to be willing to change things that you’ve done in the past when problems are identified or a better solution comes along.

Can’t get sentimental about these things.