0 Comments

Something is eating all of the memory on some of our production API instances. I say somethingbecause its non-trivial diagnosing exactly what is eating the memory.

How is that even possible? Well, its eating the memory in such a way that the monitoring tools available (i.e. task manager, performance counters, etc) are completely unable to say which process is the culprit. The processes don’t have the memory, at least not in any of their working sets and the only way to notice that it is missing is that the amount of total available memory is dropping over time. It seems that the memory is accumulating in the non-paged memory pool.

Ugh, non-paged pool memory leak. Not fun. Probably a device driver or something else equally low level.

As is usually the case with this sort of thing, I blame Logstash, hence the tag on this post, but I can’t really back that up.

Unfortunately, we have not yet identified the root cause. Instead, this post will talk about some things we did to run away screaming from the problem until we have time to investigate in depth. Sometimes you just have to make it work so that everyone can stop panicking long enough to form coherent thoughts.

Immediate Mitigation

First step, scheduled reboot for the affected boxes before they die. That maintains the level of service while we attempt to get to the bottom of the issue.

Easiest way to accomplish this? Calendar reminder for a few specific people in the organisation. Odds are at least one of those people will action the item and that everything will continue to work as expected from an external point of view.

The risks here are many and varied. What if everyone on the list expects that someone on the list will do the thing? What if everyone is on holidays (Christmas is a particular bad time for this), or sick. If the scheduled task lasts long enough, you have to consider what will happen as people leave the organisation.

Its a pretty bad sign if your immediate, manual mitigation step lasts long enough for the people involved to leave the organisation. Either you are bad at prioritising or you have some serious churn problems.

Engineers and Manual Tasks

The easiest way to get something automated is to assign a regular, manual task to an engineer, or group of engineers. There is nothing an engineer hates more than repeatedly doing the same thing on some schedule. The response? Automation.

On our case, we originally thought that the best way to automate this particular restart was using a tag based system like we do for managing start and stop times for EC2 instances. The problem was, we didn’t want to restart all of the API instances inside the auto scaling group, just the oldest one (because it was the mostly likely to be closest to experiencing the problem). We didn’t want to get into a situation where we brought down the service because everything restarted at once.

Our next thought was to target the auto scaling group instead of the API instances themselves. On some regular interval, we could scaling up to N + 1, then after everything was good, scale down to N again. This would automatically terminate the oldest instance (because our termination policy was set to oldest first). Seems simple enough.

Luckily, because we went too far down the “lets write our own script path” on of our operations team remember that this functionality (scheduled scaling policies) was actually already a feature in AWS. Alas, its not exposed via the AWS management console (i.e. the website), but you can definitely create and manage the policies from the command line using the AWS CLI.

I’m not sure if you can use the equivalent AWS client libraries (like the Powershell cmdlets), but its definitely available in the CLI.

We created two policies. Scale up to N + 1 at midnight, and then scale down to N at 0100. This acts as a recycle for the instances we are having problems with, and leverages no custom code or scripting. Its just AWS functionality.

To create a schedule, assuming you have already configured the CLI, you can use the following snippet:

aws autoscaling put-scheduled-update-group-action --scheduled-action-name ScaleUp --auto-scaling-group-name <ASG Name> --recurrence "0 0 * * 1" --desired-capacity 3

This will create a scheduled action to set the desired capacity to 3 for the specified Auto Scaling Group at midnight UTC on every Monday of the year (standard Cron format, the only thing to remember is that it will execute based on UTC time).

Summary

I’m pretty disappointed that we still haven’t actually had a chance to really did into what the root cause of the issue is. In all seriousness, I do actually blame Logstash, specifically its TCP output that we use to write to another Logstash endpoint as part of our log aggregation. We’ve had some issues with that plugin before, and it wouldn’t surprise me if there was some issue where it was not properly disposing of sockets or some other sort of low level object as part of its normal operation.

I worry that the automated solution that we put into place (to workaround the issue by recycling) will probably remain in place for far longer than anyone wants it to. From a business point of view what is the motivation to identify and solve the root cause when everything is working well enough, at least from an outside perspective.

Still, its better than having to manually recycle the instances ourselves.

0 Comments

Generally, I’ve been pretty happy with the way that we do log aggregation, using Elasticsearch, Logstash and Kibana (ELK). That’s not to say it couldn’t use improvements in almost every area, but for now, it does what it needs to do. Its definitely better than we had before, and people are actually using it to answer questions, which I would consider the mark of a successful implementation.

Of all the components in the triumvirate though, Logstash has easily been the worst. Kibana has been basically problem free (somewhat to be expected for a visualisation framework that delegates the majority of its work to the data store) and Elasticsearch has been complicated and somewhat temperamental (mostly because we haven’t had the time nor experience to tune it properly), but Logstash has been a pain.

Version 1.4 flat out did not work properly on Windows with file based inputs, as it didn’t properly remember its position in files it was reading. Version 1.5.0 fixed that problem, but we had to suffer through a number of beta issues because we were forced to use the beta version. Even after the 1.5 release we’ve had some other issues relating to TCP outputs consuming mass amounts of CPU (kind of annoying on CPU Credit limited instances), using the S3 input from behind a proxy and recovering reliably after its TCP output endpoint was down temporarily.

It’s become a bit of meme between myself and one of our operations people that whenever something weird happens with logs (even if its unrelated to the actual process of log aggregation), that it’s probably Logstash’s fault.

When one of our AWS instances responsible for filling API requests became unresponsive a few weeks ago, we joked that it was probably Logstash’s fault.

It turned out that it actually was.

What’s That Thing You Use to Dig Food?

Our API’s all have redundancy built into them via the magic of auto scaling groups and multiple instances. Its the kind of thing you have to have when you’re running anything in AWS, as there is no guarantee that your instances will be long lived. Its likely that they will be fine and everything will go swimmingly, but there is no guarantee. Honestly, that’s no different from normal hosting, the information is just available upfront, instead of after a few months when something explodes. Plan for failure, because it will happen.

Anyway, a few weeks ago one of our API instances stopped doing the thing it was supposed to be doing, i.e. filling API requests, so it was quietly removed from the load balancer. The alarms we have setup for number of healthy instances kicked in and we put the failing instance into standby mode (which spun up a new one to take its place) and started investigating.

Looking at the logs in Kibana, the instance started returning 500’s sporadically early in the morning and then got progressively worse until it was removed from the load balancer a little after 0800. The root cause of the 500’s? OutOfMemoryExceptions, the API process literally could not allocate a piece of memory large enough to do the thing it wanted to do.

My first response was to make that deep guttural noise that only comes from software developers when they know what’s about to happen. A mixture between a groan and a resigned sigh, tinged with frustration.

I’ve dealt with memory leaks before.

They are terrible, and I hate them.

I assumed that we had a memory leak in our deployed .NET application. Although .NET is a managed language, it is surprisingly easy to leak memory, with various ways to root vast amounts of information, stopping the garbage collector from recovering it.

The best tool that I’ve found for visualizing memory leaks in .NET is the SciTech .NET Memory Profiler. Honestly though, other than Visual Studio itself, its basically the only one I’ve really used, so I can’t strictly say that its better than anything else. I can say that it is good at what it does, because its helped me locate and fix quite a few memory leaks.

Being that the API had been running uninterrupted for a few weeks, the memory leak would be slow. it might even be related to specific requests that only happen rarely, or even worse, it might be time based, rather than event based. I knew I would need to either reproduce the problem locally (which might be quite hard) and find some way to get the memory profiler tool into the environment where the problem occurred and then wait until it happened again.

Like I said, I hate memory leaks.

Surprise! It Wasn’t Us

Remoting on to the AWS instance with the memory issue showed an entirely unexpected situation.

Our .NET process wasn’t using much memory at all.

The Java process, however, was using 1.5GB of physical memory and a huge chunk of virtual memory. The only piece of Java code running? Logstash.

Now I had a different problem. At least with a memory leak within our software I could trace it, find it and fix it. I know .NET, I’ve done memory leak analysis before, and while its terrible, its a known problem. It might take me a while to get to the bottom of it, but I would get there.

Logstash runs Ruby via JRuby, which runs inside Java. Its open source, but I wouldn’t even know where to begin with that mess. How could I find out why it was using so much memory for the apparently simple task of processing some text files? None of our Logstash configs are fancy, they just process text files and shunt the results to another instance of Logstash which puts them into Elasticsearch.

Clearly, The Problem Is Solved

At this particular point in time, I do not have the luxury of getting to the root cause of problems with supplementary components. This makes me sad inside, but that’s how reality rolls sometimes.

The easiest solution right now? Restart the Logstash service every day.

Easy enough to implement with a Scheduled Task that executes Powershell.

I did have one incredibly weird issue with the task. The task name that I generated for the service restart had square braces in it. As a result, so did the folder where the Powershell task puts its transcribed output. The Start-Transcript command does not like square braces in its output file for some reason, and it fails with a Start-Transcript: The host is not currently transcribing, which is incredibly helpful. This particular quirk took far too long to get to the bottom of.

Long term we need to look into the issue in depth. Maybe its a problem in the particular version of Logstash we are using (1.5.2), maybe its our configuration or maybe its the environment. I don’t know, and that particular investigation could be a gigantic time sink, so we’ll have to dedicate some time to it later on.

Another solution would be to not use Logstash directly on the machines generating the logs, and to instead use the Logstash Forwarder, which is a lightweight replacement for Logstash that only forwards events to a remote location. The remote location is typically an actual Logstash instance which has all the logic to parse the event into a meaningful format. To me this just moves the issue around, but it still might be an option.

Conclusion

This is actually the first case that I can remember where a memory leak wasn’t the fault of the software I was actively working on. I call it a memory leak because I can’t conceive how an apparently simple process like Logstash could use so much memory over time, but it might just be what it does. I mean, i wouldn’t consider SQL Server using massive amounts of memory a leak per se (because I understand what it does is hard, and it consumes more memory based on the size of the data), it’s just really weird that a process which should be entirely transient (read, transform, write, discard) could keep using more and more memory over time without there being some sort of bug.

To be honest, I’m extremely unhappy with the installing a scheduled task to restart the logstash service. It’s one of those bandaid fixes that you just know will end up biting you hard at some point in the future. It will probably cause an entirely worse issue, like massive, unexpected data loss or something.

For now, at least I know that the Logstash process will not stop our APi’s from working, which I guess is a victory.

A pyrrhic victory.