Something is eating all of the memory on some of our production API instances. I say somethingbecause its non-trivial diagnosing exactly what is eating the memory.
How is that even possible? Well, its eating the memory in such a way that the monitoring tools available (i.e. task manager, performance counters, etc) are completely unable to say which process is the culprit. The processes don’t have the memory, at least not in any of their working sets and the only way to notice that it is missing is that the amount of total available memory is dropping over time. It seems that the memory is accumulating in the non-paged memory pool.
Ugh, non-paged pool memory leak. Not fun. Probably a device driver or something else equally low level.
As is usually the case with this sort of thing, I blame Logstash, hence the tag on this post, but I can’t really back that up.
Unfortunately, we have not yet identified the root cause. Instead, this post will talk about some things we did to run away screaming from the problem until we have time to investigate in depth. Sometimes you just have to make it work so that everyone can stop panicking long enough to form coherent thoughts.
First step, scheduled reboot for the affected boxes before they die. That maintains the level of service while we attempt to get to the bottom of the issue.
Easiest way to accomplish this? Calendar reminder for a few specific people in the organisation. Odds are at least one of those people will action the item and that everything will continue to work as expected from an external point of view.
The risks here are many and varied. What if everyone on the list expects that someone on the list will do the thing? What if everyone is on holidays (Christmas is a particular bad time for this), or sick. If the scheduled task lasts long enough, you have to consider what will happen as people leave the organisation.
Its a pretty bad sign if your immediate, manual mitigation step lasts long enough for the people involved to leave the organisation. Either you are bad at prioritising or you have some serious churn problems.
Engineers and Manual Tasks
The easiest way to get something automated is to assign a regular, manual task to an engineer, or group of engineers. There is nothing an engineer hates more than repeatedly doing the same thing on some schedule. The response? Automation.
On our case, we originally thought that the best way to automate this particular restart was using a tag based system like we do for managing start and stop times for EC2 instances. The problem was, we didn’t want to restart all of the API instances inside the auto scaling group, just the oldest one (because it was the mostly likely to be closest to experiencing the problem). We didn’t want to get into a situation where we brought down the service because everything restarted at once.
Our next thought was to target the auto scaling group instead of the API instances themselves. On some regular interval, we could scaling up to N + 1, then after everything was good, scale down to N again. This would automatically terminate the oldest instance (because our termination policy was set to oldest first). Seems simple enough.
Luckily, because we went too far down the “lets write our own script path” on of our operations team remember that this functionality (scheduled scaling policies) was actually already a feature in AWS. Alas, its not exposed via the AWS management console (i.e. the website), but you can definitely create and manage the policies from the command line using the AWS CLI.
I’m not sure if you can use the equivalent AWS client libraries (like the Powershell cmdlets), but its definitely available in the CLI.
We created two policies. Scale up to N + 1 at midnight, and then scale down to N at 0100. This acts as a recycle for the instances we are having problems with, and leverages no custom code or scripting. Its just AWS functionality.
To create a schedule, assuming you have already configured the CLI, you can use the following snippet:
aws autoscaling put-scheduled-update-group-action --scheduled-action-name ScaleUp --auto-scaling-group-name <ASG Name> --recurrence "0 0 * * 1" --desired-capacity 3
This will create a scheduled action to set the desired capacity to 3 for the specified Auto Scaling Group at midnight UTC on every Monday of the year (standard Cron format, the only thing to remember is that it will execute based on UTC time).
I’m pretty disappointed that we still haven’t actually had a chance to really did into what the root cause of the issue is. In all seriousness, I do actually blame Logstash, specifically its TCP output that we use to write to another Logstash endpoint as part of our log aggregation. We’ve had some issues with that plugin before, and it wouldn’t surprise me if there was some issue where it was not properly disposing of sockets or some other sort of low level object as part of its normal operation.
I worry that the automated solution that we put into place (to workaround the issue by recycling) will probably remain in place for far longer than anyone wants it to. From a business point of view what is the motivation to identify and solve the root cause when everything is working well enough, at least from an outside perspective.
Still, its better than having to manually recycle the instances ourselves.