A Real Life Chaos Gorilla

June 7. 2016 0 Comments

You may have heard of the Netflix simian army, which is an interesting concept that ensures that your infrastructure is always prepared to lose chunks of itself at any moment. The simian army is responsible for randomly killing various components in their live environments, ranging from a single machine/component (chaos monkey), to an entire availability zone (chaos gorilla) all the way through to an entire region (chaos kong).

The simian army is all about making sure that you are prepared to weather any sort of storm and is a very important part of the reliability engineering that goes on at Netflix.

On Sunday, 5 June, we found a chaos gorilla in the wild.

Monkey See, Monkey Do

AWS is a weird service when it comes to reliability. Their services are highly available (i.e. you can almost always access the EC2 service itself to spin up virtual machines), but the constructs created by those services seem to be far less permanent. For EC2 in particular, I don’t think there is any guarantee that a particular instance will remain alive, although they usually do. I know I’ve seen instances die of their own accord, or simply become unresponsive for long periods of time (for no apparent reason).

The good thing is that AWS is very transparent in this regard. They mostly just provide you with the tools, and then its up to you to set up whatever you need to create whatever high-availability/high reliability setup, depending on what you need.

When it comes to ensuring availability of the service, there are many AWS regions across the world (a few in North America, some in Europe, some in Asia Pacific), each with at least two availability zones in them (sometimes more), which are generally located geographically separately. Its completely up to you where and how you create your resources, and whether or not you are willing to accept the risk that a single availability zone or even a region might disappear for whatever reason.

For our purposes, we tend towards redundancy (multiple copies of a resource), with those copies spread across multiple availability zones, but only within a single region (Asia Pacific Sydney). Everything is typically hidden behind a load balancer (an AWS supplied component that ensures traffic is distributed evenly to all registered resources) and its rare that we only have a single one of anything.

Yesterday (Sunday, 05 June 2016), AWS EC2 (the Elastic Compute Cloud, the heart of the virtualization services offered by AWS) experienced some major issues in ap-southeast-2.

This was a real-life chaos gorilla at work.

Extinction Is A Real Problem

The first I knew of this outage was when one of our external monitoring services (Pingdom), reported that one of our production services had dropped offline.

Unfortunately, the persistence framework used by this service has a history of being flakey, so this sort of behaviour is not entirely unusual, although the service had been behaving itself for the last couple of weeks. When I get a message like that, I usually access our log aggregation stack (ELK), and then use the information therein to identify what’s going on (it contains information on traffic, resource utilization, instance statistics and so on).

This time though, I couldn’t access the log stack at all.

This was unusual, because we engineered that stack to be highly available. It features multiple instances across all availability zones at all levels of the architecture, from the Logstash broker dealing with incoming log events, to our RabbitMQ event queue that the brokers write to, through to the indexers that pull off the queue all the way down to the elasticsearch instances that back the whole thing.

The next step was to log into the AWS console/dashboard directly and see what it said. This is the second place where we can view information about our resources. Not as easily searchable/visualizable as the data inside the ELK stack, we can still use the CloudWatch statistics inside AWS to get some stats on instances and traffic.

Once I logged in, our entire EC2 dashboard was unavailable. The only thing that would load inside the dashboard in the EC2 section was the load balancers, and they were all reporting various concerning things. The service that I had received the notification for was showing 0/6 instances available, but others were showing some availability, so not everything was dead.

I was blind though.

All By Myself

Luckily, the service that was completely down tends to not get used very much on the weekend, and this outage occurred on Sunday night AEST, so chances of it causing a poor user experience were low.

The question burning in my mind though? Was it just us? Or was the entire internet on fire? Maybe its just a blip and things will come back online in the next few minutes.

A visit to the AWS status page, a few quick Google searches and a visit to the AWS Reddit showed that there was very little traffic relating to this problem, so I was worried that it was somehow just us. My thoughts started to turn towards some sort of account compromise, or that maybe I’d been locked out of the system somehow, and this was just the way EC2 presented that I was not allowed to view the information.

Eventually information started to trickle through about this being an issue specifically with the ap-southeast-2 region, and AWS themselves even updated their status page and put up a warning on the support request page saying it was a known issue.

Now I just had to play the waiting time, because there was literally nothing I could do.

Guess Who’s Back

Some amount of time later (I forget exactly how much), the EC2 dashboard came back. The entirely of ap-southeast-2a appeared to have had some sort of massive outage/failure, meaning that most of the machines we were hosting in that availability zone were unavailable.

Even in the face of a massive interruption to service from ap-southeast-2a, we kind of lucked out. While most of the instances in that availability zone appeared to have died a horrible death, one of our most important instances that happened to not feature any redundancy across availability zones was just fine. Its a pretty massive instance now (r3.4xlarge), which is a result of performance problems we were having with RavenDB, so I wonder if it was given a higher priority or something? All of our small instances (t2/m3 mostly) were definitely dead.

Now that the zone seemed to be back online, restoring everything should have been as simple as killing the broken instances and then waiting for the Auto Scaling Groups to recreate them.

The problem was, this wasn’t working at all. The instances would start up just fine, but they would never be added into the Load Balancers. The reason? Their status checks never succeeded.

This happened consistently across multiple services, even across multiple AWS accounts.

Further investigation showed that our Octopus Deploy server had unfortunately fallen victim to the same problem as the others. We attempted to restart it, but it got stuck shutting down.

No Octopus meant no new instances.

No new instances meant we only had the ones that were already active, and they were showing some strain in keeping up with the traffic, particularly when it came to CPU credits.

Hack Away

Emergency solution time.

We went through a number of different ideas, but we settled on the following:

Take a functioning instance
Snapshot its primary drive
Create an AMI from that snapshot
Manually create some instances to fill in the gaps
Manually add those instances to the load balancer

Not the greatest of solutions, but we really only needed to buy some time while we fixed the problem of not having Octopus. Once that was back up and running, we could clean everything up and rely on our standard self healing strategies to balance everything out.

A few hours later and all of the emergency instances were in place (3 services all told + some supporting services like proxies). Everything was functioning within normal boundaries, and we could get some sleep while the zombie Octopus instance hopefully sorted itself out (because it was around 0100 on Monday at this point).

The following morning the Octopus Server had successfully shutdown (finally) and all we had to do was restart it.

We cleaned up the emergency instances, scaled back to where we were supposed to be and cleaned up anything that wasn’t operating like we expected it to (like the instances that had been created through auto scaling while Octopus was unavailable).

Crisis managed.

Conclusion

In the end, the application of a real life chaos gorilla showed us a number of different areas where we lacked redundancy (and highlighted some areas where we handled failure well).

The low incidence of AWS issues like this combined with the availability expected by our users, probably means that we don’t need to make any major changes to most of our infrastructure. We definitely should add redundancy to the single point of failure database behind the service that went down during this outage though (which is easier said than done).

Our reliability on a single, non-redundant Octopus Server however, especially when its required to react to some sort of event (whether it be a spike in usage or some sort of critical failure) is a different problem altogether. We’re definitely going to have to do something about that.

Everything else went about as well as I expected though, with the majority of our services continuing to function even when we lost an entire availability zone.

Just as planned.

Logstash Never Forgets…Apparently

September 29. 2015 0 Comments

Generally, I’ve been pretty happy with the way that we do log aggregation, using Elasticsearch, Logstash and Kibana (ELK). That’s not to say it couldn’t use improvements in almost every area, but for now, it does what it needs to do. Its definitely better than we had before, and people are actually using it to answer questions, which I would consider the mark of a successful implementation.

Of all the components in the triumvirate though, Logstash has easily been the worst. Kibana has been basically problem free (somewhat to be expected for a visualisation framework that delegates the majority of its work to the data store) and Elasticsearch has been complicated and somewhat temperamental (mostly because we haven’t had the time nor experience to tune it properly), but Logstash has been a pain.

Version 1.4 flat out did not work properly on Windows with file based inputs, as it didn’t properly remember its position in files it was reading. Version 1.5.0 fixed that problem, but we had to suffer through a number of beta issues because we were forced to use the beta version. Even after the 1.5 release we’ve had some other issues relating to TCP outputs consuming mass amounts of CPU (kind of annoying on CPU Credit limited instances), using the S3 input from behind a proxy and recovering reliably after its TCP output endpoint was down temporarily.

It’s become a bit of meme between myself and one of our operations people that whenever something weird happens with logs (even if its unrelated to the actual process of log aggregation), that it’s probably Logstash’s fault.

When one of our AWS instances responsible for filling API requests became unresponsive a few weeks ago, we joked that it was probably Logstash’s fault.

It turned out that it actually was.

What’s That Thing You Use to Dig Food?

Our API’s all have redundancy built into them via the magic of auto scaling groups and multiple instances. Its the kind of thing you have to have when you’re running anything in AWS, as there is no guarantee that your instances will be long lived. Its likely that they will be fine and everything will go swimmingly, but there is no guarantee. Honestly, that’s no different from normal hosting, the information is just available upfront, instead of after a few months when something explodes. Plan for failure, because it will happen.

Anyway, a few weeks ago one of our API instances stopped doing the thing it was supposed to be doing, i.e. filling API requests, so it was quietly removed from the load balancer. The alarms we have setup for number of healthy instances kicked in and we put the failing instance into standby mode (which spun up a new one to take its place) and started investigating.

Looking at the logs in Kibana, the instance started returning 500’s sporadically early in the morning and then got progressively worse until it was removed from the load balancer a little after 0800. The root cause of the 500’s? OutOfMemoryExceptions, the API process literally could not allocate a piece of memory large enough to do the thing it wanted to do.

My first response was to make that deep guttural noise that only comes from software developers when they know what’s about to happen. A mixture between a groan and a resigned sigh, tinged with frustration.

I’ve dealt with memory leaks before.

They are terrible, and I hate them.

I assumed that we had a memory leak in our deployed .NET application. Although .NET is a managed language, it is surprisingly easy to leak memory, with various ways to root vast amounts of information, stopping the garbage collector from recovering it.

The best tool that I’ve found for visualizing memory leaks in .NET is the SciTech .NET Memory Profiler. Honestly though, other than Visual Studio itself, its basically the only one I’ve really used, so I can’t strictly say that its better than anything else. I can say that it is good at what it does, because its helped me locate and fix quite a few memory leaks.

Being that the API had been running uninterrupted for a few weeks, the memory leak would be slow. it might even be related to specific requests that only happen rarely, or even worse, it might be time based, rather than event based. I knew I would need to either reproduce the problem locally (which might be quite hard) and find some way to get the memory profiler tool into the environment where the problem occurred and then wait until it happened again.

Like I said, I hate memory leaks.

Surprise! It Wasn’t Us

Remoting on to the AWS instance with the memory issue showed an entirely unexpected situation.

Our .NET process wasn’t using much memory at all.

The Java process, however, was using 1.5GB of physical memory and a huge chunk of virtual memory. The only piece of Java code running? Logstash.

Now I had a different problem. At least with a memory leak within our software I could trace it, find it and fix it. I know .NET, I’ve done memory leak analysis before, and while its terrible, its a known problem. It might take me a while to get to the bottom of it, but I would get there.

Logstash runs Ruby via JRuby, which runs inside Java. Its open source, but I wouldn’t even know where to begin with that mess. How could I find out why it was using so much memory for the apparently simple task of processing some text files? None of our Logstash configs are fancy, they just process text files and shunt the results to another instance of Logstash which puts them into Elasticsearch.

Clearly, The Problem Is Solved

At this particular point in time, I do not have the luxury of getting to the root cause of problems with supplementary components. This makes me sad inside, but that’s how reality rolls sometimes.

The easiest solution right now? Restart the Logstash service every day.

Easy enough to implement with a Scheduled Task that executes Powershell.

I did have one incredibly weird issue with the task. The task name that I generated for the service restart had square braces in it. As a result, so did the folder where the Powershell task puts its transcribed output. The Start-Transcript command does not like square braces in its output file for some reason, and it fails with a Start-Transcript: The host is not currently transcribing, which is incredibly helpful. This particular quirk took far too long to get to the bottom of.

Long term we need to look into the issue in depth. Maybe its a problem in the particular version of Logstash we are using (1.5.2), maybe its our configuration or maybe its the environment. I don’t know, and that particular investigation could be a gigantic time sink, so we’ll have to dedicate some time to it later on.

Another solution would be to not use Logstash directly on the machines generating the logs, and to instead use the Logstash Forwarder, which is a lightweight replacement for Logstash that only forwards events to a remote location. The remote location is typically an actual Logstash instance which has all the logic to parse the event into a meaningful format. To me this just moves the issue around, but it still might be an option.

Conclusion

This is actually the first case that I can remember where a memory leak wasn’t the fault of the software I was actively working on. I call it a memory leak because I can’t conceive how an apparently simple process like Logstash could use so much memory over time, but it might just be what it does. I mean, i wouldn’t consider SQL Server using massive amounts of memory a leak per se (because I understand what it does is hard, and it consumes more memory based on the size of the data), it’s just really weird that a process which should be entirely transient (read, transform, write, discard) could keep using more and more memory over time without there being some sort of bug.

To be honest, I’m extremely unhappy with the installing a scheduled task to restart the logstash service. It’s one of those bandaid fixes that you just know will end up biting you hard at some point in the future. It will probably cause an entirely worse issue, like massive, unexpected data loss or something.

For now, at least I know that the Logstash process will not stop our APi’s from working, which I guess is a victory.

A pyrrhic victory.