A Real Life Chaos Gorilla

June 7. 2016 0 Comments

You may have heard of the Netflix simian army, which is an interesting concept that ensures that your infrastructure is always prepared to lose chunks of itself at any moment. The simian army is responsible for randomly killing various components in their live environments, ranging from a single machine/component (chaos monkey), to an entire availability zone (chaos gorilla) all the way through to an entire region (chaos kong).

The simian army is all about making sure that you are prepared to weather any sort of storm and is a very important part of the reliability engineering that goes on at Netflix.

On Sunday, 5 June, we found a chaos gorilla in the wild.

Monkey See, Monkey Do

AWS is a weird service when it comes to reliability. Their services are highly available (i.e. you can almost always access the EC2 service itself to spin up virtual machines), but the constructs created by those services seem to be far less permanent. For EC2 in particular, I don’t think there is any guarantee that a particular instance will remain alive, although they usually do. I know I’ve seen instances die of their own accord, or simply become unresponsive for long periods of time (for no apparent reason).

The good thing is that AWS is very transparent in this regard. They mostly just provide you with the tools, and then its up to you to set up whatever you need to create whatever high-availability/high reliability setup, depending on what you need.

When it comes to ensuring availability of the service, there are many AWS regions across the world (a few in North America, some in Europe, some in Asia Pacific), each with at least two availability zones in them (sometimes more), which are generally located geographically separately. Its completely up to you where and how you create your resources, and whether or not you are willing to accept the risk that a single availability zone or even a region might disappear for whatever reason.

For our purposes, we tend towards redundancy (multiple copies of a resource), with those copies spread across multiple availability zones, but only within a single region (Asia Pacific Sydney). Everything is typically hidden behind a load balancer (an AWS supplied component that ensures traffic is distributed evenly to all registered resources) and its rare that we only have a single one of anything.

Yesterday (Sunday, 05 June 2016), AWS EC2 (the Elastic Compute Cloud, the heart of the virtualization services offered by AWS) experienced some major issues in ap-southeast-2.

This was a real-life chaos gorilla at work.

Extinction Is A Real Problem

The first I knew of this outage was when one of our external monitoring services (Pingdom), reported that one of our production services had dropped offline.

Unfortunately, the persistence framework used by this service has a history of being flakey, so this sort of behaviour is not entirely unusual, although the service had been behaving itself for the last couple of weeks. When I get a message like that, I usually access our log aggregation stack (ELK), and then use the information therein to identify what’s going on (it contains information on traffic, resource utilization, instance statistics and so on).

This time though, I couldn’t access the log stack at all.

This was unusual, because we engineered that stack to be highly available. It features multiple instances across all availability zones at all levels of the architecture, from the Logstash broker dealing with incoming log events, to our RabbitMQ event queue that the brokers write to, through to the indexers that pull off the queue all the way down to the elasticsearch instances that back the whole thing.

The next step was to log into the AWS console/dashboard directly and see what it said. This is the second place where we can view information about our resources. Not as easily searchable/visualizable as the data inside the ELK stack, we can still use the CloudWatch statistics inside AWS to get some stats on instances and traffic.

Once I logged in, our entire EC2 dashboard was unavailable. The only thing that would load inside the dashboard in the EC2 section was the load balancers, and they were all reporting various concerning things. The service that I had received the notification for was showing 0/6 instances available, but others were showing some availability, so not everything was dead.

I was blind though.

All By Myself

Luckily, the service that was completely down tends to not get used very much on the weekend, and this outage occurred on Sunday night AEST, so chances of it causing a poor user experience were low.

The question burning in my mind though? Was it just us? Or was the entire internet on fire? Maybe its just a blip and things will come back online in the next few minutes.

A visit to the AWS status page, a few quick Google searches and a visit to the AWS Reddit showed that there was very little traffic relating to this problem, so I was worried that it was somehow just us. My thoughts started to turn towards some sort of account compromise, or that maybe I’d been locked out of the system somehow, and this was just the way EC2 presented that I was not allowed to view the information.

Eventually information started to trickle through about this being an issue specifically with the ap-southeast-2 region, and AWS themselves even updated their status page and put up a warning on the support request page saying it was a known issue.

Now I just had to play the waiting time, because there was literally nothing I could do.

Guess Who’s Back

Some amount of time later (I forget exactly how much), the EC2 dashboard came back. The entirely of ap-southeast-2a appeared to have had some sort of massive outage/failure, meaning that most of the machines we were hosting in that availability zone were unavailable.

Even in the face of a massive interruption to service from ap-southeast-2a, we kind of lucked out. While most of the instances in that availability zone appeared to have died a horrible death, one of our most important instances that happened to not feature any redundancy across availability zones was just fine. Its a pretty massive instance now (r3.4xlarge), which is a result of performance problems we were having with RavenDB, so I wonder if it was given a higher priority or something? All of our small instances (t2/m3 mostly) were definitely dead.

Now that the zone seemed to be back online, restoring everything should have been as simple as killing the broken instances and then waiting for the Auto Scaling Groups to recreate them.

The problem was, this wasn’t working at all. The instances would start up just fine, but they would never be added into the Load Balancers. The reason? Their status checks never succeeded.

This happened consistently across multiple services, even across multiple AWS accounts.

Further investigation showed that our Octopus Deploy server had unfortunately fallen victim to the same problem as the others. We attempted to restart it, but it got stuck shutting down.

No Octopus meant no new instances.

No new instances meant we only had the ones that were already active, and they were showing some strain in keeping up with the traffic, particularly when it came to CPU credits.

Hack Away

Emergency solution time.

We went through a number of different ideas, but we settled on the following:

Take a functioning instance
Snapshot its primary drive
Create an AMI from that snapshot
Manually create some instances to fill in the gaps
Manually add those instances to the load balancer

Not the greatest of solutions, but we really only needed to buy some time while we fixed the problem of not having Octopus. Once that was back up and running, we could clean everything up and rely on our standard self healing strategies to balance everything out.

A few hours later and all of the emergency instances were in place (3 services all told + some supporting services like proxies). Everything was functioning within normal boundaries, and we could get some sleep while the zombie Octopus instance hopefully sorted itself out (because it was around 0100 on Monday at this point).

The following morning the Octopus Server had successfully shutdown (finally) and all we had to do was restart it.

We cleaned up the emergency instances, scaled back to where we were supposed to be and cleaned up anything that wasn’t operating like we expected it to (like the instances that had been created through auto scaling while Octopus was unavailable).

Crisis managed.

Conclusion

In the end, the application of a real life chaos gorilla showed us a number of different areas where we lacked redundancy (and highlighted some areas where we handled failure well).

The low incidence of AWS issues like this combined with the availability expected by our users, probably means that we don’t need to make any major changes to most of our infrastructure. We definitely should add redundancy to the single point of failure database behind the service that went down during this outage though (which is easier said than done).

Our reliability on a single, non-redundant Octopus Server however, especially when its required to react to some sort of event (whether it be a spike in usage or some sort of critical failure) is a different problem altogether. We’re definitely going to have to do something about that.

Everything else went about as well as I expected though, with the majority of our services continuing to function even when we lost an entire availability zone.

Just as planned.

The Swallows Return To Capistrano

March 15. 2016 0 Comments

A while back (god almost a full year ago), I posted about the way in which we handle environment migrations, and to be honest, it hasn’t changed all that much. We have made some improvements to way we handle our environments (for example, we’ve improved our newest environments to be built into tested, versioned packages, rather than running directly from source), which is good, but the general migration process of clone temp, tear down old, clone back to active, tear down temp hasn’t really changed all that much.

Over time, we’ve come to realise that they are a number of weaknesses in that strategy though. Its slow (double clone!), its not overly clean and it can rarely lead to all of the data for the environment under migration being destroyed.

Yes, destroyed, i.e. lost forever.

This post is about that last weakness (the others will have to continue existing…for now).

Explosions!

In the original cloning scripts, there was an ominous comment, which simply said “# compare environment data here?”, which was a pretty big red flag in retrospect. You can’t always do everything though, and the various pressures applied to the development team meant that that step became somewhat manual.

That was a mistake.

After running a number of migrations across a few different environments (using basically the same concepts), we finally triggered that particular tripwire.

An otherwise uninteresting environment upgrade for one of our production services completely annihilated the underlying database (an EC2 instance running RavenDB), but the script gave no indication that anything went wrong.

Luckily, this particular service more of a temporary waystation, acting as a holding area facilitating the connection of two applications through a common web interface. This meant that while the loss of the data was bad (very bad), it wasn’t a problem for all of our customers. Only those people who had items sitting in the holding area waiting to be picked up were affected.

Obviously, the affected customers were quite unhappy, and rightfully so.

To this day I actually have no idea what went wrong with the actual migration. I had literally run the exact same scripts on a staging environment earlier that day, and verified that the same data was present before and after. After extensive investigation, we agreed that we would probably not get to the root of the issue in a timely fashion and that it might have just been an AWS thing (for a platform based on computers, sometimes AWS is amazingly non-deterministic). Instead, we agreed to attack the code that made it possible for the data loss to occur at all.

The migration scripts themselves.

Give Me More Statistics…Stat!

Returning to that ominous comment in the migration scripts, we realised that we needed an easy way to compare the data in two environments, at least at a high level. Using a basic comparison like that would enable us to make a decision about whether to proceed with the migration (specifically the part that destroys the old environment).

The solution is to implement a statistics endpoint.

The idea is pretty simple. We provide a set of information from the endpoint that summarises the content of the service (at least as best we can summarise it). Things like how many of a certain type of entity are present are basically all we have to deal with for now (simple services), but the concept could easily be extended to include information about any piece of data in the environment.

Something as simple as the example below fills our needs:

{
    data: {
        customers: {
            count: 57
        },
        databases: {
            count: 129
        }
    }
}

A side effect of having an endpoint like this is that we can easily (at least using the http_poller input in Logstash) extract this information on a regular basis and put it into our log aggregation so that we can chart its change over time.

Making It Work

With the statistics endpoint written and deployed (after all it must be present in the environment being migrated before we can use it), all that’s left to do is incorporate it into the migration script.

I won’t rewrite the entirety of the migration script here, but I’ve included a skeleton below to provide an idea of how we use the comparison to make sure we haven’t lost anything important on the way through.

function Migrate
{
    params
    (
        #bunch of params here, mostly relating to credentials
    )
}

try
{
    # make current environment unavailable to normal traffic
    
    # clone current to temporary
    
    if (-not(Compare-Environments $current $temp))
    {
        # delete the temporary environment and exit with an error
    }
    
    # delete current environment
    # clone temporary environment into the place where the current environment used to be
    
    if (-not(Compare-Environments $current $temp))
    {
        # delete the new environment
        # keep the temporary environment because its the only one with the data
    }
}
catch
{
    # if the current environment still exists, delete the temporary environment
    # if the current environment still exists, restore its availability
}

function Compare-Environments
{
    params
    (
        $a,
        $b
    )
    
    $aEndpoint = "some logic for URL creation based off environment"
    $bEndpoint = "some logic for URL creation based off environment"
    
    $aStatistics = Invoke-RestMethod $aEndpoint #credentials, accept header, methods etc
    $bStatistics = Invoke-RestMethod $aEndpoint #credentials, accept header, methods etc
    
    if ((ConvertTo-Json $aStatistics.data) -eq (ConvertTo-Json $bStatistics.data))
    {
        return true;
    }
    
    return false;
}

Summary

The unfortunate truth of this whole saga, is that the person who originally implemented the migration scripts (I’m pretty sure it was me, so I take responsibility) was aware of the fact that the migration could potentially lead to loss of data. At the time, the protection against that was to ensure that we never deleted the old environment until we were absolutely sure that the new environment had been successfully created, making the assumption that the data had come over okay.

In the end, that assumption proved to be our undoing, because while everything appeared peachy, it actually failed spectacularly.

The introduction of a statistics endpoint (almost an environment data hash) is an elegant solution to the problem of potential data loss, which also has some nice side effects for tracking metrics that might not have been easily accessible outside of direct database access.

A double victory is a rare occurrence, so I think I’ll try to savour this one for a little while, even if I was the root cause of the problem.

Isolating Danger

January 26. 2016 0 Comments

When building a series of services to allow clients to access their own (previously office locked) data over the greater internet, there are a number of considerations to be made.

The old way was simple. There is a database. Stuff is in the database. When you want stuff, access the database. As long as the database in one office was powerful enough for the users in that office, you would be fine.

Moving all of that information into the cloud though…

Now everyone needs to access all their stuff at the same time. Now efficiency and isolation matters.

Well technically it mattered before as well, just not as much to the people who came before me.

I’m going to be talking about two things briefly in this post.

The first is isolating our upload and synchronization process from the actual service that needs to be queried.

The second is isolating binary data from all other requests.

Data Coming Right Up

In order to grant remote access to previously on-premises locked data we need to get that data out somehow. Unfortunately, for this system, the source of truth needs to stay on-premises for a number of different reasons that I won’t go into in too much detail. What we’re focusing on is allowing authenticated read-only access to the data from external systems.

The simplest solution to this is to have a replica of the data available in the cloud, and use that data for all incoming remote requests. Obviously this isn’t perfect (its an eventually consistent model), but because its read-only and we have some allowances for data latency (i.e. its okay if a mobile application doesn’t see exactly what is in the on-premises data the moment that it’s changed).

Of course, all of this data constantly being uploaded can cause a considerable amount of strain on the system as a whole, so we need to make sure that if there is a surge in the quantity of synchronization requests that the service responding to queries (get me the last 100 X entities) is not negatively impacted.

Easiest solution? Simply separate the two services, and share the data via a common store of some sort (our initial implementation will have a database).

With this model we gain some protection from load on on side impacting the other.

Its not perfect mind you, but the early separation gives us a lot of power moving forward if we need to change. For example, we could queue all synchronization requests to the sync service fairly easily, or split the shared database into a master and a number of read replicas. We don’t know if we’ll have a problem or what the solution to that problem will be, the important part is that we’ve isolated the potential danger, allowing for future change without as much effort.

10 Types of People

The system that we are constructing involves a moderate amount of binary data. I say moderate, but in reality, for most people who have large databases on premises, a good percentage of that data is binary data in various forms. Mostly images, but there are a lot of documents of various types as well (ranging from small and efficient PDF files to monstrous Word document abominations with embedded Excel spreadsheets).

Binary data is relatively problematic for a web service.

If you grant access to the binary data from a service, every request ties up one of your possible request handlers (whether it be threads, pseudo-threads or various other mechanisms of parallelism). This leaves less resources available for your other requests (data queries), which can make things difficult in the long run as the total number of binary data requests in flight at any particular moment slowly rises.

If you host the data outside the main service, you have to deal with the complexity of owning something else and making sure that it is secure (raw S3 would be ideal here, but then securing it is a pain).

In our case, our plan is to go with another service, purely for binary data. This allows us to leverage our existing authentication framework (so at least everything is secure), and allows us to use our existing logging tools to track access.

The benefits of isolating the access to binary data like this is that if there is a sudden influx of requests for images or documents or something, only that part of the system will be impacted (assuming there are no other shared components). Queries to get normal data will still complete in a timely fashion, and assuming we have written our integration well, some retry strategy will ensure that the binary data is delivered appropriately once the service resumes normal operation.

Summary

It is important to consider isolation concerns like I have outlined above when you are designing the architecture of a system.

You don’t necessarily have to implement all of your considerations straight away, but you at least need to know where your flex areas and where you can make changes without having to rewrite the entire thing. Understand how and when your architecture could adapt to potential changes, but don’t build it until you need it.

In our case, we also have a gateway/router sitting in front of everything, so we can remap URLs as we see fit moving into the future. In the case of the designs I’ve outlined above, they come from past (painful) experience. We’ve already encountered those issues in the past, while implementing similar systems, so we decided to go straight to the design that caters for them, rather than implement something we know would have problems down the track.

Its this sort of learning from your prior experiences that really makes a difference to the viability of an architecture in the long run.