This Bird Sure is Difficult to Handle

December 15. 2015 0 Comments

A few months ago we released a new service allowing our users to complete some of their work through a mobile application. For an application that is primarily locked to a computer within an office, it was a pretty big improvement. Its not the first time we’ve tried to add this functionality, but it was one of the better attempts.

That is, until people really started hammering it. Then it went downhill kind of quickly.

Before I started working here, an architectural decision was made to use a document database to store the data for this mobile application. The idea was that the database would be a temporary store, almost like a waystation, that would allow two way synchronization of the data between two applications (the mobile application and the clients server). Conceptually, its not a bad design. Its not necessarily the design I would have suggested, but it has merit.

The document database selected was RavenDB, specifically version 2.5.

The people who made that particular architectural design are no longer with company for various reasons, so it was up to my team and I to complete the work and actually release something. We did our best and after a fairly lengthy development period followed by an equally lengthy beta period, we released into the wild. As I mentioned above, it started well, but didn’t seem to scale to the amount of users we started seeing. I’m not talking hundreds of thousands of users either, just a few hundred, so it definitely wasn’t one of those problems where you are cursed by your own success.

The root cause for the performance problems? It appeared to be RavenDB.

An Unkindness

I always make the assumption that if a commercial component looks like its not holding up its end of the bargain, its probably not the components fault. Its almost certainly the developers fault, because they either configured it wrong or generally did not understand it enough to know that they were using it in entirely the wrong way.

I think this is true for our problems with RavenDB, but I still don’t know exactly where we went wrong.

I’ll start at the beginning.

The first architectural design had two RavenDB instances in EC2 hidden behind a load balancer. They were configured to replicate to each other. This pair was reduced down to a single instance when we discovered that that particular structure was causing issues in the system (using my future knowledge, I now know that’s not how RavenDB does redundancy). The intention was that if load testing showed that we had issues with only one instance, we would revisit.

Our load testing picked up a bunch of problems with various things, but at no point was the RavenDB instance the bottleneck, so we assumed we would be okay.

Unfortunately, the load tests were flawed somehow, because once the system started to be used in anger, the RavenDB instance was definitely the bottleneck.

When we released, the database was initially hosted on a t2.medium. These instances are burstable (meaning their CPU can spike), but are limited by CPU credits. It became obvious very quickly that the database was consuming far more CPU credits than we expected (its CPU usage was averaging something like 80%), so we quickly shifted it to an m3.medium (which does not use CPU credits). This worked for a little while, but eventually we started experiencing performance issues again as usage increased. Another shift of the underlying hardware to an m4.large improved the performance somewhat, but not as much as we expected.

When we looked into the issue, we discovered a direct correlation between the latency of requests to the service and the disk latency of the data disk that RavenDB was using for storage. What followed was a series of adjustments to the data disk, mostly related around switching to provisioned IOPS and then slowing scaling it up until the latency of the disk seemed to no longer be the issue.

But we still had performance problems, and at this point the business was (rightly) getting a bit antsy, because users were starting to notice.

After investigation, the new cause of the performance problems seemed to be paging. Specifically the RavenDB process was consuming more memory than was available and was paging to the very slow system drive. Scaling the underlying instance up to an m4.xlarge (for more memory and compute) alleviated this particular problem.

We had a number of other issues as well:

Because we host RavenDB in IIS, the default application pool recycle that occurs every 29 hours eventually started happening during our peak times, which didn’t end well. We now schedule the restart for early in the morning. This was made somewhat more difficult by the fact that RavenDB can’t handle overlapping processes (which IIS uses to avoid downtime during restarts).
We’ve had the RavenDB process crash from time to time. IIS handles this (by automatically restarting the process), but there is still a period of downtime while the whole thing heats up again.

That brings us to the present. The service is running well enough, and is pretty performant, but it really does feel like we’ve thrown way too much power at it for what it accomplishes.

Where to Now?

Raven 2.5 is old. Very old.

Our next step is to upgrade to Raven 3, and then directly contrast and compare the performance of the two versions under similar load to see exactly what we’re getting ourselves into.

The logic behind the upgrade is that the newer version is far more likely to have better performance, and we’re far more likely to be able to easily get support for it.

Initial investigations show that the upgrade itself is relatively painless. The older Raven 2.5 client is completely compatible with the new server, so we don’t even need to upgrade the API components yet. All of the data appears to migrate perfectly fine (and seamlessly), so all we need to do is put some effort into comparing performance and then we should be sweet.

Secondarily, we’re going to be setting up at least one other Raven instance, primarily as a backup, but maybe also as a load balancing measure. I’ll have to look into it more before we figure out exactly what its capable of, but at the very least we need a replicated backup.

Summary

This post was more of an introduction into some of the issues that we’ve been having with the persistence of our service, but it does bring to light some interesting points.

Needing to support something in production is very different from just deciding to use it during development. There is a whole other set of tools and services that are required before you can successfully use something in production, and a completely different set of expertise and understanding required. Because the development process was so smooth (from the persistence point of view), we never really had to dig into the guts of Raven and really figure out what it was doing, so we were completely unprepared when everything went to hell during actual usage.

Trial by fire indeed.