Raven Pun 2: The Ravening

January 5. 2016 0 Comments

The last post I made about our adventures with RavenDB outlined the plan, upgrade to RavenDB 3. First step? Take two copies of our production environment, leave one at RavenDB 2.5 and upgrade the other to RavenDB 3. Slam both with our load tests in parallel and then see which one has better performance by comparing the Kibana dashboard for each environment (it shows things like CPU usage, request latency, disk latency, etc).

The hope was that RavenDB 3 would show lower resource usage and better performance all round using approximately the same set of data and for the same requests. This would give me enough confidence to upgrade our production instance and hopefully mitigate some of the issues we’ve been having.

Unfortunately, that’s not what happened.

Upgrade, Upgrade, Upgrade, Upgrade!

Actually upgrading to RavenDB 3 was painless. For RavenDB 2.5 we build a Nuget package that contains all of the necessary binaries and configuration, along with Powershell scripts that setup an IIS website and application pool automatically on deployment. RavenDB 3 works in a very similar way, so all I had to do was re-construct the package so that it worked in the same way except with the newer binaries. It was a little bit fiddly (primarily because of how we constructed the package the first time), but it was relatively easy.

Even better, the number of binaries and dependencies for RavenDB 3 is lower than RavenDB 2.5, which is always nice to see. Overall I think the actual combined sized may have increased, but its still nice to have a smaller number of files to manage.

Once I had the package built, all I had to do was deploy it to the appropriate environment using Octopus Deploy.

I did a simple document count check before and after and everything was fine, exactly the same number of documents was present (all ~100K of them).

Resource usage was nominal during this upgrade and basically non-existent afterwards.

Time to simulate some load.

What a Load

I’ve written previously about our usage of JMeter for load tests, so all I had to do was reuse the structure I already had in place. I recently did some refactoring in the area as well, so it was pretty fresh in my mind (I needed to extract some generic components from the load tests repository so that we could reuse them for other load tests). I set up a couple of JMeter worker environments in AWS and started the load tests.

Knowing what I do now I can see that the load tests that I originally put together don’t actually simulate the real load on the service. This was one of the reasons why our initial, intensive load testing did not find any of the issues with the backend that we found in production. I’d love to revisit the load profiles at some stage, but for now all I really needed was some traffic so that I could compare the different versions of the persistence store.

RavenDB 2.5 continued to do what it always did when the load tests were run. It worked just fine. Minimal memory and CPU usage, disk latency was low, all pretty standard.

RavenDB 3 ate all of the memory on the machine (16GB) over the first 10-15 minutes of the load tests. This caused disk thrashing on the system drive, which in turn annihilated performance and eventually the process crashed and restarted.

Not a good sign.

I’ve done this test a few times now (upgrade to 3, run load tests) and each time it does the same thing. Sometimes after the crash it starts working well (minimal resource usage, good performance), but sometimes even when it comes back from the crash it does the exact same thing again.

Time to call in the experts, i.e. the people that wrote the software.

Help! I Need Somebody

We don’t currently have a support contract with Hibernating Rhinos (the makers of RavenDB). The plan was to upgrade to RavenDB 3 (based on the assumption that its probably a better product), and if our problems persisted, to enter into a support contract for dedicated support.

Luckily, the guys at Hibernating Rhinos are pretty awesome and interact regularly with the community at the RavenDB Google Group.

I put together a massive post describing my current issue (mentioning the history of issues we’ve had to try and give some context), which you can find here.

The RavenDB guys responded pretty quickly (the same day in fact) and asked for some more information (understandably). I re-cloned the environment (to get a clean start) and did it again, except this time I was regularly extracting statistics from RavenDB (using the /stats and /admin/stats endpoints), as well as dumping the memory when it got high (using procdump) and using the export Debug information functionality built into the new Raven Studio (which is so much better than the old studio that it’s not funny). I packaged together all of this information together with the RavenDB log files and posted a response.

While looking through that information, Oren Eini (the CEO/Founder of Hibernating Rhinos) noticed that there were a number of errors reported around not being able to find a Lucene.Net.dll file on the drive where I had placed the database files (we separated the database files from the libraries, the data lives on a large, high throughput volume while the libraries are just on the system drive). I don’t know why that file should be there, or how it should get there, but at least it was progress!

The Battle Continues

Alas, I haven’t managed to return to this particular problem just yet. The urgency has diminished somewhat (the service is generally running a lot better after the latest round of hardware upgrades), and I have been distracted by other things (our Octopus Deploy slowing down our environment provisioning because it is underpowered), so it has fallen to the wayside.

However, I have plans to continue the investigation soon. Once I get to the root of the issue, I will likely make yet another post about RavenDB, hopefully summarising the entire situation and how it was fixed.

Software developers, perpetually hopeful…