0 Comments

As you may (or may not) have noticed, I’ve published exactly zero blog posts over the last 3 weeks.

I was on holidays, and it was glorious.

Well, actually it was filled with more work than I would like (both from the job I was actually on holidays from as well as some other contract work I do for a company called MEDrefer), but it was still nice to be the master of my own destiny for a little while.

Anyway, I’m back now and everything is happening all at once, as these things sort of do.

Three things going on right now: tutoring at QUT, progress on the RavenDB issue I blogged about and some work I’m doing towards replacing RavenDB altogether (just in case), and I’ll be giving those items a brief explanation below. I’ve also been doing some work related to incorporating running Webdriver IO tests from TeamCity via Powershell (and including the results) as well as fixing an issue with Logstash on Windows where you can’t easily configure it to not do a full memory dump whenever it crashes (and it crashes a lot!).

Without further ado, on with the show!

How Can I Reach These Kids?

Its that time of the year when I start up my fairly regular Agile Project Management Tutoring gig at QUT (they’ve change the course code to IAB304 for some ungodly reason this semester, but its basically the same thing), so I’ve got that to look forward to. Unfortunately they are still using the DSDM material, but at least its changed somewhat to be more closely aligned to Scrum than to some old school project management/agile hybrid.

QUT is also offering sessional academics workshops on how to be a better teacher/tutor, which I plan on attending. There are 4 different workshops being run over the next few months, so I might follow each one with a blog post outlining anything interesting that was covered.

I enjoy tutoring at QUT at multiple levels, even if the bureaucracy there drives me nuts. It gives me an opportunity to really think about what it means to be Agile, which is always a useful though experiment. Meeting and interacting with people from many diverse backgrounds is also extremely useful for expanding my worldview, and I enjoy helping them understand the concepts and principles in play, and how they benefit both the practitioner and whatever business they are trying to serve.

The Birds is the Word

The guys at Hibernating Rhinos have been really helpful assisting me with getting to the bottom of the most recent RavenDB issue that I was having (a resource consumption issue that was preventing me from upgrading the production servers to RavenDB 3). Usually I would make a full post about the subject, but in this particular case it was mostly them investigating the issue, and me supplying a large number of memory dumps, exported settings, statuses, metrics and various other bits and pieces.

It turns out the issue was in an optimization in RavenDB 3 that caused problems for our particular document/workload profile. I’ve done a better explanation of the issue on the topic I made in the RavenDB Google Group, and Michael Yarichuk (one of the Hibernating Rhinos guys I was working with) has followed that up with even more detail.

I learned quite a few things relating to debugging and otherwise inspecting a running copy of RavenDB, as well as how to properly use the Sysinternals Procdump tool to take memory dumps.

A short summary:

  • RavenDB has stats endpoints which can be be hit via a simple HTTP call. {url}/stats and {url}/admin/stats give all sorts of great information, including memory usage and index statistics.
    • I’ve incorporated a regular poll of these endpoints into my logstash configuration for monitoring our RavenDB instance. It doesn’t exactly insert cleanly into Elasticsearch (too many arrays), but its still useful, and allows us to chart various RavenDB statistics through Kibana.
  • RavenDB has config endpoints that show what settings are currently in effect (useful for checking available options and to see if your own setting customizations were applied correctly). The main endpoint is available at {url}/debug/config but there are apparently config endpoints for specific databases as well. We only use the default, system database, and there doesn’t seem to be an endpoint specific to that one.
  • The sysinternals tool procdump can be configured to take a full memory dump if your process exceeds a certain amount of usage. procdump –ma –m 4000 w3wp.exe C:\temp\IIS.dmp will take a full memory dump (i.e. not just handlers) when the w3wp process exceeds 4GB of memory for at least 10 seconds, and put it in the C:\temp directory. It can be configured to take multiple dumps as well, in case you want to track memory growth over time.
    • If you’re trying to get a memory dump of the w3wp process, make sure you turn off pinging for the appropriate application pool, or IIS will detect that its frozen and restart it. You can turn off pinging by running the Powershell command Set-ItemProperty "IIS:\AppPools\{application pool}" -name processmodel.pingingEnabled -Value False. Don’t forget to turn it back on when you’re done.
  • Google Drive is probably the easiest way to give specific people over the internet access to large (multiple gigabyte) files. Of course there is also S3 (which is annoying to permission) and FTP/HTTP (which require setting up other stuff), but I definitely found Google Drive the easiest. OneDrive and DropBox would also probably be similarly easy.

Once Hibernating Rhinos provides a stable release containing the fix, it means that we are no longer blocked in upgrading our troubled production instance to the latest version of RavenDB, which will hopefully alleviate some of its performance issues.

More to come on this topic as it unfolds.

Quoth The Raven, Nevermore

Finally, I’ve been putting some thought into how we can move away from RavenDB  (or at least experiment with moving away from RavenDB), mostly so that we have a backup plan if the latest version does not in fact fix the performance problems that we’ve been having.

We’ve had a lot of difficulty in simulating the same level and variety of traffic that we see in our production environment (which was one of the reasons why we didn’t pick up any of the issues during our long and involved load testing), so I thought, why not just deploy any experimental persistence providers directly into production and watch how they behave?

Its not as crazy as it sounds, at least in our case.

Our API instances are hardly utilised at all, so we have plenty of spare CPU to play with in order to explore new solutions.

Our persistence layer is abstracted behind some very basic repository interfaces, so all we would have to do is provide a composite implementation of each repository interface that calls both persistence providers. Only take the response from the one that is not experimental, and everything is golden. As long as we log lots of information about the requests being made and how long they took, we can perform all sorts of interesting analysis without ever actually affecting the user experience.

Well, that’s the idea anyway. Whether or not it actually works is a whole different question.

I’ll likely make a followup post when I finish exploring the idea properly.

Summary

As good as my kinda-holidays were, it feels nice to be back in the thick of things, smiting problems and creating value.

I’m particularly looking forward to exploring a replacement for RavenDB in our troublesome service, because while I’m confident that the product itself is solid, it’s not something we’re very familiar with, so we’ll always be struggling to make the most of it. We don’t use it anywhere else (and are not planning on using it again), so its stuck in this weird place where we aren’t good at it and we have low desire to get better in the long run.

It was definitely good to finally get to the bottom of why the new and shiny version of RavenDB was misbehaving so badly though, because most of the time when I have a problem with a product like that, I always assume its the way I’m using it, not the product itself.

Plus, as a general rule of thumb, I don’t like it when mysteries remain unsolved. It bugs me.

Like why Firefly was cancelled.

Who does that?

0 Comments

The last post I made about our adventures with RavenDB outlined the plan, upgrade to RavenDB 3. First step? Take two copies of our production environment, leave one at RavenDB 2.5 and upgrade the other to RavenDB 3. Slam both with our load tests in parallel and then see which one has better performance by comparing the Kibana dashboard for each environment (it shows things like CPU usage, request latency, disk latency, etc).

The hope was that RavenDB 3 would show lower resource usage and better performance all round using approximately the same set of data and for the same requests. This would give me enough confidence to upgrade our production instance and hopefully mitigate some of the issues we’ve been having.

Unfortunately, that’s not what happened.

Upgrade, Upgrade, Upgrade, Upgrade!

Actually upgrading to RavenDB 3 was painless. For RavenDB 2.5 we build a Nuget package that contains all of the necessary binaries and configuration, along with Powershell scripts that setup an IIS website and application pool automatically on deployment. RavenDB 3 works in a very similar way, so all I had to do was re-construct the package so that it worked in the same way except with the newer binaries. It was a little bit fiddly (primarily because of how we constructed the package the first time), but it was relatively easy.

Even better, the number of binaries and dependencies for RavenDB 3 is lower than RavenDB 2.5, which is always nice to see. Overall I think the actual combined sized may have increased, but its still nice to have a smaller number of files to manage.

Once I had the package built, all I had to do was deploy it to the appropriate environment using Octopus Deploy.

I did a simple document count check before and after and everything was fine, exactly the same number of documents was present (all ~100K of them).

Resource usage was nominal during this upgrade and basically non-existent afterwards.

Time to simulate some load.

What a Load

I’ve written previously about our usage of JMeter for load tests, so all I had to do was reuse the structure I already had in place. I recently did some refactoring in the area as well, so it was pretty fresh in my mind (I needed to extract some generic components from the load tests repository so that we could reuse them for other load tests). I set up a couple of JMeter worker environments in AWS and started the load tests.

Knowing what I do now I can see that the load tests that I originally put together don’t actually simulate the real load on the service. This was one of the reasons why our initial, intensive load testing did not find any of the issues with the backend that we found in production. I’d love to revisit the load profiles at some stage, but for now all I really needed was some traffic so that I could compare the different versions of the persistence store.

RavenDB 2.5 continued to do what it always did when the load tests were run. It worked just fine. Minimal memory and CPU usage, disk latency was low, all pretty standard.

RavenDB 3 ate all of the memory on the machine (16GB) over the first 10-15 minutes of the load tests. This caused disk thrashing on the system drive, which in turn annihilated performance and eventually the process crashed and restarted.

Not a good sign.

I’ve done this test a few times now (upgrade to 3, run load tests) and each time it does the same thing. Sometimes after the crash it starts working well (minimal resource usage, good performance), but sometimes even when it comes back from the crash it does the exact same thing again.

Time to call in the experts, i.e. the people that wrote the software.

Help! I Need Somebody

We don’t currently have a support contract with Hibernating Rhinos (the makers of RavenDB). The plan was to upgrade to RavenDB 3 (based on the assumption that its probably a better product), and if our problems persisted, to enter into a support contract for dedicated support.

Luckily, the guys at Hibernating Rhinos are pretty awesome and interact regularly with the community at the RavenDB Google Group.

I put together a massive post describing my current issue (mentioning the history of issues we’ve had to try and give some context), which you can find here.

The RavenDB guys responded pretty quickly (the same day in fact) and asked for some more information (understandably). I re-cloned the environment (to get a clean start) and did it again, except this time I was regularly extracting statistics from RavenDB (using the /stats and /admin/stats endpoints), as well as dumping the memory when it got high (using procdump) and using the export Debug information functionality built into the new Raven Studio (which is so much better than the old studio that it’s not funny). I packaged together all of this information together with the RavenDB log files and posted a response.

While looking through that information, Oren Eini (the CEO/Founder of Hibernating Rhinos) noticed that there were a number of errors reported around not being able to find a Lucene.Net.dll file on the drive where I had placed the database files (we separated the database files from the libraries, the data lives on a large, high throughput volume while the libraries are just on the system drive). I don’t know why that file should be there, or how it should get there, but at least it was progress!

The Battle Continues

Alas, I haven’t managed to return to this particular problem just yet. The urgency has diminished somewhat (the service is generally running a lot better after the latest round of hardware upgrades), and I have been distracted by other things (our Octopus Deploy slowing down our environment provisioning because it is underpowered), so it has fallen to the wayside.

However, I have plans to continue the investigation soon. Once I get to the root of the issue, I will likely make yet another post about RavenDB, hopefully summarising the entire situation and how it was fixed.

Software developers, perpetually hopeful…

0 Comments

A few months ago we released a new service allowing our users to complete some of their work through a mobile application. For an application that is primarily locked to a computer within an office, it was a pretty big improvement. Its not the first time we’ve tried to add this functionality, but it was one of the better attempts.

That is, until people really started hammering it. Then it went downhill kind of quickly.

Before I started working here, an architectural decision was made to use a document database to store the data for this mobile application. The idea was that the database would be a temporary store, almost like a waystation, that would allow two way synchronization of the data between two applications (the mobile application and the clients server). Conceptually, its not a bad design. Its not necessarily the design I would have suggested, but it has merit.

The document database selected was RavenDB, specifically version 2.5.

The people who made that particular architectural design are no longer with company for various reasons, so it was up to my team and I to complete the work and actually release something. We did our best and after a fairly lengthy development period followed by an equally lengthy beta period, we released into the wild. As I mentioned above, it started well, but didn’t seem to scale to the amount of users we started seeing. I’m not talking hundreds of thousands of users either, just a few hundred, so it definitely wasn’t one of those problems where you are cursed by your own success.

The root cause for the performance problems? It appeared to be RavenDB.

An Unkindness

I always make the assumption that if a commercial component looks like its not holding up its end of the bargain, its probably not the components fault. Its almost certainly the developers fault, because they either configured it wrong or generally did not understand it enough to know that they were using it in entirely the wrong way.

I think this is true for our problems with RavenDB, but I still don’t know exactly where we went wrong.

I’ll start at the beginning.

The first architectural design had two RavenDB instances in EC2 hidden behind a load balancer. They were configured to replicate to each other. This pair was reduced down to a single instance when we discovered that that particular structure was causing issues in the system (using my future knowledge, I now know that’s not how RavenDB does redundancy). The intention was that if load testing showed that we had issues with only one instance, we would revisit.

Our load testing picked up a bunch of problems with various things, but at no point was the RavenDB instance the bottleneck, so we assumed we would be okay.

Unfortunately, the load tests were flawed somehow, because once the system started to be used in anger, the RavenDB instance was definitely the bottleneck.

When we released, the database was initially hosted on a t2.medium. These instances are burstable (meaning their CPU can spike), but are limited by CPU credits. It became obvious very quickly that the database was consuming far more CPU credits than we expected (its CPU usage was averaging something like 80%), so we quickly shifted it to an m3.medium (which does not use CPU credits). This worked for a little while, but eventually we started experiencing performance issues again as usage increased. Another shift of the underlying hardware to an m4.large improved the performance somewhat, but not as much as we expected.

When we looked into the issue, we discovered a direct correlation between the latency of requests to the service and the disk latency of the data disk that RavenDB was using for storage. What followed was a series of adjustments to the data disk, mostly related around switching to provisioned IOPS and then slowing scaling it up until the latency of the disk seemed to no longer be the issue.

But we still had performance problems, and at this point the business was (rightly) getting a bit antsy, because users were starting to notice.

After investigation, the new cause of the performance problems seemed to be paging. Specifically the RavenDB process was consuming more memory than was available and was paging to the very slow system drive. Scaling the underlying instance up to an m4.xlarge (for more memory and compute) alleviated this particular problem.

We had a number of other issues as well:

  • Because we host RavenDB in IIS, the default application pool recycle that occurs every 29 hours eventually started happening during our peak times, which didn’t end well. We now schedule the restart for early in the morning. This was made somewhat more difficult by the fact that RavenDB can’t handle overlapping processes (which IIS uses to avoid downtime during restarts).
  • We’ve had the RavenDB process crash from time to time. IIS handles this (by automatically restarting the process), but there is still a period of downtime while the whole thing heats up again.

That brings us to the present. The service is running well enough, and is pretty performant, but it really does feel like we’ve thrown way too much power at it for what it accomplishes.

Where to Now?

Raven 2.5 is old. Very old.

Our next step is to upgrade to Raven 3, and then directly contrast and compare the performance of the two versions under similar load to see exactly what we’re getting ourselves into.

The logic behind the upgrade is that the newer version is far more likely to have better performance, and we’re far more likely to be able to easily get support for it.

Initial investigations show that the upgrade itself is relatively painless. The older Raven 2.5 client is completely compatible with the new server, so we don’t even need to upgrade the API components yet. All of the data appears to migrate perfectly fine (and seamlessly), so all we need to do is put some effort into comparing performance and then we should be sweet.

Secondarily, we’re going to be setting up at least one other Raven instance, primarily as a backup, but maybe also as a load balancing measure. I’ll have to look into it more before we figure out exactly what its capable of, but at the very least we need a replicated backup.

Summary

This post was more of an introduction into some of the issues that we’ve been having with the persistence of our service, but it does bring to light some interesting points.

Needing to support something in production is very different from just deciding to use it during development. There is a whole other set of tools and services that are required before you can successfully use something in production, and a completely different set of expertise and understanding required. Because the development process was so smooth (from the persistence point of view), we never really had to dig into the guts of Raven and really figure out what it was doing, so we were completely unprepared when everything went to hell during actual usage.

Trial by fire indeed.