0 Comments

Progress!

With what? With our performance issues with RavenDB, that’s what.

Before I get into that in too much detail, I’ll try to summarise the issue at hand to give some background information. If you’re interested, the full history is available by reading the posts in the RavenDB category.

Essentially, we are using RavenDB as the persistence layer for a service that acts as a temporary waypoint for data, allowing for the connection of two disparate systems that are only intermittently connected. Unfortunately, mostly due to inexperience with the product and possibly as a result of our workload not being a good fit for the technology, we’ve had a lot of performance problems as usage has grown.

We’re not talking about a huge uptick in usage or anything like that either. Usage growth has been slow, and we’re only now dealing with around 1100 unique customer accounts and less than 120 requests per second (on average).

In the time that its taken the service usage to grow that far, we’ve had to scale the underlying infrastructure used for the RavenDB database from an m3.medium all the way through to an r3.8xlarge, as well as incrementally providing it with higher and higher provisioned IOPS (6000 at last count) This felt somewhat ridiculous to us, but we did it in order to deal with periods of high request latency, the worst of which seemed to align with when the database process dropped large amounts of memory, and then thrashed the disk in order to refill it.

As far as we could see, we weren’t doing anything special, and we’ve pretty much reached the end of the line as far as throwing infrastructure at the problem, so we had to engage Hibernating Rhinos (the company behind RavenDB) directly to get to the root of the issue.

To their credit, they were constantly engaged and interested in solving the problem for us. We had a few false starts (upgrading from RavenDB 2.5 to RavenDB 3 surfaced an entirely different issue), but once we got a test environment up that they could access, with real traffic being replicated from our production environment, they started to make some real progress.

The current hypothesis? Because our documents are relatively large (50-100KB) and our queries frequent and distributed across the entire set of documents (because its a polling based system), RavenDB has to constantly read documents into memory in order to respond to a particular query, and then throw those documents out in order to deal with subsequent queries.

The solution? The majority of the queries causing the problem don’t actually need the whole document to return a result, just a few fields. If we use a static index that stores those fields, we take a large amount of memory churn out of the equation.

Index Fingers Are For Pointing. Coincidence?

The post I wrote talking about the complications of testing RavenDB due to its eventual consistency model talked a little bit about indexes, so if you want more information, go have a read of that.

Long story short, indexes in RavenDB are the only way to query data if you don’t know the document ID. With an index you are essentially pre-planning how you want to query your documents by marking certain fields as indexed. You can have many indexes, each using different fields to provide different search capabilities.

However, based on my current understanding, all an index does is point you towards the set of documents that are the result of the query. It can’t actually return any data itself, unless you set it up specifically for that.

In our case, we were using a special kind of index called an auto index, which just means that we didn’t write any indexes ourselves, we just let RavenDB handle it (auto indexes, if enabled, are created whenever a query is made that can’t be answered by an existing index, making the system easier to use in exchange for a reduction in control).

As I wrote above, this meant that while the query to find the matching documents was efficient, those documents still needed to be read from disk (if not already in memory), which caused a high amount of churn with memory utilization.

The suggestion from Hibernating Rhinos was to create a static index and to store the values of the fields needed for the response in the index itself.

I didn’t even know that was a thing you had to do! I assumed that a query on an index that only projected fields already in the index would be pretty close to maximum efficiency, but field values aren’t stored directly in indexes unless specifically requested. Instead the field values are tokenized in various ways to allow for efficient searching.

Creating an index for a RavenDB database using C# is trivial. All you have to do is derive from a supplied class.

public class Entity_ByFieldAAndFieldB_StoringFieldAAndFieldC : AbstractIndexCreationTask<Entity>
{
    public Entity_ByFieldAAndFieldB_StoringFieldAAndFieldC()
    {
        Map = entities =>
            from e in entities
            select new
            {
                FieldA = e.FieldA,
                FieldB = e.FieldB
            };

        Stores.Add(i => i.FieldA, FieldStorage.Yes);
        Stores.Add(i => i.FieldC, FieldStorage.Yes);
    }
}

Using the index is just as simple. Just add another generic parameter to the query that specifies which index to use.

var query = _session
    .Query<Entity, Entity_ByFieldAAndFieldB_StoringFieldAAndFieldC>()
    .Statistics(out stats)
    .Where(x => x.FieldA < "value" && x.FieldB == "other value")
    .OrderBy(x => x.FieldA)
    .Select(e => new ProjectedModel { e.FieldA, e.FieldC });

As long as all of the fields that you’re using in the projection are stored in the index, no documents need to be read into memory in order to answer the query.

The last thing you need to do is actually install the index, so I created a simple RavenDBInitializer class that gets instantiated and executed during the startup of our service (Nancy + Ninject, so it gets executed when the RavenDB document store singleton is initialized).

public class RavenDbInitializer
{
    public void Initialize(IDocumentStore store)
    {
        store.ExecuteIndex(new Entity_ByFieldAAndFieldB_StoringFieldAAndFieldC());
    }
}

All the automated tests worked as expected so the only thing left was to validate any performance improvements.

Unfortunately, that’s where it got hard.

Performance Problems

I mentioned earlier that the way in which we got Hibernating Rhinos to eventually propose this hypothesis was to give them access to a production replica with production traffic being mirrored to it. The traffic replication is setup using a neat little tool called Gor and when we set this replication up, we installed Gor on the database machine directly, because we were only interested in the traffic going to the database.

When it came time to test the optimizations outlined above, I initially thought that I would be able to use the same traffic replication to directly compare the old approach with the new static index based one.

This was not the case.

Because we were only replicating the database traffic (not the API traffic), all of the queries being replicated to the test database had no reference to the new index (when you make a query to Raven, part of the query is specifying which index should be used).

Normally I would use some set of load tests to compare the two approaches, but we’ve been completely unable to replicate enough traffic with the right shape to case performance problems on the Raven database. Its something we need to return to eventually, but I don’t have time to design and create a full load profile at this point (also considering I was the one who wrote the first one, the one that failed miserably to detect and predict any of these issues, I know how complicated and time consuming it can be).

I could see two options to test the new approach at this point:

  1. Add traffic replication to the API instances. This would allow me to directly compare the results of two environments receiving the same traffic, and the new environment would be hitting the index correctly (because it would have a new version of the API).
  2. Just ship it at a time when we have some traffic (which is all the time), but where if anything goes wrong, users are unlikely to notice.

I’m in the process of doing the first option, with the understanding that if I don’t get it done before this weekend that I’m just going to deploy the new API on Sunday and watch what happens.

I’ll let you know how it goes.

Conclusion

The sort of problem that Hibernating Rhinos identified is a perfect example of unintentionally creating issues with a technology simply due to lack of experience.

I had no idea that queries hitting an index and only asking for fields in the index would have to load the entire document before projecting those fields, but now that I know, I can understand why such queries could cause all manner of problems when dealing with relatively large documents.

The upside of this whole adventures is that the process of creating and managing indexes for RavenDB in C# is amazingly developer friendly. Its easily the best experience I’ve had working with a database level optimization in a long time, which was a nice change.

As I’ve repeatedly said, there is nothing inherently wrong with RavenDB as a product. Its easy to develop on and work with (especially with the latest version and its excellent built-in management studio). The problem is that someone made an arbitrary decision to use it without really taking into considering all of the factors, and the people that followed that person (who left being even seeing the project come to fruition) had no idea what was going on.

This has left me in an awkward position, because even if I fix all the problems we’ve had, reduce the amount of infrastructure required and get the whole thing humming perfectly, no-one in the organisation will ever trust the product again, especially at a business level.

Which is a shame.