0 Comments

As should be somewhat obvious by now, I’m a big fan of logging. I try to justify that interest in a logical fashion, stating the intelligence benefits but really, I just personally find it interesting and as a result it tends to be something that I focus on when developing software.

Log aggregation as a concept is still pretty new to me though. It wasn’t until a few years ago that I even became aware that it even existed, even though something always lingered in the corners of my mind about the usefulness of logs that were trapped on hundreds of different machines and environments. The idea to take the logs from a whole bunch of different locations (servers, client machines, everything), destructure them and put them all into a centralised, searchable location kind of blew my mind to be honest, and its been something of a passion ever since.

In an interesting twist of fate, my first exposure to log aggregation involved RavenDB. One of the previous developers had set up a RavenDB database to act as the destination for log events coming out of a piece of desktop software, and when i realized what it was, I was intent on using it to understand how the software was being used.

The idea was amazing.

The implementation?

Not so much.

A Missed Opportunity

As I mentioned above, at some point in the past the development team identified a need to raise and aggregate structured log events for a variety of reasons, like business intelligence, error reporting, general logging and so on. They picked RavenDB as the storage location, probably because it was so easy to use from a development point of view, and trivial to integrate into a C# application using the RavenDB client to write arbitrary plain old C# objects with no real messing around.

A simple custom logger interface was created, which would expose the capability to write log events that derived from some common base, and the implementation of that interface used the RavenDB client. From a storage point of view, staging and production instances of RavenDB were provisioned inside AWS, and exposed to the greater internet via an Elastic Load Balancer. The very next release of the software was writing all sorts of interesting events into RavenDB with very little difficulty, like notifications that the application was being started (including things like who started it and what version it was on) and errors that were occurring during normal usage.

It was all very useful.

That is, until you tried to actually search the resulting log events and glean some information from them.

This was back in RavenDB 2.X, so the default UI was the Silverlight implementation of the RavenDB Studio, which wasn’t exactly the most amazing piece of software.

Even ignoring the usability problems, RavenDB 2.X did not have good support for ad-hoc queries, and the no one had thought ahead and set up fields for indexing. This meant that the wealth of information available damn near unusable.

How Could You Not Call It Muninn?

Shortly after I started and found out about all of this awesome information going to waste I put together an ELK stack and hacked together a bridge between the two.

Whenever a log event was written into RavenDB, a small memory resident application would automatically write it out to a JSON file, and Logstash would pick it up and put it into the ELK stack.

With the new ELK stack in place, whenever we wrote new code we would ensure that it could aggregate its logs as necessary, augmenting the system with new capabilities each time. In the end it was mostly just log files being picked up and processed by local Logstash instances, but we process ELB logs in Lambda by posting directly to Logstash over HTTP and we exposed a public facing authenticated Logging API for software to aggregate log events outside of our little walled garden in AWS.

The rest is history, and we’ve been using ELK ever since.

But we never changed the initial piece of desktop software and its been happily writing into RavenDB this whole time, with the events being replicated into the ELK stack via that “temporary” hack.

Its The Perfect Name

Don’t get me wrong, the system worked, and log events from the desktop software were available in our ELK stack, it just felt so inefficient to have this whole other system in place providing no real additional value.

Not only that, while it did technically work, there were definitely times when that particularly log event channel failed spectacularly, from full disks (no-one had ever implemented maintenance or cleanup of any sort on the RavenDB database) through to performance problems similar to the ones that we saw in one of our APIs.

The last straw was when I did an audit on our AWS costs and discovered that we were actually paying a non-trivial amount of cash for the RavenDB logging infrastructure and that just seemed wrong (plus, if I recovered that cash I could put it into other things, like the ELK stack).

All told, there were more reasons to get rid of it than there were to keep it.

But how?

Well, as I mentioned earlier, when the development team initially implementing the log aggregation, they put a simple logging interface in place, with the RavenDB specific implementation behind that interface.

Switching the logging to use our Serilog based HTTP sink (which connected to our authenticated Logging API) was as simple as providing a new implementation of the old logging interface, so even though its usage infested the entire system, we really only had to change it once at the top level.

Of course, it wasn’t quite that simple (references to RavenDB were strewn through at least 3 different libraries that dealt with logging, so they had to be removed), but all told it was a hell of a lot easier than it would have been without that abstraction.

Obviously we want to switch to using Serilog directly (rather than a custom logging abstraction that provides no real value), but that can happen slowly, bit by bit, and the flow of log events will be maintained the entire time.

Summary

Unfortunately, we can’t actually decommission the RavenDB logging infrastructure just yet, as we have to wait until everyone in the wild upgrades to the latest version or we risk missing out on important intelligence from the slackers. I’m keeping a close eye on the total number of logs received through the old system, and the moment it drops below a certain threshold, I’m going to nuke the entire thing.

To be honest, the removal of the RavenDB logging channel that I described above makes me happier than it really has cause to. It’s been bugging me for a while (mostly from an engineering point of view), so it felt really good to get rid of it in the latest release of our software. That sort of maintenance and consolidation doesn’t always have a direct and measurable impact on the bottom line, but its still important for the system as a whole and important for engineering morale.

Its like ripping out a mostly dead plant from a garden.

Better to make room for something that might actually be useful.

0 Comments

I’ve mentioned a number of times about the issues we’ve had with RavenDB and performance.

My most recent post on the subject talked about optimizing some of our most common queries by storing data in the index, to avoid having to retrieve full documents only to extract a few fields. We implement this solution as a result of guidance from Hibernating Rhinos, who believed that it would decrease the amount of memory churn on the server, because it would no longer need to constantly switch the set of documents currently in memory in order to answer incoming queries. Made sense to me, and they did some internal tests that showed that indexes that stored the indexed fields dramatically reduced the memory footprint of their test.

Alas, it didn’t really help.

Well, that might not be fair. It didn’t seem to help, but that may have been because it only removed one bottleneck in favour of another.

I deployed the changes to our production environment a few days ago, but the server is still using a huge amount of memory, still regularly dumping large chunks of memory and still causing periods of high latency while recovering from the memory dump.

Unfortunately, we’re on our own as far as additional optimizations go, as Hibernating Rhinos have withdrawn from the investigation. Considering the amount of time and money they’ve put into this particular issue, I completely understand their decision to politely request that we engage their consulting services in order to move forward with additional improvements. Their entire support team was extremely helpful and engaged whenever we talked to them, and they did manage to fix a few bugs on the way through, even if it was eventually decided that our issues were not the result of a bug, but simply because of our document and load profile.

What Now?

Its not all doom and gloom though, as the work on removing orphan/abandoned data is going well.

I ran the first round of cleanup a few weeks ago, but the results were disappointing. The removal of the entities that I was targeting (accounts) and their related data only managed to remove around 8000 useless documents from the total of approximately 400 000.

The second round is looking much more promising, with current tests indicating that we might be able to remove something like 300 000 of those 400 000 documents, which is a pretty huge reduction. The reason why this service, whose entire purpose is to be a temporary data store, is accumulating documents, is currently unknown. I’ll have to get to the bottom of that once I’ve dealt with the immediate performance issues.

The testing of the second round of abandoned document removal is time consuming. I’ve just finished the development of the tool and the first round of testing that validated that the system behaved a lot better with a much smaller set of documents in it (using a partial clone of our production environment), even when hit with real traffic replicated from production, which was something of a relief.

Now I have to test that the cleanup scripts and endpoints work as expected when they have to remove data from the server that might have large amounts of binary data linked to it.

This requires a full clone of the production environment, which is a lot more time consuming than the partial clone, because it also has to copy the huge amount of binary data in S3. On the upside, we have scripts that we can use to do the clone (as a result of the migration work), and with some tweaks I should be able to limit the resulting downtime to less than an hour.

Once I’ve verified that the cleanup works on as real of an environment as I can replicate, I’ll deploy it to production and everything will be puppies and roses.

Obviously.

Conclusion

This whole RavenDB journey has been a massive drain on my mental capacity for quite a while now, and that doesn’t even take into account the drain on the business. From the initial failure to load test the service properly (which I now believe was a result of having an insufficient number of documents in the database combined with unrealistic test data), to the performance issues that occurred during the first few months of release (dealt with by scaling the underlying hardware) all the way through to the recent time consuming investigative efforts, the usage of RavenDB for this particular service was a massive mistake.

The disappointing part is that it all could have easily gone the other way.

RavenDB might very well have been an amazing choice of technology for a service focused around being a temporary data store, which may have lead to it being used in other software solutions in my company. The developer experience is fantastic, being amazingly easy to use once you’ve got the concepts firmly in hand and its very well supported and polished. Its complicated and very different from the standard way of thinking (especially regarding eventual consistency), so you really do need to know what you’re getting into, and that level of experience and understanding was just not present in our team at the time the choice was made.

Because of the all issues we’ve had, any additional usage of RavenDB would be met with a lot of resistance. Its pretty much poison now, at least as far as the organisation is concerned.

Software development is often about choosing the right tool for the job and the far reaching impact of those choices cannot be underestimated.

0 Comments

Progress!

With what? With our performance issues with RavenDB, that’s what.

Before I get into that in too much detail, I’ll try to summarise the issue at hand to give some background information. If you’re interested, the full history is available by reading the posts in the RavenDB category.

Essentially, we are using RavenDB as the persistence layer for a service that acts as a temporary waypoint for data, allowing for the connection of two disparate systems that are only intermittently connected. Unfortunately, mostly due to inexperience with the product and possibly as a result of our workload not being a good fit for the technology, we’ve had a lot of performance problems as usage has grown.

We’re not talking about a huge uptick in usage or anything like that either. Usage growth has been slow, and we’re only now dealing with around 1100 unique customer accounts and less than 120 requests per second (on average).

In the time that its taken the service usage to grow that far, we’ve had to scale the underlying infrastructure used for the RavenDB database from an m3.medium all the way through to an r3.8xlarge, as well as incrementally providing it with higher and higher provisioned IOPS (6000 at last count) This felt somewhat ridiculous to us, but we did it in order to deal with periods of high request latency, the worst of which seemed to align with when the database process dropped large amounts of memory, and then thrashed the disk in order to refill it.

As far as we could see, we weren’t doing anything special, and we’ve pretty much reached the end of the line as far as throwing infrastructure at the problem, so we had to engage Hibernating Rhinos (the company behind RavenDB) directly to get to the root of the issue.

To their credit, they were constantly engaged and interested in solving the problem for us. We had a few false starts (upgrading from RavenDB 2.5 to RavenDB 3 surfaced an entirely different issue), but once we got a test environment up that they could access, with real traffic being replicated from our production environment, they started to make some real progress.

The current hypothesis? Because our documents are relatively large (50-100KB) and our queries frequent and distributed across the entire set of documents (because its a polling based system), RavenDB has to constantly read documents into memory in order to respond to a particular query, and then throw those documents out in order to deal with subsequent queries.

The solution? The majority of the queries causing the problem don’t actually need the whole document to return a result, just a few fields. If we use a static index that stores those fields, we take a large amount of memory churn out of the equation.

Index Fingers Are For Pointing. Coincidence?

The post I wrote talking about the complications of testing RavenDB due to its eventual consistency model talked a little bit about indexes, so if you want more information, go have a read of that.

Long story short, indexes in RavenDB are the only way to query data if you don’t know the document ID. With an index you are essentially pre-planning how you want to query your documents by marking certain fields as indexed. You can have many indexes, each using different fields to provide different search capabilities.

However, based on my current understanding, all an index does is point you towards the set of documents that are the result of the query. It can’t actually return any data itself, unless you set it up specifically for that.

In our case, we were using a special kind of index called an auto index, which just means that we didn’t write any indexes ourselves, we just let RavenDB handle it (auto indexes, if enabled, are created whenever a query is made that can’t be answered by an existing index, making the system easier to use in exchange for a reduction in control).

As I wrote above, this meant that while the query to find the matching documents was efficient, those documents still needed to be read from disk (if not already in memory), which caused a high amount of churn with memory utilization.

The suggestion from Hibernating Rhinos was to create a static index and to store the values of the fields needed for the response in the index itself.

I didn’t even know that was a thing you had to do! I assumed that a query on an index that only projected fields already in the index would be pretty close to maximum efficiency, but field values aren’t stored directly in indexes unless specifically requested. Instead the field values are tokenized in various ways to allow for efficient searching.

Creating an index for a RavenDB database using C# is trivial. All you have to do is derive from a supplied class.

public class Entity_ByFieldAAndFieldB_StoringFieldAAndFieldC : AbstractIndexCreationTask<Entity>
{
    public Entity_ByFieldAAndFieldB_StoringFieldAAndFieldC()
    {
        Map = entities =>
            from e in entities
            select new
            {
                FieldA = e.FieldA,
                FieldB = e.FieldB
            };

        Stores.Add(i => i.FieldA, FieldStorage.Yes);
        Stores.Add(i => i.FieldC, FieldStorage.Yes);
    }
}

Using the index is just as simple. Just add another generic parameter to the query that specifies which index to use.

var query = _session
    .Query<Entity, Entity_ByFieldAAndFieldB_StoringFieldAAndFieldC>()
    .Statistics(out stats)
    .Where(x => x.FieldA < "value" && x.FieldB == "other value")
    .OrderBy(x => x.FieldA)
    .Select(e => new ProjectedModel { e.FieldA, e.FieldC });

As long as all of the fields that you’re using in the projection are stored in the index, no documents need to be read into memory in order to answer the query.

The last thing you need to do is actually install the index, so I created a simple RavenDBInitializer class that gets instantiated and executed during the startup of our service (Nancy + Ninject, so it gets executed when the RavenDB document store singleton is initialized).

public class RavenDbInitializer
{
    public void Initialize(IDocumentStore store)
    {
        store.ExecuteIndex(new Entity_ByFieldAAndFieldB_StoringFieldAAndFieldC());
    }
}

All the automated tests worked as expected so the only thing left was to validate any performance improvements.

Unfortunately, that’s where it got hard.

Performance Problems

I mentioned earlier that the way in which we got Hibernating Rhinos to eventually propose this hypothesis was to give them access to a production replica with production traffic being mirrored to it. The traffic replication is setup using a neat little tool called Gor and when we set this replication up, we installed Gor on the database machine directly, because we were only interested in the traffic going to the database.

When it came time to test the optimizations outlined above, I initially thought that I would be able to use the same traffic replication to directly compare the old approach with the new static index based one.

This was not the case.

Because we were only replicating the database traffic (not the API traffic), all of the queries being replicated to the test database had no reference to the new index (when you make a query to Raven, part of the query is specifying which index should be used).

Normally I would use some set of load tests to compare the two approaches, but we’ve been completely unable to replicate enough traffic with the right shape to case performance problems on the Raven database. Its something we need to return to eventually, but I don’t have time to design and create a full load profile at this point (also considering I was the one who wrote the first one, the one that failed miserably to detect and predict any of these issues, I know how complicated and time consuming it can be).

I could see two options to test the new approach at this point:

  1. Add traffic replication to the API instances. This would allow me to directly compare the results of two environments receiving the same traffic, and the new environment would be hitting the index correctly (because it would have a new version of the API).
  2. Just ship it at a time when we have some traffic (which is all the time), but where if anything goes wrong, users are unlikely to notice.

I’m in the process of doing the first option, with the understanding that if I don’t get it done before this weekend that I’m just going to deploy the new API on Sunday and watch what happens.

I’ll let you know how it goes.

Conclusion

The sort of problem that Hibernating Rhinos identified is a perfect example of unintentionally creating issues with a technology simply due to lack of experience.

I had no idea that queries hitting an index and only asking for fields in the index would have to load the entire document before projecting those fields, but now that I know, I can understand why such queries could cause all manner of problems when dealing with relatively large documents.

The upside of this whole adventures is that the process of creating and managing indexes for RavenDB in C# is amazingly developer friendly. Its easily the best experience I’ve had working with a database level optimization in a long time, which was a nice change.

As I’ve repeatedly said, there is nothing inherently wrong with RavenDB as a product. Its easy to develop on and work with (especially with the latest version and its excellent built-in management studio). The problem is that someone made an arbitrary decision to use it without really taking into considering all of the factors, and the people that followed that person (who left being even seeing the project come to fruition) had no idea what was going on.

This has left me in an awkward position, because even if I fix all the problems we’ve had, reduce the amount of infrastructure required and get the whole thing humming perfectly, no-one in the organisation will ever trust the product again, especially at a business level.

Which is a shame.

0 Comments

As I mentioned briefly last week, I’m currently working on some maintenance endpoints that allow us to clean up data that has been identified as abandoned in one of our services (specifically, the service backed by a RavenDB database).

The concept at play is not particularly difficult. We have meta information attached to each set of data belonging to a customer, and this meta information allows us to decide whether or not the data (which is supposed to be transitory anyway) should continue to live in the service, or whether it should be annihilated to reduce load.

In building the endpoints that will allow this functionality to be reliably executed, I needed to construct a set of automated tests to verify that the functionality worked as expected with no side-effects. In other words, the normal approach.

Unfortunately, that’s where the challenge mode kicked in.

Like Oil And Water

One of the key concepts in RavenDB is the index.

Indexes are used for every query you do, except for loading a document by ID, finding the set of documents by ID prefix or finding documents by updated timestamp.

Helpfully, RavenDB will create and maintain one or more indexes for you based on your querying profile (automatic indexes), which means that if you don’t mind relinquishing control over the intricacies of your persistence, it will handle all the things for you. Honestly, it does a pretty good job of this, until you do a query that is somewhat different from all the previous queries, and RavenDB creates a new index to handle the query, which results in a massive CPU, memory and disk spike, which craters your performance.

Our inexperience with RavenDB when this service was first built meant that the difference between automatic and static indexes was somewhat lost on us, and the entire service uses nothing but automatic indexes. Being that we don’t allow for arbitrary, user defined queries, this is fine. In fact, unless automatic indexes are somehow at the root of all of our performance problems, we haven’t had any issues with them at all.

A key takeaway about indexes, regardless of whether they are automatic or static, is that they are an eventually consistent representation of the underlying data.

All writes in RavenDB take place in a similar way to writes in most relational databases like SQL Server. Once a write is accepted and completes the data is available straight away.

The difference is that in RavenDB, that data is only available straightaway if you query for it based off the ID, the prefix (which is part of the ID) or the order in which it was updated.

Any other query must go through an index, and indexes are eventually consistent, with an unknown time delay between when the write is completed and the data is available via the index.

The eventually consistent model has benefits. It means that writes can be faster (because writes don’t have to touch every index before they are “finished”), and it also means that queries are quicker, assuming you don’t mind that the results might not be completely up to date.

That’s the major cost or course, you trade-off speed and responsiveness against the ability to get a guaranteed definitive answer at any particular point in time.

Or Maybe Fire And Gasoline?

Eventual consistency is one of those concepts that infects everything it touches. If you have an API that uses a database that is eventually consistent, that API is also eventually consistent. If your API is eventually consistent, your clients are probably eventually consistent. So on and so forth all the way through the entire system.

If you’re used to relying on something being true after doing it, you have to completely rearrange your thinking, because that is no longer guaranteed to be true.

This is especially painful when it comes to tests, which brings us full circle back to where I started this post.

A standard test consists of some sort of setup (arrange), followed by the execution of some operation or set of operations (act), followed by some validation (assert).

In my case, I wanted to write a Functional Test that validated if I added a bunch of data and then executed the maintenance endpoint, some of that data should have been removed (the stuff identified as abandoned) and the rest should still be present. A nice high level test to validate that everything was working together as it should be, to be executed against the API itself (via HTTP calls).

Our functional tests are typically executed against a component deploy via our continuous integration/deployment system. We run Unit and Integration tests after the build, deploy via Octopus to our CI environment, then run Functional tests to validate that all is well. The success of the Functional tests determines whether or not that particular build/package should be allowed to propagate out into our Staging environment and beyond.

The endpoints in play were:

GET /entities
PUT /entities/{id}
GET /admin/maintenance/entities/abandoned
DELETE /admin/maintenance/entities/abandoned

Conceptually, the test would:

  • Seed a bunch of entities to the API (some meeting the abandoned conditions, some not)
  • Run a GET on the maintenance endpoint to confirm that the appropriate number of entities were being identified as abandoned
  • Run a DELETE on the maintenance endpoint, removing a subset of the abandoned entities, returning a response detailing what was removed and how much is left
  • Run a GET on the maintenance endpoint to validate that its response concurred with the DELETE response
  • Repeatedly do DELETE/GET calls to the maintenance endpoint to delete all abandoned entities
  • Run a GET on the entities endpoint to validate that all of entities that should still be present are, and all entities that should have been deleted are

Like I said, a nice, high level test that covers a lot of functionality and validates that the feature is working as it would be used in reality. Combined with a set of Unit and Integration tests, this would provide some assurance that the feature was working correctly.

Unfortunately, it didn’t work.

The Right Answer Is Just A While Loop Away

The reason it didn’t work was that it was written with the wrong mindset. I was expecting to be able to add a bunch of data and have that data be immediately visible.

That’s not possible with an eventually consistent data store like RavenDB.

At first I thought I could just deal with the issue by wrapping everything in a fancy retry loop, to check that it was eventually the value that I wanted it to be. I would be able to use the same test structure that I already had, all I would have to do would be make sure each part of it (GET, DELETE, GET) executed repeatedly until it matched the conditions, up to some maximum number of attempts.

To that end I wrote a simple extension method in the flavour of FluentAssertions that would allow me to check a set of assertions against the return value from a Func in such a way that it would repeatedly execute the Func until the assertions were true or it reached some sort of limit.

using System;
using System.Linq;
using FluentAssertions;
using NSubstitute;
using Serilog;

namespace Api.Tests.Shared.Extensions
{
    public static class FluentAssertionsExtensions
    {
        public static void ShouldEventuallyMeetAssertions<TResult>(this Func<TResult> func, Action<TResult>[] assertions, IRetryStrategy strategy = null)
        {
            strategy = strategy ?? new LinearBackoffWithCappedAttempts(Substitute.For<ILogger>(), 5, TimeSpan.FromSeconds(2));
            var result = strategy.Execute
                (
                    func, 
                    r => 
                    {
                        foreach (var assertion in assertions)
                        {
                            assertion(r);
                        }

                        return true;
                    }
                );

            if (!result.Success)
            {
                throw result.EvaluationErrors.Last();
            }
        }
    }
    
    public interface IRetryStrategy
    {
        RetryResult<TResult> Execute<TResult>(Func<TResult> f, Func<TResult, bool> isValueAcceptable = null);
    }
    
    public class LinearBackoffWithCappedAttempts : IRetryStrategy
    {
        public LinearBackoffWithCappedAttempts(ILogger logger, int maximumNumberOfAttempts, TimeSpan delayBetweenAttempts)
        {
            _logger = logger;
            _maximumNumberOfAttempts = maximumNumberOfAttempts;
            _delayBetweenAttempts = delayBetweenAttempts;
        }

        private readonly ILogger _logger;
        private readonly int _maximumNumberOfAttempts;
        private readonly TimeSpan _delayBetweenAttempts;

        public RetryResult<TResult> Execute<TResult>(Func<TResult> f, Func<TResult, bool> isValueAcceptable = null)
        {
            isValueAcceptable = isValueAcceptable ?? (r => true);
            var result = default(TResult);
            var executionErrors = new List<Exception>();

            var attempts = 0;
            var success = false;
            while (attempts < _maximumNumberOfAttempts && !success)
            {
                try
                {
                    _logger.Debug("Attempting to execute supplied retryable function to get result");
                    result = f();

                    var valueOkay = isValueAcceptable(result);
                    if (!valueOkay) throw new Exception(string.Format("The evaluation of the condition [{0}] over the value [{1}] returned false", isValueAcceptable, result));

                    success = true;
                }
                catch (Exception ex)
                {
                    _logger.Debug(ex, "An error occurred while attempting to execute the retryable function");
                    executionErrors.Add(ex);
                }
                finally
                {
                    attempts++;
                    if (!success) Thread.Sleep(_delayBetweenAttempts);
                }
            }

            var builder = new StringBuilder();
            if (success) builder.Append("Evaluation of function completed successfully");
            else builder.Append("Evaluation of function did NOT complete successfully");

            builder.AppendFormat(" after [{0}/{1}] attempts w. a delay of [{2}] seconds between attempts.", attempts, _maximumNumberOfAttempts, _delayBetweenAttempts.TotalSeconds);
            builder.AppendFormat(" [{0}] total failures.", executionErrors.Count);
            if (executionErrors.Any())
            {
                builder.AppendFormat(" Exception type summary [{0}].", string.Join(", ", executionErrors.Select(a => a.Message)));
            }

            _logger.Information("The retryable function [{function}] was executed as specified. Summary was [{summary}]", f, builder.ToString());

            return new RetryResult<TResult>(success, result, builder.ToString(), executionErrors);
        }
    }

    public class RetryResult<TResult>
    {
        public RetryResult(bool success, TResult result, string summary, List<Exception> evaluationErrors = null)
        {
            Success = success;
            Result = result;
            Summary = summary;
            EvaluationErrors = evaluationErrors;
        }

        public readonly bool Success;
        public readonly TResult Result;
        public readonly List<Exception> EvaluationErrors;
        public readonly string Summary;
    }
}

Usage of the extension method, and some rearranging let me write the test in a way that was  now aware of the eventually consistent nature of the API.

When I started thinking about it though, I realised that the approach would only work for idempotent requests, which the DELETE verb on the new endpoint was definitely not.

Repeatedly executing the DELETE was changing the output each time, and depending on whether the stars aligned with regards to the RavenDB indexes, it was unlikely to ever meet the set of assertions that I had defined.

Happiness Comes From Compromise

In the end I had to adapt to the concepts at play, rather than adapting the concepts to fit my desired method of testing.

I could still leverage the eventually consistent assertion extension that I wrote, but I had to slightly rethink what I was trying to test.

  • I would still have to seed a bunch of entities to the API (some meeting the abandoned conditions, some not)
  • Instead of doing a single GET, I would have to assert that at least all of my seeded data was eventually present.
  • Instead of running a single DELETE and checking the result, I would have to loop and delete everything until the service told me nothing was left, and then check that it deleted at least all of the things I seeded
  • Instead of doing a final GET to verify, I would have to again assert that at least all of my seeded data that was not supposed to be removed was still present

Subtly different, but still providing a lot of assurance that the feature was working as expected. I won’t copy the full test here, because including helper methods to improve readability, it ended up being about 150 lines long.

Conclusion

Sometimes it feels like no matter where I look, RavenDB is there causing me some sort of grief. It has definitely given me an appreciation of the simple nature of traditional databases.

This isn’t necessarily because RavenDB is bad (its not, even though we’ve had issues with it), but because the eventual consistency model in RavenDB (and in many other similar databases) makes testing more difficult than it has to be (and testing is already difficult and time consuming, even if it does provide a world of benefits).

In this particular case, while it did make testing a relatively simple feature more challenging than I expected, it was my own approach to testing that hurt me. I needed to work with the concepts at play rather than to try and bend them to my preferences.

Like I said earlier, working with an eventually consistent model changes how you interact with a service at every level. It is not a decision that should be taken lightly.

Unfortunately I think the selection of RavenDB in our case was more “shiny new technology” than “this is the right fit for our service”, which is definitely less gravitas than it should have been given.

0 Comments

I’ve been sick this week, so this post will be incredibly short (and late).

I’ve spoken on and off about some of the issues that we’ve had keeping the RavenDB database behind one of our services running in a maintainable fashion.

I don’t think RavenDB is a bad product though (in fact, I think its quite good at what it does), I think that the main problem is that is is not a good fit for what we’re trying to do with it and we don’t really have the expertise required to use it well.

Regardless, we had the scale the underlying infrastructure again recently (from an r3.4xlarge to an r3.8xlarge) and it go me thinking about how well we actually understand the data that is being stored inside the database. The last time we had to scale (before the most recent one), we had something like 120K documents in the system, spread across 700-800 unique clients.

Now? Almost triple that, at 340K, but we’re only up to like 1000 unique clients.

Something didn’t add up.

I mean, the entire concept behind the service is that it is a temporary staging area. It contains no long term storage. The number of documents present should be directly related to the activity around the feature that the service supports, and it was highly unlikely that the feature had become 3 times as possible in the intervening period.

No Young Waynes Here

The system uses a manifest like concept to aggregate all of the data belonging to a customers data set in one easily correlated place (specifically using prefixes on the IDs of the documents). Each manifest (or account), contains some meta information, like the last time any data at all was touched for that account.

It was a relatively simple matter to identify all accounts that had not been touched in the last 30 days. For a  system that relies on constant automatic synchronization, if an entire account has not been touched in the last 30 days, its a pretty good candidate for having been abandoned for some reason, the most likely of which is that the user has switched to using a different account (they are quite fluid).

I found 410 accounts that were untouched.

There are only1800 accounts in the system.

The second point of investigation was to look directly at the document type with the highest count.

This document describes something that is scheduled, and abandoned data can be easily found by looking for things that are scheduled over 30 days ago. Because of the transient nature of the design, of something is still in the system, even though it was scheduled for a month in the past, its a pretty safe bet that its no longer valid.

I said before that there were around 340K documents in the system?

220K were scheduled for so far in the past that they were basically irrelevant.

Conclusion

The findings above made me sad inside, because it means that there is definitely something wrong with the way the service (and the software that uses the service) is managing its data.

I suspect (but can’t back this up) that the amount of useless data present is not helping our performance problems as well, so its like a double gut-punch.

I suppose, the important thing to take away from this is to never become complacent about the contents of your persistence layer. Regular audits should be executed to make sure you understand exactly what is being stored and why.

Now that I know about the problem, all that’s left is to put together some sort of repeatable mechanism to clean up, and then find and fix the bugs that led to the data accumulating in the first place.

But if I didn’t look we probably would have just accepted that this was the shape of the data.