Last week I wrote a high level overview of the build, package, test and deploy process that we created in order to get .NET Core C# into AWS Lambda.

As part of that post, I briefly touched on the validation part of the proces, which basically amounted to “we wrote some tests in C# using XUnit, then used the dotnet command line interface to run them as part of the build”.

Considering that the testing took a decent amount of the total time, that amount of coverage isn’t really appropriate, so this post will elaborate further, and explain some of the challenges we encountered and how we overcame them.

I’m A Little Testy

At a high level, testing a .NET Core AWS Lambda function written in C# is not really any more difficult than testing other C# code. Like the stuff I’ve been writing for years.

Unit test classes in isolation, integration test classes together, have a few end to end tests (if possible) and you’re probably in a good place.

When you write a lambda function, all you really need is a single class with a method on it (Handler::Handler or equivalent). You could, if you wanted, put literally all of your logic into this class/method combination and have a fully functional lambda function.

That would probably make it harder to test though.

A better idea is to break it down into its constituent parts, ideally taking the form of small classes with a single responsibility, working together to accomplish a greater goal.

For the ELB Logs Processor, there are three clear responsibilities:

  1. Read a file from S3, and expose the information therein as lines of text
  2. Take a line of text and transform it into some sort of structure, potentially breaking it down and adding meta information 
  3. Take a structure and output it to a Logstash endpoint

Technically there is also a fourth responsibility which is to tie the three things above together (i.e. orchestration), but we’ll get to that in a minute.

Anyway, if the responsibilities are clearly separated, we’re in unit testing territory, which is familiar enough that I don’t really want to talk about it too much. Grab your testing framework of choice (XUnit is a good pick for .NET Core, NUnit didn’t seem to work too well in Visual Studio + Resharper), mock out your dependencies and test away.

You can apply the same sort of testing to the orchestrator (the Handler) by giving it its dependencies through its constructor. Make sure you give it a parameterless constructor with default values though, or you’ll have issues running it in AWS.

This is basically just a poor mans dependency injection, but my favour dependency injection library, Ninject, does not exist in .NET Core and I didn’t want to invest the time to learn one of the other ones (like AutoFac) for such a small project.

Of course, while unit tests are valuabl, they don’t really tell us whether or not everything will work when you put it all together.

That’s where integration tests come in.

Integration Functions

In my experience, I’ve consistently found there to be two distinct flavours of integration test.

The first is the test that validates whether or not everything works when all of your things are put together. This tends to end up being a test of your dependency injection configuration (if you’re using it), where you create some top level object using your container of choice, then execute a series of methods on it (and its children) echoing what would normally happen in your program execution. You might mock out the very edges of the system (like databases, API dependencies, disk-based configuration, etc), but in general you want to validate that your tens (hundreds? thousands?) of classes play nice together and actually accomplish their overarching goal.

The second type of test is the one that validates whether or not a class that relies on an external system works as expected. These are very close to unit tests in appearance (single class, some mocked out dependencies), but are much slower and more fragile (due to the reliance on something outside of the current code base). They do have a lot of value though, because beyond application failures, they inform you when your integration with the external dependency breaks for whatever reason.

The ELB Logs Processor features both types of integration test, which is nice, because it makes for a comprehensive blog post.

The simpler integration test is the one for the component that integrates with S3 to download a file and expose it as a series of lines. All you need to do is put a file in a real S3 bucket, then run the class with an appropriate set of parameters to find that file, download it and expose it in the format that is desired. You’re free to simulate missing files (i.e. go get this file that doesn’t exist) and other error cases as desired as well, which is good.

The other integration test involves the Handler, using its default configuration and a realistic input. The idea would be to run the Handler in a way that is as close to how it will be run in reality as possible, and validate that it does what it advertises. This means we need to put some sort of sample file in S3, fire an event at the Handler describing that file (i.e. the same event that it would receive during a normal trigger in AWS), validate that the Handler ran as expected (perhaps measuring log output or something) and then validate that the events from the sample file were available in the destination ELK stack. A complicated test for sure, but one with lots of value.

As is always the case, while both of those tests are conceptually fairly simple and easy to spell out, they both have to surmount the same challenge.

AWS Authentication and Authorization.

401, Who Are You?

The various libraries supplied by AWS (Javascript, Powershell, .NET, Python, etc) all handle credential management in a similar way.

They use a sort of priority system for determining the source of their credentials, and it looks something like this:

  1. If credentials have been supplied directly in the code (i.e. key/secret), use those
  2. Else, if a profile has been supplied, look for that and use it
  3. Else, if no profile has been supplied, use the default profile
  4. Else, if environment variables have been set, use those
  5. Else, use information extracted from the local machine (i.e. IAM profiles or EC2 meta information)

The specific flow differs from language to language (.NET specific rules can be found here for example), but the general flow is pretty consistent.

What does that mean for our tests?

Well, the actual Lambda function code doesn’t need credentials at all, because the easiest way to run it is with an IAM profile applied to the Lambda function itself in AWS. Assuming no other credentials are available, this kicks in at option 5 in the list above, and is by far the least painful (and most secure) option to use.

When running the tests during development, or running them on the build server, there is no ambient IAM profile available, so we need to find some other way to get credentials to the code.

Luckily, the .NET AWS library allows you to override the logic that I described above, which is a pretty great (even though it does it with static state, which makes me sad).

With a little work we can nominate an AWS profile with a specific name to be our current “context”, and then use those credentials to both setup a temporary S3 bucket with a file in it, and to run the code being tested, like in the following Handler test.

/// <summary>
/// This test is reliant on an AWS profile being setup with the name "elb-processor-s3-test" that has permissions
/// to create and delete S3 buckets with prefixes oth.console.elb.processor.tests. and read items in those buckets
/// </summary>
public void WhenHandlingAValidS3Event_ConnectsToS3ToDownloadTheFile_AndOutputsEventsToLogstash()
    var inMemoryLogger = new InMemoryLogger();
    var xunitLogger = new XUnitLogger(_output);
    var logger = new CompositeLogger(inMemoryLogger, xunitLogger);

    var application = Guid.NewGuid().ToString();
    var config = new Dictionary<string, string>
        { Configuration.Keys.Event_Environment, Guid.NewGuid().ToString() },
        { Configuration.Keys.Event_Application, application},
        { Configuration.Keys.Event_Component, Guid.NewGuid().ToString() },
        { Configuration.Keys.Logstash_Url, "{logstash-url}" }
    using (AWSProfileContext.New("elb-processor-s3-test"))
        logger.Log("Creating test bucket and uploading test file");
        var s3 = new AmazonS3Client(new AmazonS3Config { RegionEndpoint = RegionEndpoint.APSoutheast2 });
        var testBucketManager = new TestS3BucketManager(s3, "oth.console.elb.processor.tests.");
        using (var bucket = testBucketManager.Make())
            var templateFile = Path.Combine(ApplicationEnvironment.ApplicationBasePath, @"Helpers\Data\large-sample.log");
            var altered = File.ReadAllText(templateFile).Replace("@@TIMESTAMP", DateTime.UtcNow.ToString("yyyy-MM-ddTHH:mm:ss.ffffffZ"));
            var stream = new MemoryStream(Encoding.UTF8.GetBytes(altered));
            var putResult = s3.PutObjectAsync(new PutObjectRequest { BucketName = bucket.Name, InputStream = stream, Key = "test-file" }).Result;

            var message = new S3EventBuilder().ForS3Object(bucket.Name, "test-file").Build();

            logger.Log("Executing ELB logs processor");
            var handler = new Handler(logger, config);

    // Check that there are some events in Elasticsearch with our Application
    Func<long> query = () =>
        var client = new HttpClient {BaseAddress = new Uri("{elasticsearch-url}")};
        var raw = client.GetStringAsync($"/logstash-*/_search?q=Application:{application}").Result;
        dynamic result = JsonConvert.DeserializeObject(raw);
        return (long) result.hits.total;

    query.ResultShouldEventuallyBe(hits => hits.Should().BeGreaterThan(0, $"because there should be some documents in Elasticsearch with Application:{application}"));

The code for the AWSProfileContext class is relatively simple, in that while it is in effect it inserts the profile into the first position of the credential fallback process, so that it takes priority over everything else. When disposed, it removes that override, returning everything to the way it was. A similar dispose-delete pattern is used for the temporary S3 bucket.

With a profile override in place, you can run the tests in local development just by ensuring you have a profile available with the correct name, which allows for a nice smooth development experience once a small amount of manual setup has been completed.

To run the tests on the build servers, we put together a simple helper in Powershell that creates an AWS profile, runs a script block and then removes the profile. The sensitive parts of this process (i.e. the credentials themselves) are stored in TeamCity, and supplied when the script is run, so nothing dangerous needs to go into source control at all.

To Be Continued

The most complicated part of testing the ELB Logs Processor was getting the AWS credentials working consistently across a few different execution strategies, while not compromising the safety of the credentials themselves. The rest of the testing is pretty just vanilla C# testing.

The only other thing worth mentioning is that the AWS libraries available for .NET Core aren’t quite the same as the normal .NET libraries, so some bits and pieces didn’t work as expected. The biggest issue I ran into was the how a lot of the functionality for managing credential is typically configured through the app.config, and that sort of thing doesn’t even seem to exist in .NET Core. Regardless, we still got it work in the end, even though we did have to use the obsolete StoredProfileAWSCredentials class.

Next week will be the last post in this series, drawing everything together and focusing on actually running the new .NET Core ELB Logs Processor in a production environment.


Accidentally breaking backwards compatibility is an annoying class of problem that can be particularly painful to deal with once it occurs, especially if it gets out into production before you realise.

From my experience, the main issue with this class of problem is that its unlikely to be tested for, increasing the risk that when it does occur, you don’t notice until its dangerously late.

Most of the time, you’ll write tests as you add features to a piece of software, like an API. Up until the point where you release, you don’t really need to think too hard about breaking changes, you’ll just do what is necessary to make sure that your feature is working right now and then move on. You’ll probably implement tests for it, but they will be aimed at making sure its functioning.

The moment you release though, whether it be internally or externally, you’ve agreed to maintain whatever contract you’ve set in place, and changing this contract can have disastrous consequences.

Even if you didn’t mean to.

Looking Backwards To Go Forwards

In my case, I was working with that service that uses RavenDB, upgrading it to have a few new endpoints to cleanup orphaned and abandoned data.

During development I noticed that someone had duplicated model classes from our models library inside the API itself, apparently so they could decorate them with attributes describing how to customise their serialisation into RavenDB.

As is normally the case with this sort of thing, those classes had already started to diverge, and I wanted to nip that problem in the bud before it got any worse.

Using attributes on classes as the means to define behaviour is one of those things that sounds like a good idea at the time, but quickly gets troublesome as soon as you want to apply them to classes that are in another library or that you don’t control. EF is particularly bad in this case, featuring attributes that define information about how the resulting data schema will be constructed, sometimes only relevant to one particular database provider. I much prefer an approach where classes stand on their own and there is something else that provides the meta information normally provided by attributes. The downside of this of course is that you can no longer see everything to do with the class in one place, which is a nicety that the attribute approach provides. I feel like the approach that allows for a solid separation of concerns is much more effective though, even taking into account the increased cognitive load.

I consolidated some of the duplicate classes (can’t fix everything all at once) and ensured that the custom serialisation logic was maintained without using attributes with this snippet of code that initializes the RavenDB document store.

public class RavenDbInitializer
    public void Initialize(IDocumentStore store)
        var typesWithCustomSerialization = new[]

        store.Conventions.CustomizeJsonSerializer = a => a.Converters.Add(new RavenObjectAsJsonPropertyValue(typesWithCustomSerialization));
        store.ExecuteIndex(new Custom_Index());

The RavenObjectAsJsonPropertyValue class allows for the custom serialization of a type as a pure string, assuming that the type supplies a Parse method and a ToString that accurately represents the class. This reduces the complexity of the resulting JSON document and avoids mindlessly writing out all of the types fields and properties (which we may not need if there is a more efficient representation).

After making all of the appropriate changes I ran our suite of unit, integration and functional tests. They all passed with flying colours, so I thought that everything had gone well.

I finished up the changes relating to the new cleanup endpoint (adding appropriate tests), had the code reviewed and then pushed it into master, which pushed it through our CI process and deployed it to our Staging environment, reading for a final test in a production-like environment.

Then the failures began.

Failure Is Not An Option

Both the new endpoint and pre-existing endpoints were failing to execute correctly.

Closer inspection revealed that the failures were occurring whenever a particular object was being read from the underlying RavenDB database and that the errors occurring were related to being unable to deserialize the JSON representation of the object.

As I still remembered making the changes to the custom serialization, it seemed pretty likely that that was the root cause.

It turns out that when I consolidated the duplicate classes, I had forgotten to actually delete the now useless duplicated classes. As a result of not deleting them, a few references to the deprecated objects had sneakily remained, and because I had manually specified which classes should have custom serialization (which were the non-duplicate model classes), the left over classes were just being serialized in the default way (which just dumps all the properties of the object).

But all the tests passed?

The tests passed because I had not actually broken anything from a self contained round trip point of view. Any data generated by the API was able to be read by the same API.

Data generated by previous versions of the API was no longer valid, which was unintentional.

Once I realise my mistake, the fix was easy enough to conceive. Delete the deprecated classes, fix the resulting compile errors and everything would be fine.

Whenever I come across a problem like this though I prefer to put measures in place such that the exact same thing never happens again.

In this case, I needed a test to verify that the currently accepted JSON representation of the entity continued to be able to be deserialized from a RavenDB document store.

With a failing test I could be safe in the knowledge that when it passed I had successfully fixed the problem, and I can have some assurance that the problem will likely not reoccur in the future if someone breaks something in a similar way (and honestly, its probably going to be me).

Testing Is An Option Though

Being that the failure was occurring at the RavenDB level, when previously stored data was being retrieved, I had to find a way to make sure that the old data was accessible during the test. I didn’t want to have to setup and maintain a complete RavenDB instance that just had old data in it, but I couldn’t just use the current models to add data to the database in the normal way, because that would not actually detect the problem that I experienced.

RavenDB is pretty good as far as testing goes though. Its trivially easy to create an embedded data store and while the default way of communicating with the store is to use classes directly (i.e. Store<T>), it gives access to all of the raw database commands as well.

The solution? Create a temporary data store, use the raw database commands to insert a JSON document of the old format into a specific key then use the normal RavenDB session to query the entity that was just inserted via the generic methods.

public void WhenAnEntityFromAPreviousVersionOfOfTheApplicationIsInsertedIntoTheRavenDatabaseDirectly_ThatEntityCanBeReadSuccessfully()
    using (var resolverFactory = new IntegrationTestResolverFactory())
        var assemblyLocation = Assembly.GetAssembly(this.GetType()).Location;
        var assemblyFolder = Path.GetDirectoryName(assemblyLocation);

        var EntitySamplePath = Path.Combine(assemblyFolder, @"Data\Entity.json");
        var EntityMetadataPath = Path.Combine(assemblyFolder, @"Data\Entity-Metadata.json");

        var Entity = File.ReadAllText(EntitySamplePath);
        var EntityMetadata = File.ReadAllText(EntityMetadataPath);

        var key = "Entity/1";

        var store = resolverFactory.Application().Get<IDocumentStore>();
        store.DatabaseCommands.Put(key, null, RavenJObject.Parse(Entity), RavenJObject.Parse(EntityMetadata));

        using (var session = store.OpenSession())
            var entities = session.Query<Entity>().ToArray();

The test failed, I fixed the code, the test passed and everyone was happy. Mostly me.


I try to make sure that whenever I’m making changes to something I take backwards compatibility into account. This is somewhat challenging when you’re working at the API level and thinking about models, but is downright difficult when breaking changes can occur during things like serialization at the database level. We had a full suite of tests, but they all worked on the current state of the code and didn’t take into account the chronological nature of persisted data.

In this particular case, we caught the problem before it became a real issue thanks to our staging environment accumulating data over time and our excellent tester, but that’s pretty late in the game to be catching these sorts of things.

In the future it may be useful to create a class of tests specifically for backwards compatibility shortly after we do a release, where we take a copy of any persisted data and ensure that the application continues to work with that structure moving into the future.

Honestly though, because of the disconnected nature of tests like that, it will be very hard to make sure that they are done consistently.

Which worries me.


As I mentioned briefly last week, I’m currently working on some maintenance endpoints that allow us to clean up data that has been identified as abandoned in one of our services (specifically, the service backed by a RavenDB database).

The concept at play is not particularly difficult. We have meta information attached to each set of data belonging to a customer, and this meta information allows us to decide whether or not the data (which is supposed to be transitory anyway) should continue to live in the service, or whether it should be annihilated to reduce load.

In building the endpoints that will allow this functionality to be reliably executed, I needed to construct a set of automated tests to verify that the functionality worked as expected with no side-effects. In other words, the normal approach.

Unfortunately, that’s where the challenge mode kicked in.

Like Oil And Water

One of the key concepts in RavenDB is the index.

Indexes are used for every query you do, except for loading a document by ID, finding the set of documents by ID prefix or finding documents by updated timestamp.

Helpfully, RavenDB will create and maintain one or more indexes for you based on your querying profile (automatic indexes), which means that if you don’t mind relinquishing control over the intricacies of your persistence, it will handle all the things for you. Honestly, it does a pretty good job of this, until you do a query that is somewhat different from all the previous queries, and RavenDB creates a new index to handle the query, which results in a massive CPU, memory and disk spike, which craters your performance.

Our inexperience with RavenDB when this service was first built meant that the difference between automatic and static indexes was somewhat lost on us, and the entire service uses nothing but automatic indexes. Being that we don’t allow for arbitrary, user defined queries, this is fine. In fact, unless automatic indexes are somehow at the root of all of our performance problems, we haven’t had any issues with them at all.

A key takeaway about indexes, regardless of whether they are automatic or static, is that they are an eventually consistent representation of the underlying data.

All writes in RavenDB take place in a similar way to writes in most relational databases like SQL Server. Once a write is accepted and completes the data is available straight away.

The difference is that in RavenDB, that data is only available straightaway if you query for it based off the ID, the prefix (which is part of the ID) or the order in which it was updated.

Any other query must go through an index, and indexes are eventually consistent, with an unknown time delay between when the write is completed and the data is available via the index.

The eventually consistent model has benefits. It means that writes can be faster (because writes don’t have to touch every index before they are “finished”), and it also means that queries are quicker, assuming you don’t mind that the results might not be completely up to date.

That’s the major cost or course, you trade-off speed and responsiveness against the ability to get a guaranteed definitive answer at any particular point in time.

Or Maybe Fire And Gasoline?

Eventual consistency is one of those concepts that infects everything it touches. If you have an API that uses a database that is eventually consistent, that API is also eventually consistent. If your API is eventually consistent, your clients are probably eventually consistent. So on and so forth all the way through the entire system.

If you’re used to relying on something being true after doing it, you have to completely rearrange your thinking, because that is no longer guaranteed to be true.

This is especially painful when it comes to tests, which brings us full circle back to where I started this post.

A standard test consists of some sort of setup (arrange), followed by the execution of some operation or set of operations (act), followed by some validation (assert).

In my case, I wanted to write a Functional Test that validated if I added a bunch of data and then executed the maintenance endpoint, some of that data should have been removed (the stuff identified as abandoned) and the rest should still be present. A nice high level test to validate that everything was working together as it should be, to be executed against the API itself (via HTTP calls).

Our functional tests are typically executed against a component deploy via our continuous integration/deployment system. We run Unit and Integration tests after the build, deploy via Octopus to our CI environment, then run Functional tests to validate that all is well. The success of the Functional tests determines whether or not that particular build/package should be allowed to propagate out into our Staging environment and beyond.

The endpoints in play were:

GET /entities
PUT /entities/{id}
GET /admin/maintenance/entities/abandoned
DELETE /admin/maintenance/entities/abandoned

Conceptually, the test would:

  • Seed a bunch of entities to the API (some meeting the abandoned conditions, some not)
  • Run a GET on the maintenance endpoint to confirm that the appropriate number of entities were being identified as abandoned
  • Run a DELETE on the maintenance endpoint, removing a subset of the abandoned entities, returning a response detailing what was removed and how much is left
  • Run a GET on the maintenance endpoint to validate that its response concurred with the DELETE response
  • Repeatedly do DELETE/GET calls to the maintenance endpoint to delete all abandoned entities
  • Run a GET on the entities endpoint to validate that all of entities that should still be present are, and all entities that should have been deleted are

Like I said, a nice, high level test that covers a lot of functionality and validates that the feature is working as it would be used in reality. Combined with a set of Unit and Integration tests, this would provide some assurance that the feature was working correctly.

Unfortunately, it didn’t work.

The Right Answer Is Just A While Loop Away

The reason it didn’t work was that it was written with the wrong mindset. I was expecting to be able to add a bunch of data and have that data be immediately visible.

That’s not possible with an eventually consistent data store like RavenDB.

At first I thought I could just deal with the issue by wrapping everything in a fancy retry loop, to check that it was eventually the value that I wanted it to be. I would be able to use the same test structure that I already had, all I would have to do would be make sure each part of it (GET, DELETE, GET) executed repeatedly until it matched the conditions, up to some maximum number of attempts.

To that end I wrote a simple extension method in the flavour of FluentAssertions that would allow me to check a set of assertions against the return value from a Func in such a way that it would repeatedly execute the Func until the assertions were true or it reached some sort of limit.

using System;
using System.Linq;
using FluentAssertions;
using NSubstitute;
using Serilog;

namespace Api.Tests.Shared.Extensions
    public static class FluentAssertionsExtensions
        public static void ShouldEventuallyMeetAssertions<TResult>(this Func<TResult> func, Action<TResult>[] assertions, IRetryStrategy strategy = null)
            strategy = strategy ?? new LinearBackoffWithCappedAttempts(Substitute.For<ILogger>(), 5, TimeSpan.FromSeconds(2));
            var result = strategy.Execute
                    r => 
                        foreach (var assertion in assertions)

                        return true;

            if (!result.Success)
                throw result.EvaluationErrors.Last();
    public interface IRetryStrategy
        RetryResult<TResult> Execute<TResult>(Func<TResult> f, Func<TResult, bool> isValueAcceptable = null);
    public class LinearBackoffWithCappedAttempts : IRetryStrategy
        public LinearBackoffWithCappedAttempts(ILogger logger, int maximumNumberOfAttempts, TimeSpan delayBetweenAttempts)
            _logger = logger;
            _maximumNumberOfAttempts = maximumNumberOfAttempts;
            _delayBetweenAttempts = delayBetweenAttempts;

        private readonly ILogger _logger;
        private readonly int _maximumNumberOfAttempts;
        private readonly TimeSpan _delayBetweenAttempts;

        public RetryResult<TResult> Execute<TResult>(Func<TResult> f, Func<TResult, bool> isValueAcceptable = null)
            isValueAcceptable = isValueAcceptable ?? (r => true);
            var result = default(TResult);
            var executionErrors = new List<Exception>();

            var attempts = 0;
            var success = false;
            while (attempts < _maximumNumberOfAttempts && !success)
                    _logger.Debug("Attempting to execute supplied retryable function to get result");
                    result = f();

                    var valueOkay = isValueAcceptable(result);
                    if (!valueOkay) throw new Exception(string.Format("The evaluation of the condition [{0}] over the value [{1}] returned false", isValueAcceptable, result));

                    success = true;
                catch (Exception ex)
                    _logger.Debug(ex, "An error occurred while attempting to execute the retryable function");
                    if (!success) Thread.Sleep(_delayBetweenAttempts);

            var builder = new StringBuilder();
            if (success) builder.Append("Evaluation of function completed successfully");
            else builder.Append("Evaluation of function did NOT complete successfully");

            builder.AppendFormat(" after [{0}/{1}] attempts w. a delay of [{2}] seconds between attempts.", attempts, _maximumNumberOfAttempts, _delayBetweenAttempts.TotalSeconds);
            builder.AppendFormat(" [{0}] total failures.", executionErrors.Count);
            if (executionErrors.Any())
                builder.AppendFormat(" Exception type summary [{0}].", string.Join(", ", executionErrors.Select(a => a.Message)));

            _logger.Information("The retryable function [{function}] was executed as specified. Summary was [{summary}]", f, builder.ToString());

            return new RetryResult<TResult>(success, result, builder.ToString(), executionErrors);

    public class RetryResult<TResult>
        public RetryResult(bool success, TResult result, string summary, List<Exception> evaluationErrors = null)
            Success = success;
            Result = result;
            Summary = summary;
            EvaluationErrors = evaluationErrors;

        public readonly bool Success;
        public readonly TResult Result;
        public readonly List<Exception> EvaluationErrors;
        public readonly string Summary;

Usage of the extension method, and some rearranging let me write the test in a way that was  now aware of the eventually consistent nature of the API.

When I started thinking about it though, I realised that the approach would only work for idempotent requests, which the DELETE verb on the new endpoint was definitely not.

Repeatedly executing the DELETE was changing the output each time, and depending on whether the stars aligned with regards to the RavenDB indexes, it was unlikely to ever meet the set of assertions that I had defined.

Happiness Comes From Compromise

In the end I had to adapt to the concepts at play, rather than adapting the concepts to fit my desired method of testing.

I could still leverage the eventually consistent assertion extension that I wrote, but I had to slightly rethink what I was trying to test.

  • I would still have to seed a bunch of entities to the API (some meeting the abandoned conditions, some not)
  • Instead of doing a single GET, I would have to assert that at least all of my seeded data was eventually present.
  • Instead of running a single DELETE and checking the result, I would have to loop and delete everything until the service told me nothing was left, and then check that it deleted at least all of the things I seeded
  • Instead of doing a final GET to verify, I would have to again assert that at least all of my seeded data that was not supposed to be removed was still present

Subtly different, but still providing a lot of assurance that the feature was working as expected. I won’t copy the full test here, because including helper methods to improve readability, it ended up being about 150 lines long.


Sometimes it feels like no matter where I look, RavenDB is there causing me some sort of grief. It has definitely given me an appreciation of the simple nature of traditional databases.

This isn’t necessarily because RavenDB is bad (its not, even though we’ve had issues with it), but because the eventual consistency model in RavenDB (and in many other similar databases) makes testing more difficult than it has to be (and testing is already difficult and time consuming, even if it does provide a world of benefits).

In this particular case, while it did make testing a relatively simple feature more challenging than I expected, it was my own approach to testing that hurt me. I needed to work with the concepts at play rather than to try and bend them to my preferences.

Like I said earlier, working with an eventually consistent model changes how you interact with a service at every level. It is not a decision that should be taken lightly.

Unfortunately I think the selection of RavenDB in our case was more “shiny new technology” than “this is the right fit for our service”, which is definitely less gravitas than it should have been given.


I definitely would not say that I am an expert at load testing web services. At best, I realise how valuable it is in order to validate your architecture and implementation, to help you get a handle on weak or slow areas and fix them before they can become a problem.

One thing I have definitely learned in the last 12 months however, is just how important it is to make sure that your load profile (i.e. your simulation for how you think your system will be loaded) is as close to reality as possible. If you get this wrong, not only will you not be testing your system properly, you will give yourself a false sense of confidence in how it performs when people are using it. This can lead to some pretty serious disasters when you actually do go live everything explodes (literally or figuratively, it doesn’t matter).

Putting together a good load profile is a difficult and time consuming task. You need to make assumptions about expected usage patterns, number of users, quality (and quantity) of date and all sorts of other things. While you’re building this profile, it will feel like you aren’t contributing directly to the software being written (there is code to write!), but believe me, a good load profile is worth it when it comes to validation all sorts of things later on. Like a good test suite, it keeps paying dividends in all sorts of unexpected places.

Such a Tool

It would be remiss of me to talk about load tests and load profiles without mentioning at least one of the tools you can use to accomplish them, as there are quite a few out there. In our organisation we use JMeter, mostly because that’s the first one that we really looked into in any depth, but it helps that it seems to be pretty well accepted in the industry, as there is a lot of information already out there to help you when you’re stuck. Extremely flexible, extendable and deployable, its an excellent tool (though it does have a fairly steep learning curve, and its written in Java so for a .NET person it can feel a little foreign).

Back to the meat of this post though.

As part of the first major piece of work done shortly after I started, my team completed the implementation of a service for supporting the remote access and editing of data that was previously locked to client sites. I made sure that we had some load tests to validate the behaviour of the service when it was actually being used, as opposed to when it was just kind of sitting there, doing nothing. I think it might have been the first time that our part of the organisation had ever designed and implemented load tests for validating performance, so it wasn’t necessarily the most…perfect, of solutions.

The load tests showed a bunch of issues which we dutifully fixed.

When we went into production though, there were so many more issues than we anticipated, especially related to the underlying persistence store (RavenDB, which I have talked about at length recently).

Of course, the question on everyone’s lips at that point was, why didn’t we see those issues ahead of time? Surely that was what the load tests were meant to catch?

The Missing Pieces

There were a number of reasons why our load tests didn’t catch any of the problems that started occurring in production.

The first was that we were still unfamiliar with JMeter when we wrote those tests. This mostly just limited our ability to simulate complex operations (of which there are a few), and made our profile a bit messier than it should have been. It didn’t necessarily cause the weak load tests, but it certainly didn’t help.

The second reason was that the data model used in the service is not overly easy to use. When I say easy to use, I mean that the objects involved are complex (100+KB of JSON) and thus are difficult to create realistic looking random data for. As a result, we took a number of samples and then used those repeatedly in the load tests, substituting values as appropriate to differentiate users from each other. I would say that the inability to easily create realistic looking fake data was definitely high up there on the list as to why the load tests were ineffective in finding the issues we encountered in production.

The third reason why our load tests didn’t do the job, was the actual load profile itself. The simulation for what sort of calls we expected a single user (where user describes more than just one actual person using the system) to make was just not detailed enough. It did not cover enough of the functionality of the server and definitely did not accurately represent reality. This was unfortunate and unexpected, because we spent a significant amount of time attempting to come up with a profile, and we got agreement from a number of different parties that this profile would be good enough for the purposes of testing. The root cause of this one was simply unfamiliarity with the intended usage of the system.

Finally, and what I think is probably the biggest contributor to the ineffectiveness of the load tests, we simply did not run them for long enough. Each load test we did only went for around 48 hours (at the high end) and was focused around finding immediate and obvious performance problems. A lot of the issues that we had in production did not manifest themselves until we’d been live for a week or two. If we had of implemented the load tests sooner, and then started and kept them running on our staging environment for weeks at a time, I imagine that we would have found a lot of the issues that ended up plaguing us.


Of course, there is no point thinking about these sort of things unless you actually make changes the next time you go to do the same sort of task.

So, what did we learn?

  1. Start thinking about the load tests and simulating realistic looking data early. We came into the service I’ve been talking about above pretty late (to clean up someone else’s mess) and we didn’t really get a chance to spend any time on creating realistic looking data. This hurt us when it came time to simulate users.
  2. Think very very hard about your actual load profile. What is a user? What does a user do? Do they do it sequentially or in parallel? Are there other processes going on that might impact performance? Are there things that happen irregularly that you should include in the profile at random? How big is the data? How much does it vary? All of those sorts of questions can be very useful for building a better load profile. Make sure you spend the time to build it properly in whatever tool you are using, such that you can tweak it easily when you go to run it.
  3. To run our load tests early and then for as much time as possible. To us, this means we should run them in an infinite loop on top of our staging environment pretty much as soon as we have them, forever (well, until we’re no longer actively developing that component anyway).

The good thing to come out of the above is that the service we completed did not flop hard enough that we don’t get a second chance. We’re just now developing some other services (to meet similar needs) and we’ve taken all of the lessons above to heart. Our load test profiles are much better and we’ve started incorporating soak tests to pick up issues that only manifest over long periods of time.

At least when it breaks we’ll know sooner, rather than when there are real, paying, customers trying to use it.

I imagine though, that we will probably have to go through this process a few times before we really get a good handle on it.


The last post I made about our adventures with RavenDB outlined the plan, upgrade to RavenDB 3. First step? Take two copies of our production environment, leave one at RavenDB 2.5 and upgrade the other to RavenDB 3. Slam both with our load tests in parallel and then see which one has better performance by comparing the Kibana dashboard for each environment (it shows things like CPU usage, request latency, disk latency, etc).

The hope was that RavenDB 3 would show lower resource usage and better performance all round using approximately the same set of data and for the same requests. This would give me enough confidence to upgrade our production instance and hopefully mitigate some of the issues we’ve been having.

Unfortunately, that’s not what happened.

Upgrade, Upgrade, Upgrade, Upgrade!

Actually upgrading to RavenDB 3 was painless. For RavenDB 2.5 we build a Nuget package that contains all of the necessary binaries and configuration, along with Powershell scripts that setup an IIS website and application pool automatically on deployment. RavenDB 3 works in a very similar way, so all I had to do was re-construct the package so that it worked in the same way except with the newer binaries. It was a little bit fiddly (primarily because of how we constructed the package the first time), but it was relatively easy.

Even better, the number of binaries and dependencies for RavenDB 3 is lower than RavenDB 2.5, which is always nice to see. Overall I think the actual combined sized may have increased, but its still nice to have a smaller number of files to manage.

Once I had the package built, all I had to do was deploy it to the appropriate environment using Octopus Deploy.

I did a simple document count check before and after and everything was fine, exactly the same number of documents was present (all ~100K of them).

Resource usage was nominal during this upgrade and basically non-existent afterwards.

Time to simulate some load.

What a Load

I’ve written previously about our usage of JMeter for load tests, so all I had to do was reuse the structure I already had in place. I recently did some refactoring in the area as well, so it was pretty fresh in my mind (I needed to extract some generic components from the load tests repository so that we could reuse them for other load tests). I set up a couple of JMeter worker environments in AWS and started the load tests.

Knowing what I do now I can see that the load tests that I originally put together don’t actually simulate the real load on the service. This was one of the reasons why our initial, intensive load testing did not find any of the issues with the backend that we found in production. I’d love to revisit the load profiles at some stage, but for now all I really needed was some traffic so that I could compare the different versions of the persistence store.

RavenDB 2.5 continued to do what it always did when the load tests were run. It worked just fine. Minimal memory and CPU usage, disk latency was low, all pretty standard.

RavenDB 3 ate all of the memory on the machine (16GB) over the first 10-15 minutes of the load tests. This caused disk thrashing on the system drive, which in turn annihilated performance and eventually the process crashed and restarted.

Not a good sign.

I’ve done this test a few times now (upgrade to 3, run load tests) and each time it does the same thing. Sometimes after the crash it starts working well (minimal resource usage, good performance), but sometimes even when it comes back from the crash it does the exact same thing again.

Time to call in the experts, i.e. the people that wrote the software.

Help! I Need Somebody

We don’t currently have a support contract with Hibernating Rhinos (the makers of RavenDB). The plan was to upgrade to RavenDB 3 (based on the assumption that its probably a better product), and if our problems persisted, to enter into a support contract for dedicated support.

Luckily, the guys at Hibernating Rhinos are pretty awesome and interact regularly with the community at the RavenDB Google Group.

I put together a massive post describing my current issue (mentioning the history of issues we’ve had to try and give some context), which you can find here.

The RavenDB guys responded pretty quickly (the same day in fact) and asked for some more information (understandably). I re-cloned the environment (to get a clean start) and did it again, except this time I was regularly extracting statistics from RavenDB (using the /stats and /admin/stats endpoints), as well as dumping the memory when it got high (using procdump) and using the export Debug information functionality built into the new Raven Studio (which is so much better than the old studio that it’s not funny). I packaged together all of this information together with the RavenDB log files and posted a response.

While looking through that information, Oren Eini (the CEO/Founder of Hibernating Rhinos) noticed that there were a number of errors reported around not being able to find a Lucene.Net.dll file on the drive where I had placed the database files (we separated the database files from the libraries, the data lives on a large, high throughput volume while the libraries are just on the system drive). I don’t know why that file should be there, or how it should get there, but at least it was progress!

The Battle Continues

Alas, I haven’t managed to return to this particular problem just yet. The urgency has diminished somewhat (the service is generally running a lot better after the latest round of hardware upgrades), and I have been distracted by other things (our Octopus Deploy slowing down our environment provisioning because it is underpowered), so it has fallen to the wayside.

However, I have plans to continue the investigation soon. Once I get to the root of the issue, I will likely make yet another post about RavenDB, hopefully summarising the entire situation and how it was fixed.

Software developers, perpetually hopeful…