I Remember Caching Being Harder

October 4. 2016 0 Comments

A number of people much smarter than me have stated that there are only two hard problems in Computer Science, naming things, cache invalidation and off by one errors. Based on my own experience, I agree completely with this sentiment, and its one of the reasons why I always hesitate to incorporate caching into a system until its absolutely necessary.

Unfortunately, absolutely necessary always comes much sooner than I would like.

Over the last few weeks we’ve been slowly deploying our data freeing functionality to our customers. The deployment occurs automatically without user involvement (we have a neat deployment pipeline) so that has been pretty painless, but we’ve had some teething issues on the server side as our traffic ramped up. We’re obviously hosting everything in AWS, so its not like we can’t scale as necessary (and we have), but just throwing more power at something is a road we’ve been down before, and it never ends well. Plus, we’re software engineers, and when it looks like our code is slow or inefficient, it makes us sad.

The first bottleneck we ran into was our authentication service, and mitigating that particular problem is the focus of this post.

For background, we use a relatively simple token based approach to authentication. Consumers of our services are required to supply some credentials directly to our auth service to get one of these tokens, which can then be redeemed to authenticate against other services and to gain access to the resources the consumer is interested in. Each service knows that the auth service is the ultimate arbitrator for resolving those tokens, so they connect to the service as necessary for token validation and make authorization decisions based on the information returned.

A relatively naive approach to authentication, but it works for us.

Its pretty spammy though, and therein lies the root cause of the bottleneck.

I Already Know The Answer

One of the biggest factors at play here is that every single authenticated request (which is most of them) coming into any of our services needs to hit another service to validate the token before it can do anything else. It can’t even keep doing things in the background while the token is being validated, because if the auth fails, that would be a waste of effort (and a potential security risk if an error occurs and something leaks out).

On the upside, once a service has resolved a token, its unlikely that the answer will change in the near future. We don’t even do token invalidation at this stage (no need), so the answer is actually going to be valid for the entire lifetime of the token.

You can probably already see where I’m going with this, but an easy optimization is to simply remember the resolution for each token in order to bypass calls to the auth service when that token is seen again. Why ask a question when you already know the answer? This works particularly well for us in the data synchronization case because a single token will be used for a flurry of calls.

Of course, now we’re in the caching space, and historically, implementing caching can increase the overall complexity of a system by a large amount. To alleviate some of the load on the auth service, we just want a simple in-memory cache. If a particular instance of the service has seen the token, use the cache, else validate the token and then cache the result. To keep it simple, and to deal with our specific usage pattern (the aforementioned flurry), we’re not even going to cache the token resolution for the entire lifetime of the token, just for the next hour.

We could write the cache ourselves, but that would be stupid. Caching IS a hard problem, especially once you get into invalidation. Being that caching has been a known problem for a while, there are a lot of different components available for you to use in your language of choice, its just a matter of picking one.

But why pick just one?

There Has To Be A Better Way

As we are using C#, CacheManager seems like a pretty good bet for avoiding the whole “pick a cache and stick with it” problem.

Its a nice abstraction over the top of many different caching providers (like the in-memory System.Runtime.Caching and the distributed Redis), so we can easily implement our cache using CacheManager and then change our caching to something more complicated later, without having to worry about breaking everything in between. It also simplifies the interfaces to those caches, and does a bunch of really smart work behind the scenes.

The entire cache is implemented in the following class:

public class InMemoryNancyAuthenticationContextCache : INancyAuthenticationContextCache
{
    private readonly ICacheManager<AuthenticationContext> _manager;

    public InMemoryNancyAuthenticationContextCache()
        : this(TimeSpan.FromHours(1))
    {

    }

    public InMemoryNancyAuthenticationContextCache(TimeSpan cachedItemsValidFor)
    {
        _manager = CacheFactory.Build<AuthenticationContext>(a => a.WithSystemRuntimeCacheHandle().WithExpiration(ExpirationMode.Absolute, cachedItemsValidFor));
    }

    public AuthenticationContext Get(string token)
    {
        return _manager.Get(token);
    }

    public void Insert(string token, AuthenticationContext authContext)
    {
        _manager.Put(token, authContext);
    }
}

Our common authentication library now checks whatever cache was injected into it before going off to the real auth service to validate tokens. I made sure to write a few tests to validate the caching behaviour (like expiration and eviction, validating a decrease in the number of calls to the auth provider when flooded with the same token and so on), and everything seems to be good.

The one downside of using CacheManager (and the System.Runtime.Caching libraries) is that I’m putting a lot of trust in everything to “just work”. Performance testing will prove out whether or not there are any issues, but I can guarantee that if there are they will be a massive pain to diagnose if anything weird happens, just because our code is so many levels removed from the actual caching.

I sure do hope it does the right thing under stress.

Summary

Like I said at the start, I always hold off on implementing a cache for as long as I possibly can. Caching is a powerful tool to improve performance, but it can lead to some frustratingly hard to debug situations because of the transient nature of the information. No longer can you assume that you will be getting the most accurate information, which can make it harder to reason about execution paths and return values after something has actually happened. Of course, any decent system (databases, browsers, even disk access) usually has caching implemented to some degree, so I suppose you’re always dealing with the problem regardless of what your code actually does.

Good caching is completely invisible, feeding consumers appropriate information and mitigating performance problems without ever really showing its ugly side.

When it comes to implementing caching, it doesn’t make sense to write it yourself (which is true for a lot of things). Accept the fact that many people a lot smarter than you have already probably already solved the problem and just use their work instead. The existence of CacheManager was an unexpected bonus to loosely coupling our code to a specific cache implementation, which was nice.

Now if only someone could solve the whole naming things problem.

Where Has All The Read IOPS Gone

September 20. 2016 0 Comments

Welp, holidays finished, so back into software delivery I go.

We just finished a 3 day Hackathon, which was pretty fun. The team I was in worked on a proof of concept for providing business intelligence on our customers to internal users by connecting and correlating our many disparate data sources. It was a fairly simple React.js website backed by a C# API, which connected to a number of different systems including Microsoft Dynamics CRM, Elasticsearch (specifically, our log stack), some crazy financial database called Helpy (which I don’t even seem to be able to Google) and finally a PostgreSQL database containing customer information.

It was cool to see it all come together over just a few days, but I doubt I’ll give it its own blog post like I did for the faster S3 clone that I did for the last Hackathon. Funnily enough, the hardest part was getting the damn thing deployed to an instance of IIS on a manually provisioned server inside our office. I’ve spent so much time using automated scripts that I wrote months ago in AWS to setup websites and API’s that I’ve completely forgotten how to just hack a webserver together.

Why didn’t I use the scripts you might ask? Well, I tried, and they really only work inside AWS on top of the AMI’s we’ve built up specifically for that purpose. A bit disappointing, but I got over it.

Hackathon fun aside, I had to deal with something else when I got back from my holidays, identifying and resolving a massive unexpected increase in Read IOPS usage for one of our production databases.

And for once, It had nothing to do with RavenDB.

Zero Hour

While I was on holidays, one of our production API’s started exhibiting some pretty degraded performance. It happened suddenly too, it was running fine one minute and then all of a sudden its latency spiked hard. Unfortunately I was not there for the initial investigation, so the first half of this post is going to be a second-hand account, with all the inherent inaccuracy that entails.

Looking into the issue, all queries to the underlying database (a postgreSQL database hosted in RDS) were taking much longer than expected, but it wasn’t immediately obvious as to why. There had been no new deployments (so nothing had changed), and it didn’t look like the amount of traffic being received had drastically increased.

The RDS monitoring statistics showed the root cause of the performance problems. There had been a massive increase in the amount of Read IOPS being consumed by the database instance, from 300ish to 700ish. This had been happening for at least a few days before the performance went downhill, so the operative question was why it had become an issue all of a sudden.

Apparently, IO credits are a thing, like CPU credits.

With the increase in Read IOPS being sustained over many hours, the IO credits available to the instance were slowly being consumed, until they were all gone and performance tanked.

The way IO credits work, to my current understanding, is that when you provision an EBS volume, the size of the volume determines the minimum guaranteed IO throughput, at a rate of 3 IOPS per GB provisioned (unless you’ve specified provisioned IOPS, which is a whole different ballgame). The volume is capable of bursting to 3000 IOPS, but operating above the minimum will consume IO credits. Running out of IO credits means no more bursting, so if the performance of your service was reliant on pulling more than the minimum IO available, it will tank. Like ours did.

Apart from the unfortunate fact that unlike CPU credits, there is no way to monitor IO credits, the problem was relatively easy to solve. Either increase the volume size of the RDS instance (to increase the minimum IO performance) or leave the size the same, but switch to using provisioned IOPS. We opted for the cheaper option (bigger disk = more performance) and the performance stabilised.

This left us in a precarious position though, as we didn’t know what had cause the dramatic increase in Read IOPS.

Detective Agency

PostgreSQL has some pretty good logs, which is a relief, considering no equivalent tool to SQL Server Profiler exists to my knowledge.

Getting the logs for a PostgreSQL instance hosted in RDS isn’t particularly complicated. Either edit the existing parameter group associated with the instance, or create a new one with logging enabled and then associate it with the instance. The logging options we were interested in were log_statement and log_min_duration_statement, which represent what activities to log and how long a query needs to be running before it should be logged, respectively. More information about working with PostgreSQL logs in RDS can be found at the appropriate AWS webpage. Getting to the logs manually is pretty easy. They are available for download when looking at the RDS instance in the AWS console/dashboard.

The logs showed that there was a class of query that was constantly being executed to determine the most recent piece of information in the database for a subset of the data (i.e.a select top x from y where a, b). The only piece of software that can access the database is a C# API, and it uses Entity Framework 6 for all database communication, so the queries were monstrous, but once our attention had been drawn to the appropriate place, it was pretty easy to see some issues:

The data type on the first segmentation column was a string instead of an integer (only integer values were ever present in the column, and conceptually only integer values would ever be present)
The data type on the second segmentation column was a string instead of a guid (similar to the integer comment above)
The primary key on the table was a simple auto ID, whereas the actual primary key should have been a compound key with column a and b above, along with a third column
There did not appear to be any appropriate indexes on the segmentation columns

What all of this meant was that the most frequent query being executed on the database was incredibly sub-optimal. It was likely that because the most frequent query needed to do a table scan in order to get an answer, and because the RDS instance was only specified as a t2.micro, postgreSQL was constantly churning memory, which would account for the high Read IOPS. The issue likely only occurred recently because we finally hit some threshold in traffic or database size where the churning manifested.

Fixing the problem was simple. Edit the EF models to change the types and specify appropriate indexes and add a new EF migration to be executed on the next deployment. The migration was a little more complex than usual because it had to upgrade data already in the database (for the wrong column types), but all in all, the fix didn’t take a huge amount of effort.

Like good software developers though, we had to make sure that it actually did fix the problem without actually deploying to production.

Validate Me!

Thanks to previous efforts in traffic replication, I made the assumption that it would be relatively easy to replicate our production traffic into an updated version of the service.

I was wrong.

Our previous traffic replication had been accomplished using Gor replicating traffic from a single EC2 instance (a machine hosting RavenDB) to another similar instance. Pure HTTP traffic, no URLS, just IP addresses and Gor had been manually installed and configured. I had started work on replicating traffic from our API instances to a URL of my choosing, but I hadn’t quite finished it before I went on holidays.

I needed to replicate traffic from a variable number of instances to a publicly accessible URL, and I really didn’t want to have to manually install it on each machine (and then have to do it again next time I wanted to replicate traffic).

Creating a deployable Nuget package containing Gor and some helper Powershell scripts wasn’t overly difficult, I just used a similar pattern to the one we have for our Logstash deployment package. After deploying it though, it wouldn’t work. When I say it wouldn’t work, I mean I didn’t get any feedback at all when running it, no matter how I ran it. The issue was that Gor running on windows is dependent on WinPCap being present in order to function at all. This is easy enough to accomplish when you’re doing a manual install (just go grab WinPCap and install it), but unfortunately much harder when you want to be able to deploy it to any old machine in a completely unattended fashion. WinPCap doesn’t supply a silent installer (why god why), so I was forced to install NMap silently on the machines as part of the Gor deployment. Why NMap? It also installs WinPCap silently, because they’ve customised it.

Not my greatest victory, but I’ll take it.

With Gor working, I still had an issue where even though Gor seemed to be aware of the traffic occurring on the machine (I could log it to stdout for example), no matter what I did I could not get the traffic to replicate to another service via a standard URL available to the greater internet. Long story short, the issue was that Gor (as of version 0.15.1) does not and cannot use a proxy. Being that these instances did not have public IP addresses (no need, they were hidden behind a load balancer), they had to go through a proxy to get anything meaningful done on the internet.

We’re switching to using NAT Gateways for all internet traffic from private instances, so in the long term, this won’t be an issue. For now, I have to live with the fact that proxies are a thing, so I was forced to create an internal load balancer pointing at the same EC2 instances as the public URL I was going to use.

With that in place, I finally managed to replicate the production traffic to a test service with the changes applied.

And it worked amazingly. With the changes in place, our Read IOPS dropped from 700ish to like 10. Much more efficient. The only cost was that the service had to be made unavailable to apply the database updates (because they involved the transformation of the column contents). That only took like 5 minutes though, so I just did it at night when nobody noticed or cared.

Conclusion

I suppose the conclusion of this entire journey is to never underestimate the impact that early design decisions can have on the long term health of a service, and that the value of detailed monitoring should never be undervalued.

Honestly, we should have picked up this sort of thing much earlier than we did (probably through load testing), but it was good that we had the systems in place to react quickly when it did happen unexpectedly.

Another benefit of the whole thing is that it proved the value of our ability to spin up an entirely self contained environment as a mechanism for testing changes under relatively real circumstances. With the ability to replicate traffic at will now (thanks to the working Gor deployment), we’re in a much better position to try things out before we go live with them.

Sometimes Its Just Your Fault

August 23. 2016 0 Comments

I’ve mentioned a number of times about the issues we’ve had with RavenDB and performance.

My most recent post on the subject talked about optimizing some of our most common queries by storing data in the index, to avoid having to retrieve full documents only to extract a few fields. We implement this solution as a result of guidance from Hibernating Rhinos, who believed that it would decrease the amount of memory churn on the server, because it would no longer need to constantly switch the set of documents currently in memory in order to answer incoming queries. Made sense to me, and they did some internal tests that showed that indexes that stored the indexed fields dramatically reduced the memory footprint of their test.

Alas, it didn’t really help.

Well, that might not be fair. It didn’t seem to help, but that may have been because it only removed one bottleneck in favour of another.

I deployed the changes to our production environment a few days ago, but the server is still using a huge amount of memory, still regularly dumping large chunks of memory and still causing periods of high latency while recovering from the memory dump.

Unfortunately, we’re on our own as far as additional optimizations go, as Hibernating Rhinos have withdrawn from the investigation. Considering the amount of time and money they’ve put into this particular issue, I completely understand their decision to politely request that we engage their consulting services in order to move forward with additional improvements. Their entire support team was extremely helpful and engaged whenever we talked to them, and they did manage to fix a few bugs on the way through, even if it was eventually decided that our issues were not the result of a bug, but simply because of our document and load profile.

What Now?

Its not all doom and gloom though, as the work on removing orphan/abandoned data is going well.

I ran the first round of cleanup a few weeks ago, but the results were disappointing. The removal of the entities that I was targeting (accounts) and their related data only managed to remove around 8000 useless documents from the total of approximately 400 000.

The second round is looking much more promising, with current tests indicating that we might be able to remove something like 300 000 of those 400 000 documents, which is a pretty huge reduction. The reason why this service, whose entire purpose is to be a temporary data store, is accumulating documents, is currently unknown. I’ll have to get to the bottom of that once I’ve dealt with the immediate performance issues.

The testing of the second round of abandoned document removal is time consuming. I’ve just finished the development of the tool and the first round of testing that validated that the system behaved a lot better with a much smaller set of documents in it (using a partial clone of our production environment), even when hit with real traffic replicated from production, which was something of a relief.

Now I have to test that the cleanup scripts and endpoints work as expected when they have to remove data from the server that might have large amounts of binary data linked to it.

This requires a full clone of the production environment, which is a lot more time consuming than the partial clone, because it also has to copy the huge amount of binary data in S3. On the upside, we have scripts that we can use to do the clone (as a result of the migration work), and with some tweaks I should be able to limit the resulting downtime to less than an hour.

Once I’ve verified that the cleanup works on as real of an environment as I can replicate, I’ll deploy it to production and everything will be puppies and roses.

Obviously.

Conclusion

This whole RavenDB journey has been a massive drain on my mental capacity for quite a while now, and that doesn’t even take into account the drain on the business. From the initial failure to load test the service properly (which I now believe was a result of having an insufficient number of documents in the database combined with unrealistic test data), to the performance issues that occurred during the first few months of release (dealt with by scaling the underlying hardware) all the way through to the recent time consuming investigative efforts, the usage of RavenDB for this particular service was a massive mistake.

The disappointing part is that it all could have easily gone the other way.

RavenDB might very well have been an amazing choice of technology for a service focused around being a temporary data store, which may have lead to it being used in other software solutions in my company. The developer experience is fantastic, being amazingly easy to use once you’ve got the concepts firmly in hand and its very well supported and polished. Its complicated and very different from the standard way of thinking (especially regarding eventual consistency), so you really do need to know what you’re getting into, and that level of experience and understanding was just not present in our team at the time the choice was made.

Because of the all issues we’ve had, any additional usage of RavenDB would be met with a lot of resistance. Its pretty much poison now, at least as far as the organisation is concerned.

Software development is often about choosing the right tool for the job and the far reaching impact of those choices cannot be underestimated.

Slow Deployments Make Me Sad

August 16. 2016 0 Comments

As part of the work I did recently to optimize our usage of RavenDB, I had to spin up a brand new environment for the service in question.

Thanks to our commitment to infrastructure as code (and our versioned environment creation scripts), this is normally a trivial process. Use TeamCity to trigger the execution of the latest environment package and then play the waiting game. To setup my test environment, all I had to do was supply the ID of the EC2 volume containing a snapshot of production data as an extra parameter.

Unfortunately, it failed.

Sometimes that happens (an unpleasant side effect of the nature of AWS), so I started it up again in the hopes that it would succeed this time.

It failed again.

Repeated failures are highly unusual, so I started digging. It looked as though the problem was that the environments were simply taking too long to “finish” provisioning. We allocate a fixed amount of time for our environments to complete, which includes the time required to create the AWS resources, deploy the necessary software and then wait until the resulting service can actually be hit from the expected URL.

Part of the environment setup is the execution of the CloudFormation template (and all of the AWS resources that go with that). The other part is deploying the software that needs to be present on those resources, which is where we use Octopus Deploy. At startup, each instance registers itself with the Octopus server, applying tags as appropriate and then a series of Octopus projects (i.e. the software) is deployed to the instance.

Everything seemed to be working fine from an AWS point of view, but the initialisation of the machines was taking too long. They were only getting through a few of their deployments before running out of time. It wasn’t a tight time limit either, we’d allocated a very generous 50 minutes for everything to be up and running, so it was disappointing to see that it was taking so long.

The cfn-init logs were very useful here, because they show start/end timestamps for each step specified. Each of the deployments was taking somewhere between 12-15 minutes, which is way too slow, because there are at least 8 of them on the most popular machine.

This is one of those cases where I’m glad I put the effort in to remotely extract log files as part of the reporting that happens whenever an environment fails. Its especially valuable when the environment creation is running in a completely automated fashion, like it does when executed through TeamCity, but its still very useful when debugging an environment locally. The last thing I want to do is manually remote into each machine in the environment and grab its log files in order to try and determine where a failure is occurring when I can just have a computer do it for me.

Speed Deprived Octopus

When using Octopus Deploy (at least the version we have, which is 2.6.4), you can choose to either deploy a specific version of a project or deploy the “latest” version, leaving the decision of what exactly is the latest version up to Octopus. I was somewhat uncomfortable with the concept of using “latest” for anything other than a brand new environment that had never existed before, because initial testing showed that “latest” really did mean the most recent release.

The last thing I wanted to do was to cede control over exactly what as getting deployed to a new production API instance when I had to scale up due to increased usage.

In order to alleviate this, we wrote a fairly simple Powershell script that uses the Octopus .NET Client library to find what the best version to deploy is, making sure to only use those releases that have actually been deployed to the environment, and discounting deployments that failed.

$deployments = $repository.Deployments.FindMany({ param($x) $x.EnvironmentId -eq $env.Id -and $x.ProjectId -eq $project.Id })

if ($deployments | Any)
{
    Write-Verbose "Deployments of project [$projectName] to environment [$environmentName] were found. Selecting the most recent successful deployment."
    $latestDeployment = $deployments |
        Sort -Descending -Property Created |
        First -Predicate { $repository.Tasks.Get($_.TaskId).FinishedSuccessfully -eq $true } -Default "latest"

    $release = $repository.Releases.Get($latestDeployment.ReleaseId)
}
else
{
    Write-Verbose "No deployments of project [$projectName] to environment [$environmentName] were found."
}

$version = if ($release -eq $null) { "latest" } else { $release.Version }

Write-Verbose "The version of the recent successful deployment of project [$projectName] to environment [$environmentName] was [$version]. 'latest' indicates no successful deployments, and will mean the very latest release version is used."

return $version

We use this script during the deployment of the Octopus projects that I mentioned above.

When it was first written, it was fast.

That was almost two years ago though, and now it was super slow, taking upwards of 10 minutes just to find what the most recent release was. As you can imagine, this was problematic for our environment provisioning, because each one had multiple projects being deployed during initialization and all of those delays added up very very quickly.

Age Often Leads to Slowdown

The slowdown didn’t just happen all of a sudden though. We had noticed the environment provisioning getting somewhat slower from time to time, but we’d assumed it was a result of there simply being more for our Octopus server to do (more projects, more deployments, more things happening at the same time). It wasn’t until it reached that critical point where I couldn’t even spin up one of our environments for testing purposes, that it needed to be addressed.

Looking at the statistics of the Octopus server during an environment provisioning, I noticed that it was using 100% CPU for the entire time that it was deploying projects, up until the point where the environment timed out.

Our Octopus server isn’t exactly a powerhouse, so I thought maybe it had just reached a point where it simply did not have enough power to do what we wanted it to do. This intuition was compounded by the knowledge that Octopus 2.6 uses RavenDB as its backend, and they’ve moved to SQL Server for Octopus 3, citing performance problems.

I dialed up the power (good old AWS), and then tried to provision the environment again.

Still timed out, even though the CPU was no longer maxxing out on the Octopus server.

My next suspicion was that I had made an error of some sort in the script, but there was nothing obviously slow about it. It makes a bounded query to Octopus via the client library (.FindMany with a Func specifying the environment and project, to narrow down the result set) and then iterates through the resulting deployments (by date, so working backwards in time) until it finds an acceptable deployment for identifying the best release.

Debugging the automated tests around the script, the slowest part seemed to be when it was getting the deployments from Octopus. Using Fiddler, the reason for this became obvious.

The call to .FindMany with a Func was not optimizing the resulting API calls using the information provided in the Func. The script had been getting slower and slower over time because every single new deployment would have to retrieve every single previous deployment from the server (including those unrelated to the current environment and project). This meant hundreds of queries to the Octopus API, just to get what version needed to be deployed.

The script was initially fast because I wrote it when we only had a few hundred deployments recorded in Octopus. We have thousands now and it grows every day.

What I thought was a bounded query, wasn’t bounded in any way.

But Age Also Means Wisdom

On the upside, the problem wasn’t particularly hard to fix.

The Octopus HTTP API is pretty good. In fact, it actually has specific query parameters for specifying environment and project when you’re using the deployments endpoint (which is another reason why i assumed the client would be optimised to use them based on the incoming parameters). All I had to do was rewrite the “last release to environment” script to use the API directly instead of relying on the client.

$environmentId = $env.Id;

$headers = @{"X-Octopus-ApiKey" = $octopusApiKey}
$uri = "$octopusServerUrl/api/deployments?environments=$environmentId&projects=$projectId"
while ($true) {
    Write-Verbose "Getting the next set of deployments from Octopus using the URI [$uri]"
    $deployments = Invoke-RestMethod -Uri $uri -Headers $headers -Method Get -Verbose:$false
    if (-not ($deployments.Items | Any))
    {
        $version = "latest";
        Write-Verbose "No deployments could not found for project [$projectName ($projectId)] in environment [$environmentName ($environmentId)]. Returning the value [$version], which will indicate that the most recent release is to be used"
        return $version
    }

    Write-Verbose "Finding the first successful deployment in the set of deployments returned. There were [$($deployments.TotalResults)] total deployments, and we're currently searching through a set of [$($deployments.Items.Length)]"
    $successful = $deployments.Items | First -Predicate { 
        $uri = "$octopusServerUrl$($_.Links.Task)"; 
        Write-Verbose "Getting the task the deployment [$($_.Id)] was linked to using URI [$uri] to determine whether or not the deployment was successful"
        $task = Invoke-RestMethod -Uri $uri -Headers $headers -Method Get -Verbose:$false; 
        $task.FinishedSuccessfully; 
    } -Default "NONE"
    
    if ($successful -ne "NONE")
    {
        Write-Verbose "Finding the release associated with the successful deployment [$($successful.Id)]"
        $release = Invoke-RestMethod "$octopusServerUrl$($successful.Links.Release)" -Headers $headers -Method Get -Verbose:$false
        $version = $release.Version
        Write-Verbose "A successful deployment of project [$projectName ($projectId)] was found in environment [$environmentName ($environmentId)]. Returning the version of the release attached to that deployment, which was [$version]"
        return $version
    }

    Write-Verbose "Finished searching through the current page of deployments for project [$projectName ($projectId)] in environment [$environmentName ($environmentId)] without finding a successful one. Trying the next page"
    $next = $deployments.Links."Page.Next"
    if ($next -eq $null)
    {
        Write-Verbose "There are no more deployments available for project [$projectName ($projectId)] in environment [$environmentName ($environmentId)]. We're just going to return the string [latest] and hope for the best"
        return "latest"
    }
    else
    {
        $uri = "$octopusServerUrl$next"
    }
}

Its a little more code than the old one, but its much, much faster. 200 seconds (for our test project) down to something like 6 seconds, almost all of which was the result of optimizing the number of deployments that need to be interrogated to find the best release. The full test suite for our deployment scripts (which actually provisions test environments) halved its execution time too (from 60 minutes to 30), which was nice, because the tests were starting to get a bit too slow for my liking.

Summary

This was one of those cases where an issue had been present for a while, but we’d chalked it up to growth issues rather than an actual inefficiency in the code.

The most annoying part is that I did think of this specific problem when I originally wrote the script, which is why I included the winnowing parameters for environment and project. It’s pretty disappointing to see that the client did not optimize the resulting API calls by making use of those parameters, although I have a suspicion about that.

I think that if I were to run the same sort of code in .NET directly (rather than through Powershell) that the resulting call to .FindMany would actually be optimized in the way that I thought it was. I have absolutely no idea how Powershell handles the equivalent of Lambda expressions, so if it was just passing a pure Func through to the function instead of an Expression<Func>, there would be no obvious way for the client library to determine if it could use the optimized query string parameters on the API.

Still, I’m pretty happy with the end result, especially considering how easy it was to adapt the existing code to just use the HTTP API directly.

It helps that its a damn good API.

Indexes. Storing Data Since A Few Days Ago

August 9. 2016 0 Comments

Posted in:
ravendb
performance

Progress!

With what? With our performance issues with RavenDB, that’s what.

Before I get into that in too much detail, I’ll try to summarise the issue at hand to give some background information. If you’re interested, the full history is available by reading the posts in the RavenDB category.

Essentially, we are using RavenDB as the persistence layer for a service that acts as a temporary waypoint for data, allowing for the connection of two disparate systems that are only intermittently connected. Unfortunately, mostly due to inexperience with the product and possibly as a result of our workload not being a good fit for the technology, we’ve had a lot of performance problems as usage has grown.

We’re not talking about a huge uptick in usage or anything like that either. Usage growth has been slow, and we’re only now dealing with around 1100 unique customer accounts and less than 120 requests per second (on average).

In the time that its taken the service usage to grow that far, we’ve had to scale the underlying infrastructure used for the RavenDB database from an m3.medium all the way through to an r3.8xlarge, as well as incrementally providing it with higher and higher provisioned IOPS (6000 at last count) This felt somewhat ridiculous to us, but we did it in order to deal with periods of high request latency, the worst of which seemed to align with when the database process dropped large amounts of memory, and then thrashed the disk in order to refill it.

As far as we could see, we weren’t doing anything special, and we’ve pretty much reached the end of the line as far as throwing infrastructure at the problem, so we had to engage Hibernating Rhinos (the company behind RavenDB) directly to get to the root of the issue.

To their credit, they were constantly engaged and interested in solving the problem for us. We had a few false starts (upgrading from RavenDB 2.5 to RavenDB 3 surfaced an entirely different issue), but once we got a test environment up that they could access, with real traffic being replicated from our production environment, they started to make some real progress.

The current hypothesis? Because our documents are relatively large (50-100KB) and our queries frequent and distributed across the entire set of documents (because its a polling based system), RavenDB has to constantly read documents into memory in order to respond to a particular query, and then throw those documents out in order to deal with subsequent queries.

The solution? The majority of the queries causing the problem don’t actually need the whole document to return a result, just a few fields. If we use a static index that stores those fields, we take a large amount of memory churn out of the equation.

Index Fingers Are For Pointing. Coincidence?

The post I wrote talking about the complications of testing RavenDB due to its eventual consistency model talked a little bit about indexes, so if you want more information, go have a read of that.

Long story short, indexes in RavenDB are the only way to query data if you don’t know the document ID. With an index you are essentially pre-planning how you want to query your documents by marking certain fields as indexed. You can have many indexes, each using different fields to provide different search capabilities.

However, based on my current understanding, all an index does is point you towards the set of documents that are the result of the query. It can’t actually return any data itself, unless you set it up specifically for that.

In our case, we were using a special kind of index called an auto index, which just means that we didn’t write any indexes ourselves, we just let RavenDB handle it (auto indexes, if enabled, are created whenever a query is made that can’t be answered by an existing index, making the system easier to use in exchange for a reduction in control).

As I wrote above, this meant that while the query to find the matching documents was efficient, those documents still needed to be read from disk (if not already in memory), which caused a high amount of churn with memory utilization.

The suggestion from Hibernating Rhinos was to create a static index and to store the values of the fields needed for the response in the index itself.

I didn’t even know that was a thing you had to do! I assumed that a query on an index that only projected fields already in the index would be pretty close to maximum efficiency, but field values aren’t stored directly in indexes unless specifically requested. Instead the field values are tokenized in various ways to allow for efficient searching.

Creating an index for a RavenDB database using C# is trivial. All you have to do is derive from a supplied class.

public class Entity_ByFieldAAndFieldB_StoringFieldAAndFieldC : AbstractIndexCreationTask<Entity>
{
    public Entity_ByFieldAAndFieldB_StoringFieldAAndFieldC()
    {
        Map = entities =>
            from e in entities
            select new
            {
                FieldA = e.FieldA,
                FieldB = e.FieldB
            };

        Stores.Add(i => i.FieldA, FieldStorage.Yes);
        Stores.Add(i => i.FieldC, FieldStorage.Yes);
    }
}

Using the index is just as simple. Just add another generic parameter to the query that specifies which index to use.

var query = _session
    .Query<Entity, Entity_ByFieldAAndFieldB_StoringFieldAAndFieldC>()
    .Statistics(out stats)
    .Where(x => x.FieldA < "value" && x.FieldB == "other value")
    .OrderBy(x => x.FieldA)
    .Select(e => new ProjectedModel { e.FieldA, e.FieldC });

As long as all of the fields that you’re using in the projection are stored in the index, no documents need to be read into memory in order to answer the query.

The last thing you need to do is actually install the index, so I created a simple RavenDBInitializer class that gets instantiated and executed during the startup of our service (Nancy + Ninject, so it gets executed when the RavenDB document store singleton is initialized).

public class RavenDbInitializer
{
    public void Initialize(IDocumentStore store)
    {
        store.ExecuteIndex(new Entity_ByFieldAAndFieldB_StoringFieldAAndFieldC());
    }
}

All the automated tests worked as expected so the only thing left was to validate any performance improvements.

Unfortunately, that’s where it got hard.

Performance Problems

I mentioned earlier that the way in which we got Hibernating Rhinos to eventually propose this hypothesis was to give them access to a production replica with production traffic being mirrored to it. The traffic replication is setup using a neat little tool called Gor and when we set this replication up, we installed Gor on the database machine directly, because we were only interested in the traffic going to the database.

When it came time to test the optimizations outlined above, I initially thought that I would be able to use the same traffic replication to directly compare the old approach with the new static index based one.

This was not the case.

Because we were only replicating the database traffic (not the API traffic), all of the queries being replicated to the test database had no reference to the new index (when you make a query to Raven, part of the query is specifying which index should be used).

Normally I would use some set of load tests to compare the two approaches, but we’ve been completely unable to replicate enough traffic with the right shape to case performance problems on the Raven database. Its something we need to return to eventually, but I don’t have time to design and create a full load profile at this point (also considering I was the one who wrote the first one, the one that failed miserably to detect and predict any of these issues, I know how complicated and time consuming it can be).

I could see two options to test the new approach at this point:

Add traffic replication to the API instances. This would allow me to directly compare the results of two environments receiving the same traffic, and the new environment would be hitting the index correctly (because it would have a new version of the API).
Just ship it at a time when we have some traffic (which is all the time), but where if anything goes wrong, users are unlikely to notice.

I’m in the process of doing the first option, with the understanding that if I don’t get it done before this weekend that I’m just going to deploy the new API on Sunday and watch what happens.

I’ll let you know how it goes.

Conclusion

The sort of problem that Hibernating Rhinos identified is a perfect example of unintentionally creating issues with a technology simply due to lack of experience.

I had no idea that queries hitting an index and only asking for fields in the index would have to load the entire document before projecting those fields, but now that I know, I can understand why such queries could cause all manner of problems when dealing with relatively large documents.

The upside of this whole adventures is that the process of creating and managing indexes for RavenDB in C# is amazingly developer friendly. Its easily the best experience I’ve had working with a database level optimization in a long time, which was a nice change.

As I’ve repeatedly said, there is nothing inherently wrong with RavenDB as a product. Its easy to develop on and work with (especially with the latest version and its excellent built-in management studio). The problem is that someone made an arbitrary decision to use it without really taking into considering all of the factors, and the people that followed that person (who left being even seeing the project come to fruition) had no idea what was going on.

This has left me in an awkward position, because even if I fix all the problems we’ve had, reduce the amount of infrastructure required and get the whole thing humming perfectly, no-one in the organisation will ever trust the product again, especially at a business level.

Which is a shame.