Write Away

October 11. 2016 0 Comments

Its been at least a few years since I’ve actually worked with a dedicated DBA. I’m sure the role still exists, but it seems like a lot of organisations mostly just expect their developers to be able to do anything at least passably well, especially with the seemingly inexorable merge of the development and operations responsibilities.

As a developer, I personally like the trend, because it moves me closer to the end-user and it gives me a greater sense of responsibility about the software I produce. If I’m going to be the one who has to support it and the one that gets woken up at 1 in the morning because production servers are melting down due to a poorly optimised data update algorithm, then you better believe I will be writing the most maintainable, supportable and reliable software that I can.

I mean, I would have done that anyway, but now I have even more of a personal stake in the situation.

Where I work now, we still have a dedicated operations department, but they are mostly responsible for managing the IT operations of the business as a whole, rather than the deployment specifics of any software that my team writes. They do still participate in our process in an advisory role (something for which I am perpetually grateful), but other than that, we’re pretty much on our own.

Anyway, the point that this rambling preface is slowly trying to arrive at is we’ve had some teething issues with one of our more recent releases that I thought might be worth talking about.

Specifically, massive, unexpected read load on our relational databases during periods of high write activity.

Everything In a Box To The Write

The service at the core of the data freeing functionality that I’ve mentioned previously is our synchronization service.

Its only purpose is to facilitate the continual synchronization of data from on-premises databases to a central cloud repository, with the help of a locally installed application. I won’t go into too much detail about the actual algorithm in play, but its mostly just batched updates to the cloud database at regular intervals through a HTTP API, where the batched data is obtained by leveraging the SQL Server row version concept.

In the last few weeks, we’ve been deploying the on-premises application to customers in order to kick off the synchronization process, a chunk of customers at a time.

It went pretty well, releasing to each group and watching their data upload, until we hit some sort of breakpoint and the Read IOPS of our backend RDS service rose to an unsustainable level.

We were seeing numbers in excess of 1500 for read IOPS (and 200ish for write), which is somewhat problematic, because the database is relatively small (150GB), which means it only has around 450 baseline IOPS split between reads and writes. Considering the way that AWS volumes work (guaranteed baseline, spike up to 3000 by consuming IO credits), the consumption rate we were seeing would leave us high and dry within an hour or two. We scaled a number of things to deal with the traffic (which eventually subsided as the initial flurry of data gave way to more incremental updates).

But where did the massive amount of read load come from?

To understand that, I’m going to have to outline a little bit more about the service itself.

The API for the service exposes an endpoint for each table we’re synchronizing, differentiated by customer identity. Something like the following:

https://{url}/{customer-identity}/tables/{table-name}

A GET to this endpoint returns a manifest of sorts, stating information about what data is currently available in the service for that customer-table combination, which is then used to decide what to upload. A POST allows for inserts and updates and a DELETE allows for either the removal of all data or the removal of a subset of data defined by the body.

Inside the service, the POST body is handled by iterating through the rows contained therein, and executing an add or update for each one using Entity Framework, saving as we go.

Profiling the resulting queries from the execution of a POST, we discovered that entity framework will always do a SELECT first on each row in question, in order to determine whether to run an INSERT or an UPDATE. Given that the number of these SELECT operations dwarfed the number of reads resulting from GET requests, we assumed that that was where the majority of the read load was coming from.

Now we just had to find out how to optimise it.

The EF Tradeoff

One of the really nice things when working with Entity Framework is that you don’t need to give too much thought to the resulting queries that it executes on the underlying data store. I mean, the whole point of the library is to reduce the visibility of those sorts of things, adding a nice abstraction layer over the top so you don’t have to deal with it.

When it comes to performance though, that is also one of the places where EF can hurt you the most.

For our purposes, EF increased the speed at which we could deliver this service (by removing some of the nastyness around talking to databases), but by being so far away from the proverbial metal, when it came time to try and deal with performance issues, there didn’t seem to be much that we could do.

We tried a few things with EF, including:

Just adding the entities, instead of add/update, which forced EF to insert and fail on primary key violations. This was faster, but a lot of our load is actually updates as opposed to pure inserts, so applied to our production environment it would have simply caused other performance issues (as a result of the constant exceptions).
Turning off the various entity tracking features in EF, as we didn’t need them for the relatively simple inserts we were doing. This helped a little bit, but the same pattern of SELECT, INSERT/UPDATE was still present afterwards, so it wouldn’t have resolved out read problem.
Updating to the very latest version of EF6 and Npgsql (just in case there had been some opimisations that we’d missed), but it was pretty much the same. We were not that far behind the curve anyway, so it wasn’t too surprising.
Batching our data updates under a single SaveChanges. This helped a fair amount, but would require us to change the way we handle update errors to remove the bad rows and try the update again rather than simply moving on. We don’t have any real statistics as to how many errors we actually get during updates, so this one was plausible, but would need more investigation.

Then we tried something completely different.

No EF.

Just pure ADO.NET insert calls, catching primary key violations and doing updates.

This approach was so much faster it wasn’t funny, but it left a bad taste in our mouths at multiple levels. For starters, we would have to write and maintain the SQL ourselves, which is something we’d rather not do (for various reasons, maintainability being one and database specific code being another). Even getting past that, we were still handling exceptions as part of normal application flow, which didn’t feel right, and would almost certainly lead to performance problems of its own, even if it was dodging writes.

It was here that PostgreSQL came to our rescure.

Version 9.5 introduced an UPSERT command in the form of INSERT () ON CONFLICT UPDATE (). Now we could leave the whole thing up to the database and move on with our lives. We’d still have to maintain the SQL ourselves (or write a generator of some sort), but that would be a small price to pay for a massive performance increase.

Unfortunately this sort ends with an anti-climax, because we’re still in the process of implementing the raw ADO.NET, PostgreSQL specific UPSERT as an optimisation. Once I have more information I’ll follow up with another post filling in the end of the story.

In the meanwhile, we threw money at it by scaling up the RDS instance so it could keep more things in memory at once, which alleviated the issue, at least until we hit the next breakpoint.

Conclusion

Its true that no plan survives contact with the enemy, even when the enemy isn’t particularly hostile. All of our data synchronization was a silent precursor to shipping some new cloud features, so no-one notice when it exploded a few times, but it was still pretty depressing to see it be completely unable to handle the load we were throwing at it. I’m not surprised that Entity Framework doesn’t lead to optimal performance, even though its not pleasant having that particular fact slap you in the face. We can still use EF for a lot of what we need to do, and benefit from all of its goodies, like database migrations and easy in-memory testing, but we definitely need to put some things in place to bypass EF when the situation calls for it.

When it comes to this particular issue; even though throwing money at a problem doesn’t really work for us in the long term (for obvious reasons), it was nice to have that option available until we had something better in place. We’re still not sure if what we’re planning on doing is going to make a massive difference, but I remain hopeful.

Of course, software development and hope is never a good combination.

Look where it gets us when it comes to estimates.