Mo Data, Mo Problems

January 18. 2019 0 Comments

Its the gift that keeps on giving, our data synchronization process!

Well, it keeps on giving to me anyway, because its fuel for the furnaces of this blog. Sometimes finding topics to write about every week can be hard, so its nice when they drop into your lap.

Anyway, the process has started to creak at the seams a bit, because we’re pushing more data through it than ever before.

And when I say creak at the seams, what I mean is that our Read IOPS usage on the underlying database has returned to being consistently ridiculous.

Couldn’t Eat Another Bite

The data synchronization process had been relatively stable over most of 2018. Towards the middle, we scaled the underlying database to allow for the syncing of one of the two biggest data sets in the application, and after a slow rollout, that seemed to be going okay.

Of course, with that success under our belt, we decided to sync the other biggest data set in the application. Living life on the edge.

We ended up getting about half way through because everything started to fall apart again, with similar symptoms to last time (spiking Read IOPS capping out at the maximum allowed burst, which would consume the IO credits and then tank the performance completely). We tried a quick fix of provisioning IOPS (to guarantee performance and remove the tipping point created by the consumption of IO credits), but it wasn’t enough.

The database just could not keep up what was being demanded of it.

I’m A Very Understanding Person

Just like last time, the first step was to have a look at the queries being run and see if there was anything obviously inefficient.

With the slow queries related to the “version” of the remote table mostly dealt with in our last round of improvements, the majority of the slow queries remaining were focused on the part of the process that gets a table “manifest”. The worst offenders were the manifest calls for one of the big tables that we had only started syncing relatively recently. Keep in mind that this table is the “special” one featuring hard deletes (compared to the soft deletes of the other tables), so it was using the manifest functionality a lot more than any of the other tables were.

Having had enough of software level optimizations last time, we decided to try a different approach.

An approach that is probably, by far, the more common approach when dealing with performance issues in a database.

Indexes.

Probably The Obvious Solution

The first time we had performance problems with the database we shied away from implementing additional indexes. At the time, we thought that the indexes that we did have were the most efficient for our query load (being a Clustered Index on the two most selective fields in the schema), and we assumed we would have to look elsewhere for optimization opportunities. Additionally, we were worried that the performance issues might have an underlying cause related to total memory usage, and adding another index (or 10) is just more things to keep in memory.

Having scaled the underlying instance and seeing no evidence that the core problem was memory related, we decided to pull the index lever this time.

Analysis showed that the addition of another index similar to the primary key would allow for a decent reduction in the amount of reads required to service a single request (in that, the index would short circuit the need to read the entire partition of the data set into memory in order to figure out what the max value was for the un-indexed field). A quick replication on our performance testing environment proved it unequivocally, which was nice.

For implementation, its easy enough to use Entity Framework to add an index as part of a database migration, so that’s exactly what we did.

We only encountered two issues, which was nice:

We didn’t seem to be able to use the concurrent index creation feature in PostgreSQL with the version of EF and Npgsql that we were using (which are older than I would like)
Some of the down migrations would not consistently apply, no matter what we tried

Neither of those two factors could stop us though, and the indexes were created.

Now we just had to roll them out.

Be Free Indexes, Be Free!

That required a little finesse.

We had a decent number of indexes that we wanted to add, and the datasets we wanted to add them to were quite large. Some of the indexes only took a few minutes to initialise, but others took as long as twenty.

Being that we couldn’t seem to get concurrent index creation working with Entity Framework data migrations, we had to sequence them out one at a time in sequential releases.

Not too hard, but a little bit more time consuming than we originally desired.

Of course, the sync process being what it is, its okay if it goes down for a half hour every now and then, so we just took everything out of service temporarily on each deployment to ensure that the database could focus on the index creation without having worry too much about dealing with the constant flood of requests that it usually gets.

Conclusion

At the end of the day, this round of performance investigation and optimization actually took a hell of a lot less time and effort than the last, but I think that’s kind of to be expected when you’re actively trying to minimise code changes.

With the first few of the indexes deployed, we’ve already seen a significant drop in the Read IOPS of the database, and I think we’re going to be in a pretty good place to continue to sync the remainder of the massive data set that caused the database to choke.

The best indicator of future performance is past data though, so I’m sure there will be another post one day, talking all about the next terrible problem.

And how we solved it of course, because that’s what we do.

Guess Who’s Back, Back Again

January 9. 2018 0 Comments

And we’re back

Its time to ring in the new year with some more chatter about our data synchronization algorithm, which honestly, has turned into a pretty complex beast.

The last time I wrote about the algorithm, I explained how we flipped the original implementation of the differencing scanfrom bottom up, to top down. This change represented a hearty improvement over the original implementation, because the data at the “top” of the table (i.e. when ordered by Row Version descending) is far more likely to change than the data at the “bottom”, at least for our clients anyway.

In that post though, I pointed out an inefficiency in the differencing scan, regardless of the direction”":

There is still an inherent inefficiency in the design of the scan, such that it will continue to do extra work for no benefit. Consider the situation where the scan has started and has dealt with 4 chunks. Its a bit down in the table, and the change it originally wanted to find was in chunk 5. In the meantime, the user has deleted data in chunk 2. The process will have to finish its scan of the entire table before it returns to the top and finds that particular change, which is just wasteful. Instead, it should be able to determine if there are any differences in the remainder of the data set (i.e. from the bottom of the current chunk down) and then use that information to decide to reset its position to the top again. We haven’t actually done this bit yet, because it came up after the deployed the first version to our users.

That inherent inefficiency is the topic for today.

Wait, Come Back, You’re Going In The Wrong Direction!

As mentioned above, the problem lies in the fact that once a difference is detected (i.e. one or more missing rows locally or remotely), the scan will just keep going until it finds and fixes said difference (or it hits the “end” of its scan and resets to the “top”). Of course, time does not stop while the scan is happening, and because it happens over a period of hours, with many relatively isolated executions of the algorithm, other changes are still being made to the underlying data.

Additions and modifications are handled by an entirely different process, so we don’t have to worry about them, but hard deletions can still occur at will.

When such a deletion occurs in the section that has already been scanned, but before the original difference that triggered the scan has been located and fixed, the algorithm can get stuck scanning the remainder of the data set for no real benefit. This actually isn’t a huge issue when the data set is small, or even medium sized, as you can just eat the inefficiency and wait. Eventually it will all work itself out.

Unfortunately, as the data set gets larger, the chances of a deletion occurring in the scanned section increases. Also, since changes are more likely to occur at the “top” of the data set, and the differencing scan works top-down, the sections that were most recently scanned are actually the most likely to contain differences. As a result, the algorithm spends longer and longer doing work for no benefit, so the inefficiency gets worse and worse.

Each chunk comparison requires the data to be present in memory in the remote database as well (its not an index-only query), so every pointless comparison actually results in more reads, which means more IOPS, which is why we started optimizing in the first place!

Long story short, we have to do something about it.

4665, 4666, 4667! There Are 4667 Rows Both Locally And Remotely, Ah Ah Ah

The good news is that this particular problem is not overly difficult to solve. It will still take some effort, but its not some fundamental flaw in the algorithm.

Every time a chunk is completed as part of the differencing scan, we can use a simple count to see whether or not the remaining differences (if there are any) are above or below the current location.

Locally this is trivial, just a query to the DB for count where < current range end.

Getting the same information from the remote requires the introduction of a new endpoint to the API though:

/v1/customers/{customerId}/databases/{databaseId}/tables/{tableName}/count{?aboveRowVersion=123&belowRowVersion=456}

Slightly generalised from our specific use case, but it basically lets you get a count of records:

above a certain row version
below a certain row version
in between two boundary row versions

For the differencing scan, we only really care about the “below a certain row version” use case, to get a count and compare it to the same count from the local data.

If the count is the same, we can safely exit early and flip back to the top of the data set, resetting the differencing scan so that it can pick up the more recent changes.

If the count is different, we just keep going down and repeat the process once we’ve actioned the next chunkl.

Nice and easy.

Of course, there are still some edge cases. Its possible (though not likely) get into a situation where the counts are the same but the data is actually different (a particular combination of locally deleted data and data that was never uploaded successfully) which can throw the whole thing for a bit of a loop, but that could have happened regardless, so we’re still in a better place than we would otherwise be.

Conclusion

Its been a bit of a hard slog trying to optimise the data synchronization algorithm, and we’re still not really there. Not only that, the algorithm itself has become more and more complex over time (obviously), and is getting pretty hard to reason about.

Annoyingly enough, we haven’t run across that one magical improvement that changes everything. Its very much been a kind of “death by a thousand cuts” sort of thing, with tens of small optimizations that alleviate the issue slightly. The improvement in this post is a good example of that sort of thing, and pretty much boils down to “don’t do something you don’t have to do”, which isn’t exactly ground-breaking.

Don’t get me wrong, the process is much better than its ever been, but we’re still seeing very similar patterns in read IOPS, which is problematic.

It might be that the algorithm itself just doesn’t scale as well as we want it to, and that we might need to flip to a fundamentally different approach. Perhaps something that hooks into deeper into the customer data and notifies us of creations/updates/deletions as they occurs, rather than us having to poll for the same information.

Still, that sort of change is not something to be embarked on lightly, even disregarding the fallacy of sunk cost.

Hello IOPS My Old Friend, Part 2

August 1. 2017 0 Comments

Last week I described the re-occurrence of an issue from late last year with our data synchronization process.

Basically, read IOPS on the underlying database much higher than expected, causing issues with performance in AWS when the database volume ran out of burst balance (or IO credits as they are sometimes called).

After identifying that the most recent problem had been caused by the addition of tables, we disabled those tables, restoring performance and stability.

Obviously we couldn’t just leave them turned off though, and we couldn’t throw money at the problem this time, like we did last time. It was time to dig in and fix the problem properly.

But first we’d need to reproduce it.

But The Read Performance Issue In My Store

The good thing about ensuring that your environment is codified is that you can always spin up another one that looks just like the existing one.

Well, theoretically anyway.

In the case of the sync API environment everything was fine. One script execution later and we had a new environment called “sync performance” with the same number of identically sized API instances running the exact same code as production.

The database was a little trickier unfortunately.

You see, the database environment was from the time before Iimproved our environment deployment process. This meant that it was easy to make one, but hard to update an existing one.

Unfortunately, it was hard enough to update an existing one that the most efficient course of action had been to simply update the live one each time we had to tweak it, so we had diverged from the source code.

First step? Get those two back in sync.

Second step, spin up a new one that looks just like prod, which needed to include a copy of the prod data. Luckily, RDS makes that easy with snapshots.

With a fully functional, prod-like data synchronization environment up and running, all we needed was traffic.

Good thing Gor exists. We still had a deployable Gor component from last time I wanted to replicate traffic, so all we needed to do was make a new Octopus project, configure it appropriately and deploy it to our production API.

Now we had two (mostly) identical environments processing the same traffic, behaving pretty much the same. Because we’d turned off multiple tables in order to stop the high read IOPS, it was a simple matter to turn one back on, causing a less severe version of the issue to reoccur in both environments (higher than normal read IOPS, but not enough to eat burst balance).

With that in place we were free to investigate and experiment.

Is Still Abhorred

I’m going to cut ahead in the timeline here, but we analysed the behaviour of the test environment for a few days, trying to get a handle on what was going on.

Leveraging some of the inbuilt query statistics in postgreSQL, it looked like the the most frequent and costly type of query was related to getting a “version” of the remote table for synchronization orchestration. The second most costly type of query was related to getting a “manifest” of a subset of the table being synchronized for the differencing engine.

Disabling those parts of the API (but leaving the uploads alone) dropped the IOPS significantly, surprising exactly zero people. This did disagree with our hypothesis from last time though, so that was interesting.

Of course, the API is pretty useless without the ability to inform the synchronization process, so it was optimization time.

We could try to reduce the total number of calls, reducing the frequency that those queries are executed. We’d already done some work recently to dramatically reduce the total number of calls to the API from each synchronization process though, so it was unlikely we would be able to get any wins here
We could implement a cache in front of the API, but this just complicates things and all it will really result in is doing work repeatedly for no benefit (if the process syncs data then asks the API for a status, and gets the cached response, it will just sync the data again)
We could reduce the frequency of syncing, doing it less often. Since we already did the work I mentioned above to reduce overall calls, the potential gains here were small
We could try to make the queries more efficient. The problem here was that the queries were already using the primary keys of the tables in question, so I’m not entirely sure that any database level optimizations on those tables would have helped
We could make getting an answer to the question “give me the remote version of this table” more efficient by using a dedicated data structure to service those requests, basically a fancy database level cache

We prototyped the last option (basically a special cache within the database that contained table versions in a much easier to query format) and it had a positive effect on the overall read IOPS.

But it didn’t get rid of it entirely.

Within The Sound Of Synchronization

Looking into our traffic, we discovered that our baseline traffic had crept up since we’d implemented the skipping strategy in the sync process. Most of that baseline traffic appeared to be requests relating to the differencing engine (i.e. scanning the table to get primary key manifests for comparison purposes), which was one of the expensive type of queries that we identified above.

We’d made some changes to the algorithm to incorporate the ability to put a cap on the number of skips we did (for safety, to avoid de-sync edge cases) and to introduce forced skips for tables whose changes we were happy to only sync a few times a day.

A side effect of these changes was that whenever we decided NOT to skip using the local comparison, the most common result of the subsequent local vs remote comparison was choosing to execute the differencing engine. There was a piece of the algorithm missing where it should have been choosing to do nothing if the local and remote were identical, but that did not seem to be working due to the way the skip resolution had been implemented.

Fixing the bug and deploying it cause the read IOPS on our normal production server to drop a little bit, which was good.

The different pattern of traffic + our prototype table version cached caused a much more dramatic drop in read IOPS in our test environment though. It looked like the two things acting together apparently reduced the demands on the database enough to prevent it from having to read so much all the time.

Conclusion

We’re still working on a production quality cached table version implementation, but I am cautiously optimistic. There are some tricky bits regarding the cache (like invalidation vs updates, and where that sort of decision is made), so we’ve got a bit of work ahead of us though.

At this point I’m pretty thankful that we were easily able to both spin up an entirely separate and self contained environment for testing purposes, and that we were able to replicate traffic from one environment to the other without a lot of fuss. Without the capability to reproduce the problem disconnected from our clients and experiment, I don’t think we would have been able to tackle the problem as efficiently as we did.

I’m a little disappointed that a bug in our sync process managed to slip through our quality assurance processes, but I can understand how it happened. It wasn’t strictly a bug with the process itself, as the actions it was performing were still strictly valid, just not optimal. Software with many interconnected dependent components can be a very difficult thing to reason about, and this problem was relatively subtle unless you were looking for it specifically. We might have been able to prevent it from occurring with additional tests, but its always possible that we actually had those tests anyway, and during one of the subsequent changes a test failed and was fixed in an incorrect way. I mean, if that was the case, then we need to be more careful about “fixing” broken tests.

Regardless, we’re starting to head into challenging territory with our sync process now, as it is a very complicated beast. So complicated in fact that its getting difficult to keep the entire thing in your head at the same time.

Which is scary.

Hello IOPS My Old Friend, Part 1

July 25. 2017 0 Comments

in late 2016 we deployed the first version of our data synchronization API. It was a momentous occasion, and it all worked perfectly…at least until we turned on the software that actually used it. Actual traffic has a habit of doing that.

Long story short, we experienced a bunch of growing pains as we scaled our traffic up during those first few weeks, and this post is a follow-up to one of those issues in particular.

Specifically, the issue of particularly high read IOPS on the RDS instance that backed the sync API.

At the time, we investigated, didn’t find the root cause of the problem, threw money at i,t and ran away to do things that the business considered more important.

Why does it matter now?

Well, it happened again, and this time we can’t just throw money at it.

Sound familiar?

I’ve Come to Read From You Again

As I mentioned above, the main symptom was overly high read IOPS on our RDS server. Much higher than we expected for what we were doing.

This was an issue because of the way that data volumes work in AWS. You get some guaranteed IOPS (read and write combined) based on the size of the volume (1GB = 3 IOPS), while at the same time you get the capability to burst up to 3000 IOPS for brief periods of time.

The brief period of time is the important part, because it is actually modelled as a “balance” of IOPS credits that you consume if you go above your baseline, which get replenished if you’re below your baseline. Very very similar to the way that CPU credits work on EC2 instances. Back in late 2016, what exactly the balance was at any particular point in time was a bit of mystery though, because you couldn’t graph or alarm on it through CloudWatch. You only knew when it ran out and the performance of your database tanked because it was dependent on using more IOPS than its baseline.

Anyway, you can chart and alarm on the balance now, which is nice.

We tried a few things at the time, including scaling the underlying data volume (to give us more baseline), but it wasn’t enough.

We then increased the size of the RDS instance, and the high read IOPS vanished. I mean, that was obviously the best outcome, clearly. A problem we didn’t understand reacting in a completely unexpected way when we threw more power at it.

At the time, we thought that it might be related to EF querying the database on writes (to choose between an insert or update), but we never proved that hypothesis conclusively.

No-one in the business complained too loudly about the cost and everything was working swimmingly, so after an initial period of discomfort (mostly the engineers inside us sobbing quietly), we promptly forgot about it all together.

Because Some Usage Softly Creeping

A few weeks back, our data synchronization API started behaving strangely.

The first we knew about the problem was when our external monitoring service (Pingdom) reported that the API was down. Then up. Then down again. Then up again. So on and so forth.

Looking at the RDS metrics there were thousands of connections active to the database, causing us to hit connection limits. My first thought was that someone internal to the organisation had somehow got the credentials to the database and was running insane queries for business intelligence. I mean, thinking about that conclusion now, its completely bogus, because:

No-one has the credentials to the master database but a few members of my team
No-one even knows what the URL to access the master database is
We created a read replica specifically for the organisation to use for business intelligence, so why would someone use the master
You can’t even connect to the master database without being on our internal AWS VPN, which most people in the organisation don’t have access to

It didn’t help that when PostgreSQL runs out of connections, you can’t even connect to it with pgAdmin or something similar to see what the connections are doing.

Anyway, after falsely bombarding some innocent people with polite requests to “please stop doing things to our master database, its killing our API”, I shut down our data synchronization API and all the connections disappeared.

Because it was our API.

The calls were coming from inside the house.

Wrecked Everything While I was Sleeping

Luckily, the data synchronization API is probably the one API we can take down for a few hours and not have anybody really notice. As long as the database is available, the other API that actually allows access to the data can continue to do what it normally does, just with increasingly stale information. Its an eventually consistent model anyway, so some staleness is inherent.

We still only realistically had a few hours to dig into the issue before it became a real problem for actual users though.

Our only lead was the massive number of connections that were being established to the RDS instance. I managed to get a sample of what the connections were doing just as I shut the API down, and it wasn’t anything interesting. Just normal queries that I would expect from the API. Traffic hadn’t changed either, so we weren’t dealing with some weird DDOS attack or anything like that.

Well, to skip ahead, it turns out that the rise in the number of connections was a symptom, not the cause. The root issue was that we had burnt through our IO burst balance over the last few hours due to an increase in read IOPS. We’d almost done the same thing the day before, but it bottomed out at like 5% remaining because the days traffic started dropping off, so we didn’t notice. The lack of IOPS was causing each request to execute slower than normal, which just meant that there were more requests being executed in parallel.

Each API instance had a maximum number of connections that it could use, but the sum total across all of the instances was higher than the total number available to postgreSQL (whoops).

But why were we eating all of our delicious burst balance all of a sudden?

Well, a few days prior we started syncing a new table, and even though it was a small one, it increased the drain on the infrastructure.

Interestingly enough, it wasn’t a case of read IOPS going from close to nothing to a massive number with that new table. Our read IOPS had actually increased to a ridiculous baseline a few weeks before that when we shipped a completely different table, and we just hadn’t noticed. The table we shipped recently had just been the straw that broke the camels back.

We turned both of the new tables off, restored the API and all was well.

Conclusion

Honestly, we really should have had some alarms. Any alarms. Literally any alarms on the things we were already suspicious of at the database level (connections, read IOPS, etc) would have given us some sort of warning that bad things were happening, weeks before they actually happened.

We have those alarms now, but it was still a pretty big oversight.

With the new tables turned off and stability and functionality restored, we looked to fix the problem.

Throwing money at it wasn’t really feasible, as we would have basically doubled the infrastructure costs, and it was already costing us a few thousand $US each month. No, this time we would have to understand the problem and fix is properly.

But that’s a topic for next time, because this post is already large.

Write Away

October 11. 2016 0 Comments

Its been at least a few years since I’ve actually worked with a dedicated DBA. I’m sure the role still exists, but it seems like a lot of organisations mostly just expect their developers to be able to do anything at least passably well, especially with the seemingly inexorable merge of the development and operations responsibilities.

As a developer, I personally like the trend, because it moves me closer to the end-user and it gives me a greater sense of responsibility about the software I produce. If I’m going to be the one who has to support it and the one that gets woken up at 1 in the morning because production servers are melting down due to a poorly optimised data update algorithm, then you better believe I will be writing the most maintainable, supportable and reliable software that I can.

I mean, I would have done that anyway, but now I have even more of a personal stake in the situation.

Where I work now, we still have a dedicated operations department, but they are mostly responsible for managing the IT operations of the business as a whole, rather than the deployment specifics of any software that my team writes. They do still participate in our process in an advisory role (something for which I am perpetually grateful), but other than that, we’re pretty much on our own.

Anyway, the point that this rambling preface is slowly trying to arrive at is we’ve had some teething issues with one of our more recent releases that I thought might be worth talking about.

Specifically, massive, unexpected read load on our relational databases during periods of high write activity.

Everything In a Box To The Write

The service at the core of the data freeing functionality that I’ve mentioned previously is our synchronization service.

Its only purpose is to facilitate the continual synchronization of data from on-premises databases to a central cloud repository, with the help of a locally installed application. I won’t go into too much detail about the actual algorithm in play, but its mostly just batched updates to the cloud database at regular intervals through a HTTP API, where the batched data is obtained by leveraging the SQL Server row version concept.

In the last few weeks, we’ve been deploying the on-premises application to customers in order to kick off the synchronization process, a chunk of customers at a time.

It went pretty well, releasing to each group and watching their data upload, until we hit some sort of breakpoint and the Read IOPS of our backend RDS service rose to an unsustainable level.

We were seeing numbers in excess of 1500 for read IOPS (and 200ish for write), which is somewhat problematic, because the database is relatively small (150GB), which means it only has around 450 baseline IOPS split between reads and writes. Considering the way that AWS volumes work (guaranteed baseline, spike up to 3000 by consuming IO credits), the consumption rate we were seeing would leave us high and dry within an hour or two. We scaled a number of things to deal with the traffic (which eventually subsided as the initial flurry of data gave way to more incremental updates).

But where did the massive amount of read load come from?

To understand that, I’m going to have to outline a little bit more about the service itself.

The API for the service exposes an endpoint for each table we’re synchronizing, differentiated by customer identity. Something like the following:

https://{url}/{customer-identity}/tables/{table-name}

A GET to this endpoint returns a manifest of sorts, stating information about what data is currently available in the service for that customer-table combination, which is then used to decide what to upload. A POST allows for inserts and updates and a DELETE allows for either the removal of all data or the removal of a subset of data defined by the body.

Inside the service, the POST body is handled by iterating through the rows contained therein, and executing an add or update for each one using Entity Framework, saving as we go.

Profiling the resulting queries from the execution of a POST, we discovered that entity framework will always do a SELECT first on each row in question, in order to determine whether to run an INSERT or an UPDATE. Given that the number of these SELECT operations dwarfed the number of reads resulting from GET requests, we assumed that that was where the majority of the read load was coming from.

Now we just had to find out how to optimise it.

The EF Tradeoff

One of the really nice things when working with Entity Framework is that you don’t need to give too much thought to the resulting queries that it executes on the underlying data store. I mean, the whole point of the library is to reduce the visibility of those sorts of things, adding a nice abstraction layer over the top so you don’t have to deal with it.

When it comes to performance though, that is also one of the places where EF can hurt you the most.

For our purposes, EF increased the speed at which we could deliver this service (by removing some of the nastyness around talking to databases), but by being so far away from the proverbial metal, when it came time to try and deal with performance issues, there didn’t seem to be much that we could do.

We tried a few things with EF, including:

Just adding the entities, instead of add/update, which forced EF to insert and fail on primary key violations. This was faster, but a lot of our load is actually updates as opposed to pure inserts, so applied to our production environment it would have simply caused other performance issues (as a result of the constant exceptions).
Turning off the various entity tracking features in EF, as we didn’t need them for the relatively simple inserts we were doing. This helped a little bit, but the same pattern of SELECT, INSERT/UPDATE was still present afterwards, so it wouldn’t have resolved out read problem.
Updating to the very latest version of EF6 and Npgsql (just in case there had been some opimisations that we’d missed), but it was pretty much the same. We were not that far behind the curve anyway, so it wasn’t too surprising.
Batching our data updates under a single SaveChanges. This helped a fair amount, but would require us to change the way we handle update errors to remove the bad rows and try the update again rather than simply moving on. We don’t have any real statistics as to how many errors we actually get during updates, so this one was plausible, but would need more investigation.

Then we tried something completely different.

No EF.

Just pure ADO.NET insert calls, catching primary key violations and doing updates.

This approach was so much faster it wasn’t funny, but it left a bad taste in our mouths at multiple levels. For starters, we would have to write and maintain the SQL ourselves, which is something we’d rather not do (for various reasons, maintainability being one and database specific code being another). Even getting past that, we were still handling exceptions as part of normal application flow, which didn’t feel right, and would almost certainly lead to performance problems of its own, even if it was dodging writes.

It was here that PostgreSQL came to our rescure.

Version 9.5 introduced an UPSERT command in the form of INSERT () ON CONFLICT UPDATE (). Now we could leave the whole thing up to the database and move on with our lives. We’d still have to maintain the SQL ourselves (or write a generator of some sort), but that would be a small price to pay for a massive performance increase.

Unfortunately this sort ends with an anti-climax, because we’re still in the process of implementing the raw ADO.NET, PostgreSQL specific UPSERT as an optimisation. Once I have more information I’ll follow up with another post filling in the end of the story.

In the meanwhile, we threw money at it by scaling up the RDS instance so it could keep more things in memory at once, which alleviated the issue, at least until we hit the next breakpoint.

Conclusion

Its true that no plan survives contact with the enemy, even when the enemy isn’t particularly hostile. All of our data synchronization was a silent precursor to shipping some new cloud features, so no-one notice when it exploded a few times, but it was still pretty depressing to see it be completely unable to handle the load we were throwing at it. I’m not surprised that Entity Framework doesn’t lead to optimal performance, even though its not pleasant having that particular fact slap you in the face. We can still use EF for a lot of what we need to do, and benefit from all of its goodies, like database migrations and easy in-memory testing, but we definitely need to put some things in place to bypass EF when the situation calls for it.

When it comes to this particular issue; even though throwing money at a problem doesn’t really work for us in the long term (for obvious reasons), it was nice to have that option available until we had something better in place. We’re still not sure if what we’re planning on doing is going to make a massive difference, but I remain hopeful.

Of course, software development and hope is never a good combination.

Look where it gets us when it comes to estimates.