Its the gift that keeps on giving, our data synchronization process!

Well, it keeps on giving to me anyway, because its fuel for the furnaces of this blog. Sometimes finding topics to write about every week can be hard, so its nice when they drop into your lap.

Anyway, the process has started to creak at the seams a bit, because we’re pushing more data through it than ever before.

And when I say creak at the seams, what I mean is that our Read IOPS usage on the underlying database has returned to being consistently ridiculous.

Couldn’t Eat Another Bite

The data synchronization process had been relatively stable over most of 2018. Towards the middle, we scaled the underlying database to allow for the syncing of one of the two biggest data sets in the application, and after a slow rollout, that seemed to be going okay.

Of course, with that success under our belt, we decided to sync the other biggest data set in the application. Living life on the edge.

We ended up getting about half way through because everything started to fall apart again, with similar symptoms to last time (spiking Read IOPS capping out at the maximum allowed burst, which would consume the IO credits and then tank the performance completely). We tried a quick fix of provisioning IOPS (to guarantee performance and remove the tipping point created by the consumption of IO credits), but it wasn’t enough.

The database just could not keep up what was being demanded of it.

I’m A Very Understanding Person

Just like last time, the first step was to have a look at the queries being run and see if there was anything obviously inefficient.

With the slow queries related to the “version” of the remote table mostly dealt with in our last round of improvements, the majority of the slow queries remaining were focused on the part of the process that gets a table “manifest”. The worst offenders were the manifest calls for one of the big tables that we had only started syncing relatively recently. Keep in mind that this table is the “special” one featuring hard deletes (compared to the soft deletes of the other tables), so it was using the manifest functionality a lot more than any of the other tables were.

Having had enough of software level optimizations last time, we decided to try a different approach.

An approach that is probably, by far, the more common approach when dealing with performance issues in a database.


Probably The Obvious Solution

The first time we had performance problems with the database we shied away from implementing additional indexes. At the time, we thought that the indexes that we did have were the most efficient for our query load (being a Clustered Index on the two most selective fields in the schema), and we assumed we would have to look elsewhere for optimization opportunities. Additionally, we were worried that the performance issues might have an underlying cause related to total memory usage, and adding another index (or 10) is just more things to keep in memory.

Having scaled the underlying instance and seeing no evidence that the core problem was memory related, we decided to pull the index lever this time.

Analysis showed that the addition of another index similar to the primary key would allow for a decent reduction in the amount of reads required to service a single request (in that, the index would short circuit the need to read the entire partition of the data set into memory in order to figure out what the max value was for the un-indexed field). A quick replication on our performance testing environment proved it unequivocally, which was nice.

For implementation, its easy enough to use Entity Framework to add an index as part of a database migration, so that’s exactly what we did.

We only encountered two issues, which was nice:

  • We didn’t seem to be able to use the concurrent index creation feature in PostgreSQL with the version of EF and Npgsql that we were using (which are older than I would like)
  • Some of the down migrations would not consistently apply, no matter what we tried

Neither of those two factors could stop us though, and the indexes were created.

Now we just had to roll them out.

Be Free Indexes, Be Free!

That required a little finesse.

We had a decent number of indexes that we wanted to add, and the datasets we wanted to add them to were quite large. Some of the indexes only took a few minutes to initialise, but others took as long as twenty.

Being that we couldn’t seem to get concurrent index creation working with Entity Framework data migrations, we had to sequence them out one at a time in sequential releases.

Not too hard, but a little bit more time consuming than we originally desired.

Of course, the sync process being what it is, its okay if it goes down for a half hour every now and then, so we just took everything out of service temporarily on each deployment to ensure that the database could focus on the index creation without having worry too much about dealing with the constant flood of requests that it usually gets.


At the end of the day, this round of performance investigation and optimization actually took a hell of a lot less time and effort than the last, but I think that’s kind of to be expected when you’re actively trying to minimise code changes.

With the first few of the indexes deployed, we’ve already seen a significant drop in the Read IOPS of the database, and I think we’re going to be in a pretty good place to continue to sync the remainder of the massive data set that caused the database to choke.

The best indicator of future performance is past data though, so I’m sure there will be another post one day, talking all about the next terrible problem.

And how we solved it of course, because that’s what we do.


And thus we come to the end of this particular saga, of rewriting the our AWS Lambda ELB Logs Processor in C# using .NET Core.

To summarise the posts in this series:

  1. I went through some of the basics around how to use C# and .NET Core to implement an AWS Lambda function
  2. I explained how we constructed the automated build, test and deployment pipeline for the .NET Core ELB Logs Processor
  3. I outlined how we tested the new .NET Core Lambda function, with a specific focus around how we handled the AWS Credentials

In conjunction with those posts, I’ve also uploaded a copy of the entire source code (build and deployment scripts included) to Solavirum.Logging.ELB.Processor.Lambda, mostly as a master reference point. Its a copy of our internal repository, quickly sanitized to remove anything specific to our organization, so I can’t guarantee that it works perfectly, but there should be enough stuff there to demonstrate the entire process working end to end.

As with all the rest of our stuff, the build script is meant to run from inside TeamCity, and the deployment script is meant to run from inside Octopus Deploy, so if you don’t have either of those things, you will have to use your imagination.

Of course, everything I’ve gone through is purely academic unless its running successfully in Production…

Seize The Means Of Production

If you’ve read any of my previous posts about the Javascript/Node.js ELB Logs Processor, you’re probably aware of the fact that it tended to look like it was working fine right up to the point where I published it into production and the sheer weight of traffic caused it to have a meltdown.

Unsurprisingly, the difference between a sample file with a few hundred lines in it and a real file with hundreds of thousands of lines is significant, and that’s not even really taking into account the difference between my local development environment and AWS Lambda.

In an unprecedented result though, the .NET Core implementation worked just fine when I shipped it to our production environment.

Unfortunately it was around twenty times slower.

The old Javascript/Node.js implementation could process a file in around 9 seconds, which is  pretty decent.

The .NET Core implementation processes a file in around 200 seconds, which is a hell of a lot slower.

Now, I can explain some of that away with the fact that the .NET Core implementation uses HTTP instead of TCP to communicate with Logstash (which was the primary reason I rewrote it in the first place), but it was still pretty jarring to see.

Interestingly enough, because executing Lambda functions is so cheap, the total cost increase was like $US8 a month, even though it was twenty times slower. so who really cares.

In the end, while the increase in processing time was undesirable, I’d rather have slower C# and HTTP over faster Javascript and TCP.


Overall, I’m pretty happy with how this whole .NET Core rewrite went, for a number of reasons:

  • I got to explore and experiment with the newest .NET and C# framework, while delivering actual business value, which was great.
  • In the process of developing this particular solution, I managed to put together some more reusable scripts and components for building other .NET Core components
  • I finally switched the ELB Logs Processor to HTTP, which means that the overall feed of events into our ELK stack is more reliable (TCP was prone to failure)
  • I managed to get a handle on fully automated testing of components that require AWS Credentials, without accidentally checking the credentials into source control

In the end though, I’m mostly just happy because its not Javascript.


Last week I described the re-occurrence of an issue from late last year with our data synchronization process.

Basically, read IOPS on the underlying database much higher than expected, causing issues with performance in AWS when the database volume ran out of burst balance (or IO credits as they are sometimes called).

After identifying that the most recent problem had been caused by the addition of tables, we disabled those tables, restoring performance and stability.

Obviously we couldn’t just leave them turned off though, and we couldn’t throw money at the problem this time, like we did last time. It was time to dig in and fix the problem properly.

But first we’d need to reproduce it.

But The Read Performance Issue In My Store

The good thing about ensuring that your environment is codified is that you can always spin up another one that looks just like the existing one.

Well, theoretically anyway.

In the case of the sync API environment everything was fine. One script execution later and we had a new environment called “sync performance” with the same number of identically sized API instances running the exact same code as production.

The database was a little trickier unfortunately.

You see, the database environment was from the time before Iimproved our environment deployment process. This meant that it was easy to make one, but hard to update an existing one.

Unfortunately, it was hard enough to update an existing one that the most efficient course of action had been to simply update the live one each time we had to tweak it, so we had diverged from the source code.

First step? Get those two back in sync.

Second step, spin up a new one that looks just like prod, which needed to include a copy of the prod data. Luckily, RDS makes that easy with snapshots.

With a fully functional, prod-like data synchronization environment up and running, all we needed was traffic.

Good thing Gor exists. We still had a deployable Gor component from last time I wanted to replicate traffic, so all we needed to do was make a new Octopus project, configure it appropriately and deploy it to our production API.

Now we had two (mostly) identical environments processing the same traffic, behaving pretty much the same. Because we’d turned off multiple tables in order to stop the high read IOPS, it was a simple matter to turn one back on, causing a less severe version of the issue to reoccur in both environments (higher than normal read IOPS, but not enough to eat burst balance).

With that in place we were free to investigate and experiment.

Is Still Abhorred

I’m going to cut ahead in the timeline here, but we analysed the behaviour of the test environment for a few days, trying to get a handle on what was going on.

Leveraging some of the inbuilt query statistics in postgreSQL, it looked like the the most frequent and costly type of query was related to getting a “version” of the remote table for synchronization orchestration. The second most costly type of query was related to getting a “manifest” of a subset of the table being synchronized for the differencing engine.

Disabling those parts of the API (but leaving the uploads alone) dropped the IOPS significantly, surprising exactly zero people. This did disagree with our hypothesis from last time though, so that was interesting.

Of course, the API is pretty useless without the ability to inform the synchronization process, so it was optimization time.

  • We could try to reduce the total number of calls, reducing the frequency that those queries are executed. We’d already done some work recently to dramatically reduce the total number of calls to the API from each synchronization process though, so it was unlikely we would be able to get any wins here
  • We could implement a cache in front of the API, but this just complicates things and all it will really result in is doing work repeatedly for no benefit (if the process syncs data then asks the API for a status, and gets the cached response, it will just sync the data again)
  • We could reduce the frequency of syncing, doing it less often. Since we already did the work I mentioned above to reduce overall calls, the potential gains here were small
  • We could try to make the queries more efficient. The problem here was that the queries were already using the primary keys of the tables in question, so I’m not entirely sure that any database level optimizations on those tables would have helped
  • We could make getting an answer to the question “give me the remote version of this table” more efficient by using a dedicated data structure to service those requests, basically a fancy database level cache

We prototyped the last option (basically a special cache within the database that contained table versions in a much easier to query format) and it had a positive effect on the overall read IOPS.

But it didn’t get rid of it entirely.

Within The Sound Of Synchronization

Looking into our traffic, we discovered that our baseline traffic had crept up since we’d implemented the skipping strategy in the sync process. Most of that baseline traffic appeared to be requests relating to the differencing engine (i.e. scanning the table to get primary key manifests for comparison purposes), which was one of the expensive type of queries that we identified above.

We’d made some changes to the algorithm to incorporate the ability to put a cap on the number of skips we did (for safety, to avoid de-sync edge cases) and to introduce forced skips for tables whose changes we were happy to only sync a few times a day.

A side effect of these changes was that whenever we decided NOT to skip using the local comparison, the most common result of the subsequent local vs remote comparison was choosing to execute the differencing engine. There was a piece of the algorithm missing where it should have been choosing to do nothing if the local and remote were identical, but that did not seem to be working due to the way the skip resolution had been implemented.

Fixing the bug and deploying it cause the read IOPS on our normal production server to drop a little bit, which was good.

The different pattern of traffic + our prototype table version cached caused a much more dramatic drop in read IOPS in our test environment though. It looked like the two things acting together apparently reduced the demands on the database enough to prevent it from having to read so much all the time.


We’re still working on a production quality cached table version implementation, but I am cautiously optimistic. There are some tricky bits regarding the cache (like invalidation vs updates, and where that sort of decision is made), so we’ve got a bit of work ahead of us though.

At this point I’m pretty thankful that we were easily able to both spin up an entirely separate and self contained environment for testing purposes, and that we were able to replicate traffic from one environment to the other without a lot of fuss. Without the capability to reproduce the problem disconnected from our clients and experiment, I don’t think we would have been able to tackle the problem as efficiently as we did.

I’m a little disappointed that a bug in our sync process managed to slip through our quality assurance processes, but I can understand how it happened. It wasn’t strictly a bug with the process itself, as the actions it was performing were still strictly valid, just not optimal. Software with many interconnected dependent components can be a very difficult thing to reason about, and this problem was relatively subtle unless you were looking for it specifically.  We might have been able to prevent it from occurring with additional tests, but its always possible that we actually had those tests anyway, and during one of the subsequent changes a test failed and was fixed in an incorrect way. I mean, if that was the case, then we need to be more careful about “fixing” broken tests.

Regardless, we’re starting to head into challenging territory with our sync process now, as it is a very complicated beast. So complicated in fact that its getting difficult to keep the entire thing in your head at the same time.

Which is scary.


in late 2016 we deployed the first version of our data synchronization API. It was a momentous occasion, and it all worked perfectly…at least until we turned on the software that actually used it. Actual traffic has a habit of doing that.

Long story short, we experienced a bunch of growing pains as we scaled our traffic up during those first few weeks, and this post is a follow-up to one of those issues in particular.

Specifically, the issue of particularly high read IOPS on the RDS instance that backed the sync API.

At the time, we investigated, didn’t find the root cause of the problem, threw money at i,t and ran away to do things that the business considered more important.

Why does it matter now?

Well, it happened again, and this time we can’t just throw money at it.

Sound familiar?

I’ve Come to Read From You Again

As I mentioned above, the main symptom was overly high read IOPS on our RDS server. Much higher than we expected for what we were doing.

This was an issue because of the way that data volumes work in AWS. You get some guaranteed IOPS (read and write combined) based on the size of the volume (1GB = 3 IOPS), while at the same time you get the capability to burst up to 3000 IOPS for brief periods of time.

The brief period of time is the important part, because it is actually modelled as a “balance” of IOPS credits that you consume if you go above your baseline, which get replenished if you’re below your baseline. Very very similar to the way that CPU credits work on EC2 instances. Back in late 2016, what exactly the balance was at any particular point in time was a bit of mystery though, because you couldn’t graph or alarm on it through CloudWatch. You only knew when it ran out and the performance of your database tanked because it was dependent on using more IOPS than its baseline.

Anyway, you can chart and alarm on the balance now, which is nice.

We tried a few things at the time, including scaling the underlying data volume (to give us more baseline), but it wasn’t enough.

We then increased the size of the RDS instance, and the high read IOPS vanished. I mean, that was obviously the best outcome, clearly. A problem we didn’t understand reacting in a completely unexpected way when we threw more power at it.

At the time, we thought that it might be related to EF querying the database on writes (to choose between an insert or update), but we never proved that hypothesis conclusively.

No-one in the business complained too loudly about the cost and everything was working swimmingly, so after an initial period of discomfort (mostly the engineers inside us sobbing quietly), we promptly forgot about it all together.

Because Some Usage Softly Creeping

A few weeks back, our data synchronization API started behaving strangely.

The first we knew about the problem was when our external monitoring service (Pingdom) reported that the API was down. Then up. Then down again. Then up again. So on and so forth.

Looking at the RDS metrics there were thousands of connections active to the database, causing us to hit connection limits. My first thought was that someone internal to the organisation had somehow got the credentials to the database and was running insane queries for business intelligence. I mean, thinking about that conclusion now, its completely bogus, because:

  1. No-one has the credentials to the master database but a few members of my team
  2. No-one even knows what the URL to access the master database is
  3. We created a read replica specifically for the organisation to use for business intelligence, so why would someone use the master
  4. You can’t even connect to the master database without being on our internal AWS VPN, which most people in the organisation don’t have access to

It didn’t help that when PostgreSQL runs out of connections, you can’t even connect to it with pgAdmin or something similar to see what the connections are doing.

Anyway, after falsely bombarding some innocent people with polite requests to “please stop doing things to our master database, its killing our API”, I shut down our data synchronization API and all the connections disappeared.

Because it was our API.

The calls were coming from inside the house.

Wrecked Everything While I was Sleeping

Luckily, the data synchronization API is probably the one API we can take down for a few hours and not have anybody really notice. As long as the database is available, the other API that actually allows access to the data can continue to do what it normally does, just with increasingly stale information. Its an eventually consistent model anyway, so some staleness is inherent.

We still only realistically had a few hours to dig into the issue before it became a real problem for actual users though.

Our only lead was the massive number of connections that were being established to the RDS instance. I managed to get a sample of what the connections were doing just as I shut the API down, and it wasn’t anything interesting. Just normal queries that I would expect from the API. Traffic hadn’t changed either, so we weren’t dealing with some weird DDOS attack or anything like that.

Well, to skip ahead, it turns out that the rise in the number of connections was a symptom, not the cause. The root issue was that we had burnt through our IO burst balance over the last few hours due to an increase in read IOPS. We’d almost done the same thing the day before, but it bottomed out at like 5% remaining because the days traffic started dropping off, so we didn’t notice. The lack of IOPS was causing each request to execute slower than normal, which just meant that there were more requests being executed in parallel.

Each API instance had a maximum number of connections that it could use, but the sum total across all of the instances was higher than the total number available to postgreSQL (whoops).

But why were we eating all of our delicious burst balance all of a sudden?

Well, a few days prior we started syncing a new table, and even though it was a small one, it increased the drain on the infrastructure.

Interestingly enough, it wasn’t a case of read IOPS going from close to nothing to a massive number with that new table. Our read IOPS had actually increased to a ridiculous baseline a few weeks before that when we shipped a completely different table, and we just hadn’t noticed. The table we shipped recently had just been the straw that broke the camels back.

We turned both of the new tables off, restored the API and all was well.


Honestly, we really should have had some alarms. Any alarms. Literally any alarms on the things we were already suspicious of at the database level (connections, read IOPS, etc) would have given us some sort of warning that bad things were happening, weeks before they actually happened.

We have those alarms now, but it was still a pretty big oversight.

With the new tables turned off and stability and functionality restored, we looked to fix the problem.

Throwing money at it wasn’t really feasible, as we would have basically doubled the infrastructure costs, and it was already costing us a few thousand $US each month. No, this time we would have to understand the problem and fix is properly.

But that’s a topic for next time, because this post is already large.


Its been at least a few years since I’ve actually worked with a dedicated DBA. I’m sure the role still exists, but it seems like a lot of organisations mostly just expect their developers to be able to do anything at least passably well, especially with the seemingly inexorable merge of the development and operations responsibilities.

As a developer, I personally like the trend, because it moves me closer to the end-user and it gives me a greater sense of responsibility about the software I produce. If I’m going to be the one who has to support it and the one that gets woken up at 1 in the morning because production servers are melting down due to a poorly optimised data update algorithm, then you better believe I will be writing the most maintainable, supportable and reliable software that I can.

I mean, I would have done that anyway, but now I have even more of a personal stake in the situation.

Where I work now, we still have a dedicated operations department, but they are mostly responsible for managing the IT operations of the business as a whole, rather than the deployment specifics of any software that my team writes. They do still participate in our process in an advisory role (something for which I am perpetually grateful), but other than that, we’re pretty much on our own.

Anyway, the point that this rambling preface is slowly trying to arrive at is we’ve had some teething issues with one of our more recent releases that I thought might be worth talking about.

Specifically, massive, unexpected read load on our relational databases during periods of high write activity.

Everything In a Box To The Write

The service at the core of the data freeing functionality that I’ve mentioned previously is our synchronization service.

Its only purpose is to facilitate the continual synchronization of data from on-premises databases to a central cloud repository, with the help of a locally installed application. I won’t go into too much detail about the actual algorithm in play, but its mostly just batched updates to the cloud database at regular intervals through a HTTP API, where the batched data is obtained by leveraging the SQL Server row version concept.

In the last few weeks, we’ve been deploying the on-premises application to customers in order to kick off the synchronization process, a chunk of customers at a time.

It went pretty well, releasing to each group and watching their data upload, until we hit some sort of breakpoint and the Read IOPS of our backend RDS service rose to an unsustainable level.

We were seeing numbers in excess of 1500 for read IOPS (and 200ish for write), which is somewhat problematic, because the database is relatively small (150GB), which means it only has around 450 baseline IOPS split between reads and writes. Considering the way that AWS volumes work (guaranteed baseline, spike up to 3000 by consuming IO credits), the consumption rate we were seeing would leave us high and dry within an hour or two. We scaled a number of things to deal with the traffic (which eventually subsided as the initial flurry of data gave way to more incremental updates).

But where did the massive amount of read load come from?

To understand that, I’m going to have to outline a little bit more about the service itself.

The API for the service exposes an endpoint for each table we’re synchronizing, differentiated by customer identity. Something like the following:


A GET to this endpoint returns a manifest of sorts, stating information about what data is currently available in the service for that customer-table combination, which is then used to decide what to upload. A POST allows for inserts and updates and a DELETE allows for either the removal of all data or the removal of a subset of data defined by the body.

Inside the service, the POST body is handled by iterating through the rows contained therein, and executing an add or update for each one using Entity Framework, saving as we go.

Profiling the resulting queries from the execution of a POST, we discovered that entity framework will always do a SELECT first on each row in question, in order to determine whether to run an INSERT or an UPDATE. Given that the number of these SELECT operations dwarfed the number of reads resulting from GET requests, we assumed that that was where the majority of the read load was coming from.

Now we just had to find out how to optimise it.

The EF Tradeoff

One of the really nice things when working with Entity Framework is that you don’t need to give too much thought to the resulting queries that it executes on the underlying data store. I mean, the whole point of the library is to reduce the visibility of those sorts of things, adding a nice abstraction layer over the top so you don’t have to deal with it.

When it comes to performance though, that is also one of the places where EF can hurt you the most.

For our purposes, EF increased the speed at which we could deliver this service (by removing some of the nastyness around talking to databases), but by being so far away from the proverbial metal, when it came time to try and deal with performance issues, there didn’t seem to be much that we could do.

We tried a few things with EF, including:

  • Just adding the entities, instead of add/update, which forced EF to insert and fail on primary key violations. This was faster, but a lot of our load is actually updates as opposed to pure inserts, so applied to our production environment it would have simply caused other performance issues (as a result of the constant exceptions).
  • Turning off the various entity tracking features in EF, as we didn’t need them for the relatively simple inserts we were doing. This helped a little bit, but the same pattern of SELECT, INSERT/UPDATE was still present afterwards, so it wouldn’t have resolved out read problem.
  • Updating to the very latest version of EF6 and Npgsql (just in case there had been some opimisations that we’d missed), but it was pretty much the same. We were not that far behind the curve anyway, so it wasn’t too surprising.
  • Batching our data updates under a single SaveChanges. This helped a fair amount, but would require us to change the way we handle update errors to remove the bad rows and try the update again rather than simply moving on. We don’t have any real statistics as to how many errors we actually get during updates, so this one was plausible, but would need more investigation.

Then we tried something completely different.

No EF.

Just pure ADO.NET insert calls, catching primary key violations and doing updates.

This approach was so much faster it wasn’t funny, but it left a bad taste in our mouths at multiple levels. For starters, we would have to write and maintain the SQL ourselves, which is something we’d rather not do (for various reasons, maintainability being one and database specific code being another). Even getting past that, we were still handling exceptions as part of normal application flow, which didn’t feel right, and would almost certainly lead to performance problems of its own, even if it was dodging writes.

It was here that PostgreSQL came to our rescure.

Version 9.5 introduced an UPSERT command in the form of INSERT () ON CONFLICT UPDATE (). Now we could leave the whole thing up to the database and move on with our lives. We’d still have to maintain the SQL ourselves (or write a generator of some sort), but that would be a small price to pay for a massive performance increase.

Unfortunately this sort ends with an anti-climax, because we’re still in the process of implementing the raw ADO.NET, PostgreSQL specific UPSERT as an optimisation. Once I have more information I’ll follow up with another post filling in the end of the story.

In the meanwhile, we threw money at it by scaling up the RDS instance so it could keep more things in memory at once, which alleviated the issue, at least until we hit the next breakpoint.


Its true that no plan survives contact with the enemy, even when the enemy isn’t particularly hostile. All of our data synchronization was a silent precursor to shipping some new cloud features, so no-one notice when it exploded a few times, but it was still pretty depressing to see it be completely unable to handle the load we were throwing at it. I’m not surprised that Entity Framework doesn’t lead to optimal performance, even though its not pleasant having that particular fact slap you in the face. We can still use EF for a lot of what we need to do, and benefit from all of its goodies, like database migrations and easy in-memory testing, but we definitely need to put some things in place to bypass EF when the situation calls for it.

When it comes to this particular issue; even though throwing money at a problem doesn’t really work for us in the long term (for obvious reasons), it was nice to have that option available until we had something better in place. We’re still not sure if what we’re planning on doing is going to make a massive difference, but I remain hopeful.

Of course, software development and hope is never a good combination.

Look where it gets us when it comes to estimates.