Hello IOPS My Old Friend, Part 2
Last week I described the re-occurrence of an issue from late last year with our data synchronization process.
Basically, read IOPS on the underlying database much higher than expected, causing issues with performance in AWS when the database volume ran out of burst balance (or IO credits as they are sometimes called).
After identifying that the most recent problem had been caused by the addition of tables, we disabled those tables, restoring performance and stability.
Obviously we couldn’t just leave them turned off though, and we couldn’t throw money at the problem this time, like we did last time. It was time to dig in and fix the problem properly.
But first we’d need to reproduce it.
But The Read Performance Issue In My Store
The good thing about ensuring that your environment is codified is that you can always spin up another one that looks just like the existing one.
Well, theoretically anyway.
In the case of the sync API environment everything was fine. One script execution later and we had a new environment called “sync performance” with the same number of identically sized API instances running the exact same code as production.
The database was a little trickier unfortunately.
You see, the database environment was from the time before Iimproved our environment deployment process. This meant that it was easy to make one, but hard to update an existing one.
Unfortunately, it was hard enough to update an existing one that the most efficient course of action had been to simply update the live one each time we had to tweak it, so we had diverged from the source code.
First step? Get those two back in sync.
Second step, spin up a new one that looks just like prod, which needed to include a copy of the prod data. Luckily, RDS makes that easy with snapshots.
With a fully functional, prod-like data synchronization environment up and running, all we needed was traffic.
Good thing Gor exists. We still had a deployable Gor component from last time I wanted to replicate traffic, so all we needed to do was make a new Octopus project, configure it appropriately and deploy it to our production API.
Now we had two (mostly) identical environments processing the same traffic, behaving pretty much the same. Because we’d turned off multiple tables in order to stop the high read IOPS, it was a simple matter to turn one back on, causing a less severe version of the issue to reoccur in both environments (higher than normal read IOPS, but not enough to eat burst balance).
With that in place we were free to investigate and experiment.
Is Still Abhorred
I’m going to cut ahead in the timeline here, but we analysed the behaviour of the test environment for a few days, trying to get a handle on what was going on.
Leveraging some of the inbuilt query statistics in postgreSQL, it looked like the the most frequent and costly type of query was related to getting a “version” of the remote table for synchronization orchestration. The second most costly type of query was related to getting a “manifest” of a subset of the table being synchronized for the differencing engine.
Disabling those parts of the API (but leaving the uploads alone) dropped the IOPS significantly, surprising exactly zero people. This did disagree with our hypothesis from last time though, so that was interesting.
Of course, the API is pretty useless without the ability to inform the synchronization process, so it was optimization time.
- We could try to reduce the total number of calls, reducing the frequency that those queries are executed. We’d already done some work recently to dramatically reduce the total number of calls to the API from each synchronization process though, so it was unlikely we would be able to get any wins here
- We could implement a cache in front of the API, but this just complicates things and all it will really result in is doing work repeatedly for no benefit (if the process syncs data then asks the API for a status, and gets the cached response, it will just sync the data again)
- We could reduce the frequency of syncing, doing it less often. Since we already did the work I mentioned above to reduce overall calls, the potential gains here were small
- We could try to make the queries more efficient. The problem here was that the queries were already using the primary keys of the tables in question, so I’m not entirely sure that any database level optimizations on those tables would have helped
- We could make getting an answer to the question “give me the remote version of this table” more efficient by using a dedicated data structure to service those requests, basically a fancy database level cache
We prototyped the last option (basically a special cache within the database that contained table versions in a much easier to query format) and it had a positive effect on the overall read IOPS.
But it didn’t get rid of it entirely.
Within The Sound Of Synchronization
Looking into our traffic, we discovered that our baseline traffic had crept up since we’d implemented the skipping strategy in the sync process. Most of that baseline traffic appeared to be requests relating to the differencing engine (i.e. scanning the table to get primary key manifests for comparison purposes), which was one of the expensive type of queries that we identified above.
We’d made some changes to the algorithm to incorporate the ability to put a cap on the number of skips we did (for safety, to avoid de-sync edge cases) and to introduce forced skips for tables whose changes we were happy to only sync a few times a day.
A side effect of these changes was that whenever we decided NOT to skip using the local comparison, the most common result of the subsequent local vs remote comparison was choosing to execute the differencing engine. There was a piece of the algorithm missing where it should have been choosing to do nothing if the local and remote were identical, but that did not seem to be working due to the way the skip resolution had been implemented.
Fixing the bug and deploying it cause the read IOPS on our normal production server to drop a little bit, which was good.
The different pattern of traffic + our prototype table version cached caused a much more dramatic drop in read IOPS in our test environment though. It looked like the two things acting together apparently reduced the demands on the database enough to prevent it from having to read so much all the time.
We’re still working on a production quality cached table version implementation, but I am cautiously optimistic. There are some tricky bits regarding the cache (like invalidation vs updates, and where that sort of decision is made), so we’ve got a bit of work ahead of us though.
At this point I’m pretty thankful that we were easily able to both spin up an entirely separate and self contained environment for testing purposes, and that we were able to replicate traffic from one environment to the other without a lot of fuss. Without the capability to reproduce the problem disconnected from our clients and experiment, I don’t think we would have been able to tackle the problem as efficiently as we did.
I’m a little disappointed that a bug in our sync process managed to slip through our quality assurance processes, but I can understand how it happened. It wasn’t strictly a bug with the process itself, as the actions it was performing were still strictly valid, just not optimal. Software with many interconnected dependent components can be a very difficult thing to reason about, and this problem was relatively subtle unless you were looking for it specifically. We might have been able to prevent it from occurring with additional tests, but its always possible that we actually had those tests anyway, and during one of the subsequent changes a test failed and was fixed in an incorrect way. I mean, if that was the case, then we need to be more careful about “fixing” broken tests.
Regardless, we’re starting to head into challenging territory with our sync process now, as it is a very complicated beast. So complicated in fact that its getting difficult to keep the entire thing in your head at the same time.
Which is scary.