The Upside Down Is A Dangerous Place

November 28. 2017 0 Comments

It’s been a little while since I wrote about our data synchronization algorithm, but it still gets a fair bit of space in my mind on a day to day basis.

Most recently, we put some effort into optimizing it in the face of rising IOPS usage at the database level, which worked well enough to stabilize the system in the face of what we wanted it to do at the time.

Then we told it to sync more data.

Specifically, one of the two biggest data sets in our software which covers the storage of repeatable tasks, like appointments, phone calls, inspections and so on.

Not only is this table one of the largest in raw size, it also features the most churn AND is one of the only tables to feature hard deletes, a challenge that we had to build an entire secondary sync process for.

Funnily enough, it actually worked pretty well at first, but everything started to fray at the edges as we started syncing more and more clients. Same sort of thing as before, pressure on IOPS at the database level, mostly as a result of reads.

With the memory of the last round of optimizations fresh in our minds, it was time to enter the breach once more.

Will To Survive

We prodded at the system for a bit, analysing the traffic patterns to try and determine what might be causing the high read IOPS this time. Being that we’d already investigated similar sorts of things recently, we knew our way around it pretty well, so it wasn’t long before we noticed something suspicious.

Lets go back a step though.

The core data synchronization algorithm hasn’t changed all that much from the last time I wrote about it, even in the face of the optimizations.

Get Local Count/Version/Timestamp
Get Remote Count/Version/Timestamp
If Local Count/Version/Timestamp == Remote Count/Version/Timestamp
    Do Nothing and Exit
If Local Version/Timestamp == Remote Version/Timestamp BUT Local Count != Remote Count
    Calculate [BATCH SIZE] Using Historical Data
    Get Last Local Position
    Get Next [BATCH SIZE] Local Rows from last position
    Get Min & Max Version in Batch
    Query Remote for Manifest Between Min/Max Local Version
    Create Manifest from Local Batch
    Compare
        Find Remote Not in Local
            Delete from Remote
        Find Local Not in Remote
            Upload to Remote
If Local Version/Timestamp > Remote Version/Timestamp
    Calculate [BATCH SIZE] Using Historical Data
    Get Next [BATCH SIZE] Local Rows > Remote Version
    Upload to Remote
        Record Result for [BATCH SIZE] Tuning
        If Failure & Minimum [BATCH SIZE], Skip Ahead
If Local Version/Timestamp < Remote Version/Timestamp
    Find Remote > Local Version
    Delete from Remote

We’ve tweaked the mechanisms by which each evaluation of the algorithm determines its local and remote versions (there’s some caching on both the server and client side now, and each request to put/remote data returns new meta information from the server for efficiency) but that’s about it.

This time when we analysed the traffic, we noticed that there was a lot of calls to get table chunk manifests, an important part of the diff checker component of the algorithm, which is primarily meant to remove rows that have been deleted locally from the remote (i.e. hard deletes rather than the much easier to handle soft deletes, which are really just updates).

The problem with there being so many manifest calls is that they are quite expensive, especially for large tables.

Each manifest call requires a database query that can only use the first two components of the clustered index (which is client segmentation data), but must then scan through the remaining data in order to get the “chunk” needed for analysis. This can mean a scan of millions of rows, which is quite intense on read IOPS, because all of that data needs to be brought into memory in order to be evaluated.

But why was the number of manifest calls a problem now?

Sneaky Corporation

It all ties back into the data that we were syncing this time. As I mentioned at the beginning of this article, not only is it a huge table, it also features hard-deletes as the norm, rather than as exceptions. These few things added together create a bit of a perfect storm with regards to diff checker activity, which is why the system was starting to strain as a result of the massive increase in manifest calls.

Conceptually the diff checker part of the algorithm is a batched scan. It starts on one end of the data set, and gets chunks of data sequentially from both locations (local and remote), analysing and reacting as appropriate until it either gets back in sync or it reaches the other end, when it then wraps back around to it’s start point again.

Here’s where we made a mistake that seems obvious in hindsight.

In almost all cases, changes (even hard deletes) are more likely to occur at the top end of the data set, rather than the bottom.

Our scan?

Starts at the bottom and works its way up.

Thus, any time something was deleted from close to the top of the data set, the diff checker would have to scan through the entire table before finding and fixing it. This means more reads (to get the manifest chunks) which means higher IOPS.

This One Goes To Eleven

It really didn’t take a huge amount of effort to flip the diff checker algorithm to go top down instead of bottom up.

In fact, because of the way in which we’d built the code in the first place (using the strategy pattern to implement the actual chunking algorithm) it was as simple as writing a new class and binding it up instead of the old one via our dependency injection container.

The top down logic is basically the same as it is for bottom up, just inverted.

Starting the lower boundary of the last chunk (or the very top of the table if the state was reset), get the next [BATCH SIZE] chunk of rows from the local in a downwards direction. Using that chunk, find the extremities (top and bottom) and ask the remote for an equivalent chunk. Contrast and compare, react accordingly and then remember where you were up to for next time. If the algorithm reaches a point where the local and remote are identical, reset all state and do nothing.

Nothing super fancy, and an extremely obvious performance optimization in retrospect considering how much more likely changes are to occur at the top end rather than the bottom.

Such Strange Things

There are two tricksy bits though, one specific the top down approach and the other that was present in the bottom up approach, but is more obvious when working top down.

When you get the local chunk, and you get less than the batch size, you know that you’re probably at the bottom of the table. If you follow the same rules for asking the remote at this point, you risk leaving data in the remote that is not in the local if its right at the bottom. Instead, when you reach this point, you have to infer a lower boundary of 0 to make sure you get everything.
There is still an inherent inefficiency in the design of the scan, such that it will continue to do extra work for no benefit. Consider the situation where the scan has started and has dealt with 4 chunks. Its a bit down in the table, and the change it originally wanted to find was in chunk 5. In the meantime, the user has deleted data in chunk 2. The process will have to finish its scan of the entire table before it returns to the top and finds that particular change, which is just wasteful. Instead, it should be able to determine if there are any differences in the remainder of the data set (i.e. from the bottom of the current chunk down) and then use that information to decide to reset its position to the top again. We haven’t actually done this bit yet, because it came up after the deployed the first version to our users.

The first issue is a bit of a deal breaker (because it means the process can result in an incomplete data synchronization) so we’re lucky that we noticed it during testing and fixed it right away.

The second issue is not as bad because the process still does what it is supposed to do, just slower. Not great from an optimization point of view, so we’ll still fix it, but not exactly a critical issue.

Conclusion

Like I said a little bit earlier, doing the diff check scan from top to bottom is incredibly obvious in retrospect. All the changes are at the top, why would we start at the bottom of a potentially huge data set and waste time and effort analysing a bunch of data that is not likely to be relevant? Of course, that wasn’t actually obvious until we did some analysis on top of other optimizations, but it really does feel like a bit of a boneheaded mistake.

The good news is that when we implemented the changes as described (algorithm from bottom up to top down) we saw a dramatic decrease in traffic, because the diff checker was running fewer times before it found and fixed the changes.

The bad news is that due to point number two above (the diff checker doesn’t know that it should start from the top again when there are more recent changes) the IOPS usage didn’t really change all that much at all.

Still, the algorithm continues to get better and better and we continue to be able to synchronize more and more data without having to pay for additional infrastructure to support it.

Those are some good engineering feels.