0 Comments

And we’re back

Its time to ring in the new year with some more chatter about our data synchronization algorithm, which honestly, has turned into a pretty complex beast.

The last time I wrote about the algorithm, I explained how we flipped the original implementation of the differencing scanfrom bottom up, to top down. This change represented a hearty improvement over the original implementation, because the data at the “top” of the table (i.e. when ordered by Row Version descending) is far more likely to change than the data at the “bottom”, at least for our clients anyway.

In that post though, I pointed out an inefficiency in the differencing scan, regardless of the direction”":

There is still an inherent inefficiency in the design of the scan, such that it will continue to do extra work for no benefit. Consider the situation where the scan has started and has dealt with 4 chunks. Its a bit down in the table, and the change it originally wanted to find was in chunk 5. In the meantime, the user has deleted data in chunk 2. The process will have to finish its scan of the entire table before it returns to the top and finds that particular change, which is just wasteful. Instead, it should be able to determine if there are any differences in the remainder of the data set (i.e. from the bottom of the current chunk down) and then use that information to decide to reset its position to the top again. We haven’t actually done this bit yet, because it came up after the deployed the first version to our users.

That inherent inefficiency is the topic for today.

Wait, Come Back, You’re Going In The Wrong Direction!

As mentioned above, the problem lies in the fact that once a difference is detected (i.e. one or more missing rows locally or remotely), the scan will just keep going until it finds and fixes said difference (or it hits the “end” of its scan and resets to the “top”). Of course, time does not stop while the scan is happening, and because it happens over a period of hours, with many relatively isolated executions of the algorithm, other changes are still being made to the underlying data.

Additions and modifications are handled by an entirely different process, so we don’t have to worry about them, but hard deletions can still occur at will.

When such a deletion occurs in the section that has already been scanned, but before the original difference that triggered the scan has been located and fixed, the algorithm can get stuck scanning the remainder of the data set for no real benefit. This actually isn’t a huge issue when the data set is small, or even medium sized, as you can just eat the inefficiency and wait. Eventually it will all work itself out.

Unfortunately, as the data set gets larger, the chances of a deletion occurring in the scanned section increases. Also, since changes are more likely to occur at the “top” of the data set, and the differencing scan works top-down, the sections that were most recently scanned are actually the most likely to contain differences. As a result, the algorithm spends longer and longer doing work for no benefit, so the inefficiency gets worse and worse.

Each chunk comparison requires the data to be present in memory in the remote database as well (its not an index-only query), so every pointless comparison actually results in more reads, which means more IOPS, which is why we started optimizing in the first place!

Long story short, we have to do something about it.

4665, 4666, 4667! There Are 4667 Rows Both Locally And Remotely, Ah Ah Ah

The good news is that this particular problem is not overly difficult to solve. It will still take some effort, but its not some fundamental flaw in the algorithm.

Every time a chunk is completed as part of the differencing scan, we can use a simple count to see whether or not the remaining differences (if there are any) are above or below the current location.

Locally this is trivial, just a query to the DB for count where < current range end.

Getting the same information from the remote requires the introduction of a new endpoint to the API though:

/v1/customers/{customerId}/databases/{databaseId}/tables/{tableName}/count{?aboveRowVersion=123&belowRowVersion=456}

Slightly generalised from our specific use case, but it basically lets you get a count of records:

  • above a certain row version
  • below a certain row version
  • in between two boundary row versions

For the differencing scan, we only really care about the “below a certain row version” use case, to get a count and compare it to the same count from the local data.

If the count is the same, we can safely exit early and flip back to the top of the data set, resetting the differencing scan so that it can pick up the more recent changes.

If the count is different, we just keep going down and repeat the process once we’ve actioned the next chunkl.

Nice and easy.

Of course, there are still some edge cases. Its possible (though not likely) get into a situation where the counts are the same but the data is actually different (a particular combination of locally deleted data and data that was never uploaded successfully) which can throw the whole thing for a bit of a loop, but that could have happened regardless, so we’re still in a better place than we would otherwise be.

Conclusion

Its been a bit of a hard slog trying to optimise the data synchronization algorithm, and we’re still not really there. Not only that, the algorithm itself has become more and more complex over time (obviously), and is getting pretty hard to reason about.

Annoyingly enough, we haven’t run across that one magical improvement that changes everything. Its very much been a kind of “death by a thousand cuts” sort of thing, with tens of small optimizations that alleviate the issue slightly. The improvement in this post is a good example of that sort of thing, and pretty much boils down to “don’t do something you don’t have to do”, which isn’t exactly ground-breaking.

Don’t get me wrong, the process is much better than its ever been, but we’re still seeing very similar patterns in read IOPS, which is problematic.

It might be that the algorithm itself just doesn’t scale as well as we want it to, and that we might need to flip to a fundamentally different approach. Perhaps something that hooks into deeper into the customer data and notifies us of creations/updates/deletions as they occurs, rather than us having to poll for the same information.

Still, that sort of change is not something to be embarked on lightly, even disregarding the fallacy of sunk cost.