Guess Who’s Back, Back Again

January 9. 2018 0 Comments

And we’re back

Its time to ring in the new year with some more chatter about our data synchronization algorithm, which honestly, has turned into a pretty complex beast.

The last time I wrote about the algorithm, I explained how we flipped the original implementation of the differencing scanfrom bottom up, to top down. This change represented a hearty improvement over the original implementation, because the data at the “top” of the table (i.e. when ordered by Row Version descending) is far more likely to change than the data at the “bottom”, at least for our clients anyway.

In that post though, I pointed out an inefficiency in the differencing scan, regardless of the direction”":

There is still an inherent inefficiency in the design of the scan, such that it will continue to do extra work for no benefit. Consider the situation where the scan has started and has dealt with 4 chunks. Its a bit down in the table, and the change it originally wanted to find was in chunk 5. In the meantime, the user has deleted data in chunk 2. The process will have to finish its scan of the entire table before it returns to the top and finds that particular change, which is just wasteful. Instead, it should be able to determine if there are any differences in the remainder of the data set (i.e. from the bottom of the current chunk down) and then use that information to decide to reset its position to the top again. We haven’t actually done this bit yet, because it came up after the deployed the first version to our users.

That inherent inefficiency is the topic for today.

Wait, Come Back, You’re Going In The Wrong Direction!

As mentioned above, the problem lies in the fact that once a difference is detected (i.e. one or more missing rows locally or remotely), the scan will just keep going until it finds and fixes said difference (or it hits the “end” of its scan and resets to the “top”). Of course, time does not stop while the scan is happening, and because it happens over a period of hours, with many relatively isolated executions of the algorithm, other changes are still being made to the underlying data.

Additions and modifications are handled by an entirely different process, so we don’t have to worry about them, but hard deletions can still occur at will.

When such a deletion occurs in the section that has already been scanned, but before the original difference that triggered the scan has been located and fixed, the algorithm can get stuck scanning the remainder of the data set for no real benefit. This actually isn’t a huge issue when the data set is small, or even medium sized, as you can just eat the inefficiency and wait. Eventually it will all work itself out.

Unfortunately, as the data set gets larger, the chances of a deletion occurring in the scanned section increases. Also, since changes are more likely to occur at the “top” of the data set, and the differencing scan works top-down, the sections that were most recently scanned are actually the most likely to contain differences. As a result, the algorithm spends longer and longer doing work for no benefit, so the inefficiency gets worse and worse.

Each chunk comparison requires the data to be present in memory in the remote database as well (its not an index-only query), so every pointless comparison actually results in more reads, which means more IOPS, which is why we started optimizing in the first place!

Long story short, we have to do something about it.

4665, 4666, 4667! There Are 4667 Rows Both Locally And Remotely, Ah Ah Ah

The good news is that this particular problem is not overly difficult to solve. It will still take some effort, but its not some fundamental flaw in the algorithm.

Every time a chunk is completed as part of the differencing scan, we can use a simple count to see whether or not the remaining differences (if there are any) are above or below the current location.

Locally this is trivial, just a query to the DB for count where < current range end.

Getting the same information from the remote requires the introduction of a new endpoint to the API though:

/v1/customers/{customerId}/databases/{databaseId}/tables/{tableName}/count{?aboveRowVersion=123&belowRowVersion=456}

Slightly generalised from our specific use case, but it basically lets you get a count of records:

above a certain row version
below a certain row version
in between two boundary row versions

For the differencing scan, we only really care about the “below a certain row version” use case, to get a count and compare it to the same count from the local data.

If the count is the same, we can safely exit early and flip back to the top of the data set, resetting the differencing scan so that it can pick up the more recent changes.

If the count is different, we just keep going down and repeat the process once we’ve actioned the next chunkl.

Nice and easy.

Of course, there are still some edge cases. Its possible (though not likely) get into a situation where the counts are the same but the data is actually different (a particular combination of locally deleted data and data that was never uploaded successfully) which can throw the whole thing for a bit of a loop, but that could have happened regardless, so we’re still in a better place than we would otherwise be.

Conclusion

Its been a bit of a hard slog trying to optimise the data synchronization algorithm, and we’re still not really there. Not only that, the algorithm itself has become more and more complex over time (obviously), and is getting pretty hard to reason about.

Annoyingly enough, we haven’t run across that one magical improvement that changes everything. Its very much been a kind of “death by a thousand cuts” sort of thing, with tens of small optimizations that alleviate the issue slightly. The improvement in this post is a good example of that sort of thing, and pretty much boils down to “don’t do something you don’t have to do”, which isn’t exactly ground-breaking.

Don’t get me wrong, the process is much better than its ever been, but we’re still seeing very similar patterns in read IOPS, which is problematic.

It might be that the algorithm itself just doesn’t scale as well as we want it to, and that we might need to flip to a fundamentally different approach. Perhaps something that hooks into deeper into the customer data and notifies us of creations/updates/deletions as they occurs, rather than us having to poll for the same information.

Still, that sort of change is not something to be embarked on lightly, even disregarding the fallacy of sunk cost.

The Upside Down Is A Dangerous Place

November 28. 2017 0 Comments

It’s been a little while since I wrote about our data synchronization algorithm, but it still gets a fair bit of space in my mind on a day to day basis.

Most recently, we put some effort into optimizing it in the face of rising IOPS usage at the database level, which worked well enough to stabilize the system in the face of what we wanted it to do at the time.

Then we told it to sync more data.

Specifically, one of the two biggest data sets in our software which covers the storage of repeatable tasks, like appointments, phone calls, inspections and so on.

Not only is this table one of the largest in raw size, it also features the most churn AND is one of the only tables to feature hard deletes, a challenge that we had to build an entire secondary sync process for.

Funnily enough, it actually worked pretty well at first, but everything started to fray at the edges as we started syncing more and more clients. Same sort of thing as before, pressure on IOPS at the database level, mostly as a result of reads.

With the memory of the last round of optimizations fresh in our minds, it was time to enter the breach once more.

Will To Survive

We prodded at the system for a bit, analysing the traffic patterns to try and determine what might be causing the high read IOPS this time. Being that we’d already investigated similar sorts of things recently, we knew our way around it pretty well, so it wasn’t long before we noticed something suspicious.

Lets go back a step though.

The core data synchronization algorithm hasn’t changed all that much from the last time I wrote about it, even in the face of the optimizations.

Get Local Count/Version/Timestamp
Get Remote Count/Version/Timestamp
If Local Count/Version/Timestamp == Remote Count/Version/Timestamp
    Do Nothing and Exit
If Local Version/Timestamp == Remote Version/Timestamp BUT Local Count != Remote Count
    Calculate [BATCH SIZE] Using Historical Data
    Get Last Local Position
    Get Next [BATCH SIZE] Local Rows from last position
    Get Min & Max Version in Batch
    Query Remote for Manifest Between Min/Max Local Version
    Create Manifest from Local Batch
    Compare
        Find Remote Not in Local
            Delete from Remote
        Find Local Not in Remote
            Upload to Remote
If Local Version/Timestamp > Remote Version/Timestamp
    Calculate [BATCH SIZE] Using Historical Data
    Get Next [BATCH SIZE] Local Rows > Remote Version
    Upload to Remote
        Record Result for [BATCH SIZE] Tuning
        If Failure & Minimum [BATCH SIZE], Skip Ahead
If Local Version/Timestamp < Remote Version/Timestamp
    Find Remote > Local Version
    Delete from Remote

We’ve tweaked the mechanisms by which each evaluation of the algorithm determines its local and remote versions (there’s some caching on both the server and client side now, and each request to put/remote data returns new meta information from the server for efficiency) but that’s about it.

This time when we analysed the traffic, we noticed that there was a lot of calls to get table chunk manifests, an important part of the diff checker component of the algorithm, which is primarily meant to remove rows that have been deleted locally from the remote (i.e. hard deletes rather than the much easier to handle soft deletes, which are really just updates).

The problem with there being so many manifest calls is that they are quite expensive, especially for large tables.

Each manifest call requires a database query that can only use the first two components of the clustered index (which is client segmentation data), but must then scan through the remaining data in order to get the “chunk” needed for analysis. This can mean a scan of millions of rows, which is quite intense on read IOPS, because all of that data needs to be brought into memory in order to be evaluated.

But why was the number of manifest calls a problem now?

Sneaky Corporation

It all ties back into the data that we were syncing this time. As I mentioned at the beginning of this article, not only is it a huge table, it also features hard-deletes as the norm, rather than as exceptions. These few things added together create a bit of a perfect storm with regards to diff checker activity, which is why the system was starting to strain as a result of the massive increase in manifest calls.

Conceptually the diff checker part of the algorithm is a batched scan. It starts on one end of the data set, and gets chunks of data sequentially from both locations (local and remote), analysing and reacting as appropriate until it either gets back in sync or it reaches the other end, when it then wraps back around to it’s start point again.

Here’s where we made a mistake that seems obvious in hindsight.

In almost all cases, changes (even hard deletes) are more likely to occur at the top end of the data set, rather than the bottom.

Our scan?

Starts at the bottom and works its way up.

Thus, any time something was deleted from close to the top of the data set, the diff checker would have to scan through the entire table before finding and fixing it. This means more reads (to get the manifest chunks) which means higher IOPS.

This One Goes To Eleven

It really didn’t take a huge amount of effort to flip the diff checker algorithm to go top down instead of bottom up.

In fact, because of the way in which we’d built the code in the first place (using the strategy pattern to implement the actual chunking algorithm) it was as simple as writing a new class and binding it up instead of the old one via our dependency injection container.

The top down logic is basically the same as it is for bottom up, just inverted.

Starting the lower boundary of the last chunk (or the very top of the table if the state was reset), get the next [BATCH SIZE] chunk of rows from the local in a downwards direction. Using that chunk, find the extremities (top and bottom) and ask the remote for an equivalent chunk. Contrast and compare, react accordingly and then remember where you were up to for next time. If the algorithm reaches a point where the local and remote are identical, reset all state and do nothing.

Nothing super fancy, and an extremely obvious performance optimization in retrospect considering how much more likely changes are to occur at the top end rather than the bottom.

Such Strange Things

There are two tricksy bits though, one specific the top down approach and the other that was present in the bottom up approach, but is more obvious when working top down.

When you get the local chunk, and you get less than the batch size, you know that you’re probably at the bottom of the table. If you follow the same rules for asking the remote at this point, you risk leaving data in the remote that is not in the local if its right at the bottom. Instead, when you reach this point, you have to infer a lower boundary of 0 to make sure you get everything.
There is still an inherent inefficiency in the design of the scan, such that it will continue to do extra work for no benefit. Consider the situation where the scan has started and has dealt with 4 chunks. Its a bit down in the table, and the change it originally wanted to find was in chunk 5. In the meantime, the user has deleted data in chunk 2. The process will have to finish its scan of the entire table before it returns to the top and finds that particular change, which is just wasteful. Instead, it should be able to determine if there are any differences in the remainder of the data set (i.e. from the bottom of the current chunk down) and then use that information to decide to reset its position to the top again. We haven’t actually done this bit yet, because it came up after the deployed the first version to our users.

The first issue is a bit of a deal breaker (because it means the process can result in an incomplete data synchronization) so we’re lucky that we noticed it during testing and fixed it right away.

The second issue is not as bad because the process still does what it is supposed to do, just slower. Not great from an optimization point of view, so we’ll still fix it, but not exactly a critical issue.

Conclusion

Like I said a little bit earlier, doing the diff check scan from top to bottom is incredibly obvious in retrospect. All the changes are at the top, why would we start at the bottom of a potentially huge data set and waste time and effort analysing a bunch of data that is not likely to be relevant? Of course, that wasn’t actually obvious until we did some analysis on top of other optimizations, but it really does feel like a bit of a boneheaded mistake.

The good news is that when we implemented the changes as described (algorithm from bottom up to top down) we saw a dramatic decrease in traffic, because the diff checker was running fewer times before it found and fixed the changes.

The bad news is that due to point number two above (the diff checker doesn’t know that it should start from the top again when there are more recent changes) the IOPS usage didn’t really change all that much at all.

Still, the algorithm continues to get better and better and we continue to be able to synchronize more and more data without having to pay for additional infrastructure to support it.

Those are some good engineering feels.

Syncing the Ship, The Physics Of De-Syncing

May 9. 2017 0 Comments

Regularly writing blog posts is hard.

Recently, I’ve been focusing on monstrously long, multi-part posts explaining complex or involved processes, like:

The ELB Logs Processor, which was broken into many parts because it was new and we had a bunch of issues developing/deploying it. Somewhat ironically, it was meant to be a quick and simple replacement for a system we already had in place.
Regaining development and operational control over our log aggregation stack, which was broken down because of the multiple distinct pieces of work that we had to perform.
An exploration of our data synchronization algorithm, which was broken down because it’s surprisingly complex.

I’m honestly not sure whether this pattern is good or bad in the greater scheme of things. Either I just happen to be tackling (and thus writing about) bigger problems in my day to day life, or I’ve simply run out of small things to write about, leaving only the big ones.

But anyway, enough musing, because this post fits squarely into the “bunch of connected posts about one topic” theme.

I was pretty happy with the state of our data synchronization algorithm the last time I wrote about it. We’d just finished putting together some optimizations that dramatically reduced the overall traffic while still maintaining the quality of the overall syncing process, and it felt pretty good. Its been a little over a month since those changes were deployed, and everything has been going swimmingly. We’ve added a few new tables to the process, but the core algorithm hasn’t changed.

Normally this is the point where I would explain how it was all secretly going terribly wrong, but in a weird twist of fate, the algorithm is actually pretty solid.

We did find a bug which can cause the on-premises and remote locations to be out of of sync though, which was unfortunate. It happens infrequently, so a small subset of the data, but it still makes for an interesting topic to write about.

Well, interesting to me at least.

Optimizations Are Always Dangerous

The core of the bug lies in our recent optimizations.

In order to reduce the amount of busywork traffic occurring (i.e. the traffic resulting from the polling nature of the process), we implemented some changes that leverage local and remote table manifests to short-circuit the sync process if there was nothing to do. To further minimize the traffic to the API, we only queried the remote table manifest at the start of the run and then used that for the comparison against the local on the next run. Essentially we exchanged a guaranteed call on every non-skipped sync for one call each time the local and remote became identical.

The bug arises in the rare case where the captured remote from the last run is the same as the current local, even though the current remote is different.

The main way that this seems to happen is:

Table with low rate of change gets a new row.
Algorithm kicks in and syncs the new data.
In the time between the run that pushed the data and the next one, user removes the data somehow.
Current local now looks exactly like the remote did before the data was pushed.
Sync algorithm thinks that it has nothing to do.

In this case the algorithm is doing exactly what it was designed to do. Its detected that there are no changes to deal with, and will continue to skip executions until something does change (new rows, updates, deletions, anything), where it will run again. If the table changes infrequently we’re left with an annoying desync for much longer than we would like.

Like I said earlier, its a pretty specific situation, with relatively tight timings, and it only occurs for tables that are infrequently changed, but a bug is a bug, so we should fix it all the same.

Safety Dance

The obvious solution is to requery the remote after the sync operation has finished execution and store that value for the comparison against the local next time, rather than relying on the value from before the operation started.

The downside of this is that it adds another request to every single non-skipped sync, which amounts to a relatively significant amount of traffic. We’re still way ahead of the situation before the optimizations, but maybe we can do better?

Another idea is to limit the maximum number of skips that can happen in a row, taking into account how long we might want the situation described above to persist.

This approach also raises the number of requests occurring, but has the happy side effect of picking up changes at the remote end as well (i.e. nothing has changed locally, but we deleted all the data remotely in order to force a resync or something).

The compare the two possible fixes, I actually did some math to see which one would result in more requests, and with the maximum number of skips set to a value that forced a run every 30 minutes or so, they are pretty much a wash in terms of additional requests.

I’ve flip-flopped a fair bit on which solution I think we should apply, initially thinking the “limit maximum skips” approach was the best (because it essentially offers a sanity check to the concept of skipping runs), but from an engineering point of view, it just feels messy, like the sort of solution you come up with when you can’t come up with something better. Almost brute force in its approach.

I’m currently favouring amending the algorithm to query the remote after the operation executes because it feels the cleanest, but I’m not ecstatic about it either, as it feels like its doing more work than is strictly necessary.

Decisions decisions.

Conclusion

As much as it saddens me to find bugs, it pleases me to know that with each bug fixed, the algorithm is becoming stronger, like tempering steel, or building muscle. Applying stress to something causing it to break down and then be repaired with improvements.

It can be tempting to just throw a fix in whenever you find a bug like that, but I believe that hack fixes should never be tolerated without a truly exceptional reason. You should always aim to make the code better as you touch it, not worse. The hardest part of fixing bugs is to perform the repairs in such a way that it doesn’t compromise the design of the code.

Of course, if the design is actually the cause of the problem, then you’re in for a world of hurt.

Syncing the Ship, Quietly

March 28. 2017 0 Comments

Like with all software, its rare to ever actually be done with something. A few weeks back I wrote at length about the data synchronization algorithm we use to free valuable and useful data from its on-premises prison, to the benefit of both our clients (for new and exciting applications) and us (for statistical analysis.

Conceptually, the process leveraged an on-premises application, a relatively simple API and a powerful backend data store to accomplish its goals, along with the following abstracted algorithm (which I went into in depth in the series of blog posts that I linked above).

Get Local Version
Get Remote Version
If Local == Remote
    Calculate [BATCH SIZE] Using Historical Data
    Get Last Local Position
    Get Next [BATCH SIZE] Local Rows from last position
    Get Min & Max Version in Batch
    Query Remote for Manifest Between Min/Max Local Version
    Create Manifest from Local Batch
    Compare
        Find Remote Not in Local
            Delete from Remote
        Find Local Not in Remote
            Upload to Remote
If Local > Remote
    Calculate [BATCH SIZE] Using Historical Data
    Get Next [BATCH SIZE] Local Rows > Remote Version
    Upload to Remote
        Record Result for [BATCH SIZE] Tuning
        If Failure & Minimum [BATCH SIZE], Skip Ahead
If Local < Remote
    Find Remote > Local Version
    Delete from Remote

One of the issues with the algorithm above is that its pretty chatty when it comes to talking to the remote API. It is polling based, so that’s somewhat to be expected, but there is a lot of requests and responses being thrown around that seem like a prime opportunity for improvement.

To give some context:

We have approximately 3500 unique clients (each one representing a potential data synchronization)
Of that 3500, approximately 2200 clients are actively using the synchronization
In order to service these clients, the API deals with approximately 450 requests a second

Not a ground shaking amount of traffic, but if we needed to service the remainder of our clients in the same way, we’d probably have to scale out to deal with the traffic. Scaling out when you use AWS is pretty trivial, but the amount of traffic in play is also overloading our log aggregation (our ELK stack), there are other factors to consider.

Digging into the traffic a bit (using our awesome logging), it looks like the majority of the requests are GET requests.

The following KIbana visualization shows a single days traffic, aggregated over time/HTTP verb. You can clearly see the increase in the amount of non-GET requests during the day as clients make changes to their local database, but the GET traffic dwarfs it.

If we want to reduce the total amount of traffic, attacking the GET requests seems like a sane place to start. But maybe we could just reduce the traffic altogether?

Frequency Overload

The plugin architecture that schedules and executes the application responsible for performing the data synchronization has a cadence of around 60 seconds (with support for backoff in the case of errors). It is smart enough to be non re-entrant (meaning it won’t start another process if one is already running), but has some weaknesses that lead to the overall cadence being closer to 5 minutes, with higher cadences for multiple registered databases (because each database runs its own application, but the number running in parallel is limited).

One easy way to reduce the total amount of traffic is to simply reduce the cadence, drawing it out to at least 10 minutes in between runs.

The downside of this is that it increases the amount of latency between local changes being made and them being represented on the remote database.

Being that one of the goals for the sync process was to minimise this latency, simply reducing the cadence in order to decrease traffic is not a good enough solution.

Just Stop Talking, Please

If we look at the GET traffic in more detail we can see that it is mostly GET requests to two endpoints.

/v1/customers/{customer-number}/databases/{database-id}/tables/{table-name}/
/v1/customers/{customer-number}/databases/{database-id}/tables/{table-name}/manifest

These two endpoints form the basis for two different parts of the sync algorithm.

The first endpoint is used to get an indication of the status for the entire table for that customer/database combination. It returns a summary of row count, maximum row version and maximum timestamp. This information is used in the algorithm above in the part where it executes the “Get remote version” statement. It is then compared to the same information locally, and the result of that comparison is used to determine what to do next.

The second endpoint is used to get a summarised chunk of information from the table, using a from and to row version. This is used in the sync algorithm to perform the diff check whenever the local and remote versions (the other endpoint) are the same.

What this means is that every single run of every application is guaranteed to hit the first endpoint for each table (to get the remote version) and pretty likely to hit the second endpoint (because the default action is to engage on the diff check whenever local and remote versions are the same).

The version comparison is flawed though. It only takes into account the maximum row version and maximum timestamp for its decision making, ignoring the row count altogether (which was there historically for informational purposes). The assumption here was that we wanted to always fall back to scanning through the table for other changes using the differencing check, so if our maximum version/timestamps are identical that’s what we should do.

If we use the row count though, we can determine if the local and remote tables are completely in sync, allowing us to dodge a large amount of work. Being that all updates will be covered by the row version construct and all deletions will be covered by the row count changing, we should be in a pretty good place to maintain the reliability of the sync.

1 Row, 2 Rows, 3 Rows, Ah Ah Ah!

The naive thing to do would be to get the local count/version/timestamp and the local count/version/timestamp from last time and compare them (including the row count). If they are the same, we don’t need to do anything! Yay!

This fails to take into account the state of the remote though, and the nature of the batching process. While there might not be any changes locally since last time, we might not have actually pushed all of the changes from last time to the remote.

Instead, what we can do is compare the local count/version/timestamp with the last remote count/version/timestamp. If they are the same, we can just do nothing because both are completely in sync.

Editing the algorithm definition from the start of this post, we get this:

Get Local Count/Version/Timestamp
Get Last Remote Count/Version/Timestamp
If Local Count/Version/Timestamp == Last Remote Count/Version/Timestamp
    Do Nothing and Exit
Get Remote Count/Version/Timestamp
Store Remote Count/Version/Timestamp For Lookup Next Time
If Local Count/Version/Timestamp == Remote Count/Version/Timestamp
    Do Nothing and Exit
If Local Version/Timestamp == Remote Version/Timestamp BUT Local Count != Remote Count
    Calculate [BATCH SIZE] Using Historical Data
    Get Last Local Position
    Get Next [BATCH SIZE] Local Rows from last position
    Get Min & Max Version in Batch
    Query Remote for Manifest Between Min/Max Local Version
    Create Manifest from Local Batch
    Compare
        Find Remote Not in Local
            Delete from Remote
        Find Local Not in Remote
            Upload to Remote
If Local Version/Timestamp > Remote Version/Timestamp
    Calculate [BATCH SIZE] Using Historical Data
    Get Next [BATCH SIZE] Local Rows > Remote Version
    Upload to Remote
        Record Result for [BATCH SIZE] Tuning
        If Failure & Minimum [BATCH SIZE], Skip Ahead
If Local Version/Timestamp < Remote Version/Timestamp
    Find Remote > Local Version
    Delete from Remote

The other minor change in there is comparing the full local count/version/timestamp against the remote count/version/timestamp. If they are identical, its just another case where we need to do nothing, so we can exit safely until next time.

Conclusion

Just how much of a difference does this make though? I’ll let a picture of a partial days traffic answer that for me.

In the image below I’ve specifically set the scale of the graph to be the same as the one above for comparison purposes.

As you can see, the traffic rises from nothing (because nothing is changing overnight) to a very acceptable amount of traffic representing real work that needs to be done during the day, and then will probably continue forward with the same pattern, flattening back into nothing as the day progresses and people stop making changes.

Its a ridiculous decrease in the amount of pointless noise traffic, which is a massive victory.

Thinking about this sort of thing in more general terms, optimization is an important step in the production and maintenance of any piece of software, but its important not to engage on this path too early. You should only do it after you gather enough information to justify it, and to pinpoint exactly where the most impact will be made for the least effort. The last thing you want to do is spend a week or two chasing something you think is a terribly inefficiency only to discover that it makes less than a % difference to the system as a whole.

The most efficient way to do this sort of analysis is with good metrics and logging.

You’d be crazy not to do it.

Syncing the Ship, Well And Truly Sunk

February 28. 2017 0 Comments

And that marks the end of the series on our synchronization process. Surprising absolutely no-one, it turns out that building such a process is quite challenging, with lots of different edge cases that need to be taken into account to ensure quality at all of the various levels.

To summarise:

At the end, we’re left with the following algorithm:

Get Local Version
Get Remote Version
If Local == Remote
    Calculate [BATCH SIZE] Using Historical Data
    Get Last Local Position
    Get Next [BATCH SIZE] Local Rows from last position
    Get Min & Max Version in Batch
    Query Remote for Manifest Between Min/Max Local Version
    Create Manifest from Local Batch
    Compare
        Find Remote Not in Local
            Delete from Remote
        Find Local Not in Remote
            Upload to Remote
If Local > Remote
    Calculate [BATCH SIZE] Using Historical Data
    Get Next [BATCH SIZE] Local Rows > Remote Version
    Upload to Remote
        Record Result for [BATCH SIZE] Tuning
        If Failure & Minimum [BATCH SIZE], Skip Ahead
If Local < Remote
    Find Remote > Local Version
    Delete from Remote

Algorithms are great, and form the basis of pretty much everything we do as software developers, but you can’t really deploy one directly. At least not one written mostly as plain text, like the one above.

So how do we actually put it into practice?

Software, Hardware. What About Just-Right-Ware?

Apart from a little bit of information in the first post of the series, I’ve mostly been writing about the synchronization process conceptually. To finish everything up, I’m going to explain a little bit about how we actually put it into practice and the pieces of software in play, just to make it a little bit more concrete.

Implementation-wise, there are three relevant components.

A piece of desktop software, installed on the client side (C#)
An API (C# using Nancy)
A database (multi tenant, AWS RDS w. PostgreSQL)

Using the desktop software, client’s register a database for “cloud functionality”, agreeing for their information to be synchronized for use in other applications. This registration process gives the database a unique identifier, and if we combine this with their client id, we can safely segregate databases from one another. One one or more databases are registered, the desktop software executes approximately every minute, performing a single run of the synchronization algorithm as specified above, using the API to discover information about the remote side of the equation.

The API itself is relatively simple, and is dedicated to facilitate the synchronization process, mainly acting as an application layer on top of the database. It is primarily a REST API (where the entities are customers, databases and tables) and also features some basic authentication and authorization (using the client id, database id and some other user specific information).

At a high level, it features endpoints like this:

GET /v1/customers/{client-id}/databases/{database-id}/tables/{tablename}
POST /v1/customers/{client-id}/databases/{database-id}/tables/{tablename}
DELETE /v1/customers/{client-id}/databases/{database-id}/tables/{tablename}
GET /v1/customers/{client-id}/databases/{database-id}/tables/{tablename}/manifest

Its got some other endpoints as well, but the endpoints above are the most relevant to the synchronization process (the other endpoints are for things like health checks, uptime checks and administrative functionality).

To extrapolate:

The first GET endpoint returns the versioning information for the table (i.e. the count, max version, max modified date) which is the primary input for the synchronization process (when compared to the same information locally).
The POST endpoint on table name allows for inserting/uploading data, supplied as a set of rows appropriate for the table in question.
The DELETE endpoint on table name unsurprisingly allows for the deletion of data by supplying a set of keys to delete, but also allows for delete operations based on range (i.e. everything > version X) or deleting everything all at once.
Finally, the GET endpoint on table/manifest allows for the retrieval of a manifest describing a section of the table, which is used for the differencing check.

The database is a replica of the client side database, with additional columns for data segregation based on the combination of client id and database id (as mentioned above).

Working together, these three components make up the concrete implementation of the synchronization process, replicating local, on-premises data successfully to a remote location, for use in other applications and processes as necessary.

Conclusion

Its taken me 5 weeks to describe the synchronization process, but it took us months to build it incrementally, adapting and improving as we found different situations where it didn’t quite work the way we wanted. Obviously I summarised most of that in this series of posts, but we were constantly deploying and monitoring the results of each deployment as we went through the process. The process has been running in production without major incident for a few months now, and we’ve started to expand it to include more and more tables as necessary.

Unfortunately, as we expand the range of tables included in the process, we’ve discovered some that don’t follow the most common pattern (mostly around primary keys), which means we’re going to have to adapt the process again to take that sort of situation into account. That’s software though.

To finish everything up, I have to say that all this talk about syncing has really brought a…Titanic sense of gravitas to the whole situation, don’t you think?

…

I’ll see myself out.