0 Comments

Like with all software, its rare to ever actually be done with something. A few weeks back I wrote at length about the data synchronization algorithm we use to free valuable and useful data from its on-premises prison, to the benefit of both our clients (for new and exciting applications) and us (for statistical analysis.

Conceptually, the process leveraged an on-premises application, a relatively simple API and a powerful backend data store to accomplish its goals, along with the following abstracted algorithm (which I went into in depth in the series of blog posts that I linked above).

Get Local Version
Get Remote Version
If Local == Remote
    Calculate [BATCH SIZE] Using Historical Data
    Get Last Local Position
    Get Next [BATCH SIZE] Local Rows from last position
    Get Min & Max Version in Batch
    Query Remote for Manifest Between Min/Max Local Version
    Create Manifest from Local Batch
    Compare
        Find Remote Not in Local
            Delete from Remote
        Find Local Not in Remote
            Upload to Remote
If Local > Remote
    Calculate [BATCH SIZE] Using Historical Data
    Get Next [BATCH SIZE] Local Rows > Remote Version
    Upload to Remote
        Record Result for [BATCH SIZE] Tuning
        If Failure & Minimum [BATCH SIZE], Skip Ahead
If Local < Remote
    Find Remote > Local Version
    Delete from Remote

One of the issues with the algorithm above is that its pretty chatty when it comes to talking to the remote API. It is polling based, so that’s somewhat to be expected, but there is a lot of requests and responses being thrown around that seem like a prime opportunity for improvement.

To give some context:

  • We have approximately 3500 unique clients (each one representing a potential data synchronization)
  • Of that 3500, approximately 2200 clients are actively using the synchronization
  • In order to service these clients, the API deals with approximately 450 requests a second

Not a ground shaking amount of traffic, but if we needed to service the remainder of our clients in the same way, we’d probably have to scale out to deal with the traffic. Scaling out when you use AWS is pretty trivial, but the amount of traffic in play is also overloading our log aggregation (our ELK stack), there are other factors to consider.

Digging into the traffic a bit (using our awesome logging), it looks like the majority of the requests are GET requests.

The following KIbana visualization shows a single days traffic, aggregated over time/HTTP verb. You can clearly see the increase in the amount of non-GET requests during the day as clients make changes to their local database, but the GET traffic dwarfs it.

If we want to reduce the total amount of traffic, attacking the GET requests seems like a sane place to start. But maybe we could just reduce the traffic altogether?

Frequency Overload

The plugin architecture that schedules and executes the application responsible for performing the data synchronization has a cadence of around 60 seconds (with support for backoff in the case of errors). It is smart enough to be non re-entrant (meaning it won’t start another process if one is already running), but has some weaknesses that lead to the overall cadence being closer to 5 minutes, with higher cadences for multiple registered databases (because each database runs its own application, but the number running in parallel is limited).

One easy way to reduce the total amount of traffic is to simply reduce the cadence, drawing it out to at least 10 minutes in between runs.

The downside of this is that it increases the amount of latency between local changes being made and them being represented on the remote database.

Being that one of the goals for the sync process was to minimise this latency, simply reducing the cadence in order to decrease traffic is not a good enough solution.

Just Stop Talking, Please

If we look at the GET traffic in more detail we can see that it is mostly GET requests to two endpoints.

/v1/customers/{customer-number}/databases/{database-id}/tables/{table-name}/

/v1/customers/{customer-number}/databases/{database-id}/tables/{table-name}/manifest

These two endpoints form the basis for two different parts of the sync algorithm.

The first endpoint is used to get an indication of the status for the entire table for that customer/database combination. It returns a summary of row count, maximum row version and maximum timestamp. This information is used in the algorithm above in the part where it executes the “Get remote version” statement. It is then compared to the same information locally, and the result of that comparison is used to determine what to do next.

The second endpoint is used to get a summarised chunk of information from the table, using a from and to row version. This is used in the sync algorithm to perform the diff check whenever the local and remote versions (the other endpoint) are the same.

What this means is that every single run of every application is guaranteed to hit the first endpoint for each table (to get the remote version) and pretty likely to hit the second endpoint (because the default action is to engage on the diff check whenever local and remote versions are the same).

The version comparison is flawed though. It only takes into account the maximum row version and maximum timestamp for its decision making, ignoring the row count altogether (which was there historically for informational purposes). The assumption here was that we wanted to always fall back to scanning through the table for other changes using the differencing check, so if our maximum version/timestamps are identical that’s what we should do.

If we use the row count though, we can determine if the local and remote tables are completely in sync, allowing us to dodge a large amount of work. Being that all updates will be covered by the row version construct and all deletions will be covered by the row count changing, we should be in a pretty good place to maintain the reliability of the sync.

1 Row, 2 Rows, 3 Rows, Ah Ah Ah!

The naive thing to do would be to get the local count/version/timestamp and the local count/version/timestamp from last time and compare them (including the row count). If they are the same, we don’t need to do anything! Yay!

This fails to take into account the state of the remote though, and the nature of the batching process. While there might not be any changes locally since last time, we might not have actually pushed all of the changes from last time to the remote.

Instead, what we can do is compare the local count/version/timestamp with the last remote count/version/timestamp. If they are the same, we can just do nothing because both are completely in sync.

Editing the algorithm definition from the start of this post, we get this:

Get Local Count/Version/Timestamp
Get Last Remote Count/Version/Timestamp
If Local Count/Version/Timestamp == Last Remote Count/Version/Timestamp
    Do Nothing and Exit
Get Remote Count/Version/Timestamp
Store Remote Count/Version/Timestamp For Lookup Next Time
If Local Count/Version/Timestamp == Remote Count/Version/Timestamp
    Do Nothing and Exit
If Local Version/Timestamp == Remote Version/Timestamp BUT Local Count != Remote Count
    Calculate [BATCH SIZE] Using Historical Data
    Get Last Local Position
    Get Next [BATCH SIZE] Local Rows from last position
    Get Min & Max Version in Batch
    Query Remote for Manifest Between Min/Max Local Version
    Create Manifest from Local Batch
    Compare
        Find Remote Not in Local
            Delete from Remote
        Find Local Not in Remote
            Upload to Remote
If Local Version/Timestamp > Remote Version/Timestamp
    Calculate [BATCH SIZE] Using Historical Data
    Get Next [BATCH SIZE] Local Rows > Remote Version
    Upload to Remote
        Record Result for [BATCH SIZE] Tuning
        If Failure & Minimum [BATCH SIZE], Skip Ahead
If Local Version/Timestamp < Remote Version/Timestamp
    Find Remote > Local Version
    Delete from Remote

The other minor change in there is comparing the full local count/version/timestamp against the remote count/version/timestamp. If they are identical, its just another case where we need to do nothing, so we can exit safely until next time.

Conclusion

Just how much of a difference does this make though? I’ll let a picture of a partial days traffic answer that for me.

In the image below I’ve specifically set the scale of the graph to be the same as the one above for comparison purposes.

As you can see, the traffic rises from nothing (because nothing is changing overnight) to a very acceptable amount of traffic representing real work that needs to be done during the day, and then will probably continue forward with the same pattern, flattening back into nothing as the day progresses and people stop making changes.

Its a ridiculous decrease in the amount of pointless noise traffic, which is a massive victory.

Thinking about this sort of thing in more general terms, optimization is an important step in the production and maintenance of any piece of software, but its important not to engage on this path too early. You should only do it after you gather enough information to justify it, and to pinpoint exactly where the most impact will be made for the least effort. The last thing you want to do is spend a week or two chasing something you think is a terribly inefficiency only to discover that it makes less than a % difference to the system as a whole.

The most efficient way to do this sort of analysis is with good metrics and logging.

You’d be crazy not to do it.

0 Comments

And that marks the end of the series on our synchronization process. Surprising absolutely no-one, it turns out that building such a process is quite challenging, with lots of different edge cases that need to be taken into account to ensure quality at all of the various levels.

To summarise:

At the end, we’re left with the following algorithm:

Get Local Version
Get Remote Version
If Local == Remote
    Calculate [BATCH SIZE] Using Historical Data
    Get Last Local Position
    Get Next [BATCH SIZE] Local Rows from last position
    Get Min & Max Version in Batch
    Query Remote for Manifest Between Min/Max Local Version
    Create Manifest from Local Batch
    Compare
        Find Remote Not in Local
            Delete from Remote
        Find Local Not in Remote
            Upload to Remote
If Local > Remote
    Calculate [BATCH SIZE] Using Historical Data
    Get Next [BATCH SIZE] Local Rows > Remote Version
    Upload to Remote
        Record Result for [BATCH SIZE] Tuning
        If Failure & Minimum [BATCH SIZE], Skip Ahead
If Local < Remote
    Find Remote > Local Version
    Delete from Remote

Algorithms are great, and form the basis of pretty much everything we do as software developers, but you can’t really deploy one directly. At least not one written mostly as plain text, like the one above.

So how do we actually put it into practice?

Software, Hardware. What About Just-Right-Ware?

Apart from a little bit of information in the first post of the series, I’ve mostly been writing about the synchronization process conceptually. To finish everything up, I’m going to explain a little bit about how we actually put it into practice and the pieces of software in play, just to make it a little bit more concrete.

Implementation-wise, there are three relevant components.

  • A piece of desktop software, installed on the client side (C#)
  • An API (C# using Nancy)
  • A database (multi tenant, AWS RDS w. PostgreSQL)

Using the desktop software, client’s register a database for “cloud functionality”, agreeing for their information to be synchronized for use in other applications. This registration process gives the database a unique identifier, and if we combine this with their client id, we can safely segregate databases from one another. One one or more databases are registered, the desktop software executes approximately every minute, performing a single run of the synchronization algorithm as specified above, using the API to discover information about the remote side of the equation.

The API itself is relatively simple, and is dedicated to facilitate the synchronization process, mainly acting as an application layer on top of the database. It is primarily a REST API (where the entities are customers, databases and tables) and also features some basic authentication and authorization (using the client id, database id and some other user specific information).

At a high level, it features endpoints like this:

GET /v1/customers/{client-id}/databases/{database-id}/tables/{tablename}
POST /v1/customers/{client-id}/databases/{database-id}/tables/{tablename}
DELETE /v1/customers/{client-id}/databases/{database-id}/tables/{tablename}
GET /v1/customers/{client-id}/databases/{database-id}/tables/{tablename}/manifest

Its got some other endpoints as well, but the endpoints above are the most relevant to the synchronization process (the other endpoints are for things like health checks, uptime checks and administrative functionality).

To extrapolate:

  • The first GET endpoint returns the versioning information for the table (i.e. the count, max version, max modified date) which is the primary input for the synchronization process (when compared to the same information locally).
  • The POST endpoint on table name allows for inserting/uploading data, supplied as a set of rows appropriate for the table in question.
  • The DELETE endpoint on table name unsurprisingly allows for the deletion of data by supplying a set of keys to delete, but also allows for delete operations based on range (i.e. everything > version X) or deleting everything all at once.
  • Finally, the GET endpoint on table/manifest allows for the retrieval of a manifest describing a section of the table, which is used for the differencing check.

The database is a replica of the client side database, with additional columns for data segregation based on the combination of client id and database id (as mentioned above).

Working together, these three components make up the concrete implementation of the synchronization process, replicating local, on-premises data successfully to a remote location, for use in other applications and processes as necessary.

Conclusion

Its taken me 5 weeks to describe the synchronization process, but it took us months to build it incrementally, adapting and improving as we found different situations where it didn’t quite work the way we wanted. Obviously I summarised most of that in this series of posts, but we were constantly deploying and monitoring the results of each deployment as we went through the process. The process has been running in production without major incident for a few months now, and we’ve started to expand it to include more and more tables as necessary.

Unfortunately, as we expand the range of tables included in the process, we’ve discovered some that don’t follow the most common pattern (mostly around primary keys), which means we’re going to have to adapt the process again to take that sort of situation into account. That’s software though.

To finish everything up, I have to say that all this talk about syncing has really brought a…Titanic sense of gravitas to the whole situation, don’t you think?

I’ll see myself out.

0 Comments

Well, now its next week, and its time to deal with the problem.

What problem?

The problem where we delete literally all of the data for a table and then synchronize it all over again whenever the maximum local version gets lower than the maximum remote version.

That problem.

If that sort of thing only happened very rarely, I probably wouldn’t care all that much. The problem is, it can occur whenever a database restore/rollback is performed (which is annoyingly frequent for our users) or when a user removes an entity they just added and that entity is subject to hard deletion rules (which happens quite a lot).

With the table annihilation occurring far more frequently than we would like, we’re looking at a couple of very real problems.

  • During the period where the data is re-syncing, anything using that data (third party APIs, mobile applications, websites, etc) will be out of date. Syncing a table from scratch can take a few hours (depending on table size), so it’s not a great situation to be in.
  • Repeatedly pushing the same information consumes valuable bandwidth. In Australia, where the internet is mostly run over taut pieces of string and unlimited data plans are not the norm, consuming bandwidth for no real gain is foolish.
  • Some tables with hard delete have such high churn that a constant cycle of delete and re-sync can be maintained almost indefinitely, which exacerbates the two points above.

Also, such a glaring and gross inefficiency makes me sad as an engineer, which, ultimately, is the most important reason. A happy engineer is an effective engineer after all.

Controlled Fission

Rather than going with the nuclear approach of “delete all the things”, it makes a lot more sense to act in a more controlled, surgical fashion and only remove the things that need to be removed when the version goes backwards. Identifying what needs to be removed is relatively easy, its just everything with a version greater than the maximum local version.

Get Local Version
Get Remote Version
If Local == Remote
    Get Last Local Position
    Get Next [BATCH SIZE] Local Rows from last position
    Get Min & Max Version in Batch
    Query Remote for Manifest Between Min/Max Local Version
    Create Manifest from Local Batch
    Compare
        Find Remote Not in Local
            Delete from Remote
        Find Local Not in Remote
            Upload to Remote
If Local > Remote
    Get Next [BATCH SIZE] Local Rows > Remote Version
    Upload to Remote
If Local < Remote
    Find Remote > Local Version
    Delete from Remote

This is a decent solution, but without the differencing check, it has a major flaw that can lead to missing data. This was the primary reason we went with the “delete it all, let god sort if out” approach originally, as we’d rather have everything eventually be correct, than risk getting ourselves into a state where the remote data is not identical to the local data, and the synchronization process thinks that its done.

The sequence that can lead to missing data with the above algorithm in play is not straightforward, so I’ll try to explain it using an example.

Imagine we have two sets of data, one representing the local store (left side) and other the remote (right side). The first column is the row ID, the second is the version.

4245 4245
3234 3234
2221 2221
1210 1210

 

According to the representation above, both local and remote are in sync. Assume at this point a snapshot was made locally.

Now the user does something to update the row with ID 3, which gives it a new version.

3257 4245
4245 3234
2221 2221
1210 1210

 

The sync process kicks in, detects the new row (because its version is higher than the remote) and sends it out, where it is in turn updated in the database by its primary key.

3257 3257
4245 4245
2221 2221
1210 1210

 

If the user now performs a rollback/restore using the snapshot they took earlier, the data now looks like this.

4245 3257
3234 4245
2221 2221
1210 1210

 

Finally, the algorithm above will react to this situation by removing all data from the remote where the version is greater than the local.

4245   
3234 4245
2221 2221
1210 1210

 

Unless something happens to the row with ID 3 (i.e. its updated in some way), it will never be synced remotely, ruining everything.

The good news is that with the differencing check this situation is (eventually) fixed, because when scanning through the database it will eventually discover the missing row and upload it, allowing us to implement partial rollbacks in the interest of efficiency.

And that makes the engineer in me happy.

The Worst of the Worst

Everything is not quite puppies and roses though.

The synchronization process as described above is robust enough to handle just about any situation you can throw at it.

Except one.

What happens if the high watermark process can never upload its batch of changes, even at the minimum size? Keeping in mind, the minimum size is one.

There are a number of reasons why a batch size of one might be reached, but the most common are:

  • API unavailable for an extended period of time. This could be as a result of a local or remote issue, but once its available again, the batch size will increase, so we don’t really have anything to worry about here.
  • Some critical problem with uploading that particular row.

The second one is an issue that we actually ran into, where a single row contained something ridiculous like 50MB of crazy embedded data, which was more than the maximum request size we were allowing on the API, so it kept failing miserably.

As a last resort, we improved the algorithm to skip single rows if the minimum batch size has failed more than a few times. If the row is broken that badly, then we reasoned that missing data (the bad row) is preferably to not being able to sync at all. On the off chance the row is fixed, the process will pick it up and upload it back in its rightful place.

Get Local Version
Get Remote Version
If Local == Remote
    Calculate [BATCH SIZE] Using Historical Data
    Get Last Local Position
    Get Next [BATCH SIZE] Local Rows from last position
    Get Min & Max Version in Batch
    Query Remote for Manifest Between Min/Max Local Version
    Create Manifest from Local Batch
    Compare
        Find Remote Not in Local
            Delete from Remote
        Find Local Not in Remote
            Upload to Remote
If Local > Remote
    Calculate [BATCH SIZE] Using Historical Data
    Get Next [BATCH SIZE] Local Rows > Remote Version
    Upload to Remote
        Record Result for [BATCH SIZE] Tuning
        If Failure & Minimum [BATCH SIZE], Skip Ahead
If Local < Remote
    Find Remote > Local Version
    Delete from Remote

To Be Continued

With the final flaws in the process rectified, that’s pretty much it for data synchronization. Until we discover the next flaw that is, which I’m sure will happen eventually.

I’ll make a summary post next week to draw everything together and bring the entire saga to a close.

0 Comments

Since I left last weeks post on a terrible cliffhanger, I thought that this week I’d take a detour and write some more about our AWS Lambda ELB logs processor, just to keep the suspense going for as long as possible.

I’m sure my reader will be irate.

But seriously, lets talk about hard deletes and why they make everything more difficult.

Funnily enough, we have actually had some more issues with our AWS Lambda ELB logs processor, which I will probably have to talk about at some point. Specifically, while the connection pooling worked and stopped the execution from erroring out, the process can’t get through an entire ELB log file within its time limit (300 seconds, which is the maximum time available for the execution of a Lambda function). The files are pretty juicy, clocking in at 20ish megabytes, so either we need to optimise the Lambda function itself or we need to create ourselves a divide and conquer strategy.

At some unknown point in the past, the application at the core of this entire syncing adventure was changed to make the deletion of major entities reversible. Specifically, instead of simply removing the offending information from the database (along with any appropriate relationships), the primary entity entry was instead just marked as “deleted”, and then ignored for basically every interaction other than “undelete”. Being able to undo accidental (or malicious) deletes within resorting to a full database restore is pretty useful, so the soft delete pattern was applied to most of the entities in the system.

Alas, “most” is a far cry from “all”.

Some entities are just deleted the old fashioned way, gone forever unless someone does a database restore. Thinking about it, it makes sense to not retain full history for tables with high churn (like appointments), so we can’t just mandate soft deletion for everything just to make our lives easier. With that constraint in place, we need to adapt our process to take hard deletes into account.

The question then becomes, how do you find something if its just not there anymore?

Suspicious Holes

Our versioning based delta algorithm will not detect deletions unless they occur at the very top of a table. When something is deleted from that specific location, the version will appear to be lower locally than remotely, so the entire table will be removed and re-uploaded from scratch.

I’ll come back to this particular situation in another post, but lets just ignore it for now, because the solution will work in those particular cases, even if its terrible.

The more interesting case is when the deletion occurs literally anywhere else in the table.

The RowVersion construct we’re relying on won’t allow us to detect if something has been removed, at least not in the same way that we’ve been using it to detect changes/additions.

What we can do, however, is use it to generate a manifest of sorts, describing everything that is in the table in a relatively efficient fashion and then use that manifest to compare the local to the remote. RowVersion is unique (barring exceptional circumstances), so we can use it as a pseudo key of sorts, allowing us to easily compare local and remote data for any table that already features versioning.

Since we’re dealing primarily with two lists of numbers, the comparison is relatively easy. Look for everything that is in the remote but not in the local and you’ll find everything that’s been deleted locally. You can then use that information to construct some commands to remove the offending data and bring everything back to harmonious balance.

But where does this secondary differencingprocess fit into our top level algorithm?

At first we tried running the two checks in parallel, but it quickly became obvious that the differencing process needed to be aware of the boundaries of the other process (which I’m going to call high watermark from this point forward in order to maintain clarity). Without the context supplied by the high watermark (i.e. the remote maximum version), the differencing process would get confused and everything became much harder to reason about. The easy solution was to only run the differencing process when the high watermark thought it was finished, so we slotted it into the otherwise dead area of the algorithm for when the remote and local version maximums were the same.

Get Local Version
Get Remote Version
If Local == Remote
    Get Last Local Position
    Get Next [BATCH SIZE] Local Rows from last position
    Get Min & Max Version in Batch
    Query Remote for Manifest Between Min/Max Local Version
    Create Manifest from Local Batch
    Compare
        Find Remote Not in Local
        Delete from Remote  
If Local > Remote
    Get Next [BATCH SIZE] Local Rows > Remote Version
    Upload
If Local < Remote
    Delete Everything Remotely

If you look closely, you can see that I’ve added in batching logic for the differencing process similarly to how it works for the high watermark. The reality of the situation is that you can’t easily compare two entire tables (remove vs local) all at once, even if you’re only dealing with relatively simple numerical sequences. There might be millions and millions of rows at play, and that’s way too much deal with in a single HTTP request. Batching just breaks the problem down into management chunks, and all you have to do is remember where you were last up to and make sure you roll back to the start once you reach the end.

Completely Trustworthy Holes

After editing the algorithm to take into account hard deletes, the synchronization process is in a pretty good place.

Additions and updates are identified and uploaded quickly (within minutes) and hard deletes are identified and resolved eventually (still quickly, but slower due to the scanning nature of the differencing process).

What about the other side of the differencing check though?

What if we find something that is in the local but not in the remote?

If we look at the process as it currently stands, there is no obvious way for it to get into a situation where there is missing data on the remote end to be picked up during the differencing process. In the interests of completeness though, we should really take care of that situation, because even if we don’t think it can happen, many years of software development has proven to me that it probably will, maybe as a result of a bug, maybe as a result of some future change. Might as well cover it now and save time later.

Taking this into account, the algorithm becomes:

Get Local Version
Get Remote Version
If Local == Remote
    Get Last Local Position
    Get Next [BATCH SIZE] Local Rows from last position
    Get Min & Max Version in Batch
    Query Remote for Manifest Between Min/Max Local Version
    Create Manifest from Local Batch
    Compare
        Find Remote Not in Local
            Delete from Remote
        Find Local Not in Remote
            Upload to Remote
If Local > Remote
    Get Next [BATCH SIZE] Local Rows > Remote Version
    Upload to Remote
If Local < Remote
    Delete Everything Remotely

The only addition is the reverse of the hard delete handler, finding everything in the difference batch that isn’t store remotely and uploading it. With the differencing process never running before the high watermark is finished (and using that high watermark to limit itself), we’re in a pretty good place.

To Be Continued

The obvious weakness in the process is that crazy nuclear delete that happens whenever the local version is less than the remote. Its not the end of the world and it definitely does the job it’s supposed to, but it’s pretty nasty. As I mentioned earlier in this post though, it doesn’t only occur when a database has been restored, it also occurs with tables that do hard deletes and experience a lot of churn. Its pretty crazy to delete an entire table just because the user added a new thing then removed it again.

That’s a problem for next week though.

0 Comments

Continuing on from last week, if we’re syncing entire rows at a time, versioning is taken care of by the SQL Server RowVersion construct and we only care about syncing in one direction, all that’s left to to do is to devise an algorithm taking all of those disparate pieces into account.

So without further ado, lets get straight into it.

Row, Row, Row Your Boat

The simplest syncing algorithm is one obtains the versions of both the local and the remote data sources, compares them and then uses that information to decide what to do.

Get Local Version
Get Remote Version
If Local == Remote
    Do Nothing
If Local > Remote
    Get Local Rows > Remote Version
    Upload

It basically boils down to just retrieving and uploading all rows with a version greater than the remote, assuming the upload is done as an upsert and the tables contain appropriate primary keys separate from the version.

See? Simple, just like I said.

Job done, post finished.

Admitting You Have A Problem Is Hard

Alas, not quite.

The first time the algorithm runs, there is a high chance that the difference between the local and remote will be an extremely large number of rows. Some of the tables in the database (the uninteresting ones of course) only have tens of rows, but many tables have millions of rows. Even ignoring the potential impact querying for that much data might have on the usability of the application, uploading millions of rows all at once is likely to break all sorts of HTTP communication limits (request size being the obvious one).

If we’re aiming to have the process run silently in the background, without affecting anything, we need a way to break our updates into smaller batches.

Amending the algorithm to take batching into account gives us this.

Get Local Version
Get Remote Version
If Local == Remote
    Do Nothing
If Local > Remote
    Get Next [BATCH SIZE] Local Rows > Remote Version
    Upload

We don’t need to keep any information about what position we were at last time because we can always find that out by asking the remote end, which is useful. Unfortunately, this does have the side effect of making the process a little more chatty than it strictly needs to be, but optimization can always be done later. Of course, the tradeoff here is timeliness. Small batches are good for performance, but bad for the overall sync time (due to the process running around every minute), while large batches are the opposite. There is a sweet spot somewhere in the middle, but that sweet spot moves over time based on changing environmental factors, both local and remote.

A good example of these sorts of factors is that because the algorithm is being run on-premises, in a location we have zero control over, someone might try to run it with an internet connection that is nothing more than a few carrier pigeons with USB sticks strapped to their legs. Even worse than that, they might be trying to run it on ADSL more than a few kilometres from an exchange.

*shudder*

However, we can help these poor unfortunate souls by making the batch size adaptive. Basically, we set a default, min and max batch size, and then if errors occur while uploading (like timeouts), we make the batch size smaller for next time. If everything is going well, we make the batch size larger. With some logic in there to avoid an indecisive situation where the batch size never stops flipping around, we’ve got a system that will intelligently adjust itself to the capabilities of the environment that it is running in.

I’ve left the adaptive batch size stuff out of the algorithm definition above in order to simplify it, but imagine that it goes into the part that gets the batch size and that it takes into account the last 10 or so runs.

Back To The…Past?

Now that we’ve got batching in place, you might notice that the algorithm is missing a branch in the local vs remote version comparison.

What happens when the Local version is less than the Remote? Actually, how can that even happen? If the internal versioning process always results in the version increasing for every single change, then how could the system ever get into a situation where the number goes down?

The answer to that is database restores.

Unfortunately, database restores are not only possible for the application in question, they are downright common.

There is a long and storied history about why database restores are an integral part of the application, but the short version is that errors and corruption occur all too frequently, or have occurred frequently in the past, and someone decided that the capability to arbitrarily snapshot and then restore the entire database was an amazing solution to the problem.

In truth, it is an amazing solution to the problem, but it really does make integrations of all sorts incredibly difficult, especially data synchronization.

Amending the algorithm to take database restores into account leaves us with this.

Get Local Version
Get Remote Version
If Local == Remote
    Do Nothing
If Local > Remote
    Get Next [BATCH SIZE] Local Rows > Remote Version
    Upload
If Local < Remote
    Delete Everything Remotely

As you can see, we went with the nuclear approach.

Removing all of the data from the remote location is obviously a sub-optimal approach, but its also the easiest and most reliable. Until we get into a situation where we need to optimise, its good enough. I’ll be coming back to this particular decision later in this series of posts, but for now, the algorithm just removes everything when the local version goes below the remote.

Soft, Like Pudding

Looking at the algorithm I’ve defined above, it takes into account quite a few scenarios:

  • If a new entity is added in the application, one or more rows are added, they get appropriate RowVersions, are picked up by the algorithm and are uploaded to the remote location.
  • If an entity is changed in the application, one or more rows are changed, they get new RowVersions, are picked up by the algorithm and are uploaded to the remote location.
  • If an entity is deleted, one or more rows are changed (marked as deleted), they get new RowVersions, are picked up by the algorithm and are uploaded to the remote location.

Actually, that last one is a little tricky.

A good percentage of the entities in the application are only ever deleted softly. This means that they are not physically removed from the database, but are instead updated to indicate deletion and then ignored for subsequent interactions. Soft deletes make a synchronization process easy because if something is there, it will always be there, and if you have good change tracking via versioning or something similar, you’re fine.

Some things are hard deleted though, and that makes for complications.

To Be Continued

Handling hard deletes is a topic unto itself, so I’m going to leave that for the next post.

Pretty terrible cliffhanger though.