Syncing the Ship, Part 4

February 21. 2017 0 Comments

Well, now its next week, and its time to deal with the problem.

What problem?

The problem where we delete literally all of the data for a table and then synchronize it all over again whenever the maximum local version gets lower than the maximum remote version.

That problem.

If that sort of thing only happened very rarely, I probably wouldn’t care all that much. The problem is, it can occur whenever a database restore/rollback is performed (which is annoyingly frequent for our users) or when a user removes an entity they just added and that entity is subject to hard deletion rules (which happens quite a lot).

With the table annihilation occurring far more frequently than we would like, we’re looking at a couple of very real problems.

During the period where the data is re-syncing, anything using that data (third party APIs, mobile applications, websites, etc) will be out of date. Syncing a table from scratch can take a few hours (depending on table size), so it’s not a great situation to be in.
Repeatedly pushing the same information consumes valuable bandwidth. In Australia, where the internet is mostly run over taut pieces of string and unlimited data plans are not the norm, consuming bandwidth for no real gain is foolish.
Some tables with hard delete have such high churn that a constant cycle of delete and re-sync can be maintained almost indefinitely, which exacerbates the two points above.

Also, such a glaring and gross inefficiency makes me sad as an engineer, which, ultimately, is the most important reason. A happy engineer is an effective engineer after all.

Controlled Fission

Rather than going with the nuclear approach of “delete all the things”, it makes a lot more sense to act in a more controlled, surgical fashion and only remove the things that need to be removed when the version goes backwards. Identifying what needs to be removed is relatively easy, its just everything with a version greater than the maximum local version.

Get Local Version
Get Remote Version
If Local == Remote
    Get Last Local Position
    Get Next [BATCH SIZE] Local Rows from last position
    Get Min & Max Version in Batch
    Query Remote for Manifest Between Min/Max Local Version
    Create Manifest from Local Batch
    Compare
        Find Remote Not in Local
            Delete from Remote
        Find Local Not in Remote
            Upload to Remote
If Local > Remote
    Get Next [BATCH SIZE] Local Rows > Remote Version
    Upload to Remote
If Local < Remote
    Find Remote > Local Version
    Delete from Remote

This is a decent solution, but without the differencing check, it has a major flaw that can lead to missing data. This was the primary reason we went with the “delete it all, let god sort if out” approach originally, as we’d rather have everything eventually be correct, than risk getting ourselves into a state where the remote data is not identical to the local data, and the synchronization process thinks that its done.

The sequence that can lead to missing data with the above algorithm in play is not straightforward, so I’ll try to explain it using an example.

Imagine we have two sets of data, one representing the local store (left side) and other the remote (right side). The first column is the row ID, the second is the version.

4	245	4	245
3	234	3	234
2	221	2	221
1	210	1	210

According to the representation above, both local and remote are in sync. Assume at this point a snapshot was made locally.

Now the user does something to update the row with ID 3, which gives it a new version.

3	257	4	245
4	245	3	234
2	221	2	221
1	210	1	210

The sync process kicks in, detects the new row (because its version is higher than the remote) and sends it out, where it is in turn updated in the database by its primary key.

3	257	3	257
4	245	4	245
2	221	2	221
1	210	1	210

If the user now performs a rollback/restore using the snapshot they took earlier, the data now looks like this.

4	245	3	257
3	234	4	245
2	221	2	221
1	210	1	210

Finally, the algorithm above will react to this situation by removing all data from the remote where the version is greater than the local.

4	245
3	234	4	245
2	221	2	221
1	210	1	210

Unless something happens to the row with ID 3 (i.e. its updated in some way), it will never be synced remotely, ruining everything.

The good news is that with the differencing check this situation is (eventually) fixed, because when scanning through the database it will eventually discover the missing row and upload it, allowing us to implement partial rollbacks in the interest of efficiency.

And that makes the engineer in me happy.

The Worst of the Worst

Everything is not quite puppies and roses though.

The synchronization process as described above is robust enough to handle just about any situation you can throw at it.

Except one.

What happens if the high watermark process can never upload its batch of changes, even at the minimum size? Keeping in mind, the minimum size is one.

There are a number of reasons why a batch size of one might be reached, but the most common are:

API unavailable for an extended period of time. This could be as a result of a local or remote issue, but once its available again, the batch size will increase, so we don’t really have anything to worry about here.
Some critical problem with uploading that particular row.

The second one is an issue that we actually ran into, where a single row contained something ridiculous like 50MB of crazy embedded data, which was more than the maximum request size we were allowing on the API, so it kept failing miserably.

As a last resort, we improved the algorithm to skip single rows if the minimum batch size has failed more than a few times. If the row is broken that badly, then we reasoned that missing data (the bad row) is preferably to not being able to sync at all. On the off chance the row is fixed, the process will pick it up and upload it back in its rightful place.

Get Local Version
Get Remote Version
If Local == Remote
    Calculate [BATCH SIZE] Using Historical Data
    Get Last Local Position
    Get Next [BATCH SIZE] Local Rows from last position
    Get Min & Max Version in Batch
    Query Remote for Manifest Between Min/Max Local Version
    Create Manifest from Local Batch
    Compare
        Find Remote Not in Local
            Delete from Remote
        Find Local Not in Remote
            Upload to Remote
If Local > Remote
    Calculate [BATCH SIZE] Using Historical Data
    Get Next [BATCH SIZE] Local Rows > Remote Version
    Upload to Remote
        Record Result for [BATCH SIZE] Tuning
        If Failure & Minimum [BATCH SIZE], Skip Ahead
If Local < Remote
    Find Remote > Local Version
    Delete from Remote

To Be Continued

With the final flaws in the process rectified, that’s pretty much it for data synchronization. Until we discover the next flaw that is, which I’m sure will happen eventually.

I’ll make a summary post next week to draw everything together and bring the entire saga to a close.

Syncing the Ship, Part 3

February 14. 2017 0 Comments

Since I left last weeks post on a terrible cliffhanger, I thought that this week I’d take a detour and write some more about our AWS Lambda ELB logs processor, just to keep the suspense going for as long as possible.

I’m sure my reader will be irate.

But seriously, lets talk about hard deletes and why they make everything more difficult.

Funnily enough, we have actually had some more issues with our AWS Lambda ELB logs processor, which I will probably have to talk about at some point. Specifically, while the connection pooling worked and stopped the execution from erroring out, the process can’t get through an entire ELB log file within its time limit (300 seconds, which is the maximum time available for the execution of a Lambda function). The files are pretty juicy, clocking in at 20ish megabytes, so either we need to optimise the Lambda function itself or we need to create ourselves a divide and conquer strategy.

At some unknown point in the past, the application at the core of this entire syncing adventure was changed to make the deletion of major entities reversible. Specifically, instead of simply removing the offending information from the database (along with any appropriate relationships), the primary entity entry was instead just marked as “deleted”, and then ignored for basically every interaction other than “undelete”. Being able to undo accidental (or malicious) deletes within resorting to a full database restore is pretty useful, so the soft delete pattern was applied to most of the entities in the system.

Alas, “most” is a far cry from “all”.

Some entities are just deleted the old fashioned way, gone forever unless someone does a database restore. Thinking about it, it makes sense to not retain full history for tables with high churn (like appointments), so we can’t just mandate soft deletion for everything just to make our lives easier. With that constraint in place, we need to adapt our process to take hard deletes into account.

The question then becomes, how do you find something if its just not there anymore?

Suspicious Holes

Our versioning based delta algorithm will not detect deletions unless they occur at the very top of a table. When something is deleted from that specific location, the version will appear to be lower locally than remotely, so the entire table will be removed and re-uploaded from scratch.

I’ll come back to this particular situation in another post, but lets just ignore it for now, because the solution will work in those particular cases, even if its terrible.

The more interesting case is when the deletion occurs literally anywhere else in the table.

The RowVersion construct we’re relying on won’t allow us to detect if something has been removed, at least not in the same way that we’ve been using it to detect changes/additions.

What we can do, however, is use it to generate a manifest of sorts, describing everything that is in the table in a relatively efficient fashion and then use that manifest to compare the local to the remote. RowVersion is unique (barring exceptional circumstances), so we can use it as a pseudo key of sorts, allowing us to easily compare local and remote data for any table that already features versioning.

Since we’re dealing primarily with two lists of numbers, the comparison is relatively easy. Look for everything that is in the remote but not in the local and you’ll find everything that’s been deleted locally. You can then use that information to construct some commands to remove the offending data and bring everything back to harmonious balance.

But where does this secondary differencingprocess fit into our top level algorithm?

At first we tried running the two checks in parallel, but it quickly became obvious that the differencing process needed to be aware of the boundaries of the other process (which I’m going to call high watermark from this point forward in order to maintain clarity). Without the context supplied by the high watermark (i.e. the remote maximum version), the differencing process would get confused and everything became much harder to reason about. The easy solution was to only run the differencing process when the high watermark thought it was finished, so we slotted it into the otherwise dead area of the algorithm for when the remote and local version maximums were the same.

Get Local Version
Get Remote Version
If Local == Remote
    Get Last Local Position
    Get Next [BATCH SIZE] Local Rows from last position
    Get Min & Max Version in Batch
    Query Remote for Manifest Between Min/Max Local Version
    Create Manifest from Local Batch
    Compare
        Find Remote Not in Local
        Delete from Remote  
If Local > Remote
    Get Next [BATCH SIZE] Local Rows > Remote Version
    Upload
If Local < Remote
    Delete Everything Remotely

If you look closely, you can see that I’ve added in batching logic for the differencing process similarly to how it works for the high watermark. The reality of the situation is that you can’t easily compare two entire tables (remove vs local) all at once, even if you’re only dealing with relatively simple numerical sequences. There might be millions and millions of rows at play, and that’s way too much deal with in a single HTTP request. Batching just breaks the problem down into management chunks, and all you have to do is remember where you were last up to and make sure you roll back to the start once you reach the end.

Completely Trustworthy Holes

After editing the algorithm to take into account hard deletes, the synchronization process is in a pretty good place.

Additions and updates are identified and uploaded quickly (within minutes) and hard deletes are identified and resolved eventually (still quickly, but slower due to the scanning nature of the differencing process).

What about the other side of the differencing check though?

What if we find something that is in the local but not in the remote?

If we look at the process as it currently stands, there is no obvious way for it to get into a situation where there is missing data on the remote end to be picked up during the differencing process. In the interests of completeness though, we should really take care of that situation, because even if we don’t think it can happen, many years of software development has proven to me that it probably will, maybe as a result of a bug, maybe as a result of some future change. Might as well cover it now and save time later.

Taking this into account, the algorithm becomes:

Get Local Version
Get Remote Version
If Local == Remote
    Get Last Local Position
    Get Next [BATCH SIZE] Local Rows from last position
    Get Min & Max Version in Batch
    Query Remote for Manifest Between Min/Max Local Version
    Create Manifest from Local Batch
    Compare
        Find Remote Not in Local
            Delete from Remote
        Find Local Not in Remote
            Upload to Remote
If Local > Remote
    Get Next [BATCH SIZE] Local Rows > Remote Version
    Upload to Remote
If Local < Remote
    Delete Everything Remotely

The only addition is the reverse of the hard delete handler, finding everything in the difference batch that isn’t store remotely and uploading it. With the differencing process never running before the high watermark is finished (and using that high watermark to limit itself), we’re in a pretty good place.

To Be Continued

The obvious weakness in the process is that crazy nuclear delete that happens whenever the local version is less than the remote. Its not the end of the world and it definitely does the job it’s supposed to, but it’s pretty nasty. As I mentioned earlier in this post though, it doesn’t only occur when a database has been restored, it also occurs with tables that do hard deletes and experience a lot of churn. Its pretty crazy to delete an entire table just because the user added a new thing then removed it again.

That’s a problem for next week though.

Syncing the Ship, Part 2

February 7. 2017 0 Comments

Continuing on from last week, if we’re syncing entire rows at a time, versioning is taken care of by the SQL Server RowVersion construct and we only care about syncing in one direction, all that’s left to to do is to devise an algorithm taking all of those disparate pieces into account.

So without further ado, lets get straight into it.

Row, Row, Row Your Boat

The simplest syncing algorithm is one obtains the versions of both the local and the remote data sources, compares them and then uses that information to decide what to do.

Get Local Version
Get Remote Version
If Local == Remote
    Do Nothing
If Local > Remote
    Get Local Rows > Remote Version
    Upload

It basically boils down to just retrieving and uploading all rows with a version greater than the remote, assuming the upload is done as an upsert and the tables contain appropriate primary keys separate from the version.

See? Simple, just like I said.

Job done, post finished.

Admitting You Have A Problem Is Hard

Alas, not quite.

The first time the algorithm runs, there is a high chance that the difference between the local and remote will be an extremely large number of rows. Some of the tables in the database (the uninteresting ones of course) only have tens of rows, but many tables have millions of rows. Even ignoring the potential impact querying for that much data might have on the usability of the application, uploading millions of rows all at once is likely to break all sorts of HTTP communication limits (request size being the obvious one).

If we’re aiming to have the process run silently in the background, without affecting anything, we need a way to break our updates into smaller batches.

Amending the algorithm to take batching into account gives us this.

Get Local Version
Get Remote Version
If Local == Remote
    Do Nothing
If Local > Remote
    Get Next [BATCH SIZE] Local Rows > Remote Version
    Upload

We don’t need to keep any information about what position we were at last time because we can always find that out by asking the remote end, which is useful. Unfortunately, this does have the side effect of making the process a little more chatty than it strictly needs to be, but optimization can always be done later. Of course, the tradeoff here is timeliness. Small batches are good for performance, but bad for the overall sync time (due to the process running around every minute), while large batches are the opposite. There is a sweet spot somewhere in the middle, but that sweet spot moves over time based on changing environmental factors, both local and remote.

A good example of these sorts of factors is that because the algorithm is being run on-premises, in a location we have zero control over, someone might try to run it with an internet connection that is nothing more than a few carrier pigeons with USB sticks strapped to their legs. Even worse than that, they might be trying to run it on ADSL more than a few kilometres from an exchange.

*shudder*

However, we can help these poor unfortunate souls by making the batch size adaptive. Basically, we set a default, min and max batch size, and then if errors occur while uploading (like timeouts), we make the batch size smaller for next time. If everything is going well, we make the batch size larger. With some logic in there to avoid an indecisive situation where the batch size never stops flipping around, we’ve got a system that will intelligently adjust itself to the capabilities of the environment that it is running in.

I’ve left the adaptive batch size stuff out of the algorithm definition above in order to simplify it, but imagine that it goes into the part that gets the batch size and that it takes into account the last 10 or so runs.

Back To The…Past?

Now that we’ve got batching in place, you might notice that the algorithm is missing a branch in the local vs remote version comparison.

What happens when the Local version is less than the Remote? Actually, how can that even happen? If the internal versioning process always results in the version increasing for every single change, then how could the system ever get into a situation where the number goes down?

The answer to that is database restores.

Unfortunately, database restores are not only possible for the application in question, they are downright common.

There is a long and storied history about why database restores are an integral part of the application, but the short version is that errors and corruption occur all too frequently, or have occurred frequently in the past, and someone decided that the capability to arbitrarily snapshot and then restore the entire database was an amazing solution to the problem.

In truth, it is an amazing solution to the problem, but it really does make integrations of all sorts incredibly difficult, especially data synchronization.

Amending the algorithm to take database restores into account leaves us with this.

Get Local Version
Get Remote Version
If Local == Remote
    Do Nothing
If Local > Remote
    Get Next [BATCH SIZE] Local Rows > Remote Version
    Upload
If Local < Remote
    Delete Everything Remotely

As you can see, we went with the nuclear approach.

Removing all of the data from the remote location is obviously a sub-optimal approach, but its also the easiest and most reliable. Until we get into a situation where we need to optimise, its good enough. I’ll be coming back to this particular decision later in this series of posts, but for now, the algorithm just removes everything when the local version goes below the remote.

Soft, Like Pudding

Looking at the algorithm I’ve defined above, it takes into account quite a few scenarios:

If a new entity is added in the application, one or more rows are added, they get appropriate RowVersions, are picked up by the algorithm and are uploaded to the remote location.
If an entity is changed in the application, one or more rows are changed, they get new RowVersions, are picked up by the algorithm and are uploaded to the remote location.
If an entity is deleted, one or more rows are changed (marked as deleted), they get new RowVersions, are picked up by the algorithm and are uploaded to the remote location.

Actually, that last one is a little tricky.

A good percentage of the entities in the application are only ever deleted softly. This means that they are not physically removed from the database, but are instead updated to indicate deletion and then ignored for subsequent interactions. Soft deletes make a synchronization process easy because if something is there, it will always be there, and if you have good change tracking via versioning or something similar, you’re fine.

Some things are hard deleted though, and that makes for complications.

To Be Continued

Handling hard deletes is a topic unto itself, so I’m going to leave that for the next post.

Pretty terrible cliffhanger though.

Syncing the Ship, Part 1

January 31. 2017 0 Comments

On the back of the marathon series of posts around our AWS Lambda ELB Logs Processor, I’m just going to do another marathon. This time, I’m going to write about the data synchronization process that we use to facilitate Cloud integrations with our legacy application.

I’ve mentioned our long term strategy a few times in the past (the most recent being in a post about AWS RDS Database replicas in the latter half of 2016), but I’ll note it down here again. We want to free customer data currently imprisoned in their on-premises database, in order to allow for the production of helpful websites and mobile applications while also providing easier integrations for third parties. This sort of approach is win-win, we get to develop and sell interesting and useful applications and services, and the customer gets the ability to do things on the run (which is important for real-estate agents) without having to buy completely new software and migrate all of their data.

A pretty great idea, but the logistics of actually getting at the customer data are somewhat challenging.

The application responsible for generating all of the data is a VB6 application that is over 10 years old now. While it does have some modern .NET components, its not exactly the place you want to be building a synchronization process. Many copies of this application run in a single office and they all connect to a central SQL server, so orchestrating them all to act together towards a common goal is challenging. In addition to that, you run the risk of degrading the application performance for the average user if the application is busy doing things in the background that they might not necessarily care about right now.

What you really need is a centralised piece of software installed in a server environment in each office, running all the time. It can be responsible for the things the individual application should not.

Luckily, we have exactly that.

Central Intelligence

Each office that has access to our legacy application typically also has a companion application installed in parallel to their SQL Server. This component features a really nicely put together plugin framework that allows us to remotely deploy and control software that facilitates Cloud integrations and other typical server tasks, like automated backups.

Historically the motivation for the customer to install and configure this server component is to gain access to Cloud integrations they want to use, or to the other server supported features I briefly mentioned. Unfortunately, there have been at least 3 plugins so far that accomplish the syncing of data in one shape or another, usually written in such a way that they deal with only one specific Cloud integration. A good example of this is the plugin that is responsible for pushing property inspection information to a remote location to be used in a mobile application, and then pulling down updated information as it is changed.

These one-off plugins were useful, but each of them was specific enough that it was difficult to reuse the information they pushed up for other purposes.

The next time we had to create a Cloud integration, we decided to do it far more generically than we had before. We would implement a process that would sync the contents of the local database up to a centralised remote store, in a structure as close to the original as possible. This would leave us in a solid place to move forward with whatever integrations we might want to create moving forward. No need to write any more plugins for specific situations, just use the data that is already there.

As is always the case, while the goal might sound simple, the reality is vastly different.

Synchronize Your Watches Boys

When talking about synchronizing data, there are generally two parts to consider. The first is for all changes in the primary location to be represented in a secondary location and second is for all changes in the secondary location to be represented in the primary. For us, the primary is local/on-premises and the secondary is remote/cloud.

Being relatively sane people, and with business requirements that only required remote access to the data (i.e. no need to deal with remote changes), we could ignore the second part and just focus on one way sync, from the local to remote. Of course, we knew that eventually people would want to change the data remotely as well, so we never let that disquieting notion leave our minds, but for now we just had to get a process together that would make sure all local changes were represented remotely.

There were non-functional requirements too:

We had to make sure that the data was available in a timely fashion. Minutes is fine, hours is slow, days is ridiculous. We wanted a near real-time synchronization process, or as close to as we could get.
We needed to make sure that the synchronization process did not adversely affect the usability of the application in any way. It should be as invisible to the end-user as possible.

Nothing too complicated.

The Best Version Of Me

Looking at the data, and keeping in mind that we wanted to keep the remote structure as close to the local one as possible, we decided to use a table row as the fundamental unit of the synchronization process. Working with these small units, the main thing we would have to accomplish would be an efficient way to detect differences between the local and remote stores, so that we could decide how to react.

A basic differencing engine can use something as simple as a constantly increasing version in order to determine if location A is the same as location B, and luckily for us, this construct already existed in the database that we were working with. SQL Server tables can optionally have a column called RowVersion, which is a numeric column that contains unique values that constantly increase for each change made in the database. What this means is that if I make a change to a row in Table A, that row will have a new RowVersion of say 6. If I then make a change to Table B, that row will be versioned 7 and soon and so forth. I don’t believe the number is guaranteed to increase a single element at a time, but its always higher.

RowVersion is not a timestamp in the traditional sense, but it does represent the abstract progression of time for a database that is being changed (i.e. each change is a “tick” in time).

With a mechanism in place to measure change on the local side of the equation, all that was left was to find a way to use that measurement to act accordingly whenever changes occurred.

Simple.

To Be Continued

This post is long enough already though, so its time to take a break until next week, when I will outline the basic synchronization process and then poke a bunch of holes in it, showing why it wasn’t quite good enough.