On the back of the marathon series of posts around our AWS Lambda ELB Logs Processor, I’m just going to do another marathon. This time, I’m going to write about the data synchronization process that we use to facilitate Cloud integrations with our legacy application.
I’ve mentioned our long term strategy a few times in the past (the most recent being in a post about AWS RDS Database replicas in the latter half of 2016), but I’ll note it down here again. We want to free customer data currently imprisoned in their on-premises database, in order to allow for the production of helpful websites and mobile applications while also providing easier integrations for third parties. This sort of approach is win-win, we get to develop and sell interesting and useful applications and services, and the customer gets the ability to do things on the run (which is important for real-estate agents) without having to buy completely new software and migrate all of their data.
A pretty great idea, but the logistics of actually getting at the customer data are somewhat challenging.
The application responsible for generating all of the data is a VB6 application that is over 10 years old now. While it does have some modern .NET components, its not exactly the place you want to be building a synchronization process. Many copies of this application run in a single office and they all connect to a central SQL server, so orchestrating them all to act together towards a common goal is challenging. In addition to that, you run the risk of degrading the application performance for the average user if the application is busy doing things in the background that they might not necessarily care about right now.
What you really need is a centralised piece of software installed in a server environment in each office, running all the time. It can be responsible for the things the individual application should not.
Luckily, we have exactly that.
Each office that has access to our legacy application typically also has a companion application installed in parallel to their SQL Server. This component features a really nicely put together plugin framework that allows us to remotely deploy and control software that facilitates Cloud integrations and other typical server tasks, like automated backups.
Historically the motivation for the customer to install and configure this server component is to gain access to Cloud integrations they want to use, or to the other server supported features I briefly mentioned. Unfortunately, there have been at least 3 plugins so far that accomplish the syncing of data in one shape or another, usually written in such a way that they deal with only one specific Cloud integration. A good example of this is the plugin that is responsible for pushing property inspection information to a remote location to be used in a mobile application, and then pulling down updated information as it is changed.
These one-off plugins were useful, but each of them was specific enough that it was difficult to reuse the information they pushed up for other purposes.
The next time we had to create a Cloud integration, we decided to do it far more generically than we had before. We would implement a process that would sync the contents of the local database up to a centralised remote store, in a structure as close to the original as possible. This would leave us in a solid place to move forward with whatever integrations we might want to create moving forward. No need to write any more plugins for specific situations, just use the data that is already there.
As is always the case, while the goal might sound simple, the reality is vastly different.
Synchronize Your Watches Boys
When talking about synchronizing data, there are generally two parts to consider. The first is for all changes in the primary location to be represented in a secondary location and second is for all changes in the secondary location to be represented in the primary. For us, the primary is local/on-premises and the secondary is remote/cloud.
Being relatively sane people, and with business requirements that only required remote access to the data (i.e. no need to deal with remote changes), we could ignore the second part and just focus on one way sync, from the local to remote. Of course, we knew that eventually people would want to change the data remotely as well, so we never let that disquieting notion leave our minds, but for now we just had to get a process together that would make sure all local changes were represented remotely.
There were non-functional requirements too:
- We had to make sure that the data was available in a timely fashion. Minutes is fine, hours is slow, days is ridiculous. We wanted a near real-time synchronization process, or as close to as we could get.
- We needed to make sure that the synchronization process did not adversely affect the usability of the application in any way. It should be as invisible to the end-user as possible.
Nothing too complicated.
The Best Version Of Me
Looking at the data, and keeping in mind that we wanted to keep the remote structure as close to the local one as possible, we decided to use a table row as the fundamental unit of the synchronization process. Working with these small units, the main thing we would have to accomplish would be an efficient way to detect differences between the local and remote stores, so that we could decide how to react.
A basic differencing engine can use something as simple as a constantly increasing version in order to determine if location A is the same as location B, and luckily for us, this construct already existed in the database that we were working with. SQL Server tables can optionally have a column called RowVersion, which is a numeric column that contains unique values that constantly increase for each change made in the database. What this means is that if I make a change to a row in Table A, that row will have a new RowVersion of say 6. If I then make a change to Table B, that row will be versioned 7 and soon and so forth. I don’t believe the number is guaranteed to increase a single element at a time, but its always higher.
RowVersion is not a timestamp in the traditional sense, but it does represent the abstract progression of time for a database that is being changed (i.e. each change is a “tick” in time).
With a mechanism in place to measure change on the local side of the equation, all that was left was to find a way to use that measurement to act accordingly whenever changes occurred.
To Be Continued
This post is long enough already though, so its time to take a break until next week, when I will outline the basic synchronization process and then poke a bunch of holes in it, showing why it wasn’t quite good enough.