Delta Force

September 4. 2018 0 Comments

Its been two months since I posted about our foray into continuous data migration from our legacy product to our cloud platform. Its been mulling around in the back of my head ever since though, and a few weeks ago we finally got a chance to go back and poke at it again. Our intent was to extend the original prototype and get it into a place where it was able to be demonstrated to a real client (using their real data).

Spending development effort to build an early, extremely rough prototype and then demonstrating it to users and gathering feedback as quickly as possible is a great way to stop from building the wrong thing. You can get early indicators about whether or not you’re headed in the right direction without having to invest too much money, assuming you pick a representative example of your target audience of course.

When we finished building the first prototype, it quickly became apparent we couldn’t actually show it to a user. We could barely show it to ourselves.

I can just imagine the sort of awkward statement that such a demonstration would have started with:

And when you make a change in your legacy system like this, all you have to do is wait 45+ minutes before its available over here in the new and shiny cloud platform! How great is that!

Its a pretty hard sell, so before we could even talk to anyone, we needed to do better.

Its All So Pointless

The first prototype extended our existing migration process, and meant that instead of creating a brand new account in our cloud platform every time customer data was migrated, it could update an existing account.

In doing so, it just re-ran the entire migration again (query, transform, publish) over the entire customer data set, focusing its efforts on identifying whether or not a transformed entity was new or existing and then performing the appropriate actions via the cloud API’s

This was something of a nuclear approach (like our original strategy for dealing with database restores in the sync algorithm) and resulted in a hell of a lot of wasted effort. More importantly, it resulted in a huge amount of wasted time, as the system still had to iterate through thousands of entities only to decide that nothing needed to be done.

The reality is that customers don’t change the entire data set all the time. They make small changes consistently throughout the day, so as long as we can identify only those changes and act on them, we should be able to do an update in a much shorter amount of time.

So that’s exactly what we did.

Tactical Strike

Whoever it was that implemented row level versioning in our legacy database, I should send them a gift basket or something, as it was one of the major contributing factors to the success of our data synchronization algorithm. With all of that delicious versioning information is available to the migration process, the optimisation is remarkably simple.

Whenever we do a migration and it results in a set of transformed entities, we just store the versioning information about those entities.

When an update is triggered, we can query the database for the things that are greater than the maximum version we dealt with last time, vastly decreasing the amount of raw information that we have to deal with, decreasing the total amount of time taken.

The only complication is that because each transformed entity might be built up from many legacy entities, the version must be an aggregate, specifically the lowest version of the constituent entities.

With that change in place, rather than the execution time of every migration being directly dependent on the total size of the customers data, its now dependent on how much has changed since we last migrated. As a result, its actually better to run the update process frequently to ensure that it makes many small updates over the course of the day, reducing the overall latency of the process and generally giving a much better user experience.

Excellent Timing

Speaking of running many small updates over the course of a day.

With the biggest blocker to a simple timer out of the way (the time required to execute the migration), we could actually put a timer into place.

The migration API is written in Kotlin, using Spring IO and Spring Batch, so it was a relatively simple matter to implement an in-memory job that runs every minute, identifies the migrations that should be updated (by picking the last successfully completed ones) and then executes an update operation on each.

For simplicity we execute the job synchronously, so each migration update must finish before the next can start, and the entire job cannot be rescheduled for re-execution until the previous job finishes. Obviously that approach doesn’t scale at all (every additional migration being updated increases the latency of the others), but in a controlled environment where we only have a limited set of migrations, its perfectly fine.

The only other thing we had to do in order to ensure the timer job worked as expected was to lock down the migration API to only have a single instance. Again, something that would never advise in production, but is acceptable for a prototype. If we do end up using a timer in production, we’d probably have to leverage some sort of locking process to ensure that it only executes once.

Oops I Did It Again

We are highly unlikely to go with the delta approach if this project pushes ahead though.

It provides just enough functionality to be able to to demonstrate the concept to the anchor customer (and maybe a few additional validation customers), but it does not cater for at least two critical cases:

Entity deletions
Database restores

It could be augmented to cater for those cases of course.

Its just software, and software is infinitely mutable, but all we would be doing is re-implementing the data synchronization algorithm, and it was hard enough to get right the first time. I don’t really want to write it all over again in a different system, we’ll just mess it up in new and unique ways.

Instead, we should leverage the existing algorithm, which is already really good at identifying the various events that can happen.

So the goal would be to implement some sort of event pipeline that contains a set of changes that have occurred (i.e. new entity, updated entity, deleted entity, database restored to X and so on), and then react to those events as appropriate from the migration side.

Obviously its not that simple in practice, but it is likely the direction that we will end up going if this all pans out.

Conclusion

What we’re left with right now is a prototype that allows for changes to customer data in their legacy system to be applied automatically to the cloud platform with a latency of under 5 minutes.

Of course, it has a bunch of conditions attached to it (every additional customer makes it slower for every other customer, doesn’t handle deletes, doesn’t handle database restores, was not built using normal engineering practices), but its enough to demonstrate the concept to a real person and start a useful conversation.

As is always the case with this sort of thing, there is a very real risk that this prototype might accidentally become production code, so its something that we as developers are eternally vigilant against.

That’s a hill I’m prepared to die on though.