0 Comments

Its been two months since I posted about our foray into continuous data migration from our legacy product to our cloud platform. Its been mulling around in the back of my head ever since though, and a few weeks ago we finally got a chance to go back and poke at it again. Our intent was to extend the original prototype and get it into a place where it was able to be demonstrated to a real client (using their real data).

Spending development effort to build an early, extremely rough prototype and then demonstrating it to users and gathering feedback as quickly as possible is a great way to stop from building the wrong thing. You can get early indicators about whether or not you’re headed in the right direction without having to invest too much money, assuming you pick a representative example of your target audience of course.

When we finished building the first prototype, it quickly became apparent we couldn’t actually show it to a user. We could barely show it to ourselves.

I can just imagine the sort of awkward statement that such a demonstration would have started with:

And when you make a change in your legacy system like this, all you have to do is wait 45+ minutes before its available over here in the new and shiny cloud platform! How great is that!

Its a pretty hard sell, so before we could even talk to anyone, we needed to do better.

Its All So Pointless

The first prototype extended our existing migration process, and meant that instead of creating a brand new account in our cloud platform every time customer data was migrated, it could update an existing account.

In doing so, it just re-ran the entire migration again (query, transform, publish) over the entire customer data set, focusing its efforts on identifying whether or not a transformed entity was new or existing and then performing the appropriate actions via the cloud API’s

This was something of a nuclear approach (like our original strategy for dealing with database restores in the sync algorithm) and resulted in a hell of a lot of wasted effort. More importantly, it resulted in a huge amount of wasted time, as the system still had to iterate through thousands of entities only to decide that nothing needed to be done.

The reality is that customers don’t change the entire data set all the time. They make small changes consistently throughout the day, so as long as we can identify only those changes and act on them, we should be able to do an update in a much shorter amount of time.

So that’s exactly what we did.

Tactical Strike

Whoever it was that implemented row level versioning in our legacy database, I should send them a gift basket or something, as it was one of the major contributing factors to the success of our data synchronization algorithm. With all of that delicious versioning information is available to the migration process,  the optimisation is remarkably simple.

Whenever we do a migration and it results in a set of transformed entities, we just store the versioning information about those entities.

When an update is triggered, we can query the database for the things that are greater than the maximum version we dealt with last time, vastly decreasing the amount of raw information that we have to deal with, decreasing the total amount of time taken.

The only complication is that because each transformed entity might be built up from many legacy entities, the version must be an aggregate, specifically the lowest version of the constituent entities.

With that change in place, rather than the execution time of every migration being directly dependent on the total size of the customers data, its now dependent on how much has changed since we last migrated. As a result, its actually better to run the update process frequently to ensure that it makes many small updates over the course of the day, reducing the overall latency of the process and generally giving a much better user experience.

Excellent Timing

Speaking of running many small updates over the course of a day.

With the biggest blocker to a simple timer out of the way (the time required to execute the migration), we could actually put a timer into place.

The migration API is written in Kotlin, using Spring IO and Spring Batch, so it was a relatively simple matter to implement an in-memory job that runs every minute, identifies the migrations that should be updated (by picking the last successfully completed ones) and then executes an update operation on each.

For simplicity we execute the job synchronously, so each migration update must finish before the next can start, and the entire job cannot be rescheduled for re-execution until the previous job finishes. Obviously that approach doesn’t scale at all (every additional migration being updated increases the latency of the others), but in a controlled environment where we only have a limited set of migrations, its perfectly fine.

The only other thing we had to do in order to ensure the timer job worked as expected was to lock down the migration API to only have a single instance. Again, something that would never advise in production, but is acceptable for a prototype. If we do end up using a timer in production, we’d probably have to leverage some sort of locking process to ensure that it only executes once.

Oops I Did It Again

We are highly unlikely to go with the delta approach if this project pushes ahead though.

It provides just enough functionality to be able to to demonstrate the concept to the anchor customer (and maybe a few additional validation customers), but it does not cater for at least two critical cases:

It could be augmented to cater for those cases of course.

Its just software, and software is infinitely mutable, but all we would be doing is re-implementing the data synchronization algorithm, and it was hard enough to get right the first time. I don’t really want to write it all over again in a different system, we’ll just mess it up in new and unique ways.

Instead, we should leverage the existing algorithm, which is already really good at identifying the various events that can happen.

So the goal would be to implement some sort of event pipeline that contains a set of changes that have occurred (i.e. new entity, updated entity, deleted entity, database restored to X and so on), and then react to those events as appropriate from the migration side.

Obviously its not that simple in practice, but it is likely the direction that we will end up going if this all pans out.

Conclusion

What we’re left with right now is a prototype that allows for changes to customer data in their legacy system to be applied automatically to the cloud platform with a latency of under 5 minutes.

Of course, it has a bunch of conditions attached to it (every additional customer makes it slower for every other customer, doesn’t handle deletes, doesn’t handle database restores, was not built using normal engineering practices), but its enough to demonstrate the concept to a real person and start a useful conversation.

As is always the case with this sort of thing, there is a very real risk that this prototype might accidentally become production code, so its something that we as developers are eternally vigilant against.

That’s a hill I’m prepared to die on though.

0 Comments

As I’ve written about a bunch of times before, we have a data synchronization process for freeing customer data from restrictive confines of their office. We built it to allow the user to get at their valuable information when they are on the road, as our users tend to be, as something of a stopgap while we continue to develop and improve our cloud platform.

A fantastic side effect of having the data is that we can make it extremely easy for our customers to migrate from our legacy product to the new cloud platform, because we’ve already dealt with the step of getting their information into a place where we can easily access it. There are certainly challenges inherent in this approach, but anything we can do to make it easier to move from one system to another is going to make our customers happier when they do decide to switch.

And happy customers are the best customers.

As is always the case with software though, while the new cloud platform is pretty amazing, its still competing with a piece of software that’s been actively developed for almost 20 years, and not everyone wants to make the plunge just yet.

In the meanwhile, it would be hugely beneficial to let our customers use at least some part of the new system that we’ve been developing both because we honestly believe its a simpler and more focused experience and because the more people that use the new system, the more feedback we get, and that equals a better product for everyone.

Without a link between the two systems though, its all very pointless. Customers aren’t going to want to maintain two sets of identical information, and rightly so.

But what if we did it for them?

Data…Remember Who You Were

We already have a decent automated migration process. Its not perfect, but it ticks a lot of boxes with regards to bringing across the core entities that our customers rely on on a day to day basis.

All we need to do is execute that migration process on a regular basis, following some sort of cadence and acceptable latency that we figure out with the help of some test users.

This is all very good in theory, but the current migration process results in a brand new cloud account every time its executed on a customers data, which is pretty unusable for if we’re trying to make the customers lives easier. Ideally we want to re-execute the migration, but targeting an existing account. New entities get created, existing entities get updated and deleted entities get deleted.

Its basically another sync algorithm, except this time its more complicated. The current on-premises to remote sync is a relatively simple table by table, row by row process, because both sides need to look identical at the end. This time though, transforms occur, entities are broken down (or consolidated) and generally there is no one-to-one relationship that’s easy to follow.

On the upside, the same transformations are executed for both the initial migration and the update, so we can deal with the complexity somewhat by simply remembering the outcome of the previous process (actually all previous processes), and using that to identify what the destination was for all previously migrated entities.

What we’re left with is an update process that can be executed on top of any previous migration, which will:

  • Identify any previously migrated entities by their ID’s, and perform appropriate update or replacement operations
  • Identify new entities and perform appropriate creation operations, storing the ID mapping (i.e. original to new) in a place where it can be retrieved later

But what about deletions?

Well, at this point in time we’re just prototyping, so we kind of just swept that under the rug.

I’m sure it will be easy.

I mean, look at how easy deletions were to handle in the original data synchronization algorithm.

Its All Just Too Much

Unfortunately, while the prototype I described above works perfectly fine to demonstrate that an account in the cloud system can be kept up to date with data from the legacy product, we can’t give it to a customer yet because every migration “update” is a full migration of all of the customers data, not just the stuff that’s changed

Incredibly wasteful.

For our test data set, this isn’t really a problem. A migration only takes a minute at most, usually less, and its only really moving tends of entities around.

For a relatively average customer though, a migration takes like 45 minutes and moves thousands to tens of thousands of entities. Its never really been a huge problem before, because we only did it once or twice while moving them to the new cloud platform, but it quickly becomes unmanageable if you want to do it in some sort of constant loop.

Additionally, we can’t easily use a dumb timer to automate the process or we’ll get into problems with re-entrancy, where two migrations for the same data are happening at the same time. Taking that into account though, even if we started the next migration the moment the previous one finished (in some sort of eternal chain), we’re still looking at a total latency that is dependent on the size of the customers complete data set, not just the stuff they are actively working with. Also, we’d be using resources pointlessly, which is a scaling nightmare.

We should only be acting on a delta, using only the things that have changed since the last “update”.

A solvable problem, but it does add to the overall complexity, as we can’t fall back on the row based versioning that we used for the core data synchronization algorithm because of the transforms, so we have to create some sort of aggregate version from the constituent elements and store that for later use.

Its Time To Move On

Even if we do limit our migration updates to only the things that have changed since last time, we still have a problem if we’re just blindly running the process on a timer (or eternal chain). If there are no changes, why do a bunch of work for nothing?

Instead, we should be able to reactto the changes as the existing synchronization system detects them.

To be clear, the existing process is also timer based, so its not perfect. It would be nice to not add another timer to the entire thing though.

With some effort we should be able to produce a stream of information from our current sync API that describes the changes being made. We can then consume that stream in order to optimise the migration update process, focusing only on those entities that have changed, and more importantly, only doing work when changes actually occur.

Conclusion

Ignoring the limitations for a moment, the good news is that the prototype was enough to prove to interested stakeholders that the approach has merit. At least enough merit to warrant a second prototype, under the assumption that the second prototype  will alleviate some of the limitations and we can get it in front of some real customers for feedback.

I have no doubt that there will be some usability issues that come out during the user testing, but that’s to be expected when we do something as lean as this. Whether or not those usability issues are enough to cause us to run away from the idea altogether is the different question, but we won’t know until we try.

Even if the idea of using two relatively disparate systems on a day to day basis is unpalatable, we still get some nice things from being able to silently migrate our customers into our cloud platform, and keep that information up to date

I mean, what better way is there to demo a new piece of software to a potential customer than to do it with their own data?