The Ramifications of Time Travel

April 17. 2018 0 Comments

Posted in:
programming

The pieces of software that are easiest to manage are often the most isolated, mostly because of a combination of two things:

You control the entire world that the software cares about (maybe its all in a database, maybe the file system, maybe written into the fabric of the universe, but you control it all)
You don’t really have to care about any other parties in the relationship except for your users, and you control their world, so that’s easy

Unfortunately, the most isolated software is also often the least valuable, because it does not participate in a wider community.

How is this relevant to me?

Well, my team is responsible for a piece of software that was originally very well isolated. It used to be that everything the user cared about (all of their data), was encapsulated nice and neat in an SQL Server database. Connect to a different database, and the user can easily change their entire context.

This database centric view naturally led to an excellent backup and restore process, and over time the users became familiar with using restores for a number of purposes, from resolving actual software bugs that caused data corruption to simply undoing actions that they didn’t mean to do.

All was well and the world was happy, but everything changed when the third party integrations attacked.

Back To The….Past?

From a software point of view, when you no longer control the entire universe, things get substantially more difficult. This is not new information, but from our point of view, the users much loved (and frequently executed) backup and restore operations became something of a liability.

Integrating with an entirely separate system is challenging at the best of times, but when you add into the mix the capability for one of the participants in the relationship to arbitrarily regress to an earlier point in its timeline, it gets positively frustrating.

We’ve done a lot of integrations in the time that I’ve been working on the software in question, but there are a few that stick out in my mind as being good examples of the variety of ways that we’ve had to use to deal with the issue:

A third party service that made use of the client data, to in turn create things that went back into the client data. This one wasn’t too bad, as the worst thing that could happen would be an entity coming back to the client based on data that was no longer relevant. In this case, we can discard the irrelevant information, usually confirming with the user first.
A data synchronization process that replicates client data to a remote location. The nice thing about this one is that the data is only flowing one way, so all we really needed to do was detect the database restore and react appropriately. The impact of the database restore on the overall complexity of the synchronization algorithm was relatively minimal.
A third party service that we integrated with directly (as opposed to integrating with us like in the first point). Third party entities are linked to local entities, but have separate lifetimes, so the biggest issue as a result of database restores was orphans. Unfortunately, this third party service was a payment provider, so an orphan entity still capable of taking real money is a pretty big problem. I’ll get into it more detail later on in this post, but another thing we had to consider here was detecting and avoiding duplicate operations, as a result of the user repeating actions that had been undone by a database restore.

Of those three integrations, the third one is the most recent, and has proven to be the most challenging for a variety of reasons.

Localised Area Of Effect

As a result of the restore capabilities inherent in the product, we’ve had to repeatedly take certain measures in order to ensure that our users don’t get themselves into difficult situations.

Where possible, we try to store as much state as we can in the local database, so if a database restore occurs, at least everything goes back to an earlier time in unison. This works remarkably well as long as the source of truth is the local data source, so if a third party integration differs from the local, trust the local and remove the offending information. Essentially, we are extending the scope of the database restore to the third party, as much as you can anyway.

Of course, just annihilating user data is not always possible (or desirable), even if that’s technically what they asked us to do by performing a restore.

Our second approach is to allow the data to live elsewhere, but force it to synchronize automatically to the local database as necessary. This technically splits the source of truth responsibility, so it can have unfortunate side effects, like synchronizing information that is no longer relevant to the restored state of the database, which isn’t great. A technical mechanism that can make this synchronization process easier is for each entity to have a unique, monotonically increasing numeric identifier, and to be able to query the entities by that identifier and its range (i.e. give me next 100 > X). We’ve used this approach a few times now, and assuming the remote entities don’t change over time, its fairly effective.

As you can imagine, both of these things involve a significant amount of communication overhead, both technically and cognitively and can result in some seriously confusing bugs and behaviour if it all goes pear shaped.

A Potent Concept

The approaches that I was just writing about are a bit more relevant when we are acting as our own third party (i.e. we’re writing both sides of the integration, which happens a fair bit).

Sometimes we have to integrate with services and systems provided by actual third parties.

In those cases, we have to get creative, or, unfortunately, put some of the responsibility on the users themselves. If we take orphan data as an example, when detected, we let the user know, and they are then required to deal with the problems and bring the system back into an acceptable state before they can move on. Obviously you have to be careful with how you identify the orphans, especially if you’re not the only integrator, but that is a solvable problem.

If you’re lucky, the third party service will allow you to do things in an idempotent manner, such that even if you accidentally do the same operation twice (or more), the ramifications are controlled. This is particularly relevant when integrating with services that deal with actual money, because a double (or triple) transaction is pretty catastrophic from a reputation point of view.

Of course, in order for the command or operation to be idempotent, you need to have some sort of consistent way to identify and classify it. Usually this means an ID of some sort that doesn’t change.

This gets challenging for us, because a database restore is a brutally efficient operation. Everything that was, ceases to exist, as if it never was to begin with. If the user then chooses to recreate some entity that was interacting with the third party service, there is no guarantee that its identity will be the same, even if it represents the same conceptual thing. For example, IDs generated at the database level might differ based on the order of operations, or might just be different altogether, as in the case when automatically generating GUIDs.

When this happens, we typically have to rely on the user to know what to do, which is a bit of a cop out, but compromises have to be made sometimes.x

Conclusion

Easily one of the most important lessons from supporting arbitrary database backups and restores is just how costly and complicated they can make integrations.

The older and more well established a piece of software gets, the more likely it will be to have revenue growth opportunities involving other parties, especially if it forms a community of sorts. I mean, its probably not cost effective to try and do everything yourself, no matter how amazing your organisation is.

Keep in mind, to the user, all of this should just work, because they don’t see the complexity inherent in the situation.

That means if you screw it up, the dissatisfaction level will be high, but if you get it right, at best they will be neutral.

To quote a great episode of Futurama:

When you do things right, people won’t be sure you’ve done anything at all