0 Comments

I’ve written a lot of words on this blog about the data synchronization algorithm. Probably too many to be honest, but its an interesting technical topic for me, so the words come easily.

Not much has changed since the optimization to the differencing check to stop it from wastefully scanning the entire table, its just been quietly chugging along, happily grabbing data whenever clients opt in, and just generally being useful.

As we accumulate more data though, a flaw in the system is becoming more obvious.

Duplicates.

Once Uploaded, Data Lives Forever

I like to think that the data synchronization algorithm is really good at what it does.

Given a connection to a legacy database and some identifying information (i.e. the identity of the client), it will make sure that a copy of the data in that data exists remotely, and then, as the underlying database changes, ensure those changes are also present.

The actual sync algorithm is provided to clients in the form of a plugin for a component that enables services for their entire office, like server side automated backups and integrations with (other) cloud services. All of this hangs on a centralised store of registered databases, which the client is responsible for maintaining. The whole underlying system was built before my time, and while its a little rough around the edges, its pretty good.

Unfortunately, it does have one major flaw.

When a client registers a database (generally done by supplying a connection string), that database is given a unique identifier.

If the client registers the same physical database again (maybe they moved servers, maybe they lost their settings due to a bug, maybe support personnel think that re-registering databases is like doing a computer reboot), they get a new database identifier.

For the sync process, this means that all of the data gets uploaded again, appearing as another (separate) database belonging to the same client. In most cases the old registration continues to exist, but it probably stops being updated. Sometimes the client is actively uploading data from the same database more than once though, but that kind of thing is pretty rare.

For most use cases this sort of thing is mostly just annoying, as the client will select the right database whenever they interact with whatever system is pulling from the data in the cloud (and we generally try to hide databases that look like they would have no value to the customer, like ones that haven’t been actively updated in the last 24 hours).

For business intelligence though, it means a huge amount of useless duplicate data, which has all sorts of negative effects on the generated metrics.

Well, Until We Delete It And Take Its Power

From an engineering point of view, we should fix the root flaw, ensuring that the same physical database is identified correctly whenever it participates in the system.

As always, reality tends to get in the way, and unpicking that particular beast is not a simple task. Its not off the table completely, its just less palatable than it could be.

Even if the flaw is fixed though, the duplicate data that already exists is not going to magically up and disappear out of a respect for our engineering prowess. We’re going to have to deal with it anyway, so we might as well start there.

Algorithmically, a data set (customer_id-database_id pair) can be considered a duplicate of another data set if and only if:

  • The customer_id matches (we ignore duplicates across clients, for now anyway)
  • The data set contains at least 25 GUID identifiers that also appear in the other data set (each entity in the database generally has both a numerical and GUID identifier, so we just use the most common entity)

Nothing particularly complicated or fancy.

For the automated process itself, there are a few things worth considering:

  • It needs to communicate clearly what it did and why
  • There is value in separating the analysis of the data sets from the actions that were performed (and their results)
  • We’ll be using TeamCity for actually scheduling and running the process,so we can store a full history of what the process has done over time
  • To minimise risk, its useful to be able to tell the process “identify all the duplicates, but only delete the first X”, just in case it tries to delete hundreds and causes terrible performance problems

Taking all of the above into account, we' created a simple C# command line application that could be run like this:

Cleanup.exe –c {connection-string} –a {path-to-analysis-file} –d {path-to-deletion-results-file} –limitDeletionCount {number-to-delete}

Like everything we do, it gets built, tested, packaged (versioned), and uploaded to our private Nuget feed. For execution, there is a daily task in TeamCity to download the latest package and run it against our production database.

Its Like A Less Impressive Quickening

The last thing to do is make sure that we don’t ever delete any data that might still have value, to either us or the client.

As I mentioned above, the main reason that duplicates happen is when a client re-registers the same database for some reason. Upon re-registration, the “new” database will begin its data synchronization from scratch.

During the period of time where data is still uploading for the “new” database, but all the “old” data is still hanging around, how can we reasonably say which data set is the duplicate and should be deleted?

If we go off number of records, we’d almost certainly delete the “new” database mistakenly, which would just start it uploading again from scratch, and we’d get into an infinite stupidity loop.

We need some sort of indication that the data is “recent”, but we can’t use the timestamps on the data itself, because they are just copies from the local database, and oldest data uploads first.

Instead we need to use timing information from when the data set last participated in the sync process, i.e. a recency indicator.

A small modification of the tool later, and its execution looks like this:

Cleanup.exe –c {connection-string} –a {path-to-analysis-file} –d {path-to-deletion-results-file} –limitDeletionCount {number-to-delete} –recencyLimitDays {dont-delete-if-touched-this-recently}

We currently use 7 days as our recency limit, but once we’re more comfortable with the process, we’ll probably tune it down to 1 or 2 days (just to get rid of the data as soon as we can).

Conclusion

To be honest, we’ve known about the duplicate data flaw for a while now, but as I mentioned earlier, it didn’t really affect customers all that much. We’d put some systems in place to allow customers to only select recently synced databases already, so from their point of view, there might be a period where they could see multiple, but that would usually go away relatively quickly.

It wasn’t until we noticed the duplicates seeping into our metrics (which we use the make business decisions!) that we realised we really needed to do something about them, thus the automated cleanup.

A nice side effect of this, is that when we did the duplicate analysis, we realised that something like 30% of the entire database worthless duplicate data, so there might actually be significant performance gains once we get rid of it all, which is always nice.

To be honest, we probably should have just fixed the flaw as soon as we noticed it, but its in a component that is not well tested or understood, so there was a significant amount of risk in doing so.

Of course, as is always the case when you make that sort of decision, now we’re paying a different price altogether.

And who can really say which one is more expensive in the end?