Regularly writing blog posts is hard.
Recently, I’ve been focusing on monstrously long, multi-part posts explaining complex or involved processes, like:
- The ELB Logs Processor, which was broken into many parts because it was new and we had a bunch of issues developing/deploying it. Somewhat ironically, it was meant to be a quick and simple replacement for a system we already had in place.
- Regaining development and operational control over our log aggregation stack, which was broken down because of the multiple distinct pieces of work that we had to perform.
- An exploration of our data synchronization algorithm, which was broken down because it’s surprisingly complex.
I’m honestly not sure whether this pattern is good or bad in the greater scheme of things. Either I just happen to be tackling (and thus writing about) bigger problems in my day to day life, or I’ve simply run out of small things to write about, leaving only the big ones.
But anyway, enough musing, because this post fits squarely into the “bunch of connected posts about one topic” theme.
I was pretty happy with the state of our data synchronization algorithm the last time I wrote about it. We’d just finished putting together some optimizations that dramatically reduced the overall traffic while still maintaining the quality of the overall syncing process, and it felt pretty good. Its been a little over a month since those changes were deployed, and everything has been going swimmingly. We’ve added a few new tables to the process, but the core algorithm hasn’t changed.
Normally this is the point where I would explain how it was all secretly going terribly wrong, but in a weird twist of fate, the algorithm is actually pretty solid.
We did find a bug which can cause the on-premises and remote locations to be out of of sync though, which was unfortunate. It happens infrequently, so a small subset of the data, but it still makes for an interesting topic to write about.
Well, interesting to me at least.
Optimizations Are Always Dangerous
The core of the bug lies in our recent optimizations.
In order to reduce the amount of busywork traffic occurring (i.e. the traffic resulting from the polling nature of the process), we implemented some changes that leverage local and remote table manifests to short-circuit the sync process if there was nothing to do. To further minimize the traffic to the API, we only queried the remote table manifest at the start of the run and then used that for the comparison against the local on the next run. Essentially we exchanged a guaranteed call on every non-skipped sync for one call each time the local and remote became identical.
The bug arises in the rare case where the captured remote from the last run is the same as the current local, even though the current remote is different.
The main way that this seems to happen is:
- Table with low rate of change gets a new row.
- Algorithm kicks in and syncs the new data.
- In the time between the run that pushed the data and the next one, user removes the data somehow.
- Current local now looks exactly like the remote did before the data was pushed.
- Sync algorithm thinks that it has nothing to do.
In this case the algorithm is doing exactly what it was designed to do. Its detected that there are no changes to deal with, and will continue to skip executions until something does change (new rows, updates, deletions, anything), where it will run again. If the table changes infrequently we’re left with an annoying desync for much longer than we would like.
Like I said earlier, its a pretty specific situation, with relatively tight timings, and it only occurs for tables that are infrequently changed, but a bug is a bug, so we should fix it all the same.
The obvious solution is to requery the remote after the sync operation has finished execution and store that value for the comparison against the local next time, rather than relying on the value from before the operation started.
The downside of this is that it adds another request to every single non-skipped sync, which amounts to a relatively significant amount of traffic. We’re still way ahead of the situation before the optimizations, but maybe we can do better?
Another idea is to limit the maximum number of skips that can happen in a row, taking into account how long we might want the situation described above to persist.
This approach also raises the number of requests occurring, but has the happy side effect of picking up changes at the remote end as well (i.e. nothing has changed locally, but we deleted all the data remotely in order to force a resync or something).
The compare the two possible fixes, I actually did some math to see which one would result in more requests, and with the maximum number of skips set to a value that forced a run every 30 minutes or so, they are pretty much a wash in terms of additional requests.
I’ve flip-flopped a fair bit on which solution I think we should apply, initially thinking the “limit maximum skips” approach was the best (because it essentially offers a sanity check to the concept of skipping runs), but from an engineering point of view, it just feels messy, like the sort of solution you come up with when you can’t come up with something better. Almost brute force in its approach.
I’m currently favouring amending the algorithm to query the remote after the operation executes because it feels the cleanest, but I’m not ecstatic about it either, as it feels like its doing more work than is strictly necessary.
As much as it saddens me to find bugs, it pleases me to know that with each bug fixed, the algorithm is becoming stronger, like tempering steel, or building muscle. Applying stress to something causing it to break down and then be repaired with improvements.
It can be tempting to just throw a fix in whenever you find a bug like that, but I believe that hack fixes should never be tolerated without a truly exceptional reason. You should always aim to make the code better as you touch it, not worse. The hardest part of fixing bugs is to perform the repairs in such a way that it doesn’t compromise the design of the code.
Of course, if the design is actually the cause of the problem, then you’re in for a world of hurt.