0 Comments

And we’re back

Its time to ring in the new year with some more chatter about our data synchronization algorithm, which honestly, has turned into a pretty complex beast.

The last time I wrote about the algorithm, I explained how we flipped the original implementation of the differencing scanfrom bottom up, to top down. This change represented a hearty improvement over the original implementation, because the data at the “top” of the table (i.e. when ordered by Row Version descending) is far more likely to change than the data at the “bottom”, at least for our clients anyway.

In that post though, I pointed out an inefficiency in the differencing scan, regardless of the direction”":

There is still an inherent inefficiency in the design of the scan, such that it will continue to do extra work for no benefit. Consider the situation where the scan has started and has dealt with 4 chunks. Its a bit down in the table, and the change it originally wanted to find was in chunk 5. In the meantime, the user has deleted data in chunk 2. The process will have to finish its scan of the entire table before it returns to the top and finds that particular change, which is just wasteful. Instead, it should be able to determine if there are any differences in the remainder of the data set (i.e. from the bottom of the current chunk down) and then use that information to decide to reset its position to the top again. We haven’t actually done this bit yet, because it came up after the deployed the first version to our users.

That inherent inefficiency is the topic for today.

Wait, Come Back, You’re Going In The Wrong Direction!

As mentioned above, the problem lies in the fact that once a difference is detected (i.e. one or more missing rows locally or remotely), the scan will just keep going until it finds and fixes said difference (or it hits the “end” of its scan and resets to the “top”). Of course, time does not stop while the scan is happening, and because it happens over a period of hours, with many relatively isolated executions of the algorithm, other changes are still being made to the underlying data.

Additions and modifications are handled by an entirely different process, so we don’t have to worry about them, but hard deletions can still occur at will.

When such a deletion occurs in the section that has already been scanned, but before the original difference that triggered the scan has been located and fixed, the algorithm can get stuck scanning the remainder of the data set for no real benefit. This actually isn’t a huge issue when the data set is small, or even medium sized, as you can just eat the inefficiency and wait. Eventually it will all work itself out.

Unfortunately, as the data set gets larger, the chances of a deletion occurring in the scanned section increases. Also, since changes are more likely to occur at the “top” of the data set, and the differencing scan works top-down, the sections that were most recently scanned are actually the most likely to contain differences. As a result, the algorithm spends longer and longer doing work for no benefit, so the inefficiency gets worse and worse.

Each chunk comparison requires the data to be present in memory in the remote database as well (its not an index-only query), so every pointless comparison actually results in more reads, which means more IOPS, which is why we started optimizing in the first place!

Long story short, we have to do something about it.

4665, 4666, 4667! There Are 4667 Rows Both Locally And Remotely, Ah Ah Ah

The good news is that this particular problem is not overly difficult to solve. It will still take some effort, but its not some fundamental flaw in the algorithm.

Every time a chunk is completed as part of the differencing scan, we can use a simple count to see whether or not the remaining differences (if there are any) are above or below the current location.

Locally this is trivial, just a query to the DB for count where < current range end.

Getting the same information from the remote requires the introduction of a new endpoint to the API though:

/v1/customers/{customerId}/databases/{databaseId}/tables/{tableName}/count{?aboveRowVersion=123&belowRowVersion=456}

Slightly generalised from our specific use case, but it basically lets you get a count of records:

  • above a certain row version
  • below a certain row version
  • in between two boundary row versions

For the differencing scan, we only really care about the “below a certain row version” use case, to get a count and compare it to the same count from the local data.

If the count is the same, we can safely exit early and flip back to the top of the data set, resetting the differencing scan so that it can pick up the more recent changes.

If the count is different, we just keep going down and repeat the process once we’ve actioned the next chunkl.

Nice and easy.

Of course, there are still some edge cases. Its possible (though not likely) get into a situation where the counts are the same but the data is actually different (a particular combination of locally deleted data and data that was never uploaded successfully) which can throw the whole thing for a bit of a loop, but that could have happened regardless, so we’re still in a better place than we would otherwise be.

Conclusion

Its been a bit of a hard slog trying to optimise the data synchronization algorithm, and we’re still not really there. Not only that, the algorithm itself has become more and more complex over time (obviously), and is getting pretty hard to reason about.

Annoyingly enough, we haven’t run across that one magical improvement that changes everything. Its very much been a kind of “death by a thousand cuts” sort of thing, with tens of small optimizations that alleviate the issue slightly. The improvement in this post is a good example of that sort of thing, and pretty much boils down to “don’t do something you don’t have to do”, which isn’t exactly ground-breaking.

Don’t get me wrong, the process is much better than its ever been, but we’re still seeing very similar patterns in read IOPS, which is problematic.

It might be that the algorithm itself just doesn’t scale as well as we want it to, and that we might need to flip to a fundamentally different approach. Perhaps something that hooks into deeper into the customer data and notifies us of creations/updates/deletions as they occurs, rather than us having to poll for the same information.

Still, that sort of change is not something to be embarked on lightly, even disregarding the fallacy of sunk cost.

0 Comments

It’s been a little while since I wrote about our data synchronization algorithm, but it still gets a fair bit of space in my mind on a day to day basis.

Most recently, we put some effort into optimizing it in the face of rising IOPS usage at the database level, which worked well enough to stabilize the system in the face of what we wanted it to do at the time.

Then we told it to sync more data.

Specifically, one of the two biggest data sets in our software which covers the storage of repeatable tasks, like appointments, phone calls, inspections and so on.

Not only is this table one of the largest in raw size, it also features the most churn AND is one of the only tables to feature hard deletes, a challenge that we had to build an entire secondary sync process for.

Funnily enough, it actually worked pretty well at first, but everything started to fray at the edges as we started syncing more and more clients. Same sort of thing as before, pressure on IOPS at the database level, mostly as a result of reads.

With the memory of the last round of optimizations fresh in our minds, it was time to enter the breach once more.

Will To Survive

We prodded at the system for a bit, analysing the traffic patterns to try and determine what might be causing the high read IOPS this time. Being that we’d already investigated similar sorts of things recently, we knew our way around it pretty well, so it wasn’t long before we noticed something suspicious.

Lets go back a step though.

The core data synchronization algorithm hasn’t changed all that much from the last time I wrote about it, even in the face of the optimizations.

Get Local Count/Version/Timestamp
Get Remote Count/Version/Timestamp
If Local Count/Version/Timestamp == Remote Count/Version/Timestamp
    Do Nothing and Exit
If Local Version/Timestamp == Remote Version/Timestamp BUT Local Count != Remote Count
    Calculate [BATCH SIZE] Using Historical Data
    Get Last Local Position
    Get Next [BATCH SIZE] Local Rows from last position
    Get Min & Max Version in Batch
    Query Remote for Manifest Between Min/Max Local Version
    Create Manifest from Local Batch
    Compare
        Find Remote Not in Local
            Delete from Remote
        Find Local Not in Remote
            Upload to Remote
If Local Version/Timestamp > Remote Version/Timestamp
    Calculate [BATCH SIZE] Using Historical Data
    Get Next [BATCH SIZE] Local Rows > Remote Version
    Upload to Remote
        Record Result for [BATCH SIZE] Tuning
        If Failure & Minimum [BATCH SIZE], Skip Ahead
If Local Version/Timestamp < Remote Version/Timestamp
    Find Remote > Local Version
    Delete from Remote

We’ve tweaked the mechanisms by which each evaluation of the algorithm determines its local and remote versions (there’s some caching on both the server and client side now, and each request to put/remote data returns new meta information from the server for efficiency) but that’s about it.

This time when we analysed the traffic, we noticed that there was a lot of calls to get table chunk manifests, an important part of the diff checker component of the algorithm, which is primarily meant to remove rows that have been deleted locally from the remote (i.e. hard deletes rather than the much easier to handle soft deletes, which are really just updates).

The problem with there being so many manifest calls is that they are quite expensive, especially for large tables.

Each manifest call requires a database query that can only use the first two components of the clustered index (which is client segmentation data), but must then scan through the remaining data in order to get the “chunk” needed for analysis. This can mean a scan of millions of rows, which is quite intense on read IOPS, because all of that data needs to be brought into memory in order to be evaluated.

But why was the number of manifest calls a problem now?

Sneaky Corporation

It all ties back into the data that we were syncing this time. As I mentioned at the beginning of this article, not only is it a huge table, it also features hard-deletes as the norm, rather than as exceptions. These few things added together create a bit of a perfect storm with regards to diff checker activity, which is why the system was starting to strain as a result of the massive increase in manifest calls.

Conceptually the diff checker part of the algorithm is a batched scan. It starts on one end of the data set, and gets chunks of data sequentially from both locations (local and remote), analysing and reacting as appropriate until it either gets back in sync or it reaches the other end, when it then wraps back around to it’s start point again.

Here’s where we made a mistake that seems obvious in hindsight.

In almost all cases, changes (even hard deletes) are more likely to occur at the top end of the data set, rather than the bottom.

Our scan?

Starts at the bottom and works its way up.

Thus, any time something was deleted from close to the top of the data set, the diff checker would have to scan through the entire table before finding and fixing it. This means more reads (to get the manifest chunks) which means higher IOPS.

This One Goes To Eleven

It really didn’t take a huge amount of effort to flip the diff checker algorithm to go top down instead of bottom up.

In fact, because of the way in which we’d built the code in the first place (using the strategy pattern to implement the actual chunking algorithm) it was as simple as writing a new class and binding it up instead of the old one via our dependency injection container.

The top down logic is basically the same as it is for bottom up, just inverted.

Starting the lower boundary of the last chunk (or the very top of the table if the state was reset), get the next [BATCH SIZE] chunk of rows from the local in a downwards direction. Using that chunk, find the extremities (top and bottom) and ask the remote for an equivalent chunk. Contrast and compare, react accordingly and then remember where you were up to for next time. If the algorithm reaches a point where the local and remote are identical, reset all state and do nothing.

Nothing super fancy, and an extremely obvious performance optimization in retrospect considering how much more likely changes are to occur at the top end rather than the bottom.

Such Strange Things

There are two tricksy bits though, one specific the top down approach and the other that was present in the bottom up approach, but is more obvious when working top down.

  1. When you get the local chunk, and you get less than the batch size, you know that you’re probably at the bottom of the table. If you follow the same rules for asking the remote at this point, you risk leaving data in the remote that is not in the local if its right at the bottom. Instead, when you reach this point, you have to infer a lower boundary of 0 to make sure you get everything.
  2. There is still an inherent inefficiency in the design of the scan, such that it will continue to do extra work for no benefit. Consider the situation where the scan has started and has dealt with 4 chunks. Its a bit down in the table, and the change it originally wanted to find was in chunk 5. In the meantime, the user has deleted data in chunk 2. The process will have to finish its scan of the entire table before it returns to the top and finds that particular change, which is just wasteful. Instead, it should be able to determine if there are any differences in the remainder of the data set (i.e. from the bottom of the current chunk down) and then use that information to decide to reset its position to the top again. We haven’t actually done this bit yet, because it came up after the deployed the first version to our users.

The first issue is a bit of a deal breaker (because it means the process can result in an incomplete data synchronization) so we’re lucky that we noticed it during testing and fixed it right away.

The second issue is not as bad because the process still does what it is supposed to do, just slower. Not great from an optimization point of view, so we’ll still fix it, but not exactly a critical issue.

Conclusion

Like I said a little bit earlier, doing the diff check scan from top to bottom is incredibly obvious in retrospect. All the changes are at the top, why would we start at the bottom of a potentially huge data set and waste time and effort analysing a bunch of data that is not likely to be relevant? Of course, that wasn’t actually obvious until we did some analysis on top of other optimizations, but it really does feel like a bit of a boneheaded mistake.

The good news is that when we implemented the changes as described (algorithm from bottom up to top down) we saw a dramatic decrease in traffic, because the diff checker was running fewer times before it found and fixed the changes.

The bad news is that due to point number two above (the diff checker doesn’t know that it should start from the top again when there are more recent changes) the IOPS usage didn’t really change all that much at all.

Still, the algorithm continues to get better and better and we continue to be able to synchronize more and more data without having to pay for additional infrastructure to support it.

Those are some good engineering feels.

0 Comments

Last week I described the re-occurrence of an issue from late last year with our data synchronization process.

Basically, read IOPS on the underlying database much higher than expected, causing issues with performance in AWS when the database volume ran out of burst balance (or IO credits as they are sometimes called).

After identifying that the most recent problem had been caused by the addition of tables, we disabled those tables, restoring performance and stability.

Obviously we couldn’t just leave them turned off though, and we couldn’t throw money at the problem this time, like we did last time. It was time to dig in and fix the problem properly.

But first we’d need to reproduce it.

But The Read Performance Issue In My Store

The good thing about ensuring that your environment is codified is that you can always spin up another one that looks just like the existing one.

Well, theoretically anyway.

In the case of the sync API environment everything was fine. One script execution later and we had a new environment called “sync performance” with the same number of identically sized API instances running the exact same code as production.

The database was a little trickier unfortunately.

You see, the database environment was from the time before Iimproved our environment deployment process. This meant that it was easy to make one, but hard to update an existing one.

Unfortunately, it was hard enough to update an existing one that the most efficient course of action had been to simply update the live one each time we had to tweak it, so we had diverged from the source code.

First step? Get those two back in sync.

Second step, spin up a new one that looks just like prod, which needed to include a copy of the prod data. Luckily, RDS makes that easy with snapshots.

With a fully functional, prod-like data synchronization environment up and running, all we needed was traffic.

Good thing Gor exists. We still had a deployable Gor component from last time I wanted to replicate traffic, so all we needed to do was make a new Octopus project, configure it appropriately and deploy it to our production API.

Now we had two (mostly) identical environments processing the same traffic, behaving pretty much the same. Because we’d turned off multiple tables in order to stop the high read IOPS, it was a simple matter to turn one back on, causing a less severe version of the issue to reoccur in both environments (higher than normal read IOPS, but not enough to eat burst balance).

With that in place we were free to investigate and experiment.

Is Still Abhorred

I’m going to cut ahead in the timeline here, but we analysed the behaviour of the test environment for a few days, trying to get a handle on what was going on.

Leveraging some of the inbuilt query statistics in postgreSQL, it looked like the the most frequent and costly type of query was related to getting a “version” of the remote table for synchronization orchestration. The second most costly type of query was related to getting a “manifest” of a subset of the table being synchronized for the differencing engine.

Disabling those parts of the API (but leaving the uploads alone) dropped the IOPS significantly, surprising exactly zero people. This did disagree with our hypothesis from last time though, so that was interesting.

Of course, the API is pretty useless without the ability to inform the synchronization process, so it was optimization time.

  • We could try to reduce the total number of calls, reducing the frequency that those queries are executed. We’d already done some work recently to dramatically reduce the total number of calls to the API from each synchronization process though, so it was unlikely we would be able to get any wins here
  • We could implement a cache in front of the API, but this just complicates things and all it will really result in is doing work repeatedly for no benefit (if the process syncs data then asks the API for a status, and gets the cached response, it will just sync the data again)
  • We could reduce the frequency of syncing, doing it less often. Since we already did the work I mentioned above to reduce overall calls, the potential gains here were small
  • We could try to make the queries more efficient. The problem here was that the queries were already using the primary keys of the tables in question, so I’m not entirely sure that any database level optimizations on those tables would have helped
  • We could make getting an answer to the question “give me the remote version of this table” more efficient by using a dedicated data structure to service those requests, basically a fancy database level cache

We prototyped the last option (basically a special cache within the database that contained table versions in a much easier to query format) and it had a positive effect on the overall read IOPS.

But it didn’t get rid of it entirely.

Within The Sound Of Synchronization

Looking into our traffic, we discovered that our baseline traffic had crept up since we’d implemented the skipping strategy in the sync process. Most of that baseline traffic appeared to be requests relating to the differencing engine (i.e. scanning the table to get primary key manifests for comparison purposes), which was one of the expensive type of queries that we identified above.

We’d made some changes to the algorithm to incorporate the ability to put a cap on the number of skips we did (for safety, to avoid de-sync edge cases) and to introduce forced skips for tables whose changes we were happy to only sync a few times a day.

A side effect of these changes was that whenever we decided NOT to skip using the local comparison, the most common result of the subsequent local vs remote comparison was choosing to execute the differencing engine. There was a piece of the algorithm missing where it should have been choosing to do nothing if the local and remote were identical, but that did not seem to be working due to the way the skip resolution had been implemented.

Fixing the bug and deploying it cause the read IOPS on our normal production server to drop a little bit, which was good.

The different pattern of traffic + our prototype table version cached caused a much more dramatic drop in read IOPS in our test environment though. It looked like the two things acting together apparently reduced the demands on the database enough to prevent it from having to read so much all the time.

Conclusion

We’re still working on a production quality cached table version implementation, but I am cautiously optimistic. There are some tricky bits regarding the cache (like invalidation vs updates, and where that sort of decision is made), so we’ve got a bit of work ahead of us though.

At this point I’m pretty thankful that we were easily able to both spin up an entirely separate and self contained environment for testing purposes, and that we were able to replicate traffic from one environment to the other without a lot of fuss. Without the capability to reproduce the problem disconnected from our clients and experiment, I don’t think we would have been able to tackle the problem as efficiently as we did.

I’m a little disappointed that a bug in our sync process managed to slip through our quality assurance processes, but I can understand how it happened. It wasn’t strictly a bug with the process itself, as the actions it was performing were still strictly valid, just not optimal. Software with many interconnected dependent components can be a very difficult thing to reason about, and this problem was relatively subtle unless you were looking for it specifically.  We might have been able to prevent it from occurring with additional tests, but its always possible that we actually had those tests anyway, and during one of the subsequent changes a test failed and was fixed in an incorrect way. I mean, if that was the case, then we need to be more careful about “fixing” broken tests.

Regardless, we’re starting to head into challenging territory with our sync process now, as it is a very complicated beast. So complicated in fact that its getting difficult to keep the entire thing in your head at the same time.

Which is scary.

0 Comments

in late 2016 we deployed the first version of our data synchronization API. It was a momentous occasion, and it all worked perfectly…at least until we turned on the software that actually used it. Actual traffic has a habit of doing that.

Long story short, we experienced a bunch of growing pains as we scaled our traffic up during those first few weeks, and this post is a follow-up to one of those issues in particular.

Specifically, the issue of particularly high read IOPS on the RDS instance that backed the sync API.

At the time, we investigated, didn’t find the root cause of the problem, threw money at i,t and ran away to do things that the business considered more important.

Why does it matter now?

Well, it happened again, and this time we can’t just throw money at it.

Sound familiar?

I’ve Come to Read From You Again

As I mentioned above, the main symptom was overly high read IOPS on our RDS server. Much higher than we expected for what we were doing.

This was an issue because of the way that data volumes work in AWS. You get some guaranteed IOPS (read and write combined) based on the size of the volume (1GB = 3 IOPS), while at the same time you get the capability to burst up to 3000 IOPS for brief periods of time.

The brief period of time is the important part, because it is actually modelled as a “balance” of IOPS credits that you consume if you go above your baseline, which get replenished if you’re below your baseline. Very very similar to the way that CPU credits work on EC2 instances. Back in late 2016, what exactly the balance was at any particular point in time was a bit of mystery though, because you couldn’t graph or alarm on it through CloudWatch. You only knew when it ran out and the performance of your database tanked because it was dependent on using more IOPS than its baseline.

Anyway, you can chart and alarm on the balance now, which is nice.

We tried a few things at the time, including scaling the underlying data volume (to give us more baseline), but it wasn’t enough.

We then increased the size of the RDS instance, and the high read IOPS vanished. I mean, that was obviously the best outcome, clearly. A problem we didn’t understand reacting in a completely unexpected way when we threw more power at it.

At the time, we thought that it might be related to EF querying the database on writes (to choose between an insert or update), but we never proved that hypothesis conclusively.

No-one in the business complained too loudly about the cost and everything was working swimmingly, so after an initial period of discomfort (mostly the engineers inside us sobbing quietly), we promptly forgot about it all together.

Because Some Usage Softly Creeping

A few weeks back, our data synchronization API started behaving strangely.

The first we knew about the problem was when our external monitoring service (Pingdom) reported that the API was down. Then up. Then down again. Then up again. So on and so forth.

Looking at the RDS metrics there were thousands of connections active to the database, causing us to hit connection limits. My first thought was that someone internal to the organisation had somehow got the credentials to the database and was running insane queries for business intelligence. I mean, thinking about that conclusion now, its completely bogus, because:

  1. No-one has the credentials to the master database but a few members of my team
  2. No-one even knows what the URL to access the master database is
  3. We created a read replica specifically for the organisation to use for business intelligence, so why would someone use the master
  4. You can’t even connect to the master database without being on our internal AWS VPN, which most people in the organisation don’t have access to

It didn’t help that when PostgreSQL runs out of connections, you can’t even connect to it with pgAdmin or something similar to see what the connections are doing.

Anyway, after falsely bombarding some innocent people with polite requests to “please stop doing things to our master database, its killing our API”, I shut down our data synchronization API and all the connections disappeared.

Because it was our API.

The calls were coming from inside the house.

Wrecked Everything While I was Sleeping

Luckily, the data synchronization API is probably the one API we can take down for a few hours and not have anybody really notice. As long as the database is available, the other API that actually allows access to the data can continue to do what it normally does, just with increasingly stale information. Its an eventually consistent model anyway, so some staleness is inherent.

We still only realistically had a few hours to dig into the issue before it became a real problem for actual users though.

Our only lead was the massive number of connections that were being established to the RDS instance. I managed to get a sample of what the connections were doing just as I shut the API down, and it wasn’t anything interesting. Just normal queries that I would expect from the API. Traffic hadn’t changed either, so we weren’t dealing with some weird DDOS attack or anything like that.

Well, to skip ahead, it turns out that the rise in the number of connections was a symptom, not the cause. The root issue was that we had burnt through our IO burst balance over the last few hours due to an increase in read IOPS. We’d almost done the same thing the day before, but it bottomed out at like 5% remaining because the days traffic started dropping off, so we didn’t notice. The lack of IOPS was causing each request to execute slower than normal, which just meant that there were more requests being executed in parallel.

Each API instance had a maximum number of connections that it could use, but the sum total across all of the instances was higher than the total number available to postgreSQL (whoops).

But why were we eating all of our delicious burst balance all of a sudden?

Well, a few days prior we started syncing a new table, and even though it was a small one, it increased the drain on the infrastructure.

Interestingly enough, it wasn’t a case of read IOPS going from close to nothing to a massive number with that new table. Our read IOPS had actually increased to a ridiculous baseline a few weeks before that when we shipped a completely different table, and we just hadn’t noticed. The table we shipped recently had just been the straw that broke the camels back.

We turned both of the new tables off, restored the API and all was well.

Conclusion

Honestly, we really should have had some alarms. Any alarms. Literally any alarms on the things we were already suspicious of at the database level (connections, read IOPS, etc) would have given us some sort of warning that bad things were happening, weeks before they actually happened.

We have those alarms now, but it was still a pretty big oversight.

With the new tables turned off and stability and functionality restored, we looked to fix the problem.

Throwing money at it wasn’t really feasible, as we would have basically doubled the infrastructure costs, and it was already costing us a few thousand $US each month. No, this time we would have to understand the problem and fix is properly.

But that’s a topic for next time, because this post is already large.

0 Comments

Regularly writing blog posts is hard.

Recently, I’ve been focusing on monstrously long, multi-part posts explaining complex or involved processes, like:

I’m honestly not sure whether this pattern is good or bad in the greater scheme of things. Either I just happen to be tackling (and thus writing about) bigger problems in my day to day life, or I’ve simply run out of small things to write about, leaving only the big ones.

But anyway, enough musing, because this post fits squarely into the “bunch of connected posts about one topic” theme.

I was pretty happy with the state of our data synchronization algorithm the last time I wrote about it. We’d just finished putting together some optimizations that dramatically reduced the overall traffic while still maintaining the quality of the overall syncing process, and it felt pretty good. Its been a little over a month since those changes were deployed, and everything has been going swimmingly. We’ve added a few new tables to the process, but the core algorithm hasn’t changed.

Normally this is the point where I would explain how it was all secretly going terribly wrong, but in a weird twist of fate, the algorithm is actually pretty solid.

We did find a bug which can cause the on-premises and remote locations to be out of of sync though, which was unfortunate. It happens infrequently, so a small subset of the data, but it still makes for an interesting topic to write about.

Well, interesting to me at least.

Optimizations Are Always Dangerous

The core of the bug lies in our recent optimizations.

In order to reduce the amount of busywork traffic occurring (i.e. the traffic resulting from the polling nature of the process), we implemented some changes that leverage local and remote table manifests to short-circuit the sync process if there was nothing to do. To further minimize the traffic to the API, we only queried the remote table manifest at the start of the run and then used that for the comparison against the local on the next run. Essentially we exchanged a guaranteed call on every non-skipped sync for one call each time the local and remote became identical.

The bug arises in the rare case where the captured remote from the last run is the same as the current local, even though the current remote is different.

The main way that this seems to happen is:

  • Table with low rate of change gets a new row.
  • Algorithm kicks in and syncs the new data.
  • In the time between the run that pushed the data and the next one, user removes the data somehow.
  • Current local now looks exactly like the remote did before the data was pushed.
  • Sync algorithm thinks that it has nothing to do.

In this case the algorithm is doing exactly what it was designed to do. Its detected that there are no changes to deal with, and will continue to skip executions until something does change (new rows, updates, deletions, anything), where it will run again. If the table changes infrequently we’re left with an annoying desync for much longer than we would like.

Like I said earlier, its a pretty specific situation, with relatively tight timings, and it only occurs for tables that are infrequently changed, but a bug is a bug, so we should fix it all the same.

Safety Dance

The obvious solution is to requery the remote after the sync operation has finished execution and store that value for the comparison against the local next time, rather than relying on the value from before the operation started.

The downside of this is that it adds another request to every single non-skipped sync, which amounts to a relatively significant amount of traffic. We’re still way ahead of the situation before the optimizations, but maybe we can do better?

Another idea is to limit the maximum number of skips that can happen in a row, taking into account how long we might want the situation described above to persist.

This approach also raises the number of requests occurring, but has the happy side effect of picking up changes at the remote end as well (i.e. nothing has changed locally, but we deleted all the data remotely in order to force a resync or something).

The compare the two possible fixes, I actually did some math to see which one would result in more requests, and with the maximum number of skips set to a value that forced a run every 30 minutes or so, they are pretty much a wash in terms of additional requests.

I’ve flip-flopped a fair bit on which solution I think we should apply, initially thinking the “limit maximum skips” approach was the best (because it essentially offers a sanity check to the concept of skipping runs), but from an engineering point of view, it just feels messy, like the sort of solution you come up with when you can’t come up with something better. Almost brute force in its approach.

I’m currently favouring amending the algorithm to query the remote after the operation executes because it feels the cleanest, but I’m not ecstatic about it either, as it feels like its doing more work than is strictly necessary.

Decisions decisions.

Conclusion

As much as it saddens me to find bugs, it pleases me to know that with each bug fixed, the algorithm is becoming stronger, like tempering steel, or building muscle. Applying stress to something causing it to break down and then be repaired with improvements.

It can be tempting to just throw a fix in whenever you find a bug like that, but I believe that hack fixes should never be tolerated without a truly exceptional reason. You should always aim to make the code better as you touch it, not worse. The hardest part of fixing bugs is to perform the repairs in such a way that it doesn’t compromise the design of the code.

Of course, if the design is actually the cause of the problem, then you’re in for a world of hurt.