Like with all software, its rare to ever actually be done with something. A few weeks back I wrote at length about the data synchronization algorithm we use to free valuable and useful data from its on-premises prison, to the benefit of both our clients (for new and exciting applications) and us (for statistical analysis.
Conceptually, the process leveraged an on-premises application, a relatively simple API and a powerful backend data store to accomplish its goals, along with the following abstracted algorithm (which I went into in depth in the series of blog posts that I linked above).
Get Local Version Get Remote Version If Local == Remote Calculate [BATCH SIZE] Using Historical Data Get Last Local Position Get Next [BATCH SIZE] Local Rows from last position Get Min & Max Version in Batch Query Remote for Manifest Between Min/Max Local Version Create Manifest from Local Batch Compare Find Remote Not in Local Delete from Remote Find Local Not in Remote Upload to Remote If Local > Remote Calculate [BATCH SIZE] Using Historical Data Get Next [BATCH SIZE] Local Rows > Remote Version Upload to Remote Record Result for [BATCH SIZE] Tuning If Failure & Minimum [BATCH SIZE], Skip Ahead If Local < Remote Find Remote > Local Version Delete from Remote
One of the issues with the algorithm above is that its pretty chatty when it comes to talking to the remote API. It is polling based, so that’s somewhat to be expected, but there is a lot of requests and responses being thrown around that seem like a prime opportunity for improvement.
To give some context:
- We have approximately 3500 unique clients (each one representing a potential data synchronization)
- Of that 3500, approximately 2200 clients are actively using the synchronization
- In order to service these clients, the API deals with approximately 450 requests a second
Not a ground shaking amount of traffic, but if we needed to service the remainder of our clients in the same way, we’d probably have to scale out to deal with the traffic. Scaling out when you use AWS is pretty trivial, but the amount of traffic in play is also overloading our log aggregation (our ELK stack), there are other factors to consider.
Digging into the traffic a bit (using our awesome logging), it looks like the majority of the requests are GET requests.
The following KIbana visualization shows a single days traffic, aggregated over time/HTTP verb. You can clearly see the increase in the amount of non-GET requests during the day as clients make changes to their local database, but the GET traffic dwarfs it.
If we want to reduce the total amount of traffic, attacking the GET requests seems like a sane place to start. But maybe we could just reduce the traffic altogether?
The plugin architecture that schedules and executes the application responsible for performing the data synchronization has a cadence of around 60 seconds (with support for backoff in the case of errors). It is smart enough to be non re-entrant (meaning it won’t start another process if one is already running), but has some weaknesses that lead to the overall cadence being closer to 5 minutes, with higher cadences for multiple registered databases (because each database runs its own application, but the number running in parallel is limited).
One easy way to reduce the total amount of traffic is to simply reduce the cadence, drawing it out to at least 10 minutes in between runs.
The downside of this is that it increases the amount of latency between local changes being made and them being represented on the remote database.
Being that one of the goals for the sync process was to minimise this latency, simply reducing the cadence in order to decrease traffic is not a good enough solution.
Just Stop Talking, Please
If we look at the GET traffic in more detail we can see that it is mostly GET requests to two endpoints.
These two endpoints form the basis for two different parts of the sync algorithm.
The first endpoint is used to get an indication of the status for the entire table for that customer/database combination. It returns a summary of row count, maximum row version and maximum timestamp. This information is used in the algorithm above in the part where it executes the “Get remote version” statement. It is then compared to the same information locally, and the result of that comparison is used to determine what to do next.
The second endpoint is used to get a summarised chunk of information from the table, using a from and to row version. This is used in the sync algorithm to perform the diff check whenever the local and remote versions (the other endpoint) are the same.
What this means is that every single run of every application is guaranteed to hit the first endpoint for each table (to get the remote version) and pretty likely to hit the second endpoint (because the default action is to engage on the diff check whenever local and remote versions are the same).
The version comparison is flawed though. It only takes into account the maximum row version and maximum timestamp for its decision making, ignoring the row count altogether (which was there historically for informational purposes). The assumption here was that we wanted to always fall back to scanning through the table for other changes using the differencing check, so if our maximum version/timestamps are identical that’s what we should do.
If we use the row count though, we can determine if the local and remote tables are completely in sync, allowing us to dodge a large amount of work. Being that all updates will be covered by the row version construct and all deletions will be covered by the row count changing, we should be in a pretty good place to maintain the reliability of the sync.
1 Row, 2 Rows, 3 Rows, Ah Ah Ah!
The naive thing to do would be to get the local count/version/timestamp and the local count/version/timestamp from last time and compare them (including the row count). If they are the same, we don’t need to do anything! Yay!
This fails to take into account the state of the remote though, and the nature of the batching process. While there might not be any changes locally since last time, we might not have actually pushed all of the changes from last time to the remote.
Instead, what we can do is compare the local count/version/timestamp with the last remote count/version/timestamp. If they are the same, we can just do nothing because both are completely in sync.
Editing the algorithm definition from the start of this post, we get this:
Get Local Count/Version/Timestamp Get Last Remote Count/Version/Timestamp If Local Count/Version/Timestamp == Last Remote Count/Version/Timestamp Do Nothing and Exit Get Remote Count/Version/Timestamp Store Remote Count/Version/Timestamp For Lookup Next Time If Local Count/Version/Timestamp == Remote Count/Version/Timestamp Do Nothing and Exit If Local Version/Timestamp == Remote Version/Timestamp BUT Local Count != Remote Count Calculate [BATCH SIZE] Using Historical Data Get Last Local Position Get Next [BATCH SIZE] Local Rows from last position Get Min & Max Version in Batch Query Remote for Manifest Between Min/Max Local Version Create Manifest from Local Batch Compare Find Remote Not in Local Delete from Remote Find Local Not in Remote Upload to Remote If Local Version/Timestamp > Remote Version/Timestamp Calculate [BATCH SIZE] Using Historical Data Get Next [BATCH SIZE] Local Rows > Remote Version Upload to Remote Record Result for [BATCH SIZE] Tuning If Failure & Minimum [BATCH SIZE], Skip Ahead If Local Version/Timestamp < Remote Version/Timestamp Find Remote > Local Version Delete from Remote
The other minor change in there is comparing the full local count/version/timestamp against the remote count/version/timestamp. If they are identical, its just another case where we need to do nothing, so we can exit safely until next time.
Just how much of a difference does this make though? I’ll let a picture of a partial days traffic answer that for me.
In the image below I’ve specifically set the scale of the graph to be the same as the one above for comparison purposes.
As you can see, the traffic rises from nothing (because nothing is changing overnight) to a very acceptable amount of traffic representing real work that needs to be done during the day, and then will probably continue forward with the same pattern, flattening back into nothing as the day progresses and people stop making changes.
Its a ridiculous decrease in the amount of pointless noise traffic, which is a massive victory.
Thinking about this sort of thing in more general terms, optimization is an important step in the production and maintenance of any piece of software, but its important not to engage on this path too early. You should only do it after you gather enough information to justify it, and to pinpoint exactly where the most impact will be made for the least effort. The last thing you want to do is spend a week or two chasing something you think is a terribly inefficiency only to discover that it makes less than a % difference to the system as a whole.
The most efficient way to do this sort of analysis is with good metrics and logging.
You’d be crazy not to do it.