Orphans Are No Laughing Matter

June 21. 2016 0 Comments

I’ve been sick this week, so this post will be incredibly short (and late).

I’ve spoken on and off about some of the issues that we’ve had keeping the RavenDB database behind one of our services running in a maintainable fashion.

I don’t think RavenDB is a bad product though (in fact, I think its quite good at what it does), I think that the main problem is that is is not a good fit for what we’re trying to do with it and we don’t really have the expertise required to use it well.

Regardless, we had the scale the underlying infrastructure again recently (from an r3.4xlarge to an r3.8xlarge) and it go me thinking about how well we actually understand the data that is being stored inside the database. The last time we had to scale (before the most recent one), we had something like 120K documents in the system, spread across 700-800 unique clients.

Now? Almost triple that, at 340K, but we’re only up to like 1000 unique clients.

Something didn’t add up.

I mean, the entire concept behind the service is that it is a temporary staging area. It contains no long term storage. The number of documents present should be directly related to the activity around the feature that the service supports, and it was highly unlikely that the feature had become 3 times as possible in the intervening period.

No Young Waynes Here

The system uses a manifest like concept to aggregate all of the data belonging to a customers data set in one easily correlated place (specifically using prefixes on the IDs of the documents). Each manifest (or account), contains some meta information, like the last time any data at all was touched for that account.

It was a relatively simple matter to identify all accounts that had not been touched in the last 30 days. For a system that relies on constant automatic synchronization, if an entire account has not been touched in the last 30 days, its a pretty good candidate for having been abandoned for some reason, the most likely of which is that the user has switched to using a different account (they are quite fluid).

I found 410 accounts that were untouched.

There are only1800 accounts in the system.

The second point of investigation was to look directly at the document type with the highest count.

This document describes something that is scheduled, and abandoned data can be easily found by looking for things that are scheduled over 30 days ago. Because of the transient nature of the design, of something is still in the system, even though it was scheduled for a month in the past, its a pretty safe bet that its no longer valid.

I said before that there were around 340K documents in the system?

220K were scheduled for so far in the past that they were basically irrelevant.

Conclusion

The findings above made me sad inside, because it means that there is definitely something wrong with the way the service (and the software that uses the service) is managing its data.

I suspect (but can’t back this up) that the amount of useless data present is not helping our performance problems as well, so its like a double gut-punch.

I suppose, the important thing to take away from this is to never become complacent about the contents of your persistence layer. Regular audits should be executed to make sure you understand exactly what is being stored and why.

Now that I know about the problem, all that’s left is to put together some sort of repeatable mechanism to clean up, and then find and fix the bugs that led to the data accumulating in the first place.

But if I didn’t look we probably would have just accepted that this was the shape of the data.