As should be somewhat obvious by now, I’m a big fan of logging. I try to justify that interest in a logical fashion, stating the intelligence benefits but really, I just personally find it interesting and as a result it tends to be something that I focus on when developing software.
Log aggregation as a concept is still pretty new to me though. It wasn’t until a few years ago that I even became aware that it even existed, even though something always lingered in the corners of my mind about the usefulness of logs that were trapped on hundreds of different machines and environments. The idea to take the logs from a whole bunch of different locations (servers, client machines, everything), destructure them and put them all into a centralised, searchable location kind of blew my mind to be honest, and its been something of a passion ever since.
In an interesting twist of fate, my first exposure to log aggregation involved RavenDB. One of the previous developers had set up a RavenDB database to act as the destination for log events coming out of a piece of desktop software, and when i realized what it was, I was intent on using it to understand how the software was being used.
The idea was amazing.
Not so much.
A Missed Opportunity
As I mentioned above, at some point in the past the development team identified a need to raise and aggregate structured log events for a variety of reasons, like business intelligence, error reporting, general logging and so on. They picked RavenDB as the storage location, probably because it was so easy to use from a development point of view, and trivial to integrate into a C# application using the RavenDB client to write arbitrary plain old C# objects with no real messing around.
A simple custom logger interface was created, which would expose the capability to write log events that derived from some common base, and the implementation of that interface used the RavenDB client. From a storage point of view, staging and production instances of RavenDB were provisioned inside AWS, and exposed to the greater internet via an Elastic Load Balancer. The very next release of the software was writing all sorts of interesting events into RavenDB with very little difficulty, like notifications that the application was being started (including things like who started it and what version it was on) and errors that were occurring during normal usage.
It was all very useful.
That is, until you tried to actually search the resulting log events and glean some information from them.
This was back in RavenDB 2.X, so the default UI was the Silverlight implementation of the RavenDB Studio, which wasn’t exactly the most amazing piece of software.
Even ignoring the usability problems, RavenDB 2.X did not have good support for ad-hoc queries, and the no one had thought ahead and set up fields for indexing. This meant that the wealth of information available damn near unusable.
How Could You Not Call It Muninn?
Shortly after I started and found out about all of this awesome information going to waste I put together an ELK stack and hacked together a bridge between the two.
Whenever a log event was written into RavenDB, a small memory resident application would automatically write it out to a JSON file, and Logstash would pick it up and put it into the ELK stack.
With the new ELK stack in place, whenever we wrote new code we would ensure that it could aggregate its logs as necessary, augmenting the system with new capabilities each time. In the end it was mostly just log files being picked up and processed by local Logstash instances, but we process ELB logs in Lambda by posting directly to Logstash over HTTP and we exposed a public facing authenticated Logging API for software to aggregate log events outside of our little walled garden in AWS.
The rest is history, and we’ve been using ELK ever since.
But we never changed the initial piece of desktop software and its been happily writing into RavenDB this whole time, with the events being replicated into the ELK stack via that “temporary” hack.
Its The Perfect Name
Don’t get me wrong, the system worked, and log events from the desktop software were available in our ELK stack, it just felt so inefficient to have this whole other system in place providing no real additional value.
Not only that, while it did technically work, there were definitely times when that particularly log event channel failed spectacularly, from full disks (no-one had ever implemented maintenance or cleanup of any sort on the RavenDB database) through to performance problems similar to the ones that we saw in one of our APIs.
The last straw was when I did an audit on our AWS costs and discovered that we were actually paying a non-trivial amount of cash for the RavenDB logging infrastructure and that just seemed wrong (plus, if I recovered that cash I could put it into other things, like the ELK stack).
All told, there were more reasons to get rid of it than there were to keep it.
Well, as I mentioned earlier, when the development team initially implementing the log aggregation, they put a simple logging interface in place, with the RavenDB specific implementation behind that interface.
Switching the logging to use our Serilog based HTTP sink (which connected to our authenticated Logging API) was as simple as providing a new implementation of the old logging interface, so even though its usage infested the entire system, we really only had to change it once at the top level.
Of course, it wasn’t quite that simple (references to RavenDB were strewn through at least 3 different libraries that dealt with logging, so they had to be removed), but all told it was a hell of a lot easier than it would have been without that abstraction.
Obviously we want to switch to using Serilog directly (rather than a custom logging abstraction that provides no real value), but that can happen slowly, bit by bit, and the flow of log events will be maintained the entire time.
Unfortunately, we can’t actually decommission the RavenDB logging infrastructure just yet, as we have to wait until everyone in the wild upgrades to the latest version or we risk missing out on important intelligence from the slackers. I’m keeping a close eye on the total number of logs received through the old system, and the moment it drops below a certain threshold, I’m going to nuke the entire thing.
To be honest, the removal of the RavenDB logging channel that I described above makes me happier than it really has cause to. It’s been bugging me for a while (mostly from an engineering point of view), so it felt really good to get rid of it in the latest release of our software. That sort of maintenance and consolidation doesn’t always have a direct and measurable impact on the bottom line, but its still important for the system as a whole and important for engineering morale.
Its like ripping out a mostly dead plant from a garden.
Better to make room for something that might actually be useful.