Building a Better Beast, Part 1
Way back in March 2015, I wrote a few posts explaining how we set up our log aggregation. I’ve done a lot of posts since then about logging in general and about specific problems we’ve encountered in various areas, but I’ve never really revisited the infrastructure underneath the stack itself.
The main reason for the lack of posts is that the infrastructure I described back then is not the infrastructure we’re using now. As we built more things and started pushing more and more data into the stack we had, we began to experience some issues, mostly related to the reliability of the Elasticsearch process. At the time, the organization decided that it would be better if our internal operations team were responsible for dealing with these issues, and they built a new stack as a result.
This was good and bad. The good part was the arguments that if we didn’t have to spend our time building and maintaining the system, it should theoretically leave more time and brainspace for us to focus on actual software development. Bad for almost exactly the same reason, problems with the service would need to be resolved by a different team, one with their own set of priorities and their own schedule.
The arrangement worked okay for a while, until the operations team were engaged on a relatively complex set of projects and no longer had the time to extend and maintain the log stack as necessary. They did their best, but with no resources dedicated to dealing with the maintenance on the existing stack, it started to degrade surprisingly quickly.
This came to a head when we had a failure in the new stack that required us to replace some EC2 instances via the Auto Scaling Group, and the operations team was unavailable to help. When we executed the scaling operation, we discovered that it was creating instances that didn’t actually have all of the required software setup in order to fulfil their intended role. At some point in the past someone had manually made changes to the instances already in service and these changes had not been made in the infrastructure as code.
After struggling with this for a while, we decided to reclaim the stack and make it our responsibility again.
Architecturally, the new log stack was a lot better than the old one, even taking into account the teething issues that we did have.
The old stack was basically an Auto Scaling Group capable of creating EC2 instances with Elasticsearch, Logstash and Kibana along with a few load balancers for access purposes. While the old stack could theoretically scale out to better handle load, we never really tested that capability in production, and I’m pretty sure it wouldn’t have worked (looking back I doubt the Elasticseach clustering was setup correctly, in addition to some other issues with the way the Logstash indexes were being configured).
The new stack looked a lot like the reference architecture described on the Logstash website, which was good, because those guys know their stuff.
At a high level, log events would be shipped from many different places to a Broker layer (auto scaling Logstash instances behind a Load Balancer) which would then cache those events in a queue of some description (initially RabbitMQ, later Redis). An Indexer layer (auto scaling Logstash instances) would pull events off the queue at a sustainable pace, process them and place them into Elasticsearch. Users would then use Kibana (hosted on the Elasticsearch instances for ease of access) to interact with the data.
There are a number of benefits to the architecture described above, but a few of the biggest ones are:
- Its not possible for an influx of log events to shut down Elasticsearch because the Indexer layer is pulling events out of the cache at a sustainable rate. The cache might start to fill up if the number of events rises, but we’ll still be able to use Elasticsearch.
- The cache provides a buffer if something goes wrong with either the Indexer layer or Elasticsearch. We had some issues with Elasticsearch crashing in our old log stack, so having some protection against losing log events in the event of downtime was beneficial.
There were downsides as well, the most pertinent of which was that the new architecture was a lot more complicated than the old one, with a lot of moving parts. This made it harder to manage and understand, and increased the number of different ways in which it could break.
Taking all of the above into account, when we reclaimed the stack we decided to keep the architecture intact, and just improve it.
The way in which the stack was described was not bad. It just wasn’t quite as controlled was the way we’d been handling our other environments, mostly as a factor of being created/maintained by a different team.
Configuration for the major components (Logstash Broker, Logstash Indexer, Elasticsearch, Kibana) was Source Controlled, with builds in TeamCity and deployment was handled by deploying Nuget packages through Octopus. This was good, and wouldn’t require much work to bring into line with the rest of our stuff. All we would have to do was ensure all of the pertinent deployment logic was encapsulated in the Git repositories and maybe add some tests.
The infrastructure needed some more effort. It was all defined using CloudFormation templates, which was excellent, but there was no build/deployment pipeline for the templates and they were not versioned. In order to put such a pipeline in place, we would need to have CI and Staging deployments of the infrastructure as well, which did not yet exist. The infrastructure definition for each layer also shared a repository with the relevant configuration (i.e. Broker Environment with Broker Config), which was against our existing patterns. Finally, the cache/queue layer did not have an environment definition at all, because the current one (Elasticache w. Redis) had been manually created to replace the original one (RabbitMQ) as a result of some issues where the cache/queue filled up and then became unrecoverable.
In addition to the above, once we’ve improved all of the processes and got everything under control, we need to work on fixing the actual bugs in the stack (like Logstash logs filling up disks, Elasticsearch mapping templates not being setup correctly, no alerting/monitoring on the various layers, etc). Some of these things will probably be fixed as we make the process improvements, but others will require dedicated effort.
To Be Continued
With the total scope of the work laid out (infrastructure build/deploy, clean up configuration deployment, re-create infrastructure in appropriate AWS accounts, fix bugs) its time to get cracking.
The first cab off the rank is the work required to create a process that will allow us to fully automate the build and deployment of the infrastructure. Without that sort of system in place, we would have to do all of the other things manually.
The Broker layer is the obvious starting point, so next week I’ll outline how we went about using a combination of TeamCity, Nuget, Octopus and Powershell to accomplish a build and deployment pipeline for the infrastructure.