We’ve come a long way in our log aggregation journey. Don’t get me wrong, we still have a long way to go, but bit by bit, we’re getting better at it.
A good example of getting better, is the way in which we process our Elastic Load Balancer (ELB) logs. Over a year ago I put together a system for processing these logs into our ELK stack with way too many moving parts. It used Logstash (for processing) and Powershell (for downloading files from S3) hosted on an EC2 instance to aggregate ELB logs from S3 into to our ELK stack. Somewhat complicated in practice, but it worked, even if I was never particularly happy with it.
As is the way with these things though, because it did work, we’ve had no reason to revisit it, and we’re successfully applied the same approach to at least 3 other environments we’ve setup since.
It wasn’t without its share of problems though:
- The EC2 instances hosting the solution had a tendency to cap themselves at 100% CPU for long periods. They were initially t2.mediums, but they kept expending all of their CPU credits, so we had to upgrade them to m3.mediums, which was a 50% increase in cost ($US 115/month). Never did figure out exactly what needed all that CPU, but the leading theory was Logstash.
- For a while, the logs simply stopped processing after a period of time (days/weeks). This turned out to be an issue with accumulating memory dumps from Java as a result of Logstash crashing and NSSM automatically restarting it.
- These were the machines most vulnerable to the memory leak in Logstash that causes its TCP driver to accumulate non-paged memory on Windows AWS instances due to some driver problem.
Good old Logstash.
To turn the discussion back to getting better, we had the opportunity to revisit this process when building some new environments, using all of the knowledge and experience that we’d gained in the intervening period. I think we came up with a much more efficient and understandable solution, but it wasn’t without its share of difficulties, which makes for a good post.
One of the primary weaknesses in the previous approach for processing ELB logs was that it required an entire EC2 instance all to itself, for each environment that we spun up. We did this in order to keep each log processor isolated from the other and to allow us to be able to spin up an entirely self-contained environment without having to worry about some common machine that processed all of the logs in a bunch of different buckets.
Another weakness in the process that bothered me was that it had way too many moving parts. Sometimes you have to have a lot of moving parts working together in order to accomplish a goal, but you should always strive for simplicity, both from an operational point of view and from a maintenance point of view. Less is almost always better in software development.
AWS has come a long way since we jammed the initial solution together, so we decided to use this opportunity to simplify the process and experiment with some AWS tools that we don’t frequently use.
After some discussion, the we formed an idea of what we would like the new log processor to look like. We wanted to use Lambda to process the ELB logs as they were created, pushing them to the same Logstash ingress endpoint that we’ve been using consistently for the last year or so. The benefits we were expecting were a reduction in complexity (no need to have 4 different things working together), a reduction in running cost (mostly due to the removal of the EC2 instance) and a reduction in latency (the Lambda function would trigger whenever a file was written to the S3 bucket by the ELB, which meant no more polling for changes).
For those of you unfamiliar with Lamba, its a service offered by AWS that lets you configure code to run whenever a variety of events occurs. I’ve used it before to create a quicker S3 bucket clone, so if you want some more information, feel free to read up on that adventure.
In order to accomplish our goal, we would need to deal with 3 things:
- The configuration/creation of the Lambda function, which would need to be done via CloudFormation as part of our normal environment setup (i.e. spin up an environment, it comes with the Lambda function that will process any ELB logs encountered)
- The deployment of the code to the Lambda function via Octopus (because that’s how we roll)
Nothing particularly insane there, but definitely a few things that we’d never done before.
To Be Continued
In order to avoid creating a single monstrous post with more words than a small novel, I’m going to break it here.
Until then, may all your Lambda functions execute quickly and your S3 buckets not turn into ghosts.