Logging is one of the most important components of a good piece of software.
That was a statement, not a question, and is not up for debate.
Logging enables you to do useful things, like identify usage patterns (helpful when decided what sections of the application need the most attention), investigate failures (because there are always failures, and you need to be able to get to their root causes) and keep an eye on performance (which is a feature, no matter what anyone else tells you). Good logging enables a piece of software to be supported long after the software developers who wrote it have moved on, extending the life expectancy of the application and thus improving its return on investment.
It is a shame really, that logging is not always treated like a first class citizen. Often it is an afterthought, added in later after some issue or failure proves that it would have been useful, and then barely maintained from that point forward.
Making sure your application has excellent logging is only the first part though, you also need somewhere to put it so that the people who need to can access it.
The most common approach is to have logs be output to a file, somewhere on the local file system relative to the location where the software is installed and running. Obviously this is better than not having logs, but only just barely. When you have log files locally, you are stuck in a reactive mindset, using the logs as a diagnostic tool when a problem is either observed or reported through some other channel (like the user complaining).
The better approach is to the send the logs somewhere. Somewhere they can be watched and analysed and alerted on. You can be proactive when you have logs in a central location, finding issues before the users even notice and fixing them even faster.
I’m going to refer to that centralised place where the logs go as a Log Aggregator, although I’m not sure if that is the common term. It will do for now though.
Bucket O’ Logs
At my current job, we recently did some work to codify the setup of our environments. Specifically, we used AWS CloudFormation and Powershell to setup an auto scaling group + load balancer (and supporting constructs) to be the home for a new API that we were working on.
When you have a single machine, you can usually make do with local logs, even if its not the greatest of ideas (as mentioned above). When you have a variable number of machines, whose endpoints are constantly shifting and changing, you really need a central location where you can keep an eye on the log output.
Thus I’ve spent the last week and bit working on exactly that. Implementing a log aggregator.
After some initial investigation, we decided to go with an ELK stack. ELK stands for Elasticsearch, Logstash and Kibana, three components that each serve a different purpose. Elasticsearch is a document database with strong analysis and search capabilities. Logstash is a ETL (Extract, Transform, Load) system, used for moving logs around as well as transforming and mutating them into appropriate structures to be stored in Elasticsearch and Kibana is a front end visualisation and analysis tool that sits on top of Elasticsearch.
We went with ELK because a few other teams in the organization had already experimented with it, so there was at least a little organizational knowledge floating around to exploit. Alas, the other teams had not treated their ELK stacks as anything reusable, so we still had to start from scratch in order to get anything up and running.
We did look at a few other options (Splunk, Loggly, Seq) but it seemed like ELK was the best fit for our organisation and needs, so that was what we went with.
As is my pattern, I didn’t just want to jam something together and call that our log aggregator, hacking away at a single instance or machine until it worked “enough”. I wanted to make sure that the entire process was codified and reproducible. I particularly liked the way in which we had done the environment setup using CloudFormation, so I decided that would be a good thing to aim for.
Luckily someone else had already had the same idea, so in the true spirit of software development, I stole their work to bootstrap my own.
Stole in the sense that they had published a public repository on GitHub with a CloudFormation template to setup an ELK stack inside it.
I cloned the repository, wrote a few scripts around executing the CloudFormation template and that was that. ELK stack up and running.
Ha! It’s never that easy.
Throughout the rest of this post, keep in mind that I haven't used a Unix based operating system in anger in a long time. The ELK stack used a Ubuntu distro, so I was at a disadvantage from the word go. On the upside, having been using cmder a lot recently, I was far more comfortable inside a command line environment than I ever have been before. Certainly more comfortable than I was when I was used Unix.
The structure of the CloudFormation template was fairly straightforward. There were two load balancers, backed by an auto scaling group. One of the load balancers was public, intended to expose Kibana. The other was internal (i.e. only accessible from within the specified VPC) intended to expose Logstash. There were some Route53 entries to give everything nice names, and an Auto Scaling Group with a LaunchConfig to define the configuration of the instances themselves.
The auto scaling group defaulted to a single instance, which is what I went with. I’ll look into scaling later, when it actually becomes necessary and we have many applications using the aggregator.
As I said earlier, the template didn’t just work straight out of the repository, which was disappointing.
The first issue I ran into was that the template called for the creation of an IAM role. The credentials I was using to execute the template did not have permissions to do that, so I simply removed it until I could get the appropriate permissions from our AWS account managers. It turns out I didn’t really need it anyway, as the only thing I needed to access using AWS credentials was an S3 bucket (for dependency distribution) which I could configure credentials for inside the template, supplied as parameters.
Removing the IAM role allowed me to successfully execute the template, and it eventually reach that glorious state of “Create Complete”. Yay!
It still didn’t work though. Booooo!
Its Always a Proxy
The initial template assumed that the instance would be accessible over port 8080. The public load balancer relied on that fact and its health check queried the __es path. The first sign that something was wrong was that the load balancer thought that the instance inside it was unhealthy, so it was failing its health check.
Unfortunately, the instance was not configured to signal failure back to CloudFormation if its setup failed, so although CloudFormation had successfully created all of its resources, when I looked into the cloud-init-output.log file in /var/log, it turned out that large swathes of the init script (configured in the UserData section of the LaunchConfig) had simply failed to execute.
The issue here, was that we require all internet access from within our VPC to outside to go through a proxy. Obviously the instance was not configured to use the proxy (how could it be, it was from a public git repo), so all communications to the internet were being blocked, including calls to apt-get and the download of various configuration files directly from git.
Simple enough to fix, set the http_proxy and https_proxy environment variables with the appropriate value.
It was at this point that I also added a call to install the AWS CloudFormation components on the instance during initialisation, so that I could use cfn-signal to indicate failures. This at least gave me an indication of whether or not the instance had actually succeeded its initialization, without having to remote into the machine to look at the logs.
When working on CloudFormation templates, its always useful to have some sort of repeatable test that you can run in order to execute the template, ideally from the command line. You don’t want to have to go into the AWS Dashboard to do that sort of thing, and its good to have some tests outside the template itself to check its external API. As I was already executing the template through Powershell, it was a simple matter to include a Pester test that executed the template, checked that the outputs worked (the outputs being the Kibana and Logstash URLs) and then tear the whole thing down if everything passed.
At this point I also tried to setup some CloudWatch logs that would automatically extract the contents of the various initialization log files to a common location, so that I could view them from the AWS Dashboard when things were not going as expected. I did not, in the end, manage to get this working. The irony of needing a log aggregator to successfully setup a log aggregator was not lost on me.
Setting the environment variables fixed the majority of the proxy issues, but there was one other proxy related problem left that I didn’t pick up until much later. All of the Elasticsearch plugins were failing to install, for exactly the same reason. No proxy settings. Apparently Java does not read the system proxy settings (bad Java!) so I had to manually supply the proxy address and port to the call to the Elasticsearch plugin installation script.
The initialisation log now showed no errors, and everything appeared to be installed correctly.
But it still wasn’t working.
To Be Continued
Tune in next week for the thrilling conclusion, where I discover a bug caused by the specific combination of Ubuntu version and Java version, get fed up with the components being installed and start from scratch and then struggle with Nxlog in order to get some useful information into the stack.