Last time I outlined the start of setting up an ELK based Log Aggregator via CloudFormation. I went through some of the problems I had with executing someone else's template (permissions! proxies!), and then called it there, because the post was already a mile long.
Now for the thrilling conclusion where I talk about errors that mean nothing to me, accepting defeat, rallying and finally shipping some real data.
You Have Failed Me Again Java
Once I managed to get all of the proxy issues sorted, and everything was being downloaded and installed properly, the instance was still not responding to HTTP requests over the appropriate ports. Well, it seemed like it wasn’t responding anyway.
Looking into the syslog, I saw repeated attempts to start Elasticsearch and Logstash, with an equal number of failures, where the process had terminated immediately and unexpectedly.
The main issue appeared to be an error about “Bad Page Map”, which of course makes no sense to me.
Looking it up, it appears as though there was an issue with the version of Ubuntu that I was using (3.13? Its really hard to tell what version means what), and it was not actually specific to Java. I’m going to blame Java anyway. Apparently the issue is fixed in 3.15.
After swapping the AMI to the latest distro of Ubuntu, the exceptions no longer appeared inside syslog, but the damn thing still wasn’t working.
I could get to the appropriate pages through the load balancer, which would redirect me to Google to supply OAuth credentials, but after supplying appropriate credentials, nothing else ever loaded. No Kibana, no anything. This meant of course that ES was (somewhat) working, as the load balancer was passing its health checks, and Kibana was (somewhat) working because it was at least executing the OAuth bit.
Do It Yourself
It was at this point that I decided to just take it back to basics and start from scratch. I realised that I didn’t understand some of the components being installed (LogCabin for example, something to do with the OAuth implementation?), and that they were getting in the way of me accomplishing a minimum viable product. I stripped out all of the components from the UserData script, looked up the latest compatible versions of ES, Logstash and Kibana, installed them, and started them as services. I had to make some changes to the CloudFormation template as well (ES defaults to 9200, Kibana 4 to 5601, had to expose the appropriate ports and make some minor changes to the health check. Logstash was fine).
The latest version of Kibana is more self contained than previous ones, which is nice. It comes with its own web server, so all you have to do is start it up and it will listen to and respond to requests via port 5601 (which can be changed). This is different to the version that I was originally working with (3?), which seemed to be hosted directly inside Elasticsearch? I’m still not sure what the original template was doing to be honest, all I know is that it didnt work.
A Kibana dashboard, load balancers working, ES responding. Finally everything was up and running. I still didn’t fully understand it fully, but it was a hell of a lot more promising than it was before.
Now all I had to do was get some useful information into it.
Nxlog Sounds Like a Noise You Make When You Get Stabbed
There are a number of ways to get logs into ES via Logstash. Logstash itself can be installed on other machines and forward local logs to a remote Logstash, but its kind of heavy weight for that sort of thing. Someone has written a smaller component called Logstash-Forwarder which does a similar thing. You can also write directly to ES using Logstash compatible index names if you want to as well (Serilog offers a sink that does just that).
The Logstash solutions above seem to assume that you are gathering logs on a Unix based system though, and don’t really offer much in the way of documentation or help if you have a Windows based system.
After a small amount of investigation, a piece of software called Nxlog appears to be the most commonly used log shipper as far as Windows is concerned.
As with everything in automation, I couldn’t just go onto our API hosting instances and just install and configure Nxlog. I had to script it, and then add those scripts to the CloudFormation template for our environment setup.
Installing Nxlog from the command line is relatively simple using msiexec and the appropriate “don't show the damn UI” flags, and configuring it is simple as well. All you need to do is have an nxlog.conf file configured with what you need (in my case, iis and application logs being forwarded to the Logstash endpoint) and then copy it to the appropriate conf folder in the installation directory.
The nxlog configuration file takes some getting used to, but their documentation is pretty good, so its just a matter of working through it. The best tip I can give is to create a file output until you are sure that nxlog is doing what you think its doing, and then flipping everything over to output to Logstash. You’ll save a log of frustration if you know exactly where the failures are (and believe me, there will be failures).
After setting up Nxlog, it all started to happen! Stuff was appearing in Kibana! It was one of those magical moments where you actually get a result, and it felt good.
Types? We need Types?
I got everything working nicely in my test environment, so I saved my configuration, tore down the environments and created them again (to verify they could be recreated). Imagine my surprise when I was getting Nxlog internal messages into ES, but nothing from IIS. I assumed that I had messed up Nxlog somehow, so I spent a few hours trying to debug what was going wrong. My Nxlog config appeared to be fine, so I assumed that there was something wrong with the way I had configured Logstash. Again, it seemed to be fine.
It wasn't until I looked into the Elasticsearch logs that I found out why all of my IIS logs were not making it. The first document sent to Elasticsearch had a field called EventRecievedTime (from the Nxlog internal source) which was a date, represented as ticks since X, i.e. a gigantic number. ES had inferred the type of this field as a long. The IIS source also had a field called EventRecievedTime, which was an actual date (i.e. YYYY-MM-DD HH:mm). When any IIS entry arrived into ES from Logstash, ES errored out trying to parse the datetime into a long, and discarded it. Because of the asynchronous nature of the system, there was no way for Logstash to communicate the failure back to anywhere that I could see it.
After making sure that both EventRecievedTimes were dates, everything worked swimmingly.
I suppose this might reoccur in the future, with a different field name conflicting. I’m not sure exactly what the best way to deal with this would be. Maybe the Elasticsearch logs should be put into ES as well? At least then I could track it. You could setup a template to strongly type the fields as well, but due to the fluidity of ES, there are always going to be new fields, and ES will always try to infer an appropriate type, so having a template won’t stop it from occurring.
Look at this dashboard.
Look at it.
I haven’t even begun to plumb the depths of the information now at my fingertips. Most of those charts are simple (average latency, total requests, response codes, requested paths), but it still provides a fantastic picture of the web applications in question.
Last time I wrote a CloudFormation template, I didn’t manage to get it into a publically available repository, which kind of made the blog posts around it significantly less useful.
This time I thought ahead. You can find all of the scripts and templates for the Log Aggregator in this repository. This is a copy of our actual repository (private, in Bitbucket), so I’m not sure if I will be able to keep it up to date as we make changes, but at least there is some context to what I’ve been speaking about.
I’ve included the scripts that setup and configure Nxlog as well. These are actually located in the repository that contains our environment setup, but I think they are useful inside this repository as a point of reference for setting up log shipping on a Windows system. Some high level instructions are available in the readme of the repository.
Having a Log Aggregator, even though it only contains IIS and application logs for a single application, has already proved useful. It adds a huge amount of transparency to what is going on, and Kibana’s visualisations are extremely helpful in making sense of the data.
Now to do some load testing on the web application and see what Kibana looks like when everything breaks.