Are you monitoring a production environment with a system that raises alerts when something strange happens?

Do you also have some sort of pre-production or staging environment where you smoke test changes before pushing them live?

Does the pre-production environment have exactly the same monitoring as the production environment?

It really should.

Silence Definitely Doesn’t Equal Consent

A few months back we were revisiting one of our older API’s.

We’d just made a release into the pre-production environment and verified that the API was doing what it needed to do with a quick smoke test. Everything seemed fine.

Promote to production and a few moments later a whole bunch of alarms went off as they detected a swathe of errors occurring in the backend. I honestly can’t remember what the nature of the errors were, but they were real problems that were causing subtle failures in the migration process.

Of course, we were surprised, because we didn’t receive any such indication from the pre-production environment.

When we dug into it a bit deeper though, exactly the same things were happening in pre-prod, the environment was just missing equivalent alarms.


I’m honestly not sure how we got ourselves into this particular situation, where there was a clear difference in behaviour between two environments that should be basically identical. Perhaps the production environment was manually tweaked to include different alarms? I’m not as familiar with the process used for this API as I am for others (Jenkins and Ansible vs TeamCity and Octopus Deploy), but regardless of the technology involved, its easy to accidentally fall into the trap of “I’ll just manually create this alarm here” when you’re in the belly of the beast during a production incident.

Thar be dragons that way though.

Ideally you should be treating your infrastructure as code and deploying it similarly to how you deploy your applications. Of course, this assumes you have an equivalent, painless deployment pipeline for your infrastructure, which can be incredibly difficult to put together.

We’ve had some good wins in the past with this approach (like our log stack rebuild), where the entire environment is encapsulated within a single Nuget package (using AWS CloudFormation), and then deployed using Octopus Deploy.

Following such a process strictly can definitely slow you down when you need to get something done fast (because you have to add it, test it, review it and then deploy it through the chain of CI, Staging, Production), but it does prevent situations like this from arising.

Our Differences Make Us Special

As always, there is at least one caveat.

Sometimes you don’t WANT your production and pre-production systems to have the same alarms.

For example, imagine if you had an alarm that fired when the traffic dropped below a certain threshold. Production is always getting enough traffic, such that a drop indicates a serious problem.

Your pre-production environment might not be getting as much traffic, and the alarm might always be firing.

An alarm that is always firing is a great way to get people to ignore all of the alarms, even when they matter.

By that logic, there might be good reasons to have some differences between the two environments, but they are almost certainly going to be exceptions rather than the rule.


To be honest, not having the alarms go off in the pre-production environment wasn’t exactly the worst thing that’s ever happened to us, but it was annoying.

Short term it was easy enough to just remember to check the logs before pushing to production, but the manual and unreliable nature of that check is why we have alarms altogether.

Machines are really good at doing repetitive things after all.


Another week, another post.

Stepping away the data synchronization algorithm for a bit, its time to explore some AWS weirdness..

Well, strictly speaking, AWS is not really all that weird. Its just that sometimes things happen, and they don’t immediately make any sense, and it takes a few hours and some new information before you realise “oh, that makes perfect sense”.

That intervening period can be pretty confusing though.

Todays post is about a disagreement between a Load Balancer and a CloudWatch alarm about just how healthy some EC2 instances were.

Schrödinger's Symptoms

At the start of this whole adventure, we got an alert email from CloudWatch saying that one of our Load Balancers contained some unhealthy instances.

This happens from time to time, but its one of those things that you need to get to the bottom of quickly, just in case its the first sign of something more serious.

In this particular case, the alert was for one of the very small number of services whose infrastructure we manually manage. That is, the service is hosted on hand crafted EC2 instances, manually placed into a load balancer. That means no auto scaling group, so no capability to self heal or scale if necessary, at least not without additional effort. Not the greatest situation to be in, but sometimes compromises must be made.

Our first step for this sort of thing is typically to log into the AWS dashboard and have a look around, starting with the entities involved in the alert.

After logging in and checking the Load Balancer though, everything seemed fine. No unhealthy instances were being reported. Fair enough, maybe the alarm had already reset and we just hadn’t gotten the followup email (its easy enough to forgot to configure an alert to send emails when the alarm returns back to the OK state).

But no, checking on the CloudWatch alarm showed that it was still triggered.

Two views on the same information, one says “BAD THING HAPPENING” the only says “nah, its cool, I’m fine”.

But which one was right?

Diagnosis: Aggregations

When you’re working with instance health, one of the most painful things is that AWS does not offer detailed logs showing the results of Load Balancer health checks. Sure, you can see whether or not there were any healthy/unhealthy instances, but if you want to know how the machines are actually responding, you pretty much just have to execute the health check yourself and see what happens.

In our case, that meant hitting the EC2 instances directly over HTTP with a request for /index.html.

Unfortunately (fortunately?) everything appeared to be working just fine, and each request returned 200 OK (which is what the health check expects).

Our suspicion then fell to the CloudWatch alarm itself. Perhaps we had misconfigured it and it wasn’t doing what we expected? Maybe it was detecting missing data as an error and then alarming on that. It might still indicate a problem of some sort, but would at least confirm that the instances were functional, which is what everything else appeared to be saying.

The alarm was correctly configured though, saying the equivalent of “alert when there is more than 0 unhealthy instances in the last 5 minutes”.

We poked around a bit into the other metrics available on the Load Balancer (request count, latency, errors, etc) and discovered that latency had spiked a bit, request count had dropped a bit and that there was a tonne of backend errors, so something was definitely wrong.

Returning back to the alarm we noticed that the aggregation on the data point was “average”, which meant that it was actually saying “alert when there is more than 0 unhealthy instances on average over the last 5 minutes”. Its not obvious what the average value of a health check is over time, but changing the aggregation to minimum showed that there were zero unhealthy instances over the same time period, and changing it to maximum showed that all four of the instances were unhealthy over the same time period.

Of course, this meant that the instances were flapping between up and down, which meant the health checks were sometimes failing and sometimes succeeding.

It was pure chance that when we looked at the unhealthy instances directly in the Load Balancer that it had never shown any, and similarly when we had manually hit the health endpoint it had always responded appropriately. The CloudWatch alarm remembered though, and the average of [healthy, healthy, unhealthy] was unhealthy as far as it was concerned, so it was alerting correctly.

Long story short, both views of the data were strictly correct, and were just showing slightly different interpretations.

Cancerous Growth

The root cause of the flapping was exceedingly high CPU usage on the EC2 instances in question, which was leading to timeouts.

We’d done a deployment earlier that day, and it had increased the overall CPU usage of two of the services hosted on those instances by enough that the machines were starting to strain with the load.

More concerning though was the fact that the deployment had only really spiked the CPU from 85% to 95-100%. We’d actually been running those particularly instances hot for quite a while.

In fact, looking back at the CPU usage over time, there was a clear series of step changes leading up to the latest increase, and we just hadn’t been paying attention.


It can be quite surprising when two different views on the same data suddenly start disagreeing, especially when you’ve never seen that sort of thing happen before.

Luckily for us, both views were actually correct, and it just took a little digging to understand what was going on. There was a moment there where we started thinking that maybe we’d found a bug in CloudWatch (or the Load Balancer), but realistically, at this point in the lifecycle of AWS, that sort of thing is pretty unlikely, especially considering that neither of those systems are exactly new or experimental.

More importantly, now we’re aware of the weirdly high CPU usage on the underlying EC2 instances, so we’re going to have to deal with that.

Its not like we can auto scale them to deal with the traffic.


Alerting on potential problems before they become real problems is something that we have been historically bad at. We’ve got a wealth of information available both through AWS and in our ELK Stack, but we’ve never really put together anything to use that information to notify us when something interesting happens.

In an effort to alleviate that, we’ve recently started using CloudWatch alarms to do some basic alerting. I’m not exactly sure why I shied away from them to begin with to be honest, as they are easy to setup and maintain. It might have been that the old way that we managed our environments didn’t lend itself to easy, non-destructive updates, making tweaking and creating alarms a relatively onerous process.

Regardless of my initial hesitation, when I was polishing the ELK stack environment recently I made sure to include some CloudWatch alarms to let us know if certain pieces fell over.

The alerting setup isn’t anything special:

  • Dedicated SNS topics for each environment (i.e. CI, Staging, Production, Scratch)
  • Each environment has two topics, one for alarms and one for notifications
  • The entire team is signed up to emails from both Staging and Production, but CI/Scratch are optional

Email topic subscriptions are enough for us for now, but we have plans to use the topics to trigger messages in HipChat as well.

I started with some basic alarms, like unhealthy instances > 0 for Elasticsearch/Logstash. Both of those components expose HTTP endpoints that can easily be pinged, so if they stop responding to traffic for any decent amount of time, we want to be notified. Other than some tweaks to the Logstash health check (I tuned it too tightly initially, so it was going off whenever a HTTP request took longer than 5 seconds), these alarms have served us well in picking up things like Logstash/Elasticsearch crashing or not starting properly.

As far as Elasticsearch is concerned, this is pretty much enough.

With Logstash being Logstash though, more work needed to be done.

Transport Failure

As a result of the issues I had with HTTP outputs in Logstash, we’re still using the Logstash TCP input/output combo.

The upside of this is that it works.

The downside is that sometimes the TCP input on the Broker side seems to crash and get into an unrecoverable state.

That wouldn’t be so terrible if it took Logstash with it, but it doesn’t. Instead, Logstash continues to accept HTTP requests, and just fails all of the TCP stuff. I’m not actually sure if the log events received via HTTP during this time are still processed through the pipeline correctly, but all of the incoming TCP traffic is definitely black holed.

As a result of the HTTP endpoint continuing to return 200 OK, the alarms I setup for unhealthy instances completely fail to pick up this particular issue.

In fact, because of the nature of TCP traffic through an ELB, and the relatively poor quality of TCP metrics, it can be very difficult to tell whether or not its working at a glance. Can’t use requests or latency, because they have no bearing on TCP traffic, and certainly can’t use status codes (obviously). Maybe network traffic, but that doesn’t seem like the best idea due to natural traffic fluctuations.

The only metric that I could find was “Backend Connection Errors”. This metric appears to be a measure of how many low level connection errors occurred between the ELB and the underlying EC2 instances, and seems like a good fit. Even better, when the Logstash TCP input falls over, it is this metric that actually changes, as all of the TCP traffic being forwarded through the ELB fails miserably.

One simple alarm initialization through CloudFormation later, and we were safe and secure in the knowledge that the next time the TCP input fell over, we wouldn’t find out about it 2 days later.

"BrokerLoadBalancerBackendConnectionErrorsAlarm": {
  "Type" : "AWS::CloudWatch::Alarm",
  "Properties" : {
    "AlarmName" : { "Fn::Join": [ "", [ "elk-broker-", { "Ref": "OctopusEnvironment" }, "-backend-errors" ] ] },
    "AlarmDescription" : "Alarm for when there is an increase in the backend connection errors on the Broker load balancer, typically indicating a problem with the Broker EC2 instances. Suggest restarting them",
    "AlarmActions" : [ { "Ref" : "AlarmsTopicARN" } ],
    "OKActions": [ { "Ref" : "AlarmsTopicARN" } ],
    "TreatMissingData": "notBreaching",
    "MetricName" : "BackendConnectionErrors",
    "Namespace" : "AWS/ELB",
    "Statistic" : "Maximum",
    "Period" : "60",
    "EvaluationPeriods" : "5",
    "Threshold" : "100",
    "ComparisonOperator" : "GreaterThanThreshold",
    "Dimensions" : [ {
      "Name" : "LoadBalancerName",
      "Value" : { "Ref" : "BrokerLoadBalancer" }
    } ]

Mistakes Were Made

Of course, a few weeks later the TCP input crashed again and we didn’t find out for 2 days.

But why?

It turns out that the only statistic worth a damn for Backend Connection Errors when alerting is SUM, and I created the alarm on the MAXIMUM, assuming that it would act like requests and other metrics (which give you the maximum number of X that occurred during the time period). Graphing the maximum backend connection errors during a time period where the TCP input was broken gives a flat line at y = 1, which is definitely not greater than the threshold of 100 that I entered.

I switched the alarm to SUM and as far as I can see the next time the TCP input goes down, we should get a notification.

But I’ve been fooled before.


I’ll be honest, even though I did make a mistake with this particular CloudWatch alarm, I’m glad that we started using them.

Easy to implement (via our CloudFormation templates) and relatively easy to use, they provide an element of alerting on our infrastructure that was sorely missing. I doubt we will go about making thousands of alarms for all of the various things we want to be alerted on (like increases in 500s and latency problems), but we’ll definitely include a few alarms in each stack we create to yell at us when something simple goes wrong (like unhealthy instances).

To bring it back to the start, I think one of the reasons we’ve hesitated to use the CloudWatch alarms was because we were waiting for the all singing all dancing alerting solution based off the wealth of information in our log stack, but I don’t think that is going to happen in a hurry.

Its been years already.


The way we structure our services for deployment should be familiar to anyone who’s worked with AWS before. An ELB (Elastic Load Balancer) containing one or more EC2 (Elastic Compute Cloud) instances, each of which has our service code deployed on it by Octopus Deploy. The EC2 instances are maintained by an ASG (Auto Scaling Group) which uses a Launch Configuration that describes how to create and initialize an instance. Nothing too fancy.

An ELB has logic in it to direct incoming traffic to its instance members, in order to balance the load across multiple resources. In order to do this, it needs to know whether or not a member instance is healthy, and it accomplishes this by using a health check, configured when the ELB is created.

Health checks can be very simple (does a member instance respond to a TCP packet over a certain port), or a little bit more complex (hit some API endpoint on a member instance via HTTP and look for non 200 response codes), but conceptually they are meant to provide an indication that the ELB is okay to continue to direct traffic to this member instance. If a certain number of status checks fail over a period of time (configurable, but something like “more than 5 within 5 minutes”, the ELB will mark a member instance as unhealthy and stop directing traffic to it. It will continue to execute the configured health check in the background (in case the instance comes back online because the problem was transitory), but can also be configured (when used in conjunction with an ASG) to terminate the instance and replace it with one that isn’t terrible.

Up until recently, out services have consistently used a /status endpoint for the purposes of the ELB health check.

Its a pretty simple endpoint, checking some basic things like “can I connect to the database” and “can I connect to S3”, returning a 503 if anything is wrong.

When we started using an external website monitoring service called Pingdom, it made sense to just use the /status endpoint as well.

Then we complicated things.

Only I May Do Things

Earlier this year I posted about how we do environment migrations. The contents of that post are still accurate (even though I am becoming less and less satisfied with the approach as time goes on), but we’ve improved the part about “making the environment unavailable” before starting the migration.

Originally, we did this by just putting all of the instances in the auto scaling group into standby mode (and shutting down any manually created databases, like ones hosted on EC2 instances), with the intent that this would be sufficient to stop customer traffic while we did the migration. When we introduced the concept of the statistics endpoint, and started comparing data before and after the migration, this was no longer acceptable (because there was no way for us to hit the service ourselves when it was down to everybody).

What we really wanted to do was put the service into some sort of mode that let us interact with it from an administrative point of view, but no-one else. That way we could execute the statistics endpoint (and anything else we needed access to), but all external traffic would be turned away with a 503 Service Unavailable (and a message saying something about maintenance).

This complicated the /status endpoint though. If we made it return a 503 when maintenance mode was on, the instances would eventually be removed from the ELB, which would make the service inaccessible to us. If we didn’t make it return a 503, the external monitoring in Pingdom would falsely report that everything was working just fine, when in reality normal users would be completely unable to use the service.

The conclusion that we came to was that we made a mistake using /status for two very different purposes (ELB membership and external service availability), even though they looked very similar at a glance.

The solutoion?

Split the /status endpoint into two new endpoints, /health for the ELB and /test for the external monitoring service.

Healthy Services

The /health endpoint is basically just the /status endpoint with a new name. Its response payload looks like this:

    "BuildVersion" : "1.0.16209.8792",
    "BuildTimestamp" : "2016-07-27T04:53:04+00:00",
    "DatabaseConnectivity" : true,
    "MaintenanceMode" : false

There’s not really much more to say about it, other than its purpose now being clearer and more focused. Its all about making sure the particular member instance is healthy and should continue to receive traffic. We can probably extend it to do other health related things (like available disk space, available memory, and so on), but it’s good enough for now.

Testing Is My Bag Baby

The /test endpoint though, is a different beast.

At the very least, the /test endpoint needs to return the information present in the /health endpoint. If /health is returning a 503, /test should too. But it needs to do more.

The initial idea was to simply check whether or not we were in maintenance mode and then just make the /test endpoint return a 503 when maintenance mode was on as the only differentiator between /health and /test. That’s not quite good enough though, as what we want conceptually is the /test endpoint to act more like a smoke test of common user interactions and just checking the maintenance mode flag doesn’t give us enough faith that the service itself is working from a user point of view.

So we added some tests that get executed when the endpoint is called, checking commonly executed features.

The first service we implemented the /test endpoint for, was our auth service, so the tests included things like “can I generate a token using a known user” and “can I query the statistics endpoint”, but eventually we’ll probably extend it to include other common user operations like “add a customer” and “list customer databases”.

Anyway, the payload for the /test endpoint response ended up looking like this:

    "Health" : {
        "BuildVersion" : "1.0.16209.8792",
        "BuildTimestamp" : "2016-07-27T04:53:04+00:00",
        "DatabaseConnectivity" : true,
        "MaintenanceMode" : false
    "Tests" : [{
            "Name" : "QueryApiStatistics",
            "Description" : "A test which returns the api statistics.",
            "Endpoint" : "/admin/statistics",
            "Success" : true,
            "Error" : null
        }, {
            "Name" : "GenerateClientToken",
            "Description" : "A test to check that a client token can be generated.",
            "Endpoint" : "/auth/token",
            "Success" : true,
            "Error" : null

I think it turned out quite informative in the end, and it definitely meets the need of detecting when the service is actually available for normal usage, as opposed to the very specific case of detecting maintenance mode.

The only trick here was that the tests needed to hit the API via HTTP (i.e. localhost), rather than just shortcutting through the code directly. The reason for this is that otherwise each test would not be an accurate representation of actual usage, and would give false results if we added something in the Nancy pipeline that modified the response before it got to the code that the test was executing.

Subtle, but important.


After going to the effort of clearly identifying the two purposes that the /status endpoint was fulfilling, it was pretty clear that we would be better served by having two endpoints rather than just one. It was obvious in retrospect, but it wasn’t until we actually encountered a situation that required them to be different that we really noticed.

A great side effect of this was that we realised we needed to have some sort of smoke tests for our external monitoring to access on an ongoing basis, to give us a good indication of whether or not everything was going well from a users point of view.

Now our external monitoring actually stands a chance of tracking whether or not users are having issues.

Which makes me happy inside, because that’s the sort of thing I want to know about.


Logstash is a harsh mistress. Master. Whatever gender you want to apply to it, its the sort of application that is so incredibly useful that you’re usually willing to overlook its various flaws and problems, as long as it just keeps shipping those sweet sweet log events into your aggregator. It doesn’t help when the only other alternative on Windows that I know of is Nxlog.

Don’t get me wrong, Nxlog is a fine, stable, established product. It does what it needs to do and it does it reliably. Its just that its a complete nightmare to configure. Logstash is just so easy to configure, with its nice JSON configuration files and its plugins. I can do just about anything in Logstash, but I struggled to do anything more than the basics with Nxlog.

I’ve blogged a couple of times about Logstash (the most recent being a post about a mysterious memory leak occurring on machines with Logstash installed and running), but the problem I’m about to describe is only really partially the fault of Logstash.

More specifically, its Java’s fault.

I’ve Got 99 Issues

Recently, we’ve noticed an annoying trend on the machines acting as log shippers (using Logstash, aggregating mostly ELB logs from S3 buckets). After a certain amount of time, they would just stop shipping logs. When we investigated, we discovered that the disks were full, which pretty quickly puts a stop to any sort of useful activities on Windows.

Our initial suspicions were that the log files themselves (of which there can be quite a lot) were simply not being cleaned up. It wouldn’t be the first time that we’d had a bug in our log file cleanup script and honestly, it probably won’t be the last. A execution of the following simple Powershell script was usually enough to get them up and running again, so we noted it as an issue and moved on to more important things.

Get-ChildItem –Path C:\logs –Recurse –Filter *.log | Where { $_.LastWriteTime –lt (Get-Date).AddDays(-7) } ) | Remove-Item –Force

But it kept happening.

Eventually we realised that something was pretty seriously wrong (and I got tired of manually running Powershell to restore functionality), so we put some more effort into finding the root cause.

The system disk was definitely filling up, but with what? Because we use headless Windows instances (who needs a UI?), it was a little more difficult than normal to find exactly what was taking up all of the disk (a combination of using the Sysinternals DU tool and using notepad to browse folders with its handy little Windows explorer equivalent), but in the end we found a lot of very large (500MB+) memory dump files, in the root directory of our Logstash installation.

It looked like Logstash was crashing (which it does) and on crash it was dumping all of its memory to a file. Now, we wrap our Logstash installations inside NSSM to turn them into Windows services (for exactly this reason), so NSSM would detect that the process had crashed and restart it. It would keep running for a while, and then crash again, repeating the cycle (a “while” in this context was usually a few days).

Over time, this meant that the memory dump files were just accumulating on the disk, eventually causing the machine to choke and die.

Fixed! Ship It?

Logstash is Java based, so I did a little digging and found that the most likely cause of .mdmp files was simply the JVM doing its thing when it crashed. In particular, there seemed to be one option which controlled whether or not a memory dump was made whenever the process crashed. Using the options supplied to the JVM, it seemed like it would be relatively easy to turn this option off (assuming that it was on by default). All you need to do is add –XX:-CreateMinidumpOnCrash to the process and everything should be fine.

The process in question was Logstash, and I had already added JVM options before (for proxy settings and memory limits) so I added the new option to JAVA_OPTS environment variable, which was where I had been putting the other ones. Doing a little reading, I discovered that documentation suggested using the LS_JAVA_OPTS environment variable instead (because it was additive with default options), so I switched to that and ran a test Logstash instance locally to check that everything seemed okay.

Using Sysinternals Process Explorer (procexp) I viewed the process tree of Logstash, showing that Powershell started a batch file, which started JRuby which started Java and so on.

None of my options had been applied to the Java process though…

Dear God Why

On Windows, Logstash runs via a batch file. This batch file in turn calls into another batch file, called setup. Setup is responsible for setting up various things (obviously), one of which is the JVM. Over time, I imagine that many people have experienced various issues with Logstash in a Windows environment, so being open source, they add intelligent default settings to the JVM so that other people don’t have to feel the same pain.

As I mentioned above, you are supposed to be able to append additional JVM settings to Logstash using the LS_JAVA_OPTS environment variable and completely override the options that Logstash uses by setting the JAVA_OPTS environment variable. Originally I didn’t realise that the JAVA_OPTS variable was supposed to be a complete override (which I didn’t want) so it was good that I had chosen to switch to LS_JAVA_OPTS.

The problem was, the usage of JAVA_OPTS and LS_JAVA_OPTS is not how the Windows version of Logstash works at all.

For starters, the Windows version does nothing with the LS_JAVA_OPTS environment variable. Nothing at all.

Secondarily, regardless of what you set JAVA_OPTS to, it will add a bunch of its own JAVA_OPTS (including duplicates of yours, which I assume results in only the latest one being used by the JVM) before it runs Logstash.

By switching to the LS_JAVA_OPTS environment variable, I had completely broken my custom JVM options.

I forked the elastic/logstash repository on Github, fixed the setup.bat file (if JAVA_OPTS, don’t add any, if LS_JAVA_OPTS, add those to the defaults) and created a pull request. Unfortunately, though Onthehouse already had a contributor agreement, but it didn’t seem to recognise me, so that particular pull request is sitting in some weird purgatory and I need to do something about it.

For my purposes though, because we wrap Logstash to make a package deployable through Octopus, I just edited the setup.bat that we use and pushed.

Once it was deployed, I checked the settings applied to the JVM and it was correctly showing my setting to not dump memory on a crash. Yay!

Of course, only time will tell whether or not it actually keeps dumping memory to disk on crash though.


The problem that I outlined above is one of those cases where you really do need to understand how something works before you use it in anger.

I had assumed that the behaviour of Logstash between Windows and Linux would be consistent, when in reality that was definitely not the case. It wasn’t until I really started digging into how Logstash starts, specifically on my OS (rather than just using it) that I realised the truth.

It wasn’t even difficult to fix, the functionality was simply not there, probably as a result of someone changing the Linux implementation at some distant point in the past and equal changes not being made to the Windows scripts. I’m surprised no-one noticed until now (or they might have noticed and just not fixed it in the elastic/logstash repository I suppose), but maybe that just means I’m the only person insane enough to run Logstash on Windows.

I hope not.

There is also the matter of Logstash constantly crashing, which is somewhat concerning, but that’s a task for another day.