0 Comments

So, last week was mostly an introduction to .NET Core and AWS Lambda. Not a lot of meat.

This week I’ll be writing about the build and deployment pipeline that we set up, and how it ties in with our normal process.

Hopefully it will be a bit more nutritious.

The Story So Far

Over the last few years we’ve put together a fairly generic build and deployment process using Powershell that focuses on ensuring the process can be run painlessly on a fresh checkout of the repository. The goal is to have everything either embedded or bootstrapped, and to have any and all logic encapsulated within the scripts.

What this means is that we have a relatively simple setup in TeamCity, which is basically just parameter management and a small bit of logic for kicking off the scripts that live inside the repositories.

The same approach is extended to deployment as well; we have Powershell (or an equivalent scripting language) that does the actual deployment, and then only use Octopus Deploy for orchestration and variable management.

Because we need to build, test and deploy a variety of things (like C# desktop applications, Javscript websites, the old Javascript ELB Logs Processor, CloudFormation environments), the common scripts feature pluggable processes in the form of Powershell script blocks, allowing the user to choose what sort of build is needed (simple Nuget package over some files? A full MSBuild execution that generates an .exe in a Nuget package?), how and when it is tested (Nunit over .dlls containing the phrase “Tests” before deployment? Selenium using Webdriver IO through npm after deployment?) and how it is deployed (pretty much just Octopus Deploy).

The API for the pluggable processes (which I’m going to call engines from this point forward) is relatively lax, in that the Powershell just calls the supplied script block with known parameters, and fails miserably if they don’t exit.

In retrospect, we probably should have just used something already available, like psake, but it just didn’t seem necessary at the time. We’re probably going to get hurt if we try to do anything that only works on Linux systems (no Powershell there), but it hasn’t happened yet, so I’m sure it will never happen.

To summarise, we have a framework that allows for configurable approaches to build, test and deployment, and a .NET Core AWS Lambda function should be able to slot right in.

Vrmm Vrmm Goes The Engine

In order to reliably get the lambda function code from the git repository in which it lives, to the actual AWS Lambda function where it will be executed, we needed to do three things:

  1. Creation of an artefact containing a versioned and validated copy of the code to be executed
  2. The validation logic itself (i.e. testing)
  3. Deploying said artefact to the AWS Lambda function

Coincidentally, these three things line up nicely with the pluggable components in our generic build and deployment process, so all we have to do is provide an engine for each, then run the overall process, and everything should be sweet.

Bob The Builder

The first thing to deal with is the actual creation of something that can be deployed, which for us means a Nuget package.

Referring back to the Javascript version of the ELB Logs Processor, there really wasn’t a lot to the creation of the artefact. Use npm to make sure the appropriate dependencies are present, a quick consolidation step with webpack, then wrap everything up in a Nuget package.

For the C# version, its a similar story, except instead of using node/npm, we can use the dotnet executable available in the SDK, followed by the Nuget command line tool to wrap it all up neatly.

Basically, it looks like this:

dotnet restore {solution-path}
dotnet build –c Release {solution-path}
dotnet publish –o {output-path} –c Release {solution-path}
nuget pack {nuspec-path} {output-directory} {version}

The actual script block for the engine is a little bit more complicated than that (TeamCity integration, bootstrapping the .NET Core SDK, error handling/checking, etc), but its still relatively straightforward.

As I mentioned last week, the only thing to watch out for here is to make sure that you’re targeting the appropriate runtime with your solution configuration, and not to try and build a standalone executable. The code will actually be executed on a Linux machine behind the scenes, and AWS Lambda will make sure the .NET Core runtime is available, so all you have to worry about is your code. I’m pretty sure if you try and supply an OS independent executable it will fail miserably.

Emotional Validation

The second thing to deal with is the validation of the artefact.

Similar to the build process, if we refer back to the Javascript ELB Logs Processor, we had no testing at all. Literally nothing. Our validation was “deploy it and see if the logs stop”.

Don’t get me wrong, there are tonnes of ways to write tests and automate them when using node/npm, we just never went to the effort of actually doing it.

I’m much more familiar with C# (even taking into account the differences of .NET Core), and we don’t write code without tests, so this was an ideal opportunity to put some stuff together for validation purposes.

I’m going to dedicate an entire post to just how we went about testing the rewrite (some interesting stuff there about managing AWS credentials in a secure way), but at a high level we followed our standard pattern of having dedicated test projects that sit in parallel to the code (i.e. Lambda.Tests.Unit, Lambda.Tests.Integration) with automatic execution through the generic build process via an engine.

This particular engine is applied predeployment (i.e. before the Nuget package is pushed to Octopus and a project executed), and is pretty straight forward:

dotnet test {project-path} –c Release –no-build –logger {output-file-path}

The only caveat here was the general immaturity of .NET Core, as I had some problems using NUnit (which is the testing framework we typically use for C#), but XUnit worked perfectly fine. They aren’t that dissimilar, so it wasn’t a huge problem.

Tentacle Joke

The last thing to deal with is the actual deployment of the artefact to AWS.

Assuming step 1 and 2 have gone well (i.e. we have a versioned Nuget package that has been validated), all that’s left is to leverage Octopus Deploy.

For us this means a relatively simple project inside Octopus that just deploys the Nuget package to a worker tentacle, and a Powershell script embedded in the package that does the bulk of the work.

$awsKey = Get-OctopusParameter "AWS.Deployment.Key";
$awsSecret = Get-OctopusParameter "AWS.Deployment.Secret";
$awsRegion = Get-OctopusParameter "AWS.Deployment.Region";

$version = Get-OctopusParameter "Octopus.Release.Number";

$functionName = Get-OctopusParameter "AWS.Lambda.Function.Name";

$aws = Get-AwsCliExecutablePath

$env:AWS_ACCESS_KEY_ID = $awsKey
$env:AWS_SECRET_ACCESS_KEY = $awsSecret
$env:AWS_DEFAULT_REGION = $awsRegion

$functionPath = "$here\function"

Write-Verbose "Compressing lambda code file"
Add-Type -AssemblyName System.IO.Compression.FileSystem
[system.io.compression.zipfile]::CreateFromDirectory($functionPath, "index.zip")

Write-Verbose "Updating Log Processor lambda function [$functionName] to version [$version]"
(& $aws lambda update-function-configuration --function-name $functionName --runtime "dotnetcore1.0" --handler "Lambda::Handler::Handle") | Write-Verbose
(& $aws lambda update-function-code --function-name $functionName --zip-file fileb://index.zip) | Write-Verbose
(& $aws lambda update-function-configuration --function-name $functionName --description $version) | Write-Verbose

In summary, the script takes the contents of the Nuget package that was just deployed, zips up the relevant parts of it (in this case, everything under the “function” directory) and then uses the AWS CLI and some Octopus parameters to push the zip file to a known Lambda function.

Really the only interesting thing here is that we take the version from the Nuget package and apply it directly to the function in AWS for backwards traceability.

To Be Continued

The steps and processes that I’ve outlined above only describe the key parts of the build and deployment process. A lot of the detail is missing, and that is on purpose, because the entire thing is quite complicated and describing it comprehensively would make this blog post into more of a novel.

Its all ready pretty long.

The key things to take away are the .NET Core build and publish process (using the dotnet cmdline application), the .NET Core testing process (again, through the cmdline application) and the simplest way that I’ve found to actually push code to a Lambda function in an automated fashion.

Next week I’ll extrapolate on the testing of the lambda function, specifically around managing and using AWS credentials to provide an end to end test of function logic.

0 Comments

Over two months ago I commented that I was going to rewrite our ELB Logs Processor in C#. You know, a language that I actually like and respect.

No, it didn’t take me two months to actually do the rewrite (what a poor investment that would have been), I finished it up a while ago. I just kept forgetting to write about it.

Thus, here we are.

HTTP Is The Way For Me

To set the stage, lets go back a step.

The whole motivation for even touching the ELB Logs Processor again was to stop it pushing all of its log events to our Broker via TCP. The reasoning here was that TCP traffic is much harder to reason about, and Logstash has much better support around tracking, retrying and otherwise handling HTTP traffic than it does for TCP, so the switch makes sense as a way to improve the overall health of the system.

I managed to flip the Javascript ELB Logs Processor to HTTP using the Axios library (which is pretty sweet), but it all fell apart when I pushed it into production, and it failed miserably with what looked like memory issues.

Now, this is the point where a saner person might have just thrown enough memory at the Lambda function to get it to work and called it a day. Me? I’d had enough of Javascript at that point, so I started looking elsewhere.

At this point I was well aware that AWS Lambda offers execution of C# via .NET Core, as an alternative to Node.js.

Even taking into account my inexperience with doing anything at all using .NET Core, C# in Lambda was still infinitely preferable to Javascript. The assumption was that the ELB Logs Processor doesn’t actually do much (handle incoming S3 event, read file, send log events to Logstash), so rewriting it shouldn’t take too much effort.

Thus, here we are.

You Got .NET Core In My Lambda!

The easiest way to get started with using C# via .NET Core in AWS Lamnda, is to install the latest version of the AWS Toolkit and then just use Visual Studio. The toolkit provides a wealth of functionality for working with AWS in Visual Studio, including templates for creating Lambda functions as well as functionality for publishing your code to the place where it will actually be executed.

Actually setting up the Lambda function to run .NET Core is also relatively trivial. Assuming you’ve already setup a Lambda function with the appropriate listeners, its literally a combo box in the AWS Dashboard where you select the language/runtime version that your code requires.

Using the integrated toolkit and Visual Studio is a great way to get a handle on the general concepts in play and the components that you need to manage, but its not how you should do it in a serious development environment, especially when it comes to publishing.

A better way to do it is to go for something a bit more portable, like the actual .NET Core SDK, which is available for download as a simple compressed archive, with no installation required. Once you’ve got a handle on the command line options, its the best way to put something together that will also work on a build server of some description.

I’ll talk about the build and deployment process in more detail in the second post in this series, but for now, its enough to know that you can use the combination of the command line tools from the .NET Core SDK along with Visual Studio to create, develop, build, package and publish your .NET Core AWS Lambda function without too much trouble.

At that point you can choose to either create your Lambda function manually, or automate its creation through something like CloudFormation. It doesn’t really matter from the point of view of the function itself.

You Got Lambda In My .NET Core!

From the perspective of C#, all you really need to write is a class with a handler function.

namespace Lambda
{
    public class Handler
    {
        public void Handle(System.IO.Stream e)
        {
            // do some things here
        }
    }
}

A stream is the most basic input type, leaving the handling of the incoming data entirely up to you, using whatever knowledge you have about its format and structure.

Once you have the code, you can compile and publish it to a directory of your choosing (to get artefacts that you can move around), and then push it to the AWS Lambda function. As long as you’ve targeted the appropriate runtime as part of the publish (i.e. not a standalone platform independent application, you actually need to target the same runtime that you specified in the AWS Lambda function configuration), the only other thing to be aware of is how to tell the function where to start execution.

Specifying the entry point for the function is done via the Handler property, which you can set via the command line (using the AWS CLI), via Powershell Cmdlets, or using the AWS dashboard. Regardless of how you set it, its in the format “{namespace}::{class}::{function}”, so if you decide to arbitrarily change your namespaces or class names, you’ll have to keep your lambda function in sync or it will stop working the next time you publish your code.

At this point you should have enough pieces in place that you can trigger whatever event you’re listening for (like a new S3 file appearing in a bucket) and track the execution of the function via CloudWatch or something similar.

Helpful Abstractions

If you look back at the simple function handler above, you can probably conclude that working with raw streams isn’t the most efficient way of handling incoming data. A far more common approach is to get AWS Lambda to deserialize your incoming event for you. This allows you to use an actually structured object of your choosing, ranging from the presupported AWS events (like an S3 event), through to a POCO of your own devising.

In order to get this working all you have to do is add a reference to the Amazon.Lambda.Serialization.Json nuget package, and then annotate your handler with the appropriate attribute, like so:

namespace Lambda
{
    public class Handler
    {
        [LambdaSerializer(typeof(Amazon.Lambda.Serialization.Json.JsonSerializer))]
        public void Handle(Amazon.Lambda.S3Events.S3Event e)
        {
            // Do stuff here
        }
    }
}

In the function above, I’ve also made use of the Amazon.Lambda.S3Events nuget package in order to strongly type my input object, because I know that the only listener configured will be for S3 events

I’m actually not sure what you would have to do to create a function that handles multiple input types, so I assume the model is to just fall back to a more generic input type.

Or maybe to have multiple functions or something.

To Be Continued

I think that’s as good a place as any to stop and take a breath.

Next week I’ll continue on the same theme and talk about the build, package and publish process that we put together in order to actually get the code deployed.

0 Comments

In last weeks post I explained how we would have liked to do updates to our Elasticsearch environment using CloudFormation. Reality disagreed with that approach and we encountered timing problems as a result of the ES cluster and CloudFormation not talking with one another during the update.

Of course, that means that we need to come up with something ourselves to accomplish the same result.

Move In, Now Move Out

Obviously the first thing we have to do is turn off the Update Policy for the Auto Scaling Groups containing the master and data nodes. With that out of the way, we can safely rely on CloudFormation to update the rest of the environment (including the Launch Configuration describing the EC2 instances that make up the cluster), safe in the knowledge that CloudFormation is ready to create new nodes, but will not until we take some sort of action.

At that point its just a matter of controlled node replacement using the auto healing capabilities of the cluster.

If you terminate one of the nodes directly, the AWS Auto Scaling Group will react by creating a replacement EC2 instance, and it will use the latest Launch Configuration for this purpose. When that instance starts up it will get some configuration deployed to it by Octopus Deploy, and shortly afterwards will join the cluster. With a new node in play, the cluster will react accordingly and rebalance, moving shards and replicas to the new node as necessary until everything is balanced and green.

This sort of approach can be written in just about any scripting language, out poison of choice is Powershell, which was then embedded inside the environment nuget package to be executed whenever an update occurs.

I’d copy the script here, but its pretty long and verbose, so here is the high level algorithm instead:

  • Iterate through the master nodes in the cluster
    • Check the version tag of the EC2 instance behind the node
    • If equal to the current version, move on to the new node
    • If not equal to the current version
      • Get the list of current nodes in the cluster
      • Terminate the current master node
      • Wait for the cluster to report that the old node is gone
      • Wait for the cluster to report that the new node exists
  • Iterate through the data nodes in the cluster
    • Check the version tag of the EC2 instance behind the node
    • If equal to the current version, move on to the new node
    • If not equal to the current version
      • Get the list of current nodes in the cluster
      • Terminate the current data node
      • Wait for the cluster to report that the old node is gone
      • Wait for the cluster to report that the new node exists
      • Wait for the cluster to go yellow (indicating rebalancing is occurring
      • Wait for the cluster to go green (indicating rebalancing is complete). This can take a while, depending on the amount of data in the cluster

As you can see, there isn’t really all that much to the algorithm, and the hardest part of the whole thing is knowing that you should wait for the node to leave/join the cluster and for the cluster to rebalance before moving on to the next replacement.

If you don’t do that, you risk destroying the cluster by taking away too many of its parts before its ready (which was exactly the problem with leaving the update to CloudFormation).

Hands Up, Now Hands Down

For us, the most common reason to run an update on the ELK environment is when there is a new version of Elasticsearch available. Sure we run updates to fix bugs and tweak things, but those are generally pretty rare (and will get rarer as time goes on and the stack gets more stable).

As a general rule of thumb, assuming you don’t try to jump too far all at once, new versions of Elasticsearch are pretty easily integrated.

In fact, you can usually have nodes in your cluster at the new version while there are still active nodes on the old version, which is nice.

There are at least two caveats that I’m aware of though:

  • The latest version of Kibana generally doesn’t work when you point it towards a mixed cluster. It requires that all nodes are running the same version.
  • If new indexes are created in a mixed cluster, and the primary shards for that index live on a node with the latest version, nodes with the old version cannot be assigned replicas

The first one isn’t too problematic. As long as we do the upgrade overnight (unattended), no-one will notice that Kibana is down for a little while.

The second one is a problem though, especially for our production cluster.

We use hourly indexes for Logstash, so a new index is generally created every hour or so. Unfortunately it takes longer than an hour for the cluster to rebalance after a node is replaced.

This means that the cluster is almost guaranteed to be stuck in the yellow status (indicating unassigned shards, in this case the replicas from the new index that cannot be assigned to the old node), which means that our whole process of “wait for green before continuing” is not going to work properly when we do a version upgrade on the environment that actually matter, production.

Lucky for us, the API for Elasticsearch is pretty amazing, and allows you to get all of the unassigned shards, along with the reason why they were unassigned.

What this means is that we can keep our process the same, and when the “wait for green” part of the algorithm times out, we can check to see whether or not the remaining unassigned shards are just version conflicts, and if they are, just move on.

Works like a charm.

Tell Me What You’re Gonna Do Now

The last thing that we need to take into account during an upgrade is related to Octopus Tentacles.

Each Elasticsearch node that is created by the Auto Scaling Group registers itself as a Tentacle so that it can have the Elasticsearch  configuration deployed to it after coming online.

With us terminating nodes constantly during the upgrade, we generate a decent number of dead Tentacles in Octopus Deploy, which is not a situation you want to be in.

The latest versions (3+ I think) of Octopus Deploy allow you to automatically remove dead tentacles whenever a deployment occurs, but I’m still not sure how comfortable I am with that piece of functionality. It seems like if your Tentacle is dead for a bad reason (i.e. its still there, but broken) then you probably don’t want to just clean it up and keep on chugging along.

At this point I would rather clean up the Tentacles that I know to be dead because of my actions.

As a result of this, one of the outputs from the upgrade process is a list of the EC2 instances that were terminated. We can easily use the instance name to lookup the Tentacle in Octopus Deploy, and remove it.

Conclusion

What we’re left with at the end of this whole adventure is a fully automated process that allows us to easily deploy changes to our ELK environment and be confident that not only have all of the AWS components updated as we expect them to, but that Elasticsearch has been upgraded as well.

Essentially exactly what we would have had if the CloudFormation update policy had worked the way that I initially expected it to.

Speaking of which, it would be nice if AWS gave a little bit more control over that update policy (like timing, or waiting for a signal from a system component before moving on), but you can’t win them all.

Honestly, I wouldn’t be surprised if there was a way to override the whole thing with a custom behaviour, or maybe a custom CloudFormation resource or something, but I wouldn’t even know where to start with that.

I’ve probably run the update process around 10 times at this point, and while I usually discover something each time I run it, each tweak makes it more and more stable.

The real test will be what happens when Elastic.co releases version 6 of Elasticsearch and I try to upgrade.

I foresee explosions.

0 Comments

Its been a little while since I made a post on Elasticsearch. Time to remedy that.

Our log stack has been humming along relatively well ever since we took control of it. Its not perfect, but its much better than it was.

One of the nicest side effects of the restructure has been the capability to test our changes in the CI/Staging environments before pushing them into Production. Its saved us from a a few boneheaded mistakes already (mostly just ES configuration blunders), which has been great to see. It does make pushing things into the environment actually care about a little bit slower than they otherwise would be, but I’m willing to make that tradeoff for a bit more safety.

When I was putting together the process for deploying our log stack (via Nuget, Powershell and Octopus Deploy), I tried to keep in mind what it would be like when I needed to deploy an Elasticsearch version upgrade. To be honest, I thought I had a pretty good handle on it:

  • Make an AMI with the new version of Elasticsearch on it
  • Change the environment definition to reference this new AMI instead of the old one
  • Deploy the updated package, leveraging the Auto Scaling Group instance replacement functionality
  • Dance like no-one is watching

The dancing part worked perfectly. I am a graceful swan.

The rest? Not so much.

Rollin’, Rollin’

I think the core issue was that I had a little bit too much faith in Elasticsearch to react quickly and robustly in the face of random nodes dying and being replaced.

Don’t get me wrong, its pretty amazing at what it does, but there are definitely situations where it is understandably incapable of adjusting and balancing itself.

Case in point, the process that occurs when an AWS Auto Scaling Group starts doing a rolling update because the definition of its EC2 instance launch configuration has changed.

When you use CloudFormation to initialize an Auto Scaling Group, you define the instances inside that group through a configuration structure called a Launch Configuration. This structure contains the definition of your EC2 instances, including the base AMI, security groups, tags and other meta information, along with any initialization that needs to be performed on startup (user data, CFN init, etc).

Inside the Auto Scaling Group definition in the template, you decide what should be the appropriate reaction upon detecting changes to the launch configuration, which mostly amounts to a choice between “do nothing” or “start replacing the instances in a sane way”. That second option is referred to as a “rolling update”, and you can specify a policy in the template for how you would like it to occur.

For our environment, a new ES version means a new AMI, so theoretically, it should be a simple matter to update the Launch Configuration with the new AMI and push out an update, relying on the Auto Scaling Group to replace the old nodes with the new ones, and relying on Elasticsearch to rebalance and adjust as appropriate.

Not that simple unfortunately, as I learned when I tried to apply it to the ES Master and Data ASGs in our ELK template.

Whenever changes were detected, CloudFormation would spin up a new node, wait for it to complete its initialization (which was just machine up + octopus tentacle registration), then it would terminate an old node and rinse and repeat until everything was replaced. This happened for both the master nodes and data nodes at the same time (two different Auto Scaling Groups).

Elasticsearch didn’t stand a chance.

With no feedback loop between ES and CloudFormation, there was no way for ES to tell CloudFormation to wait until it had rebalanced the cluster, replicated the shards and generally recovered from the traumatic event of having a piece of itself ripped out and thrown away.

The end result? Pretty much every scrap of data in the environment disappeared.

Good thing it was a scratch environment.

Rollin’, Rollin’

Sticking with the whole “we should probably leverage CloudFormation” approach. I implemented a script to wait for the node to join the cluster and for the cluster to be green (bash scripting is fun!). The intent was that this script would be present in the baseline ES AMI, would be executed as part of the user data during EC2 instance initialization, and would essentially force the auto scaling process to wait for Elasticsearch to actually be functional before moving on.

This wrought havoc with the initial environment creation though, as the cluster isn’t even valid until enough master nodes exist to elect a primary (which is 3), so while it kind of worked for the updates, initial creation was broken.

Not only that, but in a cluster with a decent amount of data, the whole “wait for green” thing takes longer than the maximum time allowed for CloudFormation Auto Scaling Group EC2 instance replacements, which would cause the auto scaling to time out and the entire stack to fail.

So we couldn’t use CloudFormation directly.

The downside of that is that CloudFormation is really good at detecting changes and determining if it actually has to do anything, so not only did we need to find another way to update our nodes, we needed to find a mechanism that would safely know when that node update should be applied.

To Be Continued

That’s enough Elasticsearch for now I think, so next time I’ll continue with the approach we actually settled on.

0 Comments

Alerting on potential problems before they become real problems is something that we have been historically bad at. We’ve got a wealth of information available both through AWS and in our ELK Stack, but we’ve never really put together anything to use that information to notify us when something interesting happens.

In an effort to alleviate that, we’ve recently started using CloudWatch alarms to do some basic alerting. I’m not exactly sure why I shied away from them to begin with to be honest, as they are easy to setup and maintain. It might have been that the old way that we managed our environments didn’t lend itself to easy, non-destructive updates, making tweaking and creating alarms a relatively onerous process.

Regardless of my initial hesitation, when I was polishing the ELK stack environment recently I made sure to include some CloudWatch alarms to let us know if certain pieces fell over.

The alerting setup isn’t anything special:

  • Dedicated SNS topics for each environment (i.e. CI, Staging, Production, Scratch)
  • Each environment has two topics, one for alarms and one for notifications
  • The entire team is signed up to emails from both Staging and Production, but CI/Scratch are optional

Email topic subscriptions are enough for us for now, but we have plans to use the topics to trigger messages in HipChat as well.

I started with some basic alarms, like unhealthy instances > 0 for Elasticsearch/Logstash. Both of those components expose HTTP endpoints that can easily be pinged, so if they stop responding to traffic for any decent amount of time, we want to be notified. Other than some tweaks to the Logstash health check (I tuned it too tightly initially, so it was going off whenever a HTTP request took longer than 5 seconds), these alarms have served us well in picking up things like Logstash/Elasticsearch crashing or not starting properly.

As far as Elasticsearch is concerned, this is pretty much enough.

With Logstash being Logstash though, more work needed to be done.

Transport Failure

As a result of the issues I had with HTTP outputs in Logstash, we’re still using the Logstash TCP input/output combo.

The upside of this is that it works.

The downside is that sometimes the TCP input on the Broker side seems to crash and get into an unrecoverable state.

That wouldn’t be so terrible if it took Logstash with it, but it doesn’t. Instead, Logstash continues to accept HTTP requests, and just fails all of the TCP stuff. I’m not actually sure if the log events received via HTTP during this time are still processed through the pipeline correctly, but all of the incoming TCP traffic is definitely black holed.

As a result of the HTTP endpoint continuing to return 200 OK, the alarms I setup for unhealthy instances completely fail to pick up this particular issue.

In fact, because of the nature of TCP traffic through an ELB, and the relatively poor quality of TCP metrics, it can be very difficult to tell whether or not its working at a glance. Can’t use requests or latency, because they have no bearing on TCP traffic, and certainly can’t use status codes (obviously). Maybe network traffic, but that doesn’t seem like the best idea due to natural traffic fluctuations.

The only metric that I could find was “Backend Connection Errors”. This metric appears to be a measure of how many low level connection errors occurred between the ELB and the underlying EC2 instances, and seems like a good fit. Even better, when the Logstash TCP input falls over, it is this metric that actually changes, as all of the TCP traffic being forwarded through the ELB fails miserably.

One simple alarm initialization through CloudFormation later, and we were safe and secure in the knowledge that the next time the TCP input fell over, we wouldn’t find out about it 2 days later.

"BrokerLoadBalancerBackendConnectionErrorsAlarm": {
  "Type" : "AWS::CloudWatch::Alarm",
  "Properties" : {
    "AlarmName" : { "Fn::Join": [ "", [ "elk-broker-", { "Ref": "OctopusEnvironment" }, "-backend-errors" ] ] },
    "AlarmDescription" : "Alarm for when there is an increase in the backend connection errors on the Broker load balancer, typically indicating a problem with the Broker EC2 instances. Suggest restarting them",
    "AlarmActions" : [ { "Ref" : "AlarmsTopicARN" } ],
    "OKActions": [ { "Ref" : "AlarmsTopicARN" } ],
    "TreatMissingData": "notBreaching",
    "MetricName" : "BackendConnectionErrors",
    "Namespace" : "AWS/ELB",
    "Statistic" : "Maximum",
    "Period" : "60",
    "EvaluationPeriods" : "5",
    "Threshold" : "100",
    "ComparisonOperator" : "GreaterThanThreshold",
    "Dimensions" : [ {
      "Name" : "LoadBalancerName",
      "Value" : { "Ref" : "BrokerLoadBalancer" }
    } ]
  }
}

Mistakes Were Made

Of course, a few weeks later the TCP input crashed again and we didn’t find out for 2 days.

But why?

It turns out that the only statistic worth a damn for Backend Connection Errors when alerting is SUM, and I created the alarm on the MAXIMUM, assuming that it would act like requests and other metrics (which give you the maximum number of X that occurred during the time period). Graphing the maximum backend connection errors during a time period where the TCP input was broken gives a flat line at y = 1, which is definitely not greater than the threshold of 100 that I entered.

I switched the alarm to SUM and as far as I can see the next time the TCP input goes down, we should get a notification.

But I’ve been fooled before.

Conclusion

I’ll be honest, even though I did make a mistake with this particular CloudWatch alarm, I’m glad that we started using them.

Easy to implement (via our CloudFormation templates) and relatively easy to use, they provide an element of alerting on our infrastructure that was sorely missing. I doubt we will go about making thousands of alarms for all of the various things we want to be alerted on (like increases in 500s and latency problems), but we’ll definitely include a few alarms in each stack we create to yell at us when something simple goes wrong (like unhealthy instances).

To bring it back to the start, I think one of the reasons we’ve hesitated to use the CloudWatch alarms was because we were waiting for the all singing all dancing alerting solution based off the wealth of information in our log stack, but I don’t think that is going to happen in a hurry.

Its been years already.