I Request, You Respond

June 13. 2017 0 Comments

Posted in:
elb
elk
lambda
logging

If anybody actually reads this blog, it would be easy to see that my world has been all about log processing and aggregation for a while now. I mean, I do other things as well, but the questions and problems that end up here on this blog have been pretty focused around logging for as far back as I care to remember. The main reason is that I just like logging, and appreciate the powerful insights that it can give about your software. Another reason is that we don’t actually have anyone who sits in the traditional “operations” role (i.e. infrastructure/environment improvement and maintenance), so in order to help keep the rest of the team focused on our higher level deliverables, I end up doing most of that stuff.

Anyway, I don’t see that pattern changing any time soon, so on with the show.

While I was going about regaining control over our log stack, I noticed that it was extremely difficult to reason about TCP traffic when using an AWS ELB.

Why does this matter?

Well, the Broker layer in our ELK stack (i.e the primary ingress point), uses a Logstash configuration with a TCP input and as a result, all of the things that write to the Broker (our externally accessible Ingress API, other instances of Logstash, the ELB Logs Processor) use TCP. That’s a significant amount of traffic, and something that I’d really like to be able to monitor and understand.

Stealthy Traffic

As is my current understanding, when you make a TCP connection through an ELB, the ELB records the initial creation of the connection as a “request” (one of the metrics you can track with CloudWatch) and then pretty much nothing else after that. I mean, this makes sense, as its the ELB’s job to essentially pick an underlying machine to route traffic to, and most TCP connections created and used specifically as TCP connections tend to be long lived (as opposed to TCP connections created as part of HTTP requests and responses).

As far as our three primary contributors are concerned:

The Logging Ingress API is pretty oblivious. It just makes a new TCP connection for each incoming log event, so unless the .NET Framework is silently caching TCP connections for optimization purposes, it’s going to cause one ELB request per log event.
The ELB Logs Processor definitely caches TCP connections. We went through a whole ordeal with connection pooling and reuse before it would function in production, so its definitely pushing multiple log events through a single socket.
The Logstash instances that we have distributed across our various EC2 instances (local machine log aggregation, like IIS and application logs) are using the Logstash TCP output. I assume it uses one (or many) long live connections to do its business, but I don’t really know. Logstash is very mysterious.

This sort of usage makes it very hard to tell just how many log events are coming through the system via CloudWatch, which is a useful metric, especially when things start to go wrong and you need to debug which part of the stack is actually causing the failure.

Unfortunately, the monitoring issue isn’t the only problem with using the Logstash TCP input/output. Both input and output have, at separate times, been…flakey. I’ve experienced both sides of the pipeline going down for no obvious reason, or simply not working after running uninterrupted for a long time.

The final nail in the coffin for TCP came recently, when Elastic.co released the Logstash Persistent Queue feature for Logstash 5.4.0, which does not work with TCP at all (it only supports inputs that use the request-response model). I want to use persistent queues to remove both the Cache and Indexer layers from our log stack, so it was time for TCP to die.

Socket Surgery

Adding a HTTP input to our Broker layer was easy enough. In fact, such an input was already present because the ELB uses a HTTP request to check whether or not the Broker EC2 instances are healthy.

Conceptually, changing our Logstash instances to use a HTTP output instead of TCP should also be easy. Just need to change some configuration and deploy through Octopus. Keep in mind I haven’t actually done it yet, but it feels simple enough.

In a similar vein, changing our Logging Ingress API to output through HTTP instead of TCP should also be easy. A small code change to use HttpClient or RestSharp or something, a new deployment and everything is puppies and roses. Again, I haven’t actually done it yet, so who knows what dragons lurk there.

Then we have the ELB Logs Processor, which is a whole different kettle of fish.

It took a significant amount of effort to get it working with TCP in the first place (connection pooling was the biggest problem), and due to the poor quality of the Javascript (entirely my fault), its pretty tightly coupled to that particular mechanism.

Regardless of the difficulty, TCP has to go, for the good of the log stack

The first issue I ran into was “how do you even do HTTP requests in Node 4.3.2 anyway?”. There are many answers to this question, but the most obvious one is to use the HTTP API that comes with Node. Poking around this for a while showed that it wasn’t too bad, as long as I didn’t want to deal with a response payload, which I didn’t.

The biggest issue with the native Node HTTP API was that it was all callbacks, all the time. In my misadventures with the ELB Logs Processor I’d become very attached to promises and the effect they have on the readability of the resulting code, and didn’t really want to give that up so easily. I dutifully implemented a simple promise wrapper around our specific usage of the native Node HTTP API (which was just a POST of a JSON payload), and incorporated it into the Lambda function.

Unfortunately, this is where my memory gets a little bit fuzzy (it was a few weeks ago), and I don’t really remember how well it went. I don’t think it went well, because I decided to switch to a package called Axios which offered promise based HTTP requests out of the box.

Axios of Evil

Axios was pretty amazing. Well, I mean, Axios IS pretty amazing, but I suppose that sentence gave it away that the relationship didn’t end well.

The library did exactly what it said it did and with its native support for promises, was relatively easy to incorporate into the existing code, as you can see from the following excerpt:

// snip, whole bunch of setup, including summary object initialization

let axios = require('axios').create({
    baseURL: 'http://' + config.logstashHost,
    headers: {'Content-Type': 'application/json'}
});

// snip, more setup, other functions

function handleLine(line) {
    summary.lines.encountered += 1;

    var entry = {
        Component: config.component,
        SourceModuleName: config.sourceModuleName,
        Environment: config.environment,
        Application: config.application,
        message: line,
        type: config.type,
        Source: {
            S3: s3FileDetails
        }
    };
        
    var promise = axios
        .post("/", entry)
        .then((response) => {
            summary.lines.sent += 1;
        })
        .catch((error) => { 
            summary.failures.sending.total += 1;
            if (summary.failures.sending.lastFew.length >= 5) {
                summary.failures.sending.lastFew.shift();
            }
            summary.failures.sending.lastFew.push(error);
        });

    promises.push(promise);
}

// snip, main body of function (read S3, stream to line processor, wait for promises to finish

Even though it took a lot of effort to write, it was nice to remove all of the code relating to TCP sockets and connection pooling, as it simplified the whole thing.

The (single, manual) test proved that it still did its core job (contents of file written into the ELK stack), it worked in CI and it worked in Staging, so I was pretty happy.

For about 15 minutes that is, until I deployed it into Production.

Crushing Disappointment

Just like last time, the implementation simply could not deal with the amount of traffic that was being thrown at it. Even worse, it wasn’t actually logging any errors or giving me any indication as to why it was failing. After a brief and frustrating investigation, it looked like it was simply running out of memory (the Lambda function was only configured to use 192 MB, which had been enough for the TCP approach) and it was simply falling over once it reached that amount. This was my hypothesis, but I was never able to conclusively prove ii, but it was definitely using all of the memory available to the function each time it ran.

I could have just increased the available memory, but I wanted to understand where all the memory was going first.

Then I realised I would have to learn how to do memory analysis in Javascript, and I just gave up.

On Javascript that is.

Instead, I decided to rewrite the ELB Logs Processor in .NET Core, using a language that I actually like (C#).

Conclusion

This is one of those cases where looking back, with the benefits of hindsight, I probably should have just increased the memory until it worked and then walked away.

But I was just so tired of struggling with Javascript and Node that it was incredibly cathartic to just abandon it all in favour of something that actually made sense to me.

Of course, implementing the thing in C# via .NET Core wasn’t exactly painless, but that’s a topic for another time.

Probably next week.

Building A Better Beast, Part 4

April 25. 2017 0 Comments

With environments and configuration out of the way, its time to put all of the pieces together.

Obviously this isn’t the first time that both of those things have been put together. In order to validate that everything was working as expected, I was constantly creating/updating environments and deploying new versions of the configuration to them. Not only that, but with the way our deployment pipeline works (commit, push, build, test, deploy [test], [deploy]), the CI environment had been up and running for some time.

What’s left to do then?

Well, we still need to create the Staging and Production environments, which should be easy because those are just deployments inside Octopus now.

The bigger chunk of work is to use those new environments and to redirect all of our existing log traffic as appropriate.

Hostile Deployment

This is a perfect example of why I spend time on automating things.

With the environments setup to act just like everything else in Octopus, all I had to do to create a Staging environment was push a button. Once the deployment finished and the infrastructure was created, it was just another button push to deploy the configuration for that environment to make it operational. Rinse and repeat for all of the layers (Broker, Indexer, Cache, Elasticsearch) and Staging is up and running.

Production was almost the same, with one caveat. We use an entirely different AWS account for all our production resources, so we had to override all of the appropriate Octopus variables for each environment project (like AWS Credentials, VPC ID, Subnet ID’s, etc). With those overrides in place, all that’s left is to make new releases (to capture the variables) and deploy to the appropriate environments.

It’s nice when everything works.

Redirecting Firepower

Of course, the new logging environments are worthless without log events. Luckily, we have plenty of those:

IIS logs from all of our APIs
Application logs from all of our APIs
ELB logs from a subset of our load balancers, most of which are APIs, but at least one is an Nginx router
Simple performance counter statistics (like CPU, Memory, Disk, etc) from basically every EC2 instance
Logs from on-premises desktop applications

We generally have CI, Staging and prod-X (green/blue) environments for all of our services/APIs (because its how our build/test/deployment pipeline works), so now that we have similarly partitioned logging environments, all we have to do is line them up (CI to CI, Staging to Staging and so on).

For the on-premises desktop applications, there is no CI, but they do generally have the capability to run in Staging mode, so we can use that setting to direct log traffic.

There are a few ways in which the log events hit the Broker layer:

Internal Logstash instance running on an EC2 instance with a TCP output pointing at the Broker hostname
Internal Lambda function writing directly to the Broker hostname via TCP (this is the ELB logs processor)
External application writing to an authenticated Logging API, which in turn writes to the Broker via TCP (this is for the on-premises desktop applications)

We can change the hostname used by all of these mechanisms simply by changing some variables in Octopus deploy, making a new release and deploying it through the environments.

And that’s exactly what we did, making sure to monitor the log event traffic for each one to make sure we didn’t lose anything.

With all the logs going to their new homes, all that was left to do was destroy the old log stack, easily done manually through CloudFormation.

You might be wondering about any log events that were stored in the old stack? Well, we generally only keep around 14 days worth of log events in the stack itself (because there are so many), so we pretty much just left the old stack up for searching purposes until it was no longer relevant, and then destroyed it.

Conclusion

And that basically brings us to the end of this series of posts about our logging environment and the reclamation thereof.

We’ve now got our standard deployment pipeline in place for both infrastructure and configuration and have separated our log traffic accordingly.

This puts us in a much better position moving forward. Not only is the entire thing fully under our control, but we now have the capability to test changes to infrastructure/configuration before just deploying them into production, something we couldn’t do before when we only had a single stack for everything.

In all fairness though, all we really did was polish an existing system so that it was a better fit for our specific usage.

Evolutionary, not revolutionary.

Building A Better Beast, Part 3

April 18. 2017 0 Comments

Posted in:
elk
logging
octopus

Continuing on from last week, its time to talk software and the configuration thereof.

With the environments themselves being managed by Octopus (i.e. the underlying infrastructure), we need to deal with the software side of things.

Four of the five components in the new ELK stack require configuration of some sort in order to work properly:

Logstash requires configuration to tell it how to get events, filter/process then and where to put them, so the Broker and Indexer layers require different, but similar, configuration.
Elasticsearch requires configuration for a variety of reasons including cluster naming, port setup, memory limits and so on.
Kibana requires configuration to know where Elasticsearch is, and a few other things.

For me, configuration is a hell of a lot more likely to change than the software itself, although with the pace that some organisations release software, that might not always be strictly true. Also, coming from a primarily Windows background, software is traditionally a lot more difficult to install and setup, and by that logic is not something you want to do all the time.

Taking those things into account, I’ve found it helpful to separate the installation of the software from the configuration of the software. What this means in practice is that a particular version of the software itself will be baked into an AMI, and then the configuration of that software will be handled via Octopus Deploy whenever a machine is created from the AMI.

Using AMIs or Docker Images to create immutable software + configuration artefacts is also a valid approach, and is superior in a lot of respects. It makes dynamic scaling easier (by facilitating a quick startup of fully functional nodes), helps with testing and generally simplifies the entire process. Docker Images in particular are something that I would love to explore in the future, just not right at this moment.

The good news is that this is pretty much exactly what the new stack was already doing, so we only need to make a few minor improvements and we’re good to go.

The Lament Configuration

As I mentioned in the first post in this series, software configuration was already being handled by TeamCity/Nuget/Octopus Deploy, it just needed to be cleaned up a bit. First thing was to move the configuration out into its own repository as appropriate for each layer and rewrite TeamCity as necessary. The Single Responsibility Principle doesn’t just apply to classes after all.

The next part is something of a personal preference, and it relates to the logic around deployment. All of the existing configuration deployments in the new stack had their logic (i.e. where to copy the files on the target machine, how to start/restart services and so on) encapsulated entirely inside Octopus Deploy. I’m not a fan of that. I much prefer to have all of this logic inside scripts in source control alongside the artefacts that will be deployed. This leaves projects in Octopus Deploy relatively simple, only responsible for deploying Nuget packages, managing variables (which is hard to encapsulate in source control because of sensitive values) and generally overseeing the whole process. This is the same sort of approach that I use for building software, with TeamCity acting as a relatively stupid orchestration tool, executing scripts that live inside source control with the code.

Octopus actually makes using source controlled scripts pretty easy, as it will automatically execute scripts named a certain way at particular times during the deployment of a Nuget package (for example, any script called deploy.ps1 at the root of the package will be executed after the package has been copied to the appropriate location on the target machine). The nice thing is that this also works with bash scripts for Linux targets (i.e. deploy.sh), which is particularly relevant here, because all of the ELK stuff happens on Linux.

Actually deploying most of the configuration is pretty simple. For example, this is the deploy.sh script for the ELK Broker configuration.

# The deploy script is automatically run by Octopus during a deployment, after Octopus does its thing.
# Octopus deploys the contents of the package to /tmp/elk-broker/
# At this point, the logstash configuration directory has been cleared by the pre-deploy script

# Echo commands after expansion
set -x

# Copy the settings file
cp /tmp/elk-broker/logstash.yml /etc/logstash/logstash.yml || exit 1

# Copy the config files
cp /tmp/elk-broker/*.conf /etc/logstash/conf.d/ || exit 1

# Remove the UTF-8 BOM from the config files
sed -i '1 s/^\xef\xbb\xbf//' /etc/logstash/conf.d/*.conf || exit 1

# Test the configuration
sudo /usr/share/logstash/bin/logstash -f /etc/logstash/conf.d/ -t --path.settings /etc/logstash/logstash.yml || exit 1

# Set the ownership of the config files to the logstash user, which is what the service runs as
sudo chown -R logstash:logstash /etc/logstash/conf.d || exit 1

# Restart logstash - dont use the restart command, it throws errors when you try to restart a stopped service
sudo initctl stop logstash || true
sudo initctl start logstash

Prior to the script based approach, this logic was spread across five or six different steps inside an Octopus Project, which I found much harder to read and reason about.

Or Lemarchand’s Box If You Prefer

The only other difference worth talking about is the way in which we actually trigger configuration deployments.

Traditionally, we have asked Octopus to deploy the most appropriate version of the necessary projects during the initialization of a machine. For example, the ELK Broker EC2 instances had logic inside their LaunchConfiguration:UserData that said “register as Tentacle”, “deploy X”, “deploy Y” etc.

This time I tried something a little different, but which feels a hell of a lot smarter.

Instead of the machine being responsible for asking for projects to be deployed to it, we can just let Octopus react to the registration of a new Tentacle and deploy whatever Projects are appropriate. This is relatively easy to setup as well. All you need to do is add a trigger to your Project that says “deploy whenever a new machine comes online”. Octopus takes care of the rest, including picking what version is best (which is just the last successful deployment to the environment).

This is a lot cleaner than hardcoding project deployment logic inside the environment definition, and allows for changes to what software gets deployed where without actually having to edit or update the infrastructure definition. This sort of automatic deployment approach is probably more useful to our old way of handling environments (i.e. that whole terrible migration process with no update logic), than it is to the newer, easier to update environment deployments, but its still nice all the same.

Conclusion

There really wasn’t much effort required to clean up the configuration for each of the layers in the new ELK stack, but it was a great opportunity to try out the new trigger based deployments in Octopus Deploy, which was pretty cool.

With the configuration out of the way, and the environment creation also sorted, all that’s left is to actually create some new environments and start using them instead of the old one.

That’s a topic for next time though.

Building A Better Beast, Part 2

April 11. 2017 0 Comments

With all of the general context and solution outlining done for now, its time to delve into some of the details. Specifically, the build/test/deploy pipeline for the log stack environments.

Unfortunately, we use the term environmentto describe two things. The first is an Octopus environment, which is basically a grouping construct inside Octopus Deploy, like CI or prod-green. The second is a set of infrastructure intended for a specific purpose, like an Auto Scaling Group and Load Balancer intended to host an API. In the case of the log stack, we have distinct environments for the infrastructure for each layer, like the Broker and the Indexer.

Our environments are all conceptually similar, there is a Git repository that contains everything necessary to create or update the infrastructure (CloudFormation templates, Powershell scripts, etc), along with the logic for what it means to build and validate a Nuget package that can be used to manage the environment. The repository is hooked up to a Build Configuration in TeamCity which runs the build script and the resulting versioned package is uploaded to our Nuget server. The package is then used in TeamCity via other Build Configurations to allow us to Create, Delete, Migrate and otherwise interact with the environment in question.

The creation of this process has happened in bits and pieces over the last few years, most of which I’ve written about on this blog.

Its a decent system, and I’m proud of how far we’ve come and how much automation is now in place, but it’s certainly not without its flaws.

Bestial Rage

The biggest problem with the current process is that while the environment is fully encapsulated as code inside a validated and versioned Nuget package, actually using that package to create or delete an environment is not as simple as it could be. As I mentioned above, we have a set of TeamCity Build Configurations for each environment that allow for the major operations like Create and Delete. If you’ve made changes to an environment and want to deploy them, you have to decide what sort of action is necessary (i.e. “its the first time, Create” or “it already exists, Migrate”) and “run”the build, which will download the package and run the appropriate script.

This is where is gets a bit onerous, especially for production. If you want to change any of the environment parameters from the default values the package was built with, you need to provide a set of parameter overrideswhen you run the build. For production, this means you often end up overriding everything (because production is a separate AWS account) which can be upwards of 10 different parameters, all of which are only visible if you go a look at the source CloudFormation template. You have to do this every time you want to execute that operation (although you can copy the parameters from previous runs, which acts a small shortcut).

The issue with this is that it means production deployments become vulnerable to human error, which is one of the things we’re trying to avoid by automating in the first place!

Another issue is that we lack a true “Update” operation. We only have Create, Delete, Clone and Migrate.

This is entirely my fault, because when I initially put the system together I had a bad experience with the CloudFormation Update command where I accidentally wiped out an S3 bucket containing customer data. As is often the case, that fear then led to an alternate (worse) solution involving cloning, checking, deleting, cloning, checking and deleting (in that order). This was safer, but incredibly slow and prone to failure.

The existence of these two problems (hard to deploy, slow failure-prone deployments) is reason enough for me to consider exploring alternative approaches for the log stack infrastructure.

Fantastic Beasts And Where To Find Them

The existing process does do a number of things well though, and has:

A Nuget package that contains everything necessary to interact with the environment.
Environment versioning, because that’s always important for traceability.
Environment validation via tests executed as part of the build (when possible).

Keeping those three things in mind, and combining them with the desire to ease the actual environment deployment, an improved approach looks a lot like our typical software development/deployment flow.

Changes to environment are checked in
Changes are picked up by TeamCity, and a build is started
Build is tested (i.e. a test environment is created, validated and destroyed)
Versioned Nuget package is created
Package is uploaded to Octopus
Octopus Release is created
Octopus Release is deployed to CI
Secondary validation (i.e. test CI environment to make sure it does what it’s supposed to do after deployment)
[Optional] Propagation of release to Staging

In comparison to our current process, the main difference is the deployment. Prior to this, we were treating our environments as libraries (i.e. they were built, tested, packaged and uploaded to MyGet to be used by something else). Now we’re treating them as self contained deployable components, responsible for knowing how to deploy themselves.

With the approach settled, all that’s left is to come up with an actual deployment process for an environment.

Beast Mastery

There are two main cases we need to take care of when deploying a CloudFormation stack to an Octopus environment.

The first case is what to do when the CloudFormation stack doesn’t exist.

This is the easy case, all we need to do is execute New-CFNStack with the appropriate parameters and then wait for the stack to finish.

The second case is what we should do when the CloudFormation stack already exists, which is the case that is not particularly well covered by our current environment management process.

Luckily, CloudFormation makes this relatively easy with the Update-CFNStack command. Updates are dangerous (as I mentioned above), but if you’re careful with resources that contain state, they are pretty efficient. The implementation of the update is quite smart as well, and will only update the things that have changed in the template (i.e. if you’ve only changed the Load Balancer, it won’t recreate all of your EC2 instances).

The completed deployment script is shown in full below.

[CmdletBinding()]
param
(

)

$here = Split-Path $script:MyInvocation.MyCommand.Path;
$rootDirectory = Get-Item ($here);
$rootDirectoryPath = $rootDirectory.FullName;

$ErrorActionPreference = "Stop";

$component = "unique-stack-name";

if ($OctopusParameters -ne $null)
{
    $parameters = ConvertFrom-StringData ([System.IO.File]::ReadAllText("$here/cloudformation.parameters.octopus"));

    $awsKey = $OctopusParameters["AWS.Deployment.Key"];
    $awsSecret = $OctopusParameters["AWS.Deployment.Secret"];
    $awsRegion = $OctopusParameters["AWS.Deployment.Region"];
}
else 
{
    $parameters = ConvertFrom-StringData ([System.IO.File]::ReadAllText("$here/cloudformation.parameters.local"));
    
    $path = "C:\creds\credentials.json";
    Write-Verbose "Attempting to load credentials (AWS Key, Secret, Region, Octopus Url, Key) from local, non-repository stored file at [$path]. This is done this way to allow for a nice development experience in vscode"
    $creds = ConvertFrom-Json ([System.IO.File]::ReadAllText($path));
    $awsKey = $creds.aws."aws-account".key;
    $awsSecret = $creds.aws."aws-account".secret;
    $awsRegion = $creds.aws."aws-account".region;

    $parameters["OctopusAPIKey"] = $creds.octopus.key;
    $parameters["OctopusServerURL"] = $creds.octopus.url;
}

$parameters["Component"] = $component;

$environment = $parameters["OctopusEnvironment"];

. "$here/scripts/common/Functions-Aws.ps1";
. "$here/scripts/common/Functions-Aws-CloudFormation.ps1";

Ensure-AwsPowershellFunctionsAvailable

$tags = @{
    "environment"=$environment;
    "environment:version"=$parameters["EnvironmentVersion"];
    "application"=$component;
    "function"="logging";
    "team"=$parameters["team"];
}

$stackName = "$environment-$component";

$exists = Test-CloudFormationStack -AwsKey $awsKey -AwsSecret $awsSecret -AwsRegion $awsRegion -StackName $stackName

$cfParams = ConvertTo-CloudFormationParameters $parameters;
$cfTags = ConvertTo-CloudFormationTags $tags;
$args = @{
    "-StackName"=$stackName;
    "-TemplateBody"=[System.IO.File]::ReadAllText("$here\cloudformation.template");
    "-Parameters"=$cfParams;
    "-Tags"=$cfTags;
    "-AccessKey"=$awsKey;
    "-SecretKey"=$awsSecret;
    "-Region"=$awsRegion;
    "-Capabilities"="CAPABILITY_IAM";
};

if ($exists)
{
    Write-Verbose "The stack [$stackName] exists, so I'm going to update it. Its better this way"
    $stackId = Update-CFNStack @args;

    $desiredStatus = [Amazon.CloudFormation.StackStatus]::UPDATE_COMPLETE;
    $failingStatuses = @(
        [Amazon.CloudFormation.StackStatus]::UPDATE_FAILED,
        [Amazon.CloudFormation.StackStatus]::UPDATE_ROLLBACK_IN_PROGRESS,
        [Amazon.CloudFormation.StackStatus]::UPDATE_ROLLBACK_COMPLETE
    );
    Wait-CloudFormationStack -StackName $stackName -DesiredStatus  $desiredStatus -FailingStates $failingStatuses -AwsKey $awsKey -AwsSecret $awsSecret -AwsRegion $awsRegion

    Write-Verbose "Stack [$stackName] Updated";
}
else 
{
    Write-Verbose "The stack [$stackName] does not exist, so I'm going to create it. Just watch me"
    $args.Add("-DisableRollback", $true);
    $stackId = New-CFNStack @args;

    $desiredStatus = [Amazon.CloudFormation.StackStatus]::CREATE_COMPLETE;
    $failingStatuses = @(
        [Amazon.CloudFormation.StackStatus]::CREATE_FAILED,
        [Amazon.CloudFormation.StackStatus]::ROLLBACK_IN_PROGRESS,
        [Amazon.CloudFormation.StackStatus]::ROLLBACK_COMPLETE
    );
    Wait-CloudFormationStack -StackName $stackName -DesiredStatus  $desiredStatus -FailingStates $failingStatuses -AwsKey $awsKey -AwsSecret $awsSecret -AwsRegion $awsRegion

    Write-Verbose "Stack [$stackName] Created";
}

Other than the Create/Update logic that I’ve already talked about, the only other interesting thing in the deployment script is the way that it deals with parameters.

Basically if the script detects that its being run from inside Octopus Deploy (via the presence of an $OctopusParameters variable), it will load all of its parameters (as a hashtable) from a particular local file. This file leverages the Octopus variable substitution feature, so that when we deploy the infrastructure to the various environments, it gets the appropriate values (like a different VPC because prod is a separate AWS account to CI). When its not running in Octopus, it just uses a different file, structured very similarly, with test/scratch values in it.

With the deployment script in place, we plug the whole thing into our existing “deployable” component structure and we have automatic deployment of tested, versioned infrastructure via Octopus Deploy.

Conclusion

Of course, being a first version, the deployment logic that I’ve described above is not perfect. For example, there is no support for deploying to an environment where the stack is in error (failing stacks can’t be updated, but they already exist, so you have to delete it and start again) and there is little to no feedback available if a stack creation/update fails for some reason.

Additionally, the code could benefit from being extracted to a library for reuse.

All in all, the deployment process I just described is a lot simpler than the one I described at the start of this post, and its managed by Octopus, which makes it consistent with the way that we do everything else, which is nice.

With a little bit more polish, and some pretty strict usage of the CloudFormation features that stop you accidentally deleting databases full of valuable data, I think it will be a good replacement for what we do now.

Building a Better Beast, Part 1

April 4. 2017 0 Comments

Posted in:
aws
logging
elk

Way back in March 2015, I wrote a few posts explaining how we set up our log aggregation. I’ve done a lot of posts since then about logging in general and about specific problems we’ve encountered in various areas, but I’ve never really revisited the infrastructure underneath the stack itself.

The main reason for the lack of posts is that the infrastructure I described back then is not the infrastructure we’re using now. As we built more things and started pushing more and more data into the stack we had, we began to experience some issues, mostly related to the reliability of the Elasticsearch process. At the time, the organization decided that it would be better if our internal operations team were responsible for dealing with these issues, and they built a new stack as a result.

This was good and bad. The good part was the arguments that if we didn’t have to spend our time building and maintaining the system, it should theoretically leave more time and brainspace for us to focus on actual software development. Bad for almost exactly the same reason, problems with the service would need to be resolved by a different team, one with their own set of priorities and their own schedule.

The arrangement worked okay for a while, until the operations team were engaged on a relatively complex set of projects and no longer had the time to extend and maintain the log stack as necessary. They did their best, but with no resources dedicated to dealing with the maintenance on the existing stack, it started to degrade surprisingly quickly.

This came to a head when we had a failure in the new stack that required us to replace some EC2 instances via the Auto Scaling Group, and the operations team was unavailable to help. When we executed the scaling operation, we discovered that it was creating instances that didn’t actually have all of the required software setup in order to fulfil their intended role. At some point in the past someone had manually made changes to the instances already in service and these changes had not been made in the infrastructure as code.

After struggling with this for a while, we decided to reclaim the stack and make it our responsibility again.

Beast Mode

Architecturally, the new log stack was a lot better than the old one, even taking into account the teething issues that we did have.

The old stack was basically an Auto Scaling Group capable of creating EC2 instances with Elasticsearch, Logstash and Kibana along with a few load balancers for access purposes. While the old stack could theoretically scale out to better handle load, we never really tested that capability in production, and I’m pretty sure it wouldn’t have worked (looking back I doubt the Elasticseach clustering was setup correctly, in addition to some other issues with the way the Logstash indexes were being configured).

The new stack looked a lot like the reference architecture described on the Logstash website, which was good, because those guys know their stuff.

At a high level, log events would be shipped from many different places to a Broker layer (auto scaling Logstash instances behind a Load Balancer) which would then cache those events in a queue of some description (initially RabbitMQ, later Redis). An Indexer layer (auto scaling Logstash instances) would pull events off the queue at a sustainable pace, process them and place them into Elasticsearch. Users would then use Kibana (hosted on the Elasticsearch instances for ease of access) to interact with the data.

There are a number of benefits to the architecture described above, but a few of the biggest ones are:

Its not possible for an influx of log events to shut down Elasticsearch because the Indexer layer is pulling events out of the cache at a sustainable rate. The cache might start to fill up if the number of events rises, but we’ll still be able to use Elasticsearch.
The cache provides a buffer if something goes wrong with either the Indexer layer or Elasticsearch. We had some issues with Elasticsearch crashing in our old log stack, so having some protection against losing log events in the event of downtime was beneficial.

There were downsides as well, the most pertinent of which was that the new architecture was a lot more complicated than the old one, with a lot of moving parts. This made it harder to manage and understand, and increased the number of different ways in which it could break.

Taking all of the above into account, when we reclaimed the stack we decided to keep the architecture intact, and just improve it.

But how?

Beast-Like Vigour

The way in which the stack was described was not bad. It just wasn’t quite as controlled was the way we’d been handling our other environments, mostly as a factor of being created/maintained by a different team.

Configuration for the major components (Logstash Broker, Logstash Indexer, Elasticsearch, Kibana) was Source Controlled, with builds in TeamCity and deployment was handled by deploying Nuget packages through Octopus. This was good, and wouldn’t require much work to bring into line with the rest of our stuff. All we would have to do was ensure all of the pertinent deployment logic was encapsulated in the Git repositories and maybe add some tests.

The infrastructure needed some more effort. It was all defined using CloudFormation templates, which was excellent, but there was no build/deployment pipeline for the templates and they were not versioned. In order to put such a pipeline in place, we would need to have CI and Staging deployments of the infrastructure as well, which did not yet exist. The infrastructure definition for each layer also shared a repository with the relevant configuration (i.e. Broker Environment with Broker Config), which was against our existing patterns. Finally, the cache/queue layer did not have an environment definition at all, because the current one (Elasticache w. Redis) had been manually created to replace the original one (RabbitMQ) as a result of some issues where the cache/queue filled up and then became unrecoverable.

In addition to the above, once we’ve improved all of the processes and got everything under control, we need to work on fixing the actual bugs in the stack (like Logstash logs filling up disks, Elasticsearch mapping templates not being setup correctly, no alerting/monitoring on the various layers, etc). Some of these things will probably be fixed as we make the process improvements, but others will require dedicated effort.

To Be Continued

With the total scope of the work laid out (infrastructure build/deploy, clean up configuration deployment, re-create infrastructure in appropriate AWS accounts, fix bugs) its time to get cracking.

The first cab off the rank is the work required to create a process that will allow us to fully automate the build and deployment of the infrastructure. Without that sort of system in place, we would have to do all of the other things manually.

The Broker layer is the obvious starting point, so next week I’ll outline how we went about using a combination of TeamCity, Nuget, Octopus and Powershell to accomplish a build and deployment pipeline for the infrastructure.