0 Comments

A long long time ago, a valuable lesson was learned about segregating production infrastructure from development/staging infrastructure. Feel free to go and read that post if you want, but to summarise: I ran some load tests on an environment (one specifically built for load testing, separate to our normal CI/Staging/Production), and the tests flooded a shared component (a proxy), brining down production services and impacting customer experience. The outage didn’t last that long once we realised what was happening, but it was still pretty embarrassing.

Shortly after that adventure, we created a brand new AWS account to isolate our production infrastructure and slowly moved all of that important stuff into it, protecting it from developers doing what they do best (breaking stuff in new and interesting ways).

This arrangement complicated a few things, but the most relevant to this discussion was the creation and management of AMIs.

We were already using Packer to create and maintain said AMI’s, so it wasn’t a manual process, but an AMI in AWS is always owned by and accessible from one AWS account by default.

With two completely different AWS accounts, it was easy to imagine a situation where each account has slightly different AMIs available, which have slightly different behaviour, leading to weird things happening on production environments that don’t happen during development or in staging.

That sounds terrible, and it would be neat if we could ensure it doesn’t happen.

A Packaged Deal

The easiest thing to do is share the AMIs in question.

AWS makes it relatively easy to make an AMI accessible to a different AWS account, likely for this exact purpose. I think it’s also used to enable companies to sell pre-packaged AMIs as well, but that’s a space I know little to nothing about, so I’m not sure.

As long as you know the account number of the AWS account that you want to grant access to, its a simple matter to use the dashboard or the API to share the AMI, which can then be freely used from the other account to create EC2 instances.

One thing to be careful of is to make sure you grant the ability for the other account to access the AMI snapshot as well, or you’ll run into permission problems if you try to actually use the AMI to make an EC2 instance.

Sharing AMI’s is alright, but it has risks.

If you create AMIs in your development account and then share them with production, then you’ve technically got production infrastructure inside your development account, which was one of the things we desperately wanted to avoid. The main problem here is that people will not assume that a resource living inside the relatively free-for-all development environment could have any impact on production, and they might delete it or something equally dangerous. Without the AMI, auto scaling won’t work, and the most likely time to figure that sort of thing out is right when you need it the most.

A slightly better approach is to copy the AMI to the other account. In order to do this you share the AMI from the owner account (i.e. dev) and then make a permanent copy on the other account (i.e. prod). Once the copy is complete, you unshare (to prevent accidental usage).

This breaks the linkage between the two accounts while ensuring that the AMIs are identical, so its a step up from simple sharing, but there are limitations.

For example, everything works swimmingly until you try to copy a Windows AMI, then it fails miserably as a result of the way in which AWS licences Windows. On the upside, the copy operation itself fails fast, rather than making a copy that then fails when you try to use it, so that’s nice.

So, two solutions, neither of which is ideal.

Surely we can do better?

Pack Of Wolves

For us, the answer is yes. We just run our Packer templates twice, once for each account.

This has actually been our solution for a while. We execute our Packer templates through TeamCity Build Configurations, so it is a relatively simple matter to just run the build twice, once for each account.

Well, “relatively simple” is probably understating it actually.

Running in dev is easy. Just click the button, wait and a wild AMI appears.

Prod is a different question.

When creating an AMI Packer needs to know some things that are AWS account specific, like subnets, VPC, security groups and so on (mostly networking concerns). The source code contained parameters relevant for dev (hence the easy AMI creation for dev), but didn’t contain anything relevant for prod. Instead, whenever you ran a prod build in TeamCity, you had to supply a hashtable of parameter overrides, which would be used to alter the defaults and make it work in the prod AWS account.

As you can imagine, this is error prone.

Additionally, you actually have to remember to click the build button a second time and supply the overrides in order to make a prod image, or you’ll end up in a situation where you deployed your environment changes successfully through CI and Staging, but it all explodes (or even worse, subtly doesn’t do what its supposed to) when you deploy them into Production because there is no equivalent AMI. Then you have to go and make one using TeamCity, which is error prone, and if the source has diverged since you made the dev one…well, its just bad times all around.

Leader Of The Pack

With some minor improvements, we can avoid that whole problem though.

Basically, whenever we do a build in TeamCity, it creates the dev AMI first, and then automatically creates the prod one as well. If the dev fails, no prod. If the prod fails, dev is deleted.

To keep things in sync, both AMIs are tagged with a version attribute created during the build (just like software), so that we have a way to trace the AMI back to the git commit it was created from (just like software).

To accomplish this approach, we now have a relatively simple configuration hierarchy, with default parameters, dev specific parameters and prod specific parameters. When you start the AMI execution, you tell the function what environment you’re targetting (dev/prod) and it loads defaults, then merges in the appropriate overrides.

This was a relatively easy way to deal with things that are different and non-sensitive (like VPC, subnets, security groups, etc).

What about credentials though?

Since an…incident…waaaay back in 2015, I’m pretty wary of credentials, particularly ones that give access to AWS.

So they can’t go in source control with the rest of the parameters.

That leaves TeamCity as the only sane place to put them, which it can easily do, assuming we don’t mind writing some logic to pick the appropriate credentials depending on our targeted destination.

We could technically have used some combination of IAM roles and AWS profiles as well, but we already have mechanisms and experience dealing with raw credential usage, so this was not the time to re-invent that particular wheel. That’s a fight for another day.

With account specific parameters and credentials taken care of, everything is good, and every build results in 2 AMIs, one for each account.

I’ve uploaded a copy of our Packer repository containing all of this logic (and a copy of the script we embed into TeamCity) to Github for reference purposes.

Conclusion

I’m much happier with the process I described above for creating our AMIs. If a build succeeds, it creates resources in both of our active AWS accounts, keeping them in sync and reducing the risk of subtle problems come deployment time. Not only that, but it also tags those resources with a version that can be traced back to a git commit, which is always more useful than you think.

There are still some rough edges around actually using the AMIs though. Most of our newer environments specify their AMIs directly via parameter files, so you have to remember to change the values for each environment target when you want to use a new AMI. This is dangerous, because if someone forgets it could lead to a disconnect between CI/Staging and Production, which was pretty much the entire problem we were trying to avoid in the first place.

Honestly, its going to be me that forgets.

Ah well, all in all, its a lot more consistent than it was before, which is pretty much the best I could hope for.

0 Comments

With all that talk about logging environments and their improvements out of the way, its time to return to familiar territory yet again.

Our ELB logs processor has been chugging along for a while now, dutifully aggregating all of the ELB logs for one of our APIs. We have quite a few APIs though, and almost all of the them have ELB logs to process, so there was no sense keeping the awesome sequestered in one place.

Time to spread the love.

Honestly, we always planned on this sort of reuse anyway, so the logs processor was built in such a way that it could be relatively easily used in another location.

The actual Javascript code is completely generic, not specific to a particular Lambda function/API combo. There are things that need to change between instances obviously, but they are covered by the variables, the changes of which are easily accomplished via Octopus Deploy. In order to setup the processor for a new API, all you need to do is add the Lambda function/permissions to the environment definition (i.e. the CloudFormation template), make a new Octopus Project and add a small piece of Powershell into the environment scripts to deploy that project.

What could possibly go wrong?

Surface Tension

Surprisingly, very little actually went wrong.

I mean, it didn’t work, and we had no log events inside the ELK stack for the new API, but at a glance, there were no catastrophic failures.

The Lambda function existed, it was receiving events from S3, downloading files, processing them and pumping them out via TCP. There were no errors (at the Lambda function level OR within the function itself) and all of the function logging showed everything working as expected.

After digging into it a little bit, it turned out that not only were the ELB logs for the new API missing, the CI and Staging logs for the older API were missing as well. Weirdly enough, the Production logs for the older API were working exactly as expected.

Mysteriously, the code was exactly the same in all 4 Lambda functions, with the only differences being in configuration (which was relatively simple).

Forget About Freeman!

To cut a relatively long story short, the core of the problem was related to the asynchronous nature of TCP connections (via sockets) in Node.

This was the problem code:

create: function() {
    return new Promise(function(resolve, reject) {
        const socket = net.createConnection(config.logstashPort, config.logstashHost);
        summary.connections.created += 1;
        resolve(socket);
    })
}

This snippet is the mechanism by which TCP connections are created and added to the pool in the Lambda function, prior to being used to actually send data to our log stack.

It turns out that a socket can be returned by net.createConnection before it has actually connected. Even worse, sockets that are still connecting will not throw errors when you attempt to write to them. I’m not sure what actually happens to the writes in this case, but giving Javascript/Node the benefit of the doubt, I assume they are probably queued in some way.

Unfortunately, the Lambda function was exiting before any sort of queued writes could actually be applied, but only when there was a small total number of ELB log lines being processed. This is why it was working fine in production (thousands of messages every execution) and failing miserably in CI/Staging and the new environment (tens of messages). The function was just too quick and didn’t know that it needed to wait for the socket to be connected and for all pending writes to execute before it was allowed to finished. As far as the promise chain was concerned, everything was done and dusted.

The solution is actually pretty simple:

create: function() {
    return new Promise(function(resolve, reject) {
        const socket = new net.Socket;

        socket.on("connect", () => {
            summary.connections.created += 1;
            resolve(socket);
        });

        socket.on("error", error => {
            summary.failures.sockets.total += 1;
            if (summary.failures.sockets.lastFew.length >= 5) {
                summary.failures.sockets.lastFew.shift();
            }
            summary.failures.sockets.lastFew.push(error);
            socket.destroy(error);
            reject(error);
        })

        socket.connect(config.logstashPort, config.logstashHost);
    })
}

Instead of immediately returning the socket, hook a listener up to its connect event and resolve the promise then. I added in some extra error handling/reporting as well, but its all pretty straight forward.

What this means is that a socket is never added to the pool unless its already connected, which makes the rest of the code work exactly as expected.

Conclusion

It was nice to return briefly to the ELB logs processor to witness its reuse, especially because the return actually led to the discovery and eventual fix of a bug that I missed the first time.

I’m still wrapping my head around the intrinsically asynchronous/callback nature of Node though, which was the root cause of the issue. To be honest, while I really like the way that promises work, I really really hate the callback hell that vanilla Node seems to encourage. It was so easy to create a socket and write to it, without the socket actually being valid, and the only way to ensure the socket was valid was via an asynchronous event handler.

To complicated matters, I’m sure it doesn’t help that AWS Lambda can only run a relatively old version of Node (4.3.2). The latest version of Node is miles ahead of the that, and it probably has hundreds of bugfixes that might be relevant.

Anyway, I’m just happy I got to steal more Half-Life chapter titles as headings.

Seriously, they fit so well.

0 Comments

With environments and configuration out of the way, its time to put all of the pieces together.

Obviously this isn’t the first time that both of those things have been put together. In order to validate that everything was working as expected, I was constantly creating/updating environments and deploying new versions of the configuration to them. Not only that, but with the way our deployment pipeline works (commit, push, build, test, deploy [test], [deploy]), the CI environment had been up and running for some time.

What’s left to do then?

Well, we still need to create the Staging and Production environments, which should be easy because those are just deployments inside Octopus now.

The bigger chunk of work is to use those new environments and to redirect all of our existing log traffic as appropriate.

Hostile Deployment

This is a perfect example of why I spend time on automating things.

With the environments setup to act just like everything else in Octopus, all I had to do to create a Staging environment was push a button. Once the deployment finished and the infrastructure was created, it was just another button push to deploy the configuration for that environment to make it operational. Rinse and repeat for all of the layers (Broker, Indexer, Cache, Elasticsearch) and Staging is up and running.

Production was almost the same, with one caveat. We use an entirely different AWS account for all our production resources, so we had to override all of the appropriate Octopus variables for each environment project (like AWS Credentials, VPC ID, Subnet ID’s, etc). With those overrides in place, all that’s left is to make new releases (to capture the variables) and deploy to the appropriate environments.

It’s nice when everything works.

Redirecting Firepower

Of course, the new logging environments are worthless without log events. Luckily, we have plenty of those:

  • IIS logs from all of our APIs
  • Application logs from all of our APIs
  • ELB logs from a subset of our load balancers, most of which are APIs, but at least one is an Nginx router
  • Simple performance counter statistics (like CPU, Memory, Disk, etc) from basically every EC2 instance
  • Logs from on-premises desktop applications

We generally have CI, Staging and prod-X (green/blue) environments for all of our services/APIs (because its how our build/test/deployment pipeline works), so now that we have similarly partitioned logging environments, all we have to do is line them up (CI to CI, Staging to Staging and so on).

For the on-premises desktop applications, there is no CI, but they do generally have the capability to run in Staging mode, so we can use that setting to direct log traffic.

There are a few ways in which the log events hit the Broker layer:

  • Internal Logstash instance running on an EC2 instance with a TCP output pointing at the Broker hostname
  • Internal Lambda function writing directly to the Broker hostname via TCP (this is the ELB logs processor)
  • External application writing to an authenticated Logging API, which in turn writes to the Broker via TCP (this is for the on-premises desktop applications)

We can change the hostname used by all of these mechanisms simply by changing some variables in Octopus deploy, making a new release and deploying it through the environments.

And that’s exactly what we did, making sure to monitor the log event traffic for each one to make sure we didn’t lose anything.

With all the logs going to their new homes, all that was left to do was destroy the old log stack, easily done manually through CloudFormation.

You might be wondering about any log events that were stored in the old stack? Well, we generally only keep around 14 days worth of log events in the stack itself (because there are so many), so we pretty much just left the old stack up for searching purposes until it was no longer relevant, and then destroyed it.

Conclusion

And that basically brings us to the end of this series of posts about our logging environment and the reclamation thereof.

We’ve now got our standard deployment pipeline in place for both infrastructure and configuration and have separated our log traffic accordingly.

This puts us in a much better position moving forward. Not only is the entire thing fully under our control, but we now have the capability to test changes to infrastructure/configuration before just deploying them into production, something we couldn’t do before when we only had a single stack for everything.

In all fairness though, all we really did was polish an existing system so that it was a better fit for our specific usage.

Evolutionary, not revolutionary.

0 Comments

With all of the general context and solution outlining done for now, its time to delve into some of the details. Specifically, the build/test/deploy pipeline for the log stack environments.

Unfortunately, we use the term environmentto describe two things. The first is an Octopus environment, which is basically a grouping construct inside Octopus Deploy, like CI or prod-green. The second is a set of infrastructure intended for a specific purpose, like an Auto Scaling Group and Load Balancer intended to host an API. In the case of the log stack, we have distinct environments for the infrastructure for each layer, like the Broker and the Indexer.

Our environments are all conceptually similar, there is a Git repository that contains everything necessary to create or update the infrastructure (CloudFormation templates, Powershell scripts, etc), along with the logic for what it means to build and validate a Nuget package that can be used to manage the environment. The repository is hooked up to a Build Configuration in TeamCity which runs the build script and the resulting versioned package is uploaded to our Nuget server. The package is then used in TeamCity via other Build Configurations to allow us to Create, Delete, Migrate and otherwise interact with the environment in question.

The creation of this process has happened in bits and pieces over the last few years, most of which I’ve written about on this blog.

Its a decent system, and I’m proud of how far we’ve come and how much automation is now in place, but it’s certainly not without its flaws.

Bestial Rage

The biggest problem with the current process is that while the environment is fully encapsulated as code inside a validated and versioned Nuget package, actually using that package to create or delete an environment is not as simple as it could be. As I mentioned above, we have a set of TeamCity Build Configurations for each environment that allow for the major operations like Create and Delete. If you’ve made changes to an environment and want to deploy them, you have to decide what sort of action is necessary (i.e. “its the first time, Create” or “it already exists, Migrate”) and “run”the build, which will download the package and run the appropriate script.

This is where is gets a bit onerous, especially for production. If you want to change any of the environment parameters from the default values the package was built with, you need to provide a set of parameter overrideswhen you run the build. For production, this means you often end up overriding everything (because production is a separate AWS account) which can be upwards of 10 different parameters, all of which are only visible if you go a look at the source CloudFormation template. You have to do this every time you want to execute that operation (although you can copy the parameters from previous runs, which acts a small shortcut).

The issue with this is that it means production deployments become vulnerable to human error, which is one of the things we’re trying to avoid by automating in the first place!

Another issue is that we lack a true “Update” operation. We only have Create, Delete, Clone and Migrate.

This is entirely my fault, because when I initially put the system together I had a bad experience with the CloudFormation Update command where I accidentally wiped out an S3 bucket containing customer data. As is often the case, that fear then led to an alternate (worse) solution involving cloning, checking, deleting, cloning, checking and deleting (in that order). This was safer, but incredibly slow and prone to failure.

The existence of these two problems (hard to deploy, slow failure-prone deployments) is reason enough for me to consider exploring alternative approaches for the log stack infrastructure.

Fantastic Beasts And Where To Find Them

The existing process does do a number of things well though, and has:

  • A Nuget package that contains everything necessary to interact with the environment.
  • Environment versioning, because that’s always important for traceability.
  • Environment validation via tests executed as part of the build (when possible).

Keeping those three things in mind, and combining them with the desire to ease the actual environment deployment, an improved approach looks a lot like our typical software development/deployment flow.

  1. Changes to environment are checked in
  2. Changes are picked up by TeamCity, and a build is started
  3. Build is tested (i.e. a test environment is created, validated and destroyed)
  4. Versioned Nuget package is created
  5. Package is uploaded to Octopus
  6. Octopus Release is created
  7. Octopus Release is deployed to CI
  8. Secondary validation (i.e. test CI environment to make sure it does what it’s supposed to do after deployment)
  9. [Optional] Propagation of release to Staging

In comparison to our current process, the main difference is the deployment. Prior to this, we were treating our environments as libraries (i.e. they were built, tested, packaged and uploaded to MyGet to be used by something else). Now we’re treating them as self contained deployable components, responsible for knowing how to deploy themselves.

With the approach settled, all that’s left is to come up with an actual deployment process for an environment.

Beast Mastery

There are two main cases we need to take care of when deploying a CloudFormation stack to an Octopus environment.

The first case is what to do when the CloudFormation stack doesn’t exist.

This is the easy case, all we need to do is execute New-CFNStack with the appropriate parameters and then wait for the stack to finish.

The second case is what we should do when the CloudFormation stack already exists, which is the case that is not particularly well covered by our current environment management process.

Luckily, CloudFormation makes this relatively easy with the Update-CFNStack command. Updates are dangerous (as I mentioned above), but if you’re careful with resources that contain state, they are pretty efficient. The implementation of the update is quite smart as well, and will only update the things that have changed in the template (i.e. if you’ve only changed the Load Balancer, it won’t recreate all of your EC2 instances).

The completed deployment script is shown in full below.

[CmdletBinding()]
param
(

)

$here = Split-Path $script:MyInvocation.MyCommand.Path;
$rootDirectory = Get-Item ($here);
$rootDirectoryPath = $rootDirectory.FullName;

$ErrorActionPreference = "Stop";

$component = "unique-stack-name";

if ($OctopusParameters -ne $null)
{
    $parameters = ConvertFrom-StringData ([System.IO.File]::ReadAllText("$here/cloudformation.parameters.octopus"));

    $awsKey = $OctopusParameters["AWS.Deployment.Key"];
    $awsSecret = $OctopusParameters["AWS.Deployment.Secret"];
    $awsRegion = $OctopusParameters["AWS.Deployment.Region"];
}
else 
{
    $parameters = ConvertFrom-StringData ([System.IO.File]::ReadAllText("$here/cloudformation.parameters.local"));
    
    $path = "C:\creds\credentials.json";
    Write-Verbose "Attempting to load credentials (AWS Key, Secret, Region, Octopus Url, Key) from local, non-repository stored file at [$path]. This is done this way to allow for a nice development experience in vscode"
    $creds = ConvertFrom-Json ([System.IO.File]::ReadAllText($path));
    $awsKey = $creds.aws."aws-account".key;
    $awsSecret = $creds.aws."aws-account".secret;
    $awsRegion = $creds.aws."aws-account".region;

    $parameters["OctopusAPIKey"] = $creds.octopus.key;
    $parameters["OctopusServerURL"] = $creds.octopus.url;
}

$parameters["Component"] = $component;

$environment = $parameters["OctopusEnvironment"];

. "$here/scripts/common/Functions-Aws.ps1";
. "$here/scripts/common/Functions-Aws-CloudFormation.ps1";

Ensure-AwsPowershellFunctionsAvailable

$tags = @{
    "environment"=$environment;
    "environment:version"=$parameters["EnvironmentVersion"];
    "application"=$component;
    "function"="logging";
    "team"=$parameters["team"];
}

$stackName = "$environment-$component";

$exists = Test-CloudFormationStack -AwsKey $awsKey -AwsSecret $awsSecret -AwsRegion $awsRegion -StackName $stackName

$cfParams = ConvertTo-CloudFormationParameters $parameters;
$cfTags = ConvertTo-CloudFormationTags $tags;
$args = @{
    "-StackName"=$stackName;
    "-TemplateBody"=[System.IO.File]::ReadAllText("$here\cloudformation.template");
    "-Parameters"=$cfParams;
    "-Tags"=$cfTags;
    "-AccessKey"=$awsKey;
    "-SecretKey"=$awsSecret;
    "-Region"=$awsRegion;
    "-Capabilities"="CAPABILITY_IAM";
};

if ($exists)
{
    Write-Verbose "The stack [$stackName] exists, so I'm going to update it. Its better this way"
    $stackId = Update-CFNStack @args;

    $desiredStatus = [Amazon.CloudFormation.StackStatus]::UPDATE_COMPLETE;
    $failingStatuses = @(
        [Amazon.CloudFormation.StackStatus]::UPDATE_FAILED,
        [Amazon.CloudFormation.StackStatus]::UPDATE_ROLLBACK_IN_PROGRESS,
        [Amazon.CloudFormation.StackStatus]::UPDATE_ROLLBACK_COMPLETE
    );
    Wait-CloudFormationStack -StackName $stackName -DesiredStatus  $desiredStatus -FailingStates $failingStatuses -AwsKey $awsKey -AwsSecret $awsSecret -AwsRegion $awsRegion

    Write-Verbose "Stack [$stackName] Updated";
}
else 
{
    Write-Verbose "The stack [$stackName] does not exist, so I'm going to create it. Just watch me"
    $args.Add("-DisableRollback", $true);
    $stackId = New-CFNStack @args;

    $desiredStatus = [Amazon.CloudFormation.StackStatus]::CREATE_COMPLETE;
    $failingStatuses = @(
        [Amazon.CloudFormation.StackStatus]::CREATE_FAILED,
        [Amazon.CloudFormation.StackStatus]::ROLLBACK_IN_PROGRESS,
        [Amazon.CloudFormation.StackStatus]::ROLLBACK_COMPLETE
    );
    Wait-CloudFormationStack -StackName $stackName -DesiredStatus  $desiredStatus -FailingStates $failingStatuses -AwsKey $awsKey -AwsSecret $awsSecret -AwsRegion $awsRegion

    Write-Verbose "Stack [$stackName] Created";
}

Other than the Create/Update logic that I’ve already talked about, the only other interesting thing in the deployment script is the way that it deals with parameters.

Basically if the script detects that its being run from inside Octopus Deploy (via the presence of an $OctopusParameters variable), it will load all of its parameters (as a hashtable) from a particular local file. This file leverages the Octopus variable substitution feature, so that when we deploy the infrastructure to the various environments, it gets the appropriate values (like a different VPC because prod is a separate AWS account to CI). When its not running in Octopus, it just uses a different file, structured very similarly, with test/scratch values in it.

With the deployment script in place, we plug the whole thing into our existing “deployable” component structure and we have automatic deployment of tested, versioned infrastructure via Octopus Deploy.

Conclusion

Of course, being a first version, the deployment logic that I’ve described above is not perfect. For example, there is no support for deploying to an environment where the stack is in error (failing stacks can’t be updated, but they already exist, so you have to delete it and start again) and there is little to no feedback available if a stack creation/update fails for some reason.

Additionally, the code could benefit from being extracted to a library for reuse.

All in all, the deployment process I just described is a lot simpler than the one I described at the start of this post, and its managed by Octopus, which makes it consistent with the way that we do everything else, which is nice.

With a little bit more polish, and some pretty strict usage of the CloudFormation features that stop you accidentally deleting databases full of valuable data, I think it will be a good replacement for what we do now.

0 Comments

Way back in March 2015, I wrote a few posts explaining how we set up our log aggregation. I’ve done a lot of posts since then about logging in general and about specific problems we’ve encountered in various areas, but I’ve never really revisited the infrastructure underneath the stack itself.

The main reason for the lack of posts is that the infrastructure I described back then is not the infrastructure we’re using now. As we built more things and started pushing more and more data into the stack we had, we began to experience some issues, mostly related to the reliability of the Elasticsearch process. At the time, the  organization decided that it would be better if our internal operations team were responsible for dealing with these issues, and they built a new stack as a result.

This was good and bad. The good part was the arguments that if we didn’t have to spend our time building and maintaining the system, it should theoretically leave more time and brainspace for us to focus on actual software development. Bad for almost exactly the same reason, problems with the service would need to be resolved by a different team, one with their own set of priorities and their own schedule.

The arrangement worked okay for a while, until the operations team were engaged on a relatively complex set of projects and no longer had the time to extend and maintain the log stack as necessary. They did their best, but with no resources dedicated to dealing with the maintenance on the existing stack, it started to degrade surprisingly quickly.

This came to a head when we had a failure in the new stack that required us to replace some EC2 instances via the Auto Scaling Group, and the operations team was unavailable to help. When we executed the scaling operation, we discovered that it was creating instances that didn’t actually have all of the required software setup in order to fulfil their intended role. At some point in the past someone had manually made changes to the instances already in service and these changes had not been made in the infrastructure as code.

After struggling with this for a while, we decided to reclaim the stack and make it our responsibility again.

Beast Mode

Architecturally, the new log stack was a lot better than the old one, even taking into account the teething issues that we did have.

The old stack was basically an Auto Scaling Group capable of creating EC2 instances with Elasticsearch, Logstash and Kibana along with a few load balancers for access purposes. While the old stack could theoretically scale out to better handle load, we never really tested that capability in production, and I’m pretty sure it wouldn’t have worked (looking back I doubt the Elasticseach clustering was setup correctly, in addition to some other issues with the way the Logstash indexes were being configured).

The new stack looked a lot like the reference architecture described on the Logstash website, which was good, because those guys know their stuff.

At a high level, log events would be shipped from many different places to a Broker layer (auto scaling Logstash instances behind a Load Balancer) which would then cache those events in a queue of some description (initially RabbitMQ, later Redis). An Indexer layer (auto scaling Logstash instances) would pull events off the queue at a sustainable pace, process them and place them into Elasticsearch. Users would then use Kibana (hosted on the Elasticsearch instances for ease of access) to interact with the data.

There are a number of benefits to the architecture described above, but a few of the biggest ones are:

  • Its not possible for an influx of log events to shut down Elasticsearch because the Indexer layer is pulling events out of the cache at a sustainable rate. The cache might start to fill up if the number of events rises, but we’ll still be able to use Elasticsearch.
  • The cache provides a buffer if something goes wrong with either the Indexer layer or Elasticsearch. We had some issues with Elasticsearch crashing in our old log stack, so having some protection against losing log events in the event of downtime was beneficial.

There were downsides as well, the most pertinent of which was that the new architecture was a lot more complicated than the old one, with a lot of moving parts. This made it harder to manage and understand, and increased the number of different ways in which it could break.

Taking all of the above into account, when we reclaimed the stack we decided to keep the architecture intact, and just improve it.

But how?

Beast-Like Vigour

The way in which the stack was described was not bad. It just wasn’t quite as controlled was the way we’d been handling our other environments, mostly as a factor of being created/maintained by a different team.

Configuration for the major components (Logstash Broker, Logstash Indexer, Elasticsearch, Kibana) was Source Controlled, with builds in TeamCity and deployment was handled by deploying Nuget packages through Octopus. This was good, and wouldn’t require much work to bring into line with the rest of our stuff. All we would have to do was ensure all of the pertinent deployment logic was encapsulated in the Git repositories and maybe add some tests.

The infrastructure needed some more effort. It was all defined using CloudFormation templates, which was excellent, but there was no build/deployment pipeline for the templates and they were not versioned. In order to put such a pipeline in place, we would need to have CI and Staging deployments of the infrastructure as well, which did not yet exist. The infrastructure definition for each layer also shared a repository with the relevant configuration (i.e. Broker Environment with Broker Config), which was against our existing patterns. Finally, the cache/queue layer did not have an environment definition at all, because the current one (Elasticache w. Redis) had been manually created to replace the original one (RabbitMQ) as a result of some issues where the cache/queue filled up and then became unrecoverable.

In addition to the above, once we’ve improved all of the processes and got everything under control, we need to work on fixing the actual bugs in the stack (like Logstash logs filling up disks, Elasticsearch mapping templates not being setup correctly, no alerting/monitoring on the various layers, etc). Some of these things will probably be fixed as we make the process improvements, but others will require dedicated effort.

To Be Continued

With the total scope of the work laid out (infrastructure build/deploy, clean up configuration deployment, re-create infrastructure in appropriate AWS accounts, fix bugs) its time to get cracking.

The first cab off the rank is the work required to create a process that will allow us to fully automate the build and deployment of the infrastructure. Without that sort of system in place, we would have to do all of the other things manually.

The Broker layer is the obvious starting point, so next week I’ll outline how we went about using a combination of TeamCity, Nuget, Octopus and Powershell to accomplish a build and deployment pipeline for the infrastructure.