0 Comments

This post is not as technical as some of my others. I really just want to bring attention to a tool for Elasticsearch that I honestly don’t think I could do without.

Cerebro.

From my experience, one of the hardest things to wrap my head around when working with Elasticsearch was visualizing how everything fit together. My background is primarily C# and .NET in a very Microsoft world, so I’m used to things like SQL Server, which comes with an excellent exploration and interrogation tool in the form of SQL Server Management studio. When it comes to Elasticsearch though, there seems to be no equiavelent, so I felt particularly blind.

Since starting to use Elasticsearch, I’ve become more and more fond of using the command line, so I’ve started to appreciate its amazing HTTP API more and more, but that initial learning curve was pretty vicious.

Anyway, to bring it back around, my first port of call when I started using Elasticsearch was to find a tool conceptually similar to SQL Server Management Studio. Something I could use to both visualize the storage system (however it worked) and possibly even query it as necessary.

I found Kopf.

Kopf did exactly what I wanted it to do. It provided a nice interface on top of Elasticsearch that helped me visualize how everything was structured and what sort of things I could do. To this day, if I attempt to visualize an Elasticsearch cluster in my head, the pictures that come to mind are of the Kopf interface. I can thank it for my understanding of the cluster, the nodes that make it up and the indexes stored therein, along with the excellent Elasticsearch documentation of course.

Later on I learnt that Kopf didn’t have to be used from the creators demonstration website (which is how I had been using it, connecting from my local machine to our ES ELK cluster), but could in fact be installed as a plugin inside Elasticsearch itself, which was even better, because you could access it from {es-url]}/plugins/_kopf, which was a hell of a lot easier.

Unfortunately, everything changed when the fire nation attacked…

No wait, that’s not right.

Everything changed when Elasticsearch 5 was released.

I’m The Juggernaut

Elasticsearch 5 deprecated site plugins. No more site plugins meant no more Kopf, or at least no more Kopf hosted within Elasticsearch. This made me sad, obviously, but I could still use the standalone site, so it wasn’t the end of the world.

My memory of the next bit is a little bit fuzzy, but I think even the standalone site stopped working properly when connecting to Elasticsearch 5. The creator of Kopf was no longer maintaining the project either, so it was unlikely that the problems would be solved.

I was basically blind.

Enter Cerebro.

No bones about it, Cerebro IS Kopf. It’s made by the same guy and is still being actively developed. Its pretty much a standalone Kopf (i.e. built in web server), and any differences between the two (other than some cosmetic stuff and the capability to easily save multiple Elasticsearch addresses) are lost on me.

As of this post, its up to 0.6.5, but as far as I can tell, it’s fully functional.

For my usage, I’ve incorporated Cerebro into our ELK stack, with a simple setup (ELB + single instance ASG), pre-configured with the appropriate Elasticsearch address in each environment that we spin up. As is the normal pattern, I’ve set it up on an AMI via Packer, and I deploy its configuration via Octopus deploy, but there is nothing particularly complicated there.

Kitty, Its Just A Phase

This post is pretty boring so far, so lets talk about Cerebro a little with the help of a screenshot.

This is the main screen of Cerebro, and it contains a wealth of information to help you get a handle on your Elasticsearch cluster.

It shows an overview of the cluster status, data nodes, indexes and their shards and replicas.

  • The cluster status is shown at the top of the screen, mostly via colour. Green good, yellow troublesome, red bad. Helpfully enough, the icon in the browser also changes colour according to the cluster status.
  • Data nodes are shown on the left, and display information like memory, cpu and disk, as well as IP address and name.
  • Indexes pretty much fill the rest of the screen, displaying important statistics like the number of documents and size, while allowing you to access things like mappings and index operations (like delete)
  • The intersection of index and data node gives information about shard/replica allocation. In the example above, we have 3 shards, 2 replicas and 3 nodes, so each node has a full copy of the data. Solid squares indicate the primary shard.
  • If you have unassigned or relocating shards, this information appears directly above the nodes, and shards currently being moved are shown in same place as normal shards, except blue.

Honestly, I don’t really use the other screens in Cerebro very much, or at least nowhere near as much as I use the overview screen. The dedicated nodes screen can be useful to view your master nodes (which aren’t shown on the overview), and to get a more performance focused display. I’ve also used the index templates screen for managing/viewing our logstash index template, but that’s mostly done through an Octopus deployment now.

There are others (including an ad-hoc query screen), but again, I haven’t really dug into them in depth. At least not enough to talk about them anyway.

That first screen though, the overview, is worth its weight in gold as far as I’m concerned.

Conclusion

I doubt I would understand Elasticsearch anywhere near as much as I do without Kopf/Cerebro. Realistically, I don’t really understand it much at all, but that little understanding I do have would be non-existent without these awesome tools.

Its not just a one horse town though. Elastic.co provides some equivalent tools as well (like Monitoring (formerly Marvel)) which offer similar capabilities, but they are mostly paid services as far as I can tell, so I’ve been hesitant to explore them in more depth.

I’m already spending way too much on the hardware for our log stack, so adding software costs on top of that is a challenging battle that I’m not quite ready to fight.

It doesn’t help that the last time I tried to price it, their answer for “How much for the things?” was basically “How much you got?”.

0 Comments

With environments and configuration out of the way, its time to put all of the pieces together.

Obviously this isn’t the first time that both of those things have been put together. In order to validate that everything was working as expected, I was constantly creating/updating environments and deploying new versions of the configuration to them. Not only that, but with the way our deployment pipeline works (commit, push, build, test, deploy [test], [deploy]), the CI environment had been up and running for some time.

What’s left to do then?

Well, we still need to create the Staging and Production environments, which should be easy because those are just deployments inside Octopus now.

The bigger chunk of work is to use those new environments and to redirect all of our existing log traffic as appropriate.

Hostile Deployment

This is a perfect example of why I spend time on automating things.

With the environments setup to act just like everything else in Octopus, all I had to do to create a Staging environment was push a button. Once the deployment finished and the infrastructure was created, it was just another button push to deploy the configuration for that environment to make it operational. Rinse and repeat for all of the layers (Broker, Indexer, Cache, Elasticsearch) and Staging is up and running.

Production was almost the same, with one caveat. We use an entirely different AWS account for all our production resources, so we had to override all of the appropriate Octopus variables for each environment project (like AWS Credentials, VPC ID, Subnet ID’s, etc). With those overrides in place, all that’s left is to make new releases (to capture the variables) and deploy to the appropriate environments.

It’s nice when everything works.

Redirecting Firepower

Of course, the new logging environments are worthless without log events. Luckily, we have plenty of those:

  • IIS logs from all of our APIs
  • Application logs from all of our APIs
  • ELB logs from a subset of our load balancers, most of which are APIs, but at least one is an Nginx router
  • Simple performance counter statistics (like CPU, Memory, Disk, etc) from basically every EC2 instance
  • Logs from on-premises desktop applications

We generally have CI, Staging and prod-X (green/blue) environments for all of our services/APIs (because its how our build/test/deployment pipeline works), so now that we have similarly partitioned logging environments, all we have to do is line them up (CI to CI, Staging to Staging and so on).

For the on-premises desktop applications, there is no CI, but they do generally have the capability to run in Staging mode, so we can use that setting to direct log traffic.

There are a few ways in which the log events hit the Broker layer:

  • Internal Logstash instance running on an EC2 instance with a TCP output pointing at the Broker hostname
  • Internal Lambda function writing directly to the Broker hostname via TCP (this is the ELB logs processor)
  • External application writing to an authenticated Logging API, which in turn writes to the Broker via TCP (this is for the on-premises desktop applications)

We can change the hostname used by all of these mechanisms simply by changing some variables in Octopus deploy, making a new release and deploying it through the environments.

And that’s exactly what we did, making sure to monitor the log event traffic for each one to make sure we didn’t lose anything.

With all the logs going to their new homes, all that was left to do was destroy the old log stack, easily done manually through CloudFormation.

You might be wondering about any log events that were stored in the old stack? Well, we generally only keep around 14 days worth of log events in the stack itself (because there are so many), so we pretty much just left the old stack up for searching purposes until it was no longer relevant, and then destroyed it.

Conclusion

And that basically brings us to the end of this series of posts about our logging environment and the reclamation thereof.

We’ve now got our standard deployment pipeline in place for both infrastructure and configuration and have separated our log traffic accordingly.

This puts us in a much better position moving forward. Not only is the entire thing fully under our control, but we now have the capability to test changes to infrastructure/configuration before just deploying them into production, something we couldn’t do before when we only had a single stack for everything.

In all fairness though, all we really did was polish an existing system so that it was a better fit for our specific usage.

Evolutionary, not revolutionary.

0 Comments

Continuing on from last week, its time to talk software and the configuration thereof.

With the environments themselves being managed by Octopus (i.e. the underlying infrastructure), we need to deal with the software side of things.

Four of the five components in the new ELK stack require configuration of some sort in order to work properly:

  • Logstash requires configuration to tell it how to get events, filter/process then and where to put them, so the Broker and Indexer layers require different, but similar, configuration.
  • Elasticsearch requires configuration for a variety of reasons including cluster naming, port setup, memory limits and so on.
  • Kibana requires configuration to know where Elasticsearch is, and a few other things.

For me, configuration is a hell of a lot more likely to change than the software itself, although with the pace that some organisations release software, that might not always be strictly true. Also, coming from a primarily Windows background, software is traditionally a lot more difficult to install and setup, and by that logic is not something you want to do all the time.

Taking those things into account, I’ve found it helpful to separate the installation of the software from the configuration of the software. What this means in practice is that a particular version of the software itself will be baked into an AMI, and then the configuration of that software will be handled via Octopus Deploy whenever a machine is created from the AMI.

Using AMIs or Docker Images to create immutable software + configuration artefacts is also a valid approach, and is superior in a lot of respects. It makes dynamic scaling easier (by facilitating a quick startup of fully functional nodes), helps with testing and generally simplifies the entire process. Docker Images in particular are something that I would love to explore in the future, just not right at this moment.

The good news is that this is pretty much exactly what the new stack was already doing, so we only need to make a few minor improvements and we’re good to go.

The Lament Configuration

As I mentioned in the first post in this series, software configuration was already being handled by TeamCity/Nuget/Octopus Deploy, it just needed to be cleaned up a bit. First thing was to move the configuration out into its own repository as appropriate for each layer and rewrite TeamCity as necessary. The Single Responsibility Principle doesn’t just apply to classes after all.

The next part is something of a personal preference, and it relates to the logic around deployment. All of the existing configuration deployments in the new stack had their logic (i.e. where to copy the files on the target machine, how to start/restart services and so on) encapsulated entirely inside Octopus Deploy. I’m not a fan of that. I much prefer to have all of this logic inside scripts in source control alongside the artefacts that will be deployed. This leaves projects in Octopus Deploy relatively simple, only responsible for deploying Nuget packages, managing variables (which is hard to encapsulate in source control because of sensitive values) and generally overseeing the whole process. This is the same sort of approach that I use for building software, with TeamCity acting as a relatively stupid orchestration tool, executing scripts that live inside source control with the code.

Octopus actually makes using source controlled scripts pretty easy, as it will automatically execute scripts named a certain way at particular times during the deployment of a Nuget package (for example, any script called deploy.ps1 at the root of the package will be executed after the package has been copied to the appropriate location on the target machine). The nice thing is that this also works with bash scripts for Linux targets (i.e. deploy.sh), which is particularly relevant here, because all of the ELK stuff happens on Linux.

Actually deploying most of the configuration is pretty simple. For example, this is the deploy.sh script for the ELK Broker configuration.

# The deploy script is automatically run by Octopus during a deployment, after Octopus does its thing.
# Octopus deploys the contents of the package to /tmp/elk-broker/
# At this point, the logstash configuration directory has been cleared by the pre-deploy script

# Echo commands after expansion
set -x

# Copy the settings file
cp /tmp/elk-broker/logstash.yml /etc/logstash/logstash.yml || exit 1

# Copy the config files
cp /tmp/elk-broker/*.conf /etc/logstash/conf.d/ || exit 1

# Remove the UTF-8 BOM from the config files
sed -i '1 s/^\xef\xbb\xbf//' /etc/logstash/conf.d/*.conf || exit 1

# Test the configuration
sudo /usr/share/logstash/bin/logstash -f /etc/logstash/conf.d/ -t --path.settings /etc/logstash/logstash.yml || exit 1

# Set the ownership of the config files to the logstash user, which is what the service runs as
sudo chown -R logstash:logstash /etc/logstash/conf.d || exit 1

# Restart logstash - dont use the restart command, it throws errors when you try to restart a stopped service
sudo initctl stop logstash || true
sudo initctl start logstash

Prior to the script based approach, this logic was spread across five or six different steps inside an Octopus Project, which I found much harder to read and reason about.

Or Lemarchand’s Box If You Prefer

The only other difference worth talking about is the way in which we actually trigger configuration deployments.

Traditionally, we have asked Octopus to deploy the most appropriate version of the necessary projects during the initialization of a machine. For example, the ELK Broker EC2 instances had logic inside their LaunchConfiguration:UserData that said “register as Tentacle”, “deploy X”, “deploy Y” etc.

This time I tried something a little different, but which feels a hell of a lot smarter.

Instead of the machine being responsible for asking for projects to be deployed to it, we can just let Octopus react to the registration of a new Tentacle and deploy whatever Projects are appropriate. This is relatively easy to setup as well. All you need to do is add a trigger to your Project that says “deploy whenever a new machine comes online”. Octopus takes care of the rest, including picking what version is best (which is just the last successful deployment to the environment).

This is a lot cleaner than hardcoding project deployment logic inside the environment definition, and allows for changes to what software gets deployed where without actually having to edit or update the infrastructure definition. This sort of automatic deployment approach is probably more useful to our old way of handling environments (i.e. that whole terrible migration process with no update logic), than it is to the newer, easier to update environment deployments, but its still nice all the same.

Conclusion

There really wasn’t much effort required to clean up the configuration for each of the layers in the new ELK stack, but it was a great opportunity to try out the new trigger based deployments in Octopus Deploy, which was pretty cool.

With the configuration out of the way, and the environment creation also sorted, all that’s left is to actually create some new environments and start using them instead of the old one.

That’s a topic for next time though.

0 Comments

With all of the general context and solution outlining done for now, its time to delve into some of the details. Specifically, the build/test/deploy pipeline for the log stack environments.

Unfortunately, we use the term environmentto describe two things. The first is an Octopus environment, which is basically a grouping construct inside Octopus Deploy, like CI or prod-green. The second is a set of infrastructure intended for a specific purpose, like an Auto Scaling Group and Load Balancer intended to host an API. In the case of the log stack, we have distinct environments for the infrastructure for each layer, like the Broker and the Indexer.

Our environments are all conceptually similar, there is a Git repository that contains everything necessary to create or update the infrastructure (CloudFormation templates, Powershell scripts, etc), along with the logic for what it means to build and validate a Nuget package that can be used to manage the environment. The repository is hooked up to a Build Configuration in TeamCity which runs the build script and the resulting versioned package is uploaded to our Nuget server. The package is then used in TeamCity via other Build Configurations to allow us to Create, Delete, Migrate and otherwise interact with the environment in question.

The creation of this process has happened in bits and pieces over the last few years, most of which I’ve written about on this blog.

Its a decent system, and I’m proud of how far we’ve come and how much automation is now in place, but it’s certainly not without its flaws.

Bestial Rage

The biggest problem with the current process is that while the environment is fully encapsulated as code inside a validated and versioned Nuget package, actually using that package to create or delete an environment is not as simple as it could be. As I mentioned above, we have a set of TeamCity Build Configurations for each environment that allow for the major operations like Create and Delete. If you’ve made changes to an environment and want to deploy them, you have to decide what sort of action is necessary (i.e. “its the first time, Create” or “it already exists, Migrate”) and “run”the build, which will download the package and run the appropriate script.

This is where is gets a bit onerous, especially for production. If you want to change any of the environment parameters from the default values the package was built with, you need to provide a set of parameter overrideswhen you run the build. For production, this means you often end up overriding everything (because production is a separate AWS account) which can be upwards of 10 different parameters, all of which are only visible if you go a look at the source CloudFormation template. You have to do this every time you want to execute that operation (although you can copy the parameters from previous runs, which acts a small shortcut).

The issue with this is that it means production deployments become vulnerable to human error, which is one of the things we’re trying to avoid by automating in the first place!

Another issue is that we lack a true “Update” operation. We only have Create, Delete, Clone and Migrate.

This is entirely my fault, because when I initially put the system together I had a bad experience with the CloudFormation Update command where I accidentally wiped out an S3 bucket containing customer data. As is often the case, that fear then led to an alternate (worse) solution involving cloning, checking, deleting, cloning, checking and deleting (in that order). This was safer, but incredibly slow and prone to failure.

The existence of these two problems (hard to deploy, slow failure-prone deployments) is reason enough for me to consider exploring alternative approaches for the log stack infrastructure.

Fantastic Beasts And Where To Find Them

The existing process does do a number of things well though, and has:

  • A Nuget package that contains everything necessary to interact with the environment.
  • Environment versioning, because that’s always important for traceability.
  • Environment validation via tests executed as part of the build (when possible).

Keeping those three things in mind, and combining them with the desire to ease the actual environment deployment, an improved approach looks a lot like our typical software development/deployment flow.

  1. Changes to environment are checked in
  2. Changes are picked up by TeamCity, and a build is started
  3. Build is tested (i.e. a test environment is created, validated and destroyed)
  4. Versioned Nuget package is created
  5. Package is uploaded to Octopus
  6. Octopus Release is created
  7. Octopus Release is deployed to CI
  8. Secondary validation (i.e. test CI environment to make sure it does what it’s supposed to do after deployment)
  9. [Optional] Propagation of release to Staging

In comparison to our current process, the main difference is the deployment. Prior to this, we were treating our environments as libraries (i.e. they were built, tested, packaged and uploaded to MyGet to be used by something else). Now we’re treating them as self contained deployable components, responsible for knowing how to deploy themselves.

With the approach settled, all that’s left is to come up with an actual deployment process for an environment.

Beast Mastery

There are two main cases we need to take care of when deploying a CloudFormation stack to an Octopus environment.

The first case is what to do when the CloudFormation stack doesn’t exist.

This is the easy case, all we need to do is execute New-CFNStack with the appropriate parameters and then wait for the stack to finish.

The second case is what we should do when the CloudFormation stack already exists, which is the case that is not particularly well covered by our current environment management process.

Luckily, CloudFormation makes this relatively easy with the Update-CFNStack command. Updates are dangerous (as I mentioned above), but if you’re careful with resources that contain state, they are pretty efficient. The implementation of the update is quite smart as well, and will only update the things that have changed in the template (i.e. if you’ve only changed the Load Balancer, it won’t recreate all of your EC2 instances).

The completed deployment script is shown in full below.

[CmdletBinding()]
param
(

)

$here = Split-Path $script:MyInvocation.MyCommand.Path;
$rootDirectory = Get-Item ($here);
$rootDirectoryPath = $rootDirectory.FullName;

$ErrorActionPreference = "Stop";

$component = "unique-stack-name";

if ($OctopusParameters -ne $null)
{
    $parameters = ConvertFrom-StringData ([System.IO.File]::ReadAllText("$here/cloudformation.parameters.octopus"));

    $awsKey = $OctopusParameters["AWS.Deployment.Key"];
    $awsSecret = $OctopusParameters["AWS.Deployment.Secret"];
    $awsRegion = $OctopusParameters["AWS.Deployment.Region"];
}
else 
{
    $parameters = ConvertFrom-StringData ([System.IO.File]::ReadAllText("$here/cloudformation.parameters.local"));
    
    $path = "C:\creds\credentials.json";
    Write-Verbose "Attempting to load credentials (AWS Key, Secret, Region, Octopus Url, Key) from local, non-repository stored file at [$path]. This is done this way to allow for a nice development experience in vscode"
    $creds = ConvertFrom-Json ([System.IO.File]::ReadAllText($path));
    $awsKey = $creds.aws."aws-account".key;
    $awsSecret = $creds.aws."aws-account".secret;
    $awsRegion = $creds.aws."aws-account".region;

    $parameters["OctopusAPIKey"] = $creds.octopus.key;
    $parameters["OctopusServerURL"] = $creds.octopus.url;
}

$parameters["Component"] = $component;

$environment = $parameters["OctopusEnvironment"];

. "$here/scripts/common/Functions-Aws.ps1";
. "$here/scripts/common/Functions-Aws-CloudFormation.ps1";

Ensure-AwsPowershellFunctionsAvailable

$tags = @{
    "environment"=$environment;
    "environment:version"=$parameters["EnvironmentVersion"];
    "application"=$component;
    "function"="logging";
    "team"=$parameters["team"];
}

$stackName = "$environment-$component";

$exists = Test-CloudFormationStack -AwsKey $awsKey -AwsSecret $awsSecret -AwsRegion $awsRegion -StackName $stackName

$cfParams = ConvertTo-CloudFormationParameters $parameters;
$cfTags = ConvertTo-CloudFormationTags $tags;
$args = @{
    "-StackName"=$stackName;
    "-TemplateBody"=[System.IO.File]::ReadAllText("$here\cloudformation.template");
    "-Parameters"=$cfParams;
    "-Tags"=$cfTags;
    "-AccessKey"=$awsKey;
    "-SecretKey"=$awsSecret;
    "-Region"=$awsRegion;
    "-Capabilities"="CAPABILITY_IAM";
};

if ($exists)
{
    Write-Verbose "The stack [$stackName] exists, so I'm going to update it. Its better this way"
    $stackId = Update-CFNStack @args;

    $desiredStatus = [Amazon.CloudFormation.StackStatus]::UPDATE_COMPLETE;
    $failingStatuses = @(
        [Amazon.CloudFormation.StackStatus]::UPDATE_FAILED,
        [Amazon.CloudFormation.StackStatus]::UPDATE_ROLLBACK_IN_PROGRESS,
        [Amazon.CloudFormation.StackStatus]::UPDATE_ROLLBACK_COMPLETE
    );
    Wait-CloudFormationStack -StackName $stackName -DesiredStatus  $desiredStatus -FailingStates $failingStatuses -AwsKey $awsKey -AwsSecret $awsSecret -AwsRegion $awsRegion

    Write-Verbose "Stack [$stackName] Updated";
}
else 
{
    Write-Verbose "The stack [$stackName] does not exist, so I'm going to create it. Just watch me"
    $args.Add("-DisableRollback", $true);
    $stackId = New-CFNStack @args;

    $desiredStatus = [Amazon.CloudFormation.StackStatus]::CREATE_COMPLETE;
    $failingStatuses = @(
        [Amazon.CloudFormation.StackStatus]::CREATE_FAILED,
        [Amazon.CloudFormation.StackStatus]::ROLLBACK_IN_PROGRESS,
        [Amazon.CloudFormation.StackStatus]::ROLLBACK_COMPLETE
    );
    Wait-CloudFormationStack -StackName $stackName -DesiredStatus  $desiredStatus -FailingStates $failingStatuses -AwsKey $awsKey -AwsSecret $awsSecret -AwsRegion $awsRegion

    Write-Verbose "Stack [$stackName] Created";
}

Other than the Create/Update logic that I’ve already talked about, the only other interesting thing in the deployment script is the way that it deals with parameters.

Basically if the script detects that its being run from inside Octopus Deploy (via the presence of an $OctopusParameters variable), it will load all of its parameters (as a hashtable) from a particular local file. This file leverages the Octopus variable substitution feature, so that when we deploy the infrastructure to the various environments, it gets the appropriate values (like a different VPC because prod is a separate AWS account to CI). When its not running in Octopus, it just uses a different file, structured very similarly, with test/scratch values in it.

With the deployment script in place, we plug the whole thing into our existing “deployable” component structure and we have automatic deployment of tested, versioned infrastructure via Octopus Deploy.

Conclusion

Of course, being a first version, the deployment logic that I’ve described above is not perfect. For example, there is no support for deploying to an environment where the stack is in error (failing stacks can’t be updated, but they already exist, so you have to delete it and start again) and there is little to no feedback available if a stack creation/update fails for some reason.

Additionally, the code could benefit from being extracted to a library for reuse.

All in all, the deployment process I just described is a lot simpler than the one I described at the start of this post, and its managed by Octopus, which makes it consistent with the way that we do everything else, which is nice.

With a little bit more polish, and some pretty strict usage of the CloudFormation features that stop you accidentally deleting databases full of valuable data, I think it will be a good replacement for what we do now.

0 Comments

Way back in March 2015, I wrote a few posts explaining how we set up our log aggregation. I’ve done a lot of posts since then about logging in general and about specific problems we’ve encountered in various areas, but I’ve never really revisited the infrastructure underneath the stack itself.

The main reason for the lack of posts is that the infrastructure I described back then is not the infrastructure we’re using now. As we built more things and started pushing more and more data into the stack we had, we began to experience some issues, mostly related to the reliability of the Elasticsearch process. At the time, the  organization decided that it would be better if our internal operations team were responsible for dealing with these issues, and they built a new stack as a result.

This was good and bad. The good part was the arguments that if we didn’t have to spend our time building and maintaining the system, it should theoretically leave more time and brainspace for us to focus on actual software development. Bad for almost exactly the same reason, problems with the service would need to be resolved by a different team, one with their own set of priorities and their own schedule.

The arrangement worked okay for a while, until the operations team were engaged on a relatively complex set of projects and no longer had the time to extend and maintain the log stack as necessary. They did their best, but with no resources dedicated to dealing with the maintenance on the existing stack, it started to degrade surprisingly quickly.

This came to a head when we had a failure in the new stack that required us to replace some EC2 instances via the Auto Scaling Group, and the operations team was unavailable to help. When we executed the scaling operation, we discovered that it was creating instances that didn’t actually have all of the required software setup in order to fulfil their intended role. At some point in the past someone had manually made changes to the instances already in service and these changes had not been made in the infrastructure as code.

After struggling with this for a while, we decided to reclaim the stack and make it our responsibility again.

Beast Mode

Architecturally, the new log stack was a lot better than the old one, even taking into account the teething issues that we did have.

The old stack was basically an Auto Scaling Group capable of creating EC2 instances with Elasticsearch, Logstash and Kibana along with a few load balancers for access purposes. While the old stack could theoretically scale out to better handle load, we never really tested that capability in production, and I’m pretty sure it wouldn’t have worked (looking back I doubt the Elasticseach clustering was setup correctly, in addition to some other issues with the way the Logstash indexes were being configured).

The new stack looked a lot like the reference architecture described on the Logstash website, which was good, because those guys know their stuff.

At a high level, log events would be shipped from many different places to a Broker layer (auto scaling Logstash instances behind a Load Balancer) which would then cache those events in a queue of some description (initially RabbitMQ, later Redis). An Indexer layer (auto scaling Logstash instances) would pull events off the queue at a sustainable pace, process them and place them into Elasticsearch. Users would then use Kibana (hosted on the Elasticsearch instances for ease of access) to interact with the data.

There are a number of benefits to the architecture described above, but a few of the biggest ones are:

  • Its not possible for an influx of log events to shut down Elasticsearch because the Indexer layer is pulling events out of the cache at a sustainable rate. The cache might start to fill up if the number of events rises, but we’ll still be able to use Elasticsearch.
  • The cache provides a buffer if something goes wrong with either the Indexer layer or Elasticsearch. We had some issues with Elasticsearch crashing in our old log stack, so having some protection against losing log events in the event of downtime was beneficial.

There were downsides as well, the most pertinent of which was that the new architecture was a lot more complicated than the old one, with a lot of moving parts. This made it harder to manage and understand, and increased the number of different ways in which it could break.

Taking all of the above into account, when we reclaimed the stack we decided to keep the architecture intact, and just improve it.

But how?

Beast-Like Vigour

The way in which the stack was described was not bad. It just wasn’t quite as controlled was the way we’d been handling our other environments, mostly as a factor of being created/maintained by a different team.

Configuration for the major components (Logstash Broker, Logstash Indexer, Elasticsearch, Kibana) was Source Controlled, with builds in TeamCity and deployment was handled by deploying Nuget packages through Octopus. This was good, and wouldn’t require much work to bring into line with the rest of our stuff. All we would have to do was ensure all of the pertinent deployment logic was encapsulated in the Git repositories and maybe add some tests.

The infrastructure needed some more effort. It was all defined using CloudFormation templates, which was excellent, but there was no build/deployment pipeline for the templates and they were not versioned. In order to put such a pipeline in place, we would need to have CI and Staging deployments of the infrastructure as well, which did not yet exist. The infrastructure definition for each layer also shared a repository with the relevant configuration (i.e. Broker Environment with Broker Config), which was against our existing patterns. Finally, the cache/queue layer did not have an environment definition at all, because the current one (Elasticache w. Redis) had been manually created to replace the original one (RabbitMQ) as a result of some issues where the cache/queue filled up and then became unrecoverable.

In addition to the above, once we’ve improved all of the processes and got everything under control, we need to work on fixing the actual bugs in the stack (like Logstash logs filling up disks, Elasticsearch mapping templates not being setup correctly, no alerting/monitoring on the various layers, etc). Some of these things will probably be fixed as we make the process improvements, but others will require dedicated effort.

To Be Continued

With the total scope of the work laid out (infrastructure build/deploy, clean up configuration deployment, re-create infrastructure in appropriate AWS accounts, fix bugs) its time to get cracking.

The first cab off the rank is the work required to create a process that will allow us to fully automate the build and deployment of the infrastructure. Without that sort of system in place, we would have to do all of the other things manually.

The Broker layer is the obvious starting point, so next week I’ll outline how we went about using a combination of TeamCity, Nuget, Octopus and Powershell to accomplish a build and deployment pipeline for the infrastructure.