The Swallows Return To Capistrano

March 15. 2016 0 Comments

A while back (god almost a full year ago), I posted about the way in which we handle environment migrations, and to be honest, it hasn’t changed all that much. We have made some improvements to way we handle our environments (for example, we’ve improved our newest environments to be built into tested, versioned packages, rather than running directly from source), which is good, but the general migration process of clone temp, tear down old, clone back to active, tear down temp hasn’t really changed all that much.

Over time, we’ve come to realise that they are a number of weaknesses in that strategy though. Its slow (double clone!), its not overly clean and it can rarely lead to all of the data for the environment under migration being destroyed.

Yes, destroyed, i.e. lost forever.

This post is about that last weakness (the others will have to continue existing…for now).

Explosions!

In the original cloning scripts, there was an ominous comment, which simply said “# compare environment data here?”, which was a pretty big red flag in retrospect. You can’t always do everything though, and the various pressures applied to the development team meant that that step became somewhat manual.

That was a mistake.

After running a number of migrations across a few different environments (using basically the same concepts), we finally triggered that particular tripwire.

An otherwise uninteresting environment upgrade for one of our production services completely annihilated the underlying database (an EC2 instance running RavenDB), but the script gave no indication that anything went wrong.

Luckily, this particular service more of a temporary waystation, acting as a holding area facilitating the connection of two applications through a common web interface. This meant that while the loss of the data was bad (very bad), it wasn’t a problem for all of our customers. Only those people who had items sitting in the holding area waiting to be picked up were affected.

Obviously, the affected customers were quite unhappy, and rightfully so.

To this day I actually have no idea what went wrong with the actual migration. I had literally run the exact same scripts on a staging environment earlier that day, and verified that the same data was present before and after. After extensive investigation, we agreed that we would probably not get to the root of the issue in a timely fashion and that it might have just been an AWS thing (for a platform based on computers, sometimes AWS is amazingly non-deterministic). Instead, we agreed to attack the code that made it possible for the data loss to occur at all.

The migration scripts themselves.

Give Me More Statistics…Stat!

Returning to that ominous comment in the migration scripts, we realised that we needed an easy way to compare the data in two environments, at least at a high level. Using a basic comparison like that would enable us to make a decision about whether to proceed with the migration (specifically the part that destroys the old environment).

The solution is to implement a statistics endpoint.

The idea is pretty simple. We provide a set of information from the endpoint that summarises the content of the service (at least as best we can summarise it). Things like how many of a certain type of entity are present are basically all we have to deal with for now (simple services), but the concept could easily be extended to include information about any piece of data in the environment.

Something as simple as the example below fills our needs:

{
    data: {
        customers: {
            count: 57
        },
        databases: {
            count: 129
        }
    }
}

A side effect of having an endpoint like this is that we can easily (at least using the http_poller input in Logstash) extract this information on a regular basis and put it into our log aggregation so that we can chart its change over time.

Making It Work

With the statistics endpoint written and deployed (after all it must be present in the environment being migrated before we can use it), all that’s left to do is incorporate it into the migration script.

I won’t rewrite the entirety of the migration script here, but I’ve included a skeleton below to provide an idea of how we use the comparison to make sure we haven’t lost anything important on the way through.

function Migrate
{
    params
    (
        #bunch of params here, mostly relating to credentials
    )
}

try
{
    # make current environment unavailable to normal traffic
    
    # clone current to temporary
    
    if (-not(Compare-Environments $current $temp))
    {
        # delete the temporary environment and exit with an error
    }
    
    # delete current environment
    # clone temporary environment into the place where the current environment used to be
    
    if (-not(Compare-Environments $current $temp))
    {
        # delete the new environment
        # keep the temporary environment because its the only one with the data
    }
}
catch
{
    # if the current environment still exists, delete the temporary environment
    # if the current environment still exists, restore its availability
}

function Compare-Environments
{
    params
    (
        $a,
        $b
    )
    
    $aEndpoint = "some logic for URL creation based off environment"
    $bEndpoint = "some logic for URL creation based off environment"
    
    $aStatistics = Invoke-RestMethod $aEndpoint #credentials, accept header, methods etc
    $bStatistics = Invoke-RestMethod $aEndpoint #credentials, accept header, methods etc
    
    if ((ConvertTo-Json $aStatistics.data) -eq (ConvertTo-Json $bStatistics.data))
    {
        return true;
    }
    
    return false;
}

Summary

The unfortunate truth of this whole saga, is that the person who originally implemented the migration scripts (I’m pretty sure it was me, so I take responsibility) was aware of the fact that the migration could potentially lead to loss of data. At the time, the protection against that was to ensure that we never deleted the old environment until we were absolutely sure that the new environment had been successfully created, making the assumption that the data had come over okay.

In the end, that assumption proved to be our undoing, because while everything appeared peachy, it actually failed spectacularly.

The introduction of a statistics endpoint (almost an environment data hash) is an elegant solution to the problem of potential data loss, which also has some nice side effects for tracking metrics that might not have been easily accessible outside of direct database access.

A double victory is a rare occurrence, so I think I’ll try to savour this one for a little while, even if I was the root cause of the problem.

Versioning Infrastructure

February 2. 2016 0 Comments

I’ve talked at length previously about the usefulness of ensuring that your environments are able to be easily spun up and down. Typically this means that they need to be represented as code and that code should be stored in some sort of Source Control (Git is my personal preference). Obviously this is much easier with AWS (or other cloud providers) than it is with traditionally provisioned infrastructure, but you can at least control configurations and other things when you are close to the iron.

We’ve come a long way on our journey to represent our environments as code, but there has been one hole that’s been nagging me for some time.

Versioning.

Our current environment pattern looks something like this:

A repository called X.Environment, where X describes the component the environment is for.
A series of Powershell scripts and CloudFormation templates that describe how to construct the environment.
A series of TeamCity Build Configurations that allow anyone to Create and Delete named versions of the environment (sometimes there are also Clone and Migrate scripts to allow for copying and updating).

When an environment is created via a TeamCity Build Configuration, the appropriate commit in the repository is tagged with something to give some traceability as to where the environment configuration came from. Unfortunately, the environment itself (typically represented as a CloudFormation stack), is not tagged for the reverse. There is currently no easy way for us to look at an environment and determine exactly the code that created it and, more importantly, how many changes have been made to the underlying description since it was created.

Granted, this information is technically available using timestamps and other pieces of data, but this is difficult, time-consuming, manual task, so its unlikely to be done with any regularity.

All of the TeamCity Build Configurations that I mentioned simply use the HEAD of the repository when they run. There is no concept of using an old Delete script or being able to (easily) spin up an old version of an environment for testing purposes.

The Best Version

The key to solving some of the problems above is to really immerse ourselves in the concept of treating the environment blueprint as code.

When dealing with code, you would never publish raw from a repository, so why would we do that for the environment?

Instead, you compile (if you need to), you test and then you package, creating a compact artefact that represents a validated copy of the code that can be used for whatever purpose you need to use it for (typically deployment). This artefact has some version associated with it (whatever versioning strategy you might use) which is traceable both ways (look at the repo, see the version, find artefact, look at the artefact, see the version, go to repository).

Obviously, for a set of Powershell scripts and CloudFormation templates, there is no real compilation step. There is a testing step though (Powershell tests written using Pester) and there can easily be a packaging step, so we have all of the bits and pieces that we need in order to provide a versioned package, and then use that package whenever we need to perform environment operations.

Versioning Details

As a general rule, I prefer to not encapsulate complicated build and test logic into TeamCity itself. Instead, I much prefer to have a self contained script within the repository, that is then used both within TeamCity and whenever you need to build locally. This typically takes the form of a build.ps1 script file with a number of common inputs, and leverages a number of common tools that I’m not going to go into any depth about. The output of the script is a versioned Nupkg file and some test results (so that TeamCity knows whether or not the build failed).

Adapting our environment repository pattern to build a nuget package is fairly straightforward (similar to the way in which we handle Logstash, just package up all the files necessary to execute the scripts using a nuspec file). Voila, a self contained package that can be used at a later date to spin up that particular version of the environment.

The only difficult part here was the actual versioning of the environment itself.

Prior to this, when an environment was created it did not have any versioning information attached to it.

The easiest way to attach that information? Introduce a new common CloudFormation template parameter called EnvironmentVersion and make sure that it is populated when an environment is created. The CloudFormation stack is also tagged with the version, for easy lookup.

For backwards compatibility, I made the environment version optional when you execute the New-Environment Powershell cmdlet (which is our wrapper around the AWS CFN tools). If not specified it will default to something that looks like 0.0.YYDDDD.SSSSS, making it very obvious that the version was not specified correctly.

For the proper versioning inside an environment’s source code, I simply reused some code we already had for dealing with AssemblyInfo files. It might not be the best approach, but including an AssemblyInfo file (along with the appropriate Assembly attributes) inside the repository and then reading from that file during environment creation is easy enough and consistency often beats optimal.

Improving Versioning

What I’ve described above is really a step in part of a larger plan.

I would vastly prefer if the mechanism for controlling what versions of an environment are present and where was delegated to Octopus Deploy, just like with the rest of our deployable components.

With a little bit of extra effort, we should be able to create a release for an appropriately named Octopus project and then push to that project whenever a new version of the environment is available.

This would give excellent visibility into what versions of the environment are where, and also allow us to leverage something I have planned for helping us see just how different the version in environment X is from the version in environment Y.

Ad-hoc environments will still need to be managed via TeamCity, but known environments (like CI, Staging and Production) should be able to be handled within Octopus.

Summary

I much prefer the versioned and packaged approach to environment management that I’ve outlined above. It seems much neater and allows for a lot of traceability and repeatability, something that was lacking when environments were being managed directly from HEAD.

It helps that it looks very similar to the way that we manage our code (both libraries and deployable components) so the pattern is already familiar and understandable.

You can see an example of what a versioned, packagable environment repository would look like here. Keep in mind that the common scripts inside that repository are not usually included directly like that. They are typically downloaded and installed via a bootstrapping process (using a Nuget package), but for this example I had to include them directly so that I didn’t have to bring along the rest or our build pipeline.

Speaking of the common scripts, unfortunately they are a constant reminder of a lack of knowledge about how to create reusable Powershell components. I’m hoping to restructure them into a number of separate modules with greater cohesion, but until then they are a bit unwieldy (just a massive chunk of scripts that are dot-included wherever they are needed).

That would probably make a good blog post actually.

How to unpick a mess of Powershell that past you made.

Sometimes I hate past me.

Are You Paying Attention?

November 24. 2015 0 Comments

So, we have all these logs now, which is awesome. Centralised and easily searchable, containing lots of information relating to application health, feature usage, infrastructure utilization along with many other things I’m not going to bother to list here.

But who is actually looking at them?

The answer to that question is something we’ve been struggling with for a little while. Sure, we go into Kibana and do arbitrary searches, we use dashboards, we do our best to keep an eye on things like CPU usage, free memory, number of errors and so on, but we often have other things to do. Nobody has a full time job of just watching this mountain of data for interesting patterns and problems.

We’ve missed things:

We had an outage recently that was the result of a disk filling up with log files because an old version of the log management script had a bug in it. The disk usage was clearly going down when you looked at it in Kibana dashboard, but it was happening so gradually that it was never really brought up as a top priority.
We had a different outage recently where we had a gradual memory leak in the non-paged memory pool on some of our API instances. Similar to above, we were recording free memory and it was clearly dropping over time, but no-one noticed.

There has been other instances (like an increase in the total number of 500’s being returned from an API, indicating a bug), but I won’t go into too much more detail about the fact that we miss things. We’re human, we have other things to do, it happens.

Instead, lets attack the root of the issue. The human element.

We can’t reasonably expect anyone to keep an eye on all of the data hurtling towards us. Its too much. On the other hand, all of the above problems could have easily been detected by a computer, all we need is something that can do the analysis for us, and then let us know when there is something to action. It doesn’t have to be incredibly fancy (no learning algorithms….yet), all it has to do is be able to compare a few points in time and alert off a trend in the wrong direction.

One of my colleagues was investigating solutions to this problem, and they settled on Sensu.

Latin: In The Sense Of

I won’t go into too much detail about Sensu here, because I think the documentation will do a much better job than I will.

My understanding of it, however, is that it is a very generic, messaging based check/handle system, where a check can be almost anything (run an Elasticsearch query, go get some current system stats, respond to some incoming event) and a handler is an arbitrary reaction (send an email, restart a server, launch the missiles).

Sensu has a number of components, including servers (wiring logic, check –> handler), clients (things that get checks executed on them) and an API (separate from the server). All communication happens through RabbitMQ and there is some usage of Redis for storage (which I’m not fully across yet).

I am by no means any sort of expert in Sensu, as I did not implement our current proof of concept. I am, however, hopefully going to use it to deal with some of the alerting problems that I outlined above.

The first check/handler to implement?

Alert us via email/SMS when the available memory on an API instance is below a certain threshold.

Alas I have not actually done this yet. This post is more going to outline the conceptual approach, and I will return later with more information about how it actually worked (or didn’t work).

Throw Out Broken Things

One of the things that you need to come to terms with early when using AWS is that everything will break. It might be your fault, it might not be, but you should accept right from the beginning that at some point, your things will break. This is good in a way, because it forces you to not have any single points of failure (unless you are willing to accept the risk that they might go down and you will have an outage, which is a business decision).

I mention this because the problem with the memory in our API instances that I mentioned above is pretty mysterious. Its not being taken by any active process (regardless of user), so it looks like a driver problem. It could be one of those weird AWS things (there are a lot), and it goes away if you reboot, so the easiest solution is to just recycle the bad API instance and move on. Its already in an auto-scaling group for redundancy, and there is always more than 1, so its better to just murder it, relax, and let the ASG do its work.

Until we’re comfortable automating that sort of recycling, we’ll settle for an alert that someone can use to make a decision and execute the recycle themselves.

By installing the Sensu client on the machines in question (incorporating it into the environment setup itself), we can create a check that allow us to remotely query the available free memory and compare it against some configured value that we deem too low (lets say 100MB). We can then configure 2 handlers for the check result, one that emails a set of known addresses and another that does the same for SMS.

Seems simple enough. I wonder if it will actually be that simple in practice.

Summary

Alerting on your aggregate of information (logs, stats, etc) is a pretty fundamental ability that you need to have.

AWS does provide some alerting in the form of CloudWatch alarms, but we decided to investigate a different (more generic) route instead, mostly because of the wealth of information that we already had available inside our ELK stack (and our burning desire to use it for something other than pretty graphs).

As I said earlier, this post is more of an outline of how we plan to attack the situation using Sensu, so its a bit light on details I’m afraid.

I’m sure the followup will be amazing though.

Right?

Smoking Out Fires

October 20. 2015 0 Comments

I’m pretty happy with the way our environment setup scripts work.

Within TeamCity, you generally only have to push a single button to get an environment provisioned (with perhaps a few parameters filled in, like environment name and whatnot) and even outside TeamCity, its a single script that only requires some credentials and a few other things to start.

Failures are detected (primarily by CloudFormation) and the scripts have the ability to remote onto AWS instances for you and extract errors from logs to give you an idea as to the root cause of the failure, so you have to do as little manual work as possible. If a failure is detected, everything is cleaned up automatically (CloudFormation stack deleted, Octopus environment and machines deleted, etc), unless you turn off automatic cleanup for investigation purposes.

Like I said, overall I’m pretty happy with how everything works, but one of the areas that I’m not entirely happy with is the last part of environment provisioning. When an environment creation is completed, you know that all components installed correctly (including Octopus deploys) and that no errors were encountered with any of the provisioning itself (EC2 instances, Auto Scaling Groups, RDS, S3, etc). What you don’t know is whether or not the environment is actually doing what it should be doing.

You don’t know whether or not its working.

That seems like a fixable problem.

Smoke On The Water

As part of developing environments, we’ve implemented automated tests using the Powershell testing framework called Pester.

Each environment has at least one test, that verifies the environment is created as expected and works from the point of view of the service it offers. For example, in our proxy environment (which uses SQUID) one of the outputs is the proxy URL. The test takes that url and does a simple Invoke-WebRequest through it to a known address, validating that the proxy works as a proxy actually should.

The issue with these tests is that they are not executed at creation time. They are usually only used during development, to validate that whatever changes you are making haven’t broken the environment and that everything is still working.

Unfortunately, beyond git tagging, our environment creation scripts/templates are not versioned. I would vastly prefer for our build scripts to take some set of source code that represents an environment setup, test it, replace some parameters (like version) and then package it up, perhaps into a nuget package. It’s something that’s been on my mind for a while, but I haven’t had time to put it together yet. If I do, I’ll be sure to post about it here.

The simplest solution is to extract the parts of the tests that perform validation into dedicated functions and then to execute them as part of the environment creation. If the validation fails, the environment should be considered a failure and should notify the appropriate parties and clean itself up.

Where There Is Smoke There Is Fire

The easiest way to implement the validation (hereafter referred to as smoke tests) in a reusable fashion is to incorporate the concept into the common environment provisioning scripts.

We’ve created a library that contains scripts that we commonly use for deployment, environment provisioning and other things. I made a copy of the source for that library and posted it to Solavirum.Scripts.Common a while ago, but its a bit out of date now (I really should update it).

Within the library is a Functions-Environment file.

This file contains a set of Powershell cmdlets for provisioning and deleting environments. The assumption is that it will be used within libraries for specific environments (like the Proxy environment mentioned above) and will allow us to take care of all of the common concerns (like uploading dependencies, setting parameters in CloudFormation, waiting on the CloudFormation initialization, etc).

Inside this file is a function called New-Environment, whose signature looks like this:

function New-Environment
{
    [CmdletBinding()]
    param
    (
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$environmentName,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsKey,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsSecret,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsRegion,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$octopusServerUrl,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$octopusApiKey,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$uniqueComponentIdentifier,
        [System.IO.FileInfo]$templateFile,
        [hashtable]$additionalTemplateParameters,
        [scriptblock]$customiseEnvironmentDetailsHashtable={param([hashtable]$environmentDetailsHashtableToMutate,$stack) },
        [switch]$wait,
        [switch]$disableCleanupOnFailure,
        [string[]]$s3Buckets
    )

    function body would be here, but its super long
}

As you can see, it has a lot of parameters. It’s responsible for all of the bits of pieces that go into setting up an environment, like Octopus initialization, CloudFormation execution, gathering information in the case of a failure, etc. Its also responsible for triggering a cleanup when an environment is deemed a failure, so is the ideal place to put some smoke testing functionality.

Each specific environment repository typically contains a file called Invoke-NewEnvironment. This file is what is executed to actually create an environment of the specific type. It puts together all of the environment specific stuff (output customisation, template location, customised parameters) and uses that to execute the New-Environment function, which takes care of all of the common things.

In order to add a configurable smoke test, all we need to do is add an optional script block to the New-Environment function. Specific environment implementations can supply a value to it they like, but they don’t have to. If we assume that the interface for the script block is that it will throw an exception if it fails, then all we need to do is wrap it in a try..catch and fail the environment provisioning if an error occurs. Pretty straightforward.

To support the smoke test functionality, I wrote two new Pester tests. One verifies that a failing smoke test correctly fails the environment creation and the other verifies that the result of a successful smoke test is included in the environment creation result. You can see them below:

Describe -Tags @("Ignore") "Functions-Environment.New-Environment.SmokeTest" {
    Context "When supplied with a smoke test script that throws an exception (indicating smoke test failure)" {
        It "The stack creation is aborted and deleted" {
            $creds = Get-AwsCredentials
            $octoCreds = Get-OctopusCredentials
            $environmentName = Create-UniqueEnvironmentName
            $uniqueComponentIdentifier = "Test"
            $templatePath = "$rootDirectoryPath\src\TestEnvironment\Test.CloudFormation.template"
            $testBucket = [Guid]::NewGuid().ToString("N")
            $customTemplateParameters = @{
                "LogsS3BucketName"=$testBucket;
            }

            try
            {
                try
                {
                    $createArguments = @{
                        "-EnvironmentName"=$environmentName;
                        "-TemplateFile"=$templatePath;
                        "-AdditionalTemplateParameters"=$CustomTemplateParameters;
                        "-UniqueComponentIdentifier"=$uniqueComponentIdentifier;
                        "-S3Buckets"=@($testBucket);
                        "-SmokeTest"={ throw "FORCED FAILURE" };
                        "-Wait"=$true;
                        "-AwsKey"=$creds.AwsKey;
                        "-AwsSecret"=$creds.AwsSecret;
                        "-AwsRegion"=$creds.AwsRegion;
                        "-OctopusApiKey"=$octoCreds.ApiKey;
                        "-OctopusServerUrl"=$octoCreds.Url;
                    }
                    $environmentCreationResult = New-Environment @createArguments
                }
                catch
                {
                    $error = $_
                }

                $error | Should Not Be $null
                $error | Should Match "smoke"

                try
                {
                    $getArguments = @{
                        "-EnvironmentName"=$environmentName;
                        "-UniqueComponentIdentifier"=$uniqueComponentIdentifier;
                        "-AwsKey"=$creds.AwsKey;
                        "-AwsSecret"=$creds.AwsSecret;
                        "-AwsRegion"=$creds.AwsRegion;                
                    }
                    $environment = Get-Environment @getArguments
                }
                catch
                {
                    Write-Warning $_
                }

                $environment | Should Be $null
            }
            finally
            {
                $deleteArguments = @{
                    "-EnvironmentName"=$environmentName;
                    "-UniqueComponentIdentifier"=$uniqueComponentIdentifier;
                    "-S3Buckets"=@($testBucket);
                    "-Wait"=$true;
                    "-AwsKey"=$creds.AwsKey;
                    "-AwsSecret"=$creds.AwsSecret;
                    "-AwsRegion"=$creds.AwsRegion;
                    "-OctopusApiKey"=$octoCreds.ApiKey;
                    "-OctopusServerUrl"=$octoCreds.Url;
                }
                Delete-Environment @deleteArguments
            }
        }
    }

    Context "When supplied with a valid smoke test script" {
        It "The stack creation is successful" {
            $creds = Get-AwsCredentials
            $octoCreds = Get-OctopusCredentials
            $environmentName = Create-UniqueEnvironmentName
            $uniqueComponentIdentifier = "Test"
            $templatePath = "$rootDirectoryPath\src\TestEnvironment\Test.CloudFormation.template"
            $testBucket = [Guid]::NewGuid().ToString("N")
            $customTemplateParameters = @{
                "LogsS3BucketName"=$testBucket;
            }

            try
            {
                $createArguments = @{
                    "-EnvironmentName"=$environmentName;
                    "-TemplateFile"=$templatePath;
                    "-AdditionalTemplateParameters"=$CustomTemplateParameters;
                    "-UniqueComponentIdentifier"=$uniqueComponentIdentifier;
                    "-S3Buckets"=@($testBucket);
                    "-SmokeTest"={ return $_.StackId + " SMOKE TESTED"}; 
                    "-Wait"=$true;
                    "-AwsKey"=$creds.AwsKey;
                    "-AwsSecret"=$creds.AwsSecret;
                    "-AwsRegion"=$creds.AwsRegion;
                    "-OctopusApiKey"=$octoCreds.ApiKey;
                    "-OctopusServerUrl"=$octoCreds.Url;
                }
                $environmentCreationResult = New-Environment @createArguments

                Write-Verbose (ConvertTo-Json $environmentCreationResult)

                $environmentCreationResult.SmokeTestResult | Should Match "SMOKE TESTED"
            }
            finally
            {
                $deleteArguments = @{
                    "-EnvironmentName"=$environmentName;
                    "-UniqueComponentIdentifier"=$uniqueComponentIdentifier;
                    "-S3Buckets"=@($testBucket);
                    "-Wait"=$true;
                    "-AwsKey"=$creds.AwsKey;
                    "-AwsSecret"=$creds.AwsSecret;
                    "-AwsRegion"=$creds.AwsRegion;
                    "-OctopusApiKey"=$octoCreds.ApiKey;
                    "-OctopusServerUrl"=$octoCreds.Url;
                }
                Delete-Environment @deleteArguments
            }
        }
    }
}

Smoke And Mirrors

On the specific environment side (the Proxy in this example), all we need to do is supply a script block that will execute the smoke test.

The smoke test itself needs to be somewhat robust, so we use a generic wait function to repeatedly execute a HTTP request through the proxy until it succeeds or it runs out of time.

function Wait
{
    [CmdletBinding()]
    param
    (
        [scriptblock]$ScriptToFillActualValue,
        [scriptblock]$Condition,
        [int]$TimeoutSeconds=30,
        [int]$IncrementSeconds=2
    )

    write-verbose "Waiting for the output of the script block [$ScriptToFillActualValue] to meet the condition [$Condition]"

    $totalWaitTimeSeconds = 0
    while ($true)
    {
        try
        {
            $actual = & $ScriptToFillActualValue
        }
        catch
        {
            Write-Warning "An error occurred while evaluating the script to get the actual value (which is evaluated by the condition for waiting purposes). As a result, the actual value is undefined (NULL)"
            Write-Warning $_
        }

        try
        {
            $result = & $condition
        }
        catch
        {
            Write-Warning "An error occurred while evaluating the condition to determine if the wait is over"
            Write-Warning $_

            $result = $false
        }

        
        if ($result)
        {
            write-verbose "The output of the script block [$ScriptToFillActualValue] (Variable:actual = [$actual]) met the condition [$condition]"
            return $actual
        }

        write-verbose "The current output of the condition [$condition] (Variable:actual = [$actual]) is [$result]. Waiting [$IncrementSeconds] and trying again."

        Sleep -Seconds $IncrementSeconds
        $totalWaitTimeSeconds = $totalWaitTimeSeconds + $IncrementSeconds

        if ($totalWaitTimeSeconds -ge $TimeoutSeconds)
        {
            throw "The output of the script block [$ScriptToFillActualValue] (Variable:actual = [$actual]) did not meet the condition [$Condition] after [$totalWaitTimeSeconds] seconds."
        }
    }
}

function Test-Proxy
{
    [CmdletBinding()]
    param
    (
        [Parameter(Mandatory=$true)]
        [string]$proxyUrl
    )

    if ($rootDirectory -eq $null) { throw "RootDirectory script scoped variable not set. That's bad, its used to find dependencies." }
    $rootDirectoryPath = $rootDirectory.FullName

    . "$rootDirectoryPath\scripts\common\Functions-Waiting.ps1"
    
    $result = Wait -ScriptToFillActualValue { return (Invoke-WebRequest -Uri "www.google.com" -Proxy $proxyUrl -Method GET).StatusCode }  -Condition { $actual -eq 200 } -TimeoutSeconds 600 -IncrementSeconds 60
}

The main reason for this repeated try..wait loop is because sometimes a CloudFormation stack will complete successfully, but the service may be unavailable from an external point of view until the Load Balancer or similar component manages to settle properly.

Conclusion

I feel much more comfortable with our environment provisioning after moving the smoke tests into their own functions and executing them during the actual environment creation, rather than just in the tests.

Now whenever an environment completes its creation, I know that it actually works from an external observation point. The smoke tests aren’t particularly complex, but they definitely add a lot to our ability to reliably provision environments containing services.

Alas, I don’t have any more smoke puns or references to finish off this blog post…

Oh wait, yes I do!

*disappears in a puff of smoke*

Moving House Is Stressful

July 28. 2015 0 Comments

Now that we’ve (somewhat) successfully codified our environment setup and are executing it automatically every day with TeamCity, we have a new challenge. Our setup scripts create an environment that has some set of features/bugs associated with it. Not officially (we’re not really tracking environment versions like that), but definitely in spirit. As a result, we need to update environments to the latest version of the “code” whenever we fix a bug or add a feature. Just like deploying a piece of software.

To be honest, I haven’t fully nailed the whole codified environment thing just yet, but I am getting closer. Giving it some thought, I think I will probably move towards a model where the environment is built and tested (just like a piece of software) and then packaged and versioned, ready to be executed. Each environment package should consist of installation and uninstallation logic, along with any other supporting actions, in order to make them as self contained as possible.

That might be the future. For now, we simply have a repository with scripts in it for each of our environments, supported by a set of common scripts.

The way I see it, environments fall into two categories.

Environments created for a particular task, like load testing or some sort of experimental development.
Environments that take part in your deployment pipeline.

The fact that we have entirely codified our environment setup gives us the power to create an environment for either of the above. The first point is not particularly interesting, but the second one is.

We have 3 standard environments, which are probably familiar to just about anyone (though maybe under different names). They are, CI, Staging and Production.

CI is the environment that is recreated every morning through TeamCity. It is used for continuous integration/deployment, and is typically not used directly for manual testing/demonstration/anything else. It forms an important part of the pipeline, as after deployment, automated functional tests are run on it, and if successful that component is (usually) automatically propagated to Staging.

Staging is, for all intents and purposes, a Production level environment. It is stable (only components that fully pass all of their tests are deployed here) and is used primarily for manual testing and feature validation, with a secondary focus on early integration within a trusted group of people (which may include external parties and exceptional customers).

Production is of course production. Its the environment that the greater world uses for any and all executions of the software component (or components) in question. It is strictly controlled and protected, to make sure that we don’t accidentally break it, inconveniencing our customers and making them unhappy.

The problem is, how do you get changes to the underlying environment (i.e. a new version of it) into Staging/Production, without losing any state held within the system? You can’t just recreate the environment (like we do each morning for CI), because the environment contains the state, and that destroys it.

You need another process.

Migration.

Birds Fly South for the Winter

Migration, for being such a short word, is actually incredibly difficult.

Most approaches that I’ve seen in the past, involved some sort of manual migration strategy (usually written down and approved by 2+ people), which is executed by some long suffering operations person at midnight when hopefully no-one is trying o use the environment for its intended purpose.

A key component to any migration strategy: What happens if it goes wrong? Otherwise known as a rollback procedure.

This is, incidentally, where everything gets pretty hard.

With our environments being entirely codified in a mixture of Powershell and CloudFormation, I wanted to create something that would automatically update an environment to the latest version, without losing any of the data currently stored in the environment, and in a safe way.

CloudFormation offers the ability to update a stack after it has been created. This way you can change the template to include a new resource (or to change existing resources) and then execute the update and have AWS handle all of the updating. This probably works fine for most people, but I was uneasy at the prospect. Our environments are already completely self contained and I didn’t understand how CloudFormation updates would handle rollbacks, or how updates would work for all components involved. I will go back and investigate it in more depth at some point in the future, but for now I wanted a more generic solution that targeted the environment itself.

My idea was fairly simple.

What if I could clone an environment? I could make a clone of the environment I wanted to migrate, test the clone to make sure all the data came through okay and its behaviour was still the same, delete the old environment and then clone the temporary environment again, into the original environments name. At any point up to the delete of the old environment I could just stop, and everything would be the same as it was before. No need for messy rollbacks that might might only do a partial job.

Of course, the idea is not actually all that simple in practice.

A Perfect Clone

In order to clone an environment, you need to identify the parts of the environment that contain persistent data (and would not automatically be created by the environment setup). Databases and file storage (S3, disk, etc) are examples of persistent data. Log files are another example of persistent data, except they don’t really matter from a migration point of view, mostly because all of our log entries are aggregated into an ELK stack. Even if they weren’t aggregated, they probably still wouldn’t be worth spending time on.

In the case of the specific environment I’m working on for the migration this time, there is an RDS instance (the database) and at least one S3 bucket containing user data. Everything else about the environment is transient, and I won’t need to worry about it.

Luckily for me, cloning an RDS instance and an S3 bucket is relatively easy.

With RDS you can simply take a snapshot and then use that snapshot as an input into the RDS instance creation on the new environment. Fairly straightforward.

function _WaitRdsSnapshotAvailable
{
    [CmdletBinding()]
    param
    (
        [Parameter(Mandatory=$true)]
        [string]$snapshotId,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsKey,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsSecret,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsRegion,
        [int]$timeoutSeconds=3000
    )

    write-verbose "Waiting for the RDS Snapshot with Id [$snapshotId] to be [available]."
    $incrementSeconds = 15
    $totalWaitTime = 0
    while ($true)
    {
        $a = Get-RDSDBSnapshot -DBSnapshotIdentifier $snapshotId -Region $awsRegion -AccessKey $awsKey -SecretKey $awsSecret
        $status = $a.Status

        if ($status -eq "available")
        {
            write-verbose "The RDS Snapshot with Id [$snapshotId] has exited [$testStatus] into [$status] taking [$totalWaitTime] seconds."
            return $a
        }

        write-verbose "Current status of RDS Snapshot with Id [$snapshotId] is [$status]. Waiting [$incrementSeconds] seconds and checking again for change."

        Sleep -Seconds $incrementSeconds
        $totalWaitTime = $totalWaitTime + $incrementSeconds
        if ($totalWaitTime -gt $timeoutSeconds)
        {
            throw "The RDS Snapshot with Id [$snapshotId] was not [available] within [$timeoutSeconds] seconds."
        }
    }
}

... snip some scripts getting CFN stacks ...

$resources = Get-CFNStackResources -StackName $sourceStack.StackId -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion

$rds = $resources |
    Single -Predicate { $_.ResourceType -eq "AWS::RDS::DBInstance" }

$timestamp = [DateTime]::UtcNow.ToString("yyyyddMMHHmmss")
$snapshotId = "$sourceEnvironment-for-clone-to-$destinationEnvironment-$timestamp"
$snapshot = New-RDSDBSnapshot -DBInstanceIdentifier $rds.PhysicalResourceId -DBSnapshotIdentifier $snapshotId -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion

_WaitRdsSnapshotAvailable -SnapshotId $snapshotId -awsKey $awsKey -awsSecret $awsSecret -awsRegion $awsRegion

With S3, you can just copy the bucket contents. I say just, but in reality there is no support for a “sync” command in the AWS Powershell cmdlets. There is a sync command on the AWS CLI though, so I wrote a wrapper around the CLI and execute the sync command there. It works pretty nicely. Essentially its broken into two parts, the part that deals with actually locating and extracting the AWS CLI to a known location, and then the part that actually does the clone. The only difficult bit was that you don’t seem to be able to just supply credentials to the AWS CLI executable, at least in a way that I would expect (i.e. as parameters). Instead you have to use a profile, or use environment variables.

function Get-AwsCliExecutablePath
{
    if ($rootDirectory -eq $null) { throw "rootDirectory script scoped variable not set. Thats bad, its used to find dependencies." }
    $rootDirectoryPath = $rootDirectory.FullName

    $commonScriptsDirectoryPath = "$rootDirectoryPath\scripts\common"

    . "$commonScriptsDirectoryPath\Functions-Compression.ps1"

    $toolsDirectoryPath = "$rootDirectoryPath\tools"
    $nugetPackagesDirectoryPath = "$toolsDirectoryPath\packages"

    $packageId = "AWSCLI64"
    $packageVersion = "1.7.41"

    $expectedDirectory = "$nugetPackagesDirectoryPath\$packageId.$packageVersion"
    if (-not (Test-Path $expectedDirectory))
    {
        $extractedDir = 7Zip-Unzip "$toolsDirectoryPath\dist\$packageId.$packageVersion.7z" "$toolsDirectoryPath\packages"
    }

    $executable = "$expectedDirectory\aws.exe"

    return $executable
}

function Clone-S3Bucket
{
    [CmdletBinding()]
    param
    (
        [Parameter(Mandatory=$true)]
        [string]$sourceBucketName,
        [Parameter(Mandatory=$true)]
        [string]$destinationBucketName,
        [Parameter(Mandatory=$true)]
        [string]$awsKey,
        [Parameter(Mandatory=$true)]
        [string]$awsSecret,
        [Parameter(Mandatory=$true)]
        [string]$awsRegion
    )

    if ($rootDirectory -eq $null) { throw "RootDirectory script scoped variable not set. That's bad, its used to find dependencies." }
    $rootDirectoryPath = $rootDirectory.FullName

    . "$rootDirectoryPath\scripts\common\Functions-Aws.ps1"

    $awsCliExecutablePath = Get-AwsCliExecutablePath

    $previousAWSKey = $env:AWS_ACCESS_KEY_ID
    $previousAWSSecret = $env:AWS_SECRET_ACCESS_KEY
    $previousAWSRegion = $env:AWS_DEFAULT_REGION

    $env:AWS_ACCESS_KEY_ID = $awsKey
    $env:AWS_SECRET_ACCESS_KEY = $awsSecret
    $env:AWS_DEFAULT_REGION = $awsRegion

    & $awsCliExecutablePath s3 sync s3://$sourceBucketName s3://$destinationBucketName

    $env:AWS_ACCESS_KEY_ID = $previousAWSKey
    $env:AWS_SECRET_ACCESS_KEY = $previousAWSSecret
    $env:AWS_DEFAULT_REGION = $previousAWSRegion
}

I do have some concerns that as the bucket gets bigger, the clone will take longer and longer. I’ll cross that bridge when I come to it.

Using the identified areas of persistence above, the only change I need to make is to alter the new environment script to take them as optional inputs (specifically the RDS snapshot). If they are supplied, it will use them, if they are not, it will default to normal creation.

Job done, right?

A Clean Snapshot

The clone approach works well enough, but in order to perform a migration on a system that is actively being used, you need to make sure that the content does not change while you are doing it. If you don’t do this, you can potentially lose data during a migration. The most common example would be if you clone the environment, but after the clone some requests occur and the data changes. If you then delete the original and migrate back, you’ve lost that data. There are other variations as well.

This means that you need the ability to put an environment into standby mode, where it is still running, and everything is working, but it is no longer accepting user requests.

Most of our environments are fairly simple and are based around web services. They have a number of instances behind a load balancer, managed by an auto scaling group. Behind those instances are backend services, like databases and other forms of persistence/scheduled task management.

AWS Auto Scaling Groups allow you to set instances into Standby mode, which removes them from the load balancer (meaning they will no longer have requests forwarded to them) but does not delete or otherwise terminate them. More importantly, instances in Standby can count towards the desired number of instances in the Auto Scaling Group, meaning it won’t go and create X more instances to service user requests, which obviously would muck the whole plan up.

This is exactly what we need to set our environment into a Standby mode (at least until we have scheduled tasks that deal with underlying data anyway). I took the ability to shift instances into Standby mode and wrapped it into a function for setting the availability of the environment (because that’s the concept that I’m interested in, the Standby mode instances are just a mechanism to accomplish that).

function _ChangeEnvironmentAvailability
{
    [CmdletBinding()]
    param
    (
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$environment,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsKey,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsSecret,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsRegion,
        [Parameter(Mandatory=$true)]
        [ValidateSet("Available", "Unavailable")]
        [string]$availability,
        [switch]$wait
    )

    if ($rootDirectory -eq $null) { throw "RootDirectory script scoped variable not set. That's bad, its used to find dependencies." }
    $rootDirectoryDirectoryPath = $rootDirectory.FullName

    . "$rootDirectoryPath\scripts\common\Functions-Enumerables.ps1"

    . "$rootDirectoryPath\scripts\common\Functions-Aws.ps1"
    Ensure-AwsPowershellFunctionsAvailable

    $sourceStackName = Get-StackName -Environment $environment -UniqueComponentIdentifier (_GetUniqueComponentIdentifier)
    $sourceStack = Get-CFNStack -StackName $sourceStackName -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion

    $resources = Get-CFNStackResources -StackName $sourceStack.StackId -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion

    Write-Verbose "_ChangeEnvironmentAvailability([$availability]): Locating auto scaling group in [$environment]"
    $asg = $resources |
        Single -Predicate { $_.ResourceType -eq "AWS::AutoScaling::AutoScalingGroup" }

    $asg = Get-ASAutoScalingGroup -AutoScalingGroupName $asg.PhysicalResourceId -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion

    $instanceIds = @()
    $standbyActivities = @()
    if ($availability -eq "Unavailable")
    {
        Write-Verbose "_ChangeEnvironmentAvailability([$availability]): Switching all instances in [$($asg.AutoScalingGroupName)] to Standby"
        $instanceIds = $asg.Instances | Where { $_.LifecycleState -eq [Amazon.AutoScaling.LifecycleState]::InService } | Select -ExpandProperty InstanceId
        if ($instanceIds | Any)
        {
            $standbyActivities = Enter-ASStandby -AutoScalingGroupName $asg.AutoScalingGroupName -InstanceId $instanceIds -ShouldDecrementDesiredCapacity $true -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion
        }
    }
    elseif ($availability -eq "Available")
    {
        Write-Verbose "_ChangeEnvironmentAvailability([$availability]): Switching all instances in [$($asg.AutoScalingGroupName)] to Inservice"
        $instanceIds = $asg.Instances | Where { $_.LifecycleState -eq [Amazon.AutoScaling.LifecycleState]::Standby } | Select -ExpandProperty InstanceId
        if ($instanceIds | Any)
        {
            $standbyActivities = Exit-ASStandby -AutoScalingGroupName $asg.AutoScalingGroupName -InstanceId $instanceIds -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion
        }
    }

    $anyStandbyActivities = $standbyActivities | Any
    if ($wait -and $anyStandbyActivities)
    {
        Write-Verbose "_ChangeEnvironmentAvailability([$availability]): Waiting for all scaling activities to complete"
        _WaitAutoScalingGroupActivitiesComplete -Activities $standbyActivities -awsKey $awsKey -awsSecret $awsSecret -awsRegion $awsRegion
    }
}

With the only mechanism to affect the state of persisted data disabled, we can have some more confidence that the clone is a robust and clean copy.

To the Future!

I don’t think that the double clone solution is the best one, it’s just the best that I could come up with without having to make a lot of changes to way we manage our environments.

Another approach would be to maintain 2 environments during migration (A and B), but only have one of those environments be active during normal operations. So to do a migration, you would spin up Prod A if Prod B already existed. At the entry point, you have a single (or multiple) DNS record that points to either A or B based on your needs. This one still involves cloning and downtime though, so for a high availability service, it won’t really work (our services can have some amount of downtime, as long as it is organized and communicated ahead of time).

Speaking of downtime, there is another approach that you can follow in order to do zero downtime migrations. I haven’t actually done it, but if you had a mechanism to replicate incoming requests to both environments, you could conceivably bring up the new environment, let it deal with the same requests as the old environment for long enough to synchronize and to validate that it works (without responding to the user, just processing the requests) and then perform the top level switch so that the new environment becomes the active one. At some point in the future you can destroy the old environment, when you are confident that it works as expected.

This is a lot more complicated, and involves some external component managing the requests (and storing a record of all requests ever made, at least back to the last backup point) as well as each environment knowing what request they last processed. Its certainly not impossible, but if you can tolerate downtime, its probably not worth the effort.

Summary

Managing your environments is not a simple task, especially when you have actual users (and if you don’t have users, why are you bothering at all?). It’s very difficult to make sure that your production (or other live environment) does not stagnate in the face of changes, and is always kept up to date.

What I’ve outlined in this blog post is a single approach that I’ve been working on over the last week or so, to deal with out specific environments. Its not something that will work for everyone, but I thought it was worthwhile to write it down anyway, to show some of the thought processes that I needed to go through in order to accomplish the migration in a safe, robust fashion.

There is at least one nice side effect from my approach, in that we will now be able to clone any environment I want (without doing a migration, just the clone) and then use it for experiments or analysis.

I’m sure that I will run into issues eventually, but I’m pretty happy with the way the migration process is happening. It was weighing on my mind for a while, so its good to get it done.