Moving House Is Stressful

July 28. 2015 0 Comments

Now that we’ve (somewhat) successfully codified our environment setup and are executing it automatically every day with TeamCity, we have a new challenge. Our setup scripts create an environment that has some set of features/bugs associated with it. Not officially (we’re not really tracking environment versions like that), but definitely in spirit. As a result, we need to update environments to the latest version of the “code” whenever we fix a bug or add a feature. Just like deploying a piece of software.

To be honest, I haven’t fully nailed the whole codified environment thing just yet, but I am getting closer. Giving it some thought, I think I will probably move towards a model where the environment is built and tested (just like a piece of software) and then packaged and versioned, ready to be executed. Each environment package should consist of installation and uninstallation logic, along with any other supporting actions, in order to make them as self contained as possible.

That might be the future. For now, we simply have a repository with scripts in it for each of our environments, supported by a set of common scripts.

The way I see it, environments fall into two categories.

Environments created for a particular task, like load testing or some sort of experimental development.
Environments that take part in your deployment pipeline.

The fact that we have entirely codified our environment setup gives us the power to create an environment for either of the above. The first point is not particularly interesting, but the second one is.

We have 3 standard environments, which are probably familiar to just about anyone (though maybe under different names). They are, CI, Staging and Production.

CI is the environment that is recreated every morning through TeamCity. It is used for continuous integration/deployment, and is typically not used directly for manual testing/demonstration/anything else. It forms an important part of the pipeline, as after deployment, automated functional tests are run on it, and if successful that component is (usually) automatically propagated to Staging.

Staging is, for all intents and purposes, a Production level environment. It is stable (only components that fully pass all of their tests are deployed here) and is used primarily for manual testing and feature validation, with a secondary focus on early integration within a trusted group of people (which may include external parties and exceptional customers).

Production is of course production. Its the environment that the greater world uses for any and all executions of the software component (or components) in question. It is strictly controlled and protected, to make sure that we don’t accidentally break it, inconveniencing our customers and making them unhappy.

The problem is, how do you get changes to the underlying environment (i.e. a new version of it) into Staging/Production, without losing any state held within the system? You can’t just recreate the environment (like we do each morning for CI), because the environment contains the state, and that destroys it.

You need another process.

Migration.

Birds Fly South for the Winter

Migration, for being such a short word, is actually incredibly difficult.

Most approaches that I’ve seen in the past, involved some sort of manual migration strategy (usually written down and approved by 2+ people), which is executed by some long suffering operations person at midnight when hopefully no-one is trying o use the environment for its intended purpose.

A key component to any migration strategy: What happens if it goes wrong? Otherwise known as a rollback procedure.

This is, incidentally, where everything gets pretty hard.

With our environments being entirely codified in a mixture of Powershell and CloudFormation, I wanted to create something that would automatically update an environment to the latest version, without losing any of the data currently stored in the environment, and in a safe way.

CloudFormation offers the ability to update a stack after it has been created. This way you can change the template to include a new resource (or to change existing resources) and then execute the update and have AWS handle all of the updating. This probably works fine for most people, but I was uneasy at the prospect. Our environments are already completely self contained and I didn’t understand how CloudFormation updates would handle rollbacks, or how updates would work for all components involved. I will go back and investigate it in more depth at some point in the future, but for now I wanted a more generic solution that targeted the environment itself.

My idea was fairly simple.

What if I could clone an environment? I could make a clone of the environment I wanted to migrate, test the clone to make sure all the data came through okay and its behaviour was still the same, delete the old environment and then clone the temporary environment again, into the original environments name. At any point up to the delete of the old environment I could just stop, and everything would be the same as it was before. No need for messy rollbacks that might might only do a partial job.

Of course, the idea is not actually all that simple in practice.

A Perfect Clone

In order to clone an environment, you need to identify the parts of the environment that contain persistent data (and would not automatically be created by the environment setup). Databases and file storage (S3, disk, etc) are examples of persistent data. Log files are another example of persistent data, except they don’t really matter from a migration point of view, mostly because all of our log entries are aggregated into an ELK stack. Even if they weren’t aggregated, they probably still wouldn’t be worth spending time on.

In the case of the specific environment I’m working on for the migration this time, there is an RDS instance (the database) and at least one S3 bucket containing user data. Everything else about the environment is transient, and I won’t need to worry about it.

Luckily for me, cloning an RDS instance and an S3 bucket is relatively easy.

With RDS you can simply take a snapshot and then use that snapshot as an input into the RDS instance creation on the new environment. Fairly straightforward.

function _WaitRdsSnapshotAvailable
{
    [CmdletBinding()]
    param
    (
        [Parameter(Mandatory=$true)]
        [string]$snapshotId,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsKey,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsSecret,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsRegion,
        [int]$timeoutSeconds=3000
    )

    write-verbose "Waiting for the RDS Snapshot with Id [$snapshotId] to be [available]."
    $incrementSeconds = 15
    $totalWaitTime = 0
    while ($true)
    {
        $a = Get-RDSDBSnapshot -DBSnapshotIdentifier $snapshotId -Region $awsRegion -AccessKey $awsKey -SecretKey $awsSecret
        $status = $a.Status

        if ($status -eq "available")
        {
            write-verbose "The RDS Snapshot with Id [$snapshotId] has exited [$testStatus] into [$status] taking [$totalWaitTime] seconds."
            return $a
        }

        write-verbose "Current status of RDS Snapshot with Id [$snapshotId] is [$status]. Waiting [$incrementSeconds] seconds and checking again for change."

        Sleep -Seconds $incrementSeconds
        $totalWaitTime = $totalWaitTime + $incrementSeconds
        if ($totalWaitTime -gt $timeoutSeconds)
        {
            throw "The RDS Snapshot with Id [$snapshotId] was not [available] within [$timeoutSeconds] seconds."
        }
    }
}

... snip some scripts getting CFN stacks ...

$resources = Get-CFNStackResources -StackName $sourceStack.StackId -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion

$rds = $resources |
    Single -Predicate { $_.ResourceType -eq "AWS::RDS::DBInstance" }

$timestamp = [DateTime]::UtcNow.ToString("yyyyddMMHHmmss")
$snapshotId = "$sourceEnvironment-for-clone-to-$destinationEnvironment-$timestamp"
$snapshot = New-RDSDBSnapshot -DBInstanceIdentifier $rds.PhysicalResourceId -DBSnapshotIdentifier $snapshotId -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion

_WaitRdsSnapshotAvailable -SnapshotId $snapshotId -awsKey $awsKey -awsSecret $awsSecret -awsRegion $awsRegion

With S3, you can just copy the bucket contents. I say just, but in reality there is no support for a “sync” command in the AWS Powershell cmdlets. There is a sync command on the AWS CLI though, so I wrote a wrapper around the CLI and execute the sync command there. It works pretty nicely. Essentially its broken into two parts, the part that deals with actually locating and extracting the AWS CLI to a known location, and then the part that actually does the clone. The only difficult bit was that you don’t seem to be able to just supply credentials to the AWS CLI executable, at least in a way that I would expect (i.e. as parameters). Instead you have to use a profile, or use environment variables.

function Get-AwsCliExecutablePath
{
    if ($rootDirectory -eq $null) { throw "rootDirectory script scoped variable not set. Thats bad, its used to find dependencies." }
    $rootDirectoryPath = $rootDirectory.FullName

    $commonScriptsDirectoryPath = "$rootDirectoryPath\scripts\common"

    . "$commonScriptsDirectoryPath\Functions-Compression.ps1"

    $toolsDirectoryPath = "$rootDirectoryPath\tools"
    $nugetPackagesDirectoryPath = "$toolsDirectoryPath\packages"

    $packageId = "AWSCLI64"
    $packageVersion = "1.7.41"

    $expectedDirectory = "$nugetPackagesDirectoryPath\$packageId.$packageVersion"
    if (-not (Test-Path $expectedDirectory))
    {
        $extractedDir = 7Zip-Unzip "$toolsDirectoryPath\dist\$packageId.$packageVersion.7z" "$toolsDirectoryPath\packages"
    }

    $executable = "$expectedDirectory\aws.exe"

    return $executable
}

function Clone-S3Bucket
{
    [CmdletBinding()]
    param
    (
        [Parameter(Mandatory=$true)]
        [string]$sourceBucketName,
        [Parameter(Mandatory=$true)]
        [string]$destinationBucketName,
        [Parameter(Mandatory=$true)]
        [string]$awsKey,
        [Parameter(Mandatory=$true)]
        [string]$awsSecret,
        [Parameter(Mandatory=$true)]
        [string]$awsRegion
    )

    if ($rootDirectory -eq $null) { throw "RootDirectory script scoped variable not set. That's bad, its used to find dependencies." }
    $rootDirectoryPath = $rootDirectory.FullName

    . "$rootDirectoryPath\scripts\common\Functions-Aws.ps1"

    $awsCliExecutablePath = Get-AwsCliExecutablePath

    $previousAWSKey = $env:AWS_ACCESS_KEY_ID
    $previousAWSSecret = $env:AWS_SECRET_ACCESS_KEY
    $previousAWSRegion = $env:AWS_DEFAULT_REGION

    $env:AWS_ACCESS_KEY_ID = $awsKey
    $env:AWS_SECRET_ACCESS_KEY = $awsSecret
    $env:AWS_DEFAULT_REGION = $awsRegion

    & $awsCliExecutablePath s3 sync s3://$sourceBucketName s3://$destinationBucketName

    $env:AWS_ACCESS_KEY_ID = $previousAWSKey
    $env:AWS_SECRET_ACCESS_KEY = $previousAWSSecret
    $env:AWS_DEFAULT_REGION = $previousAWSRegion
}

I do have some concerns that as the bucket gets bigger, the clone will take longer and longer. I’ll cross that bridge when I come to it.

Using the identified areas of persistence above, the only change I need to make is to alter the new environment script to take them as optional inputs (specifically the RDS snapshot). If they are supplied, it will use them, if they are not, it will default to normal creation.

Job done, right?

A Clean Snapshot

The clone approach works well enough, but in order to perform a migration on a system that is actively being used, you need to make sure that the content does not change while you are doing it. If you don’t do this, you can potentially lose data during a migration. The most common example would be if you clone the environment, but after the clone some requests occur and the data changes. If you then delete the original and migrate back, you’ve lost that data. There are other variations as well.

This means that you need the ability to put an environment into standby mode, where it is still running, and everything is working, but it is no longer accepting user requests.

Most of our environments are fairly simple and are based around web services. They have a number of instances behind a load balancer, managed by an auto scaling group. Behind those instances are backend services, like databases and other forms of persistence/scheduled task management.

AWS Auto Scaling Groups allow you to set instances into Standby mode, which removes them from the load balancer (meaning they will no longer have requests forwarded to them) but does not delete or otherwise terminate them. More importantly, instances in Standby can count towards the desired number of instances in the Auto Scaling Group, meaning it won’t go and create X more instances to service user requests, which obviously would muck the whole plan up.

This is exactly what we need to set our environment into a Standby mode (at least until we have scheduled tasks that deal with underlying data anyway). I took the ability to shift instances into Standby mode and wrapped it into a function for setting the availability of the environment (because that’s the concept that I’m interested in, the Standby mode instances are just a mechanism to accomplish that).

function _ChangeEnvironmentAvailability
{
    [CmdletBinding()]
    param
    (
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$environment,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsKey,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsSecret,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsRegion,
        [Parameter(Mandatory=$true)]
        [ValidateSet("Available", "Unavailable")]
        [string]$availability,
        [switch]$wait
    )

    if ($rootDirectory -eq $null) { throw "RootDirectory script scoped variable not set. That's bad, its used to find dependencies." }
    $rootDirectoryDirectoryPath = $rootDirectory.FullName

    . "$rootDirectoryPath\scripts\common\Functions-Enumerables.ps1"

    . "$rootDirectoryPath\scripts\common\Functions-Aws.ps1"
    Ensure-AwsPowershellFunctionsAvailable

    $sourceStackName = Get-StackName -Environment $environment -UniqueComponentIdentifier (_GetUniqueComponentIdentifier)
    $sourceStack = Get-CFNStack -StackName $sourceStackName -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion

    $resources = Get-CFNStackResources -StackName $sourceStack.StackId -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion

    Write-Verbose "_ChangeEnvironmentAvailability([$availability]): Locating auto scaling group in [$environment]"
    $asg = $resources |
        Single -Predicate { $_.ResourceType -eq "AWS::AutoScaling::AutoScalingGroup" }

    $asg = Get-ASAutoScalingGroup -AutoScalingGroupName $asg.PhysicalResourceId -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion

    $instanceIds = @()
    $standbyActivities = @()
    if ($availability -eq "Unavailable")
    {
        Write-Verbose "_ChangeEnvironmentAvailability([$availability]): Switching all instances in [$($asg.AutoScalingGroupName)] to Standby"
        $instanceIds = $asg.Instances | Where { $_.LifecycleState -eq [Amazon.AutoScaling.LifecycleState]::InService } | Select -ExpandProperty InstanceId
        if ($instanceIds | Any)
        {
            $standbyActivities = Enter-ASStandby -AutoScalingGroupName $asg.AutoScalingGroupName -InstanceId $instanceIds -ShouldDecrementDesiredCapacity $true -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion
        }
    }
    elseif ($availability -eq "Available")
    {
        Write-Verbose "_ChangeEnvironmentAvailability([$availability]): Switching all instances in [$($asg.AutoScalingGroupName)] to Inservice"
        $instanceIds = $asg.Instances | Where { $_.LifecycleState -eq [Amazon.AutoScaling.LifecycleState]::Standby } | Select -ExpandProperty InstanceId
        if ($instanceIds | Any)
        {
            $standbyActivities = Exit-ASStandby -AutoScalingGroupName $asg.AutoScalingGroupName -InstanceId $instanceIds -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion
        }
    }

    $anyStandbyActivities = $standbyActivities | Any
    if ($wait -and $anyStandbyActivities)
    {
        Write-Verbose "_ChangeEnvironmentAvailability([$availability]): Waiting for all scaling activities to complete"
        _WaitAutoScalingGroupActivitiesComplete -Activities $standbyActivities -awsKey $awsKey -awsSecret $awsSecret -awsRegion $awsRegion
    }
}

With the only mechanism to affect the state of persisted data disabled, we can have some more confidence that the clone is a robust and clean copy.

To the Future!

I don’t think that the double clone solution is the best one, it’s just the best that I could come up with without having to make a lot of changes to way we manage our environments.

Another approach would be to maintain 2 environments during migration (A and B), but only have one of those environments be active during normal operations. So to do a migration, you would spin up Prod A if Prod B already existed. At the entry point, you have a single (or multiple) DNS record that points to either A or B based on your needs. This one still involves cloning and downtime though, so for a high availability service, it won’t really work (our services can have some amount of downtime, as long as it is organized and communicated ahead of time).

Speaking of downtime, there is another approach that you can follow in order to do zero downtime migrations. I haven’t actually done it, but if you had a mechanism to replicate incoming requests to both environments, you could conceivably bring up the new environment, let it deal with the same requests as the old environment for long enough to synchronize and to validate that it works (without responding to the user, just processing the requests) and then perform the top level switch so that the new environment becomes the active one. At some point in the future you can destroy the old environment, when you are confident that it works as expected.

This is a lot more complicated, and involves some external component managing the requests (and storing a record of all requests ever made, at least back to the last backup point) as well as each environment knowing what request they last processed. Its certainly not impossible, but if you can tolerate downtime, its probably not worth the effort.

Summary

Managing your environments is not a simple task, especially when you have actual users (and if you don’t have users, why are you bothering at all?). It’s very difficult to make sure that your production (or other live environment) does not stagnate in the face of changes, and is always kept up to date.

What I’ve outlined in this blog post is a single approach that I’ve been working on over the last week or so, to deal with out specific environments. Its not something that will work for everyone, but I thought it was worthwhile to write it down anyway, to show some of the thought processes that I needed to go through in order to accomplish the migration in a safe, robust fashion.

There is at least one nice side effect from my approach, in that we will now be able to clone any environment I want (without doing a migration, just the clone) and then use it for experiments or analysis.

I’m sure that I will run into issues eventually, but I’m pretty happy with the way the migration process is happening. It was weighing on my mind for a while, so its good to get it done.