0 Comments

Now that we’ve (somewhat) successfully codified our environment setup and are executing it automatically every day with TeamCity, we have a new challenge. Our setup scripts create an environment that has some set of features/bugs associated with it. Not officially (we’re not really tracking environment versions like that), but definitely in spirit. As a result, we need to update environments to the latest version of the “code” whenever we fix a bug or add a feature. Just like deploying a piece of software.

To be honest, I haven’t fully nailed the whole codified environment thing just yet, but I am getting closer. Giving it some thought, I think I will probably move towards a model where the environment is built and tested (just like a piece of software) and then packaged and versioned, ready to be executed. Each environment package should consist of installation and uninstallation logic, along with any other supporting actions, in order to make them as self contained as possible.

That might be the future. For now, we simply have a repository with scripts in it for each of our environments, supported by a set of common scripts.

The way I see it, environments fall into two categories.

  1. Environments created for a particular task, like load testing or some sort of experimental development.
  2. Environments that take part in your deployment pipeline.

The fact that we have entirely codified our environment setup gives us the power to create an environment for either of the above. The first point is not particularly interesting, but the second one is.

We have 3 standard environments, which are probably familiar to just about anyone (though maybe under different names). They are, CI, Staging and Production.

CI is the environment that is recreated every morning through TeamCity. It is used for continuous integration/deployment, and is typically not used directly for manual testing/demonstration/anything else. It forms an important part of the pipeline, as after deployment, automated functional tests are run on it, and if successful that component is (usually) automatically propagated to Staging.

Staging is, for all intents and purposes, a Production level environment. It is stable (only components that fully pass all of their tests are deployed here) and is used primarily for manual testing and feature validation, with a secondary focus on early integration within a trusted group of people (which may include external parties and exceptional customers).

Production is of course production. Its the environment that the greater world uses for any and all executions of the software component (or components) in question. It is strictly controlled and protected, to make sure that we don’t accidentally break it, inconveniencing our customers and making them unhappy.

The problem is, how do you get changes to the underlying environment (i.e. a new version of it) into Staging/Production, without losing any state held within the system? You can’t just recreate the environment (like we do each morning for CI), because the environment contains the state, and that destroys it.

You need another process.

Migration.

Birds Fly South for the Winter

Migration, for being such a short word, is actually incredibly difficult.

Most approaches that I’ve seen in the past, involved some sort of manual migration strategy (usually written down and approved by 2+ people), which is executed by some long suffering operations person at midnight when hopefully no-one is trying o use the environment for its intended purpose.

A key component to any migration strategy: What happens if it goes wrong? Otherwise known as a rollback procedure.

This is, incidentally, where everything gets pretty hard.

With our environments being entirely codified in a mixture of Powershell and CloudFormation, I wanted to create something that would automatically update an environment to the latest version, without losing any of the data currently stored in the environment, and in a safe way.

CloudFormation offers the ability to update a stack after it has been created. This way you can change the template to include a new resource (or to change existing resources) and then execute the update and have AWS handle all of the updating. This probably works fine for most people, but I was uneasy at the prospect. Our environments are already completely self contained and I didn’t understand how CloudFormation updates would handle rollbacks, or how updates would work for all components involved. I will go back and investigate it in more depth at some point in the future, but for now I wanted a more generic solution that targeted the environment itself.

My idea was fairly simple.

What if I could clone an environment? I could make a clone of the environment I wanted to migrate, test the clone to make sure all the data came through okay and its behaviour was still the same, delete the old environment and then clone the temporary environment again, into the original environments name. At any point up to the delete of the old environment I could just stop, and everything would be the same as it was before. No need for messy rollbacks that might might only do a partial job.

Of course, the idea is not actually all that simple in practice.

A Perfect Clone

In order to clone an environment, you need to identify the parts of the environment that contain persistent data (and would not automatically be created by the environment setup).  Databases and file storage (S3, disk, etc) are examples of persistent data. Log files are another example of persistent data, except they don’t really matter from a migration point of view, mostly because all of our log entries are aggregated into an ELK stack. Even if they weren’t aggregated, they probably still wouldn’t be worth spending time on.

In the case of the specific environment I’m working on for the migration this time, there is an RDS instance (the database) and at least one S3 bucket containing user data. Everything else about the environment is transient, and I won’t need to worry about it.

Luckily for me, cloning an RDS instance and an S3 bucket is relatively easy.

With RDS you can simply take a snapshot and then use that snapshot as an input into the RDS instance creation on the new environment. Fairly straightforward.

function _WaitRdsSnapshotAvailable
{
    [CmdletBinding()]
    param
    (
        [Parameter(Mandatory=$true)]
        [string]$snapshotId,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsKey,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsSecret,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsRegion,
        [int]$timeoutSeconds=3000
    )

    write-verbose "Waiting for the RDS Snapshot with Id [$snapshotId] to be [available]."
    $incrementSeconds = 15
    $totalWaitTime = 0
    while ($true)
    {
        $a = Get-RDSDBSnapshot -DBSnapshotIdentifier $snapshotId -Region $awsRegion -AccessKey $awsKey -SecretKey $awsSecret
        $status = $a.Status

        if ($status -eq "available")
        {
            write-verbose "The RDS Snapshot with Id [$snapshotId] has exited [$testStatus] into [$status] taking [$totalWaitTime] seconds."
            return $a
        }

        write-verbose "Current status of RDS Snapshot with Id [$snapshotId] is [$status]. Waiting [$incrementSeconds] seconds and checking again for change."

        Sleep -Seconds $incrementSeconds
        $totalWaitTime = $totalWaitTime + $incrementSeconds
        if ($totalWaitTime -gt $timeoutSeconds)
        {
            throw "The RDS Snapshot with Id [$snapshotId] was not [available] within [$timeoutSeconds] seconds."
        }
    }
}

... snip some scripts getting CFN stacks ...

$resources = Get-CFNStackResources -StackName $sourceStack.StackId -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion

$rds = $resources |
    Single -Predicate { $_.ResourceType -eq "AWS::RDS::DBInstance" }

$timestamp = [DateTime]::UtcNow.ToString("yyyyddMMHHmmss")
$snapshotId = "$sourceEnvironment-for-clone-to-$destinationEnvironment-$timestamp"
$snapshot = New-RDSDBSnapshot -DBInstanceIdentifier $rds.PhysicalResourceId -DBSnapshotIdentifier $snapshotId -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion

_WaitRdsSnapshotAvailable -SnapshotId $snapshotId -awsKey $awsKey -awsSecret $awsSecret -awsRegion $awsRegion

With S3, you can just copy the bucket contents. I say just, but in reality there is no support for a “sync” command in the AWS Powershell cmdlets. There is a sync command on the AWS CLI though, so I wrote a wrapper around the CLI and execute the sync command there. It works pretty nicely. Essentially its broken into two parts, the part that deals with actually locating and extracting the AWS CLI to a known location, and then the part that actually does the clone. The only difficult bit was that you don’t seem to be able to just supply credentials to the AWS CLI executable, at least in a way that I would expect (i.e. as parameters). Instead you have to use a profile, or use environment variables.

function Get-AwsCliExecutablePath
{
    if ($rootDirectory -eq $null) { throw "rootDirectory script scoped variable not set. Thats bad, its used to find dependencies." }
    $rootDirectoryPath = $rootDirectory.FullName

    $commonScriptsDirectoryPath = "$rootDirectoryPath\scripts\common"

    . "$commonScriptsDirectoryPath\Functions-Compression.ps1"

    $toolsDirectoryPath = "$rootDirectoryPath\tools"
    $nugetPackagesDirectoryPath = "$toolsDirectoryPath\packages"

    $packageId = "AWSCLI64"
    $packageVersion = "1.7.41"

    $expectedDirectory = "$nugetPackagesDirectoryPath\$packageId.$packageVersion"
    if (-not (Test-Path $expectedDirectory))
    {
        $extractedDir = 7Zip-Unzip "$toolsDirectoryPath\dist\$packageId.$packageVersion.7z" "$toolsDirectoryPath\packages"
    }

    $executable = "$expectedDirectory\aws.exe"

    return $executable
}

function Clone-S3Bucket
{
    [CmdletBinding()]
    param
    (
        [Parameter(Mandatory=$true)]
        [string]$sourceBucketName,
        [Parameter(Mandatory=$true)]
        [string]$destinationBucketName,
        [Parameter(Mandatory=$true)]
        [string]$awsKey,
        [Parameter(Mandatory=$true)]
        [string]$awsSecret,
        [Parameter(Mandatory=$true)]
        [string]$awsRegion
    )

    if ($rootDirectory -eq $null) { throw "RootDirectory script scoped variable not set. That's bad, its used to find dependencies." }
    $rootDirectoryPath = $rootDirectory.FullName

    . "$rootDirectoryPath\scripts\common\Functions-Aws.ps1"

    $awsCliExecutablePath = Get-AwsCliExecutablePath

    $previousAWSKey = $env:AWS_ACCESS_KEY_ID
    $previousAWSSecret = $env:AWS_SECRET_ACCESS_KEY
    $previousAWSRegion = $env:AWS_DEFAULT_REGION

    $env:AWS_ACCESS_KEY_ID = $awsKey
    $env:AWS_SECRET_ACCESS_KEY = $awsSecret
    $env:AWS_DEFAULT_REGION = $awsRegion

    & $awsCliExecutablePath s3 sync s3://$sourceBucketName s3://$destinationBucketName

    $env:AWS_ACCESS_KEY_ID = $previousAWSKey
    $env:AWS_SECRET_ACCESS_KEY = $previousAWSSecret
    $env:AWS_DEFAULT_REGION = $previousAWSRegion
}

I do have some concerns that as the bucket gets bigger, the clone will take longer and longer. I’ll cross that bridge when I come to it.

Using the identified areas of persistence above, the only change I need to make is to alter the new environment script to take them as optional inputs (specifically the RDS snapshot). If they are supplied, it will use them, if they are not, it will default to normal creation.

Job done, right?

A Clean Snapshot

The clone approach works well enough, but in order to perform a migration on a system that is actively being used, you need to make sure that the content does not change while you are doing it. If you don’t do this, you can potentially lose data during a migration. The most common example would be if you clone the environment, but after the clone some requests occur and the data changes. If you then delete the original and migrate back, you’ve lost that data. There are other variations as well.

This means that you need the ability to put an environment into standby mode, where it is still running, and everything is working, but it is no longer accepting user requests.

Most of our environments are fairly simple and are based around web services. They have a number of instances behind a load balancer, managed by an auto scaling group. Behind those instances are backend services, like databases and other forms of persistence/scheduled task management.

AWS Auto Scaling Groups allow you to set instances into Standby mode, which removes them from the load balancer (meaning they will no longer have requests forwarded to them) but does not delete or otherwise terminate them. More importantly, instances in Standby can count towards the desired number of instances in the Auto Scaling Group, meaning it won’t go and create X more instances to service user requests, which obviously would muck the whole plan up.

This is exactly what we need to set our environment into a Standby mode (at least until we have scheduled tasks that deal with underlying data anyway). I took the ability to shift instances into Standby mode and wrapped it into a function for setting the availability of the environment (because that’s the concept that I’m interested in, the Standby mode instances are just a mechanism to accomplish that).

function _ChangeEnvironmentAvailability
{
    [CmdletBinding()]
    param
    (
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$environment,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsKey,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsSecret,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsRegion,
        [Parameter(Mandatory=$true)]
        [ValidateSet("Available", "Unavailable")]
        [string]$availability,
        [switch]$wait
    )

    if ($rootDirectory -eq $null) { throw "RootDirectory script scoped variable not set. That's bad, its used to find dependencies." }
    $rootDirectoryDirectoryPath = $rootDirectory.FullName

    . "$rootDirectoryPath\scripts\common\Functions-Enumerables.ps1"

    . "$rootDirectoryPath\scripts\common\Functions-Aws.ps1"
    Ensure-AwsPowershellFunctionsAvailable

    $sourceStackName = Get-StackName -Environment $environment -UniqueComponentIdentifier (_GetUniqueComponentIdentifier)
    $sourceStack = Get-CFNStack -StackName $sourceStackName -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion

    $resources = Get-CFNStackResources -StackName $sourceStack.StackId -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion

    Write-Verbose "_ChangeEnvironmentAvailability([$availability]): Locating auto scaling group in [$environment]"
    $asg = $resources |
        Single -Predicate { $_.ResourceType -eq "AWS::AutoScaling::AutoScalingGroup" }

    $asg = Get-ASAutoScalingGroup -AutoScalingGroupName $asg.PhysicalResourceId -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion

    $instanceIds = @()
    $standbyActivities = @()
    if ($availability -eq "Unavailable")
    {
        Write-Verbose "_ChangeEnvironmentAvailability([$availability]): Switching all instances in [$($asg.AutoScalingGroupName)] to Standby"
        $instanceIds = $asg.Instances | Where { $_.LifecycleState -eq [Amazon.AutoScaling.LifecycleState]::InService } | Select -ExpandProperty InstanceId
        if ($instanceIds | Any)
        {
            $standbyActivities = Enter-ASStandby -AutoScalingGroupName $asg.AutoScalingGroupName -InstanceId $instanceIds -ShouldDecrementDesiredCapacity $true -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion
        }
    }
    elseif ($availability -eq "Available")
    {
        Write-Verbose "_ChangeEnvironmentAvailability([$availability]): Switching all instances in [$($asg.AutoScalingGroupName)] to Inservice"
        $instanceIds = $asg.Instances | Where { $_.LifecycleState -eq [Amazon.AutoScaling.LifecycleState]::Standby } | Select -ExpandProperty InstanceId
        if ($instanceIds | Any)
        {
            $standbyActivities = Exit-ASStandby -AutoScalingGroupName $asg.AutoScalingGroupName -InstanceId $instanceIds -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion
        }
    }

    $anyStandbyActivities = $standbyActivities | Any
    if ($wait -and $anyStandbyActivities)
    {
        Write-Verbose "_ChangeEnvironmentAvailability([$availability]): Waiting for all scaling activities to complete"
        _WaitAutoScalingGroupActivitiesComplete -Activities $standbyActivities -awsKey $awsKey -awsSecret $awsSecret -awsRegion $awsRegion
    }
}

With the only mechanism to affect the state of persisted data disabled, we can have some more confidence that the clone is a robust and clean copy.

To the Future!

I don’t think that the double clone solution is the best one, it’s just the best that I could come up with without having to make a lot of changes to way we manage our environments.

Another approach would be to maintain 2 environments during migration (A and B), but only have one of those environments be active during normal operations. So to do a migration, you would spin up Prod A if Prod B already existed. At the entry point, you have a single (or multiple) DNS record that points to either A or B based on your needs. This one still involves cloning and downtime though, so for a high availability service, it won’t really work (our services can have some amount of downtime, as long as it is organized and communicated ahead of time).

Speaking of downtime, there is another approach that you can follow in order to do zero downtime migrations. I haven’t actually done it, but if you had a mechanism to replicate incoming requests to both environments, you could conceivably bring up the new environment, let it deal with the same requests as the old environment for long enough to synchronize and to validate that it works (without responding to the user, just processing the requests) and then perform the top level switch so that the new environment becomes the active one. At some point in the future you can destroy the old environment, when you are confident that it works as expected.

This is a lot more complicated, and involves some external component managing the requests (and storing a record of all requests ever made, at least back to the last backup point) as well as each environment knowing what request they last processed. Its certainly not impossible, but if you can tolerate downtime, its probably not worth the effort.

Summary

Managing your environments is not a simple task, especially when you have actual users (and if you don’t have users, why are you bothering at all?). It’s very difficult to make sure that your production (or other live environment) does not stagnate in the face of changes, and is always kept up to date.

What I’ve outlined in this blog post is a single approach that I’ve been working on over the last week or so, to deal with out specific environments. Its not something that will work for everyone, but I thought it was worthwhile to write it down anyway, to show some of the thought processes that I needed to go through in order to accomplish the migration in a safe, robust fashion.

There is at least one nice side effect from my approach, in that we will now be able to clone any environment I want (without doing a migration, just the clone) and then use it for experiments or analysis.

I’m sure that I will run into issues eventually, but I’m pretty happy with the way the migration process is happening. It was weighing on my mind for a while, so its good to get it done.

0 Comments

In my quest to automate all the things, I’ve failed a lot.

A lot.

Typically the failures are configuration related (forgot to do X, didn’t line up the names of Y and Z properly, accidentally using uninitialized variable (Powershell is pretty “flexible”)), but sometimes they are just  transient. The cloud can be like that and you need to make sure you build with fault tolerance in mind. For example, sometimes an AWS EC2 instance will just take ages to complete its initialization, and your environment will fail to create with a timeout.

Using CloudFormation, you can use WaitConditions and WaitHandles to keep track of the setup of the more complicated components of your environment (like EC2 instances). Generally you take the exit code from your initialization on the EC2 instance (for us that is typically cfn-init) and push it through cfn-signal pointing at a particular WaitHandle. In the case where the exit code is non-zero, CloudFormation considers it a failure and it bubbles up and fails the stack (or whatever your error handling strategy is). If you need to you can supply a reason as well (a short string) which can help in identifying the root cause of the failure.

Up until recently, when my environment setups failed, I would report the failure (i.e. this stack failed to initialize) and then automatically clean it up. My scripts would wait to see if the stack status was changing to CREATE_COMPLETE within some timeout, and in the case of a failure, were simply reporting that the stack failed. If I was executing from a test, typically I would break on the line that does the cleanup, and investigate the root cause of the failure before I let the stack be torn down.

Now that I’m automatically executing environment creation/destruction through TeamCity, that kind of approach doesn’t really cut it.

If I arrive at work in the morning and see that the CI environment for service A failed to automatically re-create I need to know why. The stack will have already been cleaned up, so unless the scripts report as much information as possible, my only real recourse is to run it again locally and see what happened. This only works if the failure is consistent. It might run perfectly the second time, which is incredibly annoying when you’re trying to track down a failure.

It was time to think about failing a little bit harder.

Investigating Failures

When a stack fails I typically do 3 things.

The first is to check to see if the stack failed on creation (i.e. on the actual call to New-CfnStack). These failures are usually related to template issues, so the environment setup scripts already handle this. When New-CfnStack throws an exception it bubbles up to the top and is logged. All you have to do is read the message and it will (hopefully) point you at the root cause.

An interesting fact. If your template file is long-ish (it happened to me when mine was about 1000 lines long, but I’m pretty sure its dependent on characters, not lines) then the call to New-CfnStack can and will fail with an incredibly useless error about incorrectly formatted XML. This only happens if you are uploading the content as part of the call, instead of uploading to S3 first. I’m not entirely sure why this happens, but I have a sneaking suspicion that the Powershell/.NET AWS components have some hard limit on the length of the request that isn’t being checked properly. Just upload your template to S3 (I upload to the same place I put my dependencies) and it will work just fine, regardless of the size of the template.

The second is to check the Events attached to the stack for failures, specifically the first one chronologically, as failures typically have a knock on effect. These events are available in the AWS dashboard, under the CloudFormation section. Sometimes this is all it takes, as most of the failure events are pretty descriptive. If you fail to have permissions to do something, this is where it normally comes up, along with some errors that the template verification system won’t find (like an RDS instance using a snapshot but also supplying a DBName).

The third is to use the events to determine which component failed (usually an EC2 instance), remote onto the machine and look at some log files. For Windows EC2 instances there are two log files that are useful, cfn-init and EC2Config.

You can typically find the cfn-init log file at C:\cfn\log\cfn-init.log. It contains the full log trace from everything that happened as a result of the call to cfn-init on that machine. As far as log files go, its pretty useful, and I’ve plumbed its depths many times trying to figure out why my initialization was failing.

Second interesting fact, also related to length. If the output from one of your commands in cfn-init is too long, it will fail with an incredibly unhelpful error.

You can typically find the EC2Config log at C:\Program Files\Amazon\EC2Config\logs\EC2Config.log. This one is useful if your cfn-init log file doesn’t have anything in it, which can happen if you’re failing so hard that your EC2 instance can’t even start executing cfn-init. It happens.

The 3 step process I outlined above usually identifies the root cause of the failure, which can then be rectified. Following the process isn’t difficult, but it is annoying, and manual. The last step is especially egregious, requiring me to find the machine/s in question, get their private IPs, be connected to our AWS VPN, remote on, locate the log files, read them, etc.

If I have to leave the confines of my environment script in order to go check why it failed, then I’m losing valuable time and context. It would be much better if the environment setup script detected failures and went and did those things for me.

The Right Kind of Lazy

Essentially a good lazy programmer is the one who doesn’t have the stomach for constantly doing a manual task, so instead automates it. This frees them up to do more interesting things, and the automated task can then be run with minimal human involvement, hopefully acting as a multiplier for the effectiveness of the team.

In my case, I needed to automate the second two steps above. The first step is taken care of because the exceptions being thrown from the various calls to the AWS API are already being logged in TeamCity.

For the second step (interrogating the Events of the Stack), AWS offers API’s for pretty much everything, which is awesome (Octopus is similar, which is equally awesome, maybe more so because they are Windows native). All you need is a call to Get-CfnStackEvent, and then you can filter the resulting events to those that failed, which helps to cut down on the amount of output. The following piece of Powershell demonstrates just that.

function _TryGetFailingStackEvents
{
    param
    (
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$stackId,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsKey,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsSecret,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsRegion
    )

    try
    {
        if ($rootDirectory -eq $null) { throw "RootDirectory script scoped variable not set. That's bad, its used to find dependencies." }
        $rootDirectoryDirectoryPath = $rootDirectory.FullName
        $commonScriptsDirectoryPath = "$rootDirectoryDirectoryPath\scripts\common"
        
        . "$commonScriptsDirectoryPath\Functions-Enumerables.ps1"

        $events = Get-CFNStackEvent -StackName $stackId -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion
        $failingEvents = @($events | Where { $_.ResourceStatus.Value -match "FAILED" })
        if ($failingEvents | Any -Predicate { $true })
        {
            return $failingEvents
        }
        else
        {
            return @()
        }
    }
    catch
    {
        Write-Warning "Could not get events for stack [$stackId]."
        Write-Warning $_
    }
}

Automating the third step is a little more difficult.

As a general rule, all of our EC2 instances can have Powershell remotely executed on them from our secure AWS VPN CIDR. This is already part of our general EC2 setup (security group configuration to allow the traffic, WIndows firewall exception to unblock the Powershell Remote Execution port, etc).

Whenever a failure event occurs that contains a valid EC2 instance ID (identified by a regular expression), the script creates a remote session to the machine using its private IP address and reads the last couple of hundred lines from the log files I mentioned above. You can see the full script below. It actually feeds off the failure events identified in the previous step in order to find the ID of the instance that it needs to connect to.

function _TryExtractLogFilesFromInstanceFailingViaCfnInit
{
    param
    (
        [Parameter(Mandatory=$true)]
        $failingStackEvents,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsKey,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsSecret,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsRegion,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$adminPassword
    )

    if ($failingStackEvents -eq $null) { return "No events were supplied, could not determine if anything failed as a result of CFN-INIT failure" }

    try
    {
        if ($rootDirectory -eq $null) { throw "RootDirectory script scoped variable not set. That's bad, its used to find dependencies." }
        $rootDirectoryDirectoryPath = $rootDirectory.FullName
        $commonScriptsDirectoryPath = "$rootDirectoryDirectoryPath\scripts\common"

        $cfnInitFailedIndicator = "CFN-INIT-FAILED"
        Write-Verbose "Attempting to identify and extract information from failure events containing the string [$cfnInitFailedIndicator]"
        $instanceIdRegexExtractor = "(i\-[0-9a-zA-Z]+)"
        $cfnFailureEvent = $failingStackEvents | 
            Where {$_.ResourceStatusReason -match $cfnInitFailedIndicator} | 
            Select -First 1

        if ($cfnFailureEvent.ResourceStatusReason -match $instanceIdRegexExtractor)
        {
            $instanceId = $matches[0];
            Write-Verbose "Found a failure event for instance [$instanceId]"
            Write-Verbose "Attempting to extract some information from the logs from that machine"

            . "$commonScriptsDirectoryPath\Functions-Aws-Ec2.ps1"

            $instance = Get-AwsEc2Instance -InstanceId $instanceId -AwsKey $awsKey -AwsSecret $awsSecret -AwsRegion $awsRegion

            $ipAddress = $instance.PrivateIpAddress

            $remoteUser = "Administrator"
            $remotePassword = $adminPassword
            $securePassword = ConvertTo-SecureString $remotePassword -AsPlainText -Force
            $cred = New-Object System.Management.Automation.PSCredential($remoteUser, $securePassword)
            $session = New-PSSession -ComputerName $ipAddress -Credential $cred
    
            $remoteScript = {
                $lines = 200
                $cfnInitLogPath = "C:\cfn\log\cfn-init.log"
                Write-Output "------------------------------------------------------"
                Write-Output "Last [$lines] from $file"
                Get-Content $cfnInitLogPath -Tail $lines
                Write-Output "------------------------------------------------------"
                #Get-Content C:\Program Files
            }
            $remotelyExtractedData = Invoke-Command -Session $session -ScriptBlock $remoteScript
            # If you dont do this when you do a JSON convert later it spits out a whole bunch of useless
            # information about the machine the line was extracted from, files, etc.
            $remotelyExtractedData = $remotelyExtractedData | foreach { $_.ToString() }
            
            return $remotelyExtractedData
        }
        else
        {
            Write-Verbose "Could not find a failure event about CFN-INIT failing"
            return "No events failed with a reason containing the string [$cfnInitFailedIndicator]"
        }
    }
    catch
    {
        Write-Warning "An error occurred while attempting to gather more information about an environment setup failure"
        Write-Warning $_       
    }
}

I execute both of these functions in the catch block of the main environment setup function. Basically if anything goes wrong during the environment creation, it tries to gather some additional information to print out to the Warning channel. Sometimes failures occur before you actually managed to create the stack, so it needs to be robust in the face of not having a stack ID. As it is inside the catch block, it also needs to be wrapped in a try…catch itself, otherwise you’re likely to lose the root cause of the error if an unexpected error happens during your information gathering.

if ($stackId -ne $null)
{
    Write-Warning "Environment creation failed. Attempting to gather additional information."
    $failingStackEvents = _TryGetFailingStackEvents -StackId $stackId -awsKey $awsKey -awsSecret $awsSecret -awsRegion $awsRegion
    $cfnInitFailureLogs = _TryExtractLogFilesFromInstanceFailingViaCfnInit -failingStackEvents $failingStackEvents -adminPassword $adminPassword -awsKey $awsKey -awsSecret $awsSecret -awsRegion $awsRegion
    $customError = @{
        "StackId"=$stackId;
        "Stack"=$stack;
        "FailingStackEvents"=$failingStackEvents;
        "CfnInitFailingInstanceLogs"=$cfnInitFailureLogs;
    }

    $failingDetails = new-object PSObject $customError
    Write-Warning (ConvertTo-Json $failingDetails)
}

With everyone above working, you get a nice dump of information whenever an environment fails to create, without having to do anything else, which is nice.

The full source that the snippets above were extracted from will be available at Solavirum.Scripts.Common at some point. The content available there now is out of date, but I need to check through the scripts carefully before I upload them to Github. We all remember what happened last time.

Room for Improvement

At some point in the future I would prefer if the output from CloudFormation (its events) and the various log files from the EC2 instances being provisioned were automatically pumped into our Log Aggregation server (ELK) so that we could analyse them in more detail.

The hard part about this is that the log aggregation would require the installation of something on the machine, and if we’re in a failure state during environment setup, those components are probably not installed. I mean the whole point of the environment setup is to create an environment and deploy some software components to do it, having to have other components in place to report failures is probably not going to work.

I’ll keep thinking about it and make a post if I find a good solution.

Conclusion

I hesitated to do this automated investigation for a long time, because I knew it would be finicky and complex.

I shouldn’t have.

Every time an environment setup fails I now get detailed information right in the same window that I was watching for completion messages. It is beyond useful. It also means that whenever an environment refresh fails in the morning before I get to work, I can still investigate to see what happened, even though the stack was already cleaned up.

I should have followed my instincts and automated this kind of pointless busywork earlier.

0 Comments

If you’ve ever read any of my previous posts, you would realise that I’m pretty big on testing. From my point of view, any code without automated tests is probably a liability in the long run. Some code doesn’t need tests of course (like prototypes and spikes), but if you intend for your code to be used or maintained (i.e. most of it) you should be writing tests for it.

I’ve blogged before about the 3 different classes of automated tests, Unit, Integration and Functional. I fully admit that the lines between them can get blurred from time to time, but I feel these 3 classifications help to think about tests at different levels, which in turn assists in creating a quality piece of software.

A lot of the work I do in my current job has dependencies on a legacy SQL Server database. Not one gigantic database in a central location, many databases with the same schema, distributed at client locations. Its a bit of a monster, but workable, and we control the schema, so we’ve got that going for us, which is nice.

Quite a few of the older components take a hard dependency on this database. Some of the newer ones use EF at least, which is nice, but others simply use direct SQL execution. Tests have typically been written using an approach I see a lot, use a local database in a known state and write some tests on top of it. Typically this database is specific to a developers machine, and as such the tests will not work inside our CI environment, so they just get marked as [Ignored] (so the build won’t fail) until someone wants to use them again.

Unacceptable.

Why go to all that effort writing a test if you aren’t going to execute it all the time? Tests that aren’t run are worthless after all.

Making it Work

Testing on top of a real SQL server is a fantastic idea and I would classify this sort of test as an Integration test, especially if you’re doing it directly from the classes themselves (instead of the normal entry point for the application). It verifies that the components that you have written (or at least a subset of them) work as expected when you sit them on top of a real database.

The problem comes in automating those sort of tests.

In an ideal world, your test would be responsible for setting up and tearing down all of its dependencies. This would mean it creates a database, fills it with some data, runs the code that it needs to run, verifies the results and then cleans up after itself.

That’s not simple though and looks like a lot of unnecessary work to some developers. They think, why don’t I just create a local database and put some data in it. I can run my test while I write my code, and I can reset my database whenever I want using scripts/restores/etc. These developers don’t realise that tests live just as long as the code that they verify, and if you’re developing for the long run, you need to put that effort in or suffer the consequences later on.

I’ll take a moment to mention that not every piece of code needs to be written for the long run (most do) and that it is possible to have so many tests that it becomes hard to change you code due to the barrier of requiring that you change you tests (which can be a lot of work). As with most things in software, its a balancing act. Just like your code, your tests should be easy to change and maintain, or they will end up causing the same pain that you were trying to avoid in the first place, just in a different place.

In order to facilitate the approach where each test is responsible for its own test data you need to put some infrastructure in place.

  1. You need a common location where the tests can create a database. If your CI environment is in the same network as your developers, this can simply be a machine at a known location with the required software installed. Its a little more complicated if they are in two different networks (our CI is in AWS for example).
  2. You need to have a reusable set of tools for creating, initialising and destroying database resources, to make sure your tests are self contained. These tools must be able to be run in an automated fashion.

Where the Wild Databases Are

The first step in allowing us to create executable database integration tests is to have a database server (specifically MSSQL server) available to both our development environment and our CI environment.

Since our CI is in AWS, the best location for the database is there as well. The main reason for this is that it will be easy enough to expose the database securely to the office, whereas it would be hard to expose resources in the office to AWS.

We can use the Amazon supplied SQL Server 2014 Express AMI as a baseline, and create a CloudFormation template that puts all the bits in place (instance, security groups, host record, etc).

The template is fairly trivial, so I won’t go into detail about it. Its very much the same as any other environment setup I’ve done before (I think the only publicly available example is the JMeter Workers, but that’s a pretty good example).

In order to expose the SQL instance on the canned AMI I had to make some changes to the server itself. It’s easy enough to execute a Powershell script during initialisation (via cfn-init), so I just ran this script to enable TCP/IP, switch to mixed mode and enable and change the password for the sa account. The instance is only accessible via a secure channel (internally in our VPC and via our AWS VPN) so I’m not too concerned about exposing the sa username directly, especially with a unique and secure password. Its only purpose is to hold temporary test data anyway, and it doesn’t have the ability to connect to any other resources, so I’m not particularly worried.

[CmdletBinding()]
param
(
    [Parameter(Mandatory=$true)]
    [string]$SaPassword
)

$ErrorActionPreference = "Stop"

Import-Module sqlps -DisableNameChecking

$localhostname = hostname
$instanceName = "mssqlserver"

$smo = 'Microsoft.SqlServer.Management.Smo.'
$wmi = new-object ($smo + 'Wmi.ManagedComputer').

Write-Verbose "Enabling TCP/IP."
$uri = "ManagedComputer[@Name='$localhostname']/ ServerInstance[@Name='$instanceName']/ServerProtocol[@Name='Tcp']"
$Tcp = $wmi.GetSmoObject($uri)
$Tcp.IsEnabled = $true
$Tcp.Alter()

Write-Verbose "Enabling Mixed Mode Authentication."
$s = new-object Microsoft.SqlServer.Management.Smo.Server($localhostname)

$s.Settings.LoginMode = [Microsoft.SqlServer.Management.SMO.ServerLoginMode]::Mixed
$s.Alter()

Write-Verbose "Restarting [$instanceName] so TCP/IP and Auth changes are applied."
$service = get-service $instanceName
$service.Stop()
$service.WaitForStatus("Stopped")
$service.Start()
$service.WaitForStatus("Running")

Write-Verbose "Editing sa user for remote access."
$SQLUser = $s.Logins | ? {$_.Name -eq "sa"}
$SQLUser.PasswordPolicyEnforced = 0
$SQLUser.Alter()
$SQLUser.Refresh()
$SQLUser.ChangePassword($SaPassword)
$SqlUser.Enable()
$SQLUser.Alter()
$SQLUser.Refresh()

I know the script above isn’t necessarily the neatest way to do the configuration I needed, but its enough for now.

How Are Baby Databases Made?

The next step is having a library that we can re-use to easily make test databases.

We already had some Entity Framework classes and a database context buried deep inside another solution, so I extracted those, put them into their own repository and built a standalone Nuget package from that. It’s not the same code that actually creates a brand new database for a client (that’s a series of SQL scripts embedded in a VB6 application), but its close enough for testing purposes. I hope that eventually we will use EF instead of the hardcoded scripts (leveraging the migrations functionality for database version management), but that’s probably a long way away.

I’d previously completed a small refactor on the EF project in the past, so it already had the concept of a DbContextFactory, so all I had to do was implement a new one that connected to a known SQL server and created a randomly named database. I made it disposable, so that it would destroy the database once it was done.

EF took care of actually creating the database to match the schema defined by the DTO classes (which were already there), so I didn’t have to worry about that too much.

In the code below, the ITestDatabaseConnectionStringFactory implementation is responsible for knowing where the server is and how to connect to it (there’s a few implementations, one takes values from the app.config, one is hardcoded, etc). The INamedDbFactory has a single Create method that returns a derived DbContent, nothing fancy.

using System;
using System.Data.SqlClient;
using System.Diagnostics;
using System.Linq;
using Serilog;

namespace Solavirum.Database.EF.Tests.Common
{
    public class TestDatabaseDbFactory : INamedDbFactory, IDisposable
    {
        public TestDatabaseDbFactory(ILogger logger, ITestDatabaseConnectionStringFactory connFactory)
        {
            _connFactory = connFactory;
            var databaseName = _random.GenerateString(20);
            ConnectionString = _connFactory.Create(databaseName);

            _logger = logger
                .ForContext("databaseName", ConnectionString.DatabaseName)
                .ForContext("databaseHost", ConnectionString.Host);

            _logger.Information("New factory for database {databaseName} on host {databaseHost} created");
        }

        private readonly IRandomTestData _random = new DefaultRandomTestData();
        private readonly ILogger _logger;
        private ITestDatabaseConnectionStringFactory _connFactory;

        public readonly TestDatabaseConnectionString ConnectionString;

        public NamedDb Create()
        {
            _logger.Information("Creating a new NamedDb for database {databaseName}");
            var context = new NamedDb(new SqlConnection(ConnectionString.GetFullConnectionString()));

            if (!context.Database.Exists())
            {
                _logger.Information("This is the first time {databaseName} has been created, creating backing database on host {databaseHost}");
                context.Database.CreateIfNotExists();
            }

            return context;
        }

        public void Cleanup()
        {
            using (var context = new NamedDb(ConnectionString.GetFullConnectionString()))
            {
                _logger.Information("Deleting backing database {databaseName} on host {databaseHost}");
                context.Database.Delete();
            }
        }

        bool _disposed;

        public void Dispose()
        {
            Dispose(true);
            GC.SuppressFinalize(this);
        }

        ~TestDatabaseNamedDbFactory()
        {
            Dispose(false);
        }

        protected virtual void Dispose(bool disposing)
        {
            if (_disposed)
                return;

            if (disposing)
            {
                try
                {
                    Cleanup();
                }
                catch (Exception ex)
                {
                    Trace.WriteLine(string.Format("An unexpected error occurred while attempting to clean up the database named [{0}] spawned from this database factory. The database may not have been cleaned up.", ConnectionString.DatabaseName));
                    Trace.WriteLine(ex.ToString());
                }
                
            }

            _disposed = true;
        }
    }

    public interface IRandomTestData
    {
        string GenerateString(int length);
    }

    public class DefaultRandomTestData : IRandomTestData
    {
        private readonly Random _random = new Random();

        public string GenerateString(int length)
        {
            var chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
            var result = new string(Enumerable.Repeat(chars, length)
                .Select(s => s[_random.Next(s.Length)])
                .ToArray());

            return result;
        }
    }
}

The Cleanup

Of course by extracting the database classes above into its own repository/package, I had to replace all references to the old classes. It turns out the database component was quite heavily referenced in its original location, so it was a non-trivial amount of work to incorporate it properly.

As is almost always the way when you start doing things correctly after a long period of not doing that, I had to shave a few yaks to make the change happen.

On the upside, as a result of the extraction and the creation of the TestDatabaseFactory, I managed to create reliably executable tests for the actual database component itself, proving that it works when connected to a real SQL server database.

Summary

For me, the takeaway from this activity is that it takes effort to setup good testing infrastructure so that you can easily create tests that can be repeatedly executed. Its not something that just happens, and you need to be willing to accept the fact that you have to go slower to start off with in order to go faster later. Like delayed gratification.

I could have just setup an SQL server in a remote location without automating any of it, but that’s just another case of the same principle. I now have the ability to setup any version of SQL server that I want, or to change it to deploy a custom version (assuming we had an Octopus package to install SQL server, which is more than possible), and I’ve also automated its destruction every night and recreation every morning, allowing us to easily incorporate changes to the base AMI or to alter it in other ways (like a different version of SQL server).

It can be very hard to get some people to understand this. I find the best approach is to just do it (taking into account hard deadlines of course) and to let is prove itself in the field as time goes on, when you are able to make large changes without fear that you’ve broken everything.

I like knowing that I haven’t broken everything personally.

Or even better, knowing that I have.

0 Comments

I mentioned last week that I was struggling with an issue with my environment setup scripts via CloudFormation.

Well, I finally got to the bottom of it, and I thought it would be a good thing to share, because I don’t want anyone else to have to struggle with finding the root cause behind the incredibly unhelpful error that I was getting. Hopefully, my pain becomes your knowledge.

To give a little bit of background, I use a combination of CloudFormation templates, Powershell and Octopus Deploy to setup environments for our products and services in AWS. This approach has proven to be extremely useful for a number of reasons, including codification of what exactly goes into the environment, reducing the amount of time to setup an environment (in comparison to using typical infrastructure requests) and reducing the reliance on one particular person or team who knows how to put it all together.

A little while ago I noticed that the environment setup that I use for my JMeter Load Test Workers was failing to run to completion. Every time I tried to setup the environment, it would fail with a “Create Failed” status, stating that one or more of the instances failed their configuration.

Looking into the issue, I found the following error in the cfn-init.log file (available at C:\cfn\log\cfn-init.log).

In case you are unfamiliar with CloudFormation, cfn-init is a program supplied by Amazon that works with CloudFormation, and allows you to configure a series of steps to occur during the startup of an EC2 instance. Typically you  use the UserData for an instance to execute cfn-init with the appropriate parameters, and it connects to the meta data service, downloads the CloudFormation (cfn) template and executes it on the local machine. Its awesome, and I use it a lot.

The relevant excerpt from the log file:

2015-06-12 03:06:02,674 [DEBUG] Running command a-deploy-loadtest
2015-06-12 03:06:02,674 [DEBUG] No test for command a-deploy-loadtest
2015-06-12 03:06:59,846 [INFO] Command a-deploy-loadtest succeeded
2015-06-12 03:06:59,846 [ERROR] -----------------------BUILD FAILED!------------------------
2015-06-12 03:06:59,846 [ERROR] Unhandled exception during build: 'utf8' codec can't decode byte 0xf2 in position 644: invalid continuation byte
Traceback (most recent call last):
    File "cfn-init", line 123, in <module>
    File "cfnbootstrap\construction.pyc", line 117, in build
    File "cfnbootstrap\construction.pyc", line 502, in build
    File "cfnbootstrap\construction.pyc", line 513, in run_config
    File "cfnbootstrap\construction.pyc", line 125, in run_commands
    File "cfnbootstrap\command_tool.pyc", line 116, in apply
    File "encodings\utf_8.pyc", line 16, in decode UnicodeDecodeError: 'utf8' codec can't decode byte 0xf2 in position 644: invalid continuation byte

Of course, that doesn’t make any sense, as I hadn’t changed anything to do with the encoding of the output. Even more confusingly, when I ran cfn-init using the exact same mechanism (the execution of the UserData batch file), on the exact same machine, it would work just fine. It only failed during instance startup.

I looked into it for a little while, but couldn’t find an obvious root cause. I assumed that I was doing something slightly different from the norm or that something had changed upstream to cause the error. This made sense, because all of the other places where I was using the exact same code were working just fine, and it seemed to be localised to just my load test worker environment.

I altered the environment setup to not require instances to setup successfully (by removing the call to cfn-signal after cfn-init, and removing the WaitCondition from the template) and manually initialized each instance after it was created before I ran my load tests.

Not great, but an acceptable work around until it mattered more. I had bigger fish to fry.

The Virus Spreads

Then it happened somewhere else.

The new cloud based service we are working on needed an environment refresh to take advantage of scalable persistence in the form of RDS. We tore down the old CI environment, and then attempted to create a new one.

The mysterious “invalid UTF8 character” error reared its ugly head again.

I couldn’t brush it away this time, as this environment was much more important than my load test workers, so it was time to dig in and see what was really going on.

Taking the error at face value, I assumed that something had changed in the pipeline (Octopus, AWS, etc) and that there really was an invalid character. It was fairly obvious that cfn-init didn’t fail until after the first deployment had actually completed, so something seemed to be off in the output of that particular step.

"a-deploy-system-stats-logger": { "command": { "Fn::Join": [ "", [ "powershell.exe -ExecutionPolicy Bypass -Command c:\\cfn\\dependencies\\scripts\\environment\\Invoke-DeployLatestProjectForEnvironment.ps1 ",

" –Verbose ", " -OctopusServerApiKey ", { "Ref": "OctopusAPIKey" }, " -OctopusServerUrl ", { "Ref": "OctopusServerURL" }, " -OctopusEnvironment ", { "Ref": "OctopusEnvironment" }, " -ProjectName 'System Stats Logger'" ] ] }, "waitAfterCompletion": "0" }

I ran the script manually through C# and captured the raw output stream. I parsed the output as UTF8, and it came out just fine, so clearly the output itself wasn’t the problem.

At this point I took a slightly different approach. I knew that an old version of the environment setup for our new cloud service had worked just fine a few days ago, so I branched from that point in Git and executed the setup script.

It worked.

Now I at least had two copies of similar codebases that I could compare for differences to find what was breaking the deployments, and I knew that it wasn’t an environmental issue related to either Octopus Deploy or AWS components.

I used an application called WinMerge to diff the two repositories (after extracting them to different directories on disk), and found a number of differences, but nothing that was obviously the root cause. I thought that perhaps the problem was caused by upgrading to the latest version of the AWS Powershell package, but I rolled that back and still had the same issue.

Heads up WinMerge people, if you are on the internet at all. The statuses for “File Identical” and “File Different” in your application look incrediblysimilar at a glance. I thought all my files were identical initially, at least until I looked a little closer. Colours would be nice.

The Root Cause

While looking at the diff of the two directories, I noticed that I had added additional Write-Verbose statements to some of my generic Powershell enumerable functions (like Any, Single, All, etc). In fact, in all of the environment setup scripts that were working, this additional logging was missing (because they were using an older version of the common scripts). The logging was only present in those environments that were broken.

But how could additional logging statements cause a UTF8 error in cfn-init, and only during startup?

It took me a fair bit of time to realise that it was an issue with the length of the output. My deployment scripts all had –Verbose turned on by default. I did this on purpose, to make sure that I received the maximum amount of information possible from the environment setup.

Apparently cfn-init has a limit on the amount of output it can process from a command, and I was pushing that limit.

As to why it was failing during startup and not when manually executed? The code that was run at startup did some archive extraction which didn’t happen in future calls (because it was already on disk) and the items being extracted were logged. This was evidently enough to push the output over the invisible, undocumented limit.

I removed the –Verbose flag from the calls deployment scripts and everything was fine.

Problem solved.

Why?

Of course, the question remains. Why?

Why does cfn-init give a completely unrelated error when the output from a single command is too long? Why not a “output has been truncated” or “output too long” error of some description?

I have no idea.

I suspect the reason for the UTF8 error is that the output was truncated between two bytes or something, and when the parser underneath cfn-init tried to parse it, it encountered what appeared to be an invalid UTF8 character. I still have no idea why it would do arbitrary truncation though, especially without telling the user.

Ah well.

Conclusion

This is the sort of thing that you just have to deal with sometimes. I do wonder how many other people have run into this issue though. Considering that when I did a Google search for the error, I found almost nothing, I’m not sure if anyone else has. If they have, they certainly didn’t post online about it. Luckily, for everyone else in the future, I did, both here and on the AWS forums, so maybe that will help.

On the upside, running into these sorts of things makes you think more about your own code, and how it reports problems to the user when assumptions that you’ve made break down in reality. It takes a lot of effort to write robust code that is communicative when it fails, but I think its definitely worth the effort.

Remember, your code will be used and read a hell of a lot more than its written.

0 Comments

So, as per my last post, I built a scalable, deployable, codified proxy environment in AWS, leveraging Cloud Formation, Octopus and Squid.

Since then I have attempted to use this proxy for things. Specifically load tests, which was the entire reason I built the proxy in the first place.

In my attempts to use the proxy, I have learned a few lessons that I thought would be worthwhile to share, so that others might benefit from my pain.

This will be a relatively short post, because:

  1. I’ve been struggling with these issues over the last week and haven’t done much else,
  2. For all the trouble I had, I really only ran into 2 issues,
  3. I’m still struggling with a mysterious, probably unrelated issue, and its taking up most of my mental space (unexpected UTF8 characters in a Powershell/Octopus deployment output stream, which might be a good future blog post if I ever figure it out).

Weird Differences

Initially I assumed that my proxy would slot into the same hole as our current proxy. They were both Squid, I had replicated the config from the old one on the new one and both were referenced simply by a URL and port number.

I was mistaken.

I’m still not sure how it did it, but the old proxy was somehow allowing connections to the special EC2 meta information address (169.254.169.254) to pass through correctly. The moment I swapped in my new proxy, cfn-init and cfn-signal no longer worked.

For cfn-init, the error was incredibly unhelpful. It insisted that my instance was not a member of the CloudFormation group/template/configuration that I was trying to initialise from.

For cfn-signal, it just didn’t do anything. It said it signalled, but it was lying.

In hindsight, this makes perfect sense. The request would have gone through the proxy, which was a CloudFormation resource itself, and it would have tried to use the proxy’s CloudFormation template as the container for the meta data, which would fail, giving a technically correct error message in the first case, and signalling something non-existent in the second.

From my point of view, it looked insane.

I assumed I had put some sort of incorrect details into the cfn-init call, or that I had failed to meet the arcane requirements for cfn-signal (must base64 encode the wait handle on windows only for example), but I hadn’t changed anything. The only thing I changed was the proxy configuration.

Long story short, for my proxy, I had to add a bypass entry (on each EC2 instance, configured in the same place as the proxy, the UserData script) which would stop cfn-init (and other tools) from trying to go through the proxy to hit the meta information address. I still have no idea how the old proxy did not require the same sort of configuration. I have a hunch that it might be because it was Linux and the original creators of the box did something special to make it work. Maybe they ran into the same issue, but just fixed it a different way? Maybe Linux automatically handles the situation better? Who knows.

Very frustrating.

Location, Location, Location

The second pain point I ran into was more insane and just as frustrating.

After reviewing the results of an initial load test, I hypothesised that maybe the proxy was a bottleneck. All of the traffic for the load test had to pass through the proxy (including image uploads) and I couldn’t see anything obvious in the service logs to account for the level of failure I was seeing, except high load. In the interests of getting a better subsequent load test, I wanted to make sure that the proxy boxes could not possibly be a bottleneck, so I planned to beef up their instance type.

I was originally using t2.medium instances, which have some limitations, mostly around network performance and CPU credits. I wanted to switch to something a bit beefier, just for the proxy specific to the load tests.

When i switched to an m3.large, the proxy stopped working.

Looking into it, the expected installation directory (C:\Squid) was empty of anything that even vaguely looked like a proxy.

Following the installation log, I found out that Squid had decided to install itself to Z drive. Z drive was an ephemeral drive. You know, the ones whose content is transitory, and which tend to get annihilated if the instance goes down for any reason?

I tried so very hard to get Squid to just install to the C drive, including checking the registry settings for program installation locations (which were all correctly C based) and manually overriding TARGETFOLDER, ROOTDRIVE and INSTALLDIR in the msi execution parameters.

Alas, it was not to be. No matter what I did, Squid insisted on installing to Z drive.

I still have no idea why, I just turned the instance type back to one that didn’t have ephemeral drives available.

Like any good software user, I logged a bug. Well, I assume its a bug, because that’s a really weird feature.

Conclusion

There is no conclusion. Just a glimpse into some of the traps that sap your time, motivation and will to live when doing this sort of thing.

I only hope that someone runs across this blog post one day and it helps them. Or at least lets them know someone else out there understands their pain.