Smoking Out Fires

October 20. 2015 0 Comments

I’m pretty happy with the way our environment setup scripts work.

Within TeamCity, you generally only have to push a single button to get an environment provisioned (with perhaps a few parameters filled in, like environment name and whatnot) and even outside TeamCity, its a single script that only requires some credentials and a few other things to start.

Failures are detected (primarily by CloudFormation) and the scripts have the ability to remote onto AWS instances for you and extract errors from logs to give you an idea as to the root cause of the failure, so you have to do as little manual work as possible. If a failure is detected, everything is cleaned up automatically (CloudFormation stack deleted, Octopus environment and machines deleted, etc), unless you turn off automatic cleanup for investigation purposes.

Like I said, overall I’m pretty happy with how everything works, but one of the areas that I’m not entirely happy with is the last part of environment provisioning. When an environment creation is completed, you know that all components installed correctly (including Octopus deploys) and that no errors were encountered with any of the provisioning itself (EC2 instances, Auto Scaling Groups, RDS, S3, etc). What you don’t know is whether or not the environment is actually doing what it should be doing.

You don’t know whether or not its working.

That seems like a fixable problem.

Smoke On The Water

As part of developing environments, we’ve implemented automated tests using the Powershell testing framework called Pester.

Each environment has at least one test, that verifies the environment is created as expected and works from the point of view of the service it offers. For example, in our proxy environment (which uses SQUID) one of the outputs is the proxy URL. The test takes that url and does a simple Invoke-WebRequest through it to a known address, validating that the proxy works as a proxy actually should.

The issue with these tests is that they are not executed at creation time. They are usually only used during development, to validate that whatever changes you are making haven’t broken the environment and that everything is still working.

Unfortunately, beyond git tagging, our environment creation scripts/templates are not versioned. I would vastly prefer for our build scripts to take some set of source code that represents an environment setup, test it, replace some parameters (like version) and then package it up, perhaps into a nuget package. It’s something that’s been on my mind for a while, but I haven’t had time to put it together yet. If I do, I’ll be sure to post about it here.

The simplest solution is to extract the parts of the tests that perform validation into dedicated functions and then to execute them as part of the environment creation. If the validation fails, the environment should be considered a failure and should notify the appropriate parties and clean itself up.

Where There Is Smoke There Is Fire

The easiest way to implement the validation (hereafter referred to as smoke tests) in a reusable fashion is to incorporate the concept into the common environment provisioning scripts.

We’ve created a library that contains scripts that we commonly use for deployment, environment provisioning and other things. I made a copy of the source for that library and posted it to Solavirum.Scripts.Common a while ago, but its a bit out of date now (I really should update it).

Within the library is a Functions-Environment file.

This file contains a set of Powershell cmdlets for provisioning and deleting environments. The assumption is that it will be used within libraries for specific environments (like the Proxy environment mentioned above) and will allow us to take care of all of the common concerns (like uploading dependencies, setting parameters in CloudFormation, waiting on the CloudFormation initialization, etc).

Inside this file is a function called New-Environment, whose signature looks like this:

function New-Environment
{
    [CmdletBinding()]
    param
    (
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$environmentName,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsKey,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsSecret,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsRegion,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$octopusServerUrl,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$octopusApiKey,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$uniqueComponentIdentifier,
        [System.IO.FileInfo]$templateFile,
        [hashtable]$additionalTemplateParameters,
        [scriptblock]$customiseEnvironmentDetailsHashtable={param([hashtable]$environmentDetailsHashtableToMutate,$stack) },
        [switch]$wait,
        [switch]$disableCleanupOnFailure,
        [string[]]$s3Buckets
    )

    function body would be here, but its super long
}

As you can see, it has a lot of parameters. It’s responsible for all of the bits of pieces that go into setting up an environment, like Octopus initialization, CloudFormation execution, gathering information in the case of a failure, etc. Its also responsible for triggering a cleanup when an environment is deemed a failure, so is the ideal place to put some smoke testing functionality.

Each specific environment repository typically contains a file called Invoke-NewEnvironment. This file is what is executed to actually create an environment of the specific type. It puts together all of the environment specific stuff (output customisation, template location, customised parameters) and uses that to execute the New-Environment function, which takes care of all of the common things.

In order to add a configurable smoke test, all we need to do is add an optional script block to the New-Environment function. Specific environment implementations can supply a value to it they like, but they don’t have to. If we assume that the interface for the script block is that it will throw an exception if it fails, then all we need to do is wrap it in a try..catch and fail the environment provisioning if an error occurs. Pretty straightforward.

To support the smoke test functionality, I wrote two new Pester tests. One verifies that a failing smoke test correctly fails the environment creation and the other verifies that the result of a successful smoke test is included in the environment creation result. You can see them below:

Describe -Tags @("Ignore") "Functions-Environment.New-Environment.SmokeTest" {
    Context "When supplied with a smoke test script that throws an exception (indicating smoke test failure)" {
        It "The stack creation is aborted and deleted" {
            $creds = Get-AwsCredentials
            $octoCreds = Get-OctopusCredentials
            $environmentName = Create-UniqueEnvironmentName
            $uniqueComponentIdentifier = "Test"
            $templatePath = "$rootDirectoryPath\src\TestEnvironment\Test.CloudFormation.template"
            $testBucket = [Guid]::NewGuid().ToString("N")
            $customTemplateParameters = @{
                "LogsS3BucketName"=$testBucket;
            }

            try
            {
                try
                {
                    $createArguments = @{
                        "-EnvironmentName"=$environmentName;
                        "-TemplateFile"=$templatePath;
                        "-AdditionalTemplateParameters"=$CustomTemplateParameters;
                        "-UniqueComponentIdentifier"=$uniqueComponentIdentifier;
                        "-S3Buckets"=@($testBucket);
                        "-SmokeTest"={ throw "FORCED FAILURE" };
                        "-Wait"=$true;
                        "-AwsKey"=$creds.AwsKey;
                        "-AwsSecret"=$creds.AwsSecret;
                        "-AwsRegion"=$creds.AwsRegion;
                        "-OctopusApiKey"=$octoCreds.ApiKey;
                        "-OctopusServerUrl"=$octoCreds.Url;
                    }
                    $environmentCreationResult = New-Environment @createArguments
                }
                catch
                {
                    $error = $_
                }

                $error | Should Not Be $null
                $error | Should Match "smoke"

                try
                {
                    $getArguments = @{
                        "-EnvironmentName"=$environmentName;
                        "-UniqueComponentIdentifier"=$uniqueComponentIdentifier;
                        "-AwsKey"=$creds.AwsKey;
                        "-AwsSecret"=$creds.AwsSecret;
                        "-AwsRegion"=$creds.AwsRegion;                
                    }
                    $environment = Get-Environment @getArguments
                }
                catch
                {
                    Write-Warning $_
                }

                $environment | Should Be $null
            }
            finally
            {
                $deleteArguments = @{
                    "-EnvironmentName"=$environmentName;
                    "-UniqueComponentIdentifier"=$uniqueComponentIdentifier;
                    "-S3Buckets"=@($testBucket);
                    "-Wait"=$true;
                    "-AwsKey"=$creds.AwsKey;
                    "-AwsSecret"=$creds.AwsSecret;
                    "-AwsRegion"=$creds.AwsRegion;
                    "-OctopusApiKey"=$octoCreds.ApiKey;
                    "-OctopusServerUrl"=$octoCreds.Url;
                }
                Delete-Environment @deleteArguments
            }
        }
    }

    Context "When supplied with a valid smoke test script" {
        It "The stack creation is successful" {
            $creds = Get-AwsCredentials
            $octoCreds = Get-OctopusCredentials
            $environmentName = Create-UniqueEnvironmentName
            $uniqueComponentIdentifier = "Test"
            $templatePath = "$rootDirectoryPath\src\TestEnvironment\Test.CloudFormation.template"
            $testBucket = [Guid]::NewGuid().ToString("N")
            $customTemplateParameters = @{
                "LogsS3BucketName"=$testBucket;
            }

            try
            {
                $createArguments = @{
                    "-EnvironmentName"=$environmentName;
                    "-TemplateFile"=$templatePath;
                    "-AdditionalTemplateParameters"=$CustomTemplateParameters;
                    "-UniqueComponentIdentifier"=$uniqueComponentIdentifier;
                    "-S3Buckets"=@($testBucket);
                    "-SmokeTest"={ return $_.StackId + " SMOKE TESTED"}; 
                    "-Wait"=$true;
                    "-AwsKey"=$creds.AwsKey;
                    "-AwsSecret"=$creds.AwsSecret;
                    "-AwsRegion"=$creds.AwsRegion;
                    "-OctopusApiKey"=$octoCreds.ApiKey;
                    "-OctopusServerUrl"=$octoCreds.Url;
                }
                $environmentCreationResult = New-Environment @createArguments

                Write-Verbose (ConvertTo-Json $environmentCreationResult)

                $environmentCreationResult.SmokeTestResult | Should Match "SMOKE TESTED"
            }
            finally
            {
                $deleteArguments = @{
                    "-EnvironmentName"=$environmentName;
                    "-UniqueComponentIdentifier"=$uniqueComponentIdentifier;
                    "-S3Buckets"=@($testBucket);
                    "-Wait"=$true;
                    "-AwsKey"=$creds.AwsKey;
                    "-AwsSecret"=$creds.AwsSecret;
                    "-AwsRegion"=$creds.AwsRegion;
                    "-OctopusApiKey"=$octoCreds.ApiKey;
                    "-OctopusServerUrl"=$octoCreds.Url;
                }
                Delete-Environment @deleteArguments
            }
        }
    }
}

Smoke And Mirrors

On the specific environment side (the Proxy in this example), all we need to do is supply a script block that will execute the smoke test.

The smoke test itself needs to be somewhat robust, so we use a generic wait function to repeatedly execute a HTTP request through the proxy until it succeeds or it runs out of time.

function Wait
{
    [CmdletBinding()]
    param
    (
        [scriptblock]$ScriptToFillActualValue,
        [scriptblock]$Condition,
        [int]$TimeoutSeconds=30,
        [int]$IncrementSeconds=2
    )

    write-verbose "Waiting for the output of the script block [$ScriptToFillActualValue] to meet the condition [$Condition]"

    $totalWaitTimeSeconds = 0
    while ($true)
    {
        try
        {
            $actual = & $ScriptToFillActualValue
        }
        catch
        {
            Write-Warning "An error occurred while evaluating the script to get the actual value (which is evaluated by the condition for waiting purposes). As a result, the actual value is undefined (NULL)"
            Write-Warning $_
        }

        try
        {
            $result = & $condition
        }
        catch
        {
            Write-Warning "An error occurred while evaluating the condition to determine if the wait is over"
            Write-Warning $_

            $result = $false
        }

        
        if ($result)
        {
            write-verbose "The output of the script block [$ScriptToFillActualValue] (Variable:actual = [$actual]) met the condition [$condition]"
            return $actual
        }

        write-verbose "The current output of the condition [$condition] (Variable:actual = [$actual]) is [$result]. Waiting [$IncrementSeconds] and trying again."

        Sleep -Seconds $IncrementSeconds
        $totalWaitTimeSeconds = $totalWaitTimeSeconds + $IncrementSeconds

        if ($totalWaitTimeSeconds -ge $TimeoutSeconds)
        {
            throw "The output of the script block [$ScriptToFillActualValue] (Variable:actual = [$actual]) did not meet the condition [$Condition] after [$totalWaitTimeSeconds] seconds."
        }
    }
}

function Test-Proxy
{
    [CmdletBinding()]
    param
    (
        [Parameter(Mandatory=$true)]
        [string]$proxyUrl
    )

    if ($rootDirectory -eq $null) { throw "RootDirectory script scoped variable not set. That's bad, its used to find dependencies." }
    $rootDirectoryPath = $rootDirectory.FullName

    . "$rootDirectoryPath\scripts\common\Functions-Waiting.ps1"
    
    $result = Wait -ScriptToFillActualValue { return (Invoke-WebRequest -Uri "www.google.com" -Proxy $proxyUrl -Method GET).StatusCode }  -Condition { $actual -eq 200 } -TimeoutSeconds 600 -IncrementSeconds 60
}

The main reason for this repeated try..wait loop is because sometimes a CloudFormation stack will complete successfully, but the service may be unavailable from an external point of view until the Load Balancer or similar component manages to settle properly.

Conclusion

I feel much more comfortable with our environment provisioning after moving the smoke tests into their own functions and executing them during the actual environment creation, rather than just in the tests.

Now whenever an environment completes its creation, I know that it actually works from an external observation point. The smoke tests aren’t particularly complex, but they definitely add a lot to our ability to reliably provision environments containing services.

Alas, I don’t have any more smoke puns or references to finish off this blog post…

Oh wait, yes I do!

*disappears in a puff of smoke*