0 Comments

I’m pretty happy with the way our environment setup scripts work.

Within TeamCity, you generally only have to push a single button to get an environment provisioned (with perhaps a few parameters filled in, like environment name and whatnot) and even outside TeamCity, its a single script that only requires some credentials and a few other things to start.

Failures are detected (primarily by CloudFormation) and the scripts have the ability to remote onto AWS instances for you and extract errors from logs to give you an idea as to the root cause of the failure, so you have to do as little manual work as possible. If a failure is detected, everything is cleaned up automatically (CloudFormation stack deleted, Octopus environment and machines deleted, etc), unless you turn off automatic cleanup for investigation purposes.

Like I said, overall I’m pretty happy with how everything works, but one of the areas that I’m not entirely happy with is the last part of environment provisioning. When an environment creation is completed, you know that all components installed correctly (including Octopus deploys) and that no errors were encountered with any of the provisioning itself (EC2 instances, Auto Scaling Groups, RDS, S3, etc). What you don’t know is whether or not the environment is actually doing what it should be doing.

You don’t know whether or not its working.

That seems like a fixable problem.

Smoke On The Water

As part of developing environments, we’ve implemented automated tests using the Powershell testing framework called Pester.

Each environment has at least one test, that verifies the environment is created as expected and works from the point of view of the service it offers. For example, in our proxy environment (which uses SQUID) one of the outputs is the proxy URL. The test takes that url and does a simple Invoke-WebRequest through it to a known address, validating that the proxy works as a proxy actually should.

The issue with these tests is that they are not executed at creation time. They are usually only used during development, to validate that whatever changes you are making haven’t broken the environment and that everything is still working.

Unfortunately, beyond git tagging, our environment creation scripts/templates are not versioned. I would vastly prefer for our build scripts to take some set of source code that represents an environment setup, test it, replace some parameters (like version) and then package it up, perhaps into a nuget package. It’s something that’s been on my mind for a while, but I haven’t had time to put it together yet. If I do, I’ll be sure to post about it here.

The simplest solution is to extract the parts of the tests that perform validation into dedicated functions and then to execute them as part of the environment creation. If the validation fails, the environment should be considered a failure and should notify the appropriate parties and clean itself up.

Where There Is Smoke There Is Fire

The easiest way to implement the validation (hereafter referred to as smoke tests) in a reusable fashion is to incorporate the concept into the common environment provisioning scripts.

We’ve created a library that contains scripts that we commonly use for deployment, environment provisioning and other things. I made a copy of the source for that library and posted it to Solavirum.Scripts.Common a while ago, but its a bit out of date now (I really should update it).

Within the library is a Functions-Environment file.

This file contains a set of Powershell cmdlets for provisioning and deleting environments. The assumption is that it will be used within libraries for specific environments (like the Proxy environment mentioned above) and will allow us to take care of all of the common concerns (like uploading dependencies, setting parameters in CloudFormation, waiting on the CloudFormation initialization, etc).

Inside this file is a function called New-Environment, whose signature looks like this:

function New-Environment
{
    [CmdletBinding()]
    param
    (
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$environmentName,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsKey,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsSecret,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsRegion,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$octopusServerUrl,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$octopusApiKey,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$uniqueComponentIdentifier,
        [System.IO.FileInfo]$templateFile,
        [hashtable]$additionalTemplateParameters,
        [scriptblock]$customiseEnvironmentDetailsHashtable={param([hashtable]$environmentDetailsHashtableToMutate,$stack) },
        [switch]$wait,
        [switch]$disableCleanupOnFailure,
        [string[]]$s3Buckets
    )

    function body would be here, but its super long
}

As you can see, it has a lot of parameters. It’s responsible for all of the bits of pieces that go into setting up an environment, like Octopus initialization, CloudFormation execution, gathering information in the case of a failure, etc. Its also responsible for triggering a cleanup when an environment is deemed a failure, so is the ideal place to put some smoke testing functionality.

Each specific environment repository typically contains a file called Invoke-NewEnvironment. This file is what is executed to actually create an environment of the specific type. It puts together all of the environment specific stuff (output customisation, template location, customised parameters) and uses that to execute the New-Environment function, which takes care of all of the common things.

In order to add a configurable smoke test, all we need to do is add an optional script block to the New-Environment function. Specific environment implementations can supply a value to it they like, but they don’t have to. If we assume that the interface for the script block is that it will throw an exception if it fails, then all we need to do is wrap it in a try..catch and fail the environment provisioning if an error occurs. Pretty straightforward.

To support the smoke test functionality, I wrote two new Pester tests. One verifies that a failing smoke test correctly fails the environment creation and the other verifies that the result of a successful smoke test is included in the environment creation result. You can see them below:

Describe -Tags @("Ignore") "Functions-Environment.New-Environment.SmokeTest" {
    Context "When supplied with a smoke test script that throws an exception (indicating smoke test failure)" {
        It "The stack creation is aborted and deleted" {
            $creds = Get-AwsCredentials
            $octoCreds = Get-OctopusCredentials
            $environmentName = Create-UniqueEnvironmentName
            $uniqueComponentIdentifier = "Test"
            $templatePath = "$rootDirectoryPath\src\TestEnvironment\Test.CloudFormation.template"
            $testBucket = [Guid]::NewGuid().ToString("N")
            $customTemplateParameters = @{
                "LogsS3BucketName"=$testBucket;
            }

            try
            {
                try
                {
                    $createArguments = @{
                        "-EnvironmentName"=$environmentName;
                        "-TemplateFile"=$templatePath;
                        "-AdditionalTemplateParameters"=$CustomTemplateParameters;
                        "-UniqueComponentIdentifier"=$uniqueComponentIdentifier;
                        "-S3Buckets"=@($testBucket);
                        "-SmokeTest"={ throw "FORCED FAILURE" };
                        "-Wait"=$true;
                        "-AwsKey"=$creds.AwsKey;
                        "-AwsSecret"=$creds.AwsSecret;
                        "-AwsRegion"=$creds.AwsRegion;
                        "-OctopusApiKey"=$octoCreds.ApiKey;
                        "-OctopusServerUrl"=$octoCreds.Url;
                    }
                    $environmentCreationResult = New-Environment @createArguments
                }
                catch
                {
                    $error = $_
                }

                $error | Should Not Be $null
                $error | Should Match "smoke"

                try
                {
                    $getArguments = @{
                        "-EnvironmentName"=$environmentName;
                        "-UniqueComponentIdentifier"=$uniqueComponentIdentifier;
                        "-AwsKey"=$creds.AwsKey;
                        "-AwsSecret"=$creds.AwsSecret;
                        "-AwsRegion"=$creds.AwsRegion;                
                    }
                    $environment = Get-Environment @getArguments
                }
                catch
                {
                    Write-Warning $_
                }

                $environment | Should Be $null
            }
            finally
            {
                $deleteArguments = @{
                    "-EnvironmentName"=$environmentName;
                    "-UniqueComponentIdentifier"=$uniqueComponentIdentifier;
                    "-S3Buckets"=@($testBucket);
                    "-Wait"=$true;
                    "-AwsKey"=$creds.AwsKey;
                    "-AwsSecret"=$creds.AwsSecret;
                    "-AwsRegion"=$creds.AwsRegion;
                    "-OctopusApiKey"=$octoCreds.ApiKey;
                    "-OctopusServerUrl"=$octoCreds.Url;
                }
                Delete-Environment @deleteArguments
            }
        }
    }

    Context "When supplied with a valid smoke test script" {
        It "The stack creation is successful" {
            $creds = Get-AwsCredentials
            $octoCreds = Get-OctopusCredentials
            $environmentName = Create-UniqueEnvironmentName
            $uniqueComponentIdentifier = "Test"
            $templatePath = "$rootDirectoryPath\src\TestEnvironment\Test.CloudFormation.template"
            $testBucket = [Guid]::NewGuid().ToString("N")
            $customTemplateParameters = @{
                "LogsS3BucketName"=$testBucket;
            }

            try
            {
                $createArguments = @{
                    "-EnvironmentName"=$environmentName;
                    "-TemplateFile"=$templatePath;
                    "-AdditionalTemplateParameters"=$CustomTemplateParameters;
                    "-UniqueComponentIdentifier"=$uniqueComponentIdentifier;
                    "-S3Buckets"=@($testBucket);
                    "-SmokeTest"={ return $_.StackId + " SMOKE TESTED"}; 
                    "-Wait"=$true;
                    "-AwsKey"=$creds.AwsKey;
                    "-AwsSecret"=$creds.AwsSecret;
                    "-AwsRegion"=$creds.AwsRegion;
                    "-OctopusApiKey"=$octoCreds.ApiKey;
                    "-OctopusServerUrl"=$octoCreds.Url;
                }
                $environmentCreationResult = New-Environment @createArguments

                Write-Verbose (ConvertTo-Json $environmentCreationResult)

                $environmentCreationResult.SmokeTestResult | Should Match "SMOKE TESTED"
            }
            finally
            {
                $deleteArguments = @{
                    "-EnvironmentName"=$environmentName;
                    "-UniqueComponentIdentifier"=$uniqueComponentIdentifier;
                    "-S3Buckets"=@($testBucket);
                    "-Wait"=$true;
                    "-AwsKey"=$creds.AwsKey;
                    "-AwsSecret"=$creds.AwsSecret;
                    "-AwsRegion"=$creds.AwsRegion;
                    "-OctopusApiKey"=$octoCreds.ApiKey;
                    "-OctopusServerUrl"=$octoCreds.Url;
                }
                Delete-Environment @deleteArguments
            }
        }
    }
}

Smoke And Mirrors

On the specific environment side (the Proxy in this example), all we need to do is supply a script block that will execute the smoke test.

The smoke test itself needs to be somewhat robust, so we use a generic wait function to repeatedly execute a HTTP request through the proxy until it succeeds or it runs out of time.

function Wait
{
    [CmdletBinding()]
    param
    (
        [scriptblock]$ScriptToFillActualValue,
        [scriptblock]$Condition,
        [int]$TimeoutSeconds=30,
        [int]$IncrementSeconds=2
    )

    write-verbose "Waiting for the output of the script block [$ScriptToFillActualValue] to meet the condition [$Condition]"

    $totalWaitTimeSeconds = 0
    while ($true)
    {
        try
        {
            $actual = & $ScriptToFillActualValue
        }
        catch
        {
            Write-Warning "An error occurred while evaluating the script to get the actual value (which is evaluated by the condition for waiting purposes). As a result, the actual value is undefined (NULL)"
            Write-Warning $_
        }

        try
        {
            $result = & $condition
        }
        catch
        {
            Write-Warning "An error occurred while evaluating the condition to determine if the wait is over"
            Write-Warning $_

            $result = $false
        }

        
        if ($result)
        {
            write-verbose "The output of the script block [$ScriptToFillActualValue] (Variable:actual = [$actual]) met the condition [$condition]"
            return $actual
        }

        write-verbose "The current output of the condition [$condition] (Variable:actual = [$actual]) is [$result]. Waiting [$IncrementSeconds] and trying again."

        Sleep -Seconds $IncrementSeconds
        $totalWaitTimeSeconds = $totalWaitTimeSeconds + $IncrementSeconds

        if ($totalWaitTimeSeconds -ge $TimeoutSeconds)
        {
            throw "The output of the script block [$ScriptToFillActualValue] (Variable:actual = [$actual]) did not meet the condition [$Condition] after [$totalWaitTimeSeconds] seconds."
        }
    }
}

function Test-Proxy
{
    [CmdletBinding()]
    param
    (
        [Parameter(Mandatory=$true)]
        [string]$proxyUrl
    )

    if ($rootDirectory -eq $null) { throw "RootDirectory script scoped variable not set. That's bad, its used to find dependencies." }
    $rootDirectoryPath = $rootDirectory.FullName

    . "$rootDirectoryPath\scripts\common\Functions-Waiting.ps1"
    
    $result = Wait -ScriptToFillActualValue { return (Invoke-WebRequest -Uri "www.google.com" -Proxy $proxyUrl -Method GET).StatusCode }  -Condition { $actual -eq 200 } -TimeoutSeconds 600 -IncrementSeconds 60
}

The main reason for this repeated try..wait loop is because sometimes a CloudFormation stack will complete successfully, but the service may be unavailable from an external point of view until the Load Balancer or similar component manages to settle properly.

Conclusion

I feel much more comfortable with our environment provisioning after moving the smoke tests into their own functions and executing them during the actual environment creation, rather than just in the tests.

Now whenever an environment completes its creation, I know that it actually works from an external observation point. The smoke tests aren’t particularly complex, but they definitely add a lot to our ability to reliably provision environments containing services.

Alas, I don’t have any more smoke puns or references to finish off this blog post…

Oh wait, yes I do!

*disappears in a puff of smoke*

0 Comments

If you’ve ever read any of my previous posts, you would realise that I’m pretty big on testing. From my point of view, any code without automated tests is probably a liability in the long run. Some code doesn’t need tests of course (like prototypes and spikes), but if you intend for your code to be used or maintained (i.e. most of it) you should be writing tests for it.

I’ve blogged before about the 3 different classes of automated tests, Unit, Integration and Functional. I fully admit that the lines between them can get blurred from time to time, but I feel these 3 classifications help to think about tests at different levels, which in turn assists in creating a quality piece of software.

A lot of the work I do in my current job has dependencies on a legacy SQL Server database. Not one gigantic database in a central location, many databases with the same schema, distributed at client locations. Its a bit of a monster, but workable, and we control the schema, so we’ve got that going for us, which is nice.

Quite a few of the older components take a hard dependency on this database. Some of the newer ones use EF at least, which is nice, but others simply use direct SQL execution. Tests have typically been written using an approach I see a lot, use a local database in a known state and write some tests on top of it. Typically this database is specific to a developers machine, and as such the tests will not work inside our CI environment, so they just get marked as [Ignored] (so the build won’t fail) until someone wants to use them again.

Unacceptable.

Why go to all that effort writing a test if you aren’t going to execute it all the time? Tests that aren’t run are worthless after all.

Making it Work

Testing on top of a real SQL server is a fantastic idea and I would classify this sort of test as an Integration test, especially if you’re doing it directly from the classes themselves (instead of the normal entry point for the application). It verifies that the components that you have written (or at least a subset of them) work as expected when you sit them on top of a real database.

The problem comes in automating those sort of tests.

In an ideal world, your test would be responsible for setting up and tearing down all of its dependencies. This would mean it creates a database, fills it with some data, runs the code that it needs to run, verifies the results and then cleans up after itself.

That’s not simple though and looks like a lot of unnecessary work to some developers. They think, why don’t I just create a local database and put some data in it. I can run my test while I write my code, and I can reset my database whenever I want using scripts/restores/etc. These developers don’t realise that tests live just as long as the code that they verify, and if you’re developing for the long run, you need to put that effort in or suffer the consequences later on.

I’ll take a moment to mention that not every piece of code needs to be written for the long run (most do) and that it is possible to have so many tests that it becomes hard to change you code due to the barrier of requiring that you change you tests (which can be a lot of work). As with most things in software, its a balancing act. Just like your code, your tests should be easy to change and maintain, or they will end up causing the same pain that you were trying to avoid in the first place, just in a different place.

In order to facilitate the approach where each test is responsible for its own test data you need to put some infrastructure in place.

  1. You need a common location where the tests can create a database. If your CI environment is in the same network as your developers, this can simply be a machine at a known location with the required software installed. Its a little more complicated if they are in two different networks (our CI is in AWS for example).
  2. You need to have a reusable set of tools for creating, initialising and destroying database resources, to make sure your tests are self contained. These tools must be able to be run in an automated fashion.

Where the Wild Databases Are

The first step in allowing us to create executable database integration tests is to have a database server (specifically MSSQL server) available to both our development environment and our CI environment.

Since our CI is in AWS, the best location for the database is there as well. The main reason for this is that it will be easy enough to expose the database securely to the office, whereas it would be hard to expose resources in the office to AWS.

We can use the Amazon supplied SQL Server 2014 Express AMI as a baseline, and create a CloudFormation template that puts all the bits in place (instance, security groups, host record, etc).

The template is fairly trivial, so I won’t go into detail about it. Its very much the same as any other environment setup I’ve done before (I think the only publicly available example is the JMeter Workers, but that’s a pretty good example).

In order to expose the SQL instance on the canned AMI I had to make some changes to the server itself. It’s easy enough to execute a Powershell script during initialisation (via cfn-init), so I just ran this script to enable TCP/IP, switch to mixed mode and enable and change the password for the sa account. The instance is only accessible via a secure channel (internally in our VPC and via our AWS VPN) so I’m not too concerned about exposing the sa username directly, especially with a unique and secure password. Its only purpose is to hold temporary test data anyway, and it doesn’t have the ability to connect to any other resources, so I’m not particularly worried.

[CmdletBinding()]
param
(
    [Parameter(Mandatory=$true)]
    [string]$SaPassword
)

$ErrorActionPreference = "Stop"

Import-Module sqlps -DisableNameChecking

$localhostname = hostname
$instanceName = "mssqlserver"

$smo = 'Microsoft.SqlServer.Management.Smo.'
$wmi = new-object ($smo + 'Wmi.ManagedComputer').

Write-Verbose "Enabling TCP/IP."
$uri = "ManagedComputer[@Name='$localhostname']/ ServerInstance[@Name='$instanceName']/ServerProtocol[@Name='Tcp']"
$Tcp = $wmi.GetSmoObject($uri)
$Tcp.IsEnabled = $true
$Tcp.Alter()

Write-Verbose "Enabling Mixed Mode Authentication."
$s = new-object Microsoft.SqlServer.Management.Smo.Server($localhostname)

$s.Settings.LoginMode = [Microsoft.SqlServer.Management.SMO.ServerLoginMode]::Mixed
$s.Alter()

Write-Verbose "Restarting [$instanceName] so TCP/IP and Auth changes are applied."
$service = get-service $instanceName
$service.Stop()
$service.WaitForStatus("Stopped")
$service.Start()
$service.WaitForStatus("Running")

Write-Verbose "Editing sa user for remote access."
$SQLUser = $s.Logins | ? {$_.Name -eq "sa"}
$SQLUser.PasswordPolicyEnforced = 0
$SQLUser.Alter()
$SQLUser.Refresh()
$SQLUser.ChangePassword($SaPassword)
$SqlUser.Enable()
$SQLUser.Alter()
$SQLUser.Refresh()

I know the script above isn’t necessarily the neatest way to do the configuration I needed, but its enough for now.

How Are Baby Databases Made?

The next step is having a library that we can re-use to easily make test databases.

We already had some Entity Framework classes and a database context buried deep inside another solution, so I extracted those, put them into their own repository and built a standalone Nuget package from that. It’s not the same code that actually creates a brand new database for a client (that’s a series of SQL scripts embedded in a VB6 application), but its close enough for testing purposes. I hope that eventually we will use EF instead of the hardcoded scripts (leveraging the migrations functionality for database version management), but that’s probably a long way away.

I’d previously completed a small refactor on the EF project in the past, so it already had the concept of a DbContextFactory, so all I had to do was implement a new one that connected to a known SQL server and created a randomly named database. I made it disposable, so that it would destroy the database once it was done.

EF took care of actually creating the database to match the schema defined by the DTO classes (which were already there), so I didn’t have to worry about that too much.

In the code below, the ITestDatabaseConnectionStringFactory implementation is responsible for knowing where the server is and how to connect to it (there’s a few implementations, one takes values from the app.config, one is hardcoded, etc). The INamedDbFactory has a single Create method that returns a derived DbContent, nothing fancy.

using System;
using System.Data.SqlClient;
using System.Diagnostics;
using System.Linq;
using Serilog;

namespace Solavirum.Database.EF.Tests.Common
{
    public class TestDatabaseDbFactory : INamedDbFactory, IDisposable
    {
        public TestDatabaseDbFactory(ILogger logger, ITestDatabaseConnectionStringFactory connFactory)
        {
            _connFactory = connFactory;
            var databaseName = _random.GenerateString(20);
            ConnectionString = _connFactory.Create(databaseName);

            _logger = logger
                .ForContext("databaseName", ConnectionString.DatabaseName)
                .ForContext("databaseHost", ConnectionString.Host);

            _logger.Information("New factory for database {databaseName} on host {databaseHost} created");
        }

        private readonly IRandomTestData _random = new DefaultRandomTestData();
        private readonly ILogger _logger;
        private ITestDatabaseConnectionStringFactory _connFactory;

        public readonly TestDatabaseConnectionString ConnectionString;

        public NamedDb Create()
        {
            _logger.Information("Creating a new NamedDb for database {databaseName}");
            var context = new NamedDb(new SqlConnection(ConnectionString.GetFullConnectionString()));

            if (!context.Database.Exists())
            {
                _logger.Information("This is the first time {databaseName} has been created, creating backing database on host {databaseHost}");
                context.Database.CreateIfNotExists();
            }

            return context;
        }

        public void Cleanup()
        {
            using (var context = new NamedDb(ConnectionString.GetFullConnectionString()))
            {
                _logger.Information("Deleting backing database {databaseName} on host {databaseHost}");
                context.Database.Delete();
            }
        }

        bool _disposed;

        public void Dispose()
        {
            Dispose(true);
            GC.SuppressFinalize(this);
        }

        ~TestDatabaseNamedDbFactory()
        {
            Dispose(false);
        }

        protected virtual void Dispose(bool disposing)
        {
            if (_disposed)
                return;

            if (disposing)
            {
                try
                {
                    Cleanup();
                }
                catch (Exception ex)
                {
                    Trace.WriteLine(string.Format("An unexpected error occurred while attempting to clean up the database named [{0}] spawned from this database factory. The database may not have been cleaned up.", ConnectionString.DatabaseName));
                    Trace.WriteLine(ex.ToString());
                }
                
            }

            _disposed = true;
        }
    }

    public interface IRandomTestData
    {
        string GenerateString(int length);
    }

    public class DefaultRandomTestData : IRandomTestData
    {
        private readonly Random _random = new Random();

        public string GenerateString(int length)
        {
            var chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
            var result = new string(Enumerable.Repeat(chars, length)
                .Select(s => s[_random.Next(s.Length)])
                .ToArray());

            return result;
        }
    }
}

The Cleanup

Of course by extracting the database classes above into its own repository/package, I had to replace all references to the old classes. It turns out the database component was quite heavily referenced in its original location, so it was a non-trivial amount of work to incorporate it properly.

As is almost always the way when you start doing things correctly after a long period of not doing that, I had to shave a few yaks to make the change happen.

On the upside, as a result of the extraction and the creation of the TestDatabaseFactory, I managed to create reliably executable tests for the actual database component itself, proving that it works when connected to a real SQL server database.

Summary

For me, the takeaway from this activity is that it takes effort to setup good testing infrastructure so that you can easily create tests that can be repeatedly executed. Its not something that just happens, and you need to be willing to accept the fact that you have to go slower to start off with in order to go faster later. Like delayed gratification.

I could have just setup an SQL server in a remote location without automating any of it, but that’s just another case of the same principle. I now have the ability to setup any version of SQL server that I want, or to change it to deploy a custom version (assuming we had an Octopus package to install SQL server, which is more than possible), and I’ve also automated its destruction every night and recreation every morning, allowing us to easily incorporate changes to the base AMI or to alter it in other ways (like a different version of SQL server).

It can be very hard to get some people to understand this. I find the best approach is to just do it (taking into account hard deadlines of course) and to let is prove itself in the field as time goes on, when you are able to make large changes without fear that you’ve broken everything.

I like knowing that I haven’t broken everything personally.

Or even better, knowing that I have.

0 Comments

So, as per my last post, I built a scalable, deployable, codified proxy environment in AWS, leveraging Cloud Formation, Octopus and Squid.

Since then I have attempted to use this proxy for things. Specifically load tests, which was the entire reason I built the proxy in the first place.

In my attempts to use the proxy, I have learned a few lessons that I thought would be worthwhile to share, so that others might benefit from my pain.

This will be a relatively short post, because:

  1. I’ve been struggling with these issues over the last week and haven’t done much else,
  2. For all the trouble I had, I really only ran into 2 issues,
  3. I’m still struggling with a mysterious, probably unrelated issue, and its taking up most of my mental space (unexpected UTF8 characters in a Powershell/Octopus deployment output stream, which might be a good future blog post if I ever figure it out).

Weird Differences

Initially I assumed that my proxy would slot into the same hole as our current proxy. They were both Squid, I had replicated the config from the old one on the new one and both were referenced simply by a URL and port number.

I was mistaken.

I’m still not sure how it did it, but the old proxy was somehow allowing connections to the special EC2 meta information address (169.254.169.254) to pass through correctly. The moment I swapped in my new proxy, cfn-init and cfn-signal no longer worked.

For cfn-init, the error was incredibly unhelpful. It insisted that my instance was not a member of the CloudFormation group/template/configuration that I was trying to initialise from.

For cfn-signal, it just didn’t do anything. It said it signalled, but it was lying.

In hindsight, this makes perfect sense. The request would have gone through the proxy, which was a CloudFormation resource itself, and it would have tried to use the proxy’s CloudFormation template as the container for the meta data, which would fail, giving a technically correct error message in the first case, and signalling something non-existent in the second.

From my point of view, it looked insane.

I assumed I had put some sort of incorrect details into the cfn-init call, or that I had failed to meet the arcane requirements for cfn-signal (must base64 encode the wait handle on windows only for example), but I hadn’t changed anything. The only thing I changed was the proxy configuration.

Long story short, for my proxy, I had to add a bypass entry (on each EC2 instance, configured in the same place as the proxy, the UserData script) which would stop cfn-init (and other tools) from trying to go through the proxy to hit the meta information address. I still have no idea how the old proxy did not require the same sort of configuration. I have a hunch that it might be because it was Linux and the original creators of the box did something special to make it work. Maybe they ran into the same issue, but just fixed it a different way? Maybe Linux automatically handles the situation better? Who knows.

Very frustrating.

Location, Location, Location

The second pain point I ran into was more insane and just as frustrating.

After reviewing the results of an initial load test, I hypothesised that maybe the proxy was a bottleneck. All of the traffic for the load test had to pass through the proxy (including image uploads) and I couldn’t see anything obvious in the service logs to account for the level of failure I was seeing, except high load. In the interests of getting a better subsequent load test, I wanted to make sure that the proxy boxes could not possibly be a bottleneck, so I planned to beef up their instance type.

I was originally using t2.medium instances, which have some limitations, mostly around network performance and CPU credits. I wanted to switch to something a bit beefier, just for the proxy specific to the load tests.

When i switched to an m3.large, the proxy stopped working.

Looking into it, the expected installation directory (C:\Squid) was empty of anything that even vaguely looked like a proxy.

Following the installation log, I found out that Squid had decided to install itself to Z drive. Z drive was an ephemeral drive. You know, the ones whose content is transitory, and which tend to get annihilated if the instance goes down for any reason?

I tried so very hard to get Squid to just install to the C drive, including checking the registry settings for program installation locations (which were all correctly C based) and manually overriding TARGETFOLDER, ROOTDRIVE and INSTALLDIR in the msi execution parameters.

Alas, it was not to be. No matter what I did, Squid insisted on installing to Z drive.

I still have no idea why, I just turned the instance type back to one that didn’t have ephemeral drives available.

Like any good software user, I logged a bug. Well, I assume its a bug, because that’s a really weird feature.

Conclusion

There is no conclusion. Just a glimpse into some of the traps that sap your time, motivation and will to live when doing this sort of thing.

I only hope that someone runs across this blog post one day and it helps them. Or at least lets them know someone else out there understands their pain.

0 Comments

We’ve spent a significant amount of effort recently ensuring that our software components are automatically built and deployed. Its not something new, and its certainly something that some of our components already had in place, but nothing was ever made generic enough to reuse. The weak spot in our build/deploy pipeline is definitely tests though. We’ve had a few attempts in the past to get test automation happening as part of the build, and while it has worked on an individual component basis, we’ve never really taken a holistic look at the process and made it easy to apply to a range of components.

I’ve mentioned this before but to me tests fall into 3 categories, Unit, Integration and Functional. Unit tests cover the smallest piece of functionality, usually algorithms or classes with all dependencies stubbed or mocked out. Integration tests cover whether all of the bits are configured to work together properly, and can be used to verify features in a controlled environment. Functional tests cover the application from a feature point of view. For example, functional tests for a web service would be run on it after it is deployed, verifying users can interact with it as expected.

From my point of view, the ideal flow is as follows:

Checkin – Build – Unit and Integration Tests – Deploy (CI) – Functional Tests – Deploy (Staging)

Obviously I’m talking about web components here (sites, services, etc), but you could definitely apply it to any component if you tried hard enough.

The nice part of this flow is that you can do any manual testing/exploration/early integration on the Staging environment, with the guarantee that it will probably not be broken by a bad deploy (because the functional tests will protect against that and prevent the promotion to staging).

Aren’t All Cities Full of Teams

We use Team City as our build platform and Octopus as our deployment platform, and thanks to these components we have the checkin, build and deployment parts of the pipeline pretty much taken care of.

My only issue with these products is that they are so configurable and powerful that people often use them to store complex build/deployment logic. This makes me sad, because that logic belongs as close to the code as possible, ideally in the same repository. I think you should be able to grab a repository and build it, without having the use an external tool to put all the pieces together. Its also an issue if you need to change your build logic, but still allow for older builds (maybe a hotfix branch or something). If you stored your build logic in source control, then this situation just works, because the logic is right there with the code.

So I mostly use Team City to trigger builds and collect history about previous builds (and their output), which it does a fine job at. Extending that thought I use Octopus to manage environments and machines, but all the logic for how to install a component lives in the deployable package (which can be built with minimal fuss from the repository).

I do have to mention that these tools do have elements of change control, and do allow you to version your Build Configurations (TeamCity)/Projects (Octopus). I just prefer that this logic lives with the source, because then the same version is applied to everything.

All of our build and deployment logic lives in source control, right next to the code. There is a single powershell script (unsurprisingly called build.ps1) per repository, acting as the entry point. The build script in each repository is fairly lightweight, leveraging a set of common scripts downloaded from our Nuget server, to avoid duplicating logic.

Team City calls this build script with some appropriate parameters, and it takes care of the rest.

Testy Testy Test Test

Until recently, our generic build script didn’t automatically execute tests, which was an obvious weakness. Being that we are in the process of setting up a brand new service, I thought this would be the ideal time to fix that.

To tie in with the types of tests I mentioned above, we generally have 2 projects that live in the same solution as the main body of code (X.Tests.Unit and X.Tests.Integration, where X is the component name), and then another project that lives in parallel called X.Tests.Functional. The Functional tests project is kind of a new thing that we’re trying out, so is still very much in flux. The other two projects are well accepted at this point, and consistently applied.

Both Unit and Integration tests are written using NUnit. We went with NUnit over MSTEST for reasons that seemed valid at the time, but which I can no longer recall with any level of clarity. I think it might have been something about the support for data driven tests, or the ability to easily execute the tests from the command line? MSTEST offers both of those things though, so I’m honestly not sure. I’m sure we had valid reasons though.

The good thing about NUnit, is that the NUnit Runner is a NuGet package of its own, which fits nicely into our dependency management strategy. We’ve written powershell scripts to manage external components (like Nuget, 7Zip, Octopus Command Line Tools, etc) and the general pattern I’ve been using is to introduce a Functions-Y.ps1 file into our CommonDeploymentScripts package, where Y is the name of the external component. This powershell file contains functions that we need from the external component (for example for Nuget it would be Restore, Install, etc) and also manages downloading the dependent package and getting a reference to the appropriate executable.

This approach has worked fairly well up to this point, so my plan was to use the same pattern for test execution. I’d need to implement functions to download and get a reference to the NUnit runner, as well as expose something to run the tests as appropriate. I didn’t only require a reference to NUnit though, as we also use OpenCover (and ReportGenerator) to get code coverage results when running the NUnit tests. Slightly more complicated, but really just another dependency to manage just like NUnit.

Weirdly Smooth

In a rare twist of fate, I didn’t actually encounter any major issues implementing the functions for running tests. I was surprised, as I always run into some crazy thing that saps my time and will to live. It was nice to have something work as intended, but it was probably primarily because this was a refactor of existing functionality. We already had the script that ran the tests and got the coverage metrics, I was just restructuring it and moving it into a place where it could be easily reused.

I wrote some very rudimentary tests to verify that the automatic downloading of the dependencies was working, and then set to work incorporating the execution of the tests into our build scripts.

function FindAndExecuteNUnitTests
{
    [CmdletBinding()]
    param
    (
        [System.IO.DirectoryInfo]$searchRoot,
        [System.IO.DirectoryInfo]$buildOutput
    )

    Write-Host "##teamcity[blockOpened name='Unit and Integration Tests']"

    if ($rootDirectory -eq $null) { throw "rootDirectory script scoped variable not set. Thats bad, its used to find dependencies." }
    $rootDirectoryPath = $rootDirectory.FullName

    . "$rootDirectoryPath\scripts\common\Functions-Enumerables.ps1"
    . "$rootDirectoryPath\scripts\common\Functions-OpenCover.ps1"

    $testAssemblySearchPredicate = { 
            $_.FullName -like "*release*" -and 
            $_.FullName -notlike "*obj*" -and
            (
                $_.Name -like "*integration*" -or 
                $_.Name -like "*unit*"
            )
        }
    Write-Verbose "Locating test assemblies using predicate [$testAssemblySearchPredicate]."
    $testLibraries = Get-ChildItem -File -Path $srcDirectoryPath -Recurse -Filter "*.Test*.dll" |
        Where $testAssemblySearchPredicate
            
    $failingTestCount = 0
    foreach ($testLibrary in $testLibraries)
    {
        $testSuiteName = $testLibrary.Name
        Write-Host "##teamcity[testSuiteStarted name='$testSuiteName']"
        $result = OpenCover-ExecuteTests $testLibrary
        $failingTestCount += $result.NumberOfFailingTests
        $newResultsPath = "$($buildDirectory.FullName)\$($result.LibraryName).TestResults.xml"
        Copy-Item $result.TestResultsFile "$newResultsPath"
        Write-Host "##teamcity[importData type='nunit' path='$newResultsPath']"

        Copy-Item $result.CoverageResultsDirectory "$($buildDirectory.FullName)\$($result.LibraryName).CodeCoverageReport" -Recurse

        Write-Host "##teamcity[testSuiteFinished name='$testSuiteName']"
    }

    write-host "##teamcity[publishArtifacts '$($buildDirectory.FullName)']"
    Write-Host "##teamcity[blockClosed name='Unit and Integration Tests']"

    if ($failingTestCount -gt 0)
    {
        throw "[$failingTestCount] Failing Tests. Aborting Build."
    }
}

As you can see, its fairly straightforward. After a successful build the source directory is searched for all DLLs with Tests in their name, that also appear in the release directory and are also named with either Unit or Integration. These DLLs are then looped through, and the tests executed on each one (using the OpenCover-ExecuteTests function from the Functions-OpenCover.ps1 file), with the results being added to the build output directory. A record of the number of failing tests is kept and if we get to the end with any failing tests an exception is thrown, which is intended to prevent the deployment of faulty code.

The build script that I extracted the excerpt above from lives inside our CommonDeploymentScripts package, which I have replicated into this Github repository.

I also took this opportunity to write some tests to verify that the build script was working as expected. In order to do that, I had to create a few dummy Visual Studio projects (one for a deployable component via Octopack and another for a simple library component). At the start of each test, these dummy projects are copied to a working directory, and then mutated as necessary in order to provide the appropriate situation that the test needs to verify.

The best example of this is the following test:

Describe {
    Context "When deployable component with failing tests supplied and valid deploy" {
        It "An exception is thrown indicating build failure" {
            $creds = Get-OctopusCredentials

            $testDirectoryPath = Get-UniqueTestWorkingDirectory
            $newSourceDirectoryPath = "$testDirectoryPath\src"
            $newBuildOutputDirectoryPath = "$testDirectoryPath\build-output"

            $referenceDirectoryPath = "$rootDirectoryPath\src\TestDeployableComponent"
            Copy-Item $referenceDirectoryPath $testDirectoryPath -Recurse

            MakeTestsFail $testDirectoryPath
            
            $project = "TEST_DeployableComponent"
            $environment = "CI"
            try
            {
                $result = Build-DeployableComponent -deploy -environment $environment -OctopusServerUrl $creds.Url -OctopusServerApiKey $creds.ApiKey -projects @($project) -DI_sourceDirectory { return $testDirectoryPath } -DI_buildOutputDirectory { return $newBuildOutputDirectoryPath }
            }
            catch 
            {
                $exception = $_
            }

            $exception | Should Not Be $null

            . "$rootDirectoryPath\scripts\common\Functions-OctopusDeploy.ps1"

            $projectRelease = Get-LastReleaseToEnvironment -ProjectName $project -EnvironmentName $environment -OctopusServerUrl $creds.Url -OctopusApiKey $creds.ApiKey
            $projectRelease | Should Not Be $result.VersionInformation.New
        }
    }
}

As you can see, there is a step in this test to make the dummy tests fail. All this does is rewrite one of the classes to return a different value than is expected, but its enough to fail the tests in the solution. By doing this, we can verify that yes a failing does in fact lead to no deployment.

Summary

Nothing that I’ve said or done above is particularly ground-breaking. Its all very familiar to anyone who is doing continuous integration/deployment. Having tests is fantastic, but unless they take part in your build/deploy pipeline they are almost useless. That’s probably a bit harsh, but if you can deploy code without running the tests on it, you will (with the best of intentions no doubt) and that doesn’t lead anywhere good.

Our approach doesn’t leverage the power of TeamCity directly, due to my reluctance to store complex logic there. There are upsides and downsides to this, mostly that you trade off owning the implementation of the test execution against keeping all your logic in one place.

Obviously I prefer the second approach, but your mileage may vary.

0 Comments

The service that I’ve mentioned previously (and the iOS app it supports) has been in beta now for a few weeks. People seem relatively happy with it, both from a performance standpoint and due to the fact that it doesn’t just arbitrarily lose their information, unlike the previous version, so we’ve got that going for us, which is nice.

We did a fair amount of load testing on it before it went out to beta, but only for small numbers of concurrent users (< 100), to make sure that our beta experience would be acceptable. That load testing picked up a few issues, including one where the service would happily (accidentally of course) delete other peoples data. It wasn’t a permissions issue, it was due to the way in which we were keying our image storage. More importantly, the load testing found issues with the way in which we were storing images (we were using Raven 2.5 attachments) and how it just wasn’t working from a performance point of view. We switched to storing the files in S3, and it was much better.

I believe the newer version of Raven has a new file storage mechanism that is much better. I don’t even think Ayende recommends that you use the attachments built into Raven 2.5 for any decent amount of file storage.

Before we go live, we knew that we needed to find the breaking point of the service. The find the number of concurrent users at which its performance degraded to the point where it was unusable (at least for the configuration that we were planning on going live with). If that number was too low, we knew we would need to make some additional changes, either in terms of infrastructure (beefier AWS instances, more instances in the Auto Scaling Group) or in terms of code.

We tried to simply run a huge amount of users through our load tests locally (which is how we we did the first batch of load testing, locally using JMeter) but we capped out our available upload bandwidth pretty quickly, well below the level of traffic that the service could handle.

It was time to farm the work out to somewhere else, somewhere with a huge amount of easily accessibly computing resources.

Where else but Amazon Web Services?

I’ve Always Wanted to be a Farmer

The concept was fairly straightforward. We had a JMeter configuration file that contained all of our load tests. It was parameterised by the number of users, so conceptually the path would be to spin up some worker instances in EC2, push JMeter, its dependencies and our config to them, then execute the tests. This way we could tune the number users per instance along with the total number of worker instances, and we would be able to easily put enough pressure on the service to find its breaking point.

JMeter gives you the ability to set the value of variables via the command line. Be careful though, as the variable names are case sensitive. That one screwed me over for a while, as I couldn’t figure out why the value of my variables was still the default on every machine I started the tests on. For the variable that defined the maximum number of users it wasn’t so bad, if a bit confusing. The other variable that defined the seed for the user identity was more of an issue when it wasn’t working, because it meant the same user was doing similar things from multiple machines. Still a valid test, but not the one I was aiming to do, as the service isn’t defined for concurrent access like that.

We wouldn’t want to put all of that load on the service all at once though, so we needed to stagger when each instance started its tests.

Leveraging the work I’d done previously for setting up environments, I created a Cloud Formation template containing an Auto Scaling Group with a variable number of worker instances. Each instance would have the JMeter config file and all of its dependencies (Java, JMeter, any supporting scripts) installed during setup, and then be available for remote execution via Powershell.

The plan was to hook into that environment (or setup a new one if one could not be found), find the worker instances and then iterate through them, starting the load tests on each one, making sure to stagger the time between starts to some reasonable amount. The Powershell script for doing exactly that is below:

[CmdletBinding()]
param
(
    [Parameter(Mandatory=$true)]
    [ValidateNotNullOrEmpty()]
    [string]$environmentName,
    [Parameter(Mandatory=$true)]
    [ValidateNotNullOrEmpty()]
    [string]$awsKey,
    [Parameter(Mandatory=$true)]
    [ValidateNotNullOrEmpty()]
    [string]$awsSecret,
    [string]$awsRegion="ap-southeast-2"
)

$ErrorActionPreference = "Stop"

$currentDirectoryPath = Split-Path $script:MyInvocation.MyCommand.Path
write-verbose "Script is located at [$currentDirectoryPath]."

. "$currentDirectoryPath\_Find-RepositoryRoot.ps1"

$repositoryRoot = Find-RepositoryRoot $currentDirectoryPath

$repositoryRootDirectoryPath = $repositoryRoot.FullName
$commonScriptsDirectoryPath = "$repositoryRootDirectoryPath\scripts\common"

. "$repositoryRootDirectoryPath\scripts\environment\Functions-Environment.ps1"

. "$commonScriptsDirectoryPath\Functions-Aws.ps1"

Ensure-AwsPowershellFunctionsAvailable

$stack = $null
try
{
    $stack = Get-Environment -EnvironmentName $environmentName -AwsKey $awsKey -AwsSecret $awsSecret -AwsRegion $awsRegion
}
catch 
{
    Write-Warning $_
}

if ($stack -eq $null)
{
    $update = ($stack -ne $null)

    $stack = New-Environment -EnvironmentName $environmentName -AwsKey $awsKey -AwsSecret $awsSecret -AwsRegion $awsRegion -UpdateExisting:$update -Wait -disableCleanupOnFailure
}

$autoScalingGroupName = $stack.AutoScalingGroupName

$asg = Get-ASAutoScalingGroup -AutoScalingGroupNames $autoScalingGroupName -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion
$instances = $asg.Instances

. "$commonScriptsDirectoryPath\Functions-Aws-Ec2.ps1"

$remoteUser = "Administrator"
$remotePassword = "ObviouslyInsecurePasswordsAreTricksyMonkeys"
$securePassword = ConvertTo-SecureString $remotePassword -AsPlainText -Force
$cred = New-Object System.Management.Automation.PSCredential($remoteUser, $securePassword)

$usersPerMachine = 100
$nextAvailableCustomerNumber = 1
$jobs = @()
foreach ($instance in $instances)
{
    # Get the instance
    $instance = Get-AwsEc2Instance -InstanceId $instance.InstanceId -AwsKey $awsKey -AwsSecret $awsSecret -AwsRegion $awsRegion

    $ipAddress = $instance.PrivateIpAddress
    
    $session = New-PSSession -ComputerName $ipAddress -Credential $cred

    $remoteScript = {
        param
        (
            [int]$totalNumberOfUsers,
            [int]$startingCustomerNumber
        )
        Set-ExecutionPolicy -ExecutionPolicy Bypass
        & "C:\cfn\dependencies\scripts\jmeter\execute-load-test-no-gui.ps1" -totalNumberOfUsers $totalNumberOfUsers -startingCustomerNumber $startingCustomerNumber -AllocatedMemory 512
    }
    $job = Invoke-Command -Session $session -ScriptBlock $remoteScript -ArgumentList $usersPerMachine,$nextAvailableCustomerNumber -AsJob
    $jobs += $job
    $nextAvailableCustomerNumber += $usersPerMachine

    #Sleep -Seconds ([TimeSpan]::FromHours(2).TotalSeconds)
    Sleep -Seconds 300

    # Can use Get-Job or record list of jobs and then terminate them. I suppose we could also wait on all of them to be complete. Might be good to get some feedback from
    # the remote process somehow, to indicate whether or not it is still running/what it is doing.
}

Additionally, I’ve recreated and reuploaded the repository from my first JMeter post, containing the environment template and scripts for executing the template, as well as the script above. You can find it here.

The last time I uploaded this repository I accidentally compromised our AWS deployment credentials, so I tore it down again very quickly. Not my brightest moment, but you can rest assured I’m not making the same mistake twice. If you look at the repository, you’ll notice that I implemented the mechanism for asking for credentials for tests so I never feel tempted to put credentials in a file ever again.

We could watch the load tests kick into gear via Kibana, and keep an eye on when errors start to occur and why.

Obviously we didn’t want to run the load tests on any of the existing environments (which are in use for various reasons), so we spun up a brand new environment for the service, fired up the script to farm out the load tests (with a 2 hour delay between instance starts) and went home for the night.

15 minutes later, Production (the environment actively being used for the external beta) went down hard, and so did all of the others, including the new load test environment.

Separately Dependent

We had gone to great lengths to make sure that our environments were independent. That was the entire point behind codifying the environment setup, so that we could spin up all resources necessary for the environment, and keep it isolated from all of the other ones.

It turns out they weren’t quite as isolated as we would have liked.

Like a lot of AWS setups, we have an internet gateway, allowing resources internal to our VPC (like EC2 instances) access to the internet. By default, only resources with an external IP can access the internet through the gateway. Other resources have to use some other mechanism for accessing the internet. In our case, the other mechanism is a SQUID proxy.

It was this proxy that was the bottleneck. Both the service under test and the load test workers themselves were slamming it, the service in order to talk to S3 and the load test workers in order to hit the service (through its external URL).

We recently increased the specs on the proxy machine (because of a similar problem discovered during load testing with fewer users) and we thought that maybe it would be powerful enough to handle the incoming requests. It probably would have been if it wasn’t for the double load (i.e. if the load test requests had of been coming from an external party and the only traffic going through the proxy was to S3 from the service).

In the end the load tests did exactly what they were supposed to do, even if they did it in an unexpected way. The pushed the system to breaking point, allowing us to identify where it broke and schedule improvements to prevent the situation from occurring again.

Actions Speak Louder Than Words

What are we going to do about it? There are a number of things I have in mind.

The first is to not have a single proxy instance and instead have an auto scaling group that scales as necessary based on load. I like this idea and I will probably be implementing it at some stage in the future. To be honest, as a shared piece of infrastructure, this is how it should have been implemented in the first place. I understand that the single instance (configured lovingly by hand) was probably quicker and easier initially, but for such a critical piece of infrastructure, you really do need to spend the time to do it properly.

The second is to have environment specific proxies, probably as auto scaling groups anyway. This would give me more confidence that we won’t accidentally murder production services when doing internal things, just from an isolation point of view. Essentially, we should treat the proxy just like we treat any other service, and be able to spin them up and down as necessary for whatever purposes.

The third is to isolate our production services entirely, either with another VPC just for production, or even another AWS account just for production. I like this one a lot, because as long as we have shared environments, I’m always terrified I’ll screw up a script and accidentally delete everything. If production wasn’t located in the same account, that would literally be impossible. I’ll be trying to make this happen over the coming months, but I’ll need to move quickly, as the more stuff we have in production, the harder it will be to move.

The last optimisation is to use the new VPC endpoint feature in AWS to avoid having to go to the internet in order to access S3, which I have already done. This really just delays the root issue (shared single point of failure), but it certainly solves the immediate problem and should also provide a nice performance boost, as it removes the proxy from the picture entirely for interactions with S3, which is nice.

Conclusion

To me, this entire event proved just how valuable load testing is. As I stated previously, it did exactly what I expected it to do. Find where the service breaks. It broke in an entirely unexpected way (and broke other things as well), but honestly this is probably the best outcome, because that would have happened at some point in the future anyway (whenever we hit the saturation point for the proxy) and I’d prefer it to happen now, when we’re in beta and managing communications with every user closely, than later, when everybody and their dog are using the service.

Of course, now we have a whole lot more infrastructure work to complete before we can go live, but honestly, the work is never really done anyway.

I still hate proxies.