Dress Your Errors in Fluorescent Orange and Green

July 21. 2015 0 Comments

In my quest to automate all the things, I’ve failed a lot.

A lot.

Typically the failures are configuration related (forgot to do X, didn’t line up the names of Y and Z properly, accidentally using uninitialized variable (Powershell is pretty “flexible”)), but sometimes they are just transient. The cloud can be like that and you need to make sure you build with fault tolerance in mind. For example, sometimes an AWS EC2 instance will just take ages to complete its initialization, and your environment will fail to create with a timeout.

Using CloudFormation, you can use WaitConditions and WaitHandles to keep track of the setup of the more complicated components of your environment (like EC2 instances). Generally you take the exit code from your initialization on the EC2 instance (for us that is typically cfn-init) and push it through cfn-signal pointing at a particular WaitHandle. In the case where the exit code is non-zero, CloudFormation considers it a failure and it bubbles up and fails the stack (or whatever your error handling strategy is). If you need to you can supply a reason as well (a short string) which can help in identifying the root cause of the failure.

Up until recently, when my environment setups failed, I would report the failure (i.e. this stack failed to initialize) and then automatically clean it up. My scripts would wait to see if the stack status was changing to CREATE_COMPLETE within some timeout, and in the case of a failure, were simply reporting that the stack failed. If I was executing from a test, typically I would break on the line that does the cleanup, and investigate the root cause of the failure before I let the stack be torn down.

Now that I’m automatically executing environment creation/destruction through TeamCity, that kind of approach doesn’t really cut it.

If I arrive at work in the morning and see that the CI environment for service A failed to automatically re-create I need to know why. The stack will have already been cleaned up, so unless the scripts report as much information as possible, my only real recourse is to run it again locally and see what happened. This only works if the failure is consistent. It might run perfectly the second time, which is incredibly annoying when you’re trying to track down a failure.

It was time to think about failing a little bit harder.

Investigating Failures

When a stack fails I typically do 3 things.

The first is to check to see if the stack failed on creation (i.e. on the actual call to New-CfnStack). These failures are usually related to template issues, so the environment setup scripts already handle this. When New-CfnStack throws an exception it bubbles up to the top and is logged. All you have to do is read the message and it will (hopefully) point you at the root cause.

An interesting fact. If your template file is long-ish (it happened to me when mine was about 1000 lines long, but I’m pretty sure its dependent on characters, not lines) then the call to New-CfnStack can and will fail with an incredibly useless error about incorrectly formatted XML. This only happens if you are uploading the content as part of the call, instead of uploading to S3 first. I’m not entirely sure why this happens, but I have a sneaking suspicion that the Powershell/.NET AWS components have some hard limit on the length of the request that isn’t being checked properly. Just upload your template to S3 (I upload to the same place I put my dependencies) and it will work just fine, regardless of the size of the template.

The second is to check the Events attached to the stack for failures, specifically the first one chronologically, as failures typically have a knock on effect. These events are available in the AWS dashboard, under the CloudFormation section. Sometimes this is all it takes, as most of the failure events are pretty descriptive. If you fail to have permissions to do something, this is where it normally comes up, along with some errors that the template verification system won’t find (like an RDS instance using a snapshot but also supplying a DBName).

The third is to use the events to determine which component failed (usually an EC2 instance), remote onto the machine and look at some log files. For Windows EC2 instances there are two log files that are useful, cfn-init and EC2Config.

You can typically find the cfn-init log file at C:\cfn\log\cfn-init.log. It contains the full log trace from everything that happened as a result of the call to cfn-init on that machine. As far as log files go, its pretty useful, and I’ve plumbed its depths many times trying to figure out why my initialization was failing.

Second interesting fact, also related to length. If the output from one of your commands in cfn-init is too long, it will fail with an incredibly unhelpful error.

You can typically find the EC2Config log at C:\Program Files\Amazon\EC2Config\logs\EC2Config.log. This one is useful if your cfn-init log file doesn’t have anything in it, which can happen if you’re failing so hard that your EC2 instance can’t even start executing cfn-init. It happens.

The 3 step process I outlined above usually identifies the root cause of the failure, which can then be rectified. Following the process isn’t difficult, but it is annoying, and manual. The last step is especially egregious, requiring me to find the machine/s in question, get their private IPs, be connected to our AWS VPN, remote on, locate the log files, read them, etc.

If I have to leave the confines of my environment script in order to go check why it failed, then I’m losing valuable time and context. It would be much better if the environment setup script detected failures and went and did those things for me.

The Right Kind of Lazy

Essentially a good lazy programmer is the one who doesn’t have the stomach for constantly doing a manual task, so instead automates it. This frees them up to do more interesting things, and the automated task can then be run with minimal human involvement, hopefully acting as a multiplier for the effectiveness of the team.

In my case, I needed to automate the second two steps above. The first step is taken care of because the exceptions being thrown from the various calls to the AWS API are already being logged in TeamCity.

For the second step (interrogating the Events of the Stack), AWS offers API’s for pretty much everything, which is awesome (Octopus is similar, which is equally awesome, maybe more so because they are Windows native). All you need is a call to Get-CfnStackEvent, and then you can filter the resulting events to those that failed, which helps to cut down on the amount of output. The following piece of Powershell demonstrates just that.

function _TryGetFailingStackEvents
{
    param
    (
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$stackId,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsKey,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsSecret,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsRegion
    )

    try
    {
        if ($rootDirectory -eq $null) { throw "RootDirectory script scoped variable not set. That's bad, its used to find dependencies." }
        $rootDirectoryDirectoryPath = $rootDirectory.FullName
        $commonScriptsDirectoryPath = "$rootDirectoryDirectoryPath\scripts\common"
        
        . "$commonScriptsDirectoryPath\Functions-Enumerables.ps1"

        $events = Get-CFNStackEvent -StackName $stackId -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion
        $failingEvents = @($events | Where { $_.ResourceStatus.Value -match "FAILED" })
        if ($failingEvents | Any -Predicate { $true })
        {
            return $failingEvents
        }
        else
        {
            return @()
        }
    }
    catch
    {
        Write-Warning "Could not get events for stack [$stackId]."
        Write-Warning $_
    }
}

Automating the third step is a little more difficult.

As a general rule, all of our EC2 instances can have Powershell remotely executed on them from our secure AWS VPN CIDR. This is already part of our general EC2 setup (security group configuration to allow the traffic, WIndows firewall exception to unblock the Powershell Remote Execution port, etc).

Whenever a failure event occurs that contains a valid EC2 instance ID (identified by a regular expression), the script creates a remote session to the machine using its private IP address and reads the last couple of hundred lines from the log files I mentioned above. You can see the full script below. It actually feeds off the failure events identified in the previous step in order to find the ID of the instance that it needs to connect to.

function _TryExtractLogFilesFromInstanceFailingViaCfnInit
{
    param
    (
        [Parameter(Mandatory=$true)]
        $failingStackEvents,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsKey,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsSecret,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsRegion,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$adminPassword
    )

    if ($failingStackEvents -eq $null) { return "No events were supplied, could not determine if anything failed as a result of CFN-INIT failure" }

    try
    {
        if ($rootDirectory -eq $null) { throw "RootDirectory script scoped variable not set. That's bad, its used to find dependencies." }
        $rootDirectoryDirectoryPath = $rootDirectory.FullName
        $commonScriptsDirectoryPath = "$rootDirectoryDirectoryPath\scripts\common"

        $cfnInitFailedIndicator = "CFN-INIT-FAILED"
        Write-Verbose "Attempting to identify and extract information from failure events containing the string [$cfnInitFailedIndicator]"
        $instanceIdRegexExtractor = "(i\-[0-9a-zA-Z]+)"
        $cfnFailureEvent = $failingStackEvents | 
            Where {$_.ResourceStatusReason -match $cfnInitFailedIndicator} | 
            Select -First 1

        if ($cfnFailureEvent.ResourceStatusReason -match $instanceIdRegexExtractor)
        {
            $instanceId = $matches[0];
            Write-Verbose "Found a failure event for instance [$instanceId]"
            Write-Verbose "Attempting to extract some information from the logs from that machine"

            . "$commonScriptsDirectoryPath\Functions-Aws-Ec2.ps1"

            $instance = Get-AwsEc2Instance -InstanceId $instanceId -AwsKey $awsKey -AwsSecret $awsSecret -AwsRegion $awsRegion

            $ipAddress = $instance.PrivateIpAddress

            $remoteUser = "Administrator"
            $remotePassword = $adminPassword
            $securePassword = ConvertTo-SecureString $remotePassword -AsPlainText -Force
            $cred = New-Object System.Management.Automation.PSCredential($remoteUser, $securePassword)
            $session = New-PSSession -ComputerName $ipAddress -Credential $cred
    
            $remoteScript = {
                $lines = 200
                $cfnInitLogPath = "C:\cfn\log\cfn-init.log"
                Write-Output "------------------------------------------------------"
                Write-Output "Last [$lines] from $file"
                Get-Content $cfnInitLogPath -Tail $lines
                Write-Output "------------------------------------------------------"
                #Get-Content C:\Program Files
            }
            $remotelyExtractedData = Invoke-Command -Session $session -ScriptBlock $remoteScript
            # If you dont do this when you do a JSON convert later it spits out a whole bunch of useless
            # information about the machine the line was extracted from, files, etc.
            $remotelyExtractedData = $remotelyExtractedData | foreach { $_.ToString() }
            
            return $remotelyExtractedData
        }
        else
        {
            Write-Verbose "Could not find a failure event about CFN-INIT failing"
            return "No events failed with a reason containing the string [$cfnInitFailedIndicator]"
        }
    }
    catch
    {
        Write-Warning "An error occurred while attempting to gather more information about an environment setup failure"
        Write-Warning $_       
    }
}

I execute both of these functions in the catch block of the main environment setup function. Basically if anything goes wrong during the environment creation, it tries to gather some additional information to print out to the Warning channel. Sometimes failures occur before you actually managed to create the stack, so it needs to be robust in the face of not having a stack ID. As it is inside the catch block, it also needs to be wrapped in a try…catch itself, otherwise you’re likely to lose the root cause of the error if an unexpected error happens during your information gathering.

if ($stackId -ne $null)
{
    Write-Warning "Environment creation failed. Attempting to gather additional information."
    $failingStackEvents = _TryGetFailingStackEvents -StackId $stackId -awsKey $awsKey -awsSecret $awsSecret -awsRegion $awsRegion
    $cfnInitFailureLogs = _TryExtractLogFilesFromInstanceFailingViaCfnInit -failingStackEvents $failingStackEvents -adminPassword $adminPassword -awsKey $awsKey -awsSecret $awsSecret -awsRegion $awsRegion
    $customError = @{
        "StackId"=$stackId;
        "Stack"=$stack;
        "FailingStackEvents"=$failingStackEvents;
        "CfnInitFailingInstanceLogs"=$cfnInitFailureLogs;
    }

    $failingDetails = new-object PSObject $customError
    Write-Warning (ConvertTo-Json $failingDetails)
}

With everyone above working, you get a nice dump of information whenever an environment fails to create, without having to do anything else, which is nice.

The full source that the snippets above were extracted from will be available at Solavirum.Scripts.Common at some point. The content available there now is out of date, but I need to check through the scripts carefully before I upload them to Github. We all remember what happened last time.

Room for Improvement

At some point in the future I would prefer if the output from CloudFormation (its events) and the various log files from the EC2 instances being provisioned were automatically pumped into our Log Aggregation server (ELK) so that we could analyse them in more detail.

The hard part about this is that the log aggregation would require the installation of something on the machine, and if we’re in a failure state during environment setup, those components are probably not installed. I mean the whole point of the environment setup is to create an environment and deploy some software components to do it, having to have other components in place to report failures is probably not going to work.

I’ll keep thinking about it and make a post if I find a good solution.

Conclusion

I hesitated to do this automated investigation for a long time, because I knew it would be finicky and complex.

I shouldn’t have.

Every time an environment setup fails I now get detailed information right in the same window that I was watching for completion messages. It is beyond useful. It also means that whenever an environment refresh fails in the morning before I get to work, I can still investigate to see what happened, even though the stack was already cleaned up.

I should have followed my instincts and automated this kind of pointless busywork earlier.

Environmental Teamwork in the City

July 7. 2015 0 Comments

Posted in:
teamcity
automation

God that title is terrible. Sorry.

If you’ve ever read any of my blog before, you would know that I’ve put a significant amount of effort into making sure that our environments can be spun up and down easily. A good example of this is the environment setup inside the Solavirum.Testing.JMeter repository, which allows you to easily setup a group of JMeter worker machines for the purposes of load testing.

These environment setup scripts are all well and good, but I really don’t like the fact that I end up executing them from my local machine. This can breed a host of problems, in the same way that compiling and distributing code from your own machine can. Its fine for the rarely provisioned environments (like the JMeter workers I mentioned above) but its definitely not suitable for provisioning a CI environment for a web service or anything similar.

Additionally, when you provision environments from a developer machine you risk accidentally provisioning an environment with changes that have not been committed into source control. This can be useful for testing purposes, but ultimately leads to issues with configuration management. Its nice to tag the commit that the environment was provisioned from as well, on both sides (in source control and on the environment itself, just like versioning a library or executable).

Luckily we already have a platform in place for centralizing our compiling and packaging, and I should be able to use it to do environment management as well.

TeamCity.

I’m Not Sure Why Its Called TeamCity

If you’re unfamiliar with TeamCity, its similar to Jenkins. If you’re unfamiliar with Jenkins, its similar to TeamCity.

Ha.

Anyway, TeamCity is a CI (Continuous Integration) service. It allows you to setup build definitions and then run them on various triggers to produce artifacts (like installers, or Nuget packages). It does a lot more to be honest, but its core idea is about automating the build process in a controlled environment, free from the tyranny of developer machines.

The team here was using TeamCity before I started, so they already had some components being built, including a 20 step monstrosity that I will one day smooth out.

As a general rule, I love a good CI server, but I much prefer to automate in something like Powershell, so that it can be run locally if need be (for debugging/investigation purposes), so I’m wary of putting too much logic inside the actual CI server configuration itself. I definitely like to use the CI server for scheduling, history and other ancillary services (like tagging on successful builds and so on) though, the things that you don’t necessarily need when running a build locally.

Anyway, the meat of this blog post is about automating environment management using TeamCity, so I should probably talk about that now.

Most of my environment provisioning scripts have the same structure (with a few notable exceptions), so it was easy enough to create a Template is TeamCity to automate the destruction and recreation of an environment via scripts that already exist. The template was simple, a link to a git repository (configurable by repo name), a simple build step that just runs some Powershell, a trigger and some build parameters.

The only thing I can really copy here is the Powershell build step, so here it is:

try
{
    if ("%teamcity.build.branch.is_default%" -eq "false") 
    {
        Write-Error "Cannot create environment from non-default branch (i.e. not master)."
        exit 1
    }

    . ".\scripts\build\_Find-RootDirectory.ps1"

    $rootDirectory = Find-RootDirectory "."
    $rootDirectoryPath = $rootDirectory.FullName

    # Consider adding Migration check and run that instead if it exists.
    $environment = "%environment%"
    
    $restrictedEnvironmentRegex = "prod"
    if ($environment -match $restrictedEnvironmentRegex)
    {
        write-error "No. You've selected the environment named [$environment] to create, and it matches the regex [$restrictedEnvironmentRegex]. Think harder before you do this."
        Write-Host "##teamcity[buildProblem description='Restricted Environment Selected']"
        exit 1
    }
    
    $invokeNewEnvironmentPath = ".\scripts\environment\Invoke-NewEnvironment.ps1"
    $invokeDeleteEnvironmentPath = ".\scripts\environment\Invoke-DeleteEnvironment.ps1"

    if(-not (test-path $invokeNewEnvironmentPath) -or -not (test-path $invokeDeleteEnvironmentPath))
    {
        write-error "One of the expected environment management scripts (New: [$invokeNewEnvironmentPath], Delete: [$invokeDeleteEnvironmentPath]) could not be found."
        Write-Host "##teamcity[buildProblem description='Missing Environment Management Scripts']"
        exit 1
    }
    
    $bootstrapPath = ".\scripts\Functions-Bootstrap.ps1"
    if (-not (test-path $bootstrapPath))
    {
        Write-Warning "The bootstrap functions were not available at [$bootstrapPath]. This might not be important if everything is already present in the repository."
    }
    else
    {
        . $bootstrapPath
        Ensure-CommonScriptsAvailable
    }
    
    $octopusUrl = "%octopusdeploy-server-url%"
    $octopusApiKey = "%octopusdeploy-apikey%"
    $awsKey = "%environment-deployment-aws-key%"
    $awsSecret = "%environment-deployment-aws-secret%"
    $awsRegion =  "%environment-deployment-aws-region%"
    
    $arguments = @{}
    $arguments.Add("-Verbose", $true)
    $arguments.Add("-AwsKey", $awsKey)
    $arguments.Add("-AwsSecret", $awsSecret)
    $arguments.Add("-AwsRegion", $awsRegion)
    $arguments.Add("-OctopusApiKey", $octopusApiKey)
    $arguments.Add("-OctopusServerUrl", $octopusUrl)
    $arguments.Add("-EnvironmentName", $environment)
    
    try
    {
        Write-Host "##teamcity[blockOpened name='Delete Environment']"
        Write-Host "##teamcity[buildStatus text='Deleting $environment']"
        & $invokeDeleteEnvironmentPath @arguments
        Write-Host "##teamcity[buildStatus text='$environment Deleted']"
        Write-Host "##teamcity[blockClosed name='Delete Environment']"
    }
    catch
    {
        write-error $_
        Write-Host "##teamcity[buildProblem description='$environment Deletion Failed']"
        exit 1
    }

    try
    {
        $recreate = "%environment-recreate%"
        if ($recreate -eq "true")
        {
            Write-Host "##teamcity[blockOpened name='Create Environment']"
            Write-Host "##teamcity[buildStatus text='Creating $environment']"
            & $invokeNewEnvironmentPath @arguments
            Write-Host "##teamcity[buildStatus text='$environment Created']"
            Write-Host "##teamcity[blockClosed name='Create Environment']"
        }
    }
    catch
    {
        write-error $_
        Write-Host "##teamcity[buildProblem description='$environment Created Failed']"
        exit 1
    }
}
catch 
{
    write-error $_
    Write-Host "##teamcity[buildProblem description='$environment Created Failed']"
    exit 1
}

Once I had the template, I created new build configurations for each environment I was interested in, and filled them out appropriately.

Now I could recreate an entire environment just by clicking a button in TeamCity, and every successful recreation was tagged appropriately in Source Control, which was awesome. Now I had some traceability.

The final step was to schedule an automatic recreation of each CI environment every morning, to constantly validate our scripts and make sure they work appropriately.

Future Improvements

Alas, I ran into one of the most annoying parts of TeamCity. After the initial 20, licensing is partially based on build configurations. We already had a significant amount of configs, so I ran out before I could implement a build configuration to do a nightly tear down of environments that don’t need to exist overnight (for example all our CI environments). I had to settle for merely recreating them each morning (tear down followed by spin up), which at least verifies that the scripts continue to work.

If I could change build parameters based on a Trigger in TeamCity that would also work, but that’s a missing feature for now. I could simply set up two triggers, one for the morning to recreate and the other for the evening to tear down (where they both execute the same script, just with different inputs). This has been a requested feature of TeamCity for a while now, so I hope they get to it at some stage.

I’ll rectify this as soon as we get more build configurations. Which actually leads nicely into my next point.

So, What’s It Cost

Its free! Kinda.

Its free for 3 build agents and 20 build configurations. You can choose to buy another agent + 10 build configs for a token amount (currently $300 US), or you can choose to buy unlimited build configurations (the Enterprise edition) for another token amount (currently $2000 US).

If you’re anything like me, and you love to automate all the things, you will run out of build configurations far before you need more build agents.

I made the mistake of getting two build agent + configs packs through our organization before I realized that I should have just bought the Enterprise edition, and now I’m having a hard time justifying its purchase to the people what control the money. Unfortunate, but I’m sure I’ll convince them in time, and we’ve got an extra 2 build agents as well, so that’s always nice.

Jetbrains (the creators of TeamCity) were kind of annoying in this situation actually. We wanted to switch to Enterprise, and realized we didn’t need the build agents (yet), but they wouldn’t do us a deal. I can understand that its just probably their policy, but its still annoying.

Summary

I’m never happy unless what I’ve done can be executed on a machine completely different from my own, without a lot of futzing about. I’m a big proponent for “it should just work”, and having the environments triggered from TeamCity enforces that sort of thinking. Our build agents are pretty vanilla as far as our newer code is concerned (our legacy code has some nasty installed dependencies that I won’t go into detail about), so being able to execute the environment provisioning through TeamCity constantly validates that the code works straight out of source control.

It also lets other people create environments too, and essentially documents the usage of the environment provisioning scripts.

I get some nice side effects from doing environment management in TeamCtiy as well, the most important of which is the ability to easily tag when environments were provisioned (and from what commit) in source control.

Now I just need more build configurations…

All Your Subnets Are Belong To Me

June 9. 2015 0 Comments

Managing subnets in AWS makes me sad. Don’t get me wrong, AWS (as per normal) gives you full control over that kind of thing, I’m mostly complaining from an automation point of view.

Ideally, when you design a self contained environment, you want to ensure that it is isolated in as many ways as possible from other environments. Yes you can re-use shared infrastructure from a cost optimization point of view, but conceptually you really do want to make sure that Environment A can’t possibly affect anything in Environment B and vice versa.

As is fairly standard, all of our AWS CloudFormation templates use subnets.

In AWS, a subnet defines a set of available IP addresses (i.e. using CIDR notation 1.198.143.0/28, representing 1.198.143.1 – 1.198.143.16). Subnets also define an availability zone (for redundancy, i.e. ap-southeast-2a vs ap-southeast-2b), whether or not resources using the subnet automatically get an IP address and can be used to define routing rules to restrict access. Route tables and security groups are the main mechanisms by which you can lock down access to your machines, outside of the OS level, so its important to use them as much as you can. You should always assume that any one of your machines might be compromised and minimise possible communication channels accordingly.

Typically, in a CloudFormation template each resource will have a dependency on one or more subnets (more subnets for highly available resources, like auto scaling groups and RDS instances). The problem is, while it is possible to setup one or many subnets inside a CloudFormation template, there’s no real tools available to select an appropriate IP range for your new subnet/s from the available range in the VPC.

What we’ve had to do as a result of this, is setup a couple of known subnets with high capacity (mostly just blocks of 255 addresses) and then use those subnets statically in the templates. We’ve got a few subnets for publically accessible resources (usually just load balancers), a few for private web servers (typically only accessible from the load balancers) and a few for

This is less than ideal for various reasons (hard dependency on resources created outside of the template, can’t leverage route tables as cleanly, etc). What I would prefer, is the ability to query the AWS infrastructure for a block of IP addresses at the time the template is executed, and dynamically create subnets like that (setting up route tables as appropriate). To me this feels like a much better way of managing the network infrastructure in the cloud, keeping in line with my general philosophy of self contained environment setup.

Technically the template would probably have a dependency on a VPC, but you could search for that dynamically if you wanted to. Our accounts only have 1 VPC In them anyway.

The Dream

I can see the set of tools that I want to access in my head, they just don’t seem to exist.

The first thing needed would be a library of some sort, that allows you to supply a VPC (and its meta information) and a set of subnets (also with their meta information) and then can produce for you a new subnet of the desired capacity. For example, if I know that I only need a few IP addresses for the public facing load balancer in my environment, I would get the tool to generate 2 subnets, one in each availability zone in ap-southeast-2, of size 16 or something similarly tiny.

The second thing would be a visualization tool built on top of the library, that let you view your address space as a giant grid, zoomable, with important elements noted, like coloured subnets, resources currently using IP addresses and if you wanted to get really fancy, route tables and their effects on communication.

Now you may be thinking, you’re a programmer, why don’t you do it? The answer is, I’m considering it pretty hard, but while the situation does annoy me, it hasn’t annoyed me enough to spur me into action yet. I’m posting up the idea on the off chance someone who is more motivated than me grabs it and runs with it.

Downfall

There is at least one downside that I can think of with using a library to create subnets of the appropriate size.

Its a similar issue to memory allocation and management. As it is likely that the way in which you need IP address ranges changes from template to template, the addressable space will eventually suffer from fragmentation. In memory management, this is solved by doing some sort of compacting or other de-fragmentation activity. For IP address ranges, I’m not sure how you could solve that issue. You could probably update the environment to use new subnets, re-allocated to minimise fragmentation, but I think its likely to be more trouble than its worth.

Summary

To summarise, I really would like a tool to help me visualize the VPC (and its subnets, security groups, etc) in my AWS account. I’d settle for something that just lets me visualize my subnets in the context of the total addressable space.

I might write it.

You might write it.

Someone should.

Tests That Aren’t Run Are Worthless

June 2. 2015 0 Comments

We’ve spent a significant amount of effort recently ensuring that our software components are automatically built and deployed. Its not something new, and its certainly something that some of our components already had in place, but nothing was ever made generic enough to reuse. The weak spot in our build/deploy pipeline is definitely tests though. We’ve had a few attempts in the past to get test automation happening as part of the build, and while it has worked on an individual component basis, we’ve never really taken a holistic look at the process and made it easy to apply to a range of components.

I’ve mentioned this before but to me tests fall into 3 categories, Unit, Integration and Functional. Unit tests cover the smallest piece of functionality, usually algorithms or classes with all dependencies stubbed or mocked out. Integration tests cover whether all of the bits are configured to work together properly, and can be used to verify features in a controlled environment. Functional tests cover the application from a feature point of view. For example, functional tests for a web service would be run on it after it is deployed, verifying users can interact with it as expected.

From my point of view, the ideal flow is as follows:

Checkin – Build – Unit and Integration Tests – Deploy (CI) – Functional Tests – Deploy (Staging)

Obviously I’m talking about web components here (sites, services, etc), but you could definitely apply it to any component if you tried hard enough.

The nice part of this flow is that you can do any manual testing/exploration/early integration on the Staging environment, with the guarantee that it will probably not be broken by a bad deploy (because the functional tests will protect against that and prevent the promotion to staging).

Aren’t All Cities Full of Teams

We use Team City as our build platform and Octopus as our deployment platform, and thanks to these components we have the checkin, build and deployment parts of the pipeline pretty much taken care of.

My only issue with these products is that they are so configurable and powerful that people often use them to store complex build/deployment logic. This makes me sad, because that logic belongs as close to the code as possible, ideally in the same repository. I think you should be able to grab a repository and build it, without having the use an external tool to put all the pieces together. Its also an issue if you need to change your build logic, but still allow for older builds (maybe a hotfix branch or something). If you stored your build logic in source control, then this situation just works, because the logic is right there with the code.

So I mostly use Team City to trigger builds and collect history about previous builds (and their output), which it does a fine job at. Extending that thought I use Octopus to manage environments and machines, but all the logic for how to install a component lives in the deployable package (which can be built with minimal fuss from the repository).

I do have to mention that these tools do have elements of change control, and do allow you to version your Build Configurations (TeamCity)/Projects (Octopus). I just prefer that this logic lives with the source, because then the same version is applied to everything.

All of our build and deployment logic lives in source control, right next to the code. There is a single powershell script (unsurprisingly called build.ps1) per repository, acting as the entry point. The build script in each repository is fairly lightweight, leveraging a set of common scripts downloaded from our Nuget server, to avoid duplicating logic.

Team City calls this build script with some appropriate parameters, and it takes care of the rest.

Testy Testy Test Test

Until recently, our generic build script didn’t automatically execute tests, which was an obvious weakness. Being that we are in the process of setting up a brand new service, I thought this would be the ideal time to fix that.

To tie in with the types of tests I mentioned above, we generally have 2 projects that live in the same solution as the main body of code (X.Tests.Unit and X.Tests.Integration, where X is the component name), and then another project that lives in parallel called X.Tests.Functional. The Functional tests project is kind of a new thing that we’re trying out, so is still very much in flux. The other two projects are well accepted at this point, and consistently applied.

Both Unit and Integration tests are written using NUnit. We went with NUnit over MSTEST for reasons that seemed valid at the time, but which I can no longer recall with any level of clarity. I think it might have been something about the support for data driven tests, or the ability to easily execute the tests from the command line? MSTEST offers both of those things though, so I’m honestly not sure. I’m sure we had valid reasons though.

The good thing about NUnit, is that the NUnit Runner is a NuGet package of its own, which fits nicely into our dependency management strategy. We’ve written powershell scripts to manage external components (like Nuget, 7Zip, Octopus Command Line Tools, etc) and the general pattern I’ve been using is to introduce a Functions-Y.ps1 file into our CommonDeploymentScripts package, where Y is the name of the external component. This powershell file contains functions that we need from the external component (for example for Nuget it would be Restore, Install, etc) and also manages downloading the dependent package and getting a reference to the appropriate executable.

This approach has worked fairly well up to this point, so my plan was to use the same pattern for test execution. I’d need to implement functions to download and get a reference to the NUnit runner, as well as expose something to run the tests as appropriate. I didn’t only require a reference to NUnit though, as we also use OpenCover (and ReportGenerator) to get code coverage results when running the NUnit tests. Slightly more complicated, but really just another dependency to manage just like NUnit.

Weirdly Smooth

In a rare twist of fate, I didn’t actually encounter any major issues implementing the functions for running tests. I was surprised, as I always run into some crazy thing that saps my time and will to live. It was nice to have something work as intended, but it was probably primarily because this was a refactor of existing functionality. We already had the script that ran the tests and got the coverage metrics, I was just restructuring it and moving it into a place where it could be easily reused.

I wrote some very rudimentary tests to verify that the automatic downloading of the dependencies was working, and then set to work incorporating the execution of the tests into our build scripts.

function FindAndExecuteNUnitTests
{
    [CmdletBinding()]
    param
    (
        [System.IO.DirectoryInfo]$searchRoot,
        [System.IO.DirectoryInfo]$buildOutput
    )

    Write-Host "##teamcity[blockOpened name='Unit and Integration Tests']"

    if ($rootDirectory -eq $null) { throw "rootDirectory script scoped variable not set. Thats bad, its used to find dependencies." }
    $rootDirectoryPath = $rootDirectory.FullName

    . "$rootDirectoryPath\scripts\common\Functions-Enumerables.ps1"
    . "$rootDirectoryPath\scripts\common\Functions-OpenCover.ps1"

    $testAssemblySearchPredicate = { 
            $_.FullName -like "*release*" -and 
            $_.FullName -notlike "*obj*" -and
            (
                $_.Name -like "*integration*" -or 
                $_.Name -like "*unit*"
            )
        }
    Write-Verbose "Locating test assemblies using predicate [$testAssemblySearchPredicate]."
    $testLibraries = Get-ChildItem -File -Path $srcDirectoryPath -Recurse -Filter "*.Test*.dll" |
        Where $testAssemblySearchPredicate
            
    $failingTestCount = 0
    foreach ($testLibrary in $testLibraries)
    {
        $testSuiteName = $testLibrary.Name
        Write-Host "##teamcity[testSuiteStarted name='$testSuiteName']"
        $result = OpenCover-ExecuteTests $testLibrary
        $failingTestCount += $result.NumberOfFailingTests
        $newResultsPath = "$($buildDirectory.FullName)\$($result.LibraryName).TestResults.xml"
        Copy-Item $result.TestResultsFile "$newResultsPath"
        Write-Host "##teamcity[importData type='nunit' path='$newResultsPath']"

        Copy-Item $result.CoverageResultsDirectory "$($buildDirectory.FullName)\$($result.LibraryName).CodeCoverageReport" -Recurse

        Write-Host "##teamcity[testSuiteFinished name='$testSuiteName']"
    }

    write-host "##teamcity[publishArtifacts '$($buildDirectory.FullName)']"
    Write-Host "##teamcity[blockClosed name='Unit and Integration Tests']"

    if ($failingTestCount -gt 0)
    {
        throw "[$failingTestCount] Failing Tests. Aborting Build."
    }
}

As you can see, its fairly straightforward. After a successful build the source directory is searched for all DLLs with Tests in their name, that also appear in the release directory and are also named with either Unit or Integration. These DLLs are then looped through, and the tests executed on each one (using the OpenCover-ExecuteTests function from the Functions-OpenCover.ps1 file), with the results being added to the build output directory. A record of the number of failing tests is kept and if we get to the end with any failing tests an exception is thrown, which is intended to prevent the deployment of faulty code.

The build script that I extracted the excerpt above from lives inside our CommonDeploymentScripts package, which I have replicated into this Github repository.

I also took this opportunity to write some tests to verify that the build script was working as expected. In order to do that, I had to create a few dummy Visual Studio projects (one for a deployable component via Octopack and another for a simple library component). At the start of each test, these dummy projects are copied to a working directory, and then mutated as necessary in order to provide the appropriate situation that the test needs to verify.

The best example of this is the following test:

Describe {
    Context "When deployable component with failing tests supplied and valid deploy" {
        It "An exception is thrown indicating build failure" {
            $creds = Get-OctopusCredentials

            $testDirectoryPath = Get-UniqueTestWorkingDirectory
            $newSourceDirectoryPath = "$testDirectoryPath\src"
            $newBuildOutputDirectoryPath = "$testDirectoryPath\build-output"

            $referenceDirectoryPath = "$rootDirectoryPath\src\TestDeployableComponent"
            Copy-Item $referenceDirectoryPath $testDirectoryPath -Recurse

            MakeTestsFail $testDirectoryPath
            
            $project = "TEST_DeployableComponent"
            $environment = "CI"
            try
            {
                $result = Build-DeployableComponent -deploy -environment $environment -OctopusServerUrl $creds.Url -OctopusServerApiKey $creds.ApiKey -projects @($project) -DI_sourceDirectory { return $testDirectoryPath } -DI_buildOutputDirectory { return $newBuildOutputDirectoryPath }
            }
            catch 
            {
                $exception = $_
            }

            $exception | Should Not Be $null

            . "$rootDirectoryPath\scripts\common\Functions-OctopusDeploy.ps1"

            $projectRelease = Get-LastReleaseToEnvironment -ProjectName $project -EnvironmentName $environment -OctopusServerUrl $creds.Url -OctopusApiKey $creds.ApiKey
            $projectRelease | Should Not Be $result.VersionInformation.New
        }
    }
}

As you can see, there is a step in this test to make the dummy tests fail. All this does is rewrite one of the classes to return a different value than is expected, but its enough to fail the tests in the solution. By doing this, we can verify that yes a failing does in fact lead to no deployment.

Summary

Nothing that I’ve said or done above is particularly ground-breaking. Its all very familiar to anyone who is doing continuous integration/deployment. Having tests is fantastic, but unless they take part in your build/deploy pipeline they are almost useless. That’s probably a bit harsh, but if you can deploy code without running the tests on it, you will (with the best of intentions no doubt) and that doesn’t lead anywhere good.

Our approach doesn’t leverage the power of TeamCity directly, due to my reluctance to store complex logic there. There are upsides and downsides to this, mostly that you trade off owning the implementation of the test execution against keeping all your logic in one place.

Obviously I prefer the second approach, but your mileage may vary.

IAM Legend

May 19. 2015 0 Comments

I’ve been doing a lot of work with AWS recently.

For the last service component that we developed, we put together a CloudFormation template and a series of Powershell scripts to setup, tear down and migrate environments (like CI, Staging, Production, etc). It was extremely effective, baring some issues that we still haven’t quite solved with data migration between environment versions and updating machine configuration settings.

In the first case, an environment is obviously not stateless once you start using it, and you need a good story about maintaining user data between environment versions, at the very least for Production.

In the second case tearing down an entire environment just to update a configuration setting is obviously sub-optimal. We try to make sure that most of our settings are encapsulated within components that we deploy, but not everything can be done this way. CloudFormation does have update mechanisms, I just haven’t had a chance to investigate them yet.

But I digress, lets switch to an entirely different topic for this post How to give secure access to objects in an S3 bucket during initialization of EC2 instances while executing a CloudFormation template.

That was a mouthful.

Don’t Do What Donny Don’t Does

My first CloudFormation template/environment setup system had a fairly simple rule. Minimise dependencies.

There were so many example templates on the internet that just downloaded arbitrary scripts or files from GitHub or S3, and to me that’s the last thing you want. When I run my environment setup (ideally from within a controlled environment, like TeamCity) I want it to use the versions of the resources that are present in the location I’m running the script from. It should be self contained.

Based on that rule, I put together a fairly simple process where the Git Repository housing my environment setup contained all the necessary components required by the resources in the CloudFormation template, and the script was responsible for collecting and then uploading those components to some location that the resources could access.

At the time, I was not very experienced with S3, so I struggled a lot with getting the right permissions.

Eventually I solved the issue by handing off the AWS Key/Secret to the CloudFormation template, and then using those credentials in the AWS::CloudFormation::Authentication block inside the resource (LaunchConfig/Instance). The URL of the dependencies archive was then supplied to the source element of the first initialization step in the AWS::CloudFormation::Init block, which used the supplied credentials to download the file and extract its contents (via cfn-init) to a location on disk, ready to be executed by subsequent components.

This worked, but it left a bad taste in my mouth once I learnt about IAM roles.

IAM roles give you the ability to essentially organise sets of permissions that can be applied to resources, like EC2 instances. For example, we have a logs bucket per environment that is used to capture ELB logs. Those logs are then processed by Logstash (indirectly, because I can’t get the goddamn S3 input to work with a proxy, but I digress) on a dedicated logs processing instance. I could have gone about this in two ways. The first would have been to supply the credentials to the instance, like I had in the past. This exposes those credentials on the instance though, which can be dangerous. The second option is to apply a role to the instance that says “you are allowed to access this S3 bucket, and you can do these things to it”.

I went with the second option, and it worked swimmingly (once I got it all configured).

Looking back at the way I had done the dependency distribution, I realised that using IAM roles would be a more secure option, closer to best practice. Now I just needed a justifiable opportunity to implement it.

New Environment, Time to Improve

We’ve started work on a new service, which means new environment setup. This is a good opportunity to take what you’ve done previously and reuse it, improving it along the way. For me, this was the perfect chance to try and use IAM roles for the dependency distribution, removing all of those nasty “credentials in the clear” situations.

I followed the same process that I had for the logs processing. Setup a role describing the required policy (readonly access to the S3 bucket that contains the dependencies) and then link that role to a profile. Finally, apply the profile to the instances in question.

"ReadOnlyAccessToDependenciesBucketRole": {
    "Type": "AWS::IAM::Role",
    "Properties": {
        "AssumeRolePolicyDocument": {
            "Version" : "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Principal": { "Service": [ "ec2.amazonaws.com" ] },
                    "Action": [ "sts:AssumeRole" ]
                }
            ]
        },
        "Path": "/",
        "Policies" : [
            {
                "Version" : "2012-10-17",
                "Statement": [
                    {
                        "Effect": "Allow",
                        "Action": [ "s3:GetObject", "s3:GetObjectVersion" ],
                        "Resource": { "Fn::Join": [ "", [ "arn:aws:s3:::", { "Ref": "DependenciesS3Bucket" }, "/*" ] ] }
                    },
                    {
                        "Effect": "Allow",
                        "Action": [ "s3:ListBucket" ],
                        "Resource": { "Fn::Join": [ "", [ "arn:aws:s3:::", { "Ref": "DependenciesS3Bucket" } ] ] }
                    }
                ]
            }
        ]
    }
},
"ReadOnlyDependenciesBucketInstanceProfile": {    
    "Type": "AWS::IAM::InstanceProfile",    
    "Properties": { 
        "Path": "/", 
        "Roles": [ { "Ref": "ReadOnlyDependenciesBucketRole" }, { "Ref": "FullControlLogsBucketRole" } ] 
    }
},
"InstanceLaunchConfig": {    
    "Type": "AWS::AutoScaling::LaunchConfiguration",
    "Metadata": {        
        * snip *    
    },    
    "Properties": {        
        "KeyName": { "Ref": "KeyName" },        
        "ImageId": { "Ref": "AmiId" },        
        "SecurityGroups": [ { "Ref": "InstanceSecurityGroup" } ],        
        "InstanceType": { "Ref": "InstanceType" },        
        "IamInstanceProfile": { "Ref": "ReadOnlyDependenciesBucketInstanceProfile" },        
        "UserData": {            
            * snip *        
        }    
    }
}

It worked before, so it should work again, right? I’m sure you can probably guess that that was not the case.

The first mistake I made was attempting to specify multiple roles in a single profile. I wanted to do this because the logs processor needed to maintain its permissions to the logs bucket, but it needed the new permissions to the dependencies bucket as well. Even though the roles element is defined as an array, it can only accept a single element. I now hate whoever designed that, even though I’m sure they probably had a good reason.

At least that was an easy fix, flip the relationship between roles and policies. I split the inline policies out of the roles, then linked the roles to the policies instead. Each profile only had 1 role, so everything should have been fine.

"ReadOnlyDependenciesBucketPolicy": {
    "Type":"AWS::IAM::Policy",
    "Properties": {
        "PolicyName": "ReadOnlyDependenciesBucketPolicy",
        "PolicyDocument": {
            "Version" : "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Action": [ "s3:GetObject", "s3:GetObjectVersion" ],
                    "Resource": { "Fn::Join": [ "", [ "arn:aws:s3:::", { "Ref": "DependenciesS3Bucket" }, "/*" ] ] }
                },
                {
                    "Effect": "Allow",
                    "Action": [ "s3:ListBucket" ],
                    "Resource": { "Fn::Join": [ "", [ "arn:aws:s3:::", { "Ref": "DependenciesS3Bucket" } ] ] }
                }
            ]
        },
        "Roles": [
            { "Ref" : "InstanceRole" },
            { "Ref" : "OtherInstanceRole" }
        ]
    }
},
"InstanceRole": {
    "Type": "AWS::IAM::Role",
    "Properties": {
        "AssumeRolePolicyDocument": {
            "Version" : "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Principal": { "Service": [ "ec2.amazonaws.com" ] },
                    "Action": [ "sts:AssumeRole" ]
                }
            ]
        },
        "Path": "/"
    }
},
"InstanceProfile": {
    "Type": "AWS::IAM::InstanceProfile",
    "Properties": { "Path": "/", "Roles": [ { "Ref": "InstanceRole" } ] }
}

Ha ha ha ha ha, no.

The cfn-init logs showed that the process was getting 403s when trying to access the S3 object URL. I had incorrectly assumed that because the instance was running with the appropriate role (and it was, if I remoted onto the instance and attempted to download the object from S3 via the AWS Powershell Cmdlets, it worked just fine) that cfn-init would use that role.

It does not.

You still need to specify the AWS::CloudFormation::Authentication element, naming the role and the bucket that it will be used for. This feel s a little crap to be honest. Surely the cfn-init application is using the same AWS components, so why doesn’t it just pickup the credentials from the instance profile like everything else does?

Anyway, I added the Authentication element with appropriate values, like so.

"InstanceLaunchConfig": {
    "Type": "AWS::AutoScaling::LaunchConfiguration",
    "Metadata": {
        "Comment": "Set up instance",
        "AWS::CloudFormation::Init": {
            * snip *
        },
        "AWS::CloudFormation::Authentication": {
          "S3AccessCreds": {
            "type": "S3",
            "roleName": { "Ref" : "InstanceRole" },
            "buckets" : [ { "Ref" : "DependenciesS3Bucket" } ]
          }
        }
    },
    "Properties": {
        "KeyName": { "Ref": "KeyName" },
        "ImageId": { "Ref": "AmiId" },
        "SecurityGroups": [ { "Ref": "InstanceSecurityGroup" } ],
        "InstanceType": { "Ref": "ApiInstanceType" },
        "IamInstanceProfile": { "Ref": "InstanceProfile" },
        "UserData": {
            * snip *
        }
    }
}

Then I started getting different errors. You may think this is a bad thing, but I disagree. Different errors means progress. I’d switched from getting 403 responses (access denied) to getting 404s (not found).

Like I said, progress!

The Dependencies Archive is a Lie

It was at this point that I gave up trying to use the IAM roles. I could not for the life of me figure out why it was returning a 404 for a file that clearly existed. I checked and double checked the path, and even used the same path to download the file via the AWS Powershell Cmdlets on the machines that were having the issues. It all worked fine.

Assuming the issue was with my IAM role implementation, I rolled back to the solution that I knew worked. Specifying the Access Key and Secret in the AWS::CloudFormation::Authentication element of the LaunchConfig and removed the new IAM roles resources (for readonly access to the dependencies archive).

"InstanceLaunchConfig": {
    "Type": "AWS::AutoScaling::LaunchConfiguration",
    "Metadata": {
        "Comment": "Set up instance",
        "AWS::CloudFormation::Init": {
            * snip *
        },
        "AWS::CloudFormation::Authentication": {
            "S3AccessCreds": {
                "type": "S3",
                "accessKeyId" : { "Ref" : "DependenciesS3BucketAccessKey" },
                "secretKey" : { "Ref": "DependenciesS3BucketSecretKey" },
                "buckets" : [ { "Ref":"DependenciesS3Bucket" } ]
            }
        }
    },
    "Properties": {
        "KeyName": { "Ref": "KeyName" },
        "ImageId": { "Ref": "AmiId" },
        "SecurityGroups": [ { "Ref": "InstanceSecurityGroup" } ],
        "InstanceType": { "Ref": "ApiInstanceType" },
        "IamInstanceProfile": { "Ref": "InstanceProfile" },
        "UserData": {
            * snip *
        }
    }
}

Imagine my surprise when it also didn’t work, throwing back the same response, 404 not found.

I tried quite a few things over the next few hours, and there was much gnashing and wailing of teeth. I’ve seen some weird crap with S3 and bucket names (too long and you get errors, weird characters in your key and you get errors, etc) but as far as I could tell, everything was kosher. Yet it just wouldn’t work.

After doing a line by line diff against the template/scripts that were working (the other environment setup) and my new template/scripts I realised my error.

While working on the IAM role stuff, trying to get it to work, I had attempted to remove case sensitivity from the picture by calling ToLowerInvariant on the dependencies archive URL that I was passing to my template. The old script/template combo didn’t do that.

When I took that out, it worked fine.

The issue was that the key of the file being uploaded was not being turned into lower case, only the URL of the resulting file was, and AWS keys are case sensitive.

…

Goddamn it.

Summary

I lost basically an entire day to case sensitivity. Its not even the first time this has happened to me (well, its the first time its happened in S3 I think). I come from a heavy Windows background. I don’t even consider case sensitivity to be a thing. I can understand why its a thing (technically different characters and all), but its just not on windows, so its not even on my radar most of the time. I assume the case sensitivity in S3 is a result of the AWS backend being Unix/Linux based, but its still a shock to find a case sensitive URL.

I turns out that my IAM stuff had started working just fine and I was getting 404s because of an entirely different reason. I had assumed that I was still doing something wrong with my permissions and the API was just giving a crappy response (i.e. not really a 404, some sort of permission based can’t find file error masquerading as a 404).

At the very least I didn’t make the silliest mistake you can make in software (assuming the platform is broken), I just assumed I had configured it wrong somehow. That’s generally a fairly safe assumption when you’re using a widely distributed system. Sometimes you do find a feature that is broken, but it is far more likely that you are just doing it wrong. In my case, the error message was completely accurate, and was telling me exactly the right thing, I just didn’t realise why.

Somewhat ironically, the root cause of my 404 issue was my attempt to remove case sensitivity from the picture when I was working on getting the IAM stuff up and running. I just didn’t apply the case insensitivity consistently.

Ah well.