0 Comments

In my quest to automate all the things, I’ve failed a lot.

A lot.

Typically the failures are configuration related (forgot to do X, didn’t line up the names of Y and Z properly, accidentally using uninitialized variable (Powershell is pretty “flexible”)), but sometimes they are just  transient. The cloud can be like that and you need to make sure you build with fault tolerance in mind. For example, sometimes an AWS EC2 instance will just take ages to complete its initialization, and your environment will fail to create with a timeout.

Using CloudFormation, you can use WaitConditions and WaitHandles to keep track of the setup of the more complicated components of your environment (like EC2 instances). Generally you take the exit code from your initialization on the EC2 instance (for us that is typically cfn-init) and push it through cfn-signal pointing at a particular WaitHandle. In the case where the exit code is non-zero, CloudFormation considers it a failure and it bubbles up and fails the stack (or whatever your error handling strategy is). If you need to you can supply a reason as well (a short string) which can help in identifying the root cause of the failure.

Up until recently, when my environment setups failed, I would report the failure (i.e. this stack failed to initialize) and then automatically clean it up. My scripts would wait to see if the stack status was changing to CREATE_COMPLETE within some timeout, and in the case of a failure, were simply reporting that the stack failed. If I was executing from a test, typically I would break on the line that does the cleanup, and investigate the root cause of the failure before I let the stack be torn down.

Now that I’m automatically executing environment creation/destruction through TeamCity, that kind of approach doesn’t really cut it.

If I arrive at work in the morning and see that the CI environment for service A failed to automatically re-create I need to know why. The stack will have already been cleaned up, so unless the scripts report as much information as possible, my only real recourse is to run it again locally and see what happened. This only works if the failure is consistent. It might run perfectly the second time, which is incredibly annoying when you’re trying to track down a failure.

It was time to think about failing a little bit harder.

Investigating Failures

When a stack fails I typically do 3 things.

The first is to check to see if the stack failed on creation (i.e. on the actual call to New-CfnStack). These failures are usually related to template issues, so the environment setup scripts already handle this. When New-CfnStack throws an exception it bubbles up to the top and is logged. All you have to do is read the message and it will (hopefully) point you at the root cause.

An interesting fact. If your template file is long-ish (it happened to me when mine was about 1000 lines long, but I’m pretty sure its dependent on characters, not lines) then the call to New-CfnStack can and will fail with an incredibly useless error about incorrectly formatted XML. This only happens if you are uploading the content as part of the call, instead of uploading to S3 first. I’m not entirely sure why this happens, but I have a sneaking suspicion that the Powershell/.NET AWS components have some hard limit on the length of the request that isn’t being checked properly. Just upload your template to S3 (I upload to the same place I put my dependencies) and it will work just fine, regardless of the size of the template.

The second is to check the Events attached to the stack for failures, specifically the first one chronologically, as failures typically have a knock on effect. These events are available in the AWS dashboard, under the CloudFormation section. Sometimes this is all it takes, as most of the failure events are pretty descriptive. If you fail to have permissions to do something, this is where it normally comes up, along with some errors that the template verification system won’t find (like an RDS instance using a snapshot but also supplying a DBName).

The third is to use the events to determine which component failed (usually an EC2 instance), remote onto the machine and look at some log files. For Windows EC2 instances there are two log files that are useful, cfn-init and EC2Config.

You can typically find the cfn-init log file at C:\cfn\log\cfn-init.log. It contains the full log trace from everything that happened as a result of the call to cfn-init on that machine. As far as log files go, its pretty useful, and I’ve plumbed its depths many times trying to figure out why my initialization was failing.

Second interesting fact, also related to length. If the output from one of your commands in cfn-init is too long, it will fail with an incredibly unhelpful error.

You can typically find the EC2Config log at C:\Program Files\Amazon\EC2Config\logs\EC2Config.log. This one is useful if your cfn-init log file doesn’t have anything in it, which can happen if you’re failing so hard that your EC2 instance can’t even start executing cfn-init. It happens.

The 3 step process I outlined above usually identifies the root cause of the failure, which can then be rectified. Following the process isn’t difficult, but it is annoying, and manual. The last step is especially egregious, requiring me to find the machine/s in question, get their private IPs, be connected to our AWS VPN, remote on, locate the log files, read them, etc.

If I have to leave the confines of my environment script in order to go check why it failed, then I’m losing valuable time and context. It would be much better if the environment setup script detected failures and went and did those things for me.

The Right Kind of Lazy

Essentially a good lazy programmer is the one who doesn’t have the stomach for constantly doing a manual task, so instead automates it. This frees them up to do more interesting things, and the automated task can then be run with minimal human involvement, hopefully acting as a multiplier for the effectiveness of the team.

In my case, I needed to automate the second two steps above. The first step is taken care of because the exceptions being thrown from the various calls to the AWS API are already being logged in TeamCity.

For the second step (interrogating the Events of the Stack), AWS offers API’s for pretty much everything, which is awesome (Octopus is similar, which is equally awesome, maybe more so because they are Windows native). All you need is a call to Get-CfnStackEvent, and then you can filter the resulting events to those that failed, which helps to cut down on the amount of output. The following piece of Powershell demonstrates just that.

function _TryGetFailingStackEvents
{
    param
    (
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$stackId,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsKey,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsSecret,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsRegion
    )

    try
    {
        if ($rootDirectory -eq $null) { throw "RootDirectory script scoped variable not set. That's bad, its used to find dependencies." }
        $rootDirectoryDirectoryPath = $rootDirectory.FullName
        $commonScriptsDirectoryPath = "$rootDirectoryDirectoryPath\scripts\common"
        
        . "$commonScriptsDirectoryPath\Functions-Enumerables.ps1"

        $events = Get-CFNStackEvent -StackName $stackId -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion
        $failingEvents = @($events | Where { $_.ResourceStatus.Value -match "FAILED" })
        if ($failingEvents | Any -Predicate { $true })
        {
            return $failingEvents
        }
        else
        {
            return @()
        }
    }
    catch
    {
        Write-Warning "Could not get events for stack [$stackId]."
        Write-Warning $_
    }
}

Automating the third step is a little more difficult.

As a general rule, all of our EC2 instances can have Powershell remotely executed on them from our secure AWS VPN CIDR. This is already part of our general EC2 setup (security group configuration to allow the traffic, WIndows firewall exception to unblock the Powershell Remote Execution port, etc).

Whenever a failure event occurs that contains a valid EC2 instance ID (identified by a regular expression), the script creates a remote session to the machine using its private IP address and reads the last couple of hundred lines from the log files I mentioned above. You can see the full script below. It actually feeds off the failure events identified in the previous step in order to find the ID of the instance that it needs to connect to.

function _TryExtractLogFilesFromInstanceFailingViaCfnInit
{
    param
    (
        [Parameter(Mandatory=$true)]
        $failingStackEvents,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsKey,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsSecret,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsRegion,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$adminPassword
    )

    if ($failingStackEvents -eq $null) { return "No events were supplied, could not determine if anything failed as a result of CFN-INIT failure" }

    try
    {
        if ($rootDirectory -eq $null) { throw "RootDirectory script scoped variable not set. That's bad, its used to find dependencies." }
        $rootDirectoryDirectoryPath = $rootDirectory.FullName
        $commonScriptsDirectoryPath = "$rootDirectoryDirectoryPath\scripts\common"

        $cfnInitFailedIndicator = "CFN-INIT-FAILED"
        Write-Verbose "Attempting to identify and extract information from failure events containing the string [$cfnInitFailedIndicator]"
        $instanceIdRegexExtractor = "(i\-[0-9a-zA-Z]+)"
        $cfnFailureEvent = $failingStackEvents | 
            Where {$_.ResourceStatusReason -match $cfnInitFailedIndicator} | 
            Select -First 1

        if ($cfnFailureEvent.ResourceStatusReason -match $instanceIdRegexExtractor)
        {
            $instanceId = $matches[0];
            Write-Verbose "Found a failure event for instance [$instanceId]"
            Write-Verbose "Attempting to extract some information from the logs from that machine"

            . "$commonScriptsDirectoryPath\Functions-Aws-Ec2.ps1"

            $instance = Get-AwsEc2Instance -InstanceId $instanceId -AwsKey $awsKey -AwsSecret $awsSecret -AwsRegion $awsRegion

            $ipAddress = $instance.PrivateIpAddress

            $remoteUser = "Administrator"
            $remotePassword = $adminPassword
            $securePassword = ConvertTo-SecureString $remotePassword -AsPlainText -Force
            $cred = New-Object System.Management.Automation.PSCredential($remoteUser, $securePassword)
            $session = New-PSSession -ComputerName $ipAddress -Credential $cred
    
            $remoteScript = {
                $lines = 200
                $cfnInitLogPath = "C:\cfn\log\cfn-init.log"
                Write-Output "------------------------------------------------------"
                Write-Output "Last [$lines] from $file"
                Get-Content $cfnInitLogPath -Tail $lines
                Write-Output "------------------------------------------------------"
                #Get-Content C:\Program Files
            }
            $remotelyExtractedData = Invoke-Command -Session $session -ScriptBlock $remoteScript
            # If you dont do this when you do a JSON convert later it spits out a whole bunch of useless
            # information about the machine the line was extracted from, files, etc.
            $remotelyExtractedData = $remotelyExtractedData | foreach { $_.ToString() }
            
            return $remotelyExtractedData
        }
        else
        {
            Write-Verbose "Could not find a failure event about CFN-INIT failing"
            return "No events failed with a reason containing the string [$cfnInitFailedIndicator]"
        }
    }
    catch
    {
        Write-Warning "An error occurred while attempting to gather more information about an environment setup failure"
        Write-Warning $_       
    }
}

I execute both of these functions in the catch block of the main environment setup function. Basically if anything goes wrong during the environment creation, it tries to gather some additional information to print out to the Warning channel. Sometimes failures occur before you actually managed to create the stack, so it needs to be robust in the face of not having a stack ID. As it is inside the catch block, it also needs to be wrapped in a try…catch itself, otherwise you’re likely to lose the root cause of the error if an unexpected error happens during your information gathering.

if ($stackId -ne $null)
{
    Write-Warning "Environment creation failed. Attempting to gather additional information."
    $failingStackEvents = _TryGetFailingStackEvents -StackId $stackId -awsKey $awsKey -awsSecret $awsSecret -awsRegion $awsRegion
    $cfnInitFailureLogs = _TryExtractLogFilesFromInstanceFailingViaCfnInit -failingStackEvents $failingStackEvents -adminPassword $adminPassword -awsKey $awsKey -awsSecret $awsSecret -awsRegion $awsRegion
    $customError = @{
        "StackId"=$stackId;
        "Stack"=$stack;
        "FailingStackEvents"=$failingStackEvents;
        "CfnInitFailingInstanceLogs"=$cfnInitFailureLogs;
    }

    $failingDetails = new-object PSObject $customError
    Write-Warning (ConvertTo-Json $failingDetails)
}

With everyone above working, you get a nice dump of information whenever an environment fails to create, without having to do anything else, which is nice.

The full source that the snippets above were extracted from will be available at Solavirum.Scripts.Common at some point. The content available there now is out of date, but I need to check through the scripts carefully before I upload them to Github. We all remember what happened last time.

Room for Improvement

At some point in the future I would prefer if the output from CloudFormation (its events) and the various log files from the EC2 instances being provisioned were automatically pumped into our Log Aggregation server (ELK) so that we could analyse them in more detail.

The hard part about this is that the log aggregation would require the installation of something on the machine, and if we’re in a failure state during environment setup, those components are probably not installed. I mean the whole point of the environment setup is to create an environment and deploy some software components to do it, having to have other components in place to report failures is probably not going to work.

I’ll keep thinking about it and make a post if I find a good solution.

Conclusion

I hesitated to do this automated investigation for a long time, because I knew it would be finicky and complex.

I shouldn’t have.

Every time an environment setup fails I now get detailed information right in the same window that I was watching for completion messages. It is beyond useful. It also means that whenever an environment refresh fails in the morning before I get to work, I can still investigate to see what happened, even though the stack was already cleaned up.

I should have followed my instincts and automated this kind of pointless busywork earlier.