0 Comments

In my recent adventures with trying to create and maintain Windows Scheduled Tasks that execute Powershell, I’ve discovered a need for logging exactly what happens in the scripts. Mostly this is due to the fact that getting the output from a scheduled task (beyond the exit code) is a massive pain, but I’ve wanted a good Powershell logging system/framework for a while now. I mean, its basically code, and I would never write code without good logging, so this shouldn’t be an exception.

Luckily, Powershell offers a mechanism for logging, usable via the X-Transcript commands.

Unluckily, its a bit crap.

Transcribing

Powershell offers two cmdlets for dealing with transcription. Start-Transcript and Stop-Transcript.

When you start a transcript, you can optionally supply a filepath, and the transcription will take place there. If not, it will create a file for you and tell you where it is.

As for what is actually transcribed, you get all of the Output of the script plus errors and other exceptions. You also get some meta information, including times and the script being executed.

All of this sounds pretty perfect for logging on the surface. Just start a transcript at the beginning of your script and stop it at the end. So I tried it.

It was…okay. Once I figured out that it doesn’t automatically create missing directories in the path supplied (it doesn’t fail, it just does everything really really slowly), that permissions can really ruin its day (same sort of thing, no failures, just super slow while it fails all of its writes) and that if you have square braces in your path it fails miserably with the least helpful error message ever, it at least created a record of what was happening. Sort of.

Okay, It was terrible and it made me sad.

Apart from the weird behaviour with directories/permissions/braces, transcribing does not record the Verbose or Debug streams. This is kind of annoying, because that’s where I put the majority of my log statements (Output is bad because it messes up stream processing and return values and Host is just generally not advised). I believe later versions of transcribing can record additional streams, but I didn’t look into it any further, because I was already annoyed with the feature.

I also hated the output it created. If I’m looking at a log file, at minimum I want timestamps, severity and messages. For a language/framework with a mature logging tool (like .NET with log4net or Serilog) I expect a lot more, but for a simple log file, those three things I listed are enough.

Time to try an alternate approach. I hesitate to say better, but honestly, I was pretty unimpressed with transcription, so it seemed very possible.

Redirection

Powershell offers the ability to redirect the various available streams, using the standard Windows command line redirection, which seems like it would be the easiest way to create a log file. Simply redirect the streams you’re interested in to a file (pretty much all of them) and Bob’s your uncle.

In my case, I had a lot of issues getting this to work. Because I was running Powershell from a Windows Scheduled Task, using redirection didn’t seem to work. It turns out that one of my mistakes was assuming you could redirect the output from Powershell.exe, which you don’t seem to be able to do. Instead the redirection appears to be primarily for commands within Powershell.

I tried wrapping my Powershell in more Powershell (which actually did the redirection, as well as setup the file name via the current timestamp and ensure the appropriate directories existed), and while it seemed to work, it still wasn’t good enough. No timestamps on the lines in the file left me with no real information as to the time that elapsed during the script. Additionally, the syntax for getting the redirection to work properly was a bit strange, with the all redirection seemingly not working as expected (9>&1 didn’t seem to work, and neither did *>&1).

It was still better than transcribing though.

It was time to go deeper.

Interception

Powershell is an interesting language. Coming from a heavy .NET/C# background, I was initially confused by Powershell and its untyped variables and extremely loose, interpreted syntax. I still definitely prefer C#, but Powershell has its place in my environment, but only when its supported by tests using Pester.

One of the wonderful things Powershell lets you do, is function overriding.

You can create an implementation of a function, and have that implementation be used within a limited scope, instead of the default implementation.

You might be able to see where I’m going with this.

All of my logging was already in place via a combination of Write-Verbose, Write-Debug and Write-Warning (and the other write functions as well, just more rarely). All I needed to do was intercept those calls and write the messages to some file instead of whatever the default did.

It was easy enough to implement a file containing implementations of all of the Write-X cmdlets, and redirect the output to a file. I could easily add timestamps and any other information I wanted, and could control the output format, allowing it to be easily parsed by something else (like Logstash). To use the feature you simply setup a global variable containing the directory to put the log files into and then dot include the file.

The only downside to this was that any function I overwrote in this way no longer worked normally (which is kind of obvious), so during my testing I broke my IDE a number of times, which made development a bit harder than expected. Luckily, I ran across this StackOverflow answer from someone much smarter than me, which showed how to wrap the existing implementations instead of simply replacing them.

function Get-TimestampForLogFileName
{
    return [DateTime]::Now.ToString('yyyyMMdd');
}

function Get-CurrentLogFile
{
    if ([string]::IsNullOrEmpty($logsDirectoryPath)) { throw "LogsDirectoryPath script scoped variable not set. Thats bad, its used to create log files for Powershell output." }

    $timestamp = Get-TimestampForLogFileName

    $logFilePath = "$logsDirectoryPath\$timestamp.log";
    if (-not(Test-Path $logFilePath))
    {
        $logFile = New-Item -Type File -Path $logFilePath -Force;
        $max = 5
        $current = 0
        while (-not ($logFile.Exists))
        {
            Sleep -Seconds 1
            $current++
            if ($current -gt $max) { break }
        }
    }

    return $logFilePath
}

function Get-TimestampForLogContent
{
    return [DateTime]::Now.ToString('yyyyMMddHHmmss');
}

function Write-Debug 
{
    [CmdletBinding()]
    param
    (
       [Parameter(Mandatory=$true, Position=0, ValueFromPipeline=$true)]
       [AllowEmptyString()]
       [System.String]${Message}
    )

    begin 
    {
       try 
       {
           $outBuffer = $null
           if ($PSBoundParameters.TryGetValue('OutBuffer', [ref]$outBuffer))
           {
               $PSBoundParameters['OutBuffer'] = 1
           }
           $wrappedCmd = $ExecutionContext.InvokeCommand.GetCommand('Write-Debug', [System.Management.Automation.CommandTypes]::Cmdlet)
           $scriptCmd = {& $wrappedCmd @PSBoundParameters }
           $steppablePipeline = $scriptCmd.GetSteppablePipeline($myInvocation.CommandOrigin)
           $steppablePipeline.Begin($PSCmdlet)
       } 
       catch 
       {
           throw
       }
    }

    process 
    {
       try 
       {
            $logFilePath = Get-CurrentLogFile;
            Add-Content -Path $logFilePath "$(Get-TimestampForLogContent):DEBUG:$message" -Force;
            $steppablePipeline.Process($_)
       } 
       catch 
       {
           throw
       }
    }

    end 
    {
       try 
       {
           $steppablePipeline.End()
       } catch 
       {
           throw
       }
    }
}

I’ve only included the Write-Debug implementation, but all of the other ones look the same.

Finally I had a logging implementation that didn’t make me sad. It wrote reliably to a file, and I could easily use it from within a Windows Scheduled Task (after I started creating an actual .ps1 file for the scheduled task to execute that is).

Conclusion

The logging solution that I ended up putting into place isn’t perfect, but it seems to be a hell of a lot better than the other two approaches that I tried.

It outputs in a format that people are used to seeing, and is easily parsable by both machines and people. If I wanted to, I could import the log information into ELK using Logstash, which would let me keep track of the execution of the scheduled tasks remotely. Honestly that’s probably something I’ll put into place eventually, just not right now.

It was nice to be able to wrap functionality provided by something else without having to have written the code explicitly for that in the first place. Coming from C#, I typically code to interfaces, so I’m used to being able to replace things as necessary, but that requires effort up front. Here I simply replaced some system provided cmdlets and made them do something slightly different based on my requirements. I can’t count the number of times I’ve wanted the ability to do that sort of thing with System.Diagnostics.Trace or Console or DateTime (usually for testing purposes).

Having that sort of power is both amazing and dangerous (amazingly dangerous?). It’s very easy to completely break other, unrelated things when you are essentially replacing the implementation of a global static function.

Ah well, with great power, comes great responsibility I suppose.

Wouldn’t want Uncle Bob to be killed some rogue code in a tragically ironic accident.

0 Comments

As I’ve said repeatedly before, logging is easily one of the most important things that comes out of a piece of software. Good logging can be used for all sorts of purposes, but most importantly, it makes maintaining and supporting a piece of software much much easier. I won’t go into any more detail about how important I think logging is right now, as I’m going to use this post to talk about dealing with the bits and pieces that logging often leaves behind.

Mostly that means log files.

In our environment setup, we have drifted towards a pattern where all log files are written into a common logs directory, with subdirectories for each component.

If we want to find anything out about our components, we can just go to the logs directory. Nice and easy, no need to waste time hunting over the file system for logs that matter. Of course, its only used as a backup, as all of our logs are aggregated into ELK, but its still useful.

At this point in time, most of our components simply create log files on disk, and write everything there. Its then up to an external component (logstash in most cases, nxlog in very rare cases) to take those files, do something to them and write the events to some other location for aggregation. We have log files being created by Squid proxies, performance counters, web services, iis, windows services and Powershell scripts and probably many other things that aren't occurring to me right now.

Of course, the downside of this extensive logging is that you are constantly consuming more and more disk space. Some logging frameworks and mechanisms manage that for you (either with a maximum directory size or some maximum number of days of log files), but not all.

A great side effect of having all logs go to the same directory, is that it’s easy to write something to clean up that directory based on some retention policy (not written in the last 14 days for example).

So we did just that.

Scheduled Annihilation

Every environment that we create has two Powershell scripts executed on all machines by default.

The first sets up the logs directory at the common location, ensuring that it is writable by all appropriate users, and that all subdirectories and files are also appropriately writable.

We implemented this because when we setup a Performance Counter to write a log file containing system stats, it created the directory with an owner of System, and everything else just failed miserably from that point foward, completely breaking the common log location right off the bat. It was, for some reason, stupidly difficult to claim the folder back and fix its permissions as well, so we just implemented a script to configure the folder ahead of time.

function InitialiseCommonLogsDirectory
{
    $directoryPath = "C:\logs"
    if (-not (Test-Path $directoryPath))
    {
        Write-Verbose "Creating common logs directory at [$directoryPath]"
        New-Item $directoryPath -Type Directory 
    }

    Write-Verbose "Making common logs directory [$directoryPath] writeable by everyone."
    $Acl = Get-Acl $directoryPath

    $Ar = New-Object System.Security.AccessControl.FileSystemAccessRule("Everyone", "FullControl", "ContainerInherit,ObjectInherit", "None", "Allow")

    $Acl.SetAccessRule($Ar)
    Set-Acl $directoryPath $Acl
}

The second script installs a Scheduled Task to run a fairly simple piece of Powershell that just deletes all files that hadn’t been written to in the last 14 days. The assumption was that the physical files on disk were relatively unimportant assuming the log aggregation was working correctly, but we keep a small window just in case.

Annoyingly, the commands for setting up a scheduled task via Powershell are a bit arcane, mostly because you have to resort to using COM services to get the job done. The script to setup the scheduled task looked like this:

function CreateLogsClearingTask
{
    $directoryPath = "C:\logs"

    $taskName = "Clear Old Logs"

    $service = new-object -ComObject("Schedule.Service")
    $service.Connect()
    $rootFolder = $service.GetFolder("\")
    try
    {
        $rootFolder.DeleteTask($taskName, 0)
    }
    catch
    {
        if (-not ($_.Exception.Message -like "*The system cannot find the file specified*"))
        {
            throw $_.Exception
        }
    }

    $daysOld = 14
    $taskDescription = "Clears Logs that are older than [$daysOld] days"
    $taskCommand = "powershell"
    $taskArguments = "-Command `"&{ Get-ChildItem $directoryPath\* -Include *.log -Recurse | Where LastWriteTime -lt (Get-Date).AddDays(-$daysOld) | Remove-Item }`" -ExecutionPolicy Bypass"
 
    write-verbose "Creating Scheduled Task to automatically execute [$taskCommand] with arguments [$taskArguments] on a regular basis."

    $TaskStartTime = [DateTime]::Now.Date.AddHours(1)
 
    $TaskDefinition = $service.NewTask(0) 
    $TaskDefinition.RegistrationInfo.Description = "$taskDescription"
    $TaskDefinition.Settings.Enabled = $true
    $TaskDefinition.Settings.AllowDemandStart = $true
 
    $triggers = $TaskDefinition.Triggers
    #http://msdn.microsoft.com/en-us/library/windows/desktop/aa383915(v=vs.85).aspx
    $trigger = $triggers.Create(2)
    $trigger.StartBoundary = $TaskStartTime.ToString("yyyy-MM-dd'T'HH:mm:ss")
    $trigger.Enabled = $true
    $trigger.DaysInterval = 1
 
    # http://msdn.microsoft.com/en-us/library/windows/desktop/aa381841(v=vs.85).aspx
    $Action = $TaskDefinition.Actions.Create(0)
    $action.Path = "$taskCommand"
    $action.Arguments = "$taskArguments"
 
    #http://msdn.microsoft.com/en-us/library/windows/desktop/aa381365(v=vs.85).aspx
    $rootFolder.RegisterTaskDefinition("$TaskName",$TaskDefinition,6,"System",$null,5)

    $task = $rootFolder.GetTask($taskName)

    Write-Verbose "Executing Scheduled Task."
    $task.Run($null)
}

Not that friendly.

You may be thinking, why not just use the awesome scheduled task cmdlets available in later versions of Windows, New-ScheduledTaskTrigger, Get-ScheduledTask and Register-ScheduledTask. These commands are extremely useful and easy to understand, but unfortunately, in our particular case, we needed to target operating systems where those features are not available. Specifically, our build server (which runs the automated tests for the Powershell scripts) is running an older version of Windows Server, where the cmdlets are unavailable. Sometimes you just have to target the lowest common denominator.

Straightforward, right? I mean, it looks like it does the job.

It did not.

Logs! Logs Everywhere!

We didn’t notice for a pretty long time, because the growth was tiny on our non-production environments, but once we started to have some real traffic, the log files for IIS and the Web Service started to add up quickly. Luckily, we have dashboards that track the disk usage of our various server components, and someone happened to see the pattern of the % free space on machines dropping steadily over time.

A closer look at the script above will show you that it was failing because it was trying to clean up C:\logs\logs, which of course doesn’t exist. The directory variable had C:\logs already in it, and the Powershell script was appending another \logs to it before it was being run.

It turns out that the task had been failing continuously, pretty much since we implemented it, and because the leading indicator of failure was missing (a failing task), no-one noticed. The task thought it had succeeded because Powershell returns an exit code of 0 when an exception is thrown.

I cleaned up the script, wrote a test to verify that it was actually deleting old files (without error), incorporated the fixed scripts into our environment setup and deployed the fix to the live machines.

Once deployed, the task said it succeeded, but looking at the log files, they hadn’t actually been cleaned up at all. It had failed silently again.

Powershell and Failure: Close Friends

The root cause for the task not indicating failure correctly was that the Powershell executable, for some ungodly reason, returns an exit code of 0 when an exception is thrown. If you want to make it act like a normal application would (non-zero exit code = bad, zero = good) you have to wrap your script in a try…catch at the root and exit with a non-zero code yourself.

A little bit more refactoring around the Powershell scripts that create the scheduled task, and I had it automatically wrapping the supplied script with a try..catch and exiting with a failure if an exception occurred. I’ve included the script that creates the task below, but its only a part of a bigger system (which you can see in the Solavirum.Scripts.Common repository, when I finish updating it).

function New-DailyPowershellScheduledTask
{
    param
    (
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$taskName,
        [string]$taskDescription,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$script,
        [TimeSpan]$TimeOfDay="0:00",
        [switch]$ExecuteImmediately=$true
    )

    $service = new-object -ComObject("Schedule.Service")
    $service.Connect()
    $rootFolder = $service.GetFolder("\")

    try
    {
        $command = "powershell.exe"
        $arguments = _ConvertPowershellScriptIntoArgumentsForScheduledTask -script $script -taskName $taskName

        $TaskStartTime = [DateTime]::Now.Date.Add($TimeOfDay)
 
        $TaskDefinition = $service.NewTask(0) 
        $TaskDefinition.RegistrationInfo.Description = "$taskDescription"
        $TaskDefinition.Settings.Enabled = $true
        $TaskDefinition.Settings.AllowDemandStart = $true
 
        $triggers = $TaskDefinition.Triggers
        #http://msdn.microsoft.com/en-us/library/windows/desktop/aa383915(v=vs.85).aspx
        $trigger = $triggers.Create(2)
        $trigger.StartBoundary = $TaskStartTime.ToString("yyyy-MM-dd'T'HH:mm:ss")
        $trigger.Enabled = $true
        $trigger.DaysInterval = 1
 
        # http://msdn.microsoft.com/en-us/library/windows/desktop/aa381841(v=vs.85).aspx
        $Action = $TaskDefinition.Actions.Create(0)
        $action.Path = "$command"
        $action.Arguments = "$arguments"
 
        #http://msdn.microsoft.com/en-us/library/windows/desktop/aa381365(v=vs.85).aspx
        $rootFolder.RegisterTaskDefinition("$TaskName",$TaskDefinition,6,"System",$null,5)

        $task = $rootFolder.GetTask($taskName)

        if ($ExecuteImmediately)
        {
            Write-Verbose "An immediate execution of the just created scheduled task [$taskName] was requested. Executing now."
            $task.Run($null)
        }
    }
    finally
    {
        [System.Runtime.Interopservices.Marshal]::ReleaseComObject($service)
    }
}

function _ConvertPowershellScriptIntoArgumentsForScheduledTask
{
    param
    (
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$script,
        [string]$taskName
    )

    $taskLogsDirectoryName = "$($taskName.Replace(' ', '_'))"
    $rootLogsDirectoryPath = "C:\logs\scheduled-tasks"
    $logsDirectoryPath = "$rootLogsDirectoryPath\$taskLogsDirectoryName"
    $logDirectory = New-Item -ItemType Directory -Path $logsDirectoryPath -Force
    return "-command `"& { try { `$timestamp = [DateTime]::Now.ToString('yyyyMMdd'); Start-Transcript -Path $logsDirectoryPath\`$timestamp.log -Append; $script } catch { Stop-Transcript; exit 1; } }`" -ExecutionPolicy Bypass"
}

Now that I had fixed the issue with it failing to fail in an obvious way, it was time to add  some logging. So I could log why it wasn’t deleting the logs.

Using the code that wraps the incoming script in a try…catch, I added a call to Start-Transcript at the start of the script, and a call to Stop-Transcript at the end. Start-Transcript is a useful feature built into Powershell that lets you write to a file everything interesting that happened during the transcript block. Its not perfect (it doesn’t seem to record Verbose statements), but it does record the output stream and any errors and warnings, which is enough.

Be careful. If you supply Start-Transcript with a file path inside a directory that does not exist, it appears to just fail silently. I’m pretty sure I tried the –Force flag (which usually deals with these things), but it didn’t seem to help. Make sure your destination directory exists before you try to start the transcript.

I made sure that the log files coming out of the scheduled task were appropriately date stamped (so they too, could in turn be cleaned up) and deployed the updated scheduled task again.

The Root of My Problems

The transcript file showed that I was getting an Access Denied exception when the script was running as a task. Obviously this did NOT happen in my tests, so it was something specific to the deployed environments.

Fair enough, that can happen with scheduled tasks, because they default to running as the System account, which doesn’t necessarily have normal permissions.

The interesting thing was that the root logs directory was specifically marked as full control to all appropriate users, and that included the System account. Checking the permissions on the directory with Get-Acl proved that to be true. System clearly had the ability to do whatever it liked to the logs directory and its content.

So where was the Access Denied error coming from?

Time to go back to debugging basics. It was time for Process Monitor (procmon). Process Monitor is one of those awesome sysinternals tools supplied by Microsoft/someone at Microsoft. It lets you monitor every single event happening within windows, including interactions with the registry, file system and all sorts of other things. Also, more importantly for the maintenance of sanity, it lets you filter those events.

Firing up procmon and running the scheduled task showed that it was clearly deleting the very first file that it found that was old enough to be deleted, and then stopping. There was no obvious errors coming from the file system, the process just seemed to stop trying. If I ran the task again, it actually had deleted the first file, and it would delete the next one (which was now the first) and stop again.

I still have no idea why it was doing this. I suspect it might have had something to do with deleting files during a streaming operation in Powershell (i.e. Get|Where|Delete), but if that were the case, I would think it would happen during my test as well.

I fixed the issue by wrapping the Get|Where in brackets, to make sure it was evaluated before it was streamed to the Delete functions, kind of like using a ToArray or ToList in C# when enumerating through a collection you intend to remove values from.

Once I made that change everything worked. The files were deleted, the task indicated success and all was well in the world.

Conclusion

There are a number of lessons to come out of this, but the most important is to make sure that your failure indicators actually indicate failure when a failure occurs, which is really just a specific application of “test all the things”.

Its not enough to just test it yourself either (in a unit test or similar), you must test in the environment that it will actually be deployed to, because the differences catch you out. In additional to testing, you need to have indicators or events that you monitor in order to make sure everything is acting as you expected. In our case, the monitoring that we had on the machines allowed us to notice the issue, but if no-one was paying attention, we would have probably had issues when the disk finally filled up.

It would have been better to have alerts that warned us when something was trending the wrong way. We probably glossed over that sort of thing because we thought it was a solved problem (after all, we had a script), so I suppose the moral is to question and check everything, even if you think its fine.

Because its probably not.

0 Comments

So it turns out that Amazon delete their pre-packaged Windows AMI’s muchquicker than they delete their Linux ones.

I’ve known about this for a little while, but it wasn’t until recently that it bit me.

One of the first environments that we created using the new strategy of “fully codified and automated environment setup”, recently hit a little snag. It was a production environment, primarily intended for beta testing, and the understanding was that it would be refreshed (using the latest version of the environment setup scripts) before it became available to the general public.

Late last week, one of our scripts that shuts down AWS EC2 instances based on time of day (to limit costs) accidentally terminated both of the EC2 instances that make up the API layer in that particular environment. Normally, this wouldn’t be an issue. The Auto Scaling Group would realise that it no longer had as many instances as it should and it would recreate them. The API instances are mostly stateless, so after a small amount of time, everything would be fine again.

As I’m sure you can imagine, this did not happen.

Amazon has since removed the AMI that the API instances were based off, so the ASG couldn’t spin up any new instances to replace the ones that were terminated.

The service was down, and it was going to stay that way for a while until we managed to fix the root issue.

The Immediate Response

The first thing we did was update the existing CloudFormation stack to use the latest version of the Windows AMI that we were previously using. This at least allowed the API instances to be created. However, they never finished their initialization.

In the time between when the environment was initially provisioned and the time when it was accidentally destroyed, we had made quite a few changes to the common scripts that back our environment provisioning process. One of those was to specify the version of the Octopus Tentacle that was going to be installed on the machine. We had previously run into an issue when Octopus 3 was released where the latest tentacle no longer worked the same way, and with little time to investigate it, simply hardcoded the version that was installed to the one we had previously been using.

In order to fix this issue in the old environment we had to fix the script. Whenever an environment is provisioned, the scripts that it depends on are uploaded to S3, ready to be downloaded by EC2 instances and other resources that need access to them. Rather than manually dig in to the dependencies, it was just easier to do the planned environment refresh.

This went…okay. Not great, not terrible, but the problem was fixed and everything came back online before anyone tried to use the service the next day.

Fixing It Properly

I had actually been aware of the AMI missing issue for some time and was already working on a longer term fix. In fact, I had scheduled an environment refresh of the production/beta environment for this service for a bit later in the same week the incident happened. It was just unfortunate that the incident forced my hand.

The root cause of the issue is that we did not control all of the elements in the environment, specifically the AMI used. Having external dependencies isn’t always an issue (for example we use Nuget extensively, but old Nuget packages are generally left alone), but Amazon makes no guarantees as to the availability of the AMI’s they supply as time progresses. The solution is to create your own AMI’s, so that Amazon can’t just delete them out from underneath you.

There are upsides and downsides to managing your own AMI’s.

The primary upside is that you avoid issues like I’ve described above. Nobody external to your company is going to go and delete an AMI when you still have a dependency on it. Obviously if someone internal to your organization deletes the AMI you still have the same issue, but you at least have much more control over that situation.

Another upside is that you can include commonly installed software for your environments in your custom AMI’s. For us, that would be things like an unregistered Octopus Tentacle or the .NET Framework and ASP.NET (not for all machines, but at least for all API instances).

The primary downside is that you can no longer easily take advantage of the fact that new Amazon AMI’s are released on a regular basis, containing Windows Updates and other fixes (which are critically important to apply to machines that are exposed to the greater internet). You can still take advantage of those new AMI’s, its just a little bit more difficult.

Another downside is that you now have to manage your own AMI’s. This isn’t particularly difficult to be honest, but it is one more thing that you need to take care of, and I much prefer to simplify things rather than add more complexity.

The Mechanism

In an effort to avoid much of the manual work that can go into creating an AMI, I looked for a solution that was automated. I wanted to be able to run a process that simply spat out a customised AMI at the end, so that we could easily take advantage of new AMI’s as Amazon released them, and then refresh our environments as required.

Initially I looked into automating the process myself, using the various API’s available for AWS. I’d already done some work previously in creating an EC2 instance for the purposes of updating an AMI, so I started with that.

Shortly after, someone informed me of the existence of Packer.

Packer is a wonderful little application that allows you create AMI’s and virtual machines for a number of virtualisation platforms. It even works on Windows, without having to install some arcane dependency chain through the command line. Its just a collection of executables, which is nice.

Using Packer, I could put together the following configuration file that describes how I want my AMI to be structured.

{
    "variables" : {
        "aws_access_key" : "",
        "aws_secret_key" : "",
        "aws_region" : "",
        "source_ami" : "",
        "ami_name" : "",
        "user_data_file_path" : "",
        "octopus_api_key" : "",
        "octopus_server_url" : ""
    },
    "builders" : [{
            "type" : "amazon-ebs",
            "access_key" : "{{user `aws_access_key`}}",
            "secret_key" : "{{user `aws_secret_key`}}",
            "region" : "{{user `aws_region`}}",
            "source_ami" : "{{user `source_ami`}}",
            "instance_type" : "m3.large",
            "ami_name" : "{{user `ami_name`}}-{{timestamp}}",
            "user_data_file" : "{{user `user_data_file_path`}}",
            "vpc_id" : "vpc-a0a6aec9",
            "subnet_id" : "subnet-5908182d",
            "security_group_ids" : ["sg-0b65076e", "sg-4d188f28", "sg-faaf429f"],
            "ssh_keypair_name" : "YourKeyPair",
            "ssh_private_key_file":"C:\\temp\\YourKeyPair.pem",
            "communicator" : "winrm",
            "winrm_username" : "Administrator",
            "winrm_port" : 5985
        }
    ],
    "provisioners" : [
        {
            "type" : "powershell",
            "inline" : [
                "try",
                "{",
                    "$signalFilePath = \"C:\\signal\"",
                    "$content = Get-Content $signalFilePath",
                    "$maxWaitSeconds = 3000",
                    "$currentWaitSeconds = 0",
                    "$waitSeconds = 30",
                    "while ($content -eq \"1\" -and $currentWaitSeconds -lt $maxWaitSeconds) { Sleep -Seconds $waitSeconds; Write-Output \"Checking signal\"; $currentWaitSeconds += $waitSeconds; $content = Get-Content $signalFilePath; if ($content -eq \"-1\") { Write-Output \"User data signalled -1, indicating failure.\"; exit 1 } }",
                "}",
                "catch",
                "{",
                    "Write-Ouput \"An unexpected error occurred.\"",
                    "Write-Output $_",
                    "exit 1",
                "}"
            ]
        },
        {
            "type":"powershell",
            "scripts": [
                "@@ROOT_DIRECTORY_PATH\\scripts\\packer\\Ec2Config.ps1"
            ]
        }
    ]
}

The first part of the template describes various things about the EC2 instance that will be used to create the AMI, and the second part describes operations to perform on the instance in order to configure it the way you want it.

Note that the security groups used in the template above simply describe (in order) ability to connect via RDP and Windows Remote Management, Unfiltered Access Out and Octopus Tentacle Port In.

The configuration I’ve shared above is from our baseline Octopus Tentacle capable image. It comes with an Octopus Tentacle installed, but not configured (because its much more reliable to configure it at initialization time in CloudFormation).

The instance configuration is broken into two parts:

  1. Use UserData to run some scripts that configure the proxy (so the machine can get to the internet) and download some dependencies, plus some other miscellaneous configuration (WinRM, Firewall, etc).
  2. Use the Powershell script execution from Packer to run some scripts from the dependencies downloaded in 1.) to download and install an Octopus Tentacle.
  3. Some other miscellaneous configuration.

Nothing too fancy.

The Windows support for Packer is still a bit on the rough side, mostly due to the fact that doing this sort of thing with Windows machines is inherently more complicated than it is for Linux boxes. Luckily for me, I started using Packer after the Windows plugins were incorporated into the primary build of the application, so I didn’t have to do anything special to get Windows support.

Gotchas

It definitely wasn’t all smooth sailing though.

The documentation for the creation of Windows AMI’s from Packer is a little sparse, so I had to do some trial and error in order to figure out how everything fit together.

The main mechanism for executing scripts remotely on Windows is WinRM (Windows Remote Managment), which is basically Powershell remote execution. As such, you need to make sure that you allow access to the machine over port 5985 or nothing will work. It won’t fail straightaway either, it will timeout, which can take upwards of 10 minutes.

You also need to make sure that you specify WinRM as the communication method. Most of the template examples on the web use SSH (because Linux), so its not immediately obvious that you can actually switch to a different communication method.

Finally, you need to include the EC2 and Bundle config files, to tell the instance that it needs to run sysprep, otherwise it won’t regenerate a new Administrator password when you use the AMI to create a new EC2 instance (and thus you wouldn’t be able to retrieve the password from the AWS API). It will also have saved state on it from last time, so its definitely better to run sysprep for an AMI that will be used generically.

Conclusion

I’ve uploaded a sanitised copy of the repository containing my Packer templates and scripts to Github. If you look, you can see that I haven’t done anything particularly fancy. All I’ve done is wrap the execution of Packer in some Powershell scripts to make it easier to run. I have two different scripts to create the two AMI’s that we need right now (Octopus capable + pre-installed IIS/.NET Framework), and when you run either of them with the appropriate parameters a brand new, timestamped AMI will be created in the appropriate AWS account.

Creating our own AMI’s fixes the scaling issue that started this whole blog post. Since we control them, we can be sure that they won’t be deleted and our ability to scale via Auto Scaling Groups is maintained for the life of the environment. Another benefit of this approach is that the provisioning of an environment is now quicker, as some of the components (especially IIS/.NET Framework) are now pre-installed for the components that require them. Considering our environments can take upwards of 20 minutes to provision, every minute counts.

The whole process of creating these AMI’s via Packer took me about a day or two, so it definitely wasn’t the most time consuming task I’ve ever completed.

Incorporating the AMI’s into our environment provisioning scripts was trivial, as they already searched for the appropriate AMI to use dynamically, I just had to change the search parameters.

In the end I’m fairly pleased with Packer and how easy it made the AMI creation process. If I had to use the AWS Powershell cmdlets (or the CLI app) directly for all of this, I probably would have wasted a lot of time.

And sanity.

0 Comments

Now that we’ve (somewhat) successfully codified our environment setup and are executing it automatically every day with TeamCity, we have a new challenge. Our setup scripts create an environment that has some set of features/bugs associated with it. Not officially (we’re not really tracking environment versions like that), but definitely in spirit. As a result, we need to update environments to the latest version of the “code” whenever we fix a bug or add a feature. Just like deploying a piece of software.

To be honest, I haven’t fully nailed the whole codified environment thing just yet, but I am getting closer. Giving it some thought, I think I will probably move towards a model where the environment is built and tested (just like a piece of software) and then packaged and versioned, ready to be executed. Each environment package should consist of installation and uninstallation logic, along with any other supporting actions, in order to make them as self contained as possible.

That might be the future. For now, we simply have a repository with scripts in it for each of our environments, supported by a set of common scripts.

The way I see it, environments fall into two categories.

  1. Environments created for a particular task, like load testing or some sort of experimental development.
  2. Environments that take part in your deployment pipeline.

The fact that we have entirely codified our environment setup gives us the power to create an environment for either of the above. The first point is not particularly interesting, but the second one is.

We have 3 standard environments, which are probably familiar to just about anyone (though maybe under different names). They are, CI, Staging and Production.

CI is the environment that is recreated every morning through TeamCity. It is used for continuous integration/deployment, and is typically not used directly for manual testing/demonstration/anything else. It forms an important part of the pipeline, as after deployment, automated functional tests are run on it, and if successful that component is (usually) automatically propagated to Staging.

Staging is, for all intents and purposes, a Production level environment. It is stable (only components that fully pass all of their tests are deployed here) and is used primarily for manual testing and feature validation, with a secondary focus on early integration within a trusted group of people (which may include external parties and exceptional customers).

Production is of course production. Its the environment that the greater world uses for any and all executions of the software component (or components) in question. It is strictly controlled and protected, to make sure that we don’t accidentally break it, inconveniencing our customers and making them unhappy.

The problem is, how do you get changes to the underlying environment (i.e. a new version of it) into Staging/Production, without losing any state held within the system? You can’t just recreate the environment (like we do each morning for CI), because the environment contains the state, and that destroys it.

You need another process.

Migration.

Birds Fly South for the Winter

Migration, for being such a short word, is actually incredibly difficult.

Most approaches that I’ve seen in the past, involved some sort of manual migration strategy (usually written down and approved by 2+ people), which is executed by some long suffering operations person at midnight when hopefully no-one is trying o use the environment for its intended purpose.

A key component to any migration strategy: What happens if it goes wrong? Otherwise known as a rollback procedure.

This is, incidentally, where everything gets pretty hard.

With our environments being entirely codified in a mixture of Powershell and CloudFormation, I wanted to create something that would automatically update an environment to the latest version, without losing any of the data currently stored in the environment, and in a safe way.

CloudFormation offers the ability to update a stack after it has been created. This way you can change the template to include a new resource (or to change existing resources) and then execute the update and have AWS handle all of the updating. This probably works fine for most people, but I was uneasy at the prospect. Our environments are already completely self contained and I didn’t understand how CloudFormation updates would handle rollbacks, or how updates would work for all components involved. I will go back and investigate it in more depth at some point in the future, but for now I wanted a more generic solution that targeted the environment itself.

My idea was fairly simple.

What if I could clone an environment? I could make a clone of the environment I wanted to migrate, test the clone to make sure all the data came through okay and its behaviour was still the same, delete the old environment and then clone the temporary environment again, into the original environments name. At any point up to the delete of the old environment I could just stop, and everything would be the same as it was before. No need for messy rollbacks that might might only do a partial job.

Of course, the idea is not actually all that simple in practice.

A Perfect Clone

In order to clone an environment, you need to identify the parts of the environment that contain persistent data (and would not automatically be created by the environment setup).  Databases and file storage (S3, disk, etc) are examples of persistent data. Log files are another example of persistent data, except they don’t really matter from a migration point of view, mostly because all of our log entries are aggregated into an ELK stack. Even if they weren’t aggregated, they probably still wouldn’t be worth spending time on.

In the case of the specific environment I’m working on for the migration this time, there is an RDS instance (the database) and at least one S3 bucket containing user data. Everything else about the environment is transient, and I won’t need to worry about it.

Luckily for me, cloning an RDS instance and an S3 bucket is relatively easy.

With RDS you can simply take a snapshot and then use that snapshot as an input into the RDS instance creation on the new environment. Fairly straightforward.

function _WaitRdsSnapshotAvailable
{
    [CmdletBinding()]
    param
    (
        [Parameter(Mandatory=$true)]
        [string]$snapshotId,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsKey,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsSecret,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsRegion,
        [int]$timeoutSeconds=3000
    )

    write-verbose "Waiting for the RDS Snapshot with Id [$snapshotId] to be [available]."
    $incrementSeconds = 15
    $totalWaitTime = 0
    while ($true)
    {
        $a = Get-RDSDBSnapshot -DBSnapshotIdentifier $snapshotId -Region $awsRegion -AccessKey $awsKey -SecretKey $awsSecret
        $status = $a.Status

        if ($status -eq "available")
        {
            write-verbose "The RDS Snapshot with Id [$snapshotId] has exited [$testStatus] into [$status] taking [$totalWaitTime] seconds."
            return $a
        }

        write-verbose "Current status of RDS Snapshot with Id [$snapshotId] is [$status]. Waiting [$incrementSeconds] seconds and checking again for change."

        Sleep -Seconds $incrementSeconds
        $totalWaitTime = $totalWaitTime + $incrementSeconds
        if ($totalWaitTime -gt $timeoutSeconds)
        {
            throw "The RDS Snapshot with Id [$snapshotId] was not [available] within [$timeoutSeconds] seconds."
        }
    }
}

... snip some scripts getting CFN stacks ...

$resources = Get-CFNStackResources -StackName $sourceStack.StackId -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion

$rds = $resources |
    Single -Predicate { $_.ResourceType -eq "AWS::RDS::DBInstance" }

$timestamp = [DateTime]::UtcNow.ToString("yyyyddMMHHmmss")
$snapshotId = "$sourceEnvironment-for-clone-to-$destinationEnvironment-$timestamp"
$snapshot = New-RDSDBSnapshot -DBInstanceIdentifier $rds.PhysicalResourceId -DBSnapshotIdentifier $snapshotId -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion

_WaitRdsSnapshotAvailable -SnapshotId $snapshotId -awsKey $awsKey -awsSecret $awsSecret -awsRegion $awsRegion

With S3, you can just copy the bucket contents. I say just, but in reality there is no support for a “sync” command in the AWS Powershell cmdlets. There is a sync command on the AWS CLI though, so I wrote a wrapper around the CLI and execute the sync command there. It works pretty nicely. Essentially its broken into two parts, the part that deals with actually locating and extracting the AWS CLI to a known location, and then the part that actually does the clone. The only difficult bit was that you don’t seem to be able to just supply credentials to the AWS CLI executable, at least in a way that I would expect (i.e. as parameters). Instead you have to use a profile, or use environment variables.

function Get-AwsCliExecutablePath
{
    if ($rootDirectory -eq $null) { throw "rootDirectory script scoped variable not set. Thats bad, its used to find dependencies." }
    $rootDirectoryPath = $rootDirectory.FullName

    $commonScriptsDirectoryPath = "$rootDirectoryPath\scripts\common"

    . "$commonScriptsDirectoryPath\Functions-Compression.ps1"

    $toolsDirectoryPath = "$rootDirectoryPath\tools"
    $nugetPackagesDirectoryPath = "$toolsDirectoryPath\packages"

    $packageId = "AWSCLI64"
    $packageVersion = "1.7.41"

    $expectedDirectory = "$nugetPackagesDirectoryPath\$packageId.$packageVersion"
    if (-not (Test-Path $expectedDirectory))
    {
        $extractedDir = 7Zip-Unzip "$toolsDirectoryPath\dist\$packageId.$packageVersion.7z" "$toolsDirectoryPath\packages"
    }

    $executable = "$expectedDirectory\aws.exe"

    return $executable
}

function Clone-S3Bucket
{
    [CmdletBinding()]
    param
    (
        [Parameter(Mandatory=$true)]
        [string]$sourceBucketName,
        [Parameter(Mandatory=$true)]
        [string]$destinationBucketName,
        [Parameter(Mandatory=$true)]
        [string]$awsKey,
        [Parameter(Mandatory=$true)]
        [string]$awsSecret,
        [Parameter(Mandatory=$true)]
        [string]$awsRegion
    )

    if ($rootDirectory -eq $null) { throw "RootDirectory script scoped variable not set. That's bad, its used to find dependencies." }
    $rootDirectoryPath = $rootDirectory.FullName

    . "$rootDirectoryPath\scripts\common\Functions-Aws.ps1"

    $awsCliExecutablePath = Get-AwsCliExecutablePath

    $previousAWSKey = $env:AWS_ACCESS_KEY_ID
    $previousAWSSecret = $env:AWS_SECRET_ACCESS_KEY
    $previousAWSRegion = $env:AWS_DEFAULT_REGION

    $env:AWS_ACCESS_KEY_ID = $awsKey
    $env:AWS_SECRET_ACCESS_KEY = $awsSecret
    $env:AWS_DEFAULT_REGION = $awsRegion

    & $awsCliExecutablePath s3 sync s3://$sourceBucketName s3://$destinationBucketName

    $env:AWS_ACCESS_KEY_ID = $previousAWSKey
    $env:AWS_SECRET_ACCESS_KEY = $previousAWSSecret
    $env:AWS_DEFAULT_REGION = $previousAWSRegion
}

I do have some concerns that as the bucket gets bigger, the clone will take longer and longer. I’ll cross that bridge when I come to it.

Using the identified areas of persistence above, the only change I need to make is to alter the new environment script to take them as optional inputs (specifically the RDS snapshot). If they are supplied, it will use them, if they are not, it will default to normal creation.

Job done, right?

A Clean Snapshot

The clone approach works well enough, but in order to perform a migration on a system that is actively being used, you need to make sure that the content does not change while you are doing it. If you don’t do this, you can potentially lose data during a migration. The most common example would be if you clone the environment, but after the clone some requests occur and the data changes. If you then delete the original and migrate back, you’ve lost that data. There are other variations as well.

This means that you need the ability to put an environment into standby mode, where it is still running, and everything is working, but it is no longer accepting user requests.

Most of our environments are fairly simple and are based around web services. They have a number of instances behind a load balancer, managed by an auto scaling group. Behind those instances are backend services, like databases and other forms of persistence/scheduled task management.

AWS Auto Scaling Groups allow you to set instances into Standby mode, which removes them from the load balancer (meaning they will no longer have requests forwarded to them) but does not delete or otherwise terminate them. More importantly, instances in Standby can count towards the desired number of instances in the Auto Scaling Group, meaning it won’t go and create X more instances to service user requests, which obviously would muck the whole plan up.

This is exactly what we need to set our environment into a Standby mode (at least until we have scheduled tasks that deal with underlying data anyway). I took the ability to shift instances into Standby mode and wrapped it into a function for setting the availability of the environment (because that’s the concept that I’m interested in, the Standby mode instances are just a mechanism to accomplish that).

function _ChangeEnvironmentAvailability
{
    [CmdletBinding()]
    param
    (
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$environment,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsKey,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsSecret,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsRegion,
        [Parameter(Mandatory=$true)]
        [ValidateSet("Available", "Unavailable")]
        [string]$availability,
        [switch]$wait
    )

    if ($rootDirectory -eq $null) { throw "RootDirectory script scoped variable not set. That's bad, its used to find dependencies." }
    $rootDirectoryDirectoryPath = $rootDirectory.FullName

    . "$rootDirectoryPath\scripts\common\Functions-Enumerables.ps1"

    . "$rootDirectoryPath\scripts\common\Functions-Aws.ps1"
    Ensure-AwsPowershellFunctionsAvailable

    $sourceStackName = Get-StackName -Environment $environment -UniqueComponentIdentifier (_GetUniqueComponentIdentifier)
    $sourceStack = Get-CFNStack -StackName $sourceStackName -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion

    $resources = Get-CFNStackResources -StackName $sourceStack.StackId -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion

    Write-Verbose "_ChangeEnvironmentAvailability([$availability]): Locating auto scaling group in [$environment]"
    $asg = $resources |
        Single -Predicate { $_.ResourceType -eq "AWS::AutoScaling::AutoScalingGroup" }

    $asg = Get-ASAutoScalingGroup -AutoScalingGroupName $asg.PhysicalResourceId -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion

    $instanceIds = @()
    $standbyActivities = @()
    if ($availability -eq "Unavailable")
    {
        Write-Verbose "_ChangeEnvironmentAvailability([$availability]): Switching all instances in [$($asg.AutoScalingGroupName)] to Standby"
        $instanceIds = $asg.Instances | Where { $_.LifecycleState -eq [Amazon.AutoScaling.LifecycleState]::InService } | Select -ExpandProperty InstanceId
        if ($instanceIds | Any)
        {
            $standbyActivities = Enter-ASStandby -AutoScalingGroupName $asg.AutoScalingGroupName -InstanceId $instanceIds -ShouldDecrementDesiredCapacity $true -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion
        }
    }
    elseif ($availability -eq "Available")
    {
        Write-Verbose "_ChangeEnvironmentAvailability([$availability]): Switching all instances in [$($asg.AutoScalingGroupName)] to Inservice"
        $instanceIds = $asg.Instances | Where { $_.LifecycleState -eq [Amazon.AutoScaling.LifecycleState]::Standby } | Select -ExpandProperty InstanceId
        if ($instanceIds | Any)
        {
            $standbyActivities = Exit-ASStandby -AutoScalingGroupName $asg.AutoScalingGroupName -InstanceId $instanceIds -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion
        }
    }

    $anyStandbyActivities = $standbyActivities | Any
    if ($wait -and $anyStandbyActivities)
    {
        Write-Verbose "_ChangeEnvironmentAvailability([$availability]): Waiting for all scaling activities to complete"
        _WaitAutoScalingGroupActivitiesComplete -Activities $standbyActivities -awsKey $awsKey -awsSecret $awsSecret -awsRegion $awsRegion
    }
}

With the only mechanism to affect the state of persisted data disabled, we can have some more confidence that the clone is a robust and clean copy.

To the Future!

I don’t think that the double clone solution is the best one, it’s just the best that I could come up with without having to make a lot of changes to way we manage our environments.

Another approach would be to maintain 2 environments during migration (A and B), but only have one of those environments be active during normal operations. So to do a migration, you would spin up Prod A if Prod B already existed. At the entry point, you have a single (or multiple) DNS record that points to either A or B based on your needs. This one still involves cloning and downtime though, so for a high availability service, it won’t really work (our services can have some amount of downtime, as long as it is organized and communicated ahead of time).

Speaking of downtime, there is another approach that you can follow in order to do zero downtime migrations. I haven’t actually done it, but if you had a mechanism to replicate incoming requests to both environments, you could conceivably bring up the new environment, let it deal with the same requests as the old environment for long enough to synchronize and to validate that it works (without responding to the user, just processing the requests) and then perform the top level switch so that the new environment becomes the active one. At some point in the future you can destroy the old environment, when you are confident that it works as expected.

This is a lot more complicated, and involves some external component managing the requests (and storing a record of all requests ever made, at least back to the last backup point) as well as each environment knowing what request they last processed. Its certainly not impossible, but if you can tolerate downtime, its probably not worth the effort.

Summary

Managing your environments is not a simple task, especially when you have actual users (and if you don’t have users, why are you bothering at all?). It’s very difficult to make sure that your production (or other live environment) does not stagnate in the face of changes, and is always kept up to date.

What I’ve outlined in this blog post is a single approach that I’ve been working on over the last week or so, to deal with out specific environments. Its not something that will work for everyone, but I thought it was worthwhile to write it down anyway, to show some of the thought processes that I needed to go through in order to accomplish the migration in a safe, robust fashion.

There is at least one nice side effect from my approach, in that we will now be able to clone any environment I want (without doing a migration, just the clone) and then use it for experiments or analysis.

I’m sure that I will run into issues eventually, but I’m pretty happy with the way the migration process is happening. It was weighing on my mind for a while, so its good to get it done.

0 Comments

I mentioned last week that I was struggling with an issue with my environment setup scripts via CloudFormation.

Well, I finally got to the bottom of it, and I thought it would be a good thing to share, because I don’t want anyone else to have to struggle with finding the root cause behind the incredibly unhelpful error that I was getting. Hopefully, my pain becomes your knowledge.

To give a little bit of background, I use a combination of CloudFormation templates, Powershell and Octopus Deploy to setup environments for our products and services in AWS. This approach has proven to be extremely useful for a number of reasons, including codification of what exactly goes into the environment, reducing the amount of time to setup an environment (in comparison to using typical infrastructure requests) and reducing the reliance on one particular person or team who knows how to put it all together.

A little while ago I noticed that the environment setup that I use for my JMeter Load Test Workers was failing to run to completion. Every time I tried to setup the environment, it would fail with a “Create Failed” status, stating that one or more of the instances failed their configuration.

Looking into the issue, I found the following error in the cfn-init.log file (available at C:\cfn\log\cfn-init.log).

In case you are unfamiliar with CloudFormation, cfn-init is a program supplied by Amazon that works with CloudFormation, and allows you to configure a series of steps to occur during the startup of an EC2 instance. Typically you  use the UserData for an instance to execute cfn-init with the appropriate parameters, and it connects to the meta data service, downloads the CloudFormation (cfn) template and executes it on the local machine. Its awesome, and I use it a lot.

The relevant excerpt from the log file:

2015-06-12 03:06:02,674 [DEBUG] Running command a-deploy-loadtest
2015-06-12 03:06:02,674 [DEBUG] No test for command a-deploy-loadtest
2015-06-12 03:06:59,846 [INFO] Command a-deploy-loadtest succeeded
2015-06-12 03:06:59,846 [ERROR] -----------------------BUILD FAILED!------------------------
2015-06-12 03:06:59,846 [ERROR] Unhandled exception during build: 'utf8' codec can't decode byte 0xf2 in position 644: invalid continuation byte
Traceback (most recent call last):
    File "cfn-init", line 123, in <module>
    File "cfnbootstrap\construction.pyc", line 117, in build
    File "cfnbootstrap\construction.pyc", line 502, in build
    File "cfnbootstrap\construction.pyc", line 513, in run_config
    File "cfnbootstrap\construction.pyc", line 125, in run_commands
    File "cfnbootstrap\command_tool.pyc", line 116, in apply
    File "encodings\utf_8.pyc", line 16, in decode UnicodeDecodeError: 'utf8' codec can't decode byte 0xf2 in position 644: invalid continuation byte

Of course, that doesn’t make any sense, as I hadn’t changed anything to do with the encoding of the output. Even more confusingly, when I ran cfn-init using the exact same mechanism (the execution of the UserData batch file), on the exact same machine, it would work just fine. It only failed during instance startup.

I looked into it for a little while, but couldn’t find an obvious root cause. I assumed that I was doing something slightly different from the norm or that something had changed upstream to cause the error. This made sense, because all of the other places where I was using the exact same code were working just fine, and it seemed to be localised to just my load test worker environment.

I altered the environment setup to not require instances to setup successfully (by removing the call to cfn-signal after cfn-init, and removing the WaitCondition from the template) and manually initialized each instance after it was created before I ran my load tests.

Not great, but an acceptable work around until it mattered more. I had bigger fish to fry.

The Virus Spreads

Then it happened somewhere else.

The new cloud based service we are working on needed an environment refresh to take advantage of scalable persistence in the form of RDS. We tore down the old CI environment, and then attempted to create a new one.

The mysterious “invalid UTF8 character” error reared its ugly head again.

I couldn’t brush it away this time, as this environment was much more important than my load test workers, so it was time to dig in and see what was really going on.

Taking the error at face value, I assumed that something had changed in the pipeline (Octopus, AWS, etc) and that there really was an invalid character. It was fairly obvious that cfn-init didn’t fail until after the first deployment had actually completed, so something seemed to be off in the output of that particular step.

"a-deploy-system-stats-logger": { "command": { "Fn::Join": [ "", [ "powershell.exe -ExecutionPolicy Bypass -Command c:\\cfn\\dependencies\\scripts\\environment\\Invoke-DeployLatestProjectForEnvironment.ps1 ",

" –Verbose ", " -OctopusServerApiKey ", { "Ref": "OctopusAPIKey" }, " -OctopusServerUrl ", { "Ref": "OctopusServerURL" }, " -OctopusEnvironment ", { "Ref": "OctopusEnvironment" }, " -ProjectName 'System Stats Logger'" ] ] }, "waitAfterCompletion": "0" }

I ran the script manually through C# and captured the raw output stream. I parsed the output as UTF8, and it came out just fine, so clearly the output itself wasn’t the problem.

At this point I took a slightly different approach. I knew that an old version of the environment setup for our new cloud service had worked just fine a few days ago, so I branched from that point in Git and executed the setup script.

It worked.

Now I at least had two copies of similar codebases that I could compare for differences to find what was breaking the deployments, and I knew that it wasn’t an environmental issue related to either Octopus Deploy or AWS components.

I used an application called WinMerge to diff the two repositories (after extracting them to different directories on disk), and found a number of differences, but nothing that was obviously the root cause. I thought that perhaps the problem was caused by upgrading to the latest version of the AWS Powershell package, but I rolled that back and still had the same issue.

Heads up WinMerge people, if you are on the internet at all. The statuses for “File Identical” and “File Different” in your application look incrediblysimilar at a glance. I thought all my files were identical initially, at least until I looked a little closer. Colours would be nice.

The Root Cause

While looking at the diff of the two directories, I noticed that I had added additional Write-Verbose statements to some of my generic Powershell enumerable functions (like Any, Single, All, etc). In fact, in all of the environment setup scripts that were working, this additional logging was missing (because they were using an older version of the common scripts). The logging was only present in those environments that were broken.

But how could additional logging statements cause a UTF8 error in cfn-init, and only during startup?

It took me a fair bit of time to realise that it was an issue with the length of the output. My deployment scripts all had –Verbose turned on by default. I did this on purpose, to make sure that I received the maximum amount of information possible from the environment setup.

Apparently cfn-init has a limit on the amount of output it can process from a command, and I was pushing that limit.

As to why it was failing during startup and not when manually executed? The code that was run at startup did some archive extraction which didn’t happen in future calls (because it was already on disk) and the items being extracted were logged. This was evidently enough to push the output over the invisible, undocumented limit.

I removed the –Verbose flag from the calls deployment scripts and everything was fine.

Problem solved.

Why?

Of course, the question remains. Why?

Why does cfn-init give a completely unrelated error when the output from a single command is too long? Why not a “output has been truncated” or “output too long” error of some description?

I have no idea.

I suspect the reason for the UTF8 error is that the output was truncated between two bytes or something, and when the parser underneath cfn-init tried to parse it, it encountered what appeared to be an invalid UTF8 character. I still have no idea why it would do arbitrary truncation though, especially without telling the user.

Ah well.

Conclusion

This is the sort of thing that you just have to deal with sometimes. I do wonder how many other people have run into this issue though. Considering that when I did a Google search for the error, I found almost nothing, I’m not sure if anyone else has. If they have, they certainly didn’t post online about it. Luckily, for everyone else in the future, I did, both here and on the AWS forums, so maybe that will help.

On the upside, running into these sorts of things makes you think more about your own code, and how it reports problems to the user when assumptions that you’ve made break down in reality. It takes a lot of effort to write robust code that is communicative when it fails, but I think its definitely worth the effort.

Remember, your code will be used and read a hell of a lot more than its written.