0 Comments

A few months back I made a quick post about some automation that we put into place when running TeamCity Build Agents on spot-price instances in AWS. Long story short, we used EC2 userdata to automatically configure and register the Build Agent whenever a new spot instance was spun up, primarily as a way to deal with the instability in the spot price which was constantly nuking our machines.

The kicker in that particular post was that when we were editing the TeamCity Build Agent configuration, Powershell was changing the encoding of the file such that it looked perfectly normal at a glance, but the build agent was completely unable to read it. This lead to some really confusing errors about things not being set when they were clearly set and so on.

All in all, it was one of those problems that just make you hate software.

What does all of this have to do with this weeks post?

Well, history has a weird habit of repeating itself in slightly different ways.

More Robots

As I said above, we’ve put in some effort to make sure that our TeamCity Build Agent AMI’s can mostly take care of themselves on boot if you’ve configured the appropriate userdata in EC2.

Unfortunately, each time we wanted a brand new instance (i.e. to add one to the pool or to recreate existing ones because we’d updated the underlying AMI) we still had to go into the AWS Management Dashboard and set it all up manually, which meant that we needed to remember to set the userdata from a template, making sure the replace the appropriate tokens.

Prone to failure.

Being that I had recently made some changes to the underlying AMI (to add Powershell 5, MSBuild 2015 and .NET Framework 4.5.1) I was going to have to do the manual work.

That’s no fun. Time to automate.

A little while later I had a relatively simple Powershell script scraped together that would spin up an EC2 instance (spot or on-demand) using our AMI, with all of our requirements in place (tagging, names, etc).

[CmdletBinding()]
param
(
    [Parameter(Mandatory=$true)]
    [ValidateNotNullOrEmpty()]
    [string]$awsKey,
    [Parameter(Mandatory=$true)]
    [ValidateNotNullOrEmpty()]
    [string]$awsSecret,
    [string]$awsRegion="ap-southeast-2",
    [switch]$spot=$false,
    [int]$number
)

$here = Split-Path $script:MyInvocation.MyCommand.Path
. "$here\_Find-RootDirectory.ps1"

$rootDirectory = Find-RootDirectory $here
$rootDirectoryPath = $rootDirectory.FullName

. "$rootDirectoryPath\scripts\common\Functions-Aws.ps1"

Ensure-AwsPowershellFunctionsAvailable

$name = "[{team}]-[dev]-[teamcity]-[buildagent]-[$number]"

$token = [Guid]::NewGuid().ToString();

$userData = [System.IO.File]::ReadAllText("$rootDirectoryPath\scripts\buildagent\ec2-userdata-template.txt");
$userData = $userData.Replace("@@AUTH_TOKEN@@", $token);
$userData = $userData.Replace("@@NAME@@", $name);

$amiId = "{ami}";
$instanceProfile = "{instance-profile}"
$instanceType = "c3.large";
$subnetId = "{subnet}";
$securityGroupId = "{security-group}";
$keyPair = "{key-pair}";

if ($spot)
{
    $groupIdentifier = new-object Amazon.EC2.Model.GroupIdentifier;
    $groupIdentifier.GroupId = $securityGroupId;
    $name = "$name-[spot]"
    $params = @{
        "InstanceCount"=1;
        "AccessKey"=$awsKey;
        "SecretKey"=$awsSecret;
        "Region"=$awsRegion;
        "IamInstanceProfile_Arn"=$instanceProfile;
        "LaunchSpecification_InstanceType"=$instanceType;
        "LaunchSpecification_ImageId"=$amiId;
        "LaunchSpecification_KeyName"=$keyPair;
        "LaunchSpecification_AllSecurityGroup"=@($groupIdentifier);
        "LaunchSpecification_SubnetId"=$subnetId;
        "LaunchSpecification_UserData"=[System.Convert]::ToBase64String([System.Text.Encoding]::Unicode.GetBytes($userData));
        "SpotPrice"="0.238";
        "Type"="persistent";
    }
    $request = Request-EC2SpotInstance @params;
}
else
{
    . "$rootDirectoryPath\scripts\common\Functions-Aws-Ec2.ps1"

    $params = @{
        ImageId = $amiId;
        MinCount = "1";
        MaxCount = "1";
        KeyName = $keyPair;
        SecurityGroupId = $securityGroupId;
        InstanceType = $instanceType;
        SubnetId = $subnetId;
        InstanceProfile_Arn=$instanceProfile;
        UserData=$userData;
        EncodeUserData=$true;
    }

    $instance = New-AwsEc2Instance -awsKey $awsKey -awsSecret $awsSecret -awsRegion $awsRegion -InstanceParameters $params -IsTemporary:$false -InstancePurpose "DEV"
    Tag-Ec2Instance -InstanceId $instance.InstanceId -Tags @{"Name"=$name;"auto:start"="0 8 ALL ALL 1-5";"auto:stop"="0 20 ALL ALL 1-5";} -awsKey $awsKey -awsSecret $awsSecret -awsRegion $awsRegion
}

Nothing special here. The script leverages some of our common scripts (partially available here) to do some of the work like creating the EC2 instance itself and tagging it, but its pretty much just a switch statement and a bunch of parameter configuration.

On-Demand instances worked fine, spinning up the new Build Agent and registering it with TeamCity as expected, but for some reason instances created with the –Spot switch didn’t.

The spot request would be created, and the instance would be spawned as expected, but it would never configure itself as a Build Agent.

Thank God For Remote Desktop

As far as I could tell, instances created via either path were identical. Same AMI, same Security Groups, same VPC/Subnet, and so on.

Remoting onto the bad spot instances I could see that the Powershell script supplied as part of the instance userdata was not executing. In fact, it wasn’t even there at all. Typically, any Powershell script specified in userdata with the <powershell></powershell> tags is automatically downloaded by the EC2 Config Service on startup and placed inside C:/Program Files (x86)/Amazon/EC2ConfigService/Scripts/UserData.ps1, so it was really unusual for there to be nothing there even though I had clearly specified it.

I have run into this sort of thing before though, and the most common root cause is that someone (probably me) forgot to enable the re-execution of userdata when updating the AMI, but that couldn’t be the case this time, because the on-demand instances were working perfectly and they were all using the same image.

Checking the userdata from the instance itself (both via the AWS Management Dashboard and the local meta data service at http://169.254.169.254/latest/user-data) I could clearly see my Powershell script.

So why wasn’t it running?

It turns out that the primary difference between a spot request and an on-demand request is that you have to base64 encode the data yourself for the spot request (whereas the library takes care of it for the on-demand request). I knew this (as you can see in the code above), but what I didn’t know was that the EC2 Config Service is very particular about the character encoding of the underlying userdata. For the base64 conversion, I had elected to interpret the string as Unicode bytes, which meant that while everything looked fine after the round trip, the EC2 Config Service had no idea what was going on. Interpreting the string as UTF8 bytes before encoding it made everything work just fine.

Summary

This is another one of those cases that you run into in software development where it looks like something has made an assumption about its inputs, but hasn’t put the effort in to test that assumption before failing miserably. Just like with the TeamCity configuration file, the software required that the content be encoded as UTF8, but didn’t tell me when it wasn’t.

Or maybe it did? I couldn’t find anything in the normal places (the EC2 Config Service log files), but those files can get pretty big, so I might have overlooked it. AWS is a pretty well put together set of functionality, so its unlikely that something as common as an encoding issue is completely unknown to them.

Regardless, this whole thing cost me a few hours that I could have spent doing something else.

Like shaving a different yak.