0 Comments

When we first started putting together our data synchronization algorithm, it was intended to provide the users of our legacy software with access to their data outside of the confines of their offices, starting with a rudimentary web portal. The idea was that this would help improve the longevity of the software that they were using, while also giving them some exposure to what a cloud based offering would be like, easing the eventual transition to our new product.

Unfortunately the web portal part of the system ended up dead-ending pretty hard as the business moved in a different direction.

The data synchronization that powered it though?

That ended up becoming something incredibly valuable in its own right, as it very quickly became clear that if our goal was to ease the transition from our classic desktop software to a new and shiny cloud platform, it would help to have a sexy migration experience.

We Named Our Squad Humpback

In the scheme of things, data migrations from our legacy product to the cloud platform are old news. In fact, we’ve been doing that sort of thing since not long after the cloud platform was ready for the very first customer.

My team hasn’t owned the process that entire time though (even though we owned the data synchronization), so it wasn’t until relatively recently that I really started digging into how the whole shebang works.

At a high level, we have a team focused on actually stewarding new customers through the onboarding and migrations process. This team makes extensive use of a tool that automates parts of that process, including the creation of the overarching customer entity and the migration of data from their old system. The tool is…okay at doing what it needs to do, but it definitely has a bunch of areas that could be improved.

For example, when a migration fails, its not immediately apparent to the user why. You have to do a little bit of digging to get to the root of the problem.

Speaking of failures, they can generally be broken down into two very broad categories:

  • Complete and total failure to execute the migration
  • Failure to do parts of the migration, even though it technically completed

The second point is the most common, and the users take that information (“these things did not migrate”) and supply it to the client, so that they can either fix their underlying problems or sign off that they are okay with those limitations.

The first point is rare, but brutal, and usually encapsulates some sort of fundamental error, like the database going missing, or some critical communications problem.

Recently we hit a brand new complete and total failure which presented with the following error:

java.sql.SQLException: Incorrect string value: '\xF0\x9F\x92\x9B\xF0\x9F...' for column 'data' at row 1

Those escaped characters in the error message?

That’s the ❤️ emoji (and then some other stuff).

Because Whales Migrate

Of course, we didn’t know that the escaped characters in the error message were the ❤️ emoji at first. We just knew that it failed miserably.

At a high level, the entire migration process is a pretty simple ETL:

  1. Grabs the data for the customer (extract)
  2. Transforms that data into an appropriate structure (transform)
  3. Makes a series of API requests to create the appropriate entities in the cloud platform (load)

In between steps one and two, the data is actually inserted into a staging database, which is MySQL (Aurora specifically when its deployed in AWS).

As near as we could tell (the error message and logging weren’t great), when some of the data from the legacy system was inserted into the staging database, the whole thing exploded.

A quick internet search on the error message later, and:

  • The first character was totally the ❤️ emoji
  • We had almost certainly set the charset wrong on the database, which is why it was rejecting the data

The ❤️ emoji specifically (but also other important emojis, like [NOTE: For some weird reason I could not get the pile of poop emoji to work here, so you’ll just have to imagine it]) require four bytes to be represented, and unfortunately the utf8 charset in MySQL doesn’t really support that sort of thing. In fact, strictly speaking, the utf8 charset in MySQL only partially implements UTF-8, which can lead to all sorts of weird problems much more important than simple emoji loss.

Luckily, the fix for us is simple enough. Even if the final destination does not support emojis (the cloud platform), the migration process shouldn’t just explode. This means we need to change the charset of the staging database to utf8mb4, and then do a bit of a song and dance to get everything working as expected for all of the relevant columns.

Once we have the data, the transform can safely trim out the emojis, and we can give appropriate messages to the client being migrated explaining what we did to their data and why. Its not even a fatal error, just a warning that we had to munge their data on the way through in order for it to work properly.

Conclusion

I’m honestly surprised that the emojis made it all the way to the migration before the first failure.

I fully expected the legacy software to explode as soon as it encountered an emoji, or maybe the data synchronization process at the very least. Being that the sync process goes from an MSSQL database (on the client side), through a .NET API (involving JSON serializations) and then into a PostgreSQL database (server side), I figured at least one of those steps would have had some sort of issue, but the only thing we encountered was PostgreSQL’s dislike of the null character (and that was ages ago).

In the end, the problem was surmountable, but it was very unexpected all the same.

The saddest thing is that this isn’t even the craziest user data that we’ve seen.

One time we found 50 MB of meaningless RTF documented embedded in a varchar(MAX) column.

0 Comments

A few months back I made a quick post about some automation that we put into place when running TeamCity Build Agents on spot-price instances in AWS. Long story short, we used EC2 userdata to automatically configure and register the Build Agent whenever a new spot instance was spun up, primarily as a way to deal with the instability in the spot price which was constantly nuking our machines.

The kicker in that particular post was that when we were editing the TeamCity Build Agent configuration, Powershell was changing the encoding of the file such that it looked perfectly normal at a glance, but the build agent was completely unable to read it. This lead to some really confusing errors about things not being set when they were clearly set and so on.

All in all, it was one of those problems that just make you hate software.

What does all of this have to do with this weeks post?

Well, history has a weird habit of repeating itself in slightly different ways.

More Robots

As I said above, we’ve put in some effort to make sure that our TeamCity Build Agent AMI’s can mostly take care of themselves on boot if you’ve configured the appropriate userdata in EC2.

Unfortunately, each time we wanted a brand new instance (i.e. to add one to the pool or to recreate existing ones because we’d updated the underlying AMI) we still had to go into the AWS Management Dashboard and set it all up manually, which meant that we needed to remember to set the userdata from a template, making sure the replace the appropriate tokens.

Prone to failure.

Being that I had recently made some changes to the underlying AMI (to add Powershell 5, MSBuild 2015 and .NET Framework 4.5.1) I was going to have to do the manual work.

That’s no fun. Time to automate.

A little while later I had a relatively simple Powershell script scraped together that would spin up an EC2 instance (spot or on-demand) using our AMI, with all of our requirements in place (tagging, names, etc).

[CmdletBinding()]
param
(
    [Parameter(Mandatory=$true)]
    [ValidateNotNullOrEmpty()]
    [string]$awsKey,
    [Parameter(Mandatory=$true)]
    [ValidateNotNullOrEmpty()]
    [string]$awsSecret,
    [string]$awsRegion="ap-southeast-2",
    [switch]$spot=$false,
    [int]$number
)

$here = Split-Path $script:MyInvocation.MyCommand.Path
. "$here\_Find-RootDirectory.ps1"

$rootDirectory = Find-RootDirectory $here
$rootDirectoryPath = $rootDirectory.FullName

. "$rootDirectoryPath\scripts\common\Functions-Aws.ps1"

Ensure-AwsPowershellFunctionsAvailable

$name = "[{team}]-[dev]-[teamcity]-[buildagent]-[$number]"

$token = [Guid]::NewGuid().ToString();

$userData = [System.IO.File]::ReadAllText("$rootDirectoryPath\scripts\buildagent\ec2-userdata-template.txt");
$userData = $userData.Replace("@@AUTH_TOKEN@@", $token);
$userData = $userData.Replace("@@NAME@@", $name);

$amiId = "{ami}";
$instanceProfile = "{instance-profile}"
$instanceType = "c3.large";
$subnetId = "{subnet}";
$securityGroupId = "{security-group}";
$keyPair = "{key-pair}";

if ($spot)
{
    $groupIdentifier = new-object Amazon.EC2.Model.GroupIdentifier;
    $groupIdentifier.GroupId = $securityGroupId;
    $name = "$name-[spot]"
    $params = @{
        "InstanceCount"=1;
        "AccessKey"=$awsKey;
        "SecretKey"=$awsSecret;
        "Region"=$awsRegion;
        "IamInstanceProfile_Arn"=$instanceProfile;
        "LaunchSpecification_InstanceType"=$instanceType;
        "LaunchSpecification_ImageId"=$amiId;
        "LaunchSpecification_KeyName"=$keyPair;
        "LaunchSpecification_AllSecurityGroup"=@($groupIdentifier);
        "LaunchSpecification_SubnetId"=$subnetId;
        "LaunchSpecification_UserData"=[System.Convert]::ToBase64String([System.Text.Encoding]::Unicode.GetBytes($userData));
        "SpotPrice"="0.238";
        "Type"="persistent";
    }
    $request = Request-EC2SpotInstance @params;
}
else
{
    . "$rootDirectoryPath\scripts\common\Functions-Aws-Ec2.ps1"

    $params = @{
        ImageId = $amiId;
        MinCount = "1";
        MaxCount = "1";
        KeyName = $keyPair;
        SecurityGroupId = $securityGroupId;
        InstanceType = $instanceType;
        SubnetId = $subnetId;
        InstanceProfile_Arn=$instanceProfile;
        UserData=$userData;
        EncodeUserData=$true;
    }

    $instance = New-AwsEc2Instance -awsKey $awsKey -awsSecret $awsSecret -awsRegion $awsRegion -InstanceParameters $params -IsTemporary:$false -InstancePurpose "DEV"
    Tag-Ec2Instance -InstanceId $instance.InstanceId -Tags @{"Name"=$name;"auto:start"="0 8 ALL ALL 1-5";"auto:stop"="0 20 ALL ALL 1-5";} -awsKey $awsKey -awsSecret $awsSecret -awsRegion $awsRegion
}

Nothing special here. The script leverages some of our common scripts (partially available here) to do some of the work like creating the EC2 instance itself and tagging it, but its pretty much just a switch statement and a bunch of parameter configuration.

On-Demand instances worked fine, spinning up the new Build Agent and registering it with TeamCity as expected, but for some reason instances created with the –Spot switch didn’t.

The spot request would be created, and the instance would be spawned as expected, but it would never configure itself as a Build Agent.

Thank God For Remote Desktop

As far as I could tell, instances created via either path were identical. Same AMI, same Security Groups, same VPC/Subnet, and so on.

Remoting onto the bad spot instances I could see that the Powershell script supplied as part of the instance userdata was not executing. In fact, it wasn’t even there at all. Typically, any Powershell script specified in userdata with the <powershell></powershell> tags is automatically downloaded by the EC2 Config Service on startup and placed inside C:/Program Files (x86)/Amazon/EC2ConfigService/Scripts/UserData.ps1, so it was really unusual for there to be nothing there even though I had clearly specified it.

I have run into this sort of thing before though, and the most common root cause is that someone (probably me) forgot to enable the re-execution of userdata when updating the AMI, but that couldn’t be the case this time, because the on-demand instances were working perfectly and they were all using the same image.

Checking the userdata from the instance itself (both via the AWS Management Dashboard and the local meta data service at http://169.254.169.254/latest/user-data) I could clearly see my Powershell script.

So why wasn’t it running?

It turns out that the primary difference between a spot request and an on-demand request is that you have to base64 encode the data yourself for the spot request (whereas the library takes care of it for the on-demand request). I knew this (as you can see in the code above), but what I didn’t know was that the EC2 Config Service is very particular about the character encoding of the underlying userdata. For the base64 conversion, I had elected to interpret the string as Unicode bytes, which meant that while everything looked fine after the round trip, the EC2 Config Service had no idea what was going on. Interpreting the string as UTF8 bytes before encoding it made everything work just fine.

Summary

This is another one of those cases that you run into in software development where it looks like something has made an assumption about its inputs, but hasn’t put the effort in to test that assumption before failing miserably. Just like with the TeamCity configuration file, the software required that the content be encoded as UTF8, but didn’t tell me when it wasn’t.

Or maybe it did? I couldn’t find anything in the normal places (the EC2 Config Service log files), but those files can get pretty big, so I might have overlooked it. AWS is a pretty well put together set of functionality, so its unlikely that something as common as an encoding issue is completely unknown to them.

Regardless, this whole thing cost me a few hours that I could have spent doing something else.

Like shaving a different yak.

0 Comments

We use TeamCity as our Continuous Integration tool.

Unfortunately, our setup hasn’t been given as much love as it should have. Its not bad (or broken or any of those things), its just not quite as well setup as it could be, which increases the risk that it will break and makes it harder to manage (and change and keep up to date) than it could be.

As with everything that has got a bit crusty around the edges over time, the only real way to attack it while still delivering value to the business is by doing it slowly, piece by piece, over what seems like an inordinate amount of time. The key is to minimise disruption, while still making progress on the bigger picture.

Our setup is fairly simple. A centralised TeamCity server and at least 3 Build Agents capable of building all of our software components. We host all of this in AWS, but unfortunately, it was built before we started consistently using CloudFormation and infrastructure as code, so it was all manually setup.

Recently, we started using a few EC2 spot instances to provide extra build capabilities without dramatically increasing our costs. This worked fairly well, up until the spot price spiked and we lost the build agents. We used persistent requests, so they came back, but they needed to be configured again before they would hook up to TeamCity because of the manual way in which they were provisioned.

There’s been a lot of instability in the spot price recently, so we were dealing with this manual setup on a daily basis (sometimes multiple times per day), which got old really quickly.

You know what they say.

“If you want something painful automated, make a developer responsible for doing it manually and then just wait.”

Its Automatic

The goal was simple.

We needed to configure the spot Build Agents to automatically bootstrap themselves on startup.

On the upside, the entire process wasn’t completely manual. We were at least spinning up the instances from a pre-built AMI that already had all of the dependencies for our older, crappier components as well as an unconfigured TeamCity Build Agent on it, so we didn’t have to automate absolutely everything.

The bootstrapping would need to tag the instance appropriately (because for some reason spot instances don’t inherit the tags of the spot request), configure the Build Agent and then start it up so it would connect to TeamCity. Ideally, it would also register and authorize the Build Agent, but if we used controlled authorization tokens we could avoid this step by just authorizing the agents once. Then they would automatically reappear each time the spot instance came back,.

So tagging, configuring, service start, using Powershell, with the script baked into the AMI. During provisioning we would supply some UserData that would execute the script.

Not too complicated.

Like Graffiti, Except Useful

Tagging an EC2 instance is pretty easy thanks to the multitude of toolsets that Amazon provides. Our tool of choice is the Powershell cmdlets, so the actual tagging was a simple task.

Getting permission to the do the tagging was another story.

We’re pretty careful with our credentials these days, for some reason, so we wanted to make sure that we weren’t supply and persisting any credentials in the bootstrapping script. That means IAM.

One of the key features of the Powershell cmdlets (and most of the Amazon supplied tools) is that they are supposed to automatically grab credentials if they are being run on an EC2 instance that currently has an instance profile associated with it.

For some reason, this would just not work. We tried a number of different things to get this to work (including updating the version of the Powershell cmdlets we were using), but in the end we had to resort to calling the instance metadata service directly to grab some credentials.

Obviously the instance profile that we applied to the instance represented a role with a policy that only had permissions to alter tags. Minimal permission set and all that.

Service With a Smile

Starting/stopping services with Powershell is trivial, and for once, something weird didn’t happen causing us to burn days while we tracked down some obscure bug that only manifests in our particular use case.

I was as surprised as you are.

Configuration Is Key

The final step should have been relatively simple.

Take a file with some replacement tokens, read it, replace the tokens with appropriate values, write it back.

Except it just wouldn’t work.

After editing the file with Powershell (a relatively simple Get-Content | For-Each { $_ –replace {token}, {value} } | Out-File) the TeamCity Build Agent would refuse to load.

Checking the log file, its biggest (and only) complaint was that the serverUrl (which is the location of the TeamCity server) was not set.

This was incredibly confusing, because the file clearly had a serverUrl value in it.

I tried a number of different things to determine the root cause of the issue, including:

  • Permissions? Was the file accidentially locked by TeamCity such that the Build Agent service couldn’t access it?
  • Did the rewrite of the tokens somehow change the format of the file (extra spaces, CR LF when it was just expecting LF)
  • Was the serverUrl actually configured, but inaccessible for some reason (machine proxy settings for example) and the problem was actually occurring not when the file was rewritten but when the script was setting up the AWS Powershell cmdlets proxy settings?

Long story short, it turns out that Powershell doesn’t remember file encoding when using the Out-File functionality in the way we were using it. It was changing the Byte Order Mark (BOM) on the file from ASCII to Unicode Little Endian, and the Build Agent did not like that (it didn’t throw an encoding error either, which is super annoying, but whatever).

The error message was both a red herring (yes the it was configured) and also truthly (the Build Agent was incapable of reading the serverUrl).

Putting It All Together

With all the pieces in place, it was a relatively simple matter to create a new AMI with those scripts baked into it and put it to work straightaway.

Of course, I’d been doing this the whole time in order to test the process, so I certainly had a lot of failures building up to the final deployment.

Conclusion

Even simple automation can prove to be time consuming, especially when you run into weird unforseen problems like components not performing as advertised or even reporting correct errors for you to use for debugging purposes.

Still, it was worth it.

Now I never have to manually configure those damn spot instances when they come back.

And satisfaction is worth its weight in gold.