0 Comments

I mentioned last week that I was struggling with an issue with my environment setup scripts via CloudFormation.

Well, I finally got to the bottom of it, and I thought it would be a good thing to share, because I don’t want anyone else to have to struggle with finding the root cause behind the incredibly unhelpful error that I was getting. Hopefully, my pain becomes your knowledge.

To give a little bit of background, I use a combination of CloudFormation templates, Powershell and Octopus Deploy to setup environments for our products and services in AWS. This approach has proven to be extremely useful for a number of reasons, including codification of what exactly goes into the environment, reducing the amount of time to setup an environment (in comparison to using typical infrastructure requests) and reducing the reliance on one particular person or team who knows how to put it all together.

A little while ago I noticed that the environment setup that I use for my JMeter Load Test Workers was failing to run to completion. Every time I tried to setup the environment, it would fail with a “Create Failed” status, stating that one or more of the instances failed their configuration.

Looking into the issue, I found the following error in the cfn-init.log file (available at C:\cfn\log\cfn-init.log).

In case you are unfamiliar with CloudFormation, cfn-init is a program supplied by Amazon that works with CloudFormation, and allows you to configure a series of steps to occur during the startup of an EC2 instance. Typically you  use the UserData for an instance to execute cfn-init with the appropriate parameters, and it connects to the meta data service, downloads the CloudFormation (cfn) template and executes it on the local machine. Its awesome, and I use it a lot.

The relevant excerpt from the log file:

2015-06-12 03:06:02,674 [DEBUG] Running command a-deploy-loadtest
2015-06-12 03:06:02,674 [DEBUG] No test for command a-deploy-loadtest
2015-06-12 03:06:59,846 [INFO] Command a-deploy-loadtest succeeded
2015-06-12 03:06:59,846 [ERROR] -----------------------BUILD FAILED!------------------------
2015-06-12 03:06:59,846 [ERROR] Unhandled exception during build: 'utf8' codec can't decode byte 0xf2 in position 644: invalid continuation byte
Traceback (most recent call last):
    File "cfn-init", line 123, in <module>
    File "cfnbootstrap\construction.pyc", line 117, in build
    File "cfnbootstrap\construction.pyc", line 502, in build
    File "cfnbootstrap\construction.pyc", line 513, in run_config
    File "cfnbootstrap\construction.pyc", line 125, in run_commands
    File "cfnbootstrap\command_tool.pyc", line 116, in apply
    File "encodings\utf_8.pyc", line 16, in decode UnicodeDecodeError: 'utf8' codec can't decode byte 0xf2 in position 644: invalid continuation byte

Of course, that doesn’t make any sense, as I hadn’t changed anything to do with the encoding of the output. Even more confusingly, when I ran cfn-init using the exact same mechanism (the execution of the UserData batch file), on the exact same machine, it would work just fine. It only failed during instance startup.

I looked into it for a little while, but couldn’t find an obvious root cause. I assumed that I was doing something slightly different from the norm or that something had changed upstream to cause the error. This made sense, because all of the other places where I was using the exact same code were working just fine, and it seemed to be localised to just my load test worker environment.

I altered the environment setup to not require instances to setup successfully (by removing the call to cfn-signal after cfn-init, and removing the WaitCondition from the template) and manually initialized each instance after it was created before I ran my load tests.

Not great, but an acceptable work around until it mattered more. I had bigger fish to fry.

The Virus Spreads

Then it happened somewhere else.

The new cloud based service we are working on needed an environment refresh to take advantage of scalable persistence in the form of RDS. We tore down the old CI environment, and then attempted to create a new one.

The mysterious “invalid UTF8 character” error reared its ugly head again.

I couldn’t brush it away this time, as this environment was much more important than my load test workers, so it was time to dig in and see what was really going on.

Taking the error at face value, I assumed that something had changed in the pipeline (Octopus, AWS, etc) and that there really was an invalid character. It was fairly obvious that cfn-init didn’t fail until after the first deployment had actually completed, so something seemed to be off in the output of that particular step.

"a-deploy-system-stats-logger": { "command": { "Fn::Join": [ "", [ "powershell.exe -ExecutionPolicy Bypass -Command c:\\cfn\\dependencies\\scripts\\environment\\Invoke-DeployLatestProjectForEnvironment.ps1 ",

" –Verbose ", " -OctopusServerApiKey ", { "Ref": "OctopusAPIKey" }, " -OctopusServerUrl ", { "Ref": "OctopusServerURL" }, " -OctopusEnvironment ", { "Ref": "OctopusEnvironment" }, " -ProjectName 'System Stats Logger'" ] ] }, "waitAfterCompletion": "0" }

I ran the script manually through C# and captured the raw output stream. I parsed the output as UTF8, and it came out just fine, so clearly the output itself wasn’t the problem.

At this point I took a slightly different approach. I knew that an old version of the environment setup for our new cloud service had worked just fine a few days ago, so I branched from that point in Git and executed the setup script.

It worked.

Now I at least had two copies of similar codebases that I could compare for differences to find what was breaking the deployments, and I knew that it wasn’t an environmental issue related to either Octopus Deploy or AWS components.

I used an application called WinMerge to diff the two repositories (after extracting them to different directories on disk), and found a number of differences, but nothing that was obviously the root cause. I thought that perhaps the problem was caused by upgrading to the latest version of the AWS Powershell package, but I rolled that back and still had the same issue.

Heads up WinMerge people, if you are on the internet at all. The statuses for “File Identical” and “File Different” in your application look incrediblysimilar at a glance. I thought all my files were identical initially, at least until I looked a little closer. Colours would be nice.

The Root Cause

While looking at the diff of the two directories, I noticed that I had added additional Write-Verbose statements to some of my generic Powershell enumerable functions (like Any, Single, All, etc). In fact, in all of the environment setup scripts that were working, this additional logging was missing (because they were using an older version of the common scripts). The logging was only present in those environments that were broken.

But how could additional logging statements cause a UTF8 error in cfn-init, and only during startup?

It took me a fair bit of time to realise that it was an issue with the length of the output. My deployment scripts all had –Verbose turned on by default. I did this on purpose, to make sure that I received the maximum amount of information possible from the environment setup.

Apparently cfn-init has a limit on the amount of output it can process from a command, and I was pushing that limit.

As to why it was failing during startup and not when manually executed? The code that was run at startup did some archive extraction which didn’t happen in future calls (because it was already on disk) and the items being extracted were logged. This was evidently enough to push the output over the invisible, undocumented limit.

I removed the –Verbose flag from the calls deployment scripts and everything was fine.

Problem solved.

Why?

Of course, the question remains. Why?

Why does cfn-init give a completely unrelated error when the output from a single command is too long? Why not a “output has been truncated” or “output too long” error of some description?

I have no idea.

I suspect the reason for the UTF8 error is that the output was truncated between two bytes or something, and when the parser underneath cfn-init tried to parse it, it encountered what appeared to be an invalid UTF8 character. I still have no idea why it would do arbitrary truncation though, especially without telling the user.

Ah well.

Conclusion

This is the sort of thing that you just have to deal with sometimes. I do wonder how many other people have run into this issue though. Considering that when I did a Google search for the error, I found almost nothing, I’m not sure if anyone else has. If they have, they certainly didn’t post online about it. Luckily, for everyone else in the future, I did, both here and on the AWS forums, so maybe that will help.

On the upside, running into these sorts of things makes you think more about your own code, and how it reports problems to the user when assumptions that you’ve made break down in reality. It takes a lot of effort to write robust code that is communicative when it fails, but I think its definitely worth the effort.

Remember, your code will be used and read a hell of a lot more than its written.