Actually Not a Length Issue After All

August 4. 2015 0 Comments

A while back I made a post about how cfn-init was failing with an error that didn’t seem to make any sense (utf8 codec can’t decode, invalid continuation byte). At the time, I came to the conclusion that the reason why it was failing was the length of the output.

That was actually incorrect.

The reason I came to that conclusion was because the action that seemed to fix the issue was whether or not the –Verbose flag was present on the Powershell script being called. If I had the flag on, I would get the error. Off, and everything was fine.

At first I trusted the error. I assumed that somewhere in my output there was actually an invalid set of bytes, at least as far as utf8 is concerned. It seemed entirely plausible, considering it was Ruby, parsing the output from a Powershell script, which was running an executable directly. So many layers, so many different ways in which it could break, the least of which would be output stream encoding incompatibilities.

My initial investigation into the output stream didn’t seem to show any invalid bytes, so I assumed that cfn-init was doing something stupid, and truncating the output stream because it was too long. If it truncated the stream, it was feasible that it was shearing a UTF8 byte pair in half, hence the invalid continuation byte error. It made sense to me, because:

I had only recently added a large amount of extra verbose logging to the scripts, and that seemed to be the primary difference between the version that worked and the version that failed.
I’ve had issues with AWS components giving arcane errors when you exceed some arbitrary length before. The Powershell cmdlet New-CfnStack will fail if you try to upload a template directly and the template is too long. It also gives incredibly unhelpful errors, about malformed XML, when in reality, its a length issue.

I accepted that I could not use verbose logging for the deployment and moved on, but it always bugged me.

I don’t like losing Verbose logging. Its extremely useful for when things go bad, which they do. All the time.

Round Two (or Three?)

I got a chance to go back to the deployment scripts recently, because some deployments were failing and I wasn’t sure why. I needed the verbose logging back, so I re-added the –Verbose flag and tried to get to the bottom of why it was failing.

My first attempt simply commented out the lines that I knew could be particularly spammy (search functions testing a predicate against a collection of objects).

The error still occurred.

I ran the script by itself (in a clean environment) and the output wasn’t even that long. I realised that I had made an incorrect conclusion from my initial investigation. It definitely wasn’t a length issue.

It was time to do a manual binary search.

I knew which script was failing, so I suppressed verbose output in half the script and tested it again to see if it failed. Its easy enough to temporarily suppress verbose output in Powershell, if a bit messy.

$pref = $VerbosePreference
$VerbosePreference = "SilentlyContinue"

... operations here ...

$VerbosePreference = $pref

What followed was a few hours of manually searching for the piece of script that broke cfn-init. I couldn’t just run it locally, because everything worked just fine, I had to instantiate an environment and get it to initialise itself as it usually would to see if the script would fail or not. It had to be a fresh environment too, because if I ran cfn-init again on an environment that had already failed, it would work just fine.

An environment takes at least 10 minutes to reach the point of failure.

It was the slowest binary search ever.

Something About Rooting Issues?

Eventually I got to the root of the problem. I was piping the output from nuget.exe (used for package installation) to the Verbose stream. Somewhere in the output from nuget, there was actually (maybe?) an invalid UTF8 character, according to the code used by cfn-init anyway. The reason it didn’t fail if you ran it a second time, was because that component was already successfully installed, so it didn’t try to install it again.

I could leave every Write-Verbose statement in place except for that one, and it would all work fine. This meant that I could finally get the verbose output from the Octopus deployment back, which was the main thing I wanted. Sure I could see it through Octopus, but I like all of my information to be in one place, because it just makes everything easier.

To complicate matters further, it wasn’t all nuget output that triggered the issue. For example, the 7Zip command line tools are one of the first components to be installed, in order to make sure the AWS cmdlets are available (they are distributed to each machine as a 7zip file). That particular component would install fine. It seemed to only be the Octopus Client package (the one that contains the .NET libraries) that caused the issue. I’m still not 100% sure of that to be honest, I was mostly just happy to get my verbose deployment errors back in the cfn-init logs, and I had to move on to something else.

To be honest, the fact that I don’t know for sure why the nuget installation output breaks the script is another application of the same mistake I made the first time, and I’m almost certainly going to have to revisit it at some point in the future, where I will no doubt discover something else entirely is actually the root cause. This solution is still better than the original one though, which is enough for now.

Conclusion

Sometimes in software you think you know the answer, but it turns out you’ve made some poor assumptions or conclusions, and just completely missed the mark. In my case, the ramifications were not particularly bad, I was just missing out on informational output, but that’s not always the case. Sometimes the results of a poor assumption/conclusions can be much worse.

The important thing to note is that I stopped investigating once I found something that made the bug go away, instead of investigating in more depth as to whythe action made the bug go away. Sometimes this is a compromise (usually as a result of time contraints/pressure), but I’ve found its almost always worthwhile to spend the time to understand the why of an issue before accepting a solution. Obviously at some stage you do need to just accept that you don’t understand the problem fully, so like everything in software development, its a balancing act.

At the very least if you investigate for a little bit longer, you’ll understand the problem better, which can only be a good thing.