Painful Proxy Lessons

June 23. 2015 0 Comments

Posted in:
amazon
testing
proxy

So, as per my last post, I built a scalable, deployable, codified proxy environment in AWS, leveraging Cloud Formation, Octopus and Squid.

Since then I have attempted to use this proxy for things. Specifically load tests, which was the entire reason I built the proxy in the first place.

In my attempts to use the proxy, I have learned a few lessons that I thought would be worthwhile to share, so that others might benefit from my pain.

This will be a relatively short post, because:

I’ve been struggling with these issues over the last week and haven’t done much else,
For all the trouble I had, I really only ran into 2 issues,
I’m still struggling with a mysterious, probably unrelated issue, and its taking up most of my mental space (unexpected UTF8 characters in a Powershell/Octopus deployment output stream, which might be a good future blog post if I ever figure it out).

Weird Differences

Initially I assumed that my proxy would slot into the same hole as our current proxy. They were both Squid, I had replicated the config from the old one on the new one and both were referenced simply by a URL and port number.

I was mistaken.

I’m still not sure how it did it, but the old proxy was somehow allowing connections to the special EC2 meta information address (169.254.169.254) to pass through correctly. The moment I swapped in my new proxy, cfn-init and cfn-signal no longer worked.

For cfn-init, the error was incredibly unhelpful. It insisted that my instance was not a member of the CloudFormation group/template/configuration that I was trying to initialise from.

For cfn-signal, it just didn’t do anything. It said it signalled, but it was lying.

In hindsight, this makes perfect sense. The request would have gone through the proxy, which was a CloudFormation resource itself, and it would have tried to use the proxy’s CloudFormation template as the container for the meta data, which would fail, giving a technically correct error message in the first case, and signalling something non-existent in the second.

From my point of view, it looked insane.

I assumed I had put some sort of incorrect details into the cfn-init call, or that I had failed to meet the arcane requirements for cfn-signal (must base64 encode the wait handle on windows only for example), but I hadn’t changed anything. The only thing I changed was the proxy configuration.

Long story short, for my proxy, I had to add a bypass entry (on each EC2 instance, configured in the same place as the proxy, the UserData script) which would stop cfn-init (and other tools) from trying to go through the proxy to hit the meta information address. I still have no idea how the old proxy did not require the same sort of configuration. I have a hunch that it might be because it was Linux and the original creators of the box did something special to make it work. Maybe they ran into the same issue, but just fixed it a different way? Maybe Linux automatically handles the situation better? Who knows.

Very frustrating.

Location, Location, Location

The second pain point I ran into was more insane and just as frustrating.

After reviewing the results of an initial load test, I hypothesised that maybe the proxy was a bottleneck. All of the traffic for the load test had to pass through the proxy (including image uploads) and I couldn’t see anything obvious in the service logs to account for the level of failure I was seeing, except high load. In the interests of getting a better subsequent load test, I wanted to make sure that the proxy boxes could not possibly be a bottleneck, so I planned to beef up their instance type.

I was originally using t2.medium instances, which have some limitations, mostly around network performance and CPU credits. I wanted to switch to something a bit beefier, just for the proxy specific to the load tests.

When i switched to an m3.large, the proxy stopped working.

Looking into it, the expected installation directory (C:\Squid) was empty of anything that even vaguely looked like a proxy.

Following the installation log, I found out that Squid had decided to install itself to Z drive. Z drive was an ephemeral drive. You know, the ones whose content is transitory, and which tend to get annihilated if the instance goes down for any reason?

I tried so very hard to get Squid to just install to the C drive, including checking the registry settings for program installation locations (which were all correctly C based) and manually overriding TARGETFOLDER, ROOTDRIVE and INSTALLDIR in the msi execution parameters.

Alas, it was not to be. No matter what I did, Squid insisted on installing to Z drive.

I still have no idea why, I just turned the instance type back to one that didn’t have ephemeral drives available.

Like any good software user, I logged a bug. Well, I assume its a bug, because that’s a really weird feature.

Conclusion

There is no conclusion. Just a glimpse into some of the traps that sap your time, motivation and will to live when doing this sort of thing.

I only hope that someone runs across this blog post one day and it helps them. Or at least lets them know someone else out there understands their pain.