Unfortunately for us, we had to move our TeamCity build server into another AWS account recently, which is never a pleasant experience, though it is often eye opening.
We had to do this for a number of reasons, but the top two were:
- We’re consolidating some of our many AWS accounts, to more easily manage them across the organisation.
- We recently sold part of our company, and one of the assets included in that sale was a domain name that we were using to host our TeamCity server on.
Not the end of the world, but annoying all the same.
Originally our goal was to do a complete refresh of TeamCity, copying it into a new AWS account, with a new URL and upgrading it to the latest version. We were already going to be disrupted for a few days, so we thought we might as well make it count. We’re using TeamCity 8, and the latest is 10, which is a pretty big jump, but we were hopeful that it would upgrade without a major incident.
I should have remembered that in software development, hope it for fools. After the upgrade to TeamCity 10, the server hung on initializing for long enough that we got tired of waiting (and I’m pretty patient).
So we abandoned the upgrade, settling for moving TeamCity and rehosting it at a different URL that we still had legal ownership of.
That went relatively well. We needed to adapt some of our existing security groups in order to correctly grant access to various resources from the new TeamCity Server/Build Agents, but nothing we hadn’t dealt with a hundred times before.
As we executed more and more builds though, some of them started to fail.
Failures Are Opportunities To Get Angry
All failing builds were stopping in the same place, when uploading the nuget package at the end of the process. Builds uploading to Octopus Deploy were fine (its a server within the same AWS VPC, so that’s not surprising), but a random sampling of builds uploading packages to MyGet had issues.
Investigating, the common theme of the failing builds was largish packages. Not huge, but at least 10 MB. The nuget push call would timeout after 100s, trying a few times, but always experiencing the same issue.
With 26 MB of data required to be uploaded for one of our packages (13 MB package, 13 MB symbols, probably should optimize that), this meant that the total upload speed we were getting were < 300 KBps, which is pretty ridiculously low for something literally inside a data centre.
The strange thing was, we’d never had an issue with uploading large packages before. It wasn’t until we moved TeamCity and the Build Agents into a new AWS account that we started having problems.
Looking into the network configuration, the main differences I could determine were:
- The old configuration used a proxy to get to the greater internet. Proxies are the devil, and I hate them, so when we moved into the new AWS account, we put NAT gateways in place instead. Invisible to applications, a NAT gateway is a far easier way to give internet access to machines that do not need to be exposed on the internet directly.
- Being a completely different AWS account means that there is a good chance those resources would be spun up on entirely different hardware. Our previous components were pretty long lived, so they had consistently been running on the same stuff for months.
At first I thought maybe the NAT gateway had some sort of upload limit, but uploading large payloads to other websites was incredibly fast. With no special rules in place for accessing the greater internet, the slow uploads to MyGet were an intensely annoying mystery.
There was another thing as well. We wrap our usages of Nuget.exe in Powershell functions, specifically to ensure we’re using the various settings consistently. One of the settings we were setting by default with each usage of push, was the timeout. It wasn’t set to 100 seconds though, it was set to 600.
Bugs, Bugs, Bugs
A while back I had to upgrade to the latest Nuget 3.5 release candidate in order to get a fix for a bug that was stopping us from deploying empty files from a package. Its a long story, but it wasn’t something we could easily change. Unfortunately, the latest release candidate also has a regression in it where the timeout for the push is locked at 100 seconds, no matter what you do.
Its been fixed since, but there isn’t another release candidate yet.
Rolling back to a version that allows the timeout to work correctly, stops the other thing from working.
That whole song and dance is how software feels sometimes.
With no obvious way to simply increase the timeout, and because all other traffic seemed to be perfectly fine, it was time to contact MyGet support.
They responded that its something they’ve seen before, but they do not know the root cause. It appears to be an issue with the way that AWS is routing traffic to their Azure hosts. It doesn’t happen all the time, but when it does, it tanks performance. They suggested recycling the NAT gateway to potentially get it on new hardware (and thus give it a chance at getting access to better routes), but we tried that and it didn’t make a difference. We’ve since sent them some detailed fiddler and network logs to help them diagnose the issue, but I wouldn’t be surprised if it was something completely out of their control.
On the upside, we did actually have a solution that was already working.
Our old proxy.
It hadn’t been shut down yet, so we configured the brand new shiny build agents to use the old proxy and lo and behold, packages uploaded in a reasonable time.
This at least unblocked our build pipeline so that other things could happen while we continue to investigate.
Disappointingly, that’s where this blog post ends. The solution that we put into place temporarily with the old proxy (and I really hate proxies) is a terrible hack, and we’re going to have to spend some significant effort fixing it properly because if that proxy instance dies, we could be returned to exactly the same place without warning (if the underlying issue is something to do with routing that is out of control).
Networking issues like the one I’ve described above are some of the most frustrating, especially because they can happen when you least expect it.
Not only are they basically unfathomable, there is very little you can do to actively fix the issue other than moving your traffic around trying to find a better spot.
Of course, we can also optimise the content of our packages to be as small as possible, hopefully making all of our artifacts small enough to be uploaded with the paltry amount of bandwidth available.
Having a fixed version of Nuget would be super nice as well.