The Case of the Slow Package Upload

October 25. 2016 0 Comments

Unfortunately for us, we had to move our TeamCity build server into another AWS account recently, which is never a pleasant experience, though it is often eye opening.

We had to do this for a number of reasons, but the top two were:

We’re consolidating some of our many AWS accounts, to more easily manage them across the organisation.
We recently sold part of our company, and one of the assets included in that sale was a domain name that we were using to host our TeamCity server on.

Not the end of the world, but annoying all the same.

Originally our goal was to do a complete refresh of TeamCity, copying it into a new AWS account, with a new URL and upgrading it to the latest version. We were already going to be disrupted for a few days, so we thought we might as well make it count. We’re using TeamCity 8, and the latest is 10, which is a pretty big jump, but we were hopeful that it would upgrade without a major incident.

I should have remembered that in software development, hope it for fools. After the upgrade to TeamCity 10, the server hung on initializing for long enough that we got tired of waiting (and I’m pretty patient).

So we abandoned the upgrade, settling for moving TeamCity and rehosting it at a different URL that we still had legal ownership of.

That went relatively well. We needed to adapt some of our existing security groups in order to correctly grant access to various resources from the new TeamCity Server/Build Agents, but nothing we hadn’t dealt with a hundred times before.

Our builds seemed to be working fine, compiling, running tests and uploading nuget packages to either MyGet or Octopus Deploy as necessary.

As we executed more and more builds though, some of them started to fail.

Failures Are Opportunities To Get Angry

All failing builds were stopping in the same place, when uploading the nuget package at the end of the process. Builds uploading to Octopus Deploy were fine (its a server within the same AWS VPC, so that’s not surprising), but a random sampling of builds uploading packages to MyGet had issues.

Investigating, the common theme of the failing builds was largish packages. Not huge, but at least 10 MB. The nuget push call would timeout after 100s, trying a few times, but always experiencing the same issue.

With 26 MB of data required to be uploaded for one of our packages (13 MB package, 13 MB symbols, probably should optimize that), this meant that the total upload speed we were getting were < 300 KBps, which is pretty ridiculously low for something literally inside a data centre.

The strange thing was, we’d never had an issue with uploading large packages before. It wasn’t until we moved TeamCity and the Build Agents into a new AWS account that we started having problems.

Looking into the network configuration, the main differences I could determine were:

The old configuration used a proxy to get to the greater internet. Proxies are the devil, and I hate them, so when we moved into the new AWS account, we put NAT gateways in place instead. Invisible to applications, a NAT gateway is a far easier way to give internet access to machines that do not need to be exposed on the internet directly.
Being a completely different AWS account means that there is a good chance those resources would be spun up on entirely different hardware. Our previous components were pretty long lived, so they had consistently been running on the same stuff for months.

At first I thought maybe the NAT gateway had some sort of upload limit, but uploading large payloads to other websites was incredibly fast. With no special rules in place for accessing the greater internet, the slow uploads to MyGet were an intensely annoying mystery.

There was another thing as well. We wrap our usages of Nuget.exe in Powershell functions, specifically to ensure we’re using the various settings consistently. One of the settings we were setting by default with each usage of push, was the timeout. It wasn’t set to 100 seconds though, it was set to 600.

Bugs, Bugs, Bugs

A while back I had to upgrade to the latest Nuget 3.5 release candidate in order to get a fix for a bug that was stopping us from deploying empty files from a package. Its a long story, but it wasn’t something we could easily change. Unfortunately, the latest release candidate also has a regression in it where the timeout for the push is locked at 100 seconds, no matter what you do.

Its been fixed since, but there isn’t another release candidate yet.

Rolling back to a version that allows the timeout to work correctly, stops the other thing from working.

That whole song and dance is how software feels sometimes.

With no obvious way to simply increase the timeout, and because all other traffic seemed to be perfectly fine, it was time to contact MyGet support.

They responded that its something they’ve seen before, but they do not know the root cause. It appears to be an issue with the way that AWS is routing traffic to their Azure hosts. It doesn’t happen all the time, but when it does, it tanks performance. They suggested recycling the NAT gateway to potentially get it on new hardware (and thus give it a chance at getting access to better routes), but we tried that and it didn’t make a difference. We’ve since sent them some detailed fiddler and network logs to help them diagnose the issue, but I wouldn’t be surprised if it was something completely out of their control.

On the upside, we did actually have a solution that was already working.

Our old proxy.

It hadn’t been shut down yet, so we configured the brand new shiny build agents to use the old proxy and lo and behold, packages uploaded in a reasonable time.

This at least unblocked our build pipeline so that other things could happen while we continue to investigate.

Conclusion

Disappointingly, that’s where this blog post ends. The solution that we put into place temporarily with the old proxy (and I really hate proxies) is a terrible hack, and we’re going to have to spend some significant effort fixing it properly because if that proxy instance dies, we could be returned to exactly the same place without warning (if the underlying issue is something to do with routing that is out of control).

Networking issues like the one I’ve described above are some of the most frustrating, especially because they can happen when you least expect it.

Not only are they basically unfathomable, there is very little you can do to actively fix the issue other than moving your traffic around trying to find a better spot.

Of course, we can also optimise the content of our packages to be as small as possible, hopefully making all of our artifacts small enough to be uploaded with the paltry amount of bandwidth available.

Having a fixed version of Nuget would be super nice as well.

The Burden of Being Publicly Accessible

September 27. 2016 0 Comments

For anyone following the saga of my adventures with RavenDB, good news! It turns out its much easier to run a RavenDB server on reasonable hardware when you just put less data in it. I cleaned out the massive chunk of abandoned data over the last few weeks and everything is running much better now.

That’s not what this blog post is about though, which I’m sure is disappointing at a fundamental level.

This post is a quick one about some of the fun times that we had setting up access to customer data for the business to use for analysis purposes. What data? Well, lets go back a few steps to set the stage.

Our long term strategy has been to free customer data stuck in on-premises databases so that the customer can easily use it remotely, in mobile applications and webpages, without having to have physical access to their database server (i.e. be in the office, or connected to their office network). This sort of strategy benefits both parties, because the customer gains access to new, useful services for when they are on the move (very important for real estate agents) and we get to develop new and exciting tools and services that leverage their data. Win-win.

Of course, everything we do involving this particular product is a stopgap until we migrate those customers to a completely cloud based offering, but that’s a while away yet, and we need to stay competitive in the meantime.

As part of the creation of one of our most recent services, we consolidated the customer data location in the Cloud, and so now we have a multi tenant database in AWS that contains a subset of all of the data produced by our customers. We built the database to act as the backend for a system that allows the customer to easily view their data remotely (read only access), but the wealth of information available in that repository piqued the interest of the rest of the business, mostly around using it to calculate statistics and comparison points across our entire customer base.

Now, as a rule of thumb, I’m not going to give anyone access to a production database in order to perform arbitrary, ad-hoc queries, no matter how hard they yell at me. There are a number of concerns that lead towards this mindset, but the most important one is that the database has been optimized to work best for the applications that run on it. It is not written with generic, ad-hoc intelligence queries in mind, and any such queries could potentially have an impact on the operation of the database for its primary purpose. The last thing I want is for someone to decide they want to calculate some heavy statistics over all of the data present, tying up resources that are necessary to answer queries that customers are actually asking. Maintaining quality of service is critically important.

However, the business desire is reasonable and real value could be delivered to the customer with any intelligence gathered.

So what were we to do?

Stop Copying Me

The good thing about working with AWS is that someone, somewhere has probably already tried to do what you’re trying to do, and if you’re really lucky, Amazon has already built in features to make doing the thing easy.

Such was the case with us.

An RDS read-replica neatly resolves all of my concerns. The data will be asynchronously copied from the master to the replica, allowing business intelligence queries to be performed with wild abandon without having to be concerned with affecting the customer experience. You do have to be aware of the eventually consistent nature of the replica, but that’s not as important when the queries being done aren’t likely to be time critical. Read-replicas can even be made publicly accessible (without affecting the master), allowing you to provision access to them without requiring a VPN connection or something similarly complicated.

Of course, if it was that easy, I wouldn’t have written a blog post about it.

Actually creating a read-replica is easy. We use CloudFormation to initialise our AWS resources, so its a fairly simple matter to extend our existing template with another resource describing the replica. You can easily specify different security groups for the replica, so we can lock it down to be publicly resolvable but only accessible from approved IP addresses without too much trouble (you’ll have to provision a security group with the appropriate rules to allow traffic from your authorised IP addresses, either as part of the template, or as a parameter injected into the template).

There are some tricks and traps though.

If you want to mark a replica as publicly accessible (i.e. it gets a public IP address) you need to make sure you have DNS Resolution and DNS Hostnames enabled on the host VPC. Not a big deal to be honest, but I think DNS Hostnames default to Off, so something to watch out for. CloudFormation gives a nice error message in this case, so its easy to tell what to do.

What’s not so easy is that if you have the standard public/private split of subnets (where a public subnet specifies the internet gateway for routing of all traffic and a private subnet either specifies nothing or a NAT) you must make sure to put your replica in the public subnets. I think this applies for any instance that is going to be given a public IP address. If you don’t do this, no traffic will be able to escape from the replica because the router table will try to push it through the NAT on the way out. This complicates things with the master RDS instance as well, because both replica and master must share the same subnet group, so the master must be placed in the public subnets as well.

With all the CloudFormation/AWS/RDS chicanery out of the way, you still need to manage access to the replica using the standard PostgreSQL mechanisms though.

The Devil Is In The Details

The good thing about PostgreSQL read replicas is that they don’t allow any changes at all, even if using the root account. They are fundamentally readonly, which is fantastic.

There was no way that I was going to publicise the master password for the production RDS instance though, so I wanted to create a special user just for the rest of the business to access the replica at will, with as few permissions as possible.

Because of the aforementioned readonly-ness of the replica, you have to create the user inside the master instance, which will then propagate it across to the replica in time. When it comes to actually managing permissions for users in the PostgreSQL database though, its a little bit different to the RDBMS that I’m most familiar with, SQL Server. I don’t think its better or worse, its just different.

PostgreSQL servers hosts many databases, and each database hosts many schemas. Users however, appear to exist at the server level, so in order to manage access, you need to grant the user access to the databases, schemas and then tables (and sequences) inside that schema that you want them to be able to use.

At the time when our RDS instance is initialised, there are no databases, so we had to do this after the fact. We could provision the user and give it login/list database rights, but it couldn’t select anything from tables until we gave it access to those tables using the master user.

GRANT USAGE ON {schema} TO {username}
GRANT SELECT ON ALL TABLES IN {schema} TO {username}
GRANT SELECT ON ALL SEQUENCES IN {schema} TO {username}

Granting access once is not enough though, because any additional tables created after the statement is executed will not be accessible. To fix that you have to alter the default privileges of the schema, granting the appropriate permissions for the user you are interested in.

ALTER DEFAULT PRIVILEGES IN SCHEMA {schema}
GRANT SELECT ON TABLES TO {username}

With all of that out of the way, we had our replica.

Conclusion

Thanks to AWS, creating and managing a read-replica is a relatively painless procedure. There are some tricks and traps along the way, but they are very much surmountable. Its nice to be able to separate our concerns cleanly, and to have support for doing that at the infrastructure level.

I shudder to think how complicated something like this would have been to setup manually.

I really do hope AWS never goes full evil and decides to just triple or quadruple their prices though, because it would take months to years to replicate some of the things we’re doing in AWS now.

We’d probably just be screwed.