0 Comments

A long long time ago, a valuable lesson was learned about segregating production infrastructure from development/staging infrastructure. Feel free to go and read that post if you want, but to summarise: I ran some load tests on an environment (one specifically built for load testing, separate to our normal CI/Staging/Production), and the tests flooded a shared component (a proxy), brining down production services and impacting customer experience. The outage didn’t last that long once we realised what was happening, but it was still pretty embarrassing.

Shortly after that adventure, we created a brand new AWS account to isolate our production infrastructure and slowly moved all of that important stuff into it, protecting it from developers doing what they do best (breaking stuff in new and interesting ways).

This arrangement complicated a few things, but the most relevant to this discussion was the creation and management of AMIs.

We were already using Packer to create and maintain said AMI’s, so it wasn’t a manual process, but an AMI in AWS is always owned by and accessible from one AWS account by default.

With two completely different AWS accounts, it was easy to imagine a situation where each account has slightly different AMIs available, which have slightly different behaviour, leading to weird things happening on production environments that don’t happen during development or in staging.

That sounds terrible, and it would be neat if we could ensure it doesn’t happen.

A Packaged Deal

The easiest thing to do is share the AMIs in question.

AWS makes it relatively easy to make an AMI accessible to a different AWS account, likely for this exact purpose. I think it’s also used to enable companies to sell pre-packaged AMIs as well, but that’s a space I know little to nothing about, so I’m not sure.

As long as you know the account number of the AWS account that you want to grant access to, its a simple matter to use the dashboard or the API to share the AMI, which can then be freely used from the other account to create EC2 instances.

One thing to be careful of is to make sure you grant the ability for the other account to access the AMI snapshot as well, or you’ll run into permission problems if you try to actually use the AMI to make an EC2 instance.

Sharing AMI’s is alright, but it has risks.

If you create AMIs in your development account and then share them with production, then you’ve technically got production infrastructure inside your development account, which was one of the things we desperately wanted to avoid. The main problem here is that people will not assume that a resource living inside the relatively free-for-all development environment could have any impact on production, and they might delete it or something equally dangerous. Without the AMI, auto scaling won’t work, and the most likely time to figure that sort of thing out is right when you need it the most.

A slightly better approach is to copy the AMI to the other account. In order to do this you share the AMI from the owner account (i.e. dev) and then make a permanent copy on the other account (i.e. prod). Once the copy is complete, you unshare (to prevent accidental usage).

This breaks the linkage between the two accounts while ensuring that the AMIs are identical, so its a step up from simple sharing, but there are limitations.

For example, everything works swimmingly until you try to copy a Windows AMI, then it fails miserably as a result of the way in which AWS licences Windows. On the upside, the copy operation itself fails fast, rather than making a copy that then fails when you try to use it, so that’s nice.

So, two solutions, neither of which is ideal.

Surely we can do better?

Pack Of Wolves

For us, the answer is yes. We just run our Packer templates twice, once for each account.

This has actually been our solution for a while. We execute our Packer templates through TeamCity Build Configurations, so it is a relatively simple matter to just run the build twice, once for each account.

Well, “relatively simple” is probably understating it actually.

Running in dev is easy. Just click the button, wait and a wild AMI appears.

Prod is a different question.

When creating an AMI Packer needs to know some things that are AWS account specific, like subnets, VPC, security groups and so on (mostly networking concerns). The source code contained parameters relevant for dev (hence the easy AMI creation for dev), but didn’t contain anything relevant for prod. Instead, whenever you ran a prod build in TeamCity, you had to supply a hashtable of parameter overrides, which would be used to alter the defaults and make it work in the prod AWS account.

As you can imagine, this is error prone.

Additionally, you actually have to remember to click the build button a second time and supply the overrides in order to make a prod image, or you’ll end up in a situation where you deployed your environment changes successfully through CI and Staging, but it all explodes (or even worse, subtly doesn’t do what its supposed to) when you deploy them into Production because there is no equivalent AMI. Then you have to go and make one using TeamCity, which is error prone, and if the source has diverged since you made the dev one…well, its just bad times all around.

Leader Of The Pack

With some minor improvements, we can avoid that whole problem though.

Basically, whenever we do a build in TeamCity, it creates the dev AMI first, and then automatically creates the prod one as well. If the dev fails, no prod. If the prod fails, dev is deleted.

To keep things in sync, both AMIs are tagged with a version attribute created during the build (just like software), so that we have a way to trace the AMI back to the git commit it was created from (just like software).

To accomplish this approach, we now have a relatively simple configuration hierarchy, with default parameters, dev specific parameters and prod specific parameters. When you start the AMI execution, you tell the function what environment you’re targetting (dev/prod) and it loads defaults, then merges in the appropriate overrides.

This was a relatively easy way to deal with things that are different and non-sensitive (like VPC, subnets, security groups, etc).

What about credentials though?

Since an…incident…waaaay back in 2015, I’m pretty wary of credentials, particularly ones that give access to AWS.

So they can’t go in source control with the rest of the parameters.

That leaves TeamCity as the only sane place to put them, which it can easily do, assuming we don’t mind writing some logic to pick the appropriate credentials depending on our targeted destination.

We could technically have used some combination of IAM roles and AWS profiles as well, but we already have mechanisms and experience dealing with raw credential usage, so this was not the time to re-invent that particular wheel. That’s a fight for another day.

With account specific parameters and credentials taken care of, everything is good, and every build results in 2 AMIs, one for each account.

I’ve uploaded a copy of our Packer repository containing all of this logic (and a copy of the script we embed into TeamCity) to Github for reference purposes.

Conclusion

I’m much happier with the process I described above for creating our AMIs. If a build succeeds, it creates resources in both of our active AWS accounts, keeping them in sync and reducing the risk of subtle problems come deployment time. Not only that, but it also tags those resources with a version that can be traced back to a git commit, which is always more useful than you think.

There are still some rough edges around actually using the AMIs though. Most of our newer environments specify their AMIs directly via parameter files, so you have to remember to change the values for each environment target when you want to use a new AMI. This is dangerous, because if someone forgets it could lead to a disconnect between CI/Staging and Production, which was pretty much the entire problem we were trying to avoid in the first place.

Honestly, its going to be me that forgets.

Ah well, all in all, its a lot more consistent than it was before, which is pretty much the best I could hope for.