As I mentioned in a previous post, I recently started a new position at Onthehouse.
Onthehouse uses Amazon EC2 for their cloud based virtualisation, including that of the build environment (TeamCity). Its common for a build environment to be largely ignored as long as it is still working, until the day it breaks and then it all goes to hell.
Luckily that is not what happened.
Instead, the team identified that the build environment needed some maintenance, specifically around one of the application specific Build Agents.
Its an ongoing process, but the reason for there being an application specific Build Agent is because the application has a number of arcane, installed, licenced third-party components. Its VB6, so its hard to manage those dependencies in a way that is mobile. Something to work on in the future, but not a priority for right now.
My first task at Onthehouse, was to ensure that changes made to the running Instance of the Build Agent had been appropriately merged into the base Image. As someone who had never before used the Amazon virtualisation platform (Amazon EC2) I was somewhat confused.
This post follows my journey through that confusion and out the other side into understanding and I hope it will be of use to someone else out there.
As an aside, I think that getting new developers to start with build services is a great way to familiarise them with the most important part of an application, how to build it. Another fantastic first step is to get them to fix bugs.
As I mentioned previously, I’ve never used AWS (Amazon Web Services) before, other than uploading some files to an Amazon S3 account, let alone the virtualization platform (Amazon EC2).
My main experience with virtualisation comes from using Virtual Box on my own PC. Sure, I’ve used Azure to spin up machines and websites, and I’ve occasionally interacted with VMWare and Hyper-V, but Virtual Box is what I use every single day to build, create, maintain and execute sandbox environments for testing, exploration and all sorts of other things.
I find Virtual Box straightforward.
You have a virtual machine (which has settings, like CPU Cores, Memory, Disks, etc) and each machine has a set of snapshots.
Snapshots are a record of the state of the virtual machine and its settings at a point in time chosen by the user. I take Snapshots all the time, and I use them to easily roll back to important moments, like a specific version of an application, or before I wrecked everything by doing something stupid and destructive (it happens more than you think).
Thinking about this now, I’m not sure how Virtual Box and Snapshots interact with multiple disks. Snapshots seems to be primarily machine based, not disk based, encapsulating all of the things about the machine. I suppose that probably includes the disks. I guess I don’t tend to use multiple disks in the machines I’m snapshotting all the time, only using them rarely for specific tasks.
Images and Instances
Amazon EC2 (Elastic Compute) does not work the same way as Virtual Box.
I can see why its different to Virtual Box as they have entirely different purposes. Virtual Box is intended to facilitate virtualisation for the single user. EC2 is about using virtualisation to leverage the power of the cloud. Single users are a completely foreign concept. Its all about concurrency and scale.
In Amazon EC2 the core concept is an Image(or Amazon Machine Image, AMI). Images describe everything about a virtual machine, kind of like a Virtual Machine in Virtual Box. However, in order to actually use an Image, you must spin up an Instance of that Image.
At the point in time you spin up an Instance of an Image, they have diverged. The Instance typically contains a link back to its Image, but its not a hard link. The Instance and Image are distinctly separate and you can delete the Image (which if you are using an Amazon supplied Image, will happen regularly) without negatively impacting on the running instance.
Instances generally have Volumes, which I think are essentially virtual disks. Snapshots come into play here as well, but I don’t understand Volumes and Snapshots all that well at this point in time, so I’m going to conveniently gloss over them. Snapshots definitely don’t work like VirtualBox snapshots though, I know that much.
Instances can generally be rebooted, stopped, started and terminated.
Reboot, stop and start do what you expect.
Terminating an instance kills it forever. It also kills the Volume attached to the instance if you have that option selected. If you don’t have the Image that the Instance was created from, you’re screwed, its gone for good. Even if you do, you will have lost any change made to that Image since the Instance began running.
Build It Up
Back to the Build environment.
The application specific Build Agent had an Image, and an active Instance, as normal.
This Instance had been tweaked, updated and changed in various ways since the Image was made, so much so that no-one could remember exactly what had been done. Typically this wouldn’t be a major issue, as Instances don’t just up and disappear.
Except this Instance could, and had in the past.
The reason for its apparently ephemeral nature was because Amazon offers a spot pricing option for Instances. Spot pricing allows you to create a spot request and set your own price for an hour of compute time. As long as the spot price is below that price, your Instance will run. If the spot price goes above your price, your Instance dies. You can setup your spot price request to be reoccurring, such that the Instance will restart when the price goes down again, but you will have lost all information not on the baseline Image (an event like that is equivalent to terminating the instance and starting another one).
Obviously we needed to ensure that the baseline Image was completely able to run a build of the application in question, requiring the minimal amount of configuration on first start.
Thus began a week long adventure to take the current base Image, create an Instance from it, and get a build working, so we could be sure that if our Instance was terminated it would come back and we wouldn’t have to spend a week getting the build working again.
I won’t go into detail about the whole process, but it mostly involved lots of manual steps to find out what was thing was wrong this time, fixing it in as nice a way as time permitted and then trying again. It mostly involved waiting. Waiting for instances, waiting for builds, waiting for images. Not very interesting.
A Better Approach
Knowing what I know now (and how long the whole process would take), my approach would be slightly different.
Take a snapshot of the currently running Instance, spin up an Instance of it, change all of the appropriate unique settings to be invalid (Build Agent name mostly) and then take another Image. That’s your baseline.
Don’t get me wrong, it was a much better learning experience the first way, but it wasn’t exactly an excellent return on investment from the point of view of the organisation.
Ah well, hindsight.
A Better Architecture
The better architecture is to have TeamCity managed the lifetime of its Build Agents, which it is quite happy to do via Amazon EC2. TeamCity can then manage the instances as it sees fit, spinning them down during idle periods, and even starting more during periods of particularly high load (I’m looking at you, end of the iteration crunch time).
I think this is definitely the approach we will take in the future, but that’s a task for another day.
Honestly, the primary obstacle in this particular task was learning how Amazon handles virtualization, and wrapping my head around the differences between that and Virtual Box (which is where my mental model was coming from). After I got my head around that I was in mostly familiar territory, diagnosing build issues and determining the best fix that would maximise mobility in the future, while not requiring a massive amount of time.
From the point of view of me, a new starter, this exercise was incredibly valuable. It taught me an immense amount about the application, its dependencies, the way its built and all sorts of other, half-forgotten tidbits.
From the point of view of the business, I should have definitely realized that there was a quicker path to the end goal (make sure we can recover from a lost Build Agent instance) and taken that into consideration, rather than try to work my way through the arcane dependencies of the application. There’s always the risk that I missed something subtle as well, which will rear its ugly head next time we lose the Build Agent instance.
Which could happen.
At any moment.
(Cue Ominous Music)