We Built This Delphi On Rock And Roll

January 29. 2019 0 Comments

It makes me highly uncomfortable if someone suggests that I support a piece of software without an automated build process.

In my opinion its one of the cornerstones on top of which software delivery is built. It just makes everything that comes afterwards easier, and enables you to think and plan at a much higher level, allowing you to worry about much more complicated topics.

Like continuous delivery.

But lets shy away from that for a moment, because sometimes you have to deal with the basics before you can do anything else.

In my current position I’m responsible for the engineering behind all of the legacy products in our organization. At this point in time those products make almost all of the money (yay!), but contain 90%+ of the technical terror (boo!) so the most important thing from my point of view is to ensure that we can at least deliver them reliably and regularly.

Now, some of the products are in a good place regarding delivery.

Some, however, are not.

Someones Always Playing Continuous Integration Games

One specific product comes to mind. Now, that’s not to say that the other products are perfect (far from it in fact), but this product in particular is lacking some of the most fundamental elements of good software delivery, and it makes me uneasy.

In fairness, the product is still quite successful (multiple millions of dollars of revenue), but from an engineering point of view, that’s only because of the heroic efforts of the individuals involved.

With no build process, you suffer from the following shortcomings:

No versioning (or maybe ad-hoc versioning if you’re lucky). This makes it hard to reason about what variety of software the customer has, and can make support a nightmare. Especially true when you’re dealing with desktop software.
Mysterious or arcane build procedures. If no-one has taken the time to recreate the build environment (assuming there is one), then it probably has all sorts of crazy dependencies. This has the side effect of making it really hard to get a new developer involved as well.
No automated tests. With no build process running the tests, if you do have tests, they are probably not being run regularly. That’s if you have tests at all of course, because with no process running them, people probably aren’t writing them.
A poor or completely ad-hoc distribution mechanism. Without a build process to form the foundation of such a process, the one that does exist is mostly ad-hoc and hard to follow.

But there is no point in dwelling on what we don’t have.

Instead, lets do something about it.

Who Cares They’re Always Changing Continuous Integration Names

The first step is a build script.

Now, as I’ve mentioned before on this blog, I’m a big fan of including the build script into the repository, so that anyone with the appropriate dependencies can just clone the repo and run the script to get a deliverable. Release candidates will be built on some sort of controlled build server obviously, but I’ve found its important to be able to execute the same logic both locally and remotely in order to be able to react to unexpected issues.

Of course, the best number of dependencies outside of the repository is zero, but sometimes that’s not possible. Aim to minimise them at least, either by isolating them and including them directly, or by providing some form of automated bootstrapping.

This particular product is built in a mixture of Delphi (specifically Delphi 7) and .NET, so it wasn’t actually all that difficult to use our existing build framework (a horrific aberration built in Powershell) to get something up and running fairly quickly.

The hardest past was figuring out how to get the Delphi compiler to work from the command line, while still producing the same output as it would if you just followed the current build process (i.e. compilation from within the IDE).

With the compilation out of the way, the second hardest part was creating an artifact that looked and acted like the artifact that was being manually created. This comes in the form of a self-extracting zip file containing an assortment of libraries and executables that make up the “update” package.

Having dealt with both of those challenges, its nothing but smooth sailing.

We Just Want to Dance Here, But We Need An AMI

Ha ha ha ha no.

Being a piece of legacy software, the second step to was to create a build environment that could be used from TeamCIty.

This means an AMI with everything required in order to execute the build script.

For Delphi 7, that means an old version of the Delphi IDE and build tools. Good thing we still had the CD that the installer came on, so we just made an ISO and remotely mounted it in order to install the required software.

Then came the multitude of library and tool dependencies specific to this particular piece of software. Luckily, someone had actually documented enough instructions on how to set up a development environment, so we used that information to complete the configuration of the machine.

A few minor hiccups later and we had a build artifact coming out of TeamCity for this product for the very first time.

A considerable victory.

But it wasn’t versioned yet.

They Call Us Irresponsible, The Versioning Is A Lie

This next step is actually still under construction, but the plan is to use the TeamCity build number input and some static version configuration stored inside the repository to create a SemVer styled version for each build that passes through TeamCity.

Any build not passing through TeamCity, or being built from a branch should be tagged with an appropriate pre-release string (i.e. 1.0.0-[something]), allowing us to distinguish good release candidates (off master via TeamCity) from dangerous builds that should never be released to a real customer.

The abomination of a Powershell build framework allows for most of this sort of stuff, but assumes that a .NET style AssemblyInfo.cs file will exist somewhere in the source.

At the end of the day, we decided to just include such a file for ease of use, and then propagate that version generated via the script into the Delphi executables through means that I am currently unfamiliar with.

Finally, all builds automatically tag the appropriate commit in Git, but that’s pretty much built into TeamCity anyway, so barely worth mentioning.

Conclusion

Like I said at the start of the post, if you don’t have an automated build process, you’re definitely doing it wrong.

I managed to summarise the whole “lets construct a build process” journey into a single, fairly lightweight blog post, but a significant amount of work went into it over the course of a few months. I was only superficially involved (as is mostly the case these days), so I have to give all of the credit to my colleagues.

The victory that this build process represents cannot be understated though, as it will form a solid foundation for everything to come.

A small step in the greater scheme of things, but I’m sure everyone knows the quote at this point.

You Gotta Have Standards

October 9. 2018 0 Comments

After that brief leadership interlude, its time to get back into the technical stuff with a weird and incredibly frustrating issue that I encountered recently when building a small command-line application using .NET 4.7.

More specifically, it failed to compile on our build server citing problems with a dependency that it shouldn’t have even required.

So without further ado, on with the show.

One Build Process To Rule Them All

One of the rules I like to follow for a good build process is that it should be executable outside of the build environment.

There are a number of reasons for this rule, but the two that are most relevant are:

If something goes wrong during the build process you can try and run it yourself and see what’s happening, without having to involve the build server
As a result of having to execute the process outside of the build environment, its likely that the build logic will be encapsulated in source control, alongside the code

With the way that a lot of software compilation works though, it can be hard to create build processes that automatically bootstrap the necessary components on a clean machine.

For example, there is no real way to compile a .NET Framework 4.7 codebase without using software that has to be installed. As far as I know you have to use either MSBuild, Visual Studio or some other component to do the dirty work. .NET Core is a lot better in this respect, because its all command line driven and doesn’t feature any components that must be installed on the machine before it will work. All you have to do is bootstrap the self contained SDK.

Thus while the dream is for the build process to be painless to execute on a clean git clone (with the intent that that is exactly what the build server does), sometimes dreams don’t come true, no matter how hard you try.

For us, our build server comes with a small number of components pre-installed, including MSBuild, and then our build scripts rely on those components existing in order to work correctly. There is a little bit of fudging involved though so you don’t have to have exactly the same components installed locally, and it will dynamically find MSBuild for you.

This was exactly how the command-line application build process was working before I touched it.

Then I touched it and it all went to hell.

Missing Without A Trace

Whenever you go back to a component that hasn’t been actively developed for a while, you always have to decide whether or not you should go to the effort of updating its dependencies that are now probably out of date.

Of course, some upgrades are a lot easier to action than others (i.e. a NuGet package update is generally a lot less painful than updating to a later version of the .NET Framework), but the general idea is to put some effort into making sure you’ve got a strong base to work from.

So that’s exactly what I did when I resurrected the command-line application used for metrics generation. I updated the build process, renamed the repository/namespaces (to be more appropriate), did a pass over the readme and updated the NuGet packages. No .NET version changes though, because that stuff can get hairy and it was already at 4.7, so it wasn’t too bad.

Everything compiled perfectly fine in Visual Studio and the self-contained build process continued to work on my local machine, so I pushed ahead and implemented the necessary changes.

Then I pushed my code and the automated build process on our build server failed consistently with a bunch of compilation errors like the following:

Framework\IntegrationTestKernel.cs(64,13): error CS0012: The type 'ValueType' is defined in an assembly that is not referenced. You must add a reference to assembly 'netstandard, Version=2.0.0.0, Culture=neutral, PublicKeyToken=cc7b13ffcd2ddd51'.

The most confusing part?

I had taken no dependency on netstandard as far as I knew.

More importantly, my understanding of netstandard was that it is basically a set of common interfaces to allow for a interoperability between the .NET Framework and .NET Core. I had no idea why my code would fail to compile citing a dependency I didn’t even ask for.

Also, it worked perfectly on my machine, so clearly something was awry.

The Standard Response

The obvious first response is to add a reference to netstandard.

This is apparently possible via NETStandard.Library NuGet package, so I added that, verified that it compiled locally and pushed again.

Same compilation errors.

My next hypothesis was that maybe something had gone weird with .NET Framework 4.7. There are a number of articles on the internet about similar looking topics and some of them read like later versions of .NET 4.7 (which are in-place upgrades for god only knows what reason) have changes relating to netstandard and .NET Framework integrations and compatibility. It was a shaky hypothesis though, because this application had always specifically targeted .NET 4.7.

Anyway, I flipped the projects to all target an earlier version of the .NET Framework (4.6.2) and then reinstalled all the NuGet packages (thank god for the Update-Package –reinstall command).

Still no luck.

The last thing I tried was removing all references to the C# 7 Value Tuple feature (super helpful when creating methods with complex return types), but that didn’t help either.

I Compromised; In That I Did Exactly What It Wanted

In the end I accepted defeat and made the Visual Studio Build Tools 2017 available on our build server by installing them on our current build agent AMI, taking a new snapshot and then updating TeamCity to use that snapshot instead. In order to get everything to compile cleanly, I had to specifically install the .NET Core Build Tools, which made me sad, because .NET Core was actually pretty clean from a build standpoint. Now if someone puts together a .NET Core repository incorrectly, it will probably still continue to compile just fine on the build server, leaving a tripwire for the next time someone cleanly checks out the repo.

Ah well, can’t win them all.

Conclusion

I suspect that the root cause of the issue was updating to some of the NuGet packages, specifically the packages that are only installed in the test projects (like the Ninject.MockingKernel and its NSubstitute implementation) as the test projects were the only ones that were failing to compile.

I’m not entirely sure whya package update would cause compilation errors though, which is pretty frustrating. I’ve never experienced anything similar before, so perhaps those libraries were compiled to target a specific framework (netstandard 2.0) and those dependencies flowed through into the main projects they were installed into?

Anyway, our build agents are slightly less clean now as a result, which makes me sad, but I can live with it for now.

I really do hate system installed components though.

A Deftly Encoded Mystery

July 5. 2016 0 Comments

We use TeamCity as our Continuous Integration tool.

Unfortunately, our setup hasn’t been given as much love as it should have. Its not bad (or broken or any of those things), its just not quite as well setup as it could be, which increases the risk that it will break and makes it harder to manage (and change and keep up to date) than it could be.

As with everything that has got a bit crusty around the edges over time, the only real way to attack it while still delivering value to the business is by doing it slowly, piece by piece, over what seems like an inordinate amount of time. The key is to minimise disruption, while still making progress on the bigger picture.

Our setup is fairly simple. A centralised TeamCity server and at least 3 Build Agents capable of building all of our software components. We host all of this in AWS, but unfortunately, it was built before we started consistently using CloudFormation and infrastructure as code, so it was all manually setup.

Recently, we started using a few EC2 spot instances to provide extra build capabilities without dramatically increasing our costs. This worked fairly well, up until the spot price spiked and we lost the build agents. We used persistent requests, so they came back, but they needed to be configured again before they would hook up to TeamCity because of the manual way in which they were provisioned.

There’s been a lot of instability in the spot price recently, so we were dealing with this manual setup on a daily basis (sometimes multiple times per day), which got old really quickly.

You know what they say.

“If you want something painful automated, make a developer responsible for doing it manually and then just wait.”

Its Automatic

The goal was simple.

We needed to configure the spot Build Agents to automatically bootstrap themselves on startup.

On the upside, the entire process wasn’t completely manual. We were at least spinning up the instances from a pre-built AMI that already had all of the dependencies for our older, crappier components as well as an unconfigured TeamCity Build Agent on it, so we didn’t have to automate absolutely everything.

The bootstrapping would need to tag the instance appropriately (because for some reason spot instances don’t inherit the tags of the spot request), configure the Build Agent and then start it up so it would connect to TeamCity. Ideally, it would also register and authorize the Build Agent, but if we used controlled authorization tokens we could avoid this step by just authorizing the agents once. Then they would automatically reappear each time the spot instance came back,.

So tagging, configuring, service start, using Powershell, with the script baked into the AMI. During provisioning we would supply some UserData that would execute the script.

Not too complicated.

Like Graffiti, Except Useful

Tagging an EC2 instance is pretty easy thanks to the multitude of toolsets that Amazon provides. Our tool of choice is the Powershell cmdlets, so the actual tagging was a simple task.

Getting permission to the do the tagging was another story.

We’re pretty careful with our credentials these days, for some reason, so we wanted to make sure that we weren’t supply and persisting any credentials in the bootstrapping script. That means IAM.

One of the key features of the Powershell cmdlets (and most of the Amazon supplied tools) is that they are supposed to automatically grab credentials if they are being run on an EC2 instance that currently has an instance profile associated with it.

For some reason, this would just not work. We tried a number of different things to get this to work (including updating the version of the Powershell cmdlets we were using), but in the end we had to resort to calling the instance metadata service directly to grab some credentials.

Obviously the instance profile that we applied to the instance represented a role with a policy that only had permissions to alter tags. Minimal permission set and all that.

Service With a Smile

Starting/stopping services with Powershell is trivial, and for once, something weird didn’t happen causing us to burn days while we tracked down some obscure bug that only manifests in our particular use case.

I was as surprised as you are.

Configuration Is Key

The final step should have been relatively simple.

Take a file with some replacement tokens, read it, replace the tokens with appropriate values, write it back.

Except it just wouldn’t work.

After editing the file with Powershell (a relatively simple Get-Content | For-Each { $_ –replace {token}, {value} } | Out-File) the TeamCity Build Agent would refuse to load.

Checking the log file, its biggest (and only) complaint was that the serverUrl (which is the location of the TeamCity server) was not set.

This was incredibly confusing, because the file clearly had a serverUrl value in it.

I tried a number of different things to determine the root cause of the issue, including:

Permissions? Was the file accidentially locked by TeamCity such that the Build Agent service couldn’t access it?
Did the rewrite of the tokens somehow change the format of the file (extra spaces, CR LF when it was just expecting LF)
Was the serverUrl actually configured, but inaccessible for some reason (machine proxy settings for example) and the problem was actually occurring not when the file was rewritten but when the script was setting up the AWS Powershell cmdlets proxy settings?

Long story short, it turns out that Powershell doesn’t remember file encoding when using the Out-File functionality in the way we were using it. It was changing the Byte Order Mark (BOM) on the file from ASCII to Unicode Little Endian, and the Build Agent did not like that (it didn’t throw an encoding error either, which is super annoying, but whatever).

The error message was both a red herring (yes the it was configured) and also truthly (the Build Agent was incapable of reading the serverUrl).

Putting It All Together

With all the pieces in place, it was a relatively simple matter to create a new AMI with those scripts baked into it and put it to work straightaway.

Of course, I’d been doing this the whole time in order to test the process, so I certainly had a lot of failures building up to the final deployment.

Conclusion

Even simple automation can prove to be time consuming, especially when you run into weird unforseen problems like components not performing as advertised or even reporting correct errors for you to use for debugging purposes.

Still, it was worth it.

Now I never have to manually configure those damn spot instances when they come back.

And satisfaction is worth its weight in gold.

That Cloud looks like a Staging Environment, Part 01

February 3. 2015 0 Comments

You might have noticed a pattern in my recent posts. They’re all about build scripts, automation, AWS and other related things. It seems that I have fallen into a dev-ops role. Not officially, but it’s basically all I’ve been doing for the past few months.

I’m not entirely sure how it happened. A little automation here, a little scripting there. I see an unreliable manual process and I want to automate it to make it reproducible.

The weird thing is, I don’t really mind. I’m still solving problems, just different ones. It feels a little strange, but its nice to have your client/end-user be a technical person (i.e. a fellow programmer) instead of the usual business person with only a modicum of technical ability.

I’m not sure how my employer feels about it, but they must be okay with it, or surely someone would have pulled me aside and asked some tough questions. I’m very vocal about what I’m working on and why, so its not like I’m just quietly doing the wrong work in the background without making a peep.

Taking into account the above comments, its unsurprising then that this blog post will continue on in the same vein as the last ones.

Walking Skeletons are Scary

As I mentioned at the end of my previous post, we’ve started to develop an web API to replace a database that was being directly accessed from a mobile application. We’re hoping this will tie us less to the specific database used, and allow us some more control over performance, monitoring, logging and other similar things.

Replacing the database is something that we want to do incrementally though, as we can’t afford to develop the API all at once and the just drop it in. That’s not smart, it just leads to issues with the integration at the end.

No, we want to replace the direct database access bit by bit, giving us time to adapt to any issues that we encounter.

In Growing Object Oriented Software Guided By Tests, the authors refer to the concept of a walking skeleton. A walking skeleton is when you develop the smallest piece of functionality possibly, and focus on sorting out the entire delivery chain in order to allow that piece of functionality to be repeatably built and deploy, end-to-end, without human interaction. This differs from the approach I’ve commonly witnessed, where teams focus on getting the functionality together and then deal with the delivery closer to the “end”, often leading to integration issues and other unforeseen problems things, like certificates!

Its always certificates.

The name comes from the fact that you focus on getting the framework up and running (the bones) and then flesh it out incrementally (more features and functionality).

Our goal was to be able to reliably and automatically publish the latest build of the API to an environment dedicated to continuous integration. A developer would push some commits to a specified branch (master) in BitBucket and it would be automatically built, packaged and published to the appropriate environment, ready for someone to demo or test, all without human interaction.

A Pack of Tools

Breaking the problem down we identified 4 main chunks of work. Automatically build, package up application for deployment, actually deploy (and track versions deployed, so some form of release management) and then the setup of the actual environment that would be receiving the deployment.

The build problem is already solved, as we use TeamCity. The only difference from some of our other TeamCity builds, would be that the entire build process would be encapsulated in a Powershell script, so that we can control it in Version Control and run it separately from TeamCity if necessary. I love what TeamCity is capable of, but I’m always uncomfortable when there is so much logic about the build process separate from the actual source. I much prefer to put it all in the one place, aiming towards the ideal of “git clone, build” and it just works.

We can use the same tool for both packaging and deployment, Octopus Deploy. Originally we were going to use NuGet packages to contain our application (created via NuGet.exe), but we’ve since found that its much better to use Octopack to create the package, as it structures the internals in a way that makes it easy for Octopus Deploy to deal with it.

Lastly we needed an environment that we could deploy to using Octopus, and this is where the meat of my work over the last week and a bit actually occurs.

I’ve setup environments before, but I’ve always been uncomfortable with the manual process by which the setup usually occurs. You might provision a machine (virtual if you are lucky) and then spend a few hours manually installing and tweaking the various dependencies on it so your application works as you expect. Nobody ever documents all the things that they did to make it work, so you have this machine (or set of machines) that lives in this limbo state, where no-one is really sure how it works, just that it does. Mostly. God help you if you want to create another environment for testing or if the machine that was so carefully configured burns down.

This time I wanted to do it properly. I wanted to be able to, with the execution of a single script, create an entire environment for the API, from scratch. The environment would be regularly torn down and rebuilt, to ensure that we can always create it from scratch and we know exactly how it has been configured (as described in the script). A big ask, but more than possible with some of the tools available today.

Enter Amazon CloudFormation.

Cloud Pun

Amazon is a no brainer at this point for us. Its where our experience as an organisation lies and its what I’ve been doing a lot of recently. There are obviously other cloud offerings out there (hi Azure!), but its better to stick with what you know unless you have a pressing reason to try something different.

CloudFormation is another service offered by Amazon (like EC2 and S3), allowing you to leverage template files written in JSON that describe in detail the components of your environment and how its disparate pieces are connected. Its amazing and I wish I had known about it earlier.

In retrospect, I’m kind of glad I didn't know about it earlier, as using the EC2 and S3 services directly (and all the bits and pieces that they interact with) I have gained enough understanding of the basic components to know how to fit them together in a template effectively. If I had of started with CloudFormation I probably would have been overwhelmed. It was overwhelming enough with the knowledge that I did have, I can’t imagine what it would be like to hit CloudFormation from nothing.

Each CloudFormation template consists of some set of parameters (names, credentials, whatever), a set of resources and some outputs. Each resource can refer to other resources as necessary (like an EC2 instance referring to a Security Group) and you can setup dependencies between resources as well (like A must complete provisioning before B can start). The outputs are typically something that you want at the end of the environment setup, like a URL for a service or something similar.

I won’t go into detail about the template that I created (its somewhat large), but I will highlight some of the important pieces that I needed to get working in order for the environment to fit into our walking skeleton. I imagine that the template will need to be tweaked and improved as we progress through developing the API, but that's how incremental development works. For now its simple enough, a Load Balancer, Auto Scaling Group and a machine definition for the instances in the Auto Scaling Group (along with some supporting resources, like security groups and wait handles).

This Cloud Comes in Multiple Parts

This blog post is already 1300+ words, so it’s probably a good idea to cut it in two. I have a habit of writing posts that are too long, so this is my attempt to get that under control.

Next time I’ll talk about Powershell Desired State Configuration, deploying dependencies to be accessed by instances during startup, automating deployments with Octopus and many other wondrous things that I still don’t quite fully understand.

Images, Instances, Volumes and Storage, Oh My

November 25. 2014 0 Comments

As I mentioned in a previous post, I recently started a new position at Onthehouse.

Onthehouse uses Amazon EC2 for their cloud based virtualisation, including that of the build environment (TeamCity). Its common for a build environment to be largely ignored as long as it is still working, until the day it breaks and then it all goes to hell.

Luckily that is not what happened.

Instead, the team identified that the build environment needed some maintenance, specifically around one of the application specific Build Agents.

Its an ongoing process, but the reason for there being an application specific Build Agent is because the application has a number of arcane, installed, licenced third-party components. Its VB6, so its hard to manage those dependencies in a way that is mobile. Something to work on in the future, but not a priority for right now.

My first task at Onthehouse, was to ensure that changes made to the running Instance of the Build Agent had been appropriately merged into the base Image. As someone who had never before used the Amazon virtualisation platform (Amazon EC2) I was somewhat confused.

This post follows my journey through that confusion and out the other side into understanding and I hope it will be of use to someone else out there.

As an aside, I think that getting new developers to start with build services is a great way to familiarise them with the most important part of an application, how to build it. Another fantastic first step is to get them to fix bugs.

Virtually Awesome

As I mentioned previously, I’ve never used AWS (Amazon Web Services) before, other than uploading some files to an Amazon S3 account, let alone the virtualization platform (Amazon EC2).

My main experience with virtualisation comes from using Virtual Box on my own PC. Sure, I’ve used Azure to spin up machines and websites, and I’ve occasionally interacted with VMWare and Hyper-V, but Virtual Box is what I use every single day to build, create, maintain and execute sandbox environments for testing, exploration and all sorts of other things.

I find Virtual Box straightforward.

You have a virtual machine (which has settings, like CPU Cores, Memory, Disks, etc) and each machine has a set of snapshots.

Snapshots are a record of the state of the virtual machine and its settings at a point in time chosen by the user. I take Snapshots all the time, and I use them to easily roll back to important moments, like a specific version of an application, or before I wrecked everything by doing something stupid and destructive (it happens more than you think).

Thinking about this now, I’m not sure how Virtual Box and Snapshots interact with multiple disks. Snapshots seems to be primarily machine based, not disk based, encapsulating all of the things about the machine. I suppose that probably includes the disks. I guess I don’t tend to use multiple disks in the machines I’m snapshotting all the time, only using them rarely for specific tasks.

Images and Instances

Amazon EC2 (Elastic Compute) does not work the same way as Virtual Box.

I can see why its different to Virtual Box as they have entirely different purposes. Virtual Box is intended to facilitate virtualisation for the single user. EC2 is about using virtualisation to leverage the power of the cloud. Single users are a completely foreign concept. Its all about concurrency and scale.

In Amazon EC2 the core concept is an Image(or Amazon Machine Image, AMI). Images describe everything about a virtual machine, kind of like a Virtual Machine in Virtual Box. However, in order to actually use an Image, you must spin up an Instance of that Image.

At the point in time you spin up an Instance of an Image, they have diverged. The Instance typically contains a link back to its Image, but its not a hard link. The Instance and Image are distinctly separate and you can delete the Image (which if you are using an Amazon supplied Image, will happen regularly) without negatively impacting on the running instance.

Instances generally have Volumes, which I think are essentially virtual disks. Snapshots come into play here as well, but I don’t understand Volumes and Snapshots all that well at this point in time, so I’m going to conveniently gloss over them. Snapshots definitely don’t work like VirtualBox snapshots though, I know that much.

Instances can generally be rebooted, stopped, started and terminated.

Reboot, stop and start do what you expect.

Terminating an instance kills it forever. It also kills the Volume attached to the instance if you have that option selected. If you don’t have the Image that the Instance was created from, you’re screwed, its gone for good. Even if you do, you will have lost any change made to that Image since the Instance began running.

Build It Up

Back to the Build environment.

The application specific Build Agent had an Image, and an active Instance, as normal.

This Instance had been tweaked, updated and changed in various ways since the Image was made, so much so that no-one could remember exactly what had been done. Typically this wouldn’t be a major issue, as Instances don’t just up and disappear.

Except this Instance could, and had in the past.

The reason for its apparently ephemeral nature was because Amazon offers a spot pricing option for Instances. Spot pricing allows you to create a spot request and set your own price for an hour of compute time. As long as the spot price is below that price, your Instance will run. If the spot price goes above your price, your Instance dies. You can setup your spot price request to be reoccurring, such that the Instance will restart when the price goes down again, but you will have lost all information not on the baseline Image (an event like that is equivalent to terminating the instance and starting another one).

Obviously we needed to ensure that the baseline Image was completely able to run a build of the application in question, requiring the minimal amount of configuration on first start.

Thus began a week long adventure to take the current base Image, create an Instance from it, and get a build working, so we could be sure that if our Instance was terminated it would come back and we wouldn’t have to spend a week getting the build working again.

I won’t go into detail about the whole process, but it mostly involved lots of manual steps to find out what was thing was wrong this time, fixing it in as nice a way as time permitted and then trying again. It mostly involved waiting. Waiting for instances, waiting for builds, waiting for images. Not very interesting.

A Better Approach

Knowing what I know now (and how long the whole process would take), my approach would be slightly different.

Take a snapshot of the currently running Instance, spin up an Instance of it, change all of the appropriate unique settings to be invalid (Build Agent name mostly) and then take another Image. That’s your baseline.

Don’t get me wrong, it was a much better learning experience the first way, but it wasn’t exactly an excellent return on investment from the point of view of the organisation.

Ah well, hindsight.

A Better Architecture

The better architecture is to have TeamCity managed the lifetime of its Build Agents, which it is quite happy to do via Amazon EC2. TeamCity can then manage the instances as it sees fit, spinning them down during idle periods, and even starting more during periods of particularly high load (I’m looking at you, end of the iteration crunch time).

I think this is definitely the approach we will take in the future, but that’s a task for another day.

Conclusion

Honestly, the primary obstacle in this particular task was learning how Amazon handles virtualization, and wrapping my head around the differences between that and Virtual Box (which is where my mental model was coming from). After I got my head around that I was in mostly familiar territory, diagnosing build issues and determining the best fix that would maximise mobility in the future, while not requiring a massive amount of time.

From the point of view of me, a new starter, this exercise was incredibly valuable. It taught me an immense amount about the application, its dependencies, the way its built and all sorts of other, half-forgotten tidbits.

From the point of view of the business, I should have definitely realized that there was a quicker path to the end goal (make sure we can recover from a lost Build Agent instance) and taken that into consideration, rather than try to work my way through the arcane dependencies of the application. There’s always the risk that I missed something subtle as well, which will rear its ugly head next time we lose the Build Agent instance.

Which could happen.

At any moment.

(Cue Ominous Music)