0 Comments

Last time I outlined the start of setting up an ELK based Log Aggregator via CloudFormation. I went through some of the problems I had with executing someone else's template (permissions! proxies!), and then called it there, because the post was already a mile long.

Now for the thrilling conclusion where I talk about errors that mean nothing to me, accepting defeat, rallying and finally shipping some real data.

You Have Failed Me Again Java

Once I managed to get all of the proxy issues sorted, and everything was being downloaded and installed properly, the instance was still not responding to HTTP requests over the appropriate ports. Well, it seemed like it wasn’t responding anyway.

Looking into the syslog, I saw repeated attempts to start Elasticsearch and Logstash, with an equal number of failures, where the process had terminated immediately and unexpectedly.

The main issue appeared to be an error about “Bad Page Map”, which of course makes no sense to me.

Looking it up, it appears as though there was an issue with the version of Ubuntu that I was using (3.13? Its really hard to tell what version means what), and it was not actually specific to Java. I’m going to blame Java anyway. Apparently the issue is fixed in 3.15.

After swapping the AMI to the latest distro of Ubuntu, the exceptions no longer appeared inside syslog, but the damn thing still wasn’t working.

I could get to the appropriate pages through the load balancer, which would redirect me to Google to supply OAuth credentials, but after supplying appropriate credentials, nothing else ever loaded. No Kibana, no anything. This meant of course that ES was (somewhat) working, as the load balancer was passing its health checks, and Kibana was (somewhat) working because it was at least executing the OAuth bit.

Victory?

Do It Yourself

It was at this point that I decided to just take it back to basics and start from scratch. I realised that I didn’t understand some of the components being installed (LogCabin for example, something to do with the OAuth implementation?), and that they were getting in the way of me accomplishing a minimum viable product. I stripped out all of the components from the UserData script, looked up the latest compatible versions of ES, Logstash and Kibana, installed them, and started them as services. I had to make some changes to the CloudFormation template as well (ES defaults to 9200, Kibana 4 to 5601, had to expose the appropriate ports and make some minor changes to the health check. Logstash was fine).

The latest version of Kibana is more self contained than previous ones, which is nice. It comes with its own web server, so all you have to do is start it up and it will listen to and respond to requests via port 5601 (which can be changed). This is different to the version that I was originally working with (3?), which seemed to be hosted directly inside Elasticsearch? I’m still not sure what the original template was doing to be honest, all I know is that it didnt work.

Success!

A Kibana dashboard, load balancers working, ES responding. Finally everything was up and running. I still didn’t fully understand it fully, but it was a hell of a lot more promising than it was before.

Now all I had to do was get some useful information into it.

Nxlog Sounds Like a Noise You Make When You Get Stabbed

There are a number of ways to get logs into ES via Logstash. Logstash itself can be installed on other machines and forward local logs to a remote Logstash, but its kind of heavy weight for that sort of thing. Someone has written a smaller component called Logstash-Forwarder which does a similar thing. You can also write directly to ES using Logstash compatible index names if you want to as well (Serilog offers a sink that does just that).

The Logstash solutions above seem to assume that you are gathering logs on a Unix based system though, and don’t really offer much in the way of documentation or help if you have a Windows based system.

After a small amount of investigation, a piece of software called Nxlog appears to be the most commonly used log shipper as far as Windows is concerned.

As with everything in automation, I couldn’t just go onto our API hosting instances and just install and configure Nxlog. I had to script it, and then add those scripts to the CloudFormation template for our environment setup.

Installing Nxlog from the command line is relatively simple using msiexec and the appropriate “don't show the damn UI” flags, and configuring it is simple as well. All you need to do is have an nxlog.conf file configured with what you need (in my case, iis and application logs being forwarded to the Logstash endpoint) and then copy it to the appropriate conf folder in the installation directory.

The nxlog configuration file takes some getting used to, but their documentation is pretty good, so its just a matter of working through it. The best tip I can give is to create a file output until you are sure that nxlog is doing what you think its doing, and then flipping everything over to output to Logstash. You’ll save a log of frustration if you know exactly where the failures are (and believe me, there will be failures).

After setting up Nxlog, it all started to happen! Stuff was appearing in Kibana! It was one of those magical moments where you actually get a result, and it felt good.

Types? We need Types?

I got everything working nicely in my test environment, so I saved my configuration, tore down the environments and created them again (to verify they could be recreated). Imagine my surprise when I was getting Nxlog internal messages into ES, but nothing from IIS. I assumed that I had messed up Nxlog somehow, so I spent a few hours trying to debug what was going wrong. My Nxlog config appeared to be fine, so I assumed that there was something wrong with the way I had configured Logstash. Again, it seemed to be fine.

It wasn't until I looked into the Elasticsearch logs that I found out why all of my IIS logs were not making it. The first document sent to Elasticsearch had a field called EventRecievedTime (from the Nxlog internal source) which was a date, represented as ticks since X, i.e. a gigantic number. ES had inferred the type of this field as a long. The IIS source also had a field called EventRecievedTime, which was an actual date (i.e. YYYY-MM-DD HH:mm). When any IIS entry arrived into ES from Logstash, ES errored out trying to parse the datetime into a long, and discarded it. Because of the asynchronous nature of the system, there was no way for Logstash to communicate the failure back to anywhere that I could see it.

After making sure that both EventRecievedTimes were dates, everything worked swimmingly.

I suppose this might reoccur in the future, with a different field name conflicting. I’m not sure exactly what the best way to deal with this would be. Maybe the Elasticsearch logs should be put into ES as well? At least then I could track it. You could setup a template to strongly type the fields as well, but due to the fluidity of ES, there are always going to be new fields, and ES will always try to infer an appropriate type, so having a template won’t stop it from occurring.

Mmmmmmm Kibana

Look at this dashboard.

Look at it.

So nice.

I haven’t even begun to plumb the depths of the information now at my fingertips. Most of those charts are simple (average latency, total requests, response codes, requested paths), but it still provides a fantastic picture of the web applications in question.

Conclusion

Last time I wrote a CloudFormation template, I didn’t manage to get it into a publically available repository, which kind of made the blog posts around it significantly less useful.

This time I thought ahead. You can find all of the scripts and templates for the Log Aggregator in this repository. This is a copy of our actual repository (private, in Bitbucket), so I’m not sure if I will be able to keep it up to date as we make changes, but at least there is some context to what I’ve been speaking about.

I’ve included the scripts that setup and configure Nxlog as well. These are actually located in the repository that contains our environment setup, but I think they are useful inside this repository as a point of reference for setting up log shipping on a Windows system. Some high level instructions are available in the readme of the repository.

Having a Log Aggregator, even though it only contains IIS and application logs for a single application, has already proved useful. It adds a huge amount of transparency to what is going on, and Kibana’s visualisations are extremely helpful in making sense of the data.

Now to do some load testing on the web application and see what Kibana looks like when everything breaks.

0 Comments

Logging is one of the most important components of a good piece of software.

That was a statement, not a question, and is not up for debate.

Logging enables you to do useful things, like identify usage patterns (helpful when decided what sections of the application need the most attention), investigate failures (because there are always failures, and you need to be able to get to their root causes) and keep an eye on performance (which is a feature, no matter what anyone else tells you). Good logging enables a piece of software to be supported long after the software developers who wrote it have moved on, extending the life expectancy of the application and thus improving its return on investment.

It is a shame really, that logging is not always treated like a first class citizen. Often it is an afterthought, added in later after some issue or failure proves that it would have been useful, and then barely maintained from that point forward.

Making sure your application has excellent logging is only the first part though, you also need somewhere to put it so that the people who need to can access it.

The most common approach is to have logs be output to a file, somewhere on the local file system relative to the location where the software is installed and running. Obviously this is better than not having logs, but only just barely. When you have log files locally, you are stuck in a reactive mindset, using the logs as a diagnostic tool when a problem is either observed or reported through some other channel (like the user complaining).

The better approach is to the send the logs somewhere. Somewhere they can be watched and analysed and alerted on. You can be proactive when you have logs in a central location, finding issues before the users even notice and fixing them even faster.

I’m going to refer to that centralised place where the logs go as a Log Aggregator, although I’m not sure if that is the common term. It will do for now though.

Bucket O’ Logs

At my current job, we recently did some work to codify the setup of our environments. Specifically, we used AWS CloudFormation and Powershell to setup an auto scaling group + load balancer (and supporting constructs) to be the home for a new API that we were working on.

When you have a single machine, you can usually make do with local logs, even if its not the greatest of ideas (as mentioned above). When you have a variable number of machines, whose endpoints are constantly shifting and changing, you really need a central location where you can keep an eye on the log output.

Thus I’ve spent the last week and bit working on exactly that. Implementing a log aggregator.

After some initial investigation, we decided to go with an ELK stack. ELK stands for Elasticsearch, Logstash and Kibana, three components that each serve a different purpose. Elasticsearch is a document database with strong analysis and search capabilities. Logstash is a ETL (Extract, Transform, Load) system, used for moving logs around as well as transforming and mutating them into appropriate structures to be stored in Elasticsearch and Kibana is a front end visualisation and analysis tool that sits on top of Elasticsearch.

We went with ELK because a few other teams in the organization had already experimented with it, so there was at least a little organizational knowledge floating around to exploit. Alas, the other teams had not treated their ELK stacks as anything reusable, so we still had to start from scratch in order to get anything up and running.

We did look at a few other options (Splunk, Loggly, Seq) but it seemed like ELK was the best fit for our organisation and needs, so that was what we went with.

Unix?

As is my pattern, I didn’t just want to jam something together and call that our log aggregator, hacking away at a single instance or machine until it worked “enough”. I wanted to make sure that the entire process was codified and reproducible. I particularly liked the way in which we had done the environment setup using CloudFormation, so I decided that would be a good thing to aim for.

Luckily someone else had already had the same idea, so in the true spirit of software development, I stole their work to bootstrap my own.

Stole in the sense that they had published a public repository on GitHub with a CloudFormation template to setup an ELK stack inside it.

I cloned the repository, wrote a few scripts around executing the CloudFormation template and that was that. ELK stack up and running.

Right?

Right?

Ha! It’s never that easy.

Throughout the rest of this post, keep in mind that I haven't used a Unix based operating system in anger in a long time. The ELK stack used a Ubuntu distro, so I was at a disadvantage from the word go. On the upside, having been using cmder a lot recently, I was far more comfortable inside a command line environment than I ever have been before. Certainly more comfortable than I was when I was used Unix.

Structural Integrity

The structure of the CloudFormation template was fairly straightforward. There were two load balancers, backed by an auto scaling group. One of the load balancers was public, intended to expose Kibana. The other was internal (i.e. only accessible from within the specified VPC) intended to expose Logstash. There were some Route53 entries to give everything nice names, and an Auto Scaling Group with a LaunchConfig to define the configuration of the instances themselves.

The auto scaling group defaulted to a single instance, which is what I went with. I’ll look into scaling later, when it actually becomes necessary and we have many applications using the aggregator.

As I said earlier, the template didn’t just work straight out of the repository, which was disappointing.

The first issue I ran into was that the template called for the creation of an IAM role. The credentials I was using to execute the template did not have permissions to do that, so I simply removed it until I could get the appropriate permissions from our AWS account managers. It turns out I didn’t really need it anyway, as the only thing I needed to access using AWS credentials was an S3 bucket (for dependency distribution) which I could configure credentials for inside the template, supplied as parameters.

Removing the IAM role allowed me to successfully execute the template, and it eventually reach that glorious state of “Create Complete”. Yay!

It still didn’t work though. Booooo!

Its Always a Proxy

The initial template assumed that the instance would be accessible over port 8080. The public load balancer relied on that fact and its health check queried the __es path. The first sign that something was wrong was that the load balancer thought that the instance inside it was unhealthy, so it was failing its health check.

Unfortunately, the instance was not configured to signal failure back to CloudFormation if its setup failed, so although CloudFormation had successfully created all of its resources, when I looked into the cloud-init-output.log file in /var/log, it turned out that large swathes of the init script (configured in the UserData section of the LaunchConfig) had simply failed to execute.

The issue here, was that we require all internet access from within our VPC to outside to go through a proxy. Obviously the instance was not configured to use the proxy (how could it be, it was from a public git repo), so all communications to the internet were being blocked, including calls to apt-get and the download of various configuration files directly from git.

Simple enough to fix, set the http_proxy and https_proxy environment variables with the appropriate value.

It was at this point that I also added a call to install the AWS CloudFormation components on the instance during initialisation, so that I could use cfn-signal to indicate failures. This at least gave me an indication of whether or not the instance had actually succeeded its initialization, without having to remote into the machine to look at the logs.

When working on CloudFormation templates, its always useful to have some sort of repeatable test that you can run in order to execute the template, ideally from the command line. You don’t want to have to go into the AWS Dashboard to do that sort of thing, and its good to have some tests outside the template itself to check its external API. As I was already executing the template through Powershell, it was a simple matter to include a Pester test that executed the template, checked that the outputs worked (the outputs being the Kibana and Logstash URLs) and then tear the whole thing down if everything passed.

At this point I also tried to setup some CloudWatch logs that would automatically extract the contents of the various initialization log files to a common location, so that I could view them from the AWS Dashboard when things were not going as expected. I did not, in the end, manage to get this working. The irony of needing a log aggregator to successfully setup a log aggregator was not lost on me.

Setting the environment variables fixed the majority of the proxy issues, but there was one other proxy related problem left that I didn’t pick up until much later. All of the Elasticsearch plugins were failing to install, for exactly the same reason. No proxy settings. Apparently Java does not read the system proxy settings (bad Java!) so I had to manually supply the proxy address and port to the call to the Elasticsearch plugin installation script.

The initialisation log now showed no errors, and everything appeared to be installed correctly.

But it still wasn’t working.

To Be Continued

Tune in next week for the thrilling conclusion, where I discover a bug caused by the specific combination of Ubuntu version and Java version, get fed up with the components being installed and start from scratch and then struggle with Nxlog in order to get some useful information into the stack.

0 Comments

In my last blog post, I mentioned the 3 classifications that I think tests fall into, Unit, Integration and Functional.

Of course, regardless of classification, all tests are only valuable if they are actually being executed. Its wonderful to say you have tests, but if you’re not running them all the time, and actually looking at the results, they are worthless. Worse than worthless if you think about it, because the presence of tests gives a false sense of security about your system.

Typically executing Unit Tests (and Integration Tests if they are using the same framework) is trivial, made vasty easier by having a build server. Its not that bad even if you don’t have a build server, because those sorts of tests can typically be run on a developers machine, without a huge amount of fanfare. The downside of not having a build server, is that the developers in question need to remember to run the tests. As creative people, following a checklist that includes “wait for tests to run” is sometimes not our strongest quality.

Note that I’m not saying developers should not be running tests on their own machines, because they definitely should be. I would usually limit this to Unit tests though, or very self-contained Integration tests. You need to be very careful about complicating the process of actually writing and committing code if you want to produce features and improvements in a reasonable amount of time. Its very helpful to encourage people to run the tests themselves regularly, but to also have a fallback position. Just in case.

Compared to running Unit and Integration tests, Functional tests are a different story. Regardless of your software, you’ll want to run your Functional tests in a controlled environment, and this usually involves spinning up virtual machines, installing software, configuring the software and so on. To get good test results, and to lower the risk that the results have been corrupted by previously test runs, you’ll want to use a clean environment each time you run the tests. Setting up and running the tests then becomes a time consuming and boring task, something that developers hate.

What happens when you give a developer a task to do that is time consuming and boring?

Automation happens.

Procedural Logic

Before you start doing anything, its helpful to have a high level overview of what you want to accomplish.

At a high level, the automated execution of functional tests needed to:

  • Set up a test environment.
    • Spin up a fresh virtual machine.
    • Install the software under test.
    • Configure software under test.
  • Execute the functional tests.
  • Report the results.

Fairly straightforward. As with everything related to software though, the devil is in the details.

For anyone who doesn’t want to listen to me blather, here is a link to a GitHub repository containing sanitized versions of the scripts. Note that the scripts were not complete at the time of this post, but will be completed later.

Now, on to the blather!

Automatic Weapons

In order to automate any of the above, I would need to select a scripting language.

It would need to be able to do just about anything (which is true of most scripting languages), but would also have to be able to allow me to remotely execute a script on a machine without having to log onto it or use the UI in any way.

I’ve been doing a lot of work with Powershell recently, mostly using it to automate build, package and publish processes. I’d hesitated to learn Powershell for a long time, because every time I encountered something that I thought would have been made easier by using Powershell, I realised I would have to spend a significant amount of time learning just the basics of Powershell before I could do anything useful. I finally bit the bullet and did just that, and its snowballed from there.

Powershell is the hammer and everything is a nail now.

Obviously being a well established scripting language and is installed on basically every modern version of Windows. Powerful by itself, it’s integration with the .NET framework allows a C# developer like me the power to fall back to the familiar .NET BCL for anything I can’t accomplish using just Powershell and its cmdlets. Finally, Powershell Remote Execution allows you to configure a machine and allow authenticated users to remotely execute scripts on it.

So, Powershell it was.

A little bit more about Powershell Remote Execution. It leverages the Windows Remoting Framework (WinRM), and once you’ve got all the bits and pieces setup on the target machine, is very easy to use.

A couple of things to be aware of with remote execution:

  1. By default the Windows Remoting Service is not enabled on some versions of Windows. Obviously this needs to be running.
  2. Powershell Remote Execution communicates over port 5985 (HTTP) and 5986 (HTTPS). Earlier versions used 80 and 443. These ports need to be configured in the Firewall on the machine in question.
  3. The user you are planning on using for the remote execution (and I highly suggest using a brand new user just for this purpose) needs to be a member of the [GROUP HERE] group.

Once you’ve sorted the things above, actually remotely executing a script can be accomplish using the Invoke-Command cmdlet, like so:

$pw = ConvertTo-SecureString '[REMOTE USER PASSWORD' -AsPlainText -Force
$cred = New-Object System.Management.Automation.PSCredential('[REMOTE USERNAME]', $pw)
$session = New-PSSession -ComputerName $ipaddress -Credential $cred 

write-host "Beginning remote execution on [$ipaddress]."

$testResult = Invoke-Command -Session $session -FilePath "$root\remote-download-files-and-run-functional-tests.ps1" -ArgumentList $awsKey, $awsSecret, $awsRegion, $awsBucket, $buildIdentifier

Notice that I don’t have to use a machine name at all. IP Addresses work fine in the ComputerName parameter. How do I know the IP address? That information is retrieved when starting the Amazon EC2 instance.

Environmental Concerns

In order to execute the functional tests, I wanted to be able to create a brand new, clean virtual machine without any human interaction. As I’ve stated previously, we primarily use Amazon EC2 for our virtualisation needs.

The creation of a virtual machine for functional testing would need to be done from another AWS EC2 instance, the one running the TeamCity build agent. The idea being that the build agent instance is responsible for building the software/installer, and would in turn farm out the execution of the functional tests to a completely different machine, to keep a good separation of concerns.

Amazon supplies two methods of interacting with AWS EC2 (Elastic Compute Cloud) via Powershell on a Windows machine.

The first is a set of cmdlets (Get-EC2Instance, New-EC2Instance, etc).

The second is the classes available in the .NET SDK for AWS.

The upside of running on an EC2 instance that was based off an Amazon supplied image is that both of those methods are already installed, so I didn’t have to mess around with any dependencies.

I ended up using a combination of both (cmdlets and .NET SDK objects) to get an instance up and running, mostly because the cmdlets didn’t expose all of the functionality that I needed.

There were 3 distinct parts to using Amazon EC2 for the test environment. Creation, Configuration and Waiting and Clean Up. All of these needed to be automated.

Creation

Obviously an instance needs to be created. The reason this part is split from the Configuration and Waiting is because I’m still not all that accomplished at error handling and returning values in Powershell. Originally I had creation and configuration/waiting in the same script, but if the call to New-EC2Instance returned successfully and then something else failed, I had a hard time returning the instance information in order to terminate it in the finally block of the wrapping script.

The full content of the creation script is available at create-new-ec2-instance.ps1. Its called from the main script (functional-tests.ps1).

Configuration and Waiting

Beyond the configuration done as part of creation, instances can be tagged to add additional information. Also, the script needs to wait on a number of important indicators to ensure that the instance is ready to be interacted with. It made sense to do these two things together for reasons.

The tags help to identify the instance (the name) and also mark the instance as being acceptable to be terminated as part of a scheduled cleanup script that runs over all of our EC2 instances in order to ensure we don’t run expensive instances longer than we expected to.

As for the waiting indicators, the first indicator is whether or not the instance is running. This is an easy one, as the state of the instance is very easy to get at. You can see the function below, but all it does is poll the instance every 5 seconds to check whether or not it has entered the desired state yet.

The second indicator is a bit harder to get at, but it actually much more important. EC2 instances can be configured with status checks, and one of those status checks is whether or not the instance is actually reachable. I’m honestly not sure if this is something that someone before me setup, or if it is standard on all EC2 instances, but its extremely useful.

Anyway, accessing this status check is a bit of a rabbit hole. You can see the function below, but it uses a similar approach to the running check. It polls some information about the instance every 5 seconds until it meets certain criteria. This is the one spot in the entire script that I had to use the .NET SDK classes, as I couldn’t find a way to get this information out of a cmdlet.

The full content of the configuration and wait script is available at tag-and-wait-for-ec2-instance.ps1, and is just called from the main script.

Clean Up

Since you don’t want to leave instances hanging around, burning money, the script needs to clean up after it was done.

Programmatically terminating an instance is quite easy, but I had a lot of issues around the robustness of the script itself, as I couldn’t quite grasp the correct path to ensure that a clean up was always run if an instance was successfully created. The solution to this was to split the creation and tag/wait into different scripts, to ensure that if creation finished it would always return identifying information about the instance for clean up.

Termination happens in the finally block of the main script (functional-tests.ps1).

Instant Machine

Of course all of the instance creation above is dependent on actually having an AMI (Amazon Machine Image) available that holds all of the baseline information about the instance to be created, as well as other things like VPC (Virtual Private Cloud, basically how the instance fits into a network) and security groups (for defining port accessibility). I’d already gone through this process last time I was playing with EC2 instances, so it was just a matter of identifying the various bits and pieces that needs to be done on the machine in order to make it work, while keeping it as clean as possible in order to get good test results.

I went through the image creation process a lot as I evolved the automation script. One thing I found to be useful was to create a change log for the machine in question (I used a page in Confluence) and to version any images made. This helped me to keep the whole process repeatable, as well as documenting the requirements of a machine able to perform the functional tests.

To Be Continued

I think that’s probably enough for now, so next time I’ll continue and explain about automating the installation of the software under test and then actually running the tests and reporting the results.

Until next time!

0 Comments

As I mentioned in a previous post, I recently started a new position at Onthehouse.

Onthehouse uses Amazon EC2 for their cloud based virtualisation, including that of the build environment (TeamCity). Its common for a build environment to be largely ignored as long as it is still working, until the day it breaks and then it all goes to hell.

Luckily that is not what happened.

Instead, the team identified that the build environment needed some maintenance, specifically around one of the application specific Build Agents.

Its an ongoing process, but the reason for there being an application specific Build Agent is because the application has a number of arcane, installed, licenced third-party components. Its VB6, so its hard to manage those dependencies in a way that is mobile. Something to work on in the future, but not a priority for right now.

My first task at Onthehouse, was to ensure that changes made to the running Instance of the Build Agent had been appropriately merged into the base Image. As someone who had never before used the Amazon virtualisation platform (Amazon EC2) I was somewhat confused.

This post follows my journey through that confusion and out the other side into understanding and I hope it will be of use to someone else out there.

As an aside, I think that getting new developers to start with build services is a great way to familiarise them with the most important part of an application, how to build it. Another fantastic first step is to get them to fix bugs.

Virtually Awesome

As I mentioned previously, I’ve never used AWS (Amazon Web Services) before, other than uploading some files to an Amazon S3 account, let alone the virtualization platform (Amazon EC2).

My main experience with virtualisation comes from using Virtual Box on my own PC. Sure, I’ve used Azure to spin up machines and websites, and I’ve occasionally interacted with VMWare and Hyper-V, but Virtual Box is what I use every single day to build, create, maintain and execute sandbox environments for testing, exploration and all sorts of other things.

I find Virtual Box straightforward.

You have a virtual machine (which has settings, like CPU Cores, Memory, Disks, etc) and each machine has a set of snapshots.

Snapshots are a record of the state of the virtual machine and its settings at a point in time chosen by the user. I take Snapshots all the time, and I use them to easily roll back to important moments, like a specific version of an application, or before I wrecked everything by doing something stupid and destructive (it happens more than you think).

Thinking about this now, I’m not sure how Virtual Box and Snapshots interact with multiple disks. Snapshots seems to be primarily machine based, not disk based, encapsulating all of the things about the machine. I suppose that probably includes the disks. I guess I don’t tend to use multiple disks in the machines I’m snapshotting all the time, only using them rarely for specific tasks.

Images and Instances

Amazon EC2 (Elastic Compute) does not work the same way as Virtual Box.

I can see why its different to Virtual Box as they have entirely different purposes. Virtual Box is intended to facilitate virtualisation for the single user. EC2 is about using virtualisation to leverage the power of the cloud. Single users are a completely foreign concept. Its all about concurrency and scale.

In Amazon EC2 the core concept is an Image(or Amazon Machine Image, AMI). Images describe everything about a virtual machine, kind of like a Virtual Machine in Virtual Box. However, in order to actually use an Image, you must spin up an Instance of that Image.

At the point in time you spin up an Instance of an Image, they have diverged. The Instance typically contains a link back to its Image, but its not a hard link. The Instance and Image are distinctly separate and you can delete the Image (which if you are using an Amazon supplied Image, will happen regularly) without negatively impacting on the running instance.

Instances generally have Volumes, which I think are essentially virtual disks. Snapshots come into play here as well, but I don’t understand Volumes and Snapshots all that well at this point in time, so I’m going to conveniently gloss over them. Snapshots definitely don’t work like VirtualBox snapshots though, I know that much.

Instances can generally be rebooted, stopped, started and terminated.

Reboot, stop and start do what you expect.

Terminating an instance kills it forever. It also kills the Volume attached to the instance if you have that option selected. If you don’t have the Image that the Instance was created from, you’re screwed, its gone for good. Even if you do, you will have lost any change made to that Image since the Instance began running.

Build It Up

Back to the Build environment.

The application specific Build Agent had an Image, and an active Instance, as normal.

This Instance had been tweaked, updated and changed in various ways since the Image was made, so much so that no-one could remember exactly what had been done. Typically this wouldn’t be a major issue, as Instances don’t just up and disappear.

Except this Instance could, and had in the past.

The reason for its apparently ephemeral nature was because Amazon offers a spot pricing option for Instances. Spot pricing allows you to create a spot request and set your own price for an hour of compute time. As long as the spot price is below that price, your Instance will run. If the spot price goes above your price, your Instance dies. You can setup your spot price request to be reoccurring, such that the Instance will restart when the price goes down again, but you will have lost all information not on the baseline Image (an event like that is equivalent to terminating the instance and starting another one).

Obviously we needed to ensure that the baseline Image was completely able to run a build of the application in question, requiring the minimal amount of configuration on first start.

Thus began a week long adventure to take the current base Image, create an Instance from it, and get a build working, so we could be sure that if our Instance was terminated it would come back and we wouldn’t have to spend a week getting the build working again.

I won’t go into detail about the whole process, but it mostly involved lots of manual steps to find out what was thing was wrong this time, fixing it in as nice a way as time permitted and then trying again. It mostly involved waiting. Waiting for instances, waiting for builds, waiting for images. Not very interesting.

A Better Approach

Knowing what I know now (and how long the whole process would take), my approach would be slightly different.

Take a snapshot of the currently running Instance, spin up an Instance of it, change all of the appropriate unique settings to be invalid (Build Agent name mostly) and then take another Image. That’s your baseline.

Don’t get me wrong, it was a much better learning experience the first way, but it wasn’t exactly an excellent return on investment from the point of view of the organisation.

Ah well, hindsight.

A Better Architecture

The better architecture is to have TeamCity managed the lifetime of its Build Agents, which it is quite happy to do via Amazon EC2. TeamCity can then manage the instances as it sees fit, spinning them down during idle periods, and even starting more during periods of particularly high load (I’m looking at you, end of the iteration crunch time).

I think this is definitely the approach we will take in the future, but that’s a task for another day.

Conclusion

Honestly, the primary obstacle in this particular task was learning how Amazon handles virtualization, and wrapping my head around the differences between that and Virtual Box (which is where my mental model was coming from). After I got my head around that I was in mostly familiar territory, diagnosing build issues and determining the best fix that would maximise mobility in the future, while not requiring a massive amount of time.

From the point of view of me, a new starter, this exercise was incredibly valuable. It taught me an immense amount about the application, its dependencies, the way its built and all sorts of other, half-forgotten tidbits.

From the point of view of the business, I should have definitely realized that there was a quicker path to the end goal (make sure we can recover from a lost Build Agent instance) and taken that into consideration, rather than try to work my way through the arcane dependencies of the application. There’s always the risk that I missed something subtle as well, which will rear its ugly head next time we lose the Build Agent instance.

Which could happen.

At any moment.

(Cue Ominous Music)