0 Comments

Full disclosure, most of the Elastalert related work was actually done by a colleague of mine, I’m just writing about it because I thought it was interesting.

Last week I did a bit of an introduction to Elastalert, as it is the new mechanism that we use to alert on the data in our ELK stack.

We take our infrastructure pretty seriously though, so I didn’t want to just manually create an Elastalert instance and set up it up to do things. It all needs to be codified and controlled, with a deployment pipeline for distributing changes (like new rules or changed rules) and everything needs to be versioned as appropriate.

After doing some very high level playing around (just to make sure it all worked relatively as advertised), it was time to do it properly and set up an auto-scaling, auto-healing Elastalert environment, just like all of the other ones.

Packing It Away

Installing Elastalert is pretty straightforward.

Its all Python based, so its a fairly simple matter to use pip to install the package:

pip install elastalert

This doesn’t quite work out of the box on an Amazon Linux EC2 instance though, as you have to also install some dependencies that are not immediately obvious.

sudo yum update -y;
sudo yum install gcc gcc-c++ -y;
sudo yum install libffi-devel -y;
sudo yum install openssl-devel -y;
sudo pip install elastalert;

With that out of the way, the machine is basically ready to run Elastalert, assuming you configure it correctly (as per the documentation).

With a relatively self contained installation script out of the way, it was time to create an AMI containing using Packer, to be used inside the impending environment.

The Packer configuration for an AMI with Elastalert installed on it is pretty straightforward, and just follows the normal pattern, which I described in this post and which you can see directly in this Github repository. The only meaningful difference is the script that installs Elastalert itself, which you can see above.

Cumulonimbus Clouds Are My Favourite

With an AMI created and ready to go, all that’s left is to create a simple environment to run it in.

Nothing fancy, just a CloudFormation template with a single auto scaling group in it, such that accidental or unexpected terminations self-heal. No need for a load balancer, DNS entries or anything like that, its a purely background process that sits quietly and yells at us as appropriate.

Again, this is a problem that we’ve solved before, and we have a decent pattern in place for putting this sort of thing together.

  • A dedicated repository for the environment, containing the CloudFormation template, configuration and deployment logic
  • A TeamCity Build Configuration, which uses the contents of this repository and builds and tests a versioned package
  • An Octopus project, which contains all of the logic necessary to target the deployment, along with any environment level variables (like target ES cluster)

The good news was that the standard environment stuff worked perfectly. It built, a package was created and that package was deployed.

The bad news was that the deployment never actually completed successfully because the Elastalert AMI failed to result in a working EC2 instance, which meant that the environment failed miserably as the Auto Scaling Group never received a success signal.

But why?

Snakes Are Tricky

It actually took us a while to get to the bottom of the problem, because Elastalert appeared to be fully functional at the end of the Packer process, but the AMI created from that EC2 instance seemed to be fundamentally broken.

Any EC2 instance created from that AMI just didn’t work, regardless of how we used it (i.e. CloudFormation vs manual instance creation, nothing mattered).

The instance would be created and it would “go green” (i.e. the AWS status checks and whatnot would complete successfully) but we couldn’t connect to it using any of the normal mechanisms (SSH using the specified key being the most obvious). It was like none of the normal EC2 setup was being executed, which was weird, because we’ve created many different AMIs through Packer and we hadn’t done anything differently this time.

Looking at the system log for the broken EC2 instances (via the AWS Dashboard) we could see that the core setup procedure of the EC2 instance (where it uses the supplied key file to setup access among other things) was failing due to problems with Python.

What else uses Python?

That’s right, Elastalert.

It turned out that by our Elastalert installation script was updating some dependencies that the EC2 initialization was relied on, and those updates had completely broken the normal setup procedure.

The AMI was functionally useless.

Dock Worker

We went through a few different approaches to try and fix the underlying dependency conflicts, but in the end we settled on using Docker.

At a very high level, Docker is a kind of a virtualization platform, except it doesn’t virtualize the entire OS and instead sits a little bit above that, virtualizing a set of applications instead, leveraging the OS rather than simulating the entire thing. Each Docker image generally hosts a single application in a completely isolated environment, which makes it the perfect solution when you have system software conflicts like we did.

Of course, we had to change our strategy somewhat in order for this to work.

Instead of using Packer to create an AMI with Elastalert installed, we now have to create an AMI with Docker (and Octopus) installed and available.

Same pattern as before, just different software being installed.

Nothing much changed in the environment though, as its still just an Auto Scaling Group spinning up an EC2 instance using the specified AMI.

The big changes were in the Elastalert configuration deployment, which now had to be responsible for both deploying the actual configuration and making sure the Elastalert docker images was correctly configured and running.

To Be Continued

And that is as good a place as any to stop for now.

Next week I’ll explain what our original plan was for the Elastalert configuration deployment and how that changed when we switched to using Docker to host an Elastalert image.

0 Comments

Well, its been almost 2 years now since I made a post about Sensu as a generic alerting/alarming mechanism. It ended on a hopeful note, explaining that the content of the post was relatively theoretical and that we hoped to put some of it in place in the coming weeks/months.

Yeah, that never happened.

Its not like we didn’t have any alerts or alarms during that time, we just never continued on with the whole theme of “lets put something together to yell at us whenever weird stuff happens in our ELK stack”. We’ve been using Pingdom ever since our first service went live (to monitor HTTP endpoints and websites) and we’ve been slowly increasing our usage of CloudWatch alarms, but all of that juicy intelligence in the ELK stack is still languishing in alerting limbo.

Until now.

Attention Deficit Disorder

As I’ve previously outlined, we have a wealth of information available in our ELK stack, including things like IIS logs, application logs, system statistics for infrastructure (i.e. memory, CPU, disk space, etc), ELB logs and various intelligence events (like “user used feature X”).

This information has proven to be incredibly valuable for general analysis (bug identification and resolution is a pretty common case), but historically the motivation to start using the logs occurs through some other channel, like a customer complaining via our support team someone just noticing that “hey, this thing doesn’t look right”.

Its all very reactive, and we’ve missed early warning signs in the past such that an issue affected real people, which is sloppy at best.

We can do better.

Ideally what we need to do is identify symptoms or leading indicators that things are starting to go wrong or degrade, and then dynamically alerted the appropriate people when these things are detected, so we can action them ASAP. In a perfect world, these sorts of triggers would be identified and put in place as an integral part of the feature delivery, but for now it would be enough that they just exist at some point in time.

And that’s where Elastalert comes in.

Its Not That We Can’t Pay Attention

Elastalert is a relatively straightforward piece of installed software that allows you to do things when the data in an Elasticsearch cluster meets certain criteria.

It was created at Yelp to work in conjunction with their ELK stack for exactly the purpose that we’re chasing, so its basically a perfect fit.

Also its free.

Elastic.co offers an alerting solution themselves, in the form of X-Pack Alerting (formerly Watcher). As far as I know its pretty amazing, and integrates smoothly with Kibana. However, it costs money, and its one of those things where you actually have to request a quote, rather than just being a price on a website, so you know its expensive. I think we looked into it briefly, but I can’t remember what the actual price would have been for us. I remember it being crazy though.

The Elastalert documentation is pretty awesome, but at a high level the tool offers a number of different ways to trigger alerts and a number of notification channels (like Hipchat, Slack, Email, etc) to execute when an alert is triggered.

All of the configuration is YAML based, which is a pretty common format these days, and all of the rules are just files, so its easy to manage.

Here’s an example rule that we use for detecting spikes in the amount of 50X response codes occurring for any of our services:

name: Spike in 5xxs
type: spike
index: logstash-*

timeframe:
  seconds: @@ELASTALERT_CHECK_FREQUENCY_SECONDS@@

spike_height: 2
spike_type: up
threshold_cur: @@general-spike-5xxs.yaml.threshold_cur@@

filter:
- query:
    query_string:
      query: "Status: [500 TO 599]"
alert: "hipchat"
alert_text_type: alert_text_only
alert_text: |
  <b>{0}</b>
  <a href="@@KIBANA_URL@@">5xxs spiked {1}x. Was {2} in the last {3}, compared to {4} the previous {3}</a>
hipchat_message_format: html
hipchat_from: Elastalert
hipchat_room_id: "@@HIPCHAT_ROOM@@"
hipchat_auth_token: "@@HIPCHAT_TOKEN@@"
alert_text_args:
- name
- spike_height
- spike_count
- reference_count

The only thing in the rule above not covered extensively in the documentation is the @@SOMETHING@@ notation that we use to do some substitutions during deployment. I’ll talk about that a little bit later, but essentially its just a way to customise the rules on a per environment basis without having to rewrite the entire rule (so CI rules can execute every 30 seconds over the last 4 hours, but production might check every few minutes over the last hour and so on).

There’s Just More Important Thi….Oh A Butterfly!

With the general introduction to Elastalert out of the way, the plan for this series of posts is eerily similar to what I did for the ELK stack refresh.

Hopefully I can put together a publicly accessible repository in Github with all of the Elastalert work in it before the end of this series of posts, but I can’t make any promises. Its pretty time consuming to take one of our internal repositories and sanitized it for consumption by the greater internet, even if it is pretty useful.

To Be Continued

Before I finished up, I should make it clear that we’ve already implemented the Elastalert stuff, so its not in the same boat as our plans for Sensu. We’re literally using Elastalert right now to yell at us whenever interesting things happen in our ELK stack and its already proven to be quite useful in that respect.

Next week, I’ll go through the Elastalert environment we set up, and why the Elastalert application and Amazon Linux EC2 instances don’t get along very well.

0 Comments

A long long time ago, a valuable lesson was learned about segregating production infrastructure from development/staging infrastructure. Feel free to go and read that post if you want, but to summarise: I ran some load tests on an environment (one specifically built for load testing, separate to our normal CI/Staging/Production), and the tests flooded a shared component (a proxy), brining down production services and impacting customer experience. The outage didn’t last that long once we realised what was happening, but it was still pretty embarrassing.

Shortly after that adventure, we created a brand new AWS account to isolate our production infrastructure and slowly moved all of that important stuff into it, protecting it from developers doing what they do best (breaking stuff in new and interesting ways).

This arrangement complicated a few things, but the most relevant to this discussion was the creation and management of AMIs.

We were already using Packer to create and maintain said AMI’s, so it wasn’t a manual process, but an AMI in AWS is always owned by and accessible from one AWS account by default.

With two completely different AWS accounts, it was easy to imagine a situation where each account has slightly different AMIs available, which have slightly different behaviour, leading to weird things happening on production environments that don’t happen during development or in staging.

That sounds terrible, and it would be neat if we could ensure it doesn’t happen.

A Packaged Deal

The easiest thing to do is share the AMIs in question.

AWS makes it relatively easy to make an AMI accessible to a different AWS account, likely for this exact purpose. I think it’s also used to enable companies to sell pre-packaged AMIs as well, but that’s a space I know little to nothing about, so I’m not sure.

As long as you know the account number of the AWS account that you want to grant access to, its a simple matter to use the dashboard or the API to share the AMI, which can then be freely used from the other account to create EC2 instances.

One thing to be careful of is to make sure you grant the ability for the other account to access the AMI snapshot as well, or you’ll run into permission problems if you try to actually use the AMI to make an EC2 instance.

Sharing AMI’s is alright, but it has risks.

If you create AMIs in your development account and then share them with production, then you’ve technically got production infrastructure inside your development account, which was one of the things we desperately wanted to avoid. The main problem here is that people will not assume that a resource living inside the relatively free-for-all development environment could have any impact on production, and they might delete it or something equally dangerous. Without the AMI, auto scaling won’t work, and the most likely time to figure that sort of thing out is right when you need it the most.

A slightly better approach is to copy the AMI to the other account. In order to do this you share the AMI from the owner account (i.e. dev) and then make a permanent copy on the other account (i.e. prod). Once the copy is complete, you unshare (to prevent accidental usage).

This breaks the linkage between the two accounts while ensuring that the AMIs are identical, so its a step up from simple sharing, but there are limitations.

For example, everything works swimmingly until you try to copy a Windows AMI, then it fails miserably as a result of the way in which AWS licences Windows. On the upside, the copy operation itself fails fast, rather than making a copy that then fails when you try to use it, so that’s nice.

So, two solutions, neither of which is ideal.

Surely we can do better?

Pack Of Wolves

For us, the answer is yes. We just run our Packer templates twice, once for each account.

This has actually been our solution for a while. We execute our Packer templates through TeamCity Build Configurations, so it is a relatively simple matter to just run the build twice, once for each account.

Well, “relatively simple” is probably understating it actually.

Running in dev is easy. Just click the button, wait and a wild AMI appears.

Prod is a different question.

When creating an AMI Packer needs to know some things that are AWS account specific, like subnets, VPC, security groups and so on (mostly networking concerns). The source code contained parameters relevant for dev (hence the easy AMI creation for dev), but didn’t contain anything relevant for prod. Instead, whenever you ran a prod build in TeamCity, you had to supply a hashtable of parameter overrides, which would be used to alter the defaults and make it work in the prod AWS account.

As you can imagine, this is error prone.

Additionally, you actually have to remember to click the build button a second time and supply the overrides in order to make a prod image, or you’ll end up in a situation where you deployed your environment changes successfully through CI and Staging, but it all explodes (or even worse, subtly doesn’t do what its supposed to) when you deploy them into Production because there is no equivalent AMI. Then you have to go and make one using TeamCity, which is error prone, and if the source has diverged since you made the dev one…well, its just bad times all around.

Leader Of The Pack

With some minor improvements, we can avoid that whole problem though.

Basically, whenever we do a build in TeamCity, it creates the dev AMI first, and then automatically creates the prod one as well. If the dev fails, no prod. If the prod fails, dev is deleted.

To keep things in sync, both AMIs are tagged with a version attribute created during the build (just like software), so that we have a way to trace the AMI back to the git commit it was created from (just like software).

To accomplish this approach, we now have a relatively simple configuration hierarchy, with default parameters, dev specific parameters and prod specific parameters. When you start the AMI execution, you tell the function what environment you’re targetting (dev/prod) and it loads defaults, then merges in the appropriate overrides.

This was a relatively easy way to deal with things that are different and non-sensitive (like VPC, subnets, security groups, etc).

What about credentials though?

Since an…incident…waaaay back in 2015, I’m pretty wary of credentials, particularly ones that give access to AWS.

So they can’t go in source control with the rest of the parameters.

That leaves TeamCity as the only sane place to put them, which it can easily do, assuming we don’t mind writing some logic to pick the appropriate credentials depending on our targeted destination.

We could technically have used some combination of IAM roles and AWS profiles as well, but we already have mechanisms and experience dealing with raw credential usage, so this was not the time to re-invent that particular wheel. That’s a fight for another day.

With account specific parameters and credentials taken care of, everything is good, and every build results in 2 AMIs, one for each account.

I’ve uploaded a copy of our Packer repository containing all of this logic (and a copy of the script we embed into TeamCity) to Github for reference purposes.

Conclusion

I’m much happier with the process I described above for creating our AMIs. If a build succeeds, it creates resources in both of our active AWS accounts, keeping them in sync and reducing the risk of subtle problems come deployment time. Not only that, but it also tags those resources with a version that can be traced back to a git commit, which is always more useful than you think.

There are still some rough edges around actually using the AMIs though. Most of our newer environments specify their AMIs directly via parameter files, so you have to remember to change the values for each environment target when you want to use a new AMI. This is dangerous, because if someone forgets it could lead to a disconnect between CI/Staging and Production, which was pretty much the entire problem we were trying to avoid in the first place.

Honestly, its going to be me that forgets.

Ah well, all in all, its a lot more consistent than it was before, which is pretty much the best I could hope for.

0 Comments

I use Packer on and off. Mostly I use it to make Amazon Machine Images (AMI’s) for our environment management packages, specifically by creating Packer templates that operate on top of the Amazon supplied Windows Server images.

You should never use an Amazon supplied Windows Server AMI in your Auto Scaling Group Launch Configurations. These images are regularly retired, so if you’ve taken a dependency on one, there is a good chance it will disappear just when you need it most. Like when you need to auto-scale your API cluster because you’ve unknowingly burnt through all of the CPU credits you had on the machines slowly over the course of the last few months. What you should do is create an AMI of your own from the ones supplied by AWS so you can control its lifetime. Packer is a great tool for this.

A Packer template is basically a set of steps to execute on a virtual machine of some sort, where the core goal is to take some sort of baseline thing, apply a set of steps to it programmatically and end up with some sort of reusable thing out the other end. Like I mentioned earlier, we mostly deal in AWS AMI’s, but it can do a bunch of other things as well (VWWare, Docker, etc).

The benefits of using a Packer template for this sort of thing (instead of just doing it all manually) is reproducibility. Specifically, if you built your custom image using the AWS AMI for Windows Server 2012 6 months ago, you can go and grab the latest one from yesterday (with all of the patches and security upgrades), execute your template on it and you’ll be in a great position to upgrade all of the existing usages of your old custom AMI with minimal effort.

When using Packer templates though, you need to be cognizant of how errors are dealt with. Specifically:

Step failures appear to be indicated entirely by the exit code of the tool used in the step.

I’ve been bitten by this on two separate occasions.

A Powerful Cry For Help

Packer has much better support for Windows than it once did, but even taking that into account, Powershell steps can still be a troublesome beast.

The main issue with the Powershell executable is that if an error or exception occurs and terminates the process (i.e. its a terminating error of you have ErrorActionPreference set to Stop) the Powershell process itself still exits with zero.

In a sane world, an exit code of zero indicates success, which is what Packer expects (and most other automation tools like TeamCity/Octopus Deploy).

If you don’t take this into account, your Powershell steps may fail but the Packer execution will still succeed, giving you an artefact that hasn’t been configured the way it should have been.

Packer is pretty configurable though, and is very clear about the command that it uses to execute your Powershell steps. The great thing is, it also enables you to override that command, so you can customise your Powershell steps to exit with a non-zero code if an error occurs without actually having to change every line in your step to take that sort of thing into account.

Take this template excerpt below, which uses Powershell to set the timezone of the machine and turn off negative DNS result caching.

{
    "type": "powershell",
    "inline": [
        "tzutil.exe /s \"AUS Eastern Standard Time_dstoff\"",
        "[Microsoft.Win32.Registry]::SetValue('HKEY_LOCAL_MACHINE\\SYSTEM\\CurrentControlSet\\Services\\Dnscache\\Parameters','NegativeCacheTime',0,[Microsoft.Win32.RegistryValueKind]::DWord)"
    ],
    "execute_command": "powershell -Command \"$ErrorActionPreference = 'Stop'; try { & '{{.Path}}' } catch { Write-Warning $_; exit 1; } \""
}

The “execute_command” is the customisation, providing error handling for exceptions that occur during the execution of the Powershell snippet. Packer will take each line in that inline array, copy it to a file on the machine being setup (using WinRM) and then execute it using the command you specify. The {{.Path}} syntax is the Packer variable substitution and specifically refers to the path on the virtual machine that packer has copied the current command to. With this custom command in place, you have a much better chance of catching errors in your Powershell commands before they come back to bite you later on.

So Tasty

In a similar vein to the failures with Powershell above, be careful when doing package installs via yum on Linux.

The standard “yum install” command will not necessarily exit with a non-zero code when a package fails to install. Sometimes it will, but if a package couldn’t be found (maybe you misconfigured the repository or something) it still exits with a zero.

This can throw a pretty big spanner in the works when you’re expecting your AMI to have Elasticsearch on it (for example) and it just doesn’t because the package installation failed but Packer thought everything was fine.

Unfortunately, there is no easy way to get around this like there is for the Powershell example above, but you can mitigate it by just adding an extra step after your package install that validates the package was actually installed.

{
    "type" : "shell",
    "inline" : [
        "sudo yum remove java-1.7.0-openjdk -y",
        "sudo yum install java-1.8.0 -y",
        "sudo yum update -y",
        "sudo sh -c 'echo \"[logstash-5.x]\" >> /etc/yum.repos.d/logstash.repo'",
        "sudo sh -c 'echo \"name=Elastic repsitory for 5.x packages\" >> /etc/yum.repos.d/logstash.repo'",
        "sudo sh -c 'echo \"baseurl=https://artifacts.elastic.co/packages/5.x/yum\" >> /etc/yum.repos.d/logstash.repo'",
        "sudo sh -c 'echo \"gpgcheck=1\" >> /etc/yum.repos.d/logstash.repo'",
        "sudo sh -c 'echo \"gpgkey=http://packages.elastic.co/GPG-KEY-elasticsearch\" >> /etc/yum.repos.d/logstash.repo'",
        "sudo sh -c 'echo \"enabled=1\" >> /etc/yum.repos.d/logstash.repo'",
        "sudo rpm --import https://packages.elastic.co/GPG-KEY-elasticsearch",
        "sudo yum install logstash-5.2.2 -y",
        "sudo rpm --query logstash-5.2.2"
    ]
}

In the example above, the validation is the rpm --query command after the yum install. It will return a non-zero exit code (and thus fail the Packer execution) if the package with that version is not installed.

Conclusion

Packer is an incredibly powerful automation tool for dealing with a variety of virtual machine platforms and I highly recommend using it.

If you’re going to use it though, you need to understand what failure means in your specific case, and you need to take that into account when you decide how to signal to the Packer engine that something isn’t right.

For me, I prefer to treat every error as critical, because I prefer to deal with them at the time the AMI is created, rather than 6 months later when I try to use the AMI and can’t figure out why the Windows Firewall on an internal API instance is blocking requests from its ELB. Not that that has ever happened of course.

In order to accomplish this lofty goal of dealing with errors ASAP you need to understand how each one of your steps (and the applications and tools they use) communicate failure, and then make sure they all communicate that appropriately in a way Packer can understand.

Understanding how to deal with failure is useful outside Packer too.

0 Comments

We use TeamCity as our Continuous Integration tool.

Unfortunately, our setup hasn’t been given as much love as it should have. Its not bad (or broken or any of those things), its just not quite as well setup as it could be, which increases the risk that it will break and makes it harder to manage (and change and keep up to date) than it could be.

As with everything that has got a bit crusty around the edges over time, the only real way to attack it while still delivering value to the business is by doing it slowly, piece by piece, over what seems like an inordinate amount of time. The key is to minimise disruption, while still making progress on the bigger picture.

Our setup is fairly simple. A centralised TeamCity server and at least 3 Build Agents capable of building all of our software components. We host all of this in AWS, but unfortunately, it was built before we started consistently using CloudFormation and infrastructure as code, so it was all manually setup.

Recently, we started using a few EC2 spot instances to provide extra build capabilities without dramatically increasing our costs. This worked fairly well, up until the spot price spiked and we lost the build agents. We used persistent requests, so they came back, but they needed to be configured again before they would hook up to TeamCity because of the manual way in which they were provisioned.

There’s been a lot of instability in the spot price recently, so we were dealing with this manual setup on a daily basis (sometimes multiple times per day), which got old really quickly.

You know what they say.

“If you want something painful automated, make a developer responsible for doing it manually and then just wait.”

Its Automatic

The goal was simple.

We needed to configure the spot Build Agents to automatically bootstrap themselves on startup.

On the upside, the entire process wasn’t completely manual. We were at least spinning up the instances from a pre-built AMI that already had all of the dependencies for our older, crappier components as well as an unconfigured TeamCity Build Agent on it, so we didn’t have to automate absolutely everything.

The bootstrapping would need to tag the instance appropriately (because for some reason spot instances don’t inherit the tags of the spot request), configure the Build Agent and then start it up so it would connect to TeamCity. Ideally, it would also register and authorize the Build Agent, but if we used controlled authorization tokens we could avoid this step by just authorizing the agents once. Then they would automatically reappear each time the spot instance came back,.

So tagging, configuring, service start, using Powershell, with the script baked into the AMI. During provisioning we would supply some UserData that would execute the script.

Not too complicated.

Like Graffiti, Except Useful

Tagging an EC2 instance is pretty easy thanks to the multitude of toolsets that Amazon provides. Our tool of choice is the Powershell cmdlets, so the actual tagging was a simple task.

Getting permission to the do the tagging was another story.

We’re pretty careful with our credentials these days, for some reason, so we wanted to make sure that we weren’t supply and persisting any credentials in the bootstrapping script. That means IAM.

One of the key features of the Powershell cmdlets (and most of the Amazon supplied tools) is that they are supposed to automatically grab credentials if they are being run on an EC2 instance that currently has an instance profile associated with it.

For some reason, this would just not work. We tried a number of different things to get this to work (including updating the version of the Powershell cmdlets we were using), but in the end we had to resort to calling the instance metadata service directly to grab some credentials.

Obviously the instance profile that we applied to the instance represented a role with a policy that only had permissions to alter tags. Minimal permission set and all that.

Service With a Smile

Starting/stopping services with Powershell is trivial, and for once, something weird didn’t happen causing us to burn days while we tracked down some obscure bug that only manifests in our particular use case.

I was as surprised as you are.

Configuration Is Key

The final step should have been relatively simple.

Take a file with some replacement tokens, read it, replace the tokens with appropriate values, write it back.

Except it just wouldn’t work.

After editing the file with Powershell (a relatively simple Get-Content | For-Each { $_ –replace {token}, {value} } | Out-File) the TeamCity Build Agent would refuse to load.

Checking the log file, its biggest (and only) complaint was that the serverUrl (which is the location of the TeamCity server) was not set.

This was incredibly confusing, because the file clearly had a serverUrl value in it.

I tried a number of different things to determine the root cause of the issue, including:

  • Permissions? Was the file accidentially locked by TeamCity such that the Build Agent service couldn’t access it?
  • Did the rewrite of the tokens somehow change the format of the file (extra spaces, CR LF when it was just expecting LF)
  • Was the serverUrl actually configured, but inaccessible for some reason (machine proxy settings for example) and the problem was actually occurring not when the file was rewritten but when the script was setting up the AWS Powershell cmdlets proxy settings?

Long story short, it turns out that Powershell doesn’t remember file encoding when using the Out-File functionality in the way we were using it. It was changing the Byte Order Mark (BOM) on the file from ASCII to Unicode Little Endian, and the Build Agent did not like that (it didn’t throw an encoding error either, which is super annoying, but whatever).

The error message was both a red herring (yes the it was configured) and also truthly (the Build Agent was incapable of reading the serverUrl).

Putting It All Together

With all the pieces in place, it was a relatively simple matter to create a new AMI with those scripts baked into it and put it to work straightaway.

Of course, I’d been doing this the whole time in order to test the process, so I certainly had a lot of failures building up to the final deployment.

Conclusion

Even simple automation can prove to be time consuming, especially when you run into weird unforseen problems like components not performing as advertised or even reporting correct errors for you to use for debugging purposes.

Still, it was worth it.

Now I never have to manually configure those damn spot instances when they come back.

And satisfaction is worth its weight in gold.