We’re Finally Paying Attention, Part 2

November 7. 2017 0 Comments

Full disclosure, most of the Elastalert related work was actually done by a colleague of mine, I’m just writing about it because I thought it was interesting.

Last week I did a bit of an introduction to Elastalert, as it is the new mechanism that we use to alert on the data in our ELK stack.

We take our infrastructure pretty seriously though, so I didn’t want to just manually create an Elastalert instance and set up it up to do things. It all needs to be codified and controlled, with a deployment pipeline for distributing changes (like new rules or changed rules) and everything needs to be versioned as appropriate.

After doing some very high level playing around (just to make sure it all worked relatively as advertised), it was time to do it properly and set up an auto-scaling, auto-healing Elastalert environment, just like all of the other ones.

Packing It Away

Installing Elastalert is pretty straightforward.

Its all Python based, so its a fairly simple matter to use pip to install the package:

pip install elastalert

This doesn’t quite work out of the box on an Amazon Linux EC2 instance though, as you have to also install some dependencies that are not immediately obvious.

sudo yum update -y;
sudo yum install gcc gcc-c++ -y;
sudo yum install libffi-devel -y;
sudo yum install openssl-devel -y;
sudo pip install elastalert;

With that out of the way, the machine is basically ready to run Elastalert, assuming you configure it correctly (as per the documentation).

With a relatively self contained installation script out of the way, it was time to create an AMI containing using Packer, to be used inside the impending environment.

The Packer configuration for an AMI with Elastalert installed on it is pretty straightforward, and just follows the normal pattern, which I described in this post and which you can see directly in this Github repository. The only meaningful difference is the script that installs Elastalert itself, which you can see above.

Cumulonimbus Clouds Are My Favourite

With an AMI created and ready to go, all that’s left is to create a simple environment to run it in.

Nothing fancy, just a CloudFormation template with a single auto scaling group in it, such that accidental or unexpected terminations self-heal. No need for a load balancer, DNS entries or anything like that, its a purely background process that sits quietly and yells at us as appropriate.

Again, this is a problem that we’ve solved before, and we have a decent pattern in place for putting this sort of thing together.

A dedicated repository for the environment, containing the CloudFormation template, configuration and deployment logic
A TeamCity Build Configuration, which uses the contents of this repository and builds and tests a versioned package
An Octopus project, which contains all of the logic necessary to target the deployment, along with any environment level variables (like target ES cluster)

The good news was that the standard environment stuff worked perfectly. It built, a package was created and that package was deployed.

The bad news was that the deployment never actually completed successfully because the Elastalert AMI failed to result in a working EC2 instance, which meant that the environment failed miserably as the Auto Scaling Group never received a success signal.

But why?

Snakes Are Tricky

It actually took us a while to get to the bottom of the problem, because Elastalert appeared to be fully functional at the end of the Packer process, but the AMI created from that EC2 instance seemed to be fundamentally broken.

Any EC2 instance created from that AMI just didn’t work, regardless of how we used it (i.e. CloudFormation vs manual instance creation, nothing mattered).

The instance would be created and it would “go green” (i.e. the AWS status checks and whatnot would complete successfully) but we couldn’t connect to it using any of the normal mechanisms (SSH using the specified key being the most obvious). It was like none of the normal EC2 setup was being executed, which was weird, because we’ve created many different AMIs through Packer and we hadn’t done anything differently this time.

Looking at the system log for the broken EC2 instances (via the AWS Dashboard) we could see that the core setup procedure of the EC2 instance (where it uses the supplied key file to setup access among other things) was failing due to problems with Python.

What else uses Python?

That’s right, Elastalert.

It turned out that by our Elastalert installation script was updating some dependencies that the EC2 initialization was relied on, and those updates had completely broken the normal setup procedure.

The AMI was functionally useless.

Dock Worker

We went through a few different approaches to try and fix the underlying dependency conflicts, but in the end we settled on using Docker.

At a very high level, Docker is a kind of a virtualization platform, except it doesn’t virtualize the entire OS and instead sits a little bit above that, virtualizing a set of applications instead, leveraging the OS rather than simulating the entire thing. Each Docker image generally hosts a single application in a completely isolated environment, which makes it the perfect solution when you have system software conflicts like we did.

Of course, we had to change our strategy somewhat in order for this to work.

Instead of using Packer to create an AMI with Elastalert installed, we now have to create an AMI with Docker (and Octopus) installed and available.

Same pattern as before, just different software being installed.

Nothing much changed in the environment though, as its still just an Auto Scaling Group spinning up an EC2 instance using the specified AMI.

The big changes were in the Elastalert configuration deployment, which now had to be responsible for both deploying the actual configuration and making sure the Elastalert docker images was correctly configured and running.

To Be Continued

And that is as good a place as any to stop for now.

Next week I’ll explain what our original plan was for the Elastalert configuration deployment and how that changed when we switched to using Docker to host an Elastalert image.

Packing Heat

June 20. 2017 0 Comments

Posted in:
packer
aws
automation

A long long time ago, a valuable lesson was learned about segregating production infrastructure from development/staging infrastructure. Feel free to go and read that post if you want, but to summarise: I ran some load tests on an environment (one specifically built for load testing, separate to our normal CI/Staging/Production), and the tests flooded a shared component (a proxy), brining down production services and impacting customer experience. The outage didn’t last that long once we realised what was happening, but it was still pretty embarrassing.

Shortly after that adventure, we created a brand new AWS account to isolate our production infrastructure and slowly moved all of that important stuff into it, protecting it from developers doing what they do best (breaking stuff in new and interesting ways).

This arrangement complicated a few things, but the most relevant to this discussion was the creation and management of AMIs.

We were already using Packer to create and maintain said AMI’s, so it wasn’t a manual process, but an AMI in AWS is always owned by and accessible from one AWS account by default.

With two completely different AWS accounts, it was easy to imagine a situation where each account has slightly different AMIs available, which have slightly different behaviour, leading to weird things happening on production environments that don’t happen during development or in staging.

That sounds terrible, and it would be neat if we could ensure it doesn’t happen.

A Packaged Deal

The easiest thing to do is share the AMIs in question.

AWS makes it relatively easy to make an AMI accessible to a different AWS account, likely for this exact purpose. I think it’s also used to enable companies to sell pre-packaged AMIs as well, but that’s a space I know little to nothing about, so I’m not sure.

As long as you know the account number of the AWS account that you want to grant access to, its a simple matter to use the dashboard or the API to share the AMI, which can then be freely used from the other account to create EC2 instances.

One thing to be careful of is to make sure you grant the ability for the other account to access the AMI snapshot as well, or you’ll run into permission problems if you try to actually use the AMI to make an EC2 instance.

Sharing AMI’s is alright, but it has risks.

If you create AMIs in your development account and then share them with production, then you’ve technically got production infrastructure inside your development account, which was one of the things we desperately wanted to avoid. The main problem here is that people will not assume that a resource living inside the relatively free-for-all development environment could have any impact on production, and they might delete it or something equally dangerous. Without the AMI, auto scaling won’t work, and the most likely time to figure that sort of thing out is right when you need it the most.

A slightly better approach is to copy the AMI to the other account. In order to do this you share the AMI from the owner account (i.e. dev) and then make a permanent copy on the other account (i.e. prod). Once the copy is complete, you unshare (to prevent accidental usage).

This breaks the linkage between the two accounts while ensuring that the AMIs are identical, so its a step up from simple sharing, but there are limitations.

For example, everything works swimmingly until you try to copy a Windows AMI, then it fails miserably as a result of the way in which AWS licences Windows. On the upside, the copy operation itself fails fast, rather than making a copy that then fails when you try to use it, so that’s nice.

So, two solutions, neither of which is ideal.

Surely we can do better?

Pack Of Wolves

For us, the answer is yes. We just run our Packer templates twice, once for each account.

This has actually been our solution for a while. We execute our Packer templates through TeamCity Build Configurations, so it is a relatively simple matter to just run the build twice, once for each account.

Well, “relatively simple” is probably understating it actually.

Running in dev is easy. Just click the button, wait and a wild AMI appears.

Prod is a different question.

When creating an AMI Packer needs to know some things that are AWS account specific, like subnets, VPC, security groups and so on (mostly networking concerns). The source code contained parameters relevant for dev (hence the easy AMI creation for dev), but didn’t contain anything relevant for prod. Instead, whenever you ran a prod build in TeamCity, you had to supply a hashtable of parameter overrides, which would be used to alter the defaults and make it work in the prod AWS account.

As you can imagine, this is error prone.

Additionally, you actually have to remember to click the build button a second time and supply the overrides in order to make a prod image, or you’ll end up in a situation where you deployed your environment changes successfully through CI and Staging, but it all explodes (or even worse, subtly doesn’t do what its supposed to) when you deploy them into Production because there is no equivalent AMI. Then you have to go and make one using TeamCity, which is error prone, and if the source has diverged since you made the dev one…well, its just bad times all around.

Leader Of The Pack

With some minor improvements, we can avoid that whole problem though.

Basically, whenever we do a build in TeamCity, it creates the dev AMI first, and then automatically creates the prod one as well. If the dev fails, no prod. If the prod fails, dev is deleted.

To keep things in sync, both AMIs are tagged with a version attribute created during the build (just like software), so that we have a way to trace the AMI back to the git commit it was created from (just like software).

To accomplish this approach, we now have a relatively simple configuration hierarchy, with default parameters, dev specific parameters and prod specific parameters. When you start the AMI execution, you tell the function what environment you’re targetting (dev/prod) and it loads defaults, then merges in the appropriate overrides.

This was a relatively easy way to deal with things that are different and non-sensitive (like VPC, subnets, security groups, etc).

What about credentials though?

Since an…incident…waaaay back in 2015, I’m pretty wary of credentials, particularly ones that give access to AWS.

So they can’t go in source control with the rest of the parameters.

That leaves TeamCity as the only sane place to put them, which it can easily do, assuming we don’t mind writing some logic to pick the appropriate credentials depending on our targeted destination.

We could technically have used some combination of IAM roles and AWS profiles as well, but we already have mechanisms and experience dealing with raw credential usage, so this was not the time to re-invent that particular wheel. That’s a fight for another day.

With account specific parameters and credentials taken care of, everything is good, and every build results in 2 AMIs, one for each account.

I’ve uploaded a copy of our Packer repository containing all of this logic (and a copy of the script we embed into TeamCity) to Github for reference purposes.

Conclusion

I’m much happier with the process I described above for creating our AMIs. If a build succeeds, it creates resources in both of our active AWS accounts, keeping them in sync and reducing the risk of subtle problems come deployment time. Not only that, but it also tags those resources with a version that can be traced back to a git commit, which is always more useful than you think.

There are still some rough edges around actually using the AMIs though. Most of our newer environments specify their AMIs directly via parameter files, so you have to remember to change the values for each environment target when you want to use a new AMI. This is dangerous, because if someone forgets it could lead to a disconnect between CI/Staging and Production, which was pretty much the entire problem we were trying to avoid in the first place.

Honestly, its going to be me that forgets.

Ah well, all in all, its a lot more consistent than it was before, which is pretty much the best I could hope for.

Pack Of Lies

March 14. 2017 0 Comments

I use Packer on and off. Mostly I use it to make Amazon Machine Images (AMI’s) for our environment management packages, specifically by creating Packer templates that operate on top of the Amazon supplied Windows Server images.

You should never use an Amazon supplied Windows Server AMI in your Auto Scaling Group Launch Configurations. These images are regularly retired, so if you’ve taken a dependency on one, there is a good chance it will disappear just when you need it most. Like when you need to auto-scale your API cluster because you’ve unknowingly burnt through all of the CPU credits you had on the machines slowly over the course of the last few months. What you should do is create an AMI of your own from the ones supplied by AWS so you can control its lifetime. Packer is a great tool for this.

A Packer template is basically a set of steps to execute on a virtual machine of some sort, where the core goal is to take some sort of baseline thing, apply a set of steps to it programmatically and end up with some sort of reusable thing out the other end. Like I mentioned earlier, we mostly deal in AWS AMI’s, but it can do a bunch of other things as well (VWWare, Docker, etc).

The benefits of using a Packer template for this sort of thing (instead of just doing it all manually) is reproducibility. Specifically, if you built your custom image using the AWS AMI for Windows Server 2012 6 months ago, you can go and grab the latest one from yesterday (with all of the patches and security upgrades), execute your template on it and you’ll be in a great position to upgrade all of the existing usages of your old custom AMI with minimal effort.

When using Packer templates though, you need to be cognizant of how errors are dealt with. Specifically:

Step failures appear to be indicated entirely by the exit code of the tool used in the step.

I’ve been bitten by this on two separate occasions.

A Powerful Cry For Help

Packer has much better support for Windows than it once did, but even taking that into account, Powershell steps can still be a troublesome beast.

The main issue with the Powershell executable is that if an error or exception occurs and terminates the process (i.e. its a terminating error of you have ErrorActionPreference set to Stop) the Powershell process itself still exits with zero.

In a sane world, an exit code of zero indicates success, which is what Packer expects (and most other automation tools like TeamCity/Octopus Deploy).

If you don’t take this into account, your Powershell steps may fail but the Packer execution will still succeed, giving you an artefact that hasn’t been configured the way it should have been.

Packer is pretty configurable though, and is very clear about the command that it uses to execute your Powershell steps. The great thing is, it also enables you to override that command, so you can customise your Powershell steps to exit with a non-zero code if an error occurs without actually having to change every line in your step to take that sort of thing into account.

Take this template excerpt below, which uses Powershell to set the timezone of the machine and turn off negative DNS result caching.

{
    "type": "powershell",
    "inline": [
        "tzutil.exe /s \"AUS Eastern Standard Time_dstoff\"",
        "[Microsoft.Win32.Registry]::SetValue('HKEY_LOCAL_MACHINE\\SYSTEM\\CurrentControlSet\\Services\\Dnscache\\Parameters','NegativeCacheTime',0,[Microsoft.Win32.RegistryValueKind]::DWord)"
    ],
    "execute_command": "powershell -Command \"$ErrorActionPreference = 'Stop'; try { & '{{.Path}}' } catch { Write-Warning $_; exit 1; } \""
}

The “execute_command” is the customisation, providing error handling for exceptions that occur during the execution of the Powershell snippet. Packer will take each line in that inline array, copy it to a file on the machine being setup (using WinRM) and then execute it using the command you specify. The {{.Path}} syntax is the Packer variable substitution and specifically refers to the path on the virtual machine that packer has copied the current command to. With this custom command in place, you have a much better chance of catching errors in your Powershell commands before they come back to bite you later on.

So Tasty

In a similar vein to the failures with Powershell above, be careful when doing package installs via yum on Linux.

The standard “yum install” command will not necessarily exit with a non-zero code when a package fails to install. Sometimes it will, but if a package couldn’t be found (maybe you misconfigured the repository or something) it still exits with a zero.

This can throw a pretty big spanner in the works when you’re expecting your AMI to have Elasticsearch on it (for example) and it just doesn’t because the package installation failed but Packer thought everything was fine.

Unfortunately, there is no easy way to get around this like there is for the Powershell example above, but you can mitigate it by just adding an extra step after your package install that validates the package was actually installed.

{
    "type" : "shell",
    "inline" : [
        "sudo yum remove java-1.7.0-openjdk -y",
        "sudo yum install java-1.8.0 -y",
        "sudo yum update -y",
        "sudo sh -c 'echo \"[logstash-5.x]\" >> /etc/yum.repos.d/logstash.repo'",
        "sudo sh -c 'echo \"name=Elastic repsitory for 5.x packages\" >> /etc/yum.repos.d/logstash.repo'",
        "sudo sh -c 'echo \"baseurl=https://artifacts.elastic.co/packages/5.x/yum\" >> /etc/yum.repos.d/logstash.repo'",
        "sudo sh -c 'echo \"gpgcheck=1\" >> /etc/yum.repos.d/logstash.repo'",
        "sudo sh -c 'echo \"gpgkey=http://packages.elastic.co/GPG-KEY-elasticsearch\" >> /etc/yum.repos.d/logstash.repo'",
        "sudo sh -c 'echo \"enabled=1\" >> /etc/yum.repos.d/logstash.repo'",
        "sudo rpm --import https://packages.elastic.co/GPG-KEY-elasticsearch",
        "sudo yum install logstash-5.2.2 -y",
        "sudo rpm --query logstash-5.2.2"
    ]
}

In the example above, the validation is the rpm --query command after the yum install. It will return a non-zero exit code (and thus fail the Packer execution) if the package with that version is not installed.

Conclusion

Packer is an incredibly powerful automation tool for dealing with a variety of virtual machine platforms and I highly recommend using it.

If you’re going to use it though, you need to understand what failure means in your specific case, and you need to take that into account when you decide how to signal to the Packer engine that something isn’t right.

For me, I prefer to treat every error as critical, because I prefer to deal with them at the time the AMI is created, rather than 6 months later when I try to use the AMI and can’t figure out why the Windows Firewall on an internal API instance is blocking requests from its ELB. Not that that has ever happened of course.

In order to accomplish this lofty goal of dealing with errors ASAP you need to understand how each one of your steps (and the applications and tools they use) communicate failure, and then make sure they all communicate that appropriately in a way Packer can understand.

Understanding how to deal with failure is useful outside Packer too.

Pack It Up

August 12. 2015 0 Comments

So it turns out that Amazon delete their pre-packaged Windows AMI’s muchquicker than they delete their Linux ones.

I’ve known about this for a little while, but it wasn’t until recently that it bit me.

One of the first environments that we created using the new strategy of “fully codified and automated environment setup”, recently hit a little snag. It was a production environment, primarily intended for beta testing, and the understanding was that it would be refreshed (using the latest version of the environment setup scripts) before it became available to the general public.

Late last week, one of our scripts that shuts down AWS EC2 instances based on time of day (to limit costs) accidentally terminated both of the EC2 instances that make up the API layer in that particular environment. Normally, this wouldn’t be an issue. The Auto Scaling Group would realise that it no longer had as many instances as it should and it would recreate them. The API instances are mostly stateless, so after a small amount of time, everything would be fine again.

As I’m sure you can imagine, this did not happen.

Amazon has since removed the AMI that the API instances were based off, so the ASG couldn’t spin up any new instances to replace the ones that were terminated.

The service was down, and it was going to stay that way for a while until we managed to fix the root issue.

The Immediate Response

The first thing we did was update the existing CloudFormation stack to use the latest version of the Windows AMI that we were previously using. This at least allowed the API instances to be created. However, they never finished their initialization.

In the time between when the environment was initially provisioned and the time when it was accidentally destroyed, we had made quite a few changes to the common scripts that back our environment provisioning process. One of those was to specify the version of the Octopus Tentacle that was going to be installed on the machine. We had previously run into an issue when Octopus 3 was released where the latest tentacle no longer worked the same way, and with little time to investigate it, simply hardcoded the version that was installed to the one we had previously been using.

In order to fix this issue in the old environment we had to fix the script. Whenever an environment is provisioned, the scripts that it depends on are uploaded to S3, ready to be downloaded by EC2 instances and other resources that need access to them. Rather than manually dig in to the dependencies, it was just easier to do the planned environment refresh.

This went…okay. Not great, not terrible, but the problem was fixed and everything came back online before anyone tried to use the service the next day.

Fixing It Properly

I had actually been aware of the AMI missing issue for some time and was already working on a longer term fix. In fact, I had scheduled an environment refresh of the production/beta environment for this service for a bit later in the same week the incident happened. It was just unfortunate that the incident forced my hand.

The root cause of the issue is that we did not control all of the elements in the environment, specifically the AMI used. Having external dependencies isn’t always an issue (for example we use Nuget extensively, but old Nuget packages are generally left alone), but Amazon makes no guarantees as to the availability of the AMI’s they supply as time progresses. The solution is to create your own AMI’s, so that Amazon can’t just delete them out from underneath you.

There are upsides and downsides to managing your own AMI’s.

The primary upside is that you avoid issues like I’ve described above. Nobody external to your company is going to go and delete an AMI when you still have a dependency on it. Obviously if someone internal to your organization deletes the AMI you still have the same issue, but you at least have much more control over that situation.

Another upside is that you can include commonly installed software for your environments in your custom AMI’s. For us, that would be things like an unregistered Octopus Tentacle or the .NET Framework and ASP.NET (not for all machines, but at least for all API instances).

The primary downside is that you can no longer easily take advantage of the fact that new Amazon AMI’s are released on a regular basis, containing Windows Updates and other fixes (which are critically important to apply to machines that are exposed to the greater internet). You can still take advantage of those new AMI’s, its just a little bit more difficult.

Another downside is that you now have to manage your own AMI’s. This isn’t particularly difficult to be honest, but it is one more thing that you need to take care of, and I much prefer to simplify things rather than add more complexity.

The Mechanism

In an effort to avoid much of the manual work that can go into creating an AMI, I looked for a solution that was automated. I wanted to be able to run a process that simply spat out a customised AMI at the end, so that we could easily take advantage of new AMI’s as Amazon released them, and then refresh our environments as required.

Initially I looked into automating the process myself, using the various API’s available for AWS. I’d already done some work previously in creating an EC2 instance for the purposes of updating an AMI, so I started with that.

Shortly after, someone informed me of the existence of Packer.

Packer is a wonderful little application that allows you create AMI’s and virtual machines for a number of virtualisation platforms. It even works on Windows, without having to install some arcane dependency chain through the command line. Its just a collection of executables, which is nice.

Using Packer, I could put together the following configuration file that describes how I want my AMI to be structured.

{
    "variables" : {
        "aws_access_key" : "",
        "aws_secret_key" : "",
        "aws_region" : "",
        "source_ami" : "",
        "ami_name" : "",
        "user_data_file_path" : "",
        "octopus_api_key" : "",
        "octopus_server_url" : ""
    },
    "builders" : [{
            "type" : "amazon-ebs",
            "access_key" : "{{user `aws_access_key`}}",
            "secret_key" : "{{user `aws_secret_key`}}",
            "region" : "{{user `aws_region`}}",
            "source_ami" : "{{user `source_ami`}}",
            "instance_type" : "m3.large",
            "ami_name" : "{{user `ami_name`}}-{{timestamp}}",
            "user_data_file" : "{{user `user_data_file_path`}}",
            "vpc_id" : "vpc-a0a6aec9",
            "subnet_id" : "subnet-5908182d",
            "security_group_ids" : ["sg-0b65076e", "sg-4d188f28", "sg-faaf429f"],
            "ssh_keypair_name" : "YourKeyPair",
            "ssh_private_key_file":"C:\\temp\\YourKeyPair.pem",
            "communicator" : "winrm",
            "winrm_username" : "Administrator",
            "winrm_port" : 5985
        }
    ],
    "provisioners" : [
        {
            "type" : "powershell",
            "inline" : [
                "try",
                "{",
                    "$signalFilePath = \"C:\\signal\"",
                    "$content = Get-Content $signalFilePath",
                    "$maxWaitSeconds = 3000",
                    "$currentWaitSeconds = 0",
                    "$waitSeconds = 30",
                    "while ($content -eq \"1\" -and $currentWaitSeconds -lt $maxWaitSeconds) { Sleep -Seconds $waitSeconds; Write-Output \"Checking signal\"; $currentWaitSeconds += $waitSeconds; $content = Get-Content $signalFilePath; if ($content -eq \"-1\") { Write-Output \"User data signalled -1, indicating failure.\"; exit 1 } }",
                "}",
                "catch",
                "{",
                    "Write-Ouput \"An unexpected error occurred.\"",
                    "Write-Output $_",
                    "exit 1",
                "}"
            ]
        },
        {
            "type":"powershell",
            "scripts": [
                "@@ROOT_DIRECTORY_PATH\\scripts\\packer\\Ec2Config.ps1"
            ]
        }
    ]
}

The first part of the template describes various things about the EC2 instance that will be used to create the AMI, and the second part describes operations to perform on the instance in order to configure it the way you want it.

Note that the security groups used in the template above simply describe (in order) ability to connect via RDP and Windows Remote Management, Unfiltered Access Out and Octopus Tentacle Port In.

The configuration I’ve shared above is from our baseline Octopus Tentacle capable image. It comes with an Octopus Tentacle installed, but not configured (because its much more reliable to configure it at initialization time in CloudFormation).

The instance configuration is broken into two parts:

Use UserData to run some scripts that configure the proxy (so the machine can get to the internet) and download some dependencies, plus some other miscellaneous configuration (WinRM, Firewall, etc).
Use the Powershell script execution from Packer to run some scripts from the dependencies downloaded in 1.) to download and install an Octopus Tentacle.
Some other miscellaneous configuration.

Nothing too fancy.

The Windows support for Packer is still a bit on the rough side, mostly due to the fact that doing this sort of thing with Windows machines is inherently more complicated than it is for Linux boxes. Luckily for me, I started using Packer after the Windows plugins were incorporated into the primary build of the application, so I didn’t have to do anything special to get Windows support.

Gotchas

It definitely wasn’t all smooth sailing though.

The documentation for the creation of Windows AMI’s from Packer is a little sparse, so I had to do some trial and error in order to figure out how everything fit together.

The main mechanism for executing scripts remotely on Windows is WinRM (Windows Remote Managment), which is basically Powershell remote execution. As such, you need to make sure that you allow access to the machine over port 5985 or nothing will work. It won’t fail straightaway either, it will timeout, which can take upwards of 10 minutes.

You also need to make sure that you specify WinRM as the communication method. Most of the template examples on the web use SSH (because Linux), so its not immediately obvious that you can actually switch to a different communication method.

Finally, you need to include the EC2 and Bundle config files, to tell the instance that it needs to run sysprep, otherwise it won’t regenerate a new Administrator password when you use the AMI to create a new EC2 instance (and thus you wouldn’t be able to retrieve the password from the AWS API). It will also have saved state on it from last time, so its definitely better to run sysprep for an AMI that will be used generically.

Conclusion

I’ve uploaded a sanitised copy of the repository containing my Packer templates and scripts to Github. If you look, you can see that I haven’t done anything particularly fancy. All I’ve done is wrap the execution of Packer in some Powershell scripts to make it easier to run. I have two different scripts to create the two AMI’s that we need right now (Octopus capable + pre-installed IIS/.NET Framework), and when you run either of them with the appropriate parameters a brand new, timestamped AMI will be created in the appropriate AWS account.

Creating our own AMI’s fixes the scaling issue that started this whole blog post. Since we control them, we can be sure that they won’t be deleted and our ability to scale via Auto Scaling Groups is maintained for the life of the environment. Another benefit of this approach is that the provisioning of an environment is now quicker, as some of the components (especially IIS/.NET Framework) are now pre-installed for the components that require them. Considering our environments can take upwards of 20 minutes to provision, every minute counts.

The whole process of creating these AMI’s via Packer took me about a day or two, so it definitely wasn’t the most time consuming task I’ve ever completed.

Incorporating the AMI’s into our environment provisioning scripts was trivial, as they already searched for the appropriate AMI to use dynamically, I just had to change the search parameters.

In the end I’m fairly pleased with Packer and how easy it made the AMI creation process. If I had to use the AWS Powershell cmdlets (or the CLI app) directly for all of this, I probably would have wasted a lot of time.

And sanity.