0 Comments

So it turns out that Amazon delete their pre-packaged Windows AMI’s muchquicker than they delete their Linux ones.

I’ve known about this for a little while, but it wasn’t until recently that it bit me.

One of the first environments that we created using the new strategy of “fully codified and automated environment setup”, recently hit a little snag. It was a production environment, primarily intended for beta testing, and the understanding was that it would be refreshed (using the latest version of the environment setup scripts) before it became available to the general public.

Late last week, one of our scripts that shuts down AWS EC2 instances based on time of day (to limit costs) accidentally terminated both of the EC2 instances that make up the API layer in that particular environment. Normally, this wouldn’t be an issue. The Auto Scaling Group would realise that it no longer had as many instances as it should and it would recreate them. The API instances are mostly stateless, so after a small amount of time, everything would be fine again.

As I’m sure you can imagine, this did not happen.

Amazon has since removed the AMI that the API instances were based off, so the ASG couldn’t spin up any new instances to replace the ones that were terminated.

The service was down, and it was going to stay that way for a while until we managed to fix the root issue.

The Immediate Response

The first thing we did was update the existing CloudFormation stack to use the latest version of the Windows AMI that we were previously using. This at least allowed the API instances to be created. However, they never finished their initialization.

In the time between when the environment was initially provisioned and the time when it was accidentally destroyed, we had made quite a few changes to the common scripts that back our environment provisioning process. One of those was to specify the version of the Octopus Tentacle that was going to be installed on the machine. We had previously run into an issue when Octopus 3 was released where the latest tentacle no longer worked the same way, and with little time to investigate it, simply hardcoded the version that was installed to the one we had previously been using.

In order to fix this issue in the old environment we had to fix the script. Whenever an environment is provisioned, the scripts that it depends on are uploaded to S3, ready to be downloaded by EC2 instances and other resources that need access to them. Rather than manually dig in to the dependencies, it was just easier to do the planned environment refresh.

This went…okay. Not great, not terrible, but the problem was fixed and everything came back online before anyone tried to use the service the next day.

Fixing It Properly

I had actually been aware of the AMI missing issue for some time and was already working on a longer term fix. In fact, I had scheduled an environment refresh of the production/beta environment for this service for a bit later in the same week the incident happened. It was just unfortunate that the incident forced my hand.

The root cause of the issue is that we did not control all of the elements in the environment, specifically the AMI used. Having external dependencies isn’t always an issue (for example we use Nuget extensively, but old Nuget packages are generally left alone), but Amazon makes no guarantees as to the availability of the AMI’s they supply as time progresses. The solution is to create your own AMI’s, so that Amazon can’t just delete them out from underneath you.

There are upsides and downsides to managing your own AMI’s.

The primary upside is that you avoid issues like I’ve described above. Nobody external to your company is going to go and delete an AMI when you still have a dependency on it. Obviously if someone internal to your organization deletes the AMI you still have the same issue, but you at least have much more control over that situation.

Another upside is that you can include commonly installed software for your environments in your custom AMI’s. For us, that would be things like an unregistered Octopus Tentacle or the .NET Framework and ASP.NET (not for all machines, but at least for all API instances).

The primary downside is that you can no longer easily take advantage of the fact that new Amazon AMI’s are released on a regular basis, containing Windows Updates and other fixes (which are critically important to apply to machines that are exposed to the greater internet). You can still take advantage of those new AMI’s, its just a little bit more difficult.

Another downside is that you now have to manage your own AMI’s. This isn’t particularly difficult to be honest, but it is one more thing that you need to take care of, and I much prefer to simplify things rather than add more complexity.

The Mechanism

In an effort to avoid much of the manual work that can go into creating an AMI, I looked for a solution that was automated. I wanted to be able to run a process that simply spat out a customised AMI at the end, so that we could easily take advantage of new AMI’s as Amazon released them, and then refresh our environments as required.

Initially I looked into automating the process myself, using the various API’s available for AWS. I’d already done some work previously in creating an EC2 instance for the purposes of updating an AMI, so I started with that.

Shortly after, someone informed me of the existence of Packer.

Packer is a wonderful little application that allows you create AMI’s and virtual machines for a number of virtualisation platforms. It even works on Windows, without having to install some arcane dependency chain through the command line. Its just a collection of executables, which is nice.

Using Packer, I could put together the following configuration file that describes how I want my AMI to be structured.

{
    "variables" : {
        "aws_access_key" : "",
        "aws_secret_key" : "",
        "aws_region" : "",
        "source_ami" : "",
        "ami_name" : "",
        "user_data_file_path" : "",
        "octopus_api_key" : "",
        "octopus_server_url" : ""
    },
    "builders" : [{
            "type" : "amazon-ebs",
            "access_key" : "{{user `aws_access_key`}}",
            "secret_key" : "{{user `aws_secret_key`}}",
            "region" : "{{user `aws_region`}}",
            "source_ami" : "{{user `source_ami`}}",
            "instance_type" : "m3.large",
            "ami_name" : "{{user `ami_name`}}-{{timestamp}}",
            "user_data_file" : "{{user `user_data_file_path`}}",
            "vpc_id" : "vpc-a0a6aec9",
            "subnet_id" : "subnet-5908182d",
            "security_group_ids" : ["sg-0b65076e", "sg-4d188f28", "sg-faaf429f"],
            "ssh_keypair_name" : "YourKeyPair",
            "ssh_private_key_file":"C:\\temp\\YourKeyPair.pem",
            "communicator" : "winrm",
            "winrm_username" : "Administrator",
            "winrm_port" : 5985
        }
    ],
    "provisioners" : [
        {
            "type" : "powershell",
            "inline" : [
                "try",
                "{",
                    "$signalFilePath = \"C:\\signal\"",
                    "$content = Get-Content $signalFilePath",
                    "$maxWaitSeconds = 3000",
                    "$currentWaitSeconds = 0",
                    "$waitSeconds = 30",
                    "while ($content -eq \"1\" -and $currentWaitSeconds -lt $maxWaitSeconds) { Sleep -Seconds $waitSeconds; Write-Output \"Checking signal\"; $currentWaitSeconds += $waitSeconds; $content = Get-Content $signalFilePath; if ($content -eq \"-1\") { Write-Output \"User data signalled -1, indicating failure.\"; exit 1 } }",
                "}",
                "catch",
                "{",
                    "Write-Ouput \"An unexpected error occurred.\"",
                    "Write-Output $_",
                    "exit 1",
                "}"
            ]
        },
        {
            "type":"powershell",
            "scripts": [
                "@@ROOT_DIRECTORY_PATH\\scripts\\packer\\Ec2Config.ps1"
            ]
        }
    ]
}

The first part of the template describes various things about the EC2 instance that will be used to create the AMI, and the second part describes operations to perform on the instance in order to configure it the way you want it.

Note that the security groups used in the template above simply describe (in order) ability to connect via RDP and Windows Remote Management, Unfiltered Access Out and Octopus Tentacle Port In.

The configuration I’ve shared above is from our baseline Octopus Tentacle capable image. It comes with an Octopus Tentacle installed, but not configured (because its much more reliable to configure it at initialization time in CloudFormation).

The instance configuration is broken into two parts:

  1. Use UserData to run some scripts that configure the proxy (so the machine can get to the internet) and download some dependencies, plus some other miscellaneous configuration (WinRM, Firewall, etc).
  2. Use the Powershell script execution from Packer to run some scripts from the dependencies downloaded in 1.) to download and install an Octopus Tentacle.
  3. Some other miscellaneous configuration.

Nothing too fancy.

The Windows support for Packer is still a bit on the rough side, mostly due to the fact that doing this sort of thing with Windows machines is inherently more complicated than it is for Linux boxes. Luckily for me, I started using Packer after the Windows plugins were incorporated into the primary build of the application, so I didn’t have to do anything special to get Windows support.

Gotchas

It definitely wasn’t all smooth sailing though.

The documentation for the creation of Windows AMI’s from Packer is a little sparse, so I had to do some trial and error in order to figure out how everything fit together.

The main mechanism for executing scripts remotely on Windows is WinRM (Windows Remote Managment), which is basically Powershell remote execution. As such, you need to make sure that you allow access to the machine over port 5985 or nothing will work. It won’t fail straightaway either, it will timeout, which can take upwards of 10 minutes.

You also need to make sure that you specify WinRM as the communication method. Most of the template examples on the web use SSH (because Linux), so its not immediately obvious that you can actually switch to a different communication method.

Finally, you need to include the EC2 and Bundle config files, to tell the instance that it needs to run sysprep, otherwise it won’t regenerate a new Administrator password when you use the AMI to create a new EC2 instance (and thus you wouldn’t be able to retrieve the password from the AWS API). It will also have saved state on it from last time, so its definitely better to run sysprep for an AMI that will be used generically.

Conclusion

I’ve uploaded a sanitised copy of the repository containing my Packer templates and scripts to Github. If you look, you can see that I haven’t done anything particularly fancy. All I’ve done is wrap the execution of Packer in some Powershell scripts to make it easier to run. I have two different scripts to create the two AMI’s that we need right now (Octopus capable + pre-installed IIS/.NET Framework), and when you run either of them with the appropriate parameters a brand new, timestamped AMI will be created in the appropriate AWS account.

Creating our own AMI’s fixes the scaling issue that started this whole blog post. Since we control them, we can be sure that they won’t be deleted and our ability to scale via Auto Scaling Groups is maintained for the life of the environment. Another benefit of this approach is that the provisioning of an environment is now quicker, as some of the components (especially IIS/.NET Framework) are now pre-installed for the components that require them. Considering our environments can take upwards of 20 minutes to provision, every minute counts.

The whole process of creating these AMI’s via Packer took me about a day or two, so it definitely wasn’t the most time consuming task I’ve ever completed.

Incorporating the AMI’s into our environment provisioning scripts was trivial, as they already searched for the appropriate AMI to use dynamically, I just had to change the search parameters.

In the end I’m fairly pleased with Packer and how easy it made the AMI creation process. If I had to use the AWS Powershell cmdlets (or the CLI app) directly for all of this, I probably would have wasted a lot of time.

And sanity.