0 Comments

Choo choo goes the Elasticsearch train.

After the last few blog posts about rolling updates to our Elasticsearch environment, I thought I might as well continue with the Elasticsearch theme and do a quick post about reindexing.

Cartography

An index in Elasticsearch is kind of similar to a table in a relational database, but not really. In the same vein, index templates are kind of like schemas, and field mappings are kind of like columns.

But not really.

If you were using Elasticsearch purely for searching through some set of data, you might create an index and then add some mappings to it manually. For example, if you wanted to make all of the addresses in your system searchable, you might create fields for street, number, state, postcode and other common address elements, and maybe another field for the full address combined (like 111 None St, Brisbane, QLD, 4000 or something), to give you good coverage over the various sort of searches that might be requested.

Then you jam a bunch of documents into that index, each one representing a different address that needs to be searchable.

Over time, you might discover that you could really use a field to represent the unit or apartment number, to help narrow down those annoying queries that involve a unit complex or something.

Well, with Elasticsearch you can add a new field to the index, in a similar way to how you add a new column to a table in a relational database.

Except again, not really.

You can definitely add a new field mapping, but it will only work for documents added to the index after you’ve applied the change. You can’t make that new mapping retroactive. That is to say, you can’t magically make it apply to every document that was already in the index when you created the new mapping.

When it comes to your stock standard ELK stack, your data indexes are generally time based and generated from an index template, which adds another layer of complexity. If you want to change the mappings, you typically just change the template and then wait for some time period to rollover.

This leaves you in an unfortunate place for historical data, especially if you’ve been conservative with your field mappings.

Or does it?

Dexterous Storage

In both of the cases above (the manually created and maintained index, the swarm of indexes created automatically via a template) its easy enough to add new field mappings and have them take effect moving forward.

The hard part is always the data that already exists.

That’s where reindexing comes in.

Conceptually, reindexing is taking all of the documents that are already in an index and moving them to another index, where the new index has all the field mappings you want in it. In moving the raw documents like that, Elasticsearch will redo everything that it needs to do in order to analyse and breakdown the data into the appropriate fields, exactly like the first time the document was seen.

For older versions of Elasticsearch, the actual document migration had to be done with an external tool or script, but the latest versions (we use 5.5.1) have a reindex endpoint on the API, which is a lot simpler to use.

curl -XPUT "{elasticsearch_url}/{new_index}?pretty" -H "accept:application/json"
curl -XPOST "{elasticsearch_url}/_reindex?pretty" H "content-type:application/json" -H "accept:application/json" -d "{ "source": { "index": "{old_index}" }, "dest": { "index": "{new_index}", "version_type": "external" } }"

It doesn’t have to be a brand new index (there are options for how to handle documents that conflict if you’re reindexing into an index that already has data in it), but I imagine that a new index is the most common usage.

The useful side effect of this, is that in requiring a different index, the old one is left intact and unchanged. Its then completely up to you how to use both the new and old indexes, the most common operation being to delete the old one when you’re happy with how and where the new one is being used.

Seamless Replacement

We’ve changed our field mappings in our ELK stack over time, so while the most recent indexes do what we want them to, the old indexes have valuable historical data sitting around that we can’t really query or aggregate on.

The naive implementation is just to iterate through all the indexes we want to reindex (maybe using a regex or something to identify them), create a brand new index with a suffix (like logstash-2017.08.21-r) and then run the reindex operation via the Elasticsearch API, similar to the example above.

That leaves us with two indexes with the same data in them, which is less than ideal, especially considering that Kibana will quite happily query both indexes when you ask for data for a time period, so we can’t really leave the old one around or we’ll run into issues with duplicate data.

So we probably want to delete the old index once we’re finished reindexing into the new one.

But how do we know that we’re finished?

The default mode for the reindex operation is to wait for completion before returning a response from the API, which is handy, because that is exactly what we want.

The only other thing we needed to consider is that after a reindex, all of the indexes will have a suffix of –r, and our Curator configuration wouldn’t pick them up without some changes. In the interest of minimising the amount of things we had to touch just to reindex, we decided to do the reindex again from the temporary index back into an index named the same as the one we started with, deleting the temporary index once that second operation was done.

When you do things right, people wont be sure you’ve done anything at all.

Danger Will Robinson

Of course, the first time I ran the script (iterate through indexes, reindex to temporary index, delete source, reindex back, delete temp) on a real Elasticsearch cluster I lost a bunch of documents.

Good thing we have a staging environment specifically for this sort of thing.

I’m still not entirely sure what happened, but I think it had something to do with the eventually consistent nature of Elasticsearch, the fact we connect to the data nodes via an AWS ELB and the reindex being “complete” according to the API but not necessarily synced across all nodes, so the deletion of the source index threw a massive spanner in the works.

Long story short, I switched the script to start the reindex asynchronously and then poll the destination index until it returned the same number of documents as the source. As a bonus, this fixed another problem I had with the HTTP request for the reindex timing out on large indexes, which was nice.

The only downside of this is that we can’t reindex an index that is currently being written to (because the document counts will definitely change over the period of time the reindex occurs), but I didn’t want to do that anyway.

Conclusion

I’ve uploaded the full script to Github. Looking at it now, its a bit more complicated than you would expect, even taking into account the content of this post, but as far as I can tell, its pretty robust.

All told, I probably spent a bit longer on this than I should have, especially taking into account that its not something we do every day.

The flip side of that is that its extremely useful to know that old data is not just useless when we update our field mappings, which is nice.

0 Comments

With all of the general context and solution outlining done for now, its time to delve into some of the details. Specifically, the build/test/deploy pipeline for the log stack environments.

Unfortunately, we use the term environmentto describe two things. The first is an Octopus environment, which is basically a grouping construct inside Octopus Deploy, like CI or prod-green. The second is a set of infrastructure intended for a specific purpose, like an Auto Scaling Group and Load Balancer intended to host an API. In the case of the log stack, we have distinct environments for the infrastructure for each layer, like the Broker and the Indexer.

Our environments are all conceptually similar, there is a Git repository that contains everything necessary to create or update the infrastructure (CloudFormation templates, Powershell scripts, etc), along with the logic for what it means to build and validate a Nuget package that can be used to manage the environment. The repository is hooked up to a Build Configuration in TeamCity which runs the build script and the resulting versioned package is uploaded to our Nuget server. The package is then used in TeamCity via other Build Configurations to allow us to Create, Delete, Migrate and otherwise interact with the environment in question.

The creation of this process has happened in bits and pieces over the last few years, most of which I’ve written about on this blog.

Its a decent system, and I’m proud of how far we’ve come and how much automation is now in place, but it’s certainly not without its flaws.

Bestial Rage

The biggest problem with the current process is that while the environment is fully encapsulated as code inside a validated and versioned Nuget package, actually using that package to create or delete an environment is not as simple as it could be. As I mentioned above, we have a set of TeamCity Build Configurations for each environment that allow for the major operations like Create and Delete. If you’ve made changes to an environment and want to deploy them, you have to decide what sort of action is necessary (i.e. “its the first time, Create” or “it already exists, Migrate”) and “run”the build, which will download the package and run the appropriate script.

This is where is gets a bit onerous, especially for production. If you want to change any of the environment parameters from the default values the package was built with, you need to provide a set of parameter overrideswhen you run the build. For production, this means you often end up overriding everything (because production is a separate AWS account) which can be upwards of 10 different parameters, all of which are only visible if you go a look at the source CloudFormation template. You have to do this every time you want to execute that operation (although you can copy the parameters from previous runs, which acts a small shortcut).

The issue with this is that it means production deployments become vulnerable to human error, which is one of the things we’re trying to avoid by automating in the first place!

Another issue is that we lack a true “Update” operation. We only have Create, Delete, Clone and Migrate.

This is entirely my fault, because when I initially put the system together I had a bad experience with the CloudFormation Update command where I accidentally wiped out an S3 bucket containing customer data. As is often the case, that fear then led to an alternate (worse) solution involving cloning, checking, deleting, cloning, checking and deleting (in that order). This was safer, but incredibly slow and prone to failure.

The existence of these two problems (hard to deploy, slow failure-prone deployments) is reason enough for me to consider exploring alternative approaches for the log stack infrastructure.

Fantastic Beasts And Where To Find Them

The existing process does do a number of things well though, and has:

  • A Nuget package that contains everything necessary to interact with the environment.
  • Environment versioning, because that’s always important for traceability.
  • Environment validation via tests executed as part of the build (when possible).

Keeping those three things in mind, and combining them with the desire to ease the actual environment deployment, an improved approach looks a lot like our typical software development/deployment flow.

  1. Changes to environment are checked in
  2. Changes are picked up by TeamCity, and a build is started
  3. Build is tested (i.e. a test environment is created, validated and destroyed)
  4. Versioned Nuget package is created
  5. Package is uploaded to Octopus
  6. Octopus Release is created
  7. Octopus Release is deployed to CI
  8. Secondary validation (i.e. test CI environment to make sure it does what it’s supposed to do after deployment)
  9. [Optional] Propagation of release to Staging

In comparison to our current process, the main difference is the deployment. Prior to this, we were treating our environments as libraries (i.e. they were built, tested, packaged and uploaded to MyGet to be used by something else). Now we’re treating them as self contained deployable components, responsible for knowing how to deploy themselves.

With the approach settled, all that’s left is to come up with an actual deployment process for an environment.

Beast Mastery

There are two main cases we need to take care of when deploying a CloudFormation stack to an Octopus environment.

The first case is what to do when the CloudFormation stack doesn’t exist.

This is the easy case, all we need to do is execute New-CFNStack with the appropriate parameters and then wait for the stack to finish.

The second case is what we should do when the CloudFormation stack already exists, which is the case that is not particularly well covered by our current environment management process.

Luckily, CloudFormation makes this relatively easy with the Update-CFNStack command. Updates are dangerous (as I mentioned above), but if you’re careful with resources that contain state, they are pretty efficient. The implementation of the update is quite smart as well, and will only update the things that have changed in the template (i.e. if you’ve only changed the Load Balancer, it won’t recreate all of your EC2 instances).

The completed deployment script is shown in full below.

[CmdletBinding()]
param
(

)

$here = Split-Path $script:MyInvocation.MyCommand.Path;
$rootDirectory = Get-Item ($here);
$rootDirectoryPath = $rootDirectory.FullName;

$ErrorActionPreference = "Stop";

$component = "unique-stack-name";

if ($OctopusParameters -ne $null)
{
    $parameters = ConvertFrom-StringData ([System.IO.File]::ReadAllText("$here/cloudformation.parameters.octopus"));

    $awsKey = $OctopusParameters["AWS.Deployment.Key"];
    $awsSecret = $OctopusParameters["AWS.Deployment.Secret"];
    $awsRegion = $OctopusParameters["AWS.Deployment.Region"];
}
else 
{
    $parameters = ConvertFrom-StringData ([System.IO.File]::ReadAllText("$here/cloudformation.parameters.local"));
    
    $path = "C:\creds\credentials.json";
    Write-Verbose "Attempting to load credentials (AWS Key, Secret, Region, Octopus Url, Key) from local, non-repository stored file at [$path]. This is done this way to allow for a nice development experience in vscode"
    $creds = ConvertFrom-Json ([System.IO.File]::ReadAllText($path));
    $awsKey = $creds.aws."aws-account".key;
    $awsSecret = $creds.aws."aws-account".secret;
    $awsRegion = $creds.aws."aws-account".region;

    $parameters["OctopusAPIKey"] = $creds.octopus.key;
    $parameters["OctopusServerURL"] = $creds.octopus.url;
}

$parameters["Component"] = $component;

$environment = $parameters["OctopusEnvironment"];

. "$here/scripts/common/Functions-Aws.ps1";
. "$here/scripts/common/Functions-Aws-CloudFormation.ps1";

Ensure-AwsPowershellFunctionsAvailable

$tags = @{
    "environment"=$environment;
    "environment:version"=$parameters["EnvironmentVersion"];
    "application"=$component;
    "function"="logging";
    "team"=$parameters["team"];
}

$stackName = "$environment-$component";

$exists = Test-CloudFormationStack -AwsKey $awsKey -AwsSecret $awsSecret -AwsRegion $awsRegion -StackName $stackName

$cfParams = ConvertTo-CloudFormationParameters $parameters;
$cfTags = ConvertTo-CloudFormationTags $tags;
$args = @{
    "-StackName"=$stackName;
    "-TemplateBody"=[System.IO.File]::ReadAllText("$here\cloudformation.template");
    "-Parameters"=$cfParams;
    "-Tags"=$cfTags;
    "-AccessKey"=$awsKey;
    "-SecretKey"=$awsSecret;
    "-Region"=$awsRegion;
    "-Capabilities"="CAPABILITY_IAM";
};

if ($exists)
{
    Write-Verbose "The stack [$stackName] exists, so I'm going to update it. Its better this way"
    $stackId = Update-CFNStack @args;

    $desiredStatus = [Amazon.CloudFormation.StackStatus]::UPDATE_COMPLETE;
    $failingStatuses = @(
        [Amazon.CloudFormation.StackStatus]::UPDATE_FAILED,
        [Amazon.CloudFormation.StackStatus]::UPDATE_ROLLBACK_IN_PROGRESS,
        [Amazon.CloudFormation.StackStatus]::UPDATE_ROLLBACK_COMPLETE
    );
    Wait-CloudFormationStack -StackName $stackName -DesiredStatus  $desiredStatus -FailingStates $failingStatuses -AwsKey $awsKey -AwsSecret $awsSecret -AwsRegion $awsRegion

    Write-Verbose "Stack [$stackName] Updated";
}
else 
{
    Write-Verbose "The stack [$stackName] does not exist, so I'm going to create it. Just watch me"
    $args.Add("-DisableRollback", $true);
    $stackId = New-CFNStack @args;

    $desiredStatus = [Amazon.CloudFormation.StackStatus]::CREATE_COMPLETE;
    $failingStatuses = @(
        [Amazon.CloudFormation.StackStatus]::CREATE_FAILED,
        [Amazon.CloudFormation.StackStatus]::ROLLBACK_IN_PROGRESS,
        [Amazon.CloudFormation.StackStatus]::ROLLBACK_COMPLETE
    );
    Wait-CloudFormationStack -StackName $stackName -DesiredStatus  $desiredStatus -FailingStates $failingStatuses -AwsKey $awsKey -AwsSecret $awsSecret -AwsRegion $awsRegion

    Write-Verbose "Stack [$stackName] Created";
}

Other than the Create/Update logic that I’ve already talked about, the only other interesting thing in the deployment script is the way that it deals with parameters.

Basically if the script detects that its being run from inside Octopus Deploy (via the presence of an $OctopusParameters variable), it will load all of its parameters (as a hashtable) from a particular local file. This file leverages the Octopus variable substitution feature, so that when we deploy the infrastructure to the various environments, it gets the appropriate values (like a different VPC because prod is a separate AWS account to CI). When its not running in Octopus, it just uses a different file, structured very similarly, with test/scratch values in it.

With the deployment script in place, we plug the whole thing into our existing “deployable” component structure and we have automatic deployment of tested, versioned infrastructure via Octopus Deploy.

Conclusion

Of course, being a first version, the deployment logic that I’ve described above is not perfect. For example, there is no support for deploying to an environment where the stack is in error (failing stacks can’t be updated, but they already exist, so you have to delete it and start again) and there is little to no feedback available if a stack creation/update fails for some reason.

Additionally, the code could benefit from being extracted to a library for reuse.

All in all, the deployment process I just described is a lot simpler than the one I described at the start of this post, and its managed by Octopus, which makes it consistent with the way that we do everything else, which is nice.

With a little bit more polish, and some pretty strict usage of the CloudFormation features that stop you accidentally deleting databases full of valuable data, I think it will be a good replacement for what we do now.

0 Comments

As part of the work I did recently to optimize our usage of RavenDB, I had to spin up a brand new environment for the service in question.

Thanks to our commitment to infrastructure as code (and our versioned environment creation scripts), this is normally a trivial process. Use TeamCity to trigger the execution of the latest environment package and then play the waiting game. To setup my test environment, all I had to do was supply the ID of the EC2 volume containing a snapshot of production data as an extra parameter.

Unfortunately, it failed.

Sometimes that happens (an unpleasant side effect of the nature of AWS), so I started it up again in the hopes that it would succeed this time.

It failed again.

Repeated failures are highly unusual, so I started digging. It looked as though the problem was that the environments were simply taking too long to “finish” provisioning. We allocate a fixed amount of time for our environments to complete, which includes the time required to create the AWS resources, deploy the necessary software and then wait until the resulting service can actually be hit from the expected URL.

Part of the environment setup is the execution of the CloudFormation template (and all of the AWS resources that go with that). The other part is deploying the software that needs to be present on those resources, which is where we use Octopus Deploy. At startup, each instance registers itself with the Octopus server, applying tags as appropriate and then a series of Octopus projects (i.e. the software) is deployed to the instance.

Everything seemed to be working fine from an AWS point of view, but the initialisation of the machines was taking too long. They were only getting through a few of their deployments before running out of time. It wasn’t a tight time limit either, we’d allocated a very generous 50 minutes for everything to be up and running, so it was disappointing to see that it was taking so long.

The cfn-init logs were very useful here, because they show start/end timestamps for each step specified. Each of the deployments was taking somewhere between 12-15 minutes, which is way too slow, because there are at least 8 of them on the most popular machine.

This is one of those cases where I’m glad I put the effort in to remotely extract log files as part of the reporting that happens whenever an environment fails. Its especially valuable when the environment creation is running in a completely automated fashion, like it does when executed through TeamCity, but its still very useful when debugging an environment locally. The last thing I want to do is manually remote into each machine in the environment and grab its log files in order to try and determine where a failure is occurring when I can just have a computer do it for me.

Speed Deprived Octopus

When using Octopus Deploy (at least the version we have, which is 2.6.4), you can choose to either deploy a specific version of a project or deploy the “latest” version, leaving the decision of what exactly is the latest version up to Octopus. I was somewhat uncomfortable with the concept of using “latest” for anything other than a brand new environment that had never existed before, because initial testing showed that “latest” really did mean the most recent release.

The last thing I wanted to do was to cede control over exactly what as getting deployed to a new production API instance when I had to scale up due to increased usage.

In order to alleviate this, we wrote a fairly simple Powershell script that uses the Octopus .NET Client library to find what the best version to deploy is, making sure to only use those releases that have actually been deployed to the environment, and discounting deployments that failed.

$deployments = $repository.Deployments.FindMany({ param($x) $x.EnvironmentId -eq $env.Id -and $x.ProjectId -eq $project.Id })

if ($deployments | Any)
{
    Write-Verbose "Deployments of project [$projectName] to environment [$environmentName] were found. Selecting the most recent successful deployment."
    $latestDeployment = $deployments |
        Sort -Descending -Property Created |
        First -Predicate { $repository.Tasks.Get($_.TaskId).FinishedSuccessfully -eq $true } -Default "latest"

    $release = $repository.Releases.Get($latestDeployment.ReleaseId)
}
else
{
    Write-Verbose "No deployments of project [$projectName] to environment [$environmentName] were found."
}

$version = if ($release -eq $null) { "latest" } else { $release.Version }

Write-Verbose "The version of the recent successful deployment of project [$projectName] to environment [$environmentName] was [$version]. 'latest' indicates no successful deployments, and will mean the very latest release version is used."

return $version

We use this script during the deployment of the Octopus projects that I mentioned above.

When it was first written, it was fast.

That was almost two years ago though, and now it was super slow, taking upwards of 10 minutes just to find what the most recent release was. As you can imagine, this was problematic for our environment provisioning, because each one had multiple projects being deployed during initialization and all of those delays added up very very quickly.

Age Often Leads to Slowdown

The slowdown didn’t just happen all of a sudden though. We had noticed the environment provisioning getting somewhat slower from time to time, but we’d assumed it was a result of there simply being more for our Octopus server to do (more projects, more deployments, more things happening at the same time). It wasn’t until it reached that critical point where I couldn’t even spin up one of our environments for testing purposes, that it needed to be addressed.

Looking at the statistics of the Octopus server during an environment provisioning, I noticed that it was using 100% CPU for the entire time that it was deploying projects, up until the point where the environment timed out.

Our Octopus server isn’t exactly a powerhouse, so I thought maybe it had just reached a point where it simply did not have enough power to do what we wanted it to do. This intuition was compounded by the knowledge that Octopus 2.6 uses RavenDB as its backend, and they’ve moved to SQL Server for Octopus 3, citing performance problems.

I dialed up the power (good old AWS), and then tried to provision the environment again.

Still timed out, even though the CPU was no longer maxxing out on the Octopus server.

My next suspicion was that I had made an error of some sort in the script, but there was nothing obviously slow about it. It makes a bounded query to Octopus via the client library (.FindMany with a Func specifying the environment and project, to narrow down the result set) and then iterates through the resulting deployments (by date, so working backwards in time) until it finds an acceptable deployment for identifying the best release.

Debugging the automated tests around the script, the slowest part seemed to be when it was getting the deployments from Octopus. Using Fiddler, the reason for this became obvious.

The call to .FindMany with a Func was not optimizing the resulting API calls using the information provided in the Func. The script had been getting slower and slower over time because every single new deployment would have to retrieve every single previous deployment from the server (including those unrelated to the current environment and project). This meant hundreds of queries to the Octopus API, just to get what version needed to be deployed.

The script was initially fast because I wrote it when we only had a few hundred deployments recorded in Octopus. We have thousands now and it grows every day.

What I thought was a bounded query, wasn’t bounded in any way.

But Age Also Means Wisdom

On the upside, the problem wasn’t particularly hard to fix.

The Octopus HTTP API is pretty good. In fact, it actually has specific query parameters for specifying environment and project when you’re using the deployments endpoint (which is another reason why i assumed the client would be optimised to use them based on the incoming parameters). All I had to do was rewrite the “last release to environment” script to use the API directly instead of relying on the client.

$environmentId = $env.Id;

$headers = @{"X-Octopus-ApiKey" = $octopusApiKey}
$uri = "$octopusServerUrl/api/deployments?environments=$environmentId&projects=$projectId"
while ($true) {
    Write-Verbose "Getting the next set of deployments from Octopus using the URI [$uri]"
    $deployments = Invoke-RestMethod -Uri $uri -Headers $headers -Method Get -Verbose:$false
    if (-not ($deployments.Items | Any))
    {
        $version = "latest";
        Write-Verbose "No deployments could not found for project [$projectName ($projectId)] in environment [$environmentName ($environmentId)]. Returning the value [$version], which will indicate that the most recent release is to be used"
        return $version
    }

    Write-Verbose "Finding the first successful deployment in the set of deployments returned. There were [$($deployments.TotalResults)] total deployments, and we're currently searching through a set of [$($deployments.Items.Length)]"
    $successful = $deployments.Items | First -Predicate { 
        $uri = "$octopusServerUrl$($_.Links.Task)"; 
        Write-Verbose "Getting the task the deployment [$($_.Id)] was linked to using URI [$uri] to determine whether or not the deployment was successful"
        $task = Invoke-RestMethod -Uri $uri -Headers $headers -Method Get -Verbose:$false; 
        $task.FinishedSuccessfully; 
    } -Default "NONE"
    
    if ($successful -ne "NONE")
    {
        Write-Verbose "Finding the release associated with the successful deployment [$($successful.Id)]"
        $release = Invoke-RestMethod "$octopusServerUrl$($successful.Links.Release)" -Headers $headers -Method Get -Verbose:$false
        $version = $release.Version
        Write-Verbose "A successful deployment of project [$projectName ($projectId)] was found in environment [$environmentName ($environmentId)]. Returning the version of the release attached to that deployment, which was [$version]"
        return $version
    }

    Write-Verbose "Finished searching through the current page of deployments for project [$projectName ($projectId)] in environment [$environmentName ($environmentId)] without finding a successful one. Trying the next page"
    $next = $deployments.Links."Page.Next"
    if ($next -eq $null)
    {
        Write-Verbose "There are no more deployments available for project [$projectName ($projectId)] in environment [$environmentName ($environmentId)]. We're just going to return the string [latest] and hope for the best"
        return "latest"
    }
    else
    {
        $uri = "$octopusServerUrl$next"
    }
}

Its a little more code than the old one, but its much, much faster. 200 seconds (for our test project) down to something like 6 seconds, almost all of which was the result of optimizing the number of deployments that need to be interrogated to find the best release. The full test suite for our deployment scripts (which actually provisions test environments) halved its execution time too (from 60 minutes to 30), which was nice, because the tests were starting to get a bit too slow for my liking.

Summary

This was one of those cases where an issue had been present for a while, but we’d chalked it up to growth issues rather than an actual inefficiency in the code.

The most annoying part is that I did think of this specific problem when I originally wrote the script, which is why I included the winnowing parameters for environment and project. It’s pretty disappointing to see that the client did not optimize the resulting API calls by making use of those parameters, although I have a suspicion about that.

I think that if I were to run the same sort of code in .NET directly (rather than through Powershell) that the resulting call to .FindMany would actually be optimized in the way that I thought it was. I have absolutely no idea how Powershell handles the equivalent of Lambda expressions, so if it was just passing a pure Func through to the function instead of an Expression<Func>, there would be no obvious way for the client library to determine if it could use the optimized query string parameters on the API.

Still, I’m pretty happy with the end result, especially considering how easy it was to adapt the existing code to just use the HTTP API directly.

It helps that its a damn good API.

0 Comments

We use TeamCity as our Continuous Integration tool.

Unfortunately, our setup hasn’t been given as much love as it should have. Its not bad (or broken or any of those things), its just not quite as well setup as it could be, which increases the risk that it will break and makes it harder to manage (and change and keep up to date) than it could be.

As with everything that has got a bit crusty around the edges over time, the only real way to attack it while still delivering value to the business is by doing it slowly, piece by piece, over what seems like an inordinate amount of time. The key is to minimise disruption, while still making progress on the bigger picture.

Our setup is fairly simple. A centralised TeamCity server and at least 3 Build Agents capable of building all of our software components. We host all of this in AWS, but unfortunately, it was built before we started consistently using CloudFormation and infrastructure as code, so it was all manually setup.

Recently, we started using a few EC2 spot instances to provide extra build capabilities without dramatically increasing our costs. This worked fairly well, up until the spot price spiked and we lost the build agents. We used persistent requests, so they came back, but they needed to be configured again before they would hook up to TeamCity because of the manual way in which they were provisioned.

There’s been a lot of instability in the spot price recently, so we were dealing with this manual setup on a daily basis (sometimes multiple times per day), which got old really quickly.

You know what they say.

“If you want something painful automated, make a developer responsible for doing it manually and then just wait.”

Its Automatic

The goal was simple.

We needed to configure the spot Build Agents to automatically bootstrap themselves on startup.

On the upside, the entire process wasn’t completely manual. We were at least spinning up the instances from a pre-built AMI that already had all of the dependencies for our older, crappier components as well as an unconfigured TeamCity Build Agent on it, so we didn’t have to automate absolutely everything.

The bootstrapping would need to tag the instance appropriately (because for some reason spot instances don’t inherit the tags of the spot request), configure the Build Agent and then start it up so it would connect to TeamCity. Ideally, it would also register and authorize the Build Agent, but if we used controlled authorization tokens we could avoid this step by just authorizing the agents once. Then they would automatically reappear each time the spot instance came back,.

So tagging, configuring, service start, using Powershell, with the script baked into the AMI. During provisioning we would supply some UserData that would execute the script.

Not too complicated.

Like Graffiti, Except Useful

Tagging an EC2 instance is pretty easy thanks to the multitude of toolsets that Amazon provides. Our tool of choice is the Powershell cmdlets, so the actual tagging was a simple task.

Getting permission to the do the tagging was another story.

We’re pretty careful with our credentials these days, for some reason, so we wanted to make sure that we weren’t supply and persisting any credentials in the bootstrapping script. That means IAM.

One of the key features of the Powershell cmdlets (and most of the Amazon supplied tools) is that they are supposed to automatically grab credentials if they are being run on an EC2 instance that currently has an instance profile associated with it.

For some reason, this would just not work. We tried a number of different things to get this to work (including updating the version of the Powershell cmdlets we were using), but in the end we had to resort to calling the instance metadata service directly to grab some credentials.

Obviously the instance profile that we applied to the instance represented a role with a policy that only had permissions to alter tags. Minimal permission set and all that.

Service With a Smile

Starting/stopping services with Powershell is trivial, and for once, something weird didn’t happen causing us to burn days while we tracked down some obscure bug that only manifests in our particular use case.

I was as surprised as you are.

Configuration Is Key

The final step should have been relatively simple.

Take a file with some replacement tokens, read it, replace the tokens with appropriate values, write it back.

Except it just wouldn’t work.

After editing the file with Powershell (a relatively simple Get-Content | For-Each { $_ –replace {token}, {value} } | Out-File) the TeamCity Build Agent would refuse to load.

Checking the log file, its biggest (and only) complaint was that the serverUrl (which is the location of the TeamCity server) was not set.

This was incredibly confusing, because the file clearly had a serverUrl value in it.

I tried a number of different things to determine the root cause of the issue, including:

  • Permissions? Was the file accidentially locked by TeamCity such that the Build Agent service couldn’t access it?
  • Did the rewrite of the tokens somehow change the format of the file (extra spaces, CR LF when it was just expecting LF)
  • Was the serverUrl actually configured, but inaccessible for some reason (machine proxy settings for example) and the problem was actually occurring not when the file was rewritten but when the script was setting up the AWS Powershell cmdlets proxy settings?

Long story short, it turns out that Powershell doesn’t remember file encoding when using the Out-File functionality in the way we were using it. It was changing the Byte Order Mark (BOM) on the file from ASCII to Unicode Little Endian, and the Build Agent did not like that (it didn’t throw an encoding error either, which is super annoying, but whatever).

The error message was both a red herring (yes the it was configured) and also truthly (the Build Agent was incapable of reading the serverUrl).

Putting It All Together

With all the pieces in place, it was a relatively simple matter to create a new AMI with those scripts baked into it and put it to work straightaway.

Of course, I’d been doing this the whole time in order to test the process, so I certainly had a lot of failures building up to the final deployment.

Conclusion

Even simple automation can prove to be time consuming, especially when you run into weird unforseen problems like components not performing as advertised or even reporting correct errors for you to use for debugging purposes.

Still, it was worth it.

Now I never have to manually configure those damn spot instances when they come back.

And satisfaction is worth its weight in gold.

0 Comments

A long time ago, I wrote a post about fixing up our log management scripts to actually delete logs properly. Logs were being aggregated into our ELK stack and were then being deleted after reaching a certain age. It fixed everything perfectly and it was all fine, forever. The End

Or not.

The log management script itself was very simple (even though it was wrapped in a bunch of Powershell to schedule it via Windows Scheduled Task. It looked at a known directory (C:\logs), found all the files matching *.log, narrowed it down to only those files that had not been written to in the last 7 days and then deleted them. Not exactly rocket science.

As we got more and more usage on the services in question though, the log file generation started to outstrip the ability of the script to clean up.

They Can Fill You Up Inside

The problem was twofold, the log management script was hardcoded to only delete things that hadn’t been touched in the last 7 days and the installation of the scheduled task that ran the script was done during machine initialisation (as part of the cfn-init configuration). Changing the script would require refreshing the environment, which requires a migration (which is time consuming and still not perfect). Not only that, but changing the script for one service (i.e. to delete all logs older than a day), might not be the best thing for other services.

The solution was to not be stupid and deploy log management in the same way we deploy everything, Octopus Deploy.

With Octopus, we could build a package containing the all of the logic for how to deploy a log management solution, and use that package in any number of projects with customised parameters, like retention period, directory, filter, whatever.

More importantly, it would give us the ability to update existing log management deployments without having to alter their infrastructure configuration, which is important for minimising downtime and just generally staying sane.

They Are Stronger Than Your Drive

It was easy enough to create a Nuget package containing the logic for installing the log management scheduled task. All I had to do was create a new repository, add a reference to our common scripts and then create a deploy.ps1 file describing how the existing scheduled task setup script should be called during deployment.

In fact, the only difference from calling the script directly like we were previously, was that I grabbed variable values from the available Octopus parameters, and then slotted them into the script.

With a nice self contained Nuget package that would install a scheduled task for log management, the only thing left to do was create a new Octopus project to deploy the package.

I mostly use a 1-1 Nuget Package to Project strategy when using Octopus. I’ve found that its much easier to get a handle on the versions of components being deployed when each project only has 1 thing in it. In this case, the Nuget package was generic, and was then referenced from at least two projects, one to manage the logs on some API instances and the other to manage logs on a log shipper component (which gets and processes ELB logs).

In order to support a fully automated build, test and deployment cycle, I also created a TeamCity Build Configuration. Like all Build Configurations, it watches a repository for changes, runs tests, packages and then deploys to the appropriate environments. This one was slightly different though, in that it deployed multiple Octopus projects, rather than a single one like almost all of our other Build Configurations. I’m not sure how I feel about this yet (because its different from what we usually do), but its definitely easier to manage, especially since all the projects just use the same package anyway.

Conclusion

This post might not look like much, and to be honest it isn’t. Last weeks post on Expression Trees was a bit exhausting so I went for something much smaller this week.

That is not to say that the concepts covered herein are not important.

One thing that I’ve taken away from this is that if you ever think that installing something just once on machine initialization is enough, you’re probably wrong. Baking the installation of log management into our cfn-init steps caused us a whole bunch of problems once we realised we needed to update them because they didn’t quite do the job. It was much harder to change them on the fly without paying a ridiculous penalty in terms of downtime. It also felt really strange to have to migrate an entire environment just because the log files weren’t being cleaned up correctly.

When it comes down to it, the real lesson is to always pick the path that lets you react to change and deploy with the least amount of effort.