Slow Deployments Make Me Sad

August 16. 2016 0 Comments

As part of the work I did recently to optimize our usage of RavenDB, I had to spin up a brand new environment for the service in question.

Thanks to our commitment to infrastructure as code (and our versioned environment creation scripts), this is normally a trivial process. Use TeamCity to trigger the execution of the latest environment package and then play the waiting game. To setup my test environment, all I had to do was supply the ID of the EC2 volume containing a snapshot of production data as an extra parameter.

Unfortunately, it failed.

Sometimes that happens (an unpleasant side effect of the nature of AWS), so I started it up again in the hopes that it would succeed this time.

It failed again.

Repeated failures are highly unusual, so I started digging. It looked as though the problem was that the environments were simply taking too long to “finish” provisioning. We allocate a fixed amount of time for our environments to complete, which includes the time required to create the AWS resources, deploy the necessary software and then wait until the resulting service can actually be hit from the expected URL.

Part of the environment setup is the execution of the CloudFormation template (and all of the AWS resources that go with that). The other part is deploying the software that needs to be present on those resources, which is where we use Octopus Deploy. At startup, each instance registers itself with the Octopus server, applying tags as appropriate and then a series of Octopus projects (i.e. the software) is deployed to the instance.

Everything seemed to be working fine from an AWS point of view, but the initialisation of the machines was taking too long. They were only getting through a few of their deployments before running out of time. It wasn’t a tight time limit either, we’d allocated a very generous 50 minutes for everything to be up and running, so it was disappointing to see that it was taking so long.

The cfn-init logs were very useful here, because they show start/end timestamps for each step specified. Each of the deployments was taking somewhere between 12-15 minutes, which is way too slow, because there are at least 8 of them on the most popular machine.

This is one of those cases where I’m glad I put the effort in to remotely extract log files as part of the reporting that happens whenever an environment fails. Its especially valuable when the environment creation is running in a completely automated fashion, like it does when executed through TeamCity, but its still very useful when debugging an environment locally. The last thing I want to do is manually remote into each machine in the environment and grab its log files in order to try and determine where a failure is occurring when I can just have a computer do it for me.

Speed Deprived Octopus

When using Octopus Deploy (at least the version we have, which is 2.6.4), you can choose to either deploy a specific version of a project or deploy the “latest” version, leaving the decision of what exactly is the latest version up to Octopus. I was somewhat uncomfortable with the concept of using “latest” for anything other than a brand new environment that had never existed before, because initial testing showed that “latest” really did mean the most recent release.

The last thing I wanted to do was to cede control over exactly what as getting deployed to a new production API instance when I had to scale up due to increased usage.

In order to alleviate this, we wrote a fairly simple Powershell script that uses the Octopus .NET Client library to find what the best version to deploy is, making sure to only use those releases that have actually been deployed to the environment, and discounting deployments that failed.

$deployments = $repository.Deployments.FindMany({ param($x) $x.EnvironmentId -eq $env.Id -and $x.ProjectId -eq $project.Id })

if ($deployments | Any)
{
    Write-Verbose "Deployments of project [$projectName] to environment [$environmentName] were found. Selecting the most recent successful deployment."
    $latestDeployment = $deployments |
        Sort -Descending -Property Created |
        First -Predicate { $repository.Tasks.Get($_.TaskId).FinishedSuccessfully -eq $true } -Default "latest"

    $release = $repository.Releases.Get($latestDeployment.ReleaseId)
}
else
{
    Write-Verbose "No deployments of project [$projectName] to environment [$environmentName] were found."
}

$version = if ($release -eq $null) { "latest" } else { $release.Version }

Write-Verbose "The version of the recent successful deployment of project [$projectName] to environment [$environmentName] was [$version]. 'latest' indicates no successful deployments, and will mean the very latest release version is used."

return $version

We use this script during the deployment of the Octopus projects that I mentioned above.

When it was first written, it was fast.

That was almost two years ago though, and now it was super slow, taking upwards of 10 minutes just to find what the most recent release was. As you can imagine, this was problematic for our environment provisioning, because each one had multiple projects being deployed during initialization and all of those delays added up very very quickly.

Age Often Leads to Slowdown

The slowdown didn’t just happen all of a sudden though. We had noticed the environment provisioning getting somewhat slower from time to time, but we’d assumed it was a result of there simply being more for our Octopus server to do (more projects, more deployments, more things happening at the same time). It wasn’t until it reached that critical point where I couldn’t even spin up one of our environments for testing purposes, that it needed to be addressed.

Looking at the statistics of the Octopus server during an environment provisioning, I noticed that it was using 100% CPU for the entire time that it was deploying projects, up until the point where the environment timed out.

Our Octopus server isn’t exactly a powerhouse, so I thought maybe it had just reached a point where it simply did not have enough power to do what we wanted it to do. This intuition was compounded by the knowledge that Octopus 2.6 uses RavenDB as its backend, and they’ve moved to SQL Server for Octopus 3, citing performance problems.

I dialed up the power (good old AWS), and then tried to provision the environment again.

Still timed out, even though the CPU was no longer maxxing out on the Octopus server.

My next suspicion was that I had made an error of some sort in the script, but there was nothing obviously slow about it. It makes a bounded query to Octopus via the client library (.FindMany with a Func specifying the environment and project, to narrow down the result set) and then iterates through the resulting deployments (by date, so working backwards in time) until it finds an acceptable deployment for identifying the best release.

Debugging the automated tests around the script, the slowest part seemed to be when it was getting the deployments from Octopus. Using Fiddler, the reason for this became obvious.

The call to .FindMany with a Func was not optimizing the resulting API calls using the information provided in the Func. The script had been getting slower and slower over time because every single new deployment would have to retrieve every single previous deployment from the server (including those unrelated to the current environment and project). This meant hundreds of queries to the Octopus API, just to get what version needed to be deployed.

The script was initially fast because I wrote it when we only had a few hundred deployments recorded in Octopus. We have thousands now and it grows every day.

What I thought was a bounded query, wasn’t bounded in any way.

But Age Also Means Wisdom

On the upside, the problem wasn’t particularly hard to fix.

The Octopus HTTP API is pretty good. In fact, it actually has specific query parameters for specifying environment and project when you’re using the deployments endpoint (which is another reason why i assumed the client would be optimised to use them based on the incoming parameters). All I had to do was rewrite the “last release to environment” script to use the API directly instead of relying on the client.

$environmentId = $env.Id;

$headers = @{"X-Octopus-ApiKey" = $octopusApiKey}
$uri = "$octopusServerUrl/api/deployments?environments=$environmentId&projects=$projectId"
while ($true) {
    Write-Verbose "Getting the next set of deployments from Octopus using the URI [$uri]"
    $deployments = Invoke-RestMethod -Uri $uri -Headers $headers -Method Get -Verbose:$false
    if (-not ($deployments.Items | Any))
    {
        $version = "latest";
        Write-Verbose "No deployments could not found for project [$projectName ($projectId)] in environment [$environmentName ($environmentId)]. Returning the value [$version], which will indicate that the most recent release is to be used"
        return $version
    }

    Write-Verbose "Finding the first successful deployment in the set of deployments returned. There were [$($deployments.TotalResults)] total deployments, and we're currently searching through a set of [$($deployments.Items.Length)]"
    $successful = $deployments.Items | First -Predicate { 
        $uri = "$octopusServerUrl$($_.Links.Task)"; 
        Write-Verbose "Getting the task the deployment [$($_.Id)] was linked to using URI [$uri] to determine whether or not the deployment was successful"
        $task = Invoke-RestMethod -Uri $uri -Headers $headers -Method Get -Verbose:$false; 
        $task.FinishedSuccessfully; 
    } -Default "NONE"
    
    if ($successful -ne "NONE")
    {
        Write-Verbose "Finding the release associated with the successful deployment [$($successful.Id)]"
        $release = Invoke-RestMethod "$octopusServerUrl$($successful.Links.Release)" -Headers $headers -Method Get -Verbose:$false
        $version = $release.Version
        Write-Verbose "A successful deployment of project [$projectName ($projectId)] was found in environment [$environmentName ($environmentId)]. Returning the version of the release attached to that deployment, which was [$version]"
        return $version
    }

    Write-Verbose "Finished searching through the current page of deployments for project [$projectName ($projectId)] in environment [$environmentName ($environmentId)] without finding a successful one. Trying the next page"
    $next = $deployments.Links."Page.Next"
    if ($next -eq $null)
    {
        Write-Verbose "There are no more deployments available for project [$projectName ($projectId)] in environment [$environmentName ($environmentId)]. We're just going to return the string [latest] and hope for the best"
        return "latest"
    }
    else
    {
        $uri = "$octopusServerUrl$next"
    }
}

Its a little more code than the old one, but its much, much faster. 200 seconds (for our test project) down to something like 6 seconds, almost all of which was the result of optimizing the number of deployments that need to be interrogated to find the best release. The full test suite for our deployment scripts (which actually provisions test environments) halved its execution time too (from 60 minutes to 30), which was nice, because the tests were starting to get a bit too slow for my liking.

Summary

This was one of those cases where an issue had been present for a while, but we’d chalked it up to growth issues rather than an actual inefficiency in the code.

The most annoying part is that I did think of this specific problem when I originally wrote the script, which is why I included the winnowing parameters for environment and project. It’s pretty disappointing to see that the client did not optimize the resulting API calls by making use of those parameters, although I have a suspicion about that.

I think that if I were to run the same sort of code in .NET directly (rather than through Powershell) that the resulting call to .FindMany would actually be optimized in the way that I thought it was. I have absolutely no idea how Powershell handles the equivalent of Lambda expressions, so if it was just passing a pure Func through to the function instead of an Expression<Func>, there would be no obvious way for the client library to determine if it could use the optimized query string parameters on the API.

Still, I’m pretty happy with the end result, especially considering how easy it was to adapt the existing code to just use the HTTP API directly.

It helps that its a damn good API.