Leaking Connections Part 2: The Leakening

November 10. 2015 0 Comments

And now, for the thrilling conclusion! The thrilling conclusion to what you might ask? Check back here for the first part in this saga.

I mentioned very briefly in the first part of this epic that we attempted to fix a strange threading bug with token generation via the ASP.NET Identity Framework by making sure that the Entity Framework DbContext was fully initialized (i.e. model created, connection established, etc) before it left the factory method. Initial tests were promising, but it turns out this did not fix the issue.

I mention this because I had absolutely no luck reproducing the connection leak when I was running locally (with or without a profiler attached). I could easily force timeouts when getting a connection from the pool (because it was too busy), but I couldn’t reproduce the apparent situation where there were connections established that could not be actively used.

When going through the combination of CloudWatch logs for RDS (to track connection usage) and our own ELK stack I found a correlation between the errors that sometimes occurred when generating tokens and the increase in the usage of connections. This pattern was pretty consistent. Whenever there was a cluster of errors related to token generation, there was an increase in the total number of connections used by the service, which never went down again until the application pool was recycled at the default time of 29 hours from the last recycle.

Token Failure

We’ve been struggling with the root cause of the token generation failures for a while now. The most annoying part is that it doesn’t fail all the time. In fact, my initial load tests showed only around a 1% failure rate, which is pretty low in the scheme of things. The problem manifests itself in exceptions occurring when a part of the Identity Framework attempts to use the Entity Framework DbContext that it was given. It looks as though there is some sort of threading issue with Entity Framework, which makes sense conceptually. Generally EF DbContext objects are not thread safe, so you shouldn’t attempt to use them on two different threads at the same time.

The errors were many and varied, but all consistently come from our implementation of the OAuthAuthorizationServerProvider. A few examples are below:

System.Data.Entity.Core.EntityCommandExecutionException: An error occurred while executing the command definition. See the inner exception for details. ---> System.InvalidOperationException: Operation is not valid due to the current state of the object.
   at Npgsql.NpgsqlConnector.StartUserAction(ConnectorState newState)
   at Npgsql.NpgsqlCommand.ExecuteDbDataReaderInternal(CommandBehavior behavior)
   at Npgsql.NpgsqlCommand.ExecuteDbDataReader(CommandBehavior behavior)
   at System.Data.Entity.Infrastructure.Interception.InternalDispatcher`1.Dispatch[TTarget,TInterceptionContext,TResult](TTarget target, Func`3 operation, TInterceptionContext interceptionContext, Action`3 executing, Action`3 executed)
   at System.Data.Entity.Infrastructure.Interception.DbCommandDispatcher.Reader(DbCommand command, DbCommandInterceptionContext interceptionContext)
   at System.Data.Entity.Core.EntityClient.Internal.EntityCommandDefinition.ExecuteStoreCommands(EntityCommand entityCommand, CommandBehavior behavior)
   --- End of inner exception stack trace ---
   at System.Data.Entity.Core.EntityClient.Internal.EntityCommandDefinition.ExecuteStoreCommands(EntityCommand entityCommand, CommandBehavior behavior)
   at System.Data.Entity.Core.Objects.Internal.ObjectQueryExecutionPlan.Execute[TResultType](ObjectContext context, ObjectParameterCollection parameterValues)
   at System.Data.Entity.Core.Objects.ObjectContext.ExecuteInTransaction[T](Func`1 func, IDbExecutionStrategy executionStrategy, Boolean startLocalTransaction, Boolean releaseConnectionOnSuccess)
   at System.Data.Entity.Core.Objects.ObjectQuery`1.<>c__DisplayClass7.<GetResults>b__5()
   at System.Data.Entity.Core.Objects.ObjectQuery`1.GetResults(Nullable`1 forMergeOption)
   at System.Data.Entity.Core.Objects.ObjectQuery`1.<System.Collections.Generic.IEnumerable<T>.GetEnumerator>b__0()
   at System.Data.Entity.Internal.LazyEnumerator`1.MoveNext()
   at System.Linq.Enumerable.FirstOrDefault[TSource](IEnumerable`1 source)
   at System.Linq.Queryable.FirstOrDefault[TSource](IQueryable`1 source, Expression`1 predicate)
   at [OBFUSCATION!].Infrastructure.Repositories.AuthorizationServiceRepository.GetApplicationByKey(String appKey, String appSecret) in c:\[OBFUSCATION!]\Infrastructure\Repositories\AuthorizationServiceRepository.cs:line 412
   at [OBFUSCATION!].Infrastructure.Providers.AuthorizationServiceProvider.ValidateClientAuthentication(OAuthValidateClientAuthenticationContext context) in c:\[OBFUSCATION!]\Infrastructure\Providers\AuthorizationServiceProvider.cs:line 42
   
System.NullReferenceException: Object reference not set to an instance of an object.
   at Npgsql.NpgsqlConnector.StartUserAction(ConnectorState newState)
   at Npgsql.NpgsqlCommand.ExecuteDbDataReaderInternal(CommandBehavior behavior)
   at Npgsql.NpgsqlCommand.ExecuteDbDataReader(CommandBehavior behavior)
   at System.Data.Entity.Infrastructure.Interception.InternalDispatcher`1.Dispatch[TTarget,TInterceptionContext,TResult](TTarget target, Func`3 operation, TInterceptionContext interceptionContext, Action`3 executing, Action`3 executed)
   at System.Data.Entity.Infrastructure.Interception.DbCommandDispatcher.Reader(DbCommand command, DbCommandInterceptionContext interceptionContext)
   at System.Data.Entity.Core.EntityClient.Internal.EntityCommandDefinition.ExecuteStoreCommands(EntityCommand entityCommand, CommandBehavior behavior)
   at System.Data.Entity.Core.Objects.Internal.ObjectQueryExecutionPlan.Execute[TResultType](ObjectContext context, ObjectParameterCollection parameterValues)
   at System.Data.Entity.Core.Objects.ObjectContext.ExecuteInTransaction[T](Func`1 func, IDbExecutionStrategy executionStrategy, Boolean startLocalTransaction, Boolean releaseConnectionOnSuccess)
   at System.Data.Entity.Core.Objects.ObjectQuery`1.<>c__DisplayClass7.<GetResults>b__5()
   at System.Data.Entity.Core.Objects.ObjectQuery`1.GetResults(Nullable`1 forMergeOption)
   at System.Data.Entity.Core.Objects.ObjectQuery`1.<System.Collections.Generic.IEnumerable<T>.GetEnumerator>b__0()
   at System.Data.Entity.Internal.LazyEnumerator`1.MoveNext()
   at System.Linq.Enumerable.FirstOrDefault[TSource](IEnumerable`1 source)
   at System.Linq.Queryable.FirstOrDefault[TSource](IQueryable`1 source, Expression`1 predicate)
   at [OBFUSCATION!].Infrastructure.Repositories.AuthorizationServiceRepository.GetApplicationByKey(String appKey, String appSecret) in c:\[OBFUSCATION!]\Infrastructure\Repositories\AuthorizationServiceRepository.cs:line 412
   at [OBFUSCATION!].Infrastructure.Providers.AuthorizationServiceProvider.ValidateClientAuthentication(OAuthValidateClientAuthenticationContext context) in c:\[OBFUSCATION!]\Infrastructure\Providers\AuthorizationServiceProvider.cs:line 42

In the service, this doesn’t make a huge amount of sense. There is one DbContext created per request (via Owin), and while the Owin middleware is asynchronous by nature (meaning that execution can jump around between threads) there is no parallelism. The DbContext should not be being used on multiple threads at one time, but apparently it was.

It was either that, or something was going seriously wrong in the connection pooling code for Npgsql.

Scope Increase

As I didn’t quite understand how the dependency injection/object lifetime management worked via the OwinContext, I had my suspicions that something was going awry there. Either the DbContext was not in fact generated once per request, or there was some strange race condition that allowed a DbContext to be reused on more than one thread.

As decided to rewrite the way in which dependencies are obtained in the service. Instead of generating a DbContext per request, I would supply a DbContextFactory to everything, and let it all generate its own, temporarily scoped DbContext that it is responsible for disposing.

In order to accomplish this I switched to an IoC container that I was more familiar with, Ninject. Not a small amount of work, and not without added complexity, but I felt that it made the code more consistent with the rest of our code bases and generally better.

In retrospect, I should have verified that I could reproduce the token generation errors at will first, but I didn’t. I wrote the test after I’d spent the better part of a day switching out the dependency injection mechanisms. This was a mistake.

Since the errors always occurred during the execution of a single endpoint, I wrote a test that uses 10 tasks to spam that particular endpoint. If none of the tasks fault within a time limit (i.e. no exceptions are thrown), then the test is considered a success. Basically a very small, focused, stress test to be run automatically as part of our functional test suite.

[Test]
[Category("functional")]
public void WhenAttemptingToGenerateMultipleTokensAtTheSameTime_NoRequestsFail()
{
    var authClientFactory = _resolver.Get<IAuthClientFactory>();
    var app = new ApplicationBuilder(authClientFactory.CreateSeeder())
        .WithRole("auth_token_generate")
        .WithRole("auth_customer_register")
        .WithRole("auth_database_register")
        .WithRole("auth_user_register")
        .WithRole("auth_query")
        .Build();

    var userBuilder = new UserBuilder(authClientFactory.CreateFromApplication(app.ApplicationDetails.Key, app.ApplicationSecret));
    userBuilder.Build();

    List<Task> tokenGenerationTasks = new List<Task>();
    var cancellation = new CancellationTokenSource();
    for (int i = 0; i < 10; i++)
    {
        var task = Task.Factory.StartNew
        (
            () =>
            {
                var client = authClientFactory.CreateFromApplication(app.ApplicationDetails.Key, app.ApplicationSecret);
                while (true)
                {
                    if (cancellation.Token.IsCancellationRequested) break;
                    var token = client.TokenGenerate(userBuilder.CustomerId + "/" + userBuilder.DatabaseId + "/" + userBuilder.UserCode, userBuilder.Password);
                }
            },
            cancellation.Token,
            TaskCreationOptions.LongRunning,
            TaskScheduler.Default
        );

        tokenGenerationTasks.Add(task);
    }

    // The idea here is that if any of the parallel token generation tasks throw an exception, it will come out here
    // during the wait.
    Task.WaitAny(tokenGenerationTasks.ToArray(), TimeSpan.FromSeconds(15));
    cancellation.Cancel();

    var firstFaulted = tokenGenerationTasks.FirstOrDefault(a => a.IsFaulted);
    if (firstFaulted != null) throw firstFaulted.Exception;
}

The first time I ran the test against a local service it passed successfully…

Now, I don’t know about anyone else, but when a test works the first time I am immediately suspicious.

I rolled my changes back and ran the test again, and it failed.

So my restructuring successfully fixed the issue, but why?

The Root Of The Problem

I hadn’t actually understood the issue, all I did was make enough changes such that it seemed to no longer occur. Without that undestanding, if it recurred, I would have to start all over again, possibly misdirecting myself with the changes that I made last time.

Using the test that guaranteed a reproduction, I investigated in more depth. Keeping all of my changes reverted, I was still getting a weird sampling of lots of different errors, but they were all consistently coming from one of our repositories (classes which wrap a DbContext and add extra functionality) whenever it was used within our OAuthAuthorizationServerProvider implementation.

Staring at the code for a while, the obviousness of the issue hit me.

At startup, a single OAuthAuthorizationServerProvider implementation is created and assigned to generate tokens for requests to the /auth/token endpoint.

This of course meant that all of the functions in that provider needed to be thread safe.

They were not.

Of the two functions in the class, both set and then used a class level variable, which in turn had a dependency on a DbContext.

This was the smoking gun. If two requests came in quickly enough, one would set the variable (using the DbContext for the request) the other would do the same (using a different DbContext) and then the first would attempt to use a different threads DbContext (indirectly through the variable). This would rightly cause an error (as multiple threads tried to use the same DbContext) and throw an exception, failing the token generation.

I abandoned my changes (though I will probably make them again over time), removed the class variable and re-ran the test.

It was all good. No errors at all, even after running for a few hours.

But why did the error cause a resource leak at the database connection level?

Leaky Logic

In the end I didn’t find out exactly why threading errors with Entity Framework (using Npgsq) were causing connection leaks. I plan to investigate in more depth in the future, and I’ll probably blog about my findings, but for now I was just happy to have the problem solved.

With the bug fixed, profiling over a period of at least 24 hours showed no obvious connection leaks as a result of normal traffic. Previously this would have guaranteed at least 10 connections leaking, possibly more. So for now the problem is solved and I need to move on

Summary

Chasing down resource leaks can be a real pain, especially when you don’t have a reliable reproduction.

If I had realised earlier that the token generation failures and connection leaks were related, I would have put more effort into reproducing the first in order to reproduce the second. It wasn’t immediately obviously that they were linked though, so I spent a lot of time analysing the code trying to figure out what could possibly be leaking valuable resources. This was a time consuming and frustrating process, ultimately leading nowhere.

Once I finally connected the dots between the token failures and the connection leak, everything came together, even if I didn’t completely understand why the connections were leaking in error situations.

Ah well, can’t win em all.

The Case Of The Phantom Bucket

November 3. 2015 0 Comments

Posted in:
amazon
s3
ghosts

A very short post this week, as I’m still struggling with my connection leak and a number of other things (RavenDB production server performance issues is the biggest one, but also automating a Node/NPM built website into our current CI architecture, which is mostly based around Powershell/MSBuild). Its been a pretty discombobulated week.

So this incredibly short post?

Phantom buckets in S3.

There Is A Hole In The Bucket

Our environments often include S3 buckets, and those buckets are typically created via the same CloudFormation template as the other components (like EC2 instances, ELB, Auto Scaling Groups, etc).

Until now, the names of these buckets have been relatively straightforward. A combination of a company name + environment (i.e. ci, staging, etc) + the component (like auth service) + the purpose of the bucket (logs, images, documents, whatever).

This works great. Your buckets have sane names, so you know where to look for things and its easy to apply different lifecycle management depending on the bucket purpose.

Unfortunately its not all wonderful happy time.

The first issue is that CloudFormation will not delete a bucket with contents. I can understand this from a safety point of view, but when the actual AWS API allows you to just delete buckets with contents, the disconnect is frustrating.

What this means is that now you need to delete the bucket contents outside of the actual stack deletion. Its especially annoying for buckets being used to contain ELB logs, as there is an extremely good chance of files being written after you’ve cleared the bucket ready for CloudFormation to delete it. I’ve solved this issue by just deleting the bucket outside of the stack teardown (we already do some other things here, like Octopus management, so its not entirely unprecedented).

The second issue is phantom buckets.

OooOOooOoo

I’ve encountered this issue twice now. Once for our proxy environment and now once for one of our API’s.

What happens is that when the environment attempts to spin up (our CI environments are recreated every morning to verify that our environment creation scripts work as expected), it will fail because it cannot create the bucket. The actual error is incredibly unhelpful:

{
    "EventId" : "LogsBucket-CREATE_FAILED-2015-11-02T21:49:55.907Z",
    "LogicalResourceId" : "LogsBucket",
    "PhysicalResourceId" : "OBFUSCATED_BUCKET_NAME",
    "ResourceProperties" : "{\"BucketName\":\"OBFUSCATED_BUCKET_NAME\",\"LifecycleConfiguration\":{\"Rules\":[{\"Status\":\"Enabled\",\"Id\":\"1\",\"ExpirationInDays\":\"7\"}]}}\n",
    "ResourceStatus" : "CREATE_FAILED",
    "ResourceStatusReason" : "The specified bucket does not exist",
    "ResourceType" : "AWS::S3::Bucket",
    "StackId" : "OBFUSCATED_STACK_ID",
    "StackName" : "OBFUSCATED_STACK_NAME",
    "Timestamp" : "\/Date(1446500995907)\/"
}

If I go into the AWS dashboard and look at my buckets, its clearly not there.

If I try to create a bucket with the expected name, it fails, saying the bucket already exists.

Its a unique enough name that it seems incredibly unlikely that someone else has stolen the name (bucket names being globally unique), so I can only assume that something has gone wrong in AWS and the bucket still technically exists somehow, but we’ve lost control over it.

Somehow.

Of course, because the bucket is an intrinsic part of the environment, now I can’t create my CI environment for that particular service. Which means we can’t successfully build/deploy any thing involving that service, because CI is typically used for functional test validation.

Who Ya Gunna Call? Ghostbusters!

The only solution I could come up with, was to make sure that every time an environment is created, the buckets have completely unique names. With only 63 characters to work with, this is somewhat challenging, especially if we want to maintain nice sane bucket names that a human could read.

What I ended up doing was shortening the human readable part (just environment + component + purpose) and appending a GUID onto the end.

Now that I couldn’t predict the name of the bucket though, I had to fix up a couple of other loose ends.

The first was that the bucket deletion (during environment tear down) now had to query the stack itself to find out the bucket resources. Not overly difficult.

try
{
    if ($environment -ne $null)
    {
        $resources = Get-CFNStackResources -StackName $environment.StackId -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion
        $s3buckets = $resources | Where { $_.ResourceType -eq "AWS::S3::Bucket" }
        foreach ($s3Bucket in $s3Buckets)
        {
            try
            {
                $bucketName = $s3Bucket.PhysicalResourceId
                _RemoveBucket -bucketName $bucketName -awsKey $awsKey -awsSecret $awsSecret -awsRegion $awsRegion
            }
            catch
            {
                Write-Warning "Error occurred while trying to delete bucket [$bucketName] prior to stack destruction."
                Write-Warning $_
            }
        }
    }
}
catch
{
    Write-Warning "Error occurred while attempting to get S3 buckets to delete from the CloudFormation stack."
    Write-Warning $_
}

The second was that our Octopus projects used the predictable bucket name during deployments, so I had to change the environment setup code to update the project variables to have the correct value. This was a little more difficult, but due to Octopus being awesome from an automation point of view, it eventually worked.

Summary

I can see how this sort of situation can arise in a disconnected, eventually consistent architecture, but that doesn’t make it any less frustrating.

It could be my fault for constantly creating/deleting buckets as part of the environment management scripts, but being that it doesn’t happen all the time, it really does feel like a bug of some sort.

Plus, ghost buckets are scary. Does that mean there is some of my data up there in AWS that I no longer have control over? I mean, I can’t even see it, let alone manage it.

A sobering thought.

Smoking Out Fires

October 20. 2015 0 Comments

I’m pretty happy with the way our environment setup scripts work.

Within TeamCity, you generally only have to push a single button to get an environment provisioned (with perhaps a few parameters filled in, like environment name and whatnot) and even outside TeamCity, its a single script that only requires some credentials and a few other things to start.

Failures are detected (primarily by CloudFormation) and the scripts have the ability to remote onto AWS instances for you and extract errors from logs to give you an idea as to the root cause of the failure, so you have to do as little manual work as possible. If a failure is detected, everything is cleaned up automatically (CloudFormation stack deleted, Octopus environment and machines deleted, etc), unless you turn off automatic cleanup for investigation purposes.

Like I said, overall I’m pretty happy with how everything works, but one of the areas that I’m not entirely happy with is the last part of environment provisioning. When an environment creation is completed, you know that all components installed correctly (including Octopus deploys) and that no errors were encountered with any of the provisioning itself (EC2 instances, Auto Scaling Groups, RDS, S3, etc). What you don’t know is whether or not the environment is actually doing what it should be doing.

You don’t know whether or not its working.

That seems like a fixable problem.

Smoke On The Water

As part of developing environments, we’ve implemented automated tests using the Powershell testing framework called Pester.

Each environment has at least one test, that verifies the environment is created as expected and works from the point of view of the service it offers. For example, in our proxy environment (which uses SQUID) one of the outputs is the proxy URL. The test takes that url and does a simple Invoke-WebRequest through it to a known address, validating that the proxy works as a proxy actually should.

The issue with these tests is that they are not executed at creation time. They are usually only used during development, to validate that whatever changes you are making haven’t broken the environment and that everything is still working.

Unfortunately, beyond git tagging, our environment creation scripts/templates are not versioned. I would vastly prefer for our build scripts to take some set of source code that represents an environment setup, test it, replace some parameters (like version) and then package it up, perhaps into a nuget package. It’s something that’s been on my mind for a while, but I haven’t had time to put it together yet. If I do, I’ll be sure to post about it here.

The simplest solution is to extract the parts of the tests that perform validation into dedicated functions and then to execute them as part of the environment creation. If the validation fails, the environment should be considered a failure and should notify the appropriate parties and clean itself up.

Where There Is Smoke There Is Fire

The easiest way to implement the validation (hereafter referred to as smoke tests) in a reusable fashion is to incorporate the concept into the common environment provisioning scripts.

We’ve created a library that contains scripts that we commonly use for deployment, environment provisioning and other things. I made a copy of the source for that library and posted it to Solavirum.Scripts.Common a while ago, but its a bit out of date now (I really should update it).

Within the library is a Functions-Environment file.

This file contains a set of Powershell cmdlets for provisioning and deleting environments. The assumption is that it will be used within libraries for specific environments (like the Proxy environment mentioned above) and will allow us to take care of all of the common concerns (like uploading dependencies, setting parameters in CloudFormation, waiting on the CloudFormation initialization, etc).

Inside this file is a function called New-Environment, whose signature looks like this:

function New-Environment
{
    [CmdletBinding()]
    param
    (
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$environmentName,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsKey,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsSecret,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsRegion,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$octopusServerUrl,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$octopusApiKey,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$uniqueComponentIdentifier,
        [System.IO.FileInfo]$templateFile,
        [hashtable]$additionalTemplateParameters,
        [scriptblock]$customiseEnvironmentDetailsHashtable={param([hashtable]$environmentDetailsHashtableToMutate,$stack) },
        [switch]$wait,
        [switch]$disableCleanupOnFailure,
        [string[]]$s3Buckets
    )

    function body would be here, but its super long
}

As you can see, it has a lot of parameters. It’s responsible for all of the bits of pieces that go into setting up an environment, like Octopus initialization, CloudFormation execution, gathering information in the case of a failure, etc. Its also responsible for triggering a cleanup when an environment is deemed a failure, so is the ideal place to put some smoke testing functionality.

Each specific environment repository typically contains a file called Invoke-NewEnvironment. This file is what is executed to actually create an environment of the specific type. It puts together all of the environment specific stuff (output customisation, template location, customised parameters) and uses that to execute the New-Environment function, which takes care of all of the common things.

In order to add a configurable smoke test, all we need to do is add an optional script block to the New-Environment function. Specific environment implementations can supply a value to it they like, but they don’t have to. If we assume that the interface for the script block is that it will throw an exception if it fails, then all we need to do is wrap it in a try..catch and fail the environment provisioning if an error occurs. Pretty straightforward.

To support the smoke test functionality, I wrote two new Pester tests. One verifies that a failing smoke test correctly fails the environment creation and the other verifies that the result of a successful smoke test is included in the environment creation result. You can see them below:

Describe -Tags @("Ignore") "Functions-Environment.New-Environment.SmokeTest" {
    Context "When supplied with a smoke test script that throws an exception (indicating smoke test failure)" {
        It "The stack creation is aborted and deleted" {
            $creds = Get-AwsCredentials
            $octoCreds = Get-OctopusCredentials
            $environmentName = Create-UniqueEnvironmentName
            $uniqueComponentIdentifier = "Test"
            $templatePath = "$rootDirectoryPath\src\TestEnvironment\Test.CloudFormation.template"
            $testBucket = [Guid]::NewGuid().ToString("N")
            $customTemplateParameters = @{
                "LogsS3BucketName"=$testBucket;
            }

            try
            {
                try
                {
                    $createArguments = @{
                        "-EnvironmentName"=$environmentName;
                        "-TemplateFile"=$templatePath;
                        "-AdditionalTemplateParameters"=$CustomTemplateParameters;
                        "-UniqueComponentIdentifier"=$uniqueComponentIdentifier;
                        "-S3Buckets"=@($testBucket);
                        "-SmokeTest"={ throw "FORCED FAILURE" };
                        "-Wait"=$true;
                        "-AwsKey"=$creds.AwsKey;
                        "-AwsSecret"=$creds.AwsSecret;
                        "-AwsRegion"=$creds.AwsRegion;
                        "-OctopusApiKey"=$octoCreds.ApiKey;
                        "-OctopusServerUrl"=$octoCreds.Url;
                    }
                    $environmentCreationResult = New-Environment @createArguments
                }
                catch
                {
                    $error = $_
                }

                $error | Should Not Be $null
                $error | Should Match "smoke"

                try
                {
                    $getArguments = @{
                        "-EnvironmentName"=$environmentName;
                        "-UniqueComponentIdentifier"=$uniqueComponentIdentifier;
                        "-AwsKey"=$creds.AwsKey;
                        "-AwsSecret"=$creds.AwsSecret;
                        "-AwsRegion"=$creds.AwsRegion;                
                    }
                    $environment = Get-Environment @getArguments
                }
                catch
                {
                    Write-Warning $_
                }

                $environment | Should Be $null
            }
            finally
            {
                $deleteArguments = @{
                    "-EnvironmentName"=$environmentName;
                    "-UniqueComponentIdentifier"=$uniqueComponentIdentifier;
                    "-S3Buckets"=@($testBucket);
                    "-Wait"=$true;
                    "-AwsKey"=$creds.AwsKey;
                    "-AwsSecret"=$creds.AwsSecret;
                    "-AwsRegion"=$creds.AwsRegion;
                    "-OctopusApiKey"=$octoCreds.ApiKey;
                    "-OctopusServerUrl"=$octoCreds.Url;
                }
                Delete-Environment @deleteArguments
            }
        }
    }

    Context "When supplied with a valid smoke test script" {
        It "The stack creation is successful" {
            $creds = Get-AwsCredentials
            $octoCreds = Get-OctopusCredentials
            $environmentName = Create-UniqueEnvironmentName
            $uniqueComponentIdentifier = "Test"
            $templatePath = "$rootDirectoryPath\src\TestEnvironment\Test.CloudFormation.template"
            $testBucket = [Guid]::NewGuid().ToString("N")
            $customTemplateParameters = @{
                "LogsS3BucketName"=$testBucket;
            }

            try
            {
                $createArguments = @{
                    "-EnvironmentName"=$environmentName;
                    "-TemplateFile"=$templatePath;
                    "-AdditionalTemplateParameters"=$CustomTemplateParameters;
                    "-UniqueComponentIdentifier"=$uniqueComponentIdentifier;
                    "-S3Buckets"=@($testBucket);
                    "-SmokeTest"={ return $_.StackId + " SMOKE TESTED"}; 
                    "-Wait"=$true;
                    "-AwsKey"=$creds.AwsKey;
                    "-AwsSecret"=$creds.AwsSecret;
                    "-AwsRegion"=$creds.AwsRegion;
                    "-OctopusApiKey"=$octoCreds.ApiKey;
                    "-OctopusServerUrl"=$octoCreds.Url;
                }
                $environmentCreationResult = New-Environment @createArguments

                Write-Verbose (ConvertTo-Json $environmentCreationResult)

                $environmentCreationResult.SmokeTestResult | Should Match "SMOKE TESTED"
            }
            finally
            {
                $deleteArguments = @{
                    "-EnvironmentName"=$environmentName;
                    "-UniqueComponentIdentifier"=$uniqueComponentIdentifier;
                    "-S3Buckets"=@($testBucket);
                    "-Wait"=$true;
                    "-AwsKey"=$creds.AwsKey;
                    "-AwsSecret"=$creds.AwsSecret;
                    "-AwsRegion"=$creds.AwsRegion;
                    "-OctopusApiKey"=$octoCreds.ApiKey;
                    "-OctopusServerUrl"=$octoCreds.Url;
                }
                Delete-Environment @deleteArguments
            }
        }
    }
}

Smoke And Mirrors

On the specific environment side (the Proxy in this example), all we need to do is supply a script block that will execute the smoke test.

The smoke test itself needs to be somewhat robust, so we use a generic wait function to repeatedly execute a HTTP request through the proxy until it succeeds or it runs out of time.

function Wait
{
    [CmdletBinding()]
    param
    (
        [scriptblock]$ScriptToFillActualValue,
        [scriptblock]$Condition,
        [int]$TimeoutSeconds=30,
        [int]$IncrementSeconds=2
    )

    write-verbose "Waiting for the output of the script block [$ScriptToFillActualValue] to meet the condition [$Condition]"

    $totalWaitTimeSeconds = 0
    while ($true)
    {
        try
        {
            $actual = & $ScriptToFillActualValue
        }
        catch
        {
            Write-Warning "An error occurred while evaluating the script to get the actual value (which is evaluated by the condition for waiting purposes). As a result, the actual value is undefined (NULL)"
            Write-Warning $_
        }

        try
        {
            $result = & $condition
        }
        catch
        {
            Write-Warning "An error occurred while evaluating the condition to determine if the wait is over"
            Write-Warning $_

            $result = $false
        }

        
        if ($result)
        {
            write-verbose "The output of the script block [$ScriptToFillActualValue] (Variable:actual = [$actual]) met the condition [$condition]"
            return $actual
        }

        write-verbose "The current output of the condition [$condition] (Variable:actual = [$actual]) is [$result]. Waiting [$IncrementSeconds] and trying again."

        Sleep -Seconds $IncrementSeconds
        $totalWaitTimeSeconds = $totalWaitTimeSeconds + $IncrementSeconds

        if ($totalWaitTimeSeconds -ge $TimeoutSeconds)
        {
            throw "The output of the script block [$ScriptToFillActualValue] (Variable:actual = [$actual]) did not meet the condition [$Condition] after [$totalWaitTimeSeconds] seconds."
        }
    }
}

function Test-Proxy
{
    [CmdletBinding()]
    param
    (
        [Parameter(Mandatory=$true)]
        [string]$proxyUrl
    )

    if ($rootDirectory -eq $null) { throw "RootDirectory script scoped variable not set. That's bad, its used to find dependencies." }
    $rootDirectoryPath = $rootDirectory.FullName

    . "$rootDirectoryPath\scripts\common\Functions-Waiting.ps1"
    
    $result = Wait -ScriptToFillActualValue { return (Invoke-WebRequest -Uri "www.google.com" -Proxy $proxyUrl -Method GET).StatusCode }  -Condition { $actual -eq 200 } -TimeoutSeconds 600 -IncrementSeconds 60
}

The main reason for this repeated try..wait loop is because sometimes a CloudFormation stack will complete successfully, but the service may be unavailable from an external point of view until the Load Balancer or similar component manages to settle properly.

Conclusion

I feel much more comfortable with our environment provisioning after moving the smoke tests into their own functions and executing them during the actual environment creation, rather than just in the tests.

Now whenever an environment completes its creation, I know that it actually works from an external observation point. The smoke tests aren’t particularly complex, but they definitely add a lot to our ability to reliably provision environments containing services.

Alas, I don’t have any more smoke puns or references to finish off this blog post…

Oh wait, yes I do!

*disappears in a puff of smoke*

Pack It Up

August 12. 2015 0 Comments

So it turns out that Amazon delete their pre-packaged Windows AMI’s muchquicker than they delete their Linux ones.

I’ve known about this for a little while, but it wasn’t until recently that it bit me.

One of the first environments that we created using the new strategy of “fully codified and automated environment setup”, recently hit a little snag. It was a production environment, primarily intended for beta testing, and the understanding was that it would be refreshed (using the latest version of the environment setup scripts) before it became available to the general public.

Late last week, one of our scripts that shuts down AWS EC2 instances based on time of day (to limit costs) accidentally terminated both of the EC2 instances that make up the API layer in that particular environment. Normally, this wouldn’t be an issue. The Auto Scaling Group would realise that it no longer had as many instances as it should and it would recreate them. The API instances are mostly stateless, so after a small amount of time, everything would be fine again.

As I’m sure you can imagine, this did not happen.

Amazon has since removed the AMI that the API instances were based off, so the ASG couldn’t spin up any new instances to replace the ones that were terminated.

The service was down, and it was going to stay that way for a while until we managed to fix the root issue.

The Immediate Response

The first thing we did was update the existing CloudFormation stack to use the latest version of the Windows AMI that we were previously using. This at least allowed the API instances to be created. However, they never finished their initialization.

In the time between when the environment was initially provisioned and the time when it was accidentally destroyed, we had made quite a few changes to the common scripts that back our environment provisioning process. One of those was to specify the version of the Octopus Tentacle that was going to be installed on the machine. We had previously run into an issue when Octopus 3 was released where the latest tentacle no longer worked the same way, and with little time to investigate it, simply hardcoded the version that was installed to the one we had previously been using.

In order to fix this issue in the old environment we had to fix the script. Whenever an environment is provisioned, the scripts that it depends on are uploaded to S3, ready to be downloaded by EC2 instances and other resources that need access to them. Rather than manually dig in to the dependencies, it was just easier to do the planned environment refresh.

This went…okay. Not great, not terrible, but the problem was fixed and everything came back online before anyone tried to use the service the next day.

Fixing It Properly

I had actually been aware of the AMI missing issue for some time and was already working on a longer term fix. In fact, I had scheduled an environment refresh of the production/beta environment for this service for a bit later in the same week the incident happened. It was just unfortunate that the incident forced my hand.

The root cause of the issue is that we did not control all of the elements in the environment, specifically the AMI used. Having external dependencies isn’t always an issue (for example we use Nuget extensively, but old Nuget packages are generally left alone), but Amazon makes no guarantees as to the availability of the AMI’s they supply as time progresses. The solution is to create your own AMI’s, so that Amazon can’t just delete them out from underneath you.

There are upsides and downsides to managing your own AMI’s.

The primary upside is that you avoid issues like I’ve described above. Nobody external to your company is going to go and delete an AMI when you still have a dependency on it. Obviously if someone internal to your organization deletes the AMI you still have the same issue, but you at least have much more control over that situation.

Another upside is that you can include commonly installed software for your environments in your custom AMI’s. For us, that would be things like an unregistered Octopus Tentacle or the .NET Framework and ASP.NET (not for all machines, but at least for all API instances).

The primary downside is that you can no longer easily take advantage of the fact that new Amazon AMI’s are released on a regular basis, containing Windows Updates and other fixes (which are critically important to apply to machines that are exposed to the greater internet). You can still take advantage of those new AMI’s, its just a little bit more difficult.

Another downside is that you now have to manage your own AMI’s. This isn’t particularly difficult to be honest, but it is one more thing that you need to take care of, and I much prefer to simplify things rather than add more complexity.

The Mechanism

In an effort to avoid much of the manual work that can go into creating an AMI, I looked for a solution that was automated. I wanted to be able to run a process that simply spat out a customised AMI at the end, so that we could easily take advantage of new AMI’s as Amazon released them, and then refresh our environments as required.

Initially I looked into automating the process myself, using the various API’s available for AWS. I’d already done some work previously in creating an EC2 instance for the purposes of updating an AMI, so I started with that.

Shortly after, someone informed me of the existence of Packer.

Packer is a wonderful little application that allows you create AMI’s and virtual machines for a number of virtualisation platforms. It even works on Windows, without having to install some arcane dependency chain through the command line. Its just a collection of executables, which is nice.

Using Packer, I could put together the following configuration file that describes how I want my AMI to be structured.

{
    "variables" : {
        "aws_access_key" : "",
        "aws_secret_key" : "",
        "aws_region" : "",
        "source_ami" : "",
        "ami_name" : "",
        "user_data_file_path" : "",
        "octopus_api_key" : "",
        "octopus_server_url" : ""
    },
    "builders" : [{
            "type" : "amazon-ebs",
            "access_key" : "{{user `aws_access_key`}}",
            "secret_key" : "{{user `aws_secret_key`}}",
            "region" : "{{user `aws_region`}}",
            "source_ami" : "{{user `source_ami`}}",
            "instance_type" : "m3.large",
            "ami_name" : "{{user `ami_name`}}-{{timestamp}}",
            "user_data_file" : "{{user `user_data_file_path`}}",
            "vpc_id" : "vpc-a0a6aec9",
            "subnet_id" : "subnet-5908182d",
            "security_group_ids" : ["sg-0b65076e", "sg-4d188f28", "sg-faaf429f"],
            "ssh_keypair_name" : "YourKeyPair",
            "ssh_private_key_file":"C:\\temp\\YourKeyPair.pem",
            "communicator" : "winrm",
            "winrm_username" : "Administrator",
            "winrm_port" : 5985
        }
    ],
    "provisioners" : [
        {
            "type" : "powershell",
            "inline" : [
                "try",
                "{",
                    "$signalFilePath = \"C:\\signal\"",
                    "$content = Get-Content $signalFilePath",
                    "$maxWaitSeconds = 3000",
                    "$currentWaitSeconds = 0",
                    "$waitSeconds = 30",
                    "while ($content -eq \"1\" -and $currentWaitSeconds -lt $maxWaitSeconds) { Sleep -Seconds $waitSeconds; Write-Output \"Checking signal\"; $currentWaitSeconds += $waitSeconds; $content = Get-Content $signalFilePath; if ($content -eq \"-1\") { Write-Output \"User data signalled -1, indicating failure.\"; exit 1 } }",
                "}",
                "catch",
                "{",
                    "Write-Ouput \"An unexpected error occurred.\"",
                    "Write-Output $_",
                    "exit 1",
                "}"
            ]
        },
        {
            "type":"powershell",
            "scripts": [
                "@@ROOT_DIRECTORY_PATH\\scripts\\packer\\Ec2Config.ps1"
            ]
        }
    ]
}

The first part of the template describes various things about the EC2 instance that will be used to create the AMI, and the second part describes operations to perform on the instance in order to configure it the way you want it.

Note that the security groups used in the template above simply describe (in order) ability to connect via RDP and Windows Remote Management, Unfiltered Access Out and Octopus Tentacle Port In.

The configuration I’ve shared above is from our baseline Octopus Tentacle capable image. It comes with an Octopus Tentacle installed, but not configured (because its much more reliable to configure it at initialization time in CloudFormation).

The instance configuration is broken into two parts:

Use UserData to run some scripts that configure the proxy (so the machine can get to the internet) and download some dependencies, plus some other miscellaneous configuration (WinRM, Firewall, etc).
Use the Powershell script execution from Packer to run some scripts from the dependencies downloaded in 1.) to download and install an Octopus Tentacle.
Some other miscellaneous configuration.

Nothing too fancy.

The Windows support for Packer is still a bit on the rough side, mostly due to the fact that doing this sort of thing with Windows machines is inherently more complicated than it is for Linux boxes. Luckily for me, I started using Packer after the Windows plugins were incorporated into the primary build of the application, so I didn’t have to do anything special to get Windows support.

Gotchas

It definitely wasn’t all smooth sailing though.

The documentation for the creation of Windows AMI’s from Packer is a little sparse, so I had to do some trial and error in order to figure out how everything fit together.

The main mechanism for executing scripts remotely on Windows is WinRM (Windows Remote Managment), which is basically Powershell remote execution. As such, you need to make sure that you allow access to the machine over port 5985 or nothing will work. It won’t fail straightaway either, it will timeout, which can take upwards of 10 minutes.

You also need to make sure that you specify WinRM as the communication method. Most of the template examples on the web use SSH (because Linux), so its not immediately obvious that you can actually switch to a different communication method.

Finally, you need to include the EC2 and Bundle config files, to tell the instance that it needs to run sysprep, otherwise it won’t regenerate a new Administrator password when you use the AMI to create a new EC2 instance (and thus you wouldn’t be able to retrieve the password from the AWS API). It will also have saved state on it from last time, so its definitely better to run sysprep for an AMI that will be used generically.

Conclusion

I’ve uploaded a sanitised copy of the repository containing my Packer templates and scripts to Github. If you look, you can see that I haven’t done anything particularly fancy. All I’ve done is wrap the execution of Packer in some Powershell scripts to make it easier to run. I have two different scripts to create the two AMI’s that we need right now (Octopus capable + pre-installed IIS/.NET Framework), and when you run either of them with the appropriate parameters a brand new, timestamped AMI will be created in the appropriate AWS account.

Creating our own AMI’s fixes the scaling issue that started this whole blog post. Since we control them, we can be sure that they won’t be deleted and our ability to scale via Auto Scaling Groups is maintained for the life of the environment. Another benefit of this approach is that the provisioning of an environment is now quicker, as some of the components (especially IIS/.NET Framework) are now pre-installed for the components that require them. Considering our environments can take upwards of 20 minutes to provision, every minute counts.

The whole process of creating these AMI’s via Packer took me about a day or two, so it definitely wasn’t the most time consuming task I’ve ever completed.

Incorporating the AMI’s into our environment provisioning scripts was trivial, as they already searched for the appropriate AMI to use dynamically, I just had to change the search parameters.

In the end I’m fairly pleased with Packer and how easy it made the AMI creation process. If I had to use the AWS Powershell cmdlets (or the CLI app) directly for all of this, I probably would have wasted a lot of time.

And sanity.

Actually Not a Length Issue After All

August 4. 2015 0 Comments

A while back I made a post about how cfn-init was failing with an error that didn’t seem to make any sense (utf8 codec can’t decode, invalid continuation byte). At the time, I came to the conclusion that the reason why it was failing was the length of the output.

That was actually incorrect.

The reason I came to that conclusion was because the action that seemed to fix the issue was whether or not the –Verbose flag was present on the Powershell script being called. If I had the flag on, I would get the error. Off, and everything was fine.

At first I trusted the error. I assumed that somewhere in my output there was actually an invalid set of bytes, at least as far as utf8 is concerned. It seemed entirely plausible, considering it was Ruby, parsing the output from a Powershell script, which was running an executable directly. So many layers, so many different ways in which it could break, the least of which would be output stream encoding incompatibilities.

My initial investigation into the output stream didn’t seem to show any invalid bytes, so I assumed that cfn-init was doing something stupid, and truncating the output stream because it was too long. If it truncated the stream, it was feasible that it was shearing a UTF8 byte pair in half, hence the invalid continuation byte error. It made sense to me, because:

I had only recently added a large amount of extra verbose logging to the scripts, and that seemed to be the primary difference between the version that worked and the version that failed.
I’ve had issues with AWS components giving arcane errors when you exceed some arbitrary length before. The Powershell cmdlet New-CfnStack will fail if you try to upload a template directly and the template is too long. It also gives incredibly unhelpful errors, about malformed XML, when in reality, its a length issue.

I accepted that I could not use verbose logging for the deployment and moved on, but it always bugged me.

I don’t like losing Verbose logging. Its extremely useful for when things go bad, which they do. All the time.

Round Two (or Three?)

I got a chance to go back to the deployment scripts recently, because some deployments were failing and I wasn’t sure why. I needed the verbose logging back, so I re-added the –Verbose flag and tried to get to the bottom of why it was failing.

My first attempt simply commented out the lines that I knew could be particularly spammy (search functions testing a predicate against a collection of objects).

The error still occurred.

I ran the script by itself (in a clean environment) and the output wasn’t even that long. I realised that I had made an incorrect conclusion from my initial investigation. It definitely wasn’t a length issue.

It was time to do a manual binary search.

I knew which script was failing, so I suppressed verbose output in half the script and tested it again to see if it failed. Its easy enough to temporarily suppress verbose output in Powershell, if a bit messy.

$pref = $VerbosePreference
$VerbosePreference = "SilentlyContinue"

... operations here ...

$VerbosePreference = $pref

What followed was a few hours of manually searching for the piece of script that broke cfn-init. I couldn’t just run it locally, because everything worked just fine, I had to instantiate an environment and get it to initialise itself as it usually would to see if the script would fail or not. It had to be a fresh environment too, because if I ran cfn-init again on an environment that had already failed, it would work just fine.

An environment takes at least 10 minutes to reach the point of failure.

It was the slowest binary search ever.

Something About Rooting Issues?

Eventually I got to the root of the problem. I was piping the output from nuget.exe (used for package installation) to the Verbose stream. Somewhere in the output from nuget, there was actually (maybe?) an invalid UTF8 character, according to the code used by cfn-init anyway. The reason it didn’t fail if you ran it a second time, was because that component was already successfully installed, so it didn’t try to install it again.

I could leave every Write-Verbose statement in place except for that one, and it would all work fine. This meant that I could finally get the verbose output from the Octopus deployment back, which was the main thing I wanted. Sure I could see it through Octopus, but I like all of my information to be in one place, because it just makes everything easier.

To complicate matters further, it wasn’t all nuget output that triggered the issue. For example, the 7Zip command line tools are one of the first components to be installed, in order to make sure the AWS cmdlets are available (they are distributed to each machine as a 7zip file). That particular component would install fine. It seemed to only be the Octopus Client package (the one that contains the .NET libraries) that caused the issue. I’m still not 100% sure of that to be honest, I was mostly just happy to get my verbose deployment errors back in the cfn-init logs, and I had to move on to something else.

To be honest, the fact that I don’t know for sure why the nuget installation output breaks the script is another application of the same mistake I made the first time, and I’m almost certainly going to have to revisit it at some point in the future, where I will no doubt discover something else entirely is actually the root cause. This solution is still better than the original one though, which is enough for now.

Conclusion

Sometimes in software you think you know the answer, but it turns out you’ve made some poor assumptions or conclusions, and just completely missed the mark. In my case, the ramifications were not particularly bad, I was just missing out on informational output, but that’s not always the case. Sometimes the results of a poor assumption/conclusions can be much worse.

The important thing to note is that I stopped investigating once I found something that made the bug go away, instead of investigating in more depth as to whythe action made the bug go away. Sometimes this is a compromise (usually as a result of time contraints/pressure), but I’ve found its almost always worthwhile to spend the time to understand the why of an issue before accepting a solution. Obviously at some stage you do need to just accept that you don’t understand the problem fully, so like everything in software development, its a balancing act.

At the very least if you investigate for a little bit longer, you’ll understand the problem better, which can only be a good thing.