We’re Finally Paying Attention, Part 2

November 7. 2017 0 Comments

Full disclosure, most of the Elastalert related work was actually done by a colleague of mine, I’m just writing about it because I thought it was interesting.

Last week I did a bit of an introduction to Elastalert, as it is the new mechanism that we use to alert on the data in our ELK stack.

We take our infrastructure pretty seriously though, so I didn’t want to just manually create an Elastalert instance and set up it up to do things. It all needs to be codified and controlled, with a deployment pipeline for distributing changes (like new rules or changed rules) and everything needs to be versioned as appropriate.

After doing some very high level playing around (just to make sure it all worked relatively as advertised), it was time to do it properly and set up an auto-scaling, auto-healing Elastalert environment, just like all of the other ones.

Packing It Away

Installing Elastalert is pretty straightforward.

Its all Python based, so its a fairly simple matter to use pip to install the package:

pip install elastalert

This doesn’t quite work out of the box on an Amazon Linux EC2 instance though, as you have to also install some dependencies that are not immediately obvious.

sudo yum update -y;
sudo yum install gcc gcc-c++ -y;
sudo yum install libffi-devel -y;
sudo yum install openssl-devel -y;
sudo pip install elastalert;

With that out of the way, the machine is basically ready to run Elastalert, assuming you configure it correctly (as per the documentation).

With a relatively self contained installation script out of the way, it was time to create an AMI containing using Packer, to be used inside the impending environment.

The Packer configuration for an AMI with Elastalert installed on it is pretty straightforward, and just follows the normal pattern, which I described in this post and which you can see directly in this Github repository. The only meaningful difference is the script that installs Elastalert itself, which you can see above.

Cumulonimbus Clouds Are My Favourite

With an AMI created and ready to go, all that’s left is to create a simple environment to run it in.

Nothing fancy, just a CloudFormation template with a single auto scaling group in it, such that accidental or unexpected terminations self-heal. No need for a load balancer, DNS entries or anything like that, its a purely background process that sits quietly and yells at us as appropriate.

Again, this is a problem that we’ve solved before, and we have a decent pattern in place for putting this sort of thing together.

A dedicated repository for the environment, containing the CloudFormation template, configuration and deployment logic
A TeamCity Build Configuration, which uses the contents of this repository and builds and tests a versioned package
An Octopus project, which contains all of the logic necessary to target the deployment, along with any environment level variables (like target ES cluster)

The good news was that the standard environment stuff worked perfectly. It built, a package was created and that package was deployed.

The bad news was that the deployment never actually completed successfully because the Elastalert AMI failed to result in a working EC2 instance, which meant that the environment failed miserably as the Auto Scaling Group never received a success signal.

But why?

Snakes Are Tricky

It actually took us a while to get to the bottom of the problem, because Elastalert appeared to be fully functional at the end of the Packer process, but the AMI created from that EC2 instance seemed to be fundamentally broken.

Any EC2 instance created from that AMI just didn’t work, regardless of how we used it (i.e. CloudFormation vs manual instance creation, nothing mattered).

The instance would be created and it would “go green” (i.e. the AWS status checks and whatnot would complete successfully) but we couldn’t connect to it using any of the normal mechanisms (SSH using the specified key being the most obvious). It was like none of the normal EC2 setup was being executed, which was weird, because we’ve created many different AMIs through Packer and we hadn’t done anything differently this time.

Looking at the system log for the broken EC2 instances (via the AWS Dashboard) we could see that the core setup procedure of the EC2 instance (where it uses the supplied key file to setup access among other things) was failing due to problems with Python.

What else uses Python?

That’s right, Elastalert.

It turned out that by our Elastalert installation script was updating some dependencies that the EC2 initialization was relied on, and those updates had completely broken the normal setup procedure.

The AMI was functionally useless.

Dock Worker

We went through a few different approaches to try and fix the underlying dependency conflicts, but in the end we settled on using Docker.

At a very high level, Docker is a kind of a virtualization platform, except it doesn’t virtualize the entire OS and instead sits a little bit above that, virtualizing a set of applications instead, leveraging the OS rather than simulating the entire thing. Each Docker image generally hosts a single application in a completely isolated environment, which makes it the perfect solution when you have system software conflicts like we did.

Of course, we had to change our strategy somewhat in order for this to work.

Instead of using Packer to create an AMI with Elastalert installed, we now have to create an AMI with Docker (and Octopus) installed and available.

Same pattern as before, just different software being installed.

Nothing much changed in the environment though, as its still just an Auto Scaling Group spinning up an EC2 instance using the specified AMI.

The big changes were in the Elastalert configuration deployment, which now had to be responsible for both deploying the actual configuration and making sure the Elastalert docker images was correctly configured and running.

To Be Continued

And that is as good a place as any to stop for now.

Next week I’ll explain what our original plan was for the Elastalert configuration deployment and how that changed when we switched to using Docker to host an Elastalert image.

Building A Better Beast, Part 4

April 25. 2017 0 Comments

With environments and configuration out of the way, its time to put all of the pieces together.

Obviously this isn’t the first time that both of those things have been put together. In order to validate that everything was working as expected, I was constantly creating/updating environments and deploying new versions of the configuration to them. Not only that, but with the way our deployment pipeline works (commit, push, build, test, deploy [test], [deploy]), the CI environment had been up and running for some time.

What’s left to do then?

Well, we still need to create the Staging and Production environments, which should be easy because those are just deployments inside Octopus now.

The bigger chunk of work is to use those new environments and to redirect all of our existing log traffic as appropriate.

Hostile Deployment

This is a perfect example of why I spend time on automating things.

With the environments setup to act just like everything else in Octopus, all I had to do to create a Staging environment was push a button. Once the deployment finished and the infrastructure was created, it was just another button push to deploy the configuration for that environment to make it operational. Rinse and repeat for all of the layers (Broker, Indexer, Cache, Elasticsearch) and Staging is up and running.

Production was almost the same, with one caveat. We use an entirely different AWS account for all our production resources, so we had to override all of the appropriate Octopus variables for each environment project (like AWS Credentials, VPC ID, Subnet ID’s, etc). With those overrides in place, all that’s left is to make new releases (to capture the variables) and deploy to the appropriate environments.

It’s nice when everything works.

Redirecting Firepower

Of course, the new logging environments are worthless without log events. Luckily, we have plenty of those:

IIS logs from all of our APIs
Application logs from all of our APIs
ELB logs from a subset of our load balancers, most of which are APIs, but at least one is an Nginx router
Simple performance counter statistics (like CPU, Memory, Disk, etc) from basically every EC2 instance
Logs from on-premises desktop applications

We generally have CI, Staging and prod-X (green/blue) environments for all of our services/APIs (because its how our build/test/deployment pipeline works), so now that we have similarly partitioned logging environments, all we have to do is line them up (CI to CI, Staging to Staging and so on).

For the on-premises desktop applications, there is no CI, but they do generally have the capability to run in Staging mode, so we can use that setting to direct log traffic.

There are a few ways in which the log events hit the Broker layer:

Internal Logstash instance running on an EC2 instance with a TCP output pointing at the Broker hostname
Internal Lambda function writing directly to the Broker hostname via TCP (this is the ELB logs processor)
External application writing to an authenticated Logging API, which in turn writes to the Broker via TCP (this is for the on-premises desktop applications)

We can change the hostname used by all of these mechanisms simply by changing some variables in Octopus deploy, making a new release and deploying it through the environments.

And that’s exactly what we did, making sure to monitor the log event traffic for each one to make sure we didn’t lose anything.

With all the logs going to their new homes, all that was left to do was destroy the old log stack, easily done manually through CloudFormation.

You might be wondering about any log events that were stored in the old stack? Well, we generally only keep around 14 days worth of log events in the stack itself (because there are so many), so we pretty much just left the old stack up for searching purposes until it was no longer relevant, and then destroyed it.

Conclusion

And that basically brings us to the end of this series of posts about our logging environment and the reclamation thereof.

We’ve now got our standard deployment pipeline in place for both infrastructure and configuration and have separated our log traffic accordingly.

This puts us in a much better position moving forward. Not only is the entire thing fully under our control, but we now have the capability to test changes to infrastructure/configuration before just deploying them into production, something we couldn’t do before when we only had a single stack for everything.

In all fairness though, all we really did was polish an existing system so that it was a better fit for our specific usage.

Evolutionary, not revolutionary.

Building A Better Beast, Part 2

April 11. 2017 0 Comments

With all of the general context and solution outlining done for now, its time to delve into some of the details. Specifically, the build/test/deploy pipeline for the log stack environments.

Unfortunately, we use the term environmentto describe two things. The first is an Octopus environment, which is basically a grouping construct inside Octopus Deploy, like CI or prod-green. The second is a set of infrastructure intended for a specific purpose, like an Auto Scaling Group and Load Balancer intended to host an API. In the case of the log stack, we have distinct environments for the infrastructure for each layer, like the Broker and the Indexer.

Our environments are all conceptually similar, there is a Git repository that contains everything necessary to create or update the infrastructure (CloudFormation templates, Powershell scripts, etc), along with the logic for what it means to build and validate a Nuget package that can be used to manage the environment. The repository is hooked up to a Build Configuration in TeamCity which runs the build script and the resulting versioned package is uploaded to our Nuget server. The package is then used in TeamCity via other Build Configurations to allow us to Create, Delete, Migrate and otherwise interact with the environment in question.

The creation of this process has happened in bits and pieces over the last few years, most of which I’ve written about on this blog.

Its a decent system, and I’m proud of how far we’ve come and how much automation is now in place, but it’s certainly not without its flaws.

Bestial Rage

The biggest problem with the current process is that while the environment is fully encapsulated as code inside a validated and versioned Nuget package, actually using that package to create or delete an environment is not as simple as it could be. As I mentioned above, we have a set of TeamCity Build Configurations for each environment that allow for the major operations like Create and Delete. If you’ve made changes to an environment and want to deploy them, you have to decide what sort of action is necessary (i.e. “its the first time, Create” or “it already exists, Migrate”) and “run”the build, which will download the package and run the appropriate script.

This is where is gets a bit onerous, especially for production. If you want to change any of the environment parameters from the default values the package was built with, you need to provide a set of parameter overrideswhen you run the build. For production, this means you often end up overriding everything (because production is a separate AWS account) which can be upwards of 10 different parameters, all of which are only visible if you go a look at the source CloudFormation template. You have to do this every time you want to execute that operation (although you can copy the parameters from previous runs, which acts a small shortcut).

The issue with this is that it means production deployments become vulnerable to human error, which is one of the things we’re trying to avoid by automating in the first place!

Another issue is that we lack a true “Update” operation. We only have Create, Delete, Clone and Migrate.

This is entirely my fault, because when I initially put the system together I had a bad experience with the CloudFormation Update command where I accidentally wiped out an S3 bucket containing customer data. As is often the case, that fear then led to an alternate (worse) solution involving cloning, checking, deleting, cloning, checking and deleting (in that order). This was safer, but incredibly slow and prone to failure.

The existence of these two problems (hard to deploy, slow failure-prone deployments) is reason enough for me to consider exploring alternative approaches for the log stack infrastructure.

Fantastic Beasts And Where To Find Them

The existing process does do a number of things well though, and has:

A Nuget package that contains everything necessary to interact with the environment.
Environment versioning, because that’s always important for traceability.
Environment validation via tests executed as part of the build (when possible).

Keeping those three things in mind, and combining them with the desire to ease the actual environment deployment, an improved approach looks a lot like our typical software development/deployment flow.

Changes to environment are checked in
Changes are picked up by TeamCity, and a build is started
Build is tested (i.e. a test environment is created, validated and destroyed)
Versioned Nuget package is created
Package is uploaded to Octopus
Octopus Release is created
Octopus Release is deployed to CI
Secondary validation (i.e. test CI environment to make sure it does what it’s supposed to do after deployment)
[Optional] Propagation of release to Staging

In comparison to our current process, the main difference is the deployment. Prior to this, we were treating our environments as libraries (i.e. they were built, tested, packaged and uploaded to MyGet to be used by something else). Now we’re treating them as self contained deployable components, responsible for knowing how to deploy themselves.

With the approach settled, all that’s left is to come up with an actual deployment process for an environment.

Beast Mastery

There are two main cases we need to take care of when deploying a CloudFormation stack to an Octopus environment.

The first case is what to do when the CloudFormation stack doesn’t exist.

This is the easy case, all we need to do is execute New-CFNStack with the appropriate parameters and then wait for the stack to finish.

The second case is what we should do when the CloudFormation stack already exists, which is the case that is not particularly well covered by our current environment management process.

Luckily, CloudFormation makes this relatively easy with the Update-CFNStack command. Updates are dangerous (as I mentioned above), but if you’re careful with resources that contain state, they are pretty efficient. The implementation of the update is quite smart as well, and will only update the things that have changed in the template (i.e. if you’ve only changed the Load Balancer, it won’t recreate all of your EC2 instances).

The completed deployment script is shown in full below.

[CmdletBinding()]
param
(

)

$here = Split-Path $script:MyInvocation.MyCommand.Path;
$rootDirectory = Get-Item ($here);
$rootDirectoryPath = $rootDirectory.FullName;

$ErrorActionPreference = "Stop";

$component = "unique-stack-name";

if ($OctopusParameters -ne $null)
{
    $parameters = ConvertFrom-StringData ([System.IO.File]::ReadAllText("$here/cloudformation.parameters.octopus"));

    $awsKey = $OctopusParameters["AWS.Deployment.Key"];
    $awsSecret = $OctopusParameters["AWS.Deployment.Secret"];
    $awsRegion = $OctopusParameters["AWS.Deployment.Region"];
}
else 
{
    $parameters = ConvertFrom-StringData ([System.IO.File]::ReadAllText("$here/cloudformation.parameters.local"));
    
    $path = "C:\creds\credentials.json";
    Write-Verbose "Attempting to load credentials (AWS Key, Secret, Region, Octopus Url, Key) from local, non-repository stored file at [$path]. This is done this way to allow for a nice development experience in vscode"
    $creds = ConvertFrom-Json ([System.IO.File]::ReadAllText($path));
    $awsKey = $creds.aws."aws-account".key;
    $awsSecret = $creds.aws."aws-account".secret;
    $awsRegion = $creds.aws."aws-account".region;

    $parameters["OctopusAPIKey"] = $creds.octopus.key;
    $parameters["OctopusServerURL"] = $creds.octopus.url;
}

$parameters["Component"] = $component;

$environment = $parameters["OctopusEnvironment"];

. "$here/scripts/common/Functions-Aws.ps1";
. "$here/scripts/common/Functions-Aws-CloudFormation.ps1";

Ensure-AwsPowershellFunctionsAvailable

$tags = @{
    "environment"=$environment;
    "environment:version"=$parameters["EnvironmentVersion"];
    "application"=$component;
    "function"="logging";
    "team"=$parameters["team"];
}

$stackName = "$environment-$component";

$exists = Test-CloudFormationStack -AwsKey $awsKey -AwsSecret $awsSecret -AwsRegion $awsRegion -StackName $stackName

$cfParams = ConvertTo-CloudFormationParameters $parameters;
$cfTags = ConvertTo-CloudFormationTags $tags;
$args = @{
    "-StackName"=$stackName;
    "-TemplateBody"=[System.IO.File]::ReadAllText("$here\cloudformation.template");
    "-Parameters"=$cfParams;
    "-Tags"=$cfTags;
    "-AccessKey"=$awsKey;
    "-SecretKey"=$awsSecret;
    "-Region"=$awsRegion;
    "-Capabilities"="CAPABILITY_IAM";
};

if ($exists)
{
    Write-Verbose "The stack [$stackName] exists, so I'm going to update it. Its better this way"
    $stackId = Update-CFNStack @args;

    $desiredStatus = [Amazon.CloudFormation.StackStatus]::UPDATE_COMPLETE;
    $failingStatuses = @(
        [Amazon.CloudFormation.StackStatus]::UPDATE_FAILED,
        [Amazon.CloudFormation.StackStatus]::UPDATE_ROLLBACK_IN_PROGRESS,
        [Amazon.CloudFormation.StackStatus]::UPDATE_ROLLBACK_COMPLETE
    );
    Wait-CloudFormationStack -StackName $stackName -DesiredStatus  $desiredStatus -FailingStates $failingStatuses -AwsKey $awsKey -AwsSecret $awsSecret -AwsRegion $awsRegion

    Write-Verbose "Stack [$stackName] Updated";
}
else 
{
    Write-Verbose "The stack [$stackName] does not exist, so I'm going to create it. Just watch me"
    $args.Add("-DisableRollback", $true);
    $stackId = New-CFNStack @args;

    $desiredStatus = [Amazon.CloudFormation.StackStatus]::CREATE_COMPLETE;
    $failingStatuses = @(
        [Amazon.CloudFormation.StackStatus]::CREATE_FAILED,
        [Amazon.CloudFormation.StackStatus]::ROLLBACK_IN_PROGRESS,
        [Amazon.CloudFormation.StackStatus]::ROLLBACK_COMPLETE
    );
    Wait-CloudFormationStack -StackName $stackName -DesiredStatus  $desiredStatus -FailingStates $failingStatuses -AwsKey $awsKey -AwsSecret $awsSecret -AwsRegion $awsRegion

    Write-Verbose "Stack [$stackName] Created";
}

Other than the Create/Update logic that I’ve already talked about, the only other interesting thing in the deployment script is the way that it deals with parameters.

Basically if the script detects that its being run from inside Octopus Deploy (via the presence of an $OctopusParameters variable), it will load all of its parameters (as a hashtable) from a particular local file. This file leverages the Octopus variable substitution feature, so that when we deploy the infrastructure to the various environments, it gets the appropriate values (like a different VPC because prod is a separate AWS account to CI). When its not running in Octopus, it just uses a different file, structured very similarly, with test/scratch values in it.

With the deployment script in place, we plug the whole thing into our existing “deployable” component structure and we have automatic deployment of tested, versioned infrastructure via Octopus Deploy.

Conclusion

Of course, being a first version, the deployment logic that I’ve described above is not perfect. For example, there is no support for deploying to an environment where the stack is in error (failing stacks can’t be updated, but they already exist, so you have to delete it and start again) and there is little to no feedback available if a stack creation/update fails for some reason.

Additionally, the code could benefit from being extracted to a library for reuse.

All in all, the deployment process I just described is a lot simpler than the one I described at the start of this post, and its managed by Octopus, which makes it consistent with the way that we do everything else, which is nice.

With a little bit more polish, and some pretty strict usage of the CloudFormation features that stop you accidentally deleting databases full of valuable data, I think it will be a good replacement for what we do now.

That Neat Little Symbol From Half-Life, Part 3

December 6. 2016 0 Comments

Its that time again kids, time to continue the series of posts about how we improved the processing of our ELB logs into our ELK stack using AWS Lambda.

You can find the introduction to this whole adventure here, but last time I wrote about the Javascript content of the Lambda function that does the bulk of the work.

This time I’m going to write about how we incorporated the creation of that Lambda function into our environment management strategy and some of the tricks and traps therein.

On a completely unrelated note, it would be funny if this blog post turned up in search results for Half-Life 3.

We’ve Got Hostiles!

I’ve pushed hard to codify our environment setup where I work. The main reason for this is reproducibility, but the desire comes from a long history of interacting with manually setup environments that are lorded over by one guy who just happened to know the guy who originally set them up and where everyone is terrified of changing or otherwise touching said environments.

Its a nightmare.

As far as environment management goes, I’ve written a couple of times about environment related things on this blog, one of the most recent being the way in which we version our environments. To give some more context for this post, I recommend you go and read at least the versioning post in order to get a better understanding of how we do environment management. Our strategy is still a work in process, but its getting better all the time.

Regardless of whether or not you followed my suggestion, we use a combination of versioned Nuget packages, Powershell, CloudFormation and Octopus Deploy to create an environment, where an environment is a self contained chunk of infrastructure and code that performs some sort of role, the most common of which is acting as an API. We work primarily with EC2 instances (behind Elastic Load Balancers managed via Auto Scaling Groups), and historically, we’ve deployed Logstash to each instance alongside the code to provide log aggregation (IIS, Application, System Stats, etc). When it comes to capturing and aggregating ELB logs, we use include a standalone EC2 instance in the environment, also using Logstash. This standalone instance is the part of the system that we are aiming to replace with the Lambda function.

Because we make extensive use of CloudFormation, incorporating the creation of a Lambda function into an environment that needs to have ELB logs processed is a relatively simple affair.

Simple in that it fits nicely with our current approach. Getting it all to work as expected was still a massive pain.

Blast Pit

Below is a fragment of a completed CloudFormation template for reference purposes.

In the interests of full disclosure, I did not write most of the following fragment, another member of my team was responsible. I just helped.

{
    "Description": "This template is a fragment of a larger template that creates an environment. This fragment in particular contains all of the necessary bits and pieces for a Lambda function that processes ELB logs from S3.",
    "Parameters": {
        "ComponentName": {
            "Description": "The name of the component that this stack makes up. This is already part of the stack name, but is here so it can be used for naming/tagging purposes.",
            "Type": "String"
        },
        "OctopusEnvironment": {
            "Description": "Octopus Environment",
            "Type": "String"
        },
        "PrivateSubnets": {
            "Type": "List<AWS::EC2::Subnet::Id>",
            "Description": "Public subnets (i.e. ones that are automatically assigned public IP addresses) spread across availability zones, intended to contain load balancers and other externally accessible components.",
            "ConstraintDescription": "must be a list of an existing subnets in the selected Virtual Private Cloud."
        },
        "LogsS3BucketName": {
            "Description": "The name of the bucket where log files for the ELB and other things will be placed.",
            "Type": "String"
        }
    },
    "Resources": {
        "LogsBucket" : {
            "Type" : "AWS::S3::Bucket",
            "Properties" : {
                "BucketName" : { "Ref": "LogsS3BucketName" },
                "LifecycleConfiguration": {
                    "Rules": [
                        {
                            "Id": 1,
                            "ExpirationInDays": 7,
                            "Status": "Enabled"
                        }
                    ]
                },
                "Tags" : [
                    {
                        "Key": "function",
                        "Value": "log-storage"
                    }
                ],
                "NotificationConfiguration" : {
                  "LambdaConfigurations": [
                    {
                      "Event" : "s3:ObjectCreated:*",
                      "Function" : { "Fn::GetAtt" : [ "ELBLogProcessorFunction", "Arn" ] }
                    }
                  ]
                }
            }
        },
        "ELBLogProcessorFunctionPermission": {
            "Type" : "AWS::Lambda::Permission",
            "Properties" : {
                "Action":"lambda:invokeFunction",
                "FunctionName": { "Fn::GetAtt": [ "ELBLogProcessorFunction", "Arn" ]},
                "Principal": "s3.amazonaws.com",
                "SourceAccount": {"Ref" : "AWS::AccountId" },
                "SourceArn": {
                    "Fn::Join": [":", [ "arn","aws","s3","", "" ,{"Ref" : "LogsS3BucketName"}]]
                }
            }
        },
        "LambdaSecurityGroup": {
            "Type": "AWS::EC2::SecurityGroup",
            "Properties": {
                "GroupDescription": "Enabling all outbound communications",
                "VpcId": {
                    "Ref": "VpcId"
                },
                "SecurityGroupEgress": [
                    {
                        "IpProtocol": "tcp",
                        "FromPort": "0",
                        "ToPort": "65535",
                        "CidrIp": "0.0.0.0/0"
                    }
                ]
            }
        },
        "ELBLogProcessorFunction": {
          "Type": "AWS::Lambda::Function",
          "Properties": {
            "FunctionName": { "Fn::Join": [ "", [ { "Ref" : "ComponentName" }, "-", { "Ref" : "OctopusEnvironment" }, "-ELBLogProcessorFunction"  ] ] },
            "Description": "ELB Log Processor",
            "Handler": "index.handler",
            "Runtime": "nodejs4.3",
            "Code": {
              "ZipFile": "console.log('placeholder for lambda code')"
            },
            "Role": { "Fn::GetAtt" : ["LogsBucketAccessorRole", "Arn"]},
            "VpcConfig": {
              "SecurityGroupIds": [{"Fn::GetAtt": ["LambdaSecurityGroup", "GroupId"]}],
              "SubnetIds": { "Ref": "PrivateSubnets" }
            }
          }
        },
        "LogsBucketAccessorRole": {
          "Type": "AWS::IAM::Role",
          "Properties": {
            "AssumeRolePolicyDocument": {
                "Version": "2012-10-17",
                "Statement": [
                    {
                        "Effect": "Allow",
                        "Principal": { "Service" : ["lambda.amazonaws.com"]},
                        "Action": [
                            "sts:AssumeRole"
                        ]
                    }
                ]
            },
            "Path": "/",
            "Policies": [{ 
              "PolicyName": "access-s3-read",
              "PolicyDocument": {
                "Version": "2012-10-17",
                "Statement": [
                    {
                        "Effect": "Allow",
                        "Action": [
                            "s3:GetObject"
                        ],
                        "Resource": {
                            "Fn::Join": [":", [ "arn","aws","s3","", "" ,{"Ref" : "LogsS3BucketName"}, "/*"]]
                        }
                    }
                ]
              }
            },
            {
              "PolicyName": "access-logs-write",
              "PolicyDocument": {
                "Version": "2012-10-17",
                "Statement": [
                    {
                        "Effect": "Allow",
                        "Action": [
                            "logs:CreateLogGroup",
                            "logs:CreateLogStream",
                            "logs:PutLogEvents",
                            "logs:DescribeLogStreams"
                        ],
                        "Resource": {
                            "Fn::Join": [":", [ "arn","aws","logs", { "Ref": "AWS::Region" }, {"Ref": "AWS::AccountId"} , "*", "/aws/lambda/*"]]
                        }
                    }
                ]
              }
            },
            {
              "PolicyName": "access-ec2",
              "PolicyDocument": {
                "Version": "2012-10-17",
                "Statement": [
                    {
                        "Effect": "Allow",
                        "Action": [
                          "ec2:*"
                        ],
                        "Resource": "arn:aws:ec2:::*"
                    }
                ]
              }
            },
            {
              "PolicyName": "access-ec2-networkinterface",
              "PolicyDocument": {
                  "Version": "2012-10-17",
                  "Statement": [
                    {
                      "Effect": "Allow",
                      "Action": [
                        "ec2:DescribeInstances",
                        "ec2:CreateNetworkInterface",
                        "ec2:AttachNetworkInterface",
                        "ec2:DescribeNetworkInterfaces",
                        "ec2:DeleteNetworkInterface",
                        "ec2:DetachNetworkInterface",
                        "ec2:ModifyNetworkInterfaceAttribute",
                        "ec2:ResetNetworkInterfaceAttribute",
                        "autoscaling:CompleteLifecycleAction"
                      ],
                      "Resource": "*"
                    }
                  ]
                }
              }
            ]
          }
        }
    }
}

The most important part of the template above is the ELBLogProcessorFunction. This is where the actual Lambda function is specified, although you might notice that it does not actually have the code from the previous post attached to it in any way. The reason for this is that we create the Lambda function with placeholder code, and then use Octopus Deploy afterwards to deploy a versioned package containing the actual code to the Lambda function, like we do for everything else. Packaging and deploying the Lambda function code is a topic for another blog post though (the next one, hopefully).

Other things to note in the template fragment:

Lambda functions require a surprising amount of permissions to do what they do. When creating a function using the AWS website, most of this complexity is dealt with for you. When using CloudFormation however, you have to be aware of what the Lambda function needs and give it the appropriate permissions. You could just give the Lambda function as many permissions as possible, but that would be stupid in the face of “least privilege”, and would represent a significant security risk (compromised Lambda code being able to do all sorts of crazy things for example).
- Logging is a particularly important example for Lambda permissions. Without the capability to create log streams, your function is going to be damn near impossible to debug.
If you’re using S3 as the trigger for your Lambda function, you need to make sure that S3 has permissions to execute the function. This is the ELBLogProcessorFunctionPermission logical resource in the template fragment. Without this, your Lambda function will never trigger, even if you have setup the NotificationConfiguration on the bucket itself.
If your Lambda function needs to access external resources (like S3) you will likely have to use Private Subnets + a NAT Gateway to give it that ability. Technically you could also use a proxy, but god why would you do that to yourself. If you put your Lambda function into a Public Subnet, I’m pretty sure it doesn’t automatically get access to the greater internet like an EC2 instance does, and you will be intensely confused as to why all your external calls timeout.
Make sure you apply an appropriate Security Group to your Lambda function so that I can communicate externally, or you’ll get mysterious timeouts that look exactly the same as ones you get when you haven’t setup general internet access correctly.

To Be Continued

So that’s how we setup the Lambda function as part of any environment that needs to process ELB logs. Remember, the template fragment above is incomplete, and is missing Auto Scaling Groups, Launch Configurations, Load Balancers, Host Records and a multitude of other things that make up an actual environment. What I’ve shown above is enough to get the pipeline up and running, where any object introduced into the LogsBucket will trigger an execution of the Lambda function, so its enough to illustrate our approach.

Of course, the function doesn’t do anything yet, which ties in with the next post.

How we get the code into Lambda.

Until then, may all your Lambda function executions be swift and free of errors.

The Swallows Return To Capistrano

March 15. 2016 0 Comments

A while back (god almost a full year ago), I posted about the way in which we handle environment migrations, and to be honest, it hasn’t changed all that much. We have made some improvements to way we handle our environments (for example, we’ve improved our newest environments to be built into tested, versioned packages, rather than running directly from source), which is good, but the general migration process of clone temp, tear down old, clone back to active, tear down temp hasn’t really changed all that much.

Over time, we’ve come to realise that they are a number of weaknesses in that strategy though. Its slow (double clone!), its not overly clean and it can rarely lead to all of the data for the environment under migration being destroyed.

Yes, destroyed, i.e. lost forever.

This post is about that last weakness (the others will have to continue existing…for now).

Explosions!

In the original cloning scripts, there was an ominous comment, which simply said “# compare environment data here?”, which was a pretty big red flag in retrospect. You can’t always do everything though, and the various pressures applied to the development team meant that that step became somewhat manual.

That was a mistake.

After running a number of migrations across a few different environments (using basically the same concepts), we finally triggered that particular tripwire.

An otherwise uninteresting environment upgrade for one of our production services completely annihilated the underlying database (an EC2 instance running RavenDB), but the script gave no indication that anything went wrong.

Luckily, this particular service more of a temporary waystation, acting as a holding area facilitating the connection of two applications through a common web interface. This meant that while the loss of the data was bad (very bad), it wasn’t a problem for all of our customers. Only those people who had items sitting in the holding area waiting to be picked up were affected.

Obviously, the affected customers were quite unhappy, and rightfully so.

To this day I actually have no idea what went wrong with the actual migration. I had literally run the exact same scripts on a staging environment earlier that day, and verified that the same data was present before and after. After extensive investigation, we agreed that we would probably not get to the root of the issue in a timely fashion and that it might have just been an AWS thing (for a platform based on computers, sometimes AWS is amazingly non-deterministic). Instead, we agreed to attack the code that made it possible for the data loss to occur at all.

The migration scripts themselves.

Give Me More Statistics…Stat!

Returning to that ominous comment in the migration scripts, we realised that we needed an easy way to compare the data in two environments, at least at a high level. Using a basic comparison like that would enable us to make a decision about whether to proceed with the migration (specifically the part that destroys the old environment).

The solution is to implement a statistics endpoint.

The idea is pretty simple. We provide a set of information from the endpoint that summarises the content of the service (at least as best we can summarise it). Things like how many of a certain type of entity are present are basically all we have to deal with for now (simple services), but the concept could easily be extended to include information about any piece of data in the environment.

Something as simple as the example below fills our needs:

{
    data: {
        customers: {
            count: 57
        },
        databases: {
            count: 129
        }
    }
}

A side effect of having an endpoint like this is that we can easily (at least using the http_poller input in Logstash) extract this information on a regular basis and put it into our log aggregation so that we can chart its change over time.

Making It Work

With the statistics endpoint written and deployed (after all it must be present in the environment being migrated before we can use it), all that’s left to do is incorporate it into the migration script.

I won’t rewrite the entirety of the migration script here, but I’ve included a skeleton below to provide an idea of how we use the comparison to make sure we haven’t lost anything important on the way through.

function Migrate
{
    params
    (
        #bunch of params here, mostly relating to credentials
    )
}

try
{
    # make current environment unavailable to normal traffic
    
    # clone current to temporary
    
    if (-not(Compare-Environments $current $temp))
    {
        # delete the temporary environment and exit with an error
    }
    
    # delete current environment
    # clone temporary environment into the place where the current environment used to be
    
    if (-not(Compare-Environments $current $temp))
    {
        # delete the new environment
        # keep the temporary environment because its the only one with the data
    }
}
catch
{
    # if the current environment still exists, delete the temporary environment
    # if the current environment still exists, restore its availability
}

function Compare-Environments
{
    params
    (
        $a,
        $b
    )
    
    $aEndpoint = "some logic for URL creation based off environment"
    $bEndpoint = "some logic for URL creation based off environment"
    
    $aStatistics = Invoke-RestMethod $aEndpoint #credentials, accept header, methods etc
    $bStatistics = Invoke-RestMethod $aEndpoint #credentials, accept header, methods etc
    
    if ((ConvertTo-Json $aStatistics.data) -eq (ConvertTo-Json $bStatistics.data))
    {
        return true;
    }
    
    return false;
}

Summary

The unfortunate truth of this whole saga, is that the person who originally implemented the migration scripts (I’m pretty sure it was me, so I take responsibility) was aware of the fact that the migration could potentially lead to loss of data. At the time, the protection against that was to ensure that we never deleted the old environment until we were absolutely sure that the new environment had been successfully created, making the assumption that the data had come over okay.

In the end, that assumption proved to be our undoing, because while everything appeared peachy, it actually failed spectacularly.

The introduction of a statistics endpoint (almost an environment data hash) is an elegant solution to the problem of potential data loss, which also has some nice side effects for tracking metrics that might not have been easily accessible outside of direct database access.

A double victory is a rare occurrence, so I think I’ll try to savour this one for a little while, even if I was the root cause of the problem.