0 Comments

Over the last few weeks, I’ve been sharing the bits and pieces that went into our construction of an ELB logs processing pipeline using AWS Lambda.

As I said in the introduction, the body of work around the processor can be broken down into three pieces:

I’ve gone through each of these components in detail in the posts I’ve linked above, so this last post is really just to tie it all together and reflect on the process, as well as provide a place to mention a few things that didn’t neatly fit into any of the buckets above.

Also, I’ll obviously be continuing the pattern of naming every sub-header after a chapter title in the original Half-Life, almost all of which have proven to be surprisingly apt for whatever topic is being discussed. I mean seriously, look at this next one? How is that not perfectly fitting for a conclusion/summary post.

Residue Processing

It took a fair amount of effort for us to get to the solution we have in place now, and a decent amount of time. The whole thing was put together over the course of a few weeks by one of the people I work with, with some guidance and feedback from other members of the team from time to time. This timeframe was to develop, test and then deploy the solution into a real production environment, by a person with little to no working knowledge of the AWS toolset, so I think it was a damn good effort.

The most time consuming part was the long turnaround on environment builds, because each build needs to run a suite of tests which involve creating and destroying at least one environment, sometimes more. In reality, this means a wait time or something like 30-60 minutes per build, which is so close to eternity as to be effectively indistinguishable from it. I’ll definitely have to come up with some sort of way to tighten this feedback loop, but being that most of it is actually just waiting for AWS resources, I’m not really sure what I can do.

The hardest part of the whole process was probably just working with Lambda for the first time outside of the AWS management website.

As a team, we’d used Lambda before (back when I tried to make something to clone S3 buckets more quickly), but we’d never tried to manage the various Lambda bits and pieces through CloudFormation.

It turns out that the AWS website does a hell of a lot of things in order to make sure that your Lambda function runs, including dealing with profiles and permissions, network interfaces, listeners and so on. Having to do all of that explicitly through CloudFormation was something of a learning process.

Speaking of CloudFormation and Lambda, we ran into a nasty bug with Elastic Network Interfaces and VPC hosted Lambda functions created through CloudFormation, where the CloudFormation stack doesn’t delete cleanly because the ENI is still in use. It looks like its a known issue, so I assume it will be fixed at some point in the future, but as a result we had to include some additional cleanup in the Powershell that wraps our environment management to check the stack for Lambda functions and manually remove and delete the ENI before we try to delete the stack.

This isn’t the first time we’ve had to manually cleanup resources “managed” by CloudFormation. We do the same thing with S3 buckets because CloudFormation won’t delete a bucket with anything in it (and some of our buckets, like the ELB logs ones, are constantly being written to by other AWS services).

The only other difficult part of the whole thing I’ve already mentioned in the deployment post, which was figuring out how we could incorporate non-machine based Octopus deployments into our environments. For now they just happen after the actual AWS stack is created (as part of the Powershell scripts wrapping the entire process) and rely on having an Octopus tentacle registered in each environment on the Octopus Server machine, used as a script execution point.

Conclusion

Having put this whole system in place, the obvious question is “Was it worth it?”.

For me, the answer is “definitely”.

We’ve managed to retire a few hacky components (a Windows service running Powershell scripts via NSSM to download files from an S3 bucket, for example) and removed an entire machine from every environment that needs to process ELB logs. Its not often that you get to reduce both running and maintenance costs in one blow, so it was nice to get that accomplished.

Ignoring the reduced costs to the business for a second, we’ve also decreased the latency for receiving our ELB logs for analysis because rather than relying on a polling system, we’re now triggering the processing directly when the ELB writes the log file into S3.

Finally, we’ve gained some more experience with systems and services that we haven’t really had a chance to look into, allowing us to leverage that knowledge and tooling for other, potentially more valuable purposes.

All in all, I consider this exercise a resounding success, and I’m happy I was able to dedicate some time to improving an existing process, even though it was already “working”.

Improving existing engineering like this is incredibly valuable to the morale of a development time, which is an important and limited resource.

0 Comments

Another week, another piece of the puzzle.

This time around, I’ll go through how we’ve setup the build and deployment pipeline for the Lambda function code that processes the ELB log files. Its not a particularly complex system, but it was something completely different compared to things we’ve done in the past.

In general terms, building software is relatively straightforward, as long as you have two things:

  • A mechanism of tracking changes to the source code, i.e. source control.
  • The ability to validate the code and create a versioned package, where the behaviour of the code can be reasoned about.

With at least those two pieces in place (preferably fully automated), your builds are taken care of. Granted, the accomplishing those things is not as simple as two dot points would lead you to believe, but conceptually there is not a lot going on.

Once you have versioned packages that can be reasoned about, all you need to do is figure out how to deliver them to the appropriate places. Again, ideally the whole process is automated. You shouldn’t have to remote onto a machine and manually copy files around, as that hardly ever ends well. This can get quite complicated depending on exactly what it is that you are deploying: on-premises software can be notoriously painful to deploy without some supporting system, but deploying static page websites to S3 is ridiculously easy.

My team uses TeamCity, Nuget and Powershell to accomplish the build part and Octopus Deploy to deploy almost all of our code, and I don’t plan on changing any of that particularly soon.

Some people seem to think that because its so easy to manage Lambda functions from the AWS management website that they don’t have to have a build and deployment pipeline. After all, just paste the code into the website and you’re good to go, right?

I disagree vehemently.

Power Up

Our ELB logs processor Lambda function follows our normal repository structure, just like any other piece of software we write.

The code for the Lambda function goes into the /src folder, along with a Nuspec file that describes how to construct the resulting versioned package at build time.

Inside a /scripts folder is a build script, written in Powershell, containing all of the logic necessary to build and verify a deployable package. It mostly just leverages a library of common functions (so our builds are consistent), and its goal is to facilitate all of the parts of the pipeline, like compilation (ha Javascript), versioning, testing, packaging and deployment.

Some build systems are completely encapsulated inside the software that does the build, like Jenkins or TeamCity. I don’t like this approach, because you can’t easily run/debug the build on a different machine for any reason. I much prefer the model where the repository has all of the knowledge necessary to do the entire build and deployment process, barring the bits that it needs to accomplish via an external system.

The build script the ELB logs processor function is included below, but keep in mind, this is just the entry point, and a lot of the bits and pieces are inside the common functions that you can see referenced.

[CmdletBinding()]
param
(
    [switch]$deploy,
    [string]$octopusServerUrl,
    [string]$octopusApiKey,
    [string]$component,
    [string]$commaSeparatedDeploymentEnvironments,
    [string[]]$projects,
    [int]$buildNumber,
    [switch]$prerelease,
    [string]$prereleaseTag
)

$error.Clear()

$ErrorActionPreference = "Stop"

$here = Split-Path $script:MyInvocation.MyCommand.Path

. "$here\_Find-RootDirectory.ps1"

$rootDirectory = Find-RootDirectory $here
$rootDirectoryPath = $rootDirectory.FullName

. "$rootDirectoryPath\scripts\Invoke-Bootstrap.ps1"
. "$rootDirectoryPath\scripts\common\Functions-Build.ps1"

$arguments = @{}
$arguments.Add("Deploy", $deploy)
$arguments.Add("CommaSeparatedDeploymentEnvironments", $commaSeparatedDeploymentEnvironments)
$arguments.Add("OctopusServerUrl", $octopusServerUrl)
$arguments.Add("OctopusServerApiKey", $octopusApiKey)
$arguments.Add("Projects", $projects)
$arguments.Add("VersionStrategy", "SemVerWithPatchFilledAutomaticallyWithBuildNumber")
$arguments.Add("buildNumber", $buildNumber)
$arguments.Add("Prerelease", $prerelease)
$arguments.Add("PrereleaseTag", $prereleaseTag)
$arguments.Add("BuildEngineName", "nuget")

Build-DeployableComponent @arguments

I’m pretty sure I’ve talked about our build process and common scripts before, so I’m not going to go into any more detail.

Prior to deployment, the only interesting output is the versioned Nuget file, containing all of the dependencies necessary to run the Lambda function (which in our case is just the file in my previous post).

On A Rail

When it comes to deploying the Lambda function code, its a little bit more complicated that our standard software deployments via Octopus Deploy.

In most cases, we create a versioned package containing all of the necessary logic to deploy the software, so in the case of an API it contains a deploy.ps1 script that gets run automatically during deployment, responsible for creating a website and configuring IIS on the local machine. The most important thing Octopus does for us to provide mechanisms to get this package to the place where it needs to be.

It usually does this via an Octopus Tentacle, a service that runs on the deployment target and allows for communication between the Octopus server and the machine in question.

This system kind of falls apart when you’re trying to deploy to an AWS Lambda function, which cannot have a tentacle installed on it.

Instead, we rely on the AWS API and what amounts to a worker machine sitting in each environment.

When we do a deployment of our Lambda function project, it gets copied to the worker machine (which is actually just the Octopus server) and it runs the deployment script backed into the package. This script then uses Octopus variables to package the code again (in a way that AWS likes, a simple zip file) and uses the AWS API to upload the changed code to the appropriate Lambda function (by following a naming convention).

The deployment script is pretty straightforward:

function Get-OctopusParameter
{
    [CmdletBinding()]
    param
    (
        [string]$key
    )

    if ($OctopusParameters -eq $null)
    {
        throw "No variable called OctopusParameters is available. This script should be executed as part of an Octopus deployment."
    }

    if (-not($OctopusParameters.ContainsKey($key)))
    {
        throw "The key [$key] could not be found in the set of OctopusParameters."
    }

    return $OctopusParameters[$key]
}

$VerbosePreference = "Continue"
$ErrorActionPreference = "Stop"

$here = Split-Path -Parent $MyInvocation.MyCommand.Path
. "$here\_Find-RootDirectory.ps1"


$rootDirectory = Find-RootDirectory $here
$rootDirectoryPath = $rootDirectory.FullName

$awsKey = $OctopusParameters["AWS.Deployment.Key"]
$awsSecret = $OctopusParameters["AWS.Deployment.Secret"]
$awsRegion = $OctopusParameters["AWS.Deployment.Region"]

$environment = $OctopusParameters["Octopus.Environment.Name"]
$version = $OctopusParameters["Octopus.Release.Number"]

. "$rootDirectoryPath\scripts\common\Functions-Aws.ps1"
$aws = Get-AwsCliExecutablePath

$env:AWS_ACCESS_KEY_ID = $awsKey
$env:AWS_SECRET_ACCESS_KEY = $awsSecret
$env:AWS_DEFAULT_REGION = $awsRegion

$functionPath = "$here\src\function"

Write-Verbose "Compressing lambda code file"
Add-Type -AssemblyName System.IO.Compression.FileSystem
[system.io.compression.zipfile]::CreateFromDirectory($functionPath, "index.zip")

Write-Verbose "Updating Log Processor lambda function to version [$environment/$version]"
(& $aws lambda update-function-code --function-name $environment-ELBLogProcessorFunction --zip-file fileb://index.zip) | Write-Verbose

Nothing fancy, just using the AWS CLI to deploy code to a known function.

Apprehension

AWS Lambda does provide some other mechanisms to deploy code, and we probably could have used them to accomplish the same sort of thing, but I like our patterns to stay consistent and I’m a big fan of the functionality that Octopus Deploy provides, so I didn’t want to give that up.

We had to make a few changes to our environment pattern to allow for non-machine based deployment, like:

  • Ensuring every automatically created environment has a machine registered in it representing the Octopus server (for running scripts that effect external systems)
    • This meant that our environment cleanup also needed to take this into account, deregistering the machine before trying to remove the Octopus environment
  • Providing a section at the end of the environment setup to deploy Octopus projects that aren’t related to specific machines
    • Previously all of our deployments happened via cfn-init on the EC2 instances in question

Once all of that was in place, it was pretty easy to deploy code to a Lambda function as part of setting up the environment, just like we would deploy code to an EC2 instance. It was one of those cases where I’m glad we wrap our usage of CloudFormation in Powershell, because if we were just using raw CloudFormation, I’m not sure how we would have integrated the usage of Octopus Deploy.

To Be Summarised

I’ve only got one more post left in this series, which will summarise the entire thing and link back to all the constituent parts.

Until then, I don’t really have anything else to say.

0 Comments

Its that time again kids, time to continue the series of posts about how we improved the processing of our ELB logs into our ELK stack using AWS Lambda.

You can find the introduction to this whole adventure here, but last time I wrote about the Javascript content of the Lambda function that does the bulk of the work.

This time I’m going to write about how we incorporated the creation of that Lambda function into our environment management strategy and some of the tricks and traps therein.

On a completely unrelated note, it would be funny if this blog post turned up in search results for Half-Life 3.

We’ve Got Hostiles!

I’ve pushed hard to codify our environment setup where I work. The main reason for this is reproducibility, but the desire comes from a long history of interacting with manually setup environments that are lorded over by one guy who just happened to know the guy who originally set them up and where everyone is terrified of changing or otherwise touching said environments.

Its a nightmare.

As far as environment management goes, I’ve written a couple of times about environment related things on this blog, one of the most recent being the way in which we version our environments. To give some more context for this post, I recommend you go and read at least the versioning post in order to get a better understanding of how we do environment management. Our strategy is still a work in process, but its getting better all the time.

Regardless of whether or not you followed my suggestion, we use a combination of versioned Nuget packages, Powershell, CloudFormation and Octopus Deploy to create an environment, where an environment is a self contained chunk of infrastructure and code that performs some sort of role, the most common of which is acting as an API. We work primarily with EC2 instances (behind Elastic Load Balancers managed via Auto Scaling Groups), and historically, we’ve deployed Logstash to each instance alongside the code to provide log aggregation (IIS, Application, System Stats, etc). When it comes to capturing and aggregating ELB logs, we use include a standalone EC2 instance in the environment, also using Logstash. This standalone instance is the part of the system that we are aiming to replace with the Lambda function.

Because we make extensive use of CloudFormation, incorporating the creation of a Lambda function into an environment that needs to have ELB logs processed is a relatively simple affair.

Simple in that it fits nicely with our current approach. Getting it all to work as expected was still a massive pain.

Blast Pit

Below is a fragment of a completed CloudFormation template for reference purposes.

In the interests of full disclosure, I did not write most of the following fragment, another member of my team was responsible. I just helped.

{
    "Description": "This template is a fragment of a larger template that creates an environment. This fragment in particular contains all of the necessary bits and pieces for a Lambda function that processes ELB logs from S3.",
    "Parameters": {
        "ComponentName": {
            "Description": "The name of the component that this stack makes up. This is already part of the stack name, but is here so it can be used for naming/tagging purposes.",
            "Type": "String"
        },
        "OctopusEnvironment": {
            "Description": "Octopus Environment",
            "Type": "String"
        },
        "PrivateSubnets": {
            "Type": "List<AWS::EC2::Subnet::Id>",
            "Description": "Public subnets (i.e. ones that are automatically assigned public IP addresses) spread across availability zones, intended to contain load balancers and other externally accessible components.",
            "ConstraintDescription": "must be a list of an existing subnets in the selected Virtual Private Cloud."
        },
        "LogsS3BucketName": {
            "Description": "The name of the bucket where log files for the ELB and other things will be placed.",
            "Type": "String"
        }
    },
    "Resources": {
        "LogsBucket" : {
            "Type" : "AWS::S3::Bucket",
            "Properties" : {
                "BucketName" : { "Ref": "LogsS3BucketName" },
                "LifecycleConfiguration": {
                    "Rules": [
                        {
                            "Id": 1,
                            "ExpirationInDays": 7,
                            "Status": "Enabled"
                        }
                    ]
                },
                "Tags" : [
                    {
                        "Key": "function",
                        "Value": "log-storage"
                    }
                ],
                "NotificationConfiguration" : {
                  "LambdaConfigurations": [
                    {
                      "Event" : "s3:ObjectCreated:*",
                      "Function" : { "Fn::GetAtt" : [ "ELBLogProcessorFunction", "Arn" ] }
                    }
                  ]
                }
            }
        },
        "ELBLogProcessorFunctionPermission": {
            "Type" : "AWS::Lambda::Permission",
            "Properties" : {
                "Action":"lambda:invokeFunction",
                "FunctionName": { "Fn::GetAtt": [ "ELBLogProcessorFunction", "Arn" ]},
                "Principal": "s3.amazonaws.com",
                "SourceAccount": {"Ref" : "AWS::AccountId" },
                "SourceArn": {
                    "Fn::Join": [":", [ "arn","aws","s3","", "" ,{"Ref" : "LogsS3BucketName"}]]
                }
            }
        },
        "LambdaSecurityGroup": {
            "Type": "AWS::EC2::SecurityGroup",
            "Properties": {
                "GroupDescription": "Enabling all outbound communications",
                "VpcId": {
                    "Ref": "VpcId"
                },
                "SecurityGroupEgress": [
                    {
                        "IpProtocol": "tcp",
                        "FromPort": "0",
                        "ToPort": "65535",
                        "CidrIp": "0.0.0.0/0"
                    }
                ]
            }
        },
        "ELBLogProcessorFunction": {
          "Type": "AWS::Lambda::Function",
          "Properties": {
            "FunctionName": { "Fn::Join": [ "", [ { "Ref" : "ComponentName" }, "-", { "Ref" : "OctopusEnvironment" }, "-ELBLogProcessorFunction"  ] ] },
            "Description": "ELB Log Processor",
            "Handler": "index.handler",
            "Runtime": "nodejs4.3",
            "Code": {
              "ZipFile": "console.log('placeholder for lambda code')"
            },
            "Role": { "Fn::GetAtt" : ["LogsBucketAccessorRole", "Arn"]},
            "VpcConfig": {
              "SecurityGroupIds": [{"Fn::GetAtt": ["LambdaSecurityGroup", "GroupId"]}],
              "SubnetIds": { "Ref": "PrivateSubnets" }
            }
          }
        },
        "LogsBucketAccessorRole": {
          "Type": "AWS::IAM::Role",
          "Properties": {
            "AssumeRolePolicyDocument": {
                "Version": "2012-10-17",
                "Statement": [
                    {
                        "Effect": "Allow",
                        "Principal": { "Service" : ["lambda.amazonaws.com"]},
                        "Action": [
                            "sts:AssumeRole"
                        ]
                    }
                ]
            },
            "Path": "/",
            "Policies": [{ 
              "PolicyName": "access-s3-read",
              "PolicyDocument": {
                "Version": "2012-10-17",
                "Statement": [
                    {
                        "Effect": "Allow",
                        "Action": [
                            "s3:GetObject"
                        ],
                        "Resource": {
                            "Fn::Join": [":", [ "arn","aws","s3","", "" ,{"Ref" : "LogsS3BucketName"}, "/*"]]
                        }
                    }
                ]
              }
            },
            {
              "PolicyName": "access-logs-write",
              "PolicyDocument": {
                "Version": "2012-10-17",
                "Statement": [
                    {
                        "Effect": "Allow",
                        "Action": [
                            "logs:CreateLogGroup",
                            "logs:CreateLogStream",
                            "logs:PutLogEvents",
                            "logs:DescribeLogStreams"
                        ],
                        "Resource": {
                            "Fn::Join": [":", [ "arn","aws","logs", { "Ref": "AWS::Region" }, {"Ref": "AWS::AccountId"} , "*", "/aws/lambda/*"]]
                        }
                    }
                ]
              }
            },
            {
              "PolicyName": "access-ec2",
              "PolicyDocument": {
                "Version": "2012-10-17",
                "Statement": [
                    {
                        "Effect": "Allow",
                        "Action": [
                          "ec2:*"
                        ],
                        "Resource": "arn:aws:ec2:::*"
                    }
                ]
              }
            },
            {
              "PolicyName": "access-ec2-networkinterface",
              "PolicyDocument": {
                  "Version": "2012-10-17",
                  "Statement": [
                    {
                      "Effect": "Allow",
                      "Action": [
                        "ec2:DescribeInstances",
                        "ec2:CreateNetworkInterface",
                        "ec2:AttachNetworkInterface",
                        "ec2:DescribeNetworkInterfaces",
                        "ec2:DeleteNetworkInterface",
                        "ec2:DetachNetworkInterface",
                        "ec2:ModifyNetworkInterfaceAttribute",
                        "ec2:ResetNetworkInterfaceAttribute",
                        "autoscaling:CompleteLifecycleAction"
                      ],
                      "Resource": "*"
                    }
                  ]
                }
              }
            ]
          }
        }
    }
}

The most important part of the template above is the ELBLogProcessorFunction. This is where the actual Lambda function is specified, although you might notice that it does not actually have the code from the previous post attached to it in any way. The reason for this is that we create the Lambda function with placeholder code, and then use Octopus Deploy afterwards to deploy a versioned package containing the actual code to the Lambda function, like we do for everything else. Packaging and deploying the Lambda function code is a topic for another blog post though (the next one, hopefully).

Other things to note in the template fragment:

  • Lambda functions require a surprising amount of permissions to do what they do. When creating a function using the AWS website, most of this complexity is dealt with for you. When using CloudFormation however, you have to be aware of what the Lambda function needs and give it the appropriate permissions. You could just give the Lambda function as many permissions as possible, but that would be stupid in the face of “least privilege”, and would represent a significant security risk (compromised Lambda code being able to do all sorts of crazy things for example).
    • Logging is a particularly important example for  Lambda permissions. Without the capability to create log streams, your function is going to be damn near impossible to debug.
  • If you’re using S3 as the trigger for your Lambda function, you need to make sure that S3 has permissions to execute the function. This is the ELBLogProcessorFunctionPermission logical resource in the template fragment. Without this, your Lambda function will never trigger, even if you have setup the NotificationConfiguration on the bucket itself.
  • If your Lambda function needs to access external resources (like S3) you will likely have to use Private Subnets + a NAT Gateway to give it that ability. Technically you could also use a proxy, but god why would you do that to yourself. If you put your Lambda function into a Public Subnet, I’m pretty sure it doesn’t automatically get access to the greater internet like an EC2 instance does, and you will be intensely confused as to why all your external calls timeout.
  • Make sure you apply an appropriate Security Group to your Lambda function so that I can communicate externally, or you’ll get mysterious timeouts that look exactly the same as ones you get when you haven’t setup general internet access correctly.

To Be Continued

So that’s how we setup the Lambda function as part of any environment that needs to process ELB logs. Remember, the template fragment above is incomplete, and is missing Auto Scaling Groups, Launch Configurations, Load Balancers, Host Records and a multitude of other things that make up an actual environment. What I’ve shown above is enough to get the pipeline up and running, where any object introduced into the LogsBucket will trigger an execution of the Lambda function, so its enough to illustrate our approach.

Of course, the function doesn’t do anything yet, which ties in with the next post.

How we get the code into Lambda.

Until then, may all your Lambda function executions be swift and free of errors.

0 Comments

Now that I am feeling less like dying horribly from a terrible plague, its time to continue to talk about processing ELB logs into an ELK stack via AWS Lambda.

Last time I talked about our current situation, our motivations for wanting to switch to something better and the general plan moving forward.

To summarise the plan, the new log processor system needs 3 main parts:

  • The Lambda function code, which is written in Javascript, and would need to parse an ELB log file one line at a time, break it up, convert it into a JSON structure and then push it to a Logstash TCP endpoint to be written into Elasticsearch
  • The configuration/creation of the Lambda function, which would need to be done via CloudFormation as part of our normal environment setup (i.e. spin up an environment, it comes with the Lambda function that will process any ELB logs encountered)
  • The deployment of the code to the Lambda function via Octopus (because that’s how we roll)

I’m basically going to do a short blog post on each of those pieces of work, and maybe one at the end to tie it all together.

With that in mind, lets talk Javascript.

Unforeseen Consequences

When you’re writing Lambda functions, Javascript (via Node.js) is probably your best bet. Sure you can run Java or Python (and maybe one day C# using .NET Core), but Javascript is almost certainly going to be the easiest. Its what we chose we when put together the faster S3 clone prototype, and while the fundamentally asynchronous nature of Node.js took some getting used to, it worked well enough.

When it comes to Javascript, I don’t personally write a lot of it. If I’m writing a server side component, I’m probably going to pick C# as my language of choice (with all its fancy features like “a compiler” and “type safety”) and I don’t find myself writing things like websites or small Javascript applications very often, if at all. My team definitely writes websites though, and we use React.js to do it, so its not like Javascript is an entirely foreign concept.

For the purposes of reading in and parsing an ELB log file via a Lambda function, we didn’t need a particularly complex piece of Javascript. Something that reads the specified file from S3 after the Lambda function triggers, something to process the contents of that file line by line, something to parse and format those lines in a way that a Logstash input wll accept, and something to push that JSON payload to the Logstash listener over raw TCP.

Without further ado, I give you the completed script:

'use strict';

let aws = require('aws-sdk');
let s3 = new aws.S3({ apiVersion: '2006-03-01' });
let readline = require('readline');
let net = require('net');

const _type = 'logs';
const _sourceModuleName = 'ELB';
const _logHost = '#{LogHost}'
const _logPort = #{LogPort}
const _environment = '#{Octopus.Environment.Name}'
const _component = '#{Component}'
const _application = '#{Application}'

function postToLogstash(entry){
    console.log("INFO: Posting event to logstash...");

    var socket = net.createConnection(_logPort, _logHost);
    var message = JSON.stringify(entry) + "\n";
    socket.write(message);
    socket.end();

    console.log("INFO: Posting to logstash...done");
}

exports.handler = (event, context, callback) => {
    console.log('INFO: Retrieving log from S3 bucket...');

    const bucket = event.Records[0].s3.bucket.name;
    const key = decodeURIComponent(event.Records[0].s3.object.key.replace(/\+/g, ' '));
    const params = {
        Bucket: bucket,
        Key: key
    };

    const reader = readline.createInterface({
        input: s3.getObject(params).createReadStream()
    });

    const expectedColumns = 12;

    reader.on('line', function(line) {
        console.log("INFO: Parsing S3 line entry...");

        const columns = line.split(/ (?=(?:(?:[^"]*"){2})*[^"]*$)/);

        if(columns.length >= expectedColumns){
            var entry = {
                EventReceivedTime: columns[0],
                LoadBalancerName: columns[1],
                PublicIpAndPort: columns[2],
                InternalIpAndPort: columns[3],
                TimeToForwardRequest: parseFloat(columns[4]) * 1000,
                TimeTaken: parseFloat(columns[5]) * 1000,
                TimeToForwardResponse: parseFloat(columns[6]) * 1000,
                Status: columns[7],
                BackendStatus: columns[8],
                BytesUploadedFromClient: parseInt(columns[9]),
                BytesDownloadedByClient: parseInt(columns[10]),
                FullRequest: columns[11],
                Component: _component,
                SourceModuleName: _sourceModuleName,
                Environment: _environment,
                Application: _application,
                Type: _type
            };
            postToLogstash(entry);
        } else {
            console.log("ERROR: Invalid record length was expecting " + expectedColumns.length + " but found " + columns.length);
            console.log('ERROR: -------');
            console.log(line);
            console.log('ERROR: -------');
        }
    });
};

Nothing to fancy.

In the interest of full disclosure, I did not write the script above. It was written by a few guys from my team initially as a proof of concept, then improved/hardened as a more controlled piece of work.

Office Complex

You might notice some strange variable names at the top of the script (i.e. #{Octopus.Environment.Name}).

We use Octopus deploy for  all of our deployments, and this includes the deployment of Lambda functions (which we package via Nuget and then deploy via the AWS Powershell Cmdlets/CLI inside Octopus). The #{NAME} notation is a way for Octopus to substitute variable values into files during deployment. This substitution is very useful, and can be scoped via a variety of things (like Environment, Machine, Role, etc), so by the time the script actually gets into AWS those variables are filled in with actual values.

Other than our use of Octopus variables, other things to note in this piece of Javascript are:

  • At no point does the function specify which credentials are being used to access the S3 bucket containing the ELB log files. This is because the Lambda function has been configured with an IAM role allowing it to access the required resources. The AWS Javascript library has built in support for running inside supported AWS contexts (like EC2 instances and Lambda functions), such that it can use the role applied to the context to get appropriate temporary credentials. This is a much better approach than using specific credentials in the script, which can result in unintended consequences if you aren’t eternally vigilant.
  • You need to make sure that you’re Lambda function is configured with an appropriate security group that allows it to use the expected outbound channel (i.e. make sure it can get to the Logstash host you’re trying to connect to). This was somewhat of an issue for us as our ELK stack is hosted inside another AWS account (our OPs account), so we had to make sure that all of the appropriate VPC peering was configured before it would work correctly. It can be a bit of a pain to setup the smallest possible surface area, but don’t be tempted to just configure the Lambda function to be able to do everything and go anywhere. Smaller surface area = more security and the policy of least privilege is one you should always follow.
  • Its important to ensure that if you’re doing TCP communication in a Lambda function, that you make sure to close your socket when you’re done with it, or the Lambda function might not exit. It might, but it also might not, and it can really throw you for a loop if you don’t know why.

To Be Continued

That’s it for this week. The Javascript I’ve included above is pretty generic (apart from the specific set of fields that we like to have in our log events) and will successfully process an ELB log file from S3 to a Logstash instance listening on a port of your choosing (probably 6379) when used in a Lambda function. Feel free to reuse it for your own purposes.

Next week I’ll continue this series of posts with information about how we use CloudFormation to setup our Lambda function as part of one of our environment definitions.

CloudFormation and Lambda aren’t the best of friends yet, so there is some interesting stuff that you have to be aware of.

0 Comments

We’ve come a long way in our log aggregation journey. Don’t get me wrong, we still have a long way to go, but bit by bit, we’re getting better at it.

A good example of getting better, is the way in which we process our Elastic Load Balancer (ELB) logs. Over a year ago I put together a system for processing these logs into our ELK stack with way too many moving parts. It used Logstash (for processing) and Powershell (for downloading files from S3) hosted on an EC2 instance to aggregate ELB logs from S3 into to our ELK stack. Somewhat complicated in practice, but it worked, even if I was never particularly happy with it.

As is the way with these things though, because it did work, we’ve had no reason to revisit it, and we’re successfully applied the same approach to at least 3 other environments we’ve setup since.

It wasn’t without its share of problems though:

  • The EC2 instances hosting the solution had a tendency to cap themselves at 100% CPU for long periods. They were initially t2.mediums, but they kept expending all of their CPU credits, so we had to upgrade them to m3.mediums, which was a 50% increase in cost ($US 115/month). Never did figure out exactly what needed all that CPU, but the leading theory was Logstash.
  • For a while, the logs simply stopped processing after a period of time (days/weeks). This turned out to be an issue with accumulating memory dumps from Java as a result of Logstash crashing and NSSM automatically restarting it.
  • These were the machines most vulnerable to the memory leak in Logstash that causes its TCP driver to accumulate non-paged memory on Windows AWS instances due to some driver problem.

Good old Logstash.

To turn the discussion back to getting better, we had the opportunity to revisit this process when building some new environments, using all of the knowledge and experience that we’d gained in the intervening period. I think we came up with a much more efficient and understandable solution, but it wasn’t without its share of difficulties, which makes for a good post.

Anomalous Materials

One of the primary weaknesses in the previous approach for processing ELB logs was that it required an entire EC2 instance all to itself, for each environment that we spun up. We did this in order to keep each log processor isolated from the other and to allow us to be able to spin up an entirely self-contained environment without having to worry about some common machine that processed all of the logs in a bunch of different buckets.

Another weakness in the process that bothered me was that it had way too many moving parts. Sometimes you have to have a lot of moving parts working together in order to accomplish a goal, but you should always strive for simplicity, both from an operational point of view and from a maintenance point of view. Less is almost always better in software development.

AWS has come a long way since we jammed the initial solution together, so we decided to use this opportunity to simplify the process and experiment with some AWS tools that we don’t frequently use.

After some discussion, the we formed an idea of what we would like the new log processor to look like. We wanted to use Lambda to process the ELB logs as they were created, pushing them to the same Logstash ingress endpoint that we’ve been using consistently for the last year or so. The benefits we were expecting were a reduction in complexity (no need to have 4 different things working together), a reduction in running cost (mostly due to the removal of the EC2 instance) and a reduction in latency (the Lambda function would trigger whenever a file was written to the S3 bucket by the ELB, which meant no more polling for changes).

For those of you unfamiliar with Lamba, its a service offered by AWS that lets you configure code to run whenever a variety of events occurs. I’ve used it before to create a quicker S3 bucket clone, so if you want some more information, feel free to read up on that adventure.

In order to accomplish our goal, we would need to deal with 3 things:

Nothing particularly insane there, but definitely a few things that we’d never done before.

To Be Continued

In order to avoid creating a single monstrous post with more words than a small novel, I’m going to break it here.

Next week I’ll continue, explaining the Javascript code that we put together to process the log files (its not particularly complicated) and how we configured the Lambda function by incorporating its setup into our environment setup.

Until then, may all your Lambda functions execute quickly and your S3 buckets not turn into ghosts.