0 Comments

If anybody actually reads this blog, it would be easy to see that my world has been all about log processing and aggregation for a while now. I mean, I do other things as well, but the questions and problems that end up here on this blog have been pretty focused around logging for as far back as I care to remember. The main reason is that I just like logging, and appreciate the powerful insights that it can give about your software. Another reason is that we don’t actually have anyone who sits in the traditional “operations” role (i.e. infrastructure/environment improvement and maintenance), so in order to help keep the rest of the team focused on our higher level deliverables, I end up doing most of that stuff.

Anyway, I don’t see that pattern changing any time soon, so on with the show.

While I was going about regaining control over our log stack, I noticed that it was extremely difficult to reason about TCP traffic when using an AWS ELB.

Why does this matter?

Well, the Broker layer in our ELK stack (i.e the primary ingress point), uses a Logstash configuration with a TCP input and as a result, all of the things that write to the Broker (our externally accessible Ingress API, other instances of Logstash, the ELB Logs Processor) use TCP. That’s a significant amount of traffic, and something that I’d really like to be able to monitor and understand.

Stealthy Traffic

As is my current understanding, when you make a TCP connection through an ELB, the ELB records the initial creation of the connection as a “request” (one of the metrics you can track with CloudWatch) and then pretty much nothing else after that. I mean, this makes sense, as its the ELB’s job to essentially pick an underlying machine to route traffic to, and most TCP connections created and used specifically as TCP connections tend to be long lived (as opposed to TCP connections created as part of HTTP requests and responses).

As far as our three primary contributors are concerned:

  • The Logging Ingress API is pretty oblivious. It just makes a new TCP connection for each incoming log event, so unless the .NET Framework is silently caching TCP connections for optimization purposes, it’s going to cause one ELB request per log event.
  • The ELB Logs Processor definitely caches TCP connections. We went through a whole ordeal with connection pooling and reuse before it would function in production, so its definitely pushing multiple log events through a single socket.
  • The Logstash instances that we have distributed across our various EC2 instances (local machine log aggregation, like IIS and application logs) are using the Logstash TCP output. I assume it uses one (or many) long live connections to do its business, but I don’t really know. Logstash is very mysterious.

This sort of usage makes it very hard to tell just how many log events are coming through the system via CloudWatch, which is a useful metric, especially when things start to go wrong and you need to debug which part of the stack is actually causing the failure.

Unfortunately, the monitoring issue isn’t the only problem with using the Logstash TCP input/output. Both input and output have, at separate times, been…flakey. I’ve experienced both sides of the pipeline going down for no obvious reason, or simply not working after running uninterrupted for a long time.

The final nail in the coffin for TCP came recently, when Elastic.co released the Logstash Persistent Queue feature for Logstash 5.4.0, which does not work with TCP at all (it only supports inputs that use the request-response model). I want to use persistent queues to remove both the Cache and Indexer layers from our log stack, so it was time for TCP to die.

Socket Surgery

Adding a HTTP input to our Broker layer was easy enough. In fact, such an input was already present because the ELB uses a HTTP request to check whether or not the Broker EC2 instances are healthy.

Conceptually, changing our Logstash instances to use a HTTP output instead of TCP should also be easy. Just need to change some configuration and deploy through Octopus. Keep in mind I haven’t actually done it yet, but it feels simple enough.

In a similar vein, changing our Logging Ingress API to output through HTTP instead of TCP should also be easy. A small code change to use HttpClient or RestSharp or something, a new deployment and everything is puppies and roses. Again, I haven’t actually done it yet, so who knows what dragons lurk there.

Then we have the ELB Logs Processor, which is a whole different kettle of fish.

It took a significant amount of effort to get it working with TCP in the first place (connection pooling was the biggest problem), and due to the poor quality of the Javascript (entirely my fault), its pretty tightly coupled to that particular mechanism.

Regardless of the difficulty, TCP has to go, for the good of the log stack

The first issue I ran into was “how do you even do HTTP requests in Node 4.3.2 anyway?”. There are many answers to this question, but the most obvious one is to use the HTTP API that comes with Node. Poking around this for a while showed that it wasn’t too bad, as long as I didn’t want to deal with a response payload, which I didn’t.

The biggest issue with the native Node HTTP API was that it was all callbacks, all the time. In my misadventures with the ELB Logs Processor I’d become very attached to promises and the effect they have on the readability of the resulting code, and didn’t really want to give that up so easily. I dutifully implemented a simple promise wrapper around our specific usage of the native Node HTTP API (which was just a POST of a JSON payload), and incorporated it into the Lambda function.

Unfortunately, this is where my memory gets a little bit fuzzy (it was a few weeks ago), and I don’t really remember how well it went. I don’t think it went well, because I decided to switch to a package called Axios which offered promise based HTTP requests out of the box.

Axios of Evil

Axios was pretty amazing. Well, I mean, Axios IS pretty amazing, but I suppose that sentence gave it away that the relationship didn’t end well.

The library did exactly what it said it did and with its native support for promises, was relatively easy to incorporate into the existing code, as you can see from the following excerpt:

// snip, whole bunch of setup, including summary object initialization

let axios = require('axios').create({
    baseURL: 'http://' + config.logstashHost,
    headers: {'Content-Type': 'application/json'}
});

// snip, more setup, other functions

function handleLine(line) {
    summary.lines.encountered += 1;

    var entry = {
        Component: config.component,
        SourceModuleName: config.sourceModuleName,
        Environment: config.environment,
        Application: config.application,
        message: line,
        type: config.type,
        Source: {
            S3: s3FileDetails
        }
    };
        
    var promise = axios
        .post("/", entry)
        .then((response) => {
            summary.lines.sent += 1;
        })
        .catch((error) => { 
            summary.failures.sending.total += 1;
            if (summary.failures.sending.lastFew.length >= 5) {
                summary.failures.sending.lastFew.shift();
            }
            summary.failures.sending.lastFew.push(error);
        });

    promises.push(promise);
}

// snip, main body of function (read S3, stream to line processor, wait for promises to finish

Even though it took a lot of effort to write, it was nice to remove all of the code relating to TCP sockets and connection pooling, as it simplified the whole thing.

The (single, manual) test proved that it still did its core job (contents of file written into the ELK stack), it worked in CI and it worked in Staging, so I was pretty happy.

For about 15 minutes that is, until I deployed it into Production.

Crushing Disappointment

Just like last time, the implementation simply could not deal with the amount of traffic that was being thrown at it. Even worse, it wasn’t actually logging any errors or giving me any indication as to why it was failing. After a brief and frustrating investigation, it looked like it was simply running out of memory (the Lambda function was only configured to use 192 MB, which had been enough for the TCP approach) and it was simply falling over once it reached that amount. This was my hypothesis, but I was never able to conclusively prove ii, but it was definitely using all of the memory available to the function each time it ran.

I could have just increased the available memory, but I wanted to understand where all the memory was going first.

Then I realised I would have to learn how to do memory analysis in Javascript, and I just gave up.

On Javascript that is.

Instead, I decided to rewrite the ELB Logs Processor in .NET Core, using a language that I actually like (C#).

Conclusion

This is one of those cases where looking back, with the benefits of hindsight, I probably should have just increased the memory until it worked and then walked away.

But I was just so tired of struggling with Javascript and Node that it was incredibly cathartic to just abandon it all in favour of something that actually made sense to me.

Of course, implementing the thing in C# via .NET Core wasn’t exactly painless, but that’s a topic for another time.

Probably next week.

0 Comments

Over the last few weeks, I’ve been sharing the bits and pieces that went into our construction of an ELB logs processing pipeline using AWS Lambda.

As I said in the introduction, the body of work around the processor can be broken down into three pieces:

I’ve gone through each of these components in detail in the posts I’ve linked above, so this last post is really just to tie it all together and reflect on the process, as well as provide a place to mention a few things that didn’t neatly fit into any of the buckets above.

Also, I’ll obviously be continuing the pattern of naming every sub-header after a chapter title in the original Half-Life, almost all of which have proven to be surprisingly apt for whatever topic is being discussed. I mean seriously, look at this next one? How is that not perfectly fitting for a conclusion/summary post.

Residue Processing

It took a fair amount of effort for us to get to the solution we have in place now, and a decent amount of time. The whole thing was put together over the course of a few weeks by one of the people I work with, with some guidance and feedback from other members of the team from time to time. This timeframe was to develop, test and then deploy the solution into a real production environment, by a person with little to no working knowledge of the AWS toolset, so I think it was a damn good effort.

The most time consuming part was the long turnaround on environment builds, because each build needs to run a suite of tests which involve creating and destroying at least one environment, sometimes more. In reality, this means a wait time or something like 30-60 minutes per build, which is so close to eternity as to be effectively indistinguishable from it. I’ll definitely have to come up with some sort of way to tighten this feedback loop, but being that most of it is actually just waiting for AWS resources, I’m not really sure what I can do.

The hardest part of the whole process was probably just working with Lambda for the first time outside of the AWS management website.

As a team, we’d used Lambda before (back when I tried to make something to clone S3 buckets more quickly), but we’d never tried to manage the various Lambda bits and pieces through CloudFormation.

It turns out that the AWS website does a hell of a lot of things in order to make sure that your Lambda function runs, including dealing with profiles and permissions, network interfaces, listeners and so on. Having to do all of that explicitly through CloudFormation was something of a learning process.

Speaking of CloudFormation and Lambda, we ran into a nasty bug with Elastic Network Interfaces and VPC hosted Lambda functions created through CloudFormation, where the CloudFormation stack doesn’t delete cleanly because the ENI is still in use. It looks like its a known issue, so I assume it will be fixed at some point in the future, but as a result we had to include some additional cleanup in the Powershell that wraps our environment management to check the stack for Lambda functions and manually remove and delete the ENI before we try to delete the stack.

This isn’t the first time we’ve had to manually cleanup resources “managed” by CloudFormation. We do the same thing with S3 buckets because CloudFormation won’t delete a bucket with anything in it (and some of our buckets, like the ELB logs ones, are constantly being written to by other AWS services).

The only other difficult part of the whole thing I’ve already mentioned in the deployment post, which was figuring out how we could incorporate non-machine based Octopus deployments into our environments. For now they just happen after the actual AWS stack is created (as part of the Powershell scripts wrapping the entire process) and rely on having an Octopus tentacle registered in each environment on the Octopus Server machine, used as a script execution point.

Conclusion

Having put this whole system in place, the obvious question is “Was it worth it?”.

For me, the answer is “definitely”.

We’ve managed to retire a few hacky components (a Windows service running Powershell scripts via NSSM to download files from an S3 bucket, for example) and removed an entire machine from every environment that needs to process ELB logs. Its not often that you get to reduce both running and maintenance costs in one blow, so it was nice to get that accomplished.

Ignoring the reduced costs to the business for a second, we’ve also decreased the latency for receiving our ELB logs for analysis because rather than relying on a polling system, we’re now triggering the processing directly when the ELB writes the log file into S3.

Finally, we’ve gained some more experience with systems and services that we haven’t really had a chance to look into, allowing us to leverage that knowledge and tooling for other, potentially more valuable purposes.

All in all, I consider this exercise a resounding success, and I’m happy I was able to dedicate some time to improving an existing process, even though it was already “working”.

Improving existing engineering like this is incredibly valuable to the morale of a development time, which is an important and limited resource.

0 Comments

Another week, another piece of the puzzle.

This time around, I’ll go through how we’ve setup the build and deployment pipeline for the Lambda function code that processes the ELB log files. Its not a particularly complex system, but it was something completely different compared to things we’ve done in the past.

In general terms, building software is relatively straightforward, as long as you have two things:

  • A mechanism of tracking changes to the source code, i.e. source control.
  • The ability to validate the code and create a versioned package, where the behaviour of the code can be reasoned about.

With at least those two pieces in place (preferably fully automated), your builds are taken care of. Granted, the accomplishing those things is not as simple as two dot points would lead you to believe, but conceptually there is not a lot going on.

Once you have versioned packages that can be reasoned about, all you need to do is figure out how to deliver them to the appropriate places. Again, ideally the whole process is automated. You shouldn’t have to remote onto a machine and manually copy files around, as that hardly ever ends well. This can get quite complicated depending on exactly what it is that you are deploying: on-premises software can be notoriously painful to deploy without some supporting system, but deploying static page websites to S3 is ridiculously easy.

My team uses TeamCity, Nuget and Powershell to accomplish the build part and Octopus Deploy to deploy almost all of our code, and I don’t plan on changing any of that particularly soon.

Some people seem to think that because its so easy to manage Lambda functions from the AWS management website that they don’t have to have a build and deployment pipeline. After all, just paste the code into the website and you’re good to go, right?

I disagree vehemently.

Power Up

Our ELB logs processor Lambda function follows our normal repository structure, just like any other piece of software we write.

The code for the Lambda function goes into the /src folder, along with a Nuspec file that describes how to construct the resulting versioned package at build time.

Inside a /scripts folder is a build script, written in Powershell, containing all of the logic necessary to build and verify a deployable package. It mostly just leverages a library of common functions (so our builds are consistent), and its goal is to facilitate all of the parts of the pipeline, like compilation (ha Javascript), versioning, testing, packaging and deployment.

Some build systems are completely encapsulated inside the software that does the build, like Jenkins or TeamCity. I don’t like this approach, because you can’t easily run/debug the build on a different machine for any reason. I much prefer the model where the repository has all of the knowledge necessary to do the entire build and deployment process, barring the bits that it needs to accomplish via an external system.

The build script the ELB logs processor function is included below, but keep in mind, this is just the entry point, and a lot of the bits and pieces are inside the common functions that you can see referenced.

[CmdletBinding()]
param
(
    [switch]$deploy,
    [string]$octopusServerUrl,
    [string]$octopusApiKey,
    [string]$component,
    [string]$commaSeparatedDeploymentEnvironments,
    [string[]]$projects,
    [int]$buildNumber,
    [switch]$prerelease,
    [string]$prereleaseTag
)

$error.Clear()

$ErrorActionPreference = "Stop"

$here = Split-Path $script:MyInvocation.MyCommand.Path

. "$here\_Find-RootDirectory.ps1"

$rootDirectory = Find-RootDirectory $here
$rootDirectoryPath = $rootDirectory.FullName

. "$rootDirectoryPath\scripts\Invoke-Bootstrap.ps1"
. "$rootDirectoryPath\scripts\common\Functions-Build.ps1"

$arguments = @{}
$arguments.Add("Deploy", $deploy)
$arguments.Add("CommaSeparatedDeploymentEnvironments", $commaSeparatedDeploymentEnvironments)
$arguments.Add("OctopusServerUrl", $octopusServerUrl)
$arguments.Add("OctopusServerApiKey", $octopusApiKey)
$arguments.Add("Projects", $projects)
$arguments.Add("VersionStrategy", "SemVerWithPatchFilledAutomaticallyWithBuildNumber")
$arguments.Add("buildNumber", $buildNumber)
$arguments.Add("Prerelease", $prerelease)
$arguments.Add("PrereleaseTag", $prereleaseTag)
$arguments.Add("BuildEngineName", "nuget")

Build-DeployableComponent @arguments

I’m pretty sure I’ve talked about our build process and common scripts before, so I’m not going to go into any more detail.

Prior to deployment, the only interesting output is the versioned Nuget file, containing all of the dependencies necessary to run the Lambda function (which in our case is just the file in my previous post).

On A Rail

When it comes to deploying the Lambda function code, its a little bit more complicated that our standard software deployments via Octopus Deploy.

In most cases, we create a versioned package containing all of the necessary logic to deploy the software, so in the case of an API it contains a deploy.ps1 script that gets run automatically during deployment, responsible for creating a website and configuring IIS on the local machine. The most important thing Octopus does for us to provide mechanisms to get this package to the place where it needs to be.

It usually does this via an Octopus Tentacle, a service that runs on the deployment target and allows for communication between the Octopus server and the machine in question.

This system kind of falls apart when you’re trying to deploy to an AWS Lambda function, which cannot have a tentacle installed on it.

Instead, we rely on the AWS API and what amounts to a worker machine sitting in each environment.

When we do a deployment of our Lambda function project, it gets copied to the worker machine (which is actually just the Octopus server) and it runs the deployment script backed into the package. This script then uses Octopus variables to package the code again (in a way that AWS likes, a simple zip file) and uses the AWS API to upload the changed code to the appropriate Lambda function (by following a naming convention).

The deployment script is pretty straightforward:

function Get-OctopusParameter
{
    [CmdletBinding()]
    param
    (
        [string]$key
    )

    if ($OctopusParameters -eq $null)
    {
        throw "No variable called OctopusParameters is available. This script should be executed as part of an Octopus deployment."
    }

    if (-not($OctopusParameters.ContainsKey($key)))
    {
        throw "The key [$key] could not be found in the set of OctopusParameters."
    }

    return $OctopusParameters[$key]
}

$VerbosePreference = "Continue"
$ErrorActionPreference = "Stop"

$here = Split-Path -Parent $MyInvocation.MyCommand.Path
. "$here\_Find-RootDirectory.ps1"


$rootDirectory = Find-RootDirectory $here
$rootDirectoryPath = $rootDirectory.FullName

$awsKey = $OctopusParameters["AWS.Deployment.Key"]
$awsSecret = $OctopusParameters["AWS.Deployment.Secret"]
$awsRegion = $OctopusParameters["AWS.Deployment.Region"]

$environment = $OctopusParameters["Octopus.Environment.Name"]
$version = $OctopusParameters["Octopus.Release.Number"]

. "$rootDirectoryPath\scripts\common\Functions-Aws.ps1"
$aws = Get-AwsCliExecutablePath

$env:AWS_ACCESS_KEY_ID = $awsKey
$env:AWS_SECRET_ACCESS_KEY = $awsSecret
$env:AWS_DEFAULT_REGION = $awsRegion

$functionPath = "$here\src\function"

Write-Verbose "Compressing lambda code file"
Add-Type -AssemblyName System.IO.Compression.FileSystem
[system.io.compression.zipfile]::CreateFromDirectory($functionPath, "index.zip")

Write-Verbose "Updating Log Processor lambda function to version [$environment/$version]"
(& $aws lambda update-function-code --function-name $environment-ELBLogProcessorFunction --zip-file fileb://index.zip) | Write-Verbose

Nothing fancy, just using the AWS CLI to deploy code to a known function.

Apprehension

AWS Lambda does provide some other mechanisms to deploy code, and we probably could have used them to accomplish the same sort of thing, but I like our patterns to stay consistent and I’m a big fan of the functionality that Octopus Deploy provides, so I didn’t want to give that up.

We had to make a few changes to our environment pattern to allow for non-machine based deployment, like:

  • Ensuring every automatically created environment has a machine registered in it representing the Octopus server (for running scripts that effect external systems)
    • This meant that our environment cleanup also needed to take this into account, deregistering the machine before trying to remove the Octopus environment
  • Providing a section at the end of the environment setup to deploy Octopus projects that aren’t related to specific machines
    • Previously all of our deployments happened via cfn-init on the EC2 instances in question

Once all of that was in place, it was pretty easy to deploy code to a Lambda function as part of setting up the environment, just like we would deploy code to an EC2 instance. It was one of those cases where I’m glad we wrap our usage of CloudFormation in Powershell, because if we were just using raw CloudFormation, I’m not sure how we would have integrated the usage of Octopus Deploy.

To Be Summarised

I’ve only got one more post left in this series, which will summarise the entire thing and link back to all the constituent parts.

Until then, I don’t really have anything else to say.

0 Comments

Its that time again kids, time to continue the series of posts about how we improved the processing of our ELB logs into our ELK stack using AWS Lambda.

You can find the introduction to this whole adventure here, but last time I wrote about the Javascript content of the Lambda function that does the bulk of the work.

This time I’m going to write about how we incorporated the creation of that Lambda function into our environment management strategy and some of the tricks and traps therein.

On a completely unrelated note, it would be funny if this blog post turned up in search results for Half-Life 3.

We’ve Got Hostiles!

I’ve pushed hard to codify our environment setup where I work. The main reason for this is reproducibility, but the desire comes from a long history of interacting with manually setup environments that are lorded over by one guy who just happened to know the guy who originally set them up and where everyone is terrified of changing or otherwise touching said environments.

Its a nightmare.

As far as environment management goes, I’ve written a couple of times about environment related things on this blog, one of the most recent being the way in which we version our environments. To give some more context for this post, I recommend you go and read at least the versioning post in order to get a better understanding of how we do environment management. Our strategy is still a work in process, but its getting better all the time.

Regardless of whether or not you followed my suggestion, we use a combination of versioned Nuget packages, Powershell, CloudFormation and Octopus Deploy to create an environment, where an environment is a self contained chunk of infrastructure and code that performs some sort of role, the most common of which is acting as an API. We work primarily with EC2 instances (behind Elastic Load Balancers managed via Auto Scaling Groups), and historically, we’ve deployed Logstash to each instance alongside the code to provide log aggregation (IIS, Application, System Stats, etc). When it comes to capturing and aggregating ELB logs, we use include a standalone EC2 instance in the environment, also using Logstash. This standalone instance is the part of the system that we are aiming to replace with the Lambda function.

Because we make extensive use of CloudFormation, incorporating the creation of a Lambda function into an environment that needs to have ELB logs processed is a relatively simple affair.

Simple in that it fits nicely with our current approach. Getting it all to work as expected was still a massive pain.

Blast Pit

Below is a fragment of a completed CloudFormation template for reference purposes.

In the interests of full disclosure, I did not write most of the following fragment, another member of my team was responsible. I just helped.

{
    "Description": "This template is a fragment of a larger template that creates an environment. This fragment in particular contains all of the necessary bits and pieces for a Lambda function that processes ELB logs from S3.",
    "Parameters": {
        "ComponentName": {
            "Description": "The name of the component that this stack makes up. This is already part of the stack name, but is here so it can be used for naming/tagging purposes.",
            "Type": "String"
        },
        "OctopusEnvironment": {
            "Description": "Octopus Environment",
            "Type": "String"
        },
        "PrivateSubnets": {
            "Type": "List<AWS::EC2::Subnet::Id>",
            "Description": "Public subnets (i.e. ones that are automatically assigned public IP addresses) spread across availability zones, intended to contain load balancers and other externally accessible components.",
            "ConstraintDescription": "must be a list of an existing subnets in the selected Virtual Private Cloud."
        },
        "LogsS3BucketName": {
            "Description": "The name of the bucket where log files for the ELB and other things will be placed.",
            "Type": "String"
        }
    },
    "Resources": {
        "LogsBucket" : {
            "Type" : "AWS::S3::Bucket",
            "Properties" : {
                "BucketName" : { "Ref": "LogsS3BucketName" },
                "LifecycleConfiguration": {
                    "Rules": [
                        {
                            "Id": 1,
                            "ExpirationInDays": 7,
                            "Status": "Enabled"
                        }
                    ]
                },
                "Tags" : [
                    {
                        "Key": "function",
                        "Value": "log-storage"
                    }
                ],
                "NotificationConfiguration" : {
                  "LambdaConfigurations": [
                    {
                      "Event" : "s3:ObjectCreated:*",
                      "Function" : { "Fn::GetAtt" : [ "ELBLogProcessorFunction", "Arn" ] }
                    }
                  ]
                }
            }
        },
        "ELBLogProcessorFunctionPermission": {
            "Type" : "AWS::Lambda::Permission",
            "Properties" : {
                "Action":"lambda:invokeFunction",
                "FunctionName": { "Fn::GetAtt": [ "ELBLogProcessorFunction", "Arn" ]},
                "Principal": "s3.amazonaws.com",
                "SourceAccount": {"Ref" : "AWS::AccountId" },
                "SourceArn": {
                    "Fn::Join": [":", [ "arn","aws","s3","", "" ,{"Ref" : "LogsS3BucketName"}]]
                }
            }
        },
        "LambdaSecurityGroup": {
            "Type": "AWS::EC2::SecurityGroup",
            "Properties": {
                "GroupDescription": "Enabling all outbound communications",
                "VpcId": {
                    "Ref": "VpcId"
                },
                "SecurityGroupEgress": [
                    {
                        "IpProtocol": "tcp",
                        "FromPort": "0",
                        "ToPort": "65535",
                        "CidrIp": "0.0.0.0/0"
                    }
                ]
            }
        },
        "ELBLogProcessorFunction": {
          "Type": "AWS::Lambda::Function",
          "Properties": {
            "FunctionName": { "Fn::Join": [ "", [ { "Ref" : "ComponentName" }, "-", { "Ref" : "OctopusEnvironment" }, "-ELBLogProcessorFunction"  ] ] },
            "Description": "ELB Log Processor",
            "Handler": "index.handler",
            "Runtime": "nodejs4.3",
            "Code": {
              "ZipFile": "console.log('placeholder for lambda code')"
            },
            "Role": { "Fn::GetAtt" : ["LogsBucketAccessorRole", "Arn"]},
            "VpcConfig": {
              "SecurityGroupIds": [{"Fn::GetAtt": ["LambdaSecurityGroup", "GroupId"]}],
              "SubnetIds": { "Ref": "PrivateSubnets" }
            }
          }
        },
        "LogsBucketAccessorRole": {
          "Type": "AWS::IAM::Role",
          "Properties": {
            "AssumeRolePolicyDocument": {
                "Version": "2012-10-17",
                "Statement": [
                    {
                        "Effect": "Allow",
                        "Principal": { "Service" : ["lambda.amazonaws.com"]},
                        "Action": [
                            "sts:AssumeRole"
                        ]
                    }
                ]
            },
            "Path": "/",
            "Policies": [{ 
              "PolicyName": "access-s3-read",
              "PolicyDocument": {
                "Version": "2012-10-17",
                "Statement": [
                    {
                        "Effect": "Allow",
                        "Action": [
                            "s3:GetObject"
                        ],
                        "Resource": {
                            "Fn::Join": [":", [ "arn","aws","s3","", "" ,{"Ref" : "LogsS3BucketName"}, "/*"]]
                        }
                    }
                ]
              }
            },
            {
              "PolicyName": "access-logs-write",
              "PolicyDocument": {
                "Version": "2012-10-17",
                "Statement": [
                    {
                        "Effect": "Allow",
                        "Action": [
                            "logs:CreateLogGroup",
                            "logs:CreateLogStream",
                            "logs:PutLogEvents",
                            "logs:DescribeLogStreams"
                        ],
                        "Resource": {
                            "Fn::Join": [":", [ "arn","aws","logs", { "Ref": "AWS::Region" }, {"Ref": "AWS::AccountId"} , "*", "/aws/lambda/*"]]
                        }
                    }
                ]
              }
            },
            {
              "PolicyName": "access-ec2",
              "PolicyDocument": {
                "Version": "2012-10-17",
                "Statement": [
                    {
                        "Effect": "Allow",
                        "Action": [
                          "ec2:*"
                        ],
                        "Resource": "arn:aws:ec2:::*"
                    }
                ]
              }
            },
            {
              "PolicyName": "access-ec2-networkinterface",
              "PolicyDocument": {
                  "Version": "2012-10-17",
                  "Statement": [
                    {
                      "Effect": "Allow",
                      "Action": [
                        "ec2:DescribeInstances",
                        "ec2:CreateNetworkInterface",
                        "ec2:AttachNetworkInterface",
                        "ec2:DescribeNetworkInterfaces",
                        "ec2:DeleteNetworkInterface",
                        "ec2:DetachNetworkInterface",
                        "ec2:ModifyNetworkInterfaceAttribute",
                        "ec2:ResetNetworkInterfaceAttribute",
                        "autoscaling:CompleteLifecycleAction"
                      ],
                      "Resource": "*"
                    }
                  ]
                }
              }
            ]
          }
        }
    }
}

The most important part of the template above is the ELBLogProcessorFunction. This is where the actual Lambda function is specified, although you might notice that it does not actually have the code from the previous post attached to it in any way. The reason for this is that we create the Lambda function with placeholder code, and then use Octopus Deploy afterwards to deploy a versioned package containing the actual code to the Lambda function, like we do for everything else. Packaging and deploying the Lambda function code is a topic for another blog post though (the next one, hopefully).

Other things to note in the template fragment:

  • Lambda functions require a surprising amount of permissions to do what they do. When creating a function using the AWS website, most of this complexity is dealt with for you. When using CloudFormation however, you have to be aware of what the Lambda function needs and give it the appropriate permissions. You could just give the Lambda function as many permissions as possible, but that would be stupid in the face of “least privilege”, and would represent a significant security risk (compromised Lambda code being able to do all sorts of crazy things for example).
    • Logging is a particularly important example for  Lambda permissions. Without the capability to create log streams, your function is going to be damn near impossible to debug.
  • If you’re using S3 as the trigger for your Lambda function, you need to make sure that S3 has permissions to execute the function. This is the ELBLogProcessorFunctionPermission logical resource in the template fragment. Without this, your Lambda function will never trigger, even if you have setup the NotificationConfiguration on the bucket itself.
  • If your Lambda function needs to access external resources (like S3) you will likely have to use Private Subnets + a NAT Gateway to give it that ability. Technically you could also use a proxy, but god why would you do that to yourself. If you put your Lambda function into a Public Subnet, I’m pretty sure it doesn’t automatically get access to the greater internet like an EC2 instance does, and you will be intensely confused as to why all your external calls timeout.
  • Make sure you apply an appropriate Security Group to your Lambda function so that I can communicate externally, or you’ll get mysterious timeouts that look exactly the same as ones you get when you haven’t setup general internet access correctly.

To Be Continued

So that’s how we setup the Lambda function as part of any environment that needs to process ELB logs. Remember, the template fragment above is incomplete, and is missing Auto Scaling Groups, Launch Configurations, Load Balancers, Host Records and a multitude of other things that make up an actual environment. What I’ve shown above is enough to get the pipeline up and running, where any object introduced into the LogsBucket will trigger an execution of the Lambda function, so its enough to illustrate our approach.

Of course, the function doesn’t do anything yet, which ties in with the next post.

How we get the code into Lambda.

Until then, may all your Lambda function executions be swift and free of errors.

0 Comments

Now that I am feeling less like dying horribly from a terrible plague, its time to continue to talk about processing ELB logs into an ELK stack via AWS Lambda.

Last time I talked about our current situation, our motivations for wanting to switch to something better and the general plan moving forward.

To summarise the plan, the new log processor system needs 3 main parts:

  • The Lambda function code, which is written in Javascript, and would need to parse an ELB log file one line at a time, break it up, convert it into a JSON structure and then push it to a Logstash TCP endpoint to be written into Elasticsearch
  • The configuration/creation of the Lambda function, which would need to be done via CloudFormation as part of our normal environment setup (i.e. spin up an environment, it comes with the Lambda function that will process any ELB logs encountered)
  • The deployment of the code to the Lambda function via Octopus (because that’s how we roll)

I’m basically going to do a short blog post on each of those pieces of work, and maybe one at the end to tie it all together.

With that in mind, lets talk Javascript.

Unforeseen Consequences

When you’re writing Lambda functions, Javascript (via Node.js) is probably your best bet. Sure you can run Java or Python (and maybe one day C# using .NET Core), but Javascript is almost certainly going to be the easiest. Its what we chose we when put together the faster S3 clone prototype, and while the fundamentally asynchronous nature of Node.js took some getting used to, it worked well enough.

When it comes to Javascript, I don’t personally write a lot of it. If I’m writing a server side component, I’m probably going to pick C# as my language of choice (with all its fancy features like “a compiler” and “type safety”) and I don’t find myself writing things like websites or small Javascript applications very often, if at all. My team definitely writes websites though, and we use React.js to do it, so its not like Javascript is an entirely foreign concept.

For the purposes of reading in and parsing an ELB log file via a Lambda function, we didn’t need a particularly complex piece of Javascript. Something that reads the specified file from S3 after the Lambda function triggers, something to process the contents of that file line by line, something to parse and format those lines in a way that a Logstash input wll accept, and something to push that JSON payload to the Logstash listener over raw TCP.

Without further ado, I give you the completed script:

'use strict';

let aws = require('aws-sdk');
let s3 = new aws.S3({ apiVersion: '2006-03-01' });
let readline = require('readline');
let net = require('net');

const _type = 'logs';
const _sourceModuleName = 'ELB';
const _logHost = '#{LogHost}'
const _logPort = #{LogPort}
const _environment = '#{Octopus.Environment.Name}'
const _component = '#{Component}'
const _application = '#{Application}'

function postToLogstash(entry){
    console.log("INFO: Posting event to logstash...");

    var socket = net.createConnection(_logPort, _logHost);
    var message = JSON.stringify(entry) + "\n";
    socket.write(message);
    socket.end();

    console.log("INFO: Posting to logstash...done");
}

exports.handler = (event, context, callback) => {
    console.log('INFO: Retrieving log from S3 bucket...');

    const bucket = event.Records[0].s3.bucket.name;
    const key = decodeURIComponent(event.Records[0].s3.object.key.replace(/\+/g, ' '));
    const params = {
        Bucket: bucket,
        Key: key
    };

    const reader = readline.createInterface({
        input: s3.getObject(params).createReadStream()
    });

    const expectedColumns = 12;

    reader.on('line', function(line) {
        console.log("INFO: Parsing S3 line entry...");

        const columns = line.split(/ (?=(?:(?:[^"]*"){2})*[^"]*$)/);

        if(columns.length >= expectedColumns){
            var entry = {
                EventReceivedTime: columns[0],
                LoadBalancerName: columns[1],
                PublicIpAndPort: columns[2],
                InternalIpAndPort: columns[3],
                TimeToForwardRequest: parseFloat(columns[4]) * 1000,
                TimeTaken: parseFloat(columns[5]) * 1000,
                TimeToForwardResponse: parseFloat(columns[6]) * 1000,
                Status: columns[7],
                BackendStatus: columns[8],
                BytesUploadedFromClient: parseInt(columns[9]),
                BytesDownloadedByClient: parseInt(columns[10]),
                FullRequest: columns[11],
                Component: _component,
                SourceModuleName: _sourceModuleName,
                Environment: _environment,
                Application: _application,
                Type: _type
            };
            postToLogstash(entry);
        } else {
            console.log("ERROR: Invalid record length was expecting " + expectedColumns.length + " but found " + columns.length);
            console.log('ERROR: -------');
            console.log(line);
            console.log('ERROR: -------');
        }
    });
};

Nothing to fancy.

In the interest of full disclosure, I did not write the script above. It was written by a few guys from my team initially as a proof of concept, then improved/hardened as a more controlled piece of work.

Office Complex

You might notice some strange variable names at the top of the script (i.e. #{Octopus.Environment.Name}).

We use Octopus deploy for  all of our deployments, and this includes the deployment of Lambda functions (which we package via Nuget and then deploy via the AWS Powershell Cmdlets/CLI inside Octopus). The #{NAME} notation is a way for Octopus to substitute variable values into files during deployment. This substitution is very useful, and can be scoped via a variety of things (like Environment, Machine, Role, etc), so by the time the script actually gets into AWS those variables are filled in with actual values.

Other than our use of Octopus variables, other things to note in this piece of Javascript are:

  • At no point does the function specify which credentials are being used to access the S3 bucket containing the ELB log files. This is because the Lambda function has been configured with an IAM role allowing it to access the required resources. The AWS Javascript library has built in support for running inside supported AWS contexts (like EC2 instances and Lambda functions), such that it can use the role applied to the context to get appropriate temporary credentials. This is a much better approach than using specific credentials in the script, which can result in unintended consequences if you aren’t eternally vigilant.
  • You need to make sure that you’re Lambda function is configured with an appropriate security group that allows it to use the expected outbound channel (i.e. make sure it can get to the Logstash host you’re trying to connect to). This was somewhat of an issue for us as our ELK stack is hosted inside another AWS account (our OPs account), so we had to make sure that all of the appropriate VPC peering was configured before it would work correctly. It can be a bit of a pain to setup the smallest possible surface area, but don’t be tempted to just configure the Lambda function to be able to do everything and go anywhere. Smaller surface area = more security and the policy of least privilege is one you should always follow.
  • Its important to ensure that if you’re doing TCP communication in a Lambda function, that you make sure to close your socket when you’re done with it, or the Lambda function might not exit. It might, but it also might not, and it can really throw you for a loop if you don’t know why.

To Be Continued

That’s it for this week. The Javascript I’ve included above is pretty generic (apart from the specific set of fields that we like to have in our log events) and will successfully process an ELB log file from S3 to a Logstash instance listening on a port of your choosing (probably 6379) when used in a Lambda function. Feel free to reuse it for your own purposes.

Next week I’ll continue this series of posts with information about how we use CloudFormation to setup our Lambda function as part of one of our environment definitions.

CloudFormation and Lambda aren’t the best of friends yet, so there is some interesting stuff that you have to be aware of.