That Neat Little Symbol From Half-Life, Part 3

December 6. 2016 0 Comments

Its that time again kids, time to continue the series of posts about how we improved the processing of our ELB logs into our ELK stack using AWS Lambda.

You can find the introduction to this whole adventure here, but last time I wrote about the Javascript content of the Lambda function that does the bulk of the work.

This time I’m going to write about how we incorporated the creation of that Lambda function into our environment management strategy and some of the tricks and traps therein.

On a completely unrelated note, it would be funny if this blog post turned up in search results for Half-Life 3.

We’ve Got Hostiles!

I’ve pushed hard to codify our environment setup where I work. The main reason for this is reproducibility, but the desire comes from a long history of interacting with manually setup environments that are lorded over by one guy who just happened to know the guy who originally set them up and where everyone is terrified of changing or otherwise touching said environments.

Its a nightmare.

As far as environment management goes, I’ve written a couple of times about environment related things on this blog, one of the most recent being the way in which we version our environments. To give some more context for this post, I recommend you go and read at least the versioning post in order to get a better understanding of how we do environment management. Our strategy is still a work in process, but its getting better all the time.

Regardless of whether or not you followed my suggestion, we use a combination of versioned Nuget packages, Powershell, CloudFormation and Octopus Deploy to create an environment, where an environment is a self contained chunk of infrastructure and code that performs some sort of role, the most common of which is acting as an API. We work primarily with EC2 instances (behind Elastic Load Balancers managed via Auto Scaling Groups), and historically, we’ve deployed Logstash to each instance alongside the code to provide log aggregation (IIS, Application, System Stats, etc). When it comes to capturing and aggregating ELB logs, we use include a standalone EC2 instance in the environment, also using Logstash. This standalone instance is the part of the system that we are aiming to replace with the Lambda function.

Because we make extensive use of CloudFormation, incorporating the creation of a Lambda function into an environment that needs to have ELB logs processed is a relatively simple affair.

Simple in that it fits nicely with our current approach. Getting it all to work as expected was still a massive pain.

Blast Pit

Below is a fragment of a completed CloudFormation template for reference purposes.

In the interests of full disclosure, I did not write most of the following fragment, another member of my team was responsible. I just helped.

{
    "Description": "This template is a fragment of a larger template that creates an environment. This fragment in particular contains all of the necessary bits and pieces for a Lambda function that processes ELB logs from S3.",
    "Parameters": {
        "ComponentName": {
            "Description": "The name of the component that this stack makes up. This is already part of the stack name, but is here so it can be used for naming/tagging purposes.",
            "Type": "String"
        },
        "OctopusEnvironment": {
            "Description": "Octopus Environment",
            "Type": "String"
        },
        "PrivateSubnets": {
            "Type": "List<AWS::EC2::Subnet::Id>",
            "Description": "Public subnets (i.e. ones that are automatically assigned public IP addresses) spread across availability zones, intended to contain load balancers and other externally accessible components.",
            "ConstraintDescription": "must be a list of an existing subnets in the selected Virtual Private Cloud."
        },
        "LogsS3BucketName": {
            "Description": "The name of the bucket where log files for the ELB and other things will be placed.",
            "Type": "String"
        }
    },
    "Resources": {
        "LogsBucket" : {
            "Type" : "AWS::S3::Bucket",
            "Properties" : {
                "BucketName" : { "Ref": "LogsS3BucketName" },
                "LifecycleConfiguration": {
                    "Rules": [
                        {
                            "Id": 1,
                            "ExpirationInDays": 7,
                            "Status": "Enabled"
                        }
                    ]
                },
                "Tags" : [
                    {
                        "Key": "function",
                        "Value": "log-storage"
                    }
                ],
                "NotificationConfiguration" : {
                  "LambdaConfigurations": [
                    {
                      "Event" : "s3:ObjectCreated:*",
                      "Function" : { "Fn::GetAtt" : [ "ELBLogProcessorFunction", "Arn" ] }
                    }
                  ]
                }
            }
        },
        "ELBLogProcessorFunctionPermission": {
            "Type" : "AWS::Lambda::Permission",
            "Properties" : {
                "Action":"lambda:invokeFunction",
                "FunctionName": { "Fn::GetAtt": [ "ELBLogProcessorFunction", "Arn" ]},
                "Principal": "s3.amazonaws.com",
                "SourceAccount": {"Ref" : "AWS::AccountId" },
                "SourceArn": {
                    "Fn::Join": [":", [ "arn","aws","s3","", "" ,{"Ref" : "LogsS3BucketName"}]]
                }
            }
        },
        "LambdaSecurityGroup": {
            "Type": "AWS::EC2::SecurityGroup",
            "Properties": {
                "GroupDescription": "Enabling all outbound communications",
                "VpcId": {
                    "Ref": "VpcId"
                },
                "SecurityGroupEgress": [
                    {
                        "IpProtocol": "tcp",
                        "FromPort": "0",
                        "ToPort": "65535",
                        "CidrIp": "0.0.0.0/0"
                    }
                ]
            }
        },
        "ELBLogProcessorFunction": {
          "Type": "AWS::Lambda::Function",
          "Properties": {
            "FunctionName": { "Fn::Join": [ "", [ { "Ref" : "ComponentName" }, "-", { "Ref" : "OctopusEnvironment" }, "-ELBLogProcessorFunction"  ] ] },
            "Description": "ELB Log Processor",
            "Handler": "index.handler",
            "Runtime": "nodejs4.3",
            "Code": {
              "ZipFile": "console.log('placeholder for lambda code')"
            },
            "Role": { "Fn::GetAtt" : ["LogsBucketAccessorRole", "Arn"]},
            "VpcConfig": {
              "SecurityGroupIds": [{"Fn::GetAtt": ["LambdaSecurityGroup", "GroupId"]}],
              "SubnetIds": { "Ref": "PrivateSubnets" }
            }
          }
        },
        "LogsBucketAccessorRole": {
          "Type": "AWS::IAM::Role",
          "Properties": {
            "AssumeRolePolicyDocument": {
                "Version": "2012-10-17",
                "Statement": [
                    {
                        "Effect": "Allow",
                        "Principal": { "Service" : ["lambda.amazonaws.com"]},
                        "Action": [
                            "sts:AssumeRole"
                        ]
                    }
                ]
            },
            "Path": "/",
            "Policies": [{ 
              "PolicyName": "access-s3-read",
              "PolicyDocument": {
                "Version": "2012-10-17",
                "Statement": [
                    {
                        "Effect": "Allow",
                        "Action": [
                            "s3:GetObject"
                        ],
                        "Resource": {
                            "Fn::Join": [":", [ "arn","aws","s3","", "" ,{"Ref" : "LogsS3BucketName"}, "/*"]]
                        }
                    }
                ]
              }
            },
            {
              "PolicyName": "access-logs-write",
              "PolicyDocument": {
                "Version": "2012-10-17",
                "Statement": [
                    {
                        "Effect": "Allow",
                        "Action": [
                            "logs:CreateLogGroup",
                            "logs:CreateLogStream",
                            "logs:PutLogEvents",
                            "logs:DescribeLogStreams"
                        ],
                        "Resource": {
                            "Fn::Join": [":", [ "arn","aws","logs", { "Ref": "AWS::Region" }, {"Ref": "AWS::AccountId"} , "*", "/aws/lambda/*"]]
                        }
                    }
                ]
              }
            },
            {
              "PolicyName": "access-ec2",
              "PolicyDocument": {
                "Version": "2012-10-17",
                "Statement": [
                    {
                        "Effect": "Allow",
                        "Action": [
                          "ec2:*"
                        ],
                        "Resource": "arn:aws:ec2:::*"
                    }
                ]
              }
            },
            {
              "PolicyName": "access-ec2-networkinterface",
              "PolicyDocument": {
                  "Version": "2012-10-17",
                  "Statement": [
                    {
                      "Effect": "Allow",
                      "Action": [
                        "ec2:DescribeInstances",
                        "ec2:CreateNetworkInterface",
                        "ec2:AttachNetworkInterface",
                        "ec2:DescribeNetworkInterfaces",
                        "ec2:DeleteNetworkInterface",
                        "ec2:DetachNetworkInterface",
                        "ec2:ModifyNetworkInterfaceAttribute",
                        "ec2:ResetNetworkInterfaceAttribute",
                        "autoscaling:CompleteLifecycleAction"
                      ],
                      "Resource": "*"
                    }
                  ]
                }
              }
            ]
          }
        }
    }
}

The most important part of the template above is the ELBLogProcessorFunction. This is where the actual Lambda function is specified, although you might notice that it does not actually have the code from the previous post attached to it in any way. The reason for this is that we create the Lambda function with placeholder code, and then use Octopus Deploy afterwards to deploy a versioned package containing the actual code to the Lambda function, like we do for everything else. Packaging and deploying the Lambda function code is a topic for another blog post though (the next one, hopefully).

Other things to note in the template fragment:

Lambda functions require a surprising amount of permissions to do what they do. When creating a function using the AWS website, most of this complexity is dealt with for you. When using CloudFormation however, you have to be aware of what the Lambda function needs and give it the appropriate permissions. You could just give the Lambda function as many permissions as possible, but that would be stupid in the face of “least privilege”, and would represent a significant security risk (compromised Lambda code being able to do all sorts of crazy things for example).
- Logging is a particularly important example for Lambda permissions. Without the capability to create log streams, your function is going to be damn near impossible to debug.
If you’re using S3 as the trigger for your Lambda function, you need to make sure that S3 has permissions to execute the function. This is the ELBLogProcessorFunctionPermission logical resource in the template fragment. Without this, your Lambda function will never trigger, even if you have setup the NotificationConfiguration on the bucket itself.
If your Lambda function needs to access external resources (like S3) you will likely have to use Private Subnets + a NAT Gateway to give it that ability. Technically you could also use a proxy, but god why would you do that to yourself. If you put your Lambda function into a Public Subnet, I’m pretty sure it doesn’t automatically get access to the greater internet like an EC2 instance does, and you will be intensely confused as to why all your external calls timeout.
Make sure you apply an appropriate Security Group to your Lambda function so that I can communicate externally, or you’ll get mysterious timeouts that look exactly the same as ones you get when you haven’t setup general internet access correctly.

To Be Continued

So that’s how we setup the Lambda function as part of any environment that needs to process ELB logs. Remember, the template fragment above is incomplete, and is missing Auto Scaling Groups, Launch Configurations, Load Balancers, Host Records and a multitude of other things that make up an actual environment. What I’ve shown above is enough to get the pipeline up and running, where any object introduced into the LogsBucket will trigger an execution of the Lambda function, so its enough to illustrate our approach.

Of course, the function doesn’t do anything yet, which ties in with the next post.

How we get the code into Lambda.

Until then, may all your Lambda function executions be swift and free of errors.

That Neat Little Symbol From Half-Life, Part 2

November 29. 2016 0 Comments

Now that I am feeling less like dying horribly from a terrible plague, its time to continue to talk about processing ELB logs into an ELK stack via AWS Lambda.

Last time I talked about our current situation, our motivations for wanting to switch to something better and the general plan moving forward.

To summarise the plan, the new log processor system needs 3 main parts:

The Lambda function code, which is written in Javascript, and would need to parse an ELB log file one line at a time, break it up, convert it into a JSON structure and then push it to a Logstash TCP endpoint to be written into Elasticsearch
The configuration/creation of the Lambda function, which would need to be done via CloudFormation as part of our normal environment setup (i.e. spin up an environment, it comes with the Lambda function that will process any ELB logs encountered)
The deployment of the code to the Lambda function via Octopus (because that’s how we roll)

I’m basically going to do a short blog post on each of those pieces of work, and maybe one at the end to tie it all together.

With that in mind, lets talk Javascript.

Unforeseen Consequences

When you’re writing Lambda functions, Javascript (via Node.js) is probably your best bet. Sure you can run Java or Python (and maybe one day C# using .NET Core), but Javascript is almost certainly going to be the easiest. Its what we chose we when put together the faster S3 clone prototype, and while the fundamentally asynchronous nature of Node.js took some getting used to, it worked well enough.

When it comes to Javascript, I don’t personally write a lot of it. If I’m writing a server side component, I’m probably going to pick C# as my language of choice (with all its fancy features like “a compiler” and “type safety”) and I don’t find myself writing things like websites or small Javascript applications very often, if at all. My team definitely writes websites though, and we use React.js to do it, so its not like Javascript is an entirely foreign concept.

For the purposes of reading in and parsing an ELB log file via a Lambda function, we didn’t need a particularly complex piece of Javascript. Something that reads the specified file from S3 after the Lambda function triggers, something to process the contents of that file line by line, something to parse and format those lines in a way that a Logstash input wll accept, and something to push that JSON payload to the Logstash listener over raw TCP.

Without further ado, I give you the completed script:

'use strict';

let aws = require('aws-sdk');
let s3 = new aws.S3({ apiVersion: '2006-03-01' });
let readline = require('readline');
let net = require('net');

const _type = 'logs';
const _sourceModuleName = 'ELB';
const _logHost = '#{LogHost}'
const _logPort = #{LogPort}
const _environment = '#{Octopus.Environment.Name}'
const _component = '#{Component}'
const _application = '#{Application}'

function postToLogstash(entry){
    console.log("INFO: Posting event to logstash...");

    var socket = net.createConnection(_logPort, _logHost);
    var message = JSON.stringify(entry) + "\n";
    socket.write(message);
    socket.end();

    console.log("INFO: Posting to logstash...done");
}

exports.handler = (event, context, callback) => {
    console.log('INFO: Retrieving log from S3 bucket...');

    const bucket = event.Records[0].s3.bucket.name;
    const key = decodeURIComponent(event.Records[0].s3.object.key.replace(/\+/g, ' '));
    const params = {
        Bucket: bucket,
        Key: key
    };

    const reader = readline.createInterface({
        input: s3.getObject(params).createReadStream()
    });

    const expectedColumns = 12;

    reader.on('line', function(line) {
        console.log("INFO: Parsing S3 line entry...");

        const columns = line.split(/ (?=(?:(?:[^"]*"){2})*[^"]*$)/);

        if(columns.length >= expectedColumns){
            var entry = {
                EventReceivedTime: columns[0],
                LoadBalancerName: columns[1],
                PublicIpAndPort: columns[2],
                InternalIpAndPort: columns[3],
                TimeToForwardRequest: parseFloat(columns[4]) * 1000,
                TimeTaken: parseFloat(columns[5]) * 1000,
                TimeToForwardResponse: parseFloat(columns[6]) * 1000,
                Status: columns[7],
                BackendStatus: columns[8],
                BytesUploadedFromClient: parseInt(columns[9]),
                BytesDownloadedByClient: parseInt(columns[10]),
                FullRequest: columns[11],
                Component: _component,
                SourceModuleName: _sourceModuleName,
                Environment: _environment,
                Application: _application,
                Type: _type
            };
            postToLogstash(entry);
        } else {
            console.log("ERROR: Invalid record length was expecting " + expectedColumns.length + " but found " + columns.length);
            console.log('ERROR: -------');
            console.log(line);
            console.log('ERROR: -------');
        }
    });
};

Nothing to fancy.

In the interest of full disclosure, I did not write the script above. It was written by a few guys from my team initially as a proof of concept, then improved/hardened as a more controlled piece of work.

Office Complex

You might notice some strange variable names at the top of the script (i.e. #{Octopus.Environment.Name}).

We use Octopus deploy for all of our deployments, and this includes the deployment of Lambda functions (which we package via Nuget and then deploy via the AWS Powershell Cmdlets/CLI inside Octopus). The #{NAME} notation is a way for Octopus to substitute variable values into files during deployment. This substitution is very useful, and can be scoped via a variety of things (like Environment, Machine, Role, etc), so by the time the script actually gets into AWS those variables are filled in with actual values.

Other than our use of Octopus variables, other things to note in this piece of Javascript are:

At no point does the function specify which credentials are being used to access the S3 bucket containing the ELB log files. This is because the Lambda function has been configured with an IAM role allowing it to access the required resources. The AWS Javascript library has built in support for running inside supported AWS contexts (like EC2 instances and Lambda functions), such that it can use the role applied to the context to get appropriate temporary credentials. This is a much better approach than using specific credentials in the script, which can result in unintended consequences if you aren’t eternally vigilant.
You need to make sure that you’re Lambda function is configured with an appropriate security group that allows it to use the expected outbound channel (i.e. make sure it can get to the Logstash host you’re trying to connect to). This was somewhat of an issue for us as our ELK stack is hosted inside another AWS account (our OPs account), so we had to make sure that all of the appropriate VPC peering was configured before it would work correctly. It can be a bit of a pain to setup the smallest possible surface area, but don’t be tempted to just configure the Lambda function to be able to do everything and go anywhere. Smaller surface area = more security and the policy of least privilege is one you should always follow.
Its important to ensure that if you’re doing TCP communication in a Lambda function, that you make sure to close your socket when you’re done with it, or the Lambda function might not exit. It might, but it also might not, and it can really throw you for a loop if you don’t know why.

To Be Continued

That’s it for this week. The Javascript I’ve included above is pretty generic (apart from the specific set of fields that we like to have in our log events) and will successfully process an ELB log file from S3 to a Logstash instance listening on a port of your choosing (probably 6379) when used in a Lambda function. Feel free to reuse it for your own purposes.

Next week I’ll continue this series of posts with information about how we use CloudFormation to setup our Lambda function as part of one of our environment definitions.

CloudFormation and Lambda aren’t the best of friends yet, so there is some interesting stuff that you have to be aware of.

That Neat Little Symbol From Half-Life, Part 1

November 15. 2016 0 Comments

We’ve come a long way in our log aggregation journey. Don’t get me wrong, we still have a long way to go, but bit by bit, we’re getting better at it.

A good example of getting better, is the way in which we process our Elastic Load Balancer (ELB) logs. Over a year ago I put together a system for processing these logs into our ELK stack with way too many moving parts. It used Logstash (for processing) and Powershell (for downloading files from S3) hosted on an EC2 instance to aggregate ELB logs from S3 into to our ELK stack. Somewhat complicated in practice, but it worked, even if I was never particularly happy with it.

As is the way with these things though, because it did work, we’ve had no reason to revisit it, and we’re successfully applied the same approach to at least 3 other environments we’ve setup since.

It wasn’t without its share of problems though:

The EC2 instances hosting the solution had a tendency to cap themselves at 100% CPU for long periods. They were initially t2.mediums, but they kept expending all of their CPU credits, so we had to upgrade them to m3.mediums, which was a 50% increase in cost ($US 115/month). Never did figure out exactly what needed all that CPU, but the leading theory was Logstash.
For a while, the logs simply stopped processing after a period of time (days/weeks). This turned out to be an issue with accumulating memory dumps from Java as a result of Logstash crashing and NSSM automatically restarting it.
These were the machines most vulnerable to the memory leak in Logstash that causes its TCP driver to accumulate non-paged memory on Windows AWS instances due to some driver problem.

Good old Logstash.

To turn the discussion back to getting better, we had the opportunity to revisit this process when building some new environments, using all of the knowledge and experience that we’d gained in the intervening period. I think we came up with a much more efficient and understandable solution, but it wasn’t without its share of difficulties, which makes for a good post.

Anomalous Materials

One of the primary weaknesses in the previous approach for processing ELB logs was that it required an entire EC2 instance all to itself, for each environment that we spun up. We did this in order to keep each log processor isolated from the other and to allow us to be able to spin up an entirely self-contained environment without having to worry about some common machine that processed all of the logs in a bunch of different buckets.

Another weakness in the process that bothered me was that it had way too many moving parts. Sometimes you have to have a lot of moving parts working together in order to accomplish a goal, but you should always strive for simplicity, both from an operational point of view and from a maintenance point of view. Less is almost always better in software development.

AWS has come a long way since we jammed the initial solution together, so we decided to use this opportunity to simplify the process and experiment with some AWS tools that we don’t frequently use.

After some discussion, the we formed an idea of what we would like the new log processor to look like. We wanted to use Lambda to process the ELB logs as they were created, pushing them to the same Logstash ingress endpoint that we’ve been using consistently for the last year or so. The benefits we were expecting were a reduction in complexity (no need to have 4 different things working together), a reduction in running cost (mostly due to the removal of the EC2 instance) and a reduction in latency (the Lambda function would trigger whenever a file was written to the S3 bucket by the ELB, which meant no more polling for changes).

For those of you unfamiliar with Lamba, its a service offered by AWS that lets you configure code to run whenever a variety of events occurs. I’ve used it before to create a quicker S3 bucket clone, so if you want some more information, feel free to read up on that adventure.

In order to accomplish our goal, we would need to deal with 3 things:

The Lambda function code itself, which is written in Javascript, and would need to parse an ELB log file one line at a time, break it up, convert it into a JSON structure and then push it to a Logstash TCP endpoint to be written into Elasticsearch
The configuration/creation of the Lambda function, which would need to be done via CloudFormation as part of our normal environment setup (i.e. spin up an environment, it comes with the Lambda function that will process any ELB logs encountered)
The deployment of the code to the Lambda function via Octopus (because that’s how we roll)

Nothing particularly insane there, but definitely a few things that we’d never done before.

To Be Continued

In order to avoid creating a single monstrous post with more words than a small novel, I’m going to break it here.

Next week I’ll continue, explaining the Javascript code that we put together to process the log files (its not particularly complicated) and how we configured the Lambda function by incorporating its setup into our environment setup.

Until then, may all your Lambda functions execute quickly and your S3 buckets not turn into ghosts.

Faster S3! Clone! Clone!

April 19. 2016 0 Comments

A little over 4 months ago, I wrote a post about trying to improve the speed of cloning a large S3 bucket. At the time, I tried to simply parallelise the execution of the AWS CLI sync command, which actually proved to be much slower than simply leaving the CLI alone to do its job. It was an unsurprising result in retrospect, but you never know unless you try.

Unwilling to let the idea die, I decided to make it my focus during our recent hack days.

If you are unfamiliar with the concept of a hack day (or Hackathon as they are sometimes known), have a look at this Wikipedia article. At my current company, we’re only just starting to include hack days on a regular basis, but its a good sign of a healthy development environment.

Continuing on with the original train of thought (parallelise via prefixes), I needed to find a way to farm out the work to something (whether it was a pool of our own workers or some other mechanism). Continuing with that train of thought, I chose to use AWS Lambda.

Enter Node.js on Lambda.

At A High Level

AWS Lambda is a relatively new offering, allowing you to configure some code to automatically execute following a trigger from one of a number of different events, including an SNS Topic Notification, changes to an S3 bucket or a HTTP call. You can use Python, Java or Javascript (through Node.js) as code natively, but you can technically use anything you can compile into a Linux compatible executable and make accessible to the function via S3 or something similar.

Since Javascript seems to be everywhere now (even though its hard to call it a real language), it was a solid choice. No point being afraid of new things.

Realistically, I should have been at least a little afraid of new things.

Conceptually the idea can be explained as a simple divide and conquer strategy, managed by files in an S3 bucket (because S3 was the triggering mechanism I was most familiar with).

If something wants to trigger a clone, it writes a file into a known S3 bucket detailing the desired operation (source, destination, some sort of id) with a key of {id}-{source}-{destination}/clone-request.

In response, the Lambda function will trigger, segment the work and write a file for each segment with a key of {id}-{source}-{destination}/{prefix}-segment-request. When it has finished breaking down the work, it will write another file with the key {id}-{source}-{destination}/clone-response, containing a manifest of the breakdown, indicating that it is done with the division of work.

As each segment file is being written, another Lambda function will be triggered, doing the actual copy work and finally writing a file with the key {id}-{source}-{destination}/{prefix}-segment-response to indicate that its done.

File Formats Are Interesting

Each clone-request file looks like this:

{
    id: {id},
    source: {
        name: {source-bucket-name}
    },
    destination: {
        name: {destination-bucket-name}
    }
}

Its a relatively simple file that would be easy to extend as necessary (for example, if you needed to specify the region, credentials to access the bucket, etc).

The clone-response file (the manifest), looks like this:

{
    id: {id},
    source: {
        name: {source-bucket-name}
    },
    destination: {
        name: {destination-bucket-name}
    },
    segments: {
        count: {number-of-segments},
        values: [
            {segment-key},
            {segment-key}
            ...
        ]
    }
}

Again, another relatively simple file. The only additional information is the segments that the task was broken into. These segments are used for tracking purposes, as the code that requests a clone needs some way to know when the clone is done.

Each segment-request file looks like this:

{
    id: {id},
    source: {
        name: {source-bucket-name},
        prefix: {prefix}
    },
    destination: {
        name: {destination-bucket-name}
    }
}

And finally, each segment-response file looks like this:

{
    id: {id},
    source: {
        name: {source-bucket-name},
        prefix: {prefix}
    },
    destination: {
        name: {destination-bucket-name}
    },    
    files: [        
        {key},
        {key},
        ...
    ]
}

Nothing fancy or special, just straight JSON files with all the information needed.

Breaking It All Down

First up, the segmentation function.

Each Javascript Lambda function already comes with access to the aws-sdk, which is super useful, because honestly if you’re using Lambda, you’re probably doing it because you need to talk to other AWS offerings.

The segmentation function has to read in the triggering file from S3, parse it (its Javascript and JSON so that’s trivial at least), iterate through the available prefixes (using a delimiter, and sticking with the default “/”), write out a file for each unique prefix and finally write out a file containing the manifest.

As I very quickly learned, using Node.js to accomplish the apparently simple task outlined above was made not simple at all thanks to its fundamentally asynchronous nature, and the fact that async calls don’t seem to return a traceable component (unlike in C#, where if you were using async tasks you would get a task object that could be used to track whether or not the task succeeded/failed).

To complicate this even further, the aws-sdk will only return a maximum of 1000 results when listing the prefixes in a bucket (or doing anything with a bucket really), which means you have to loop using the callbacks. This makes accumulating some sort of result set annoying difficult, especially if you want to know when you are done.

Anyway, the segmentation function is as follows:

console.log('Loading function');

var aws = require('aws-sdk');
var s3 = new aws.S3({ apiVersion: '2006-03-01' });

function putCallback(err, data)
{
    if (err)
    {
        console.log('Failed to Upload Clone Segment ', err);
    }
}

function generateCloneSegments(s3Source, command, commandBucket, marker, context, segments)
{
    var params = { Bucket: command.source.name, Marker: marker, Delimiter: '/' };
    console.log("Listing Prefixes: ", JSON.stringify(params));
    s3Source.listObjects(params, function(err, data) {
        if (err)
        {
            context.fail(err);
        }
        else
        {
            for (var i = 0; i < data.CommonPrefixes.length; i++)
            {
                var item = data.CommonPrefixes[i];
                var segmentRequest = {
                    id: command.id,
                    source : {
                        name: command.source.name,
                        prefix: item.Prefix
                    },
                    destination : {
                        name: command.destination.name
                    }
                };
                
                var segmentKey = command.id + '/' + item.Prefix.replace('/', '') + '-segment-request';
                segments.push(segmentKey);
                console.log("Uploading: ", segmentKey);
                var segmentUploadParams = { Bucket: commandBucket, Key: segmentKey, Body: JSON.stringify(segmentRequest), ContentType: 'application/json'};
                s3.putObject(segmentUploadParams, putCallback);
            }
            
            if(data.IsTruncated)
            {
                generateCloneSegments(s3Source, command, commandBucket, data.NextMarker, context, segments);
            }
            else
            {
                // Write a clone-response file to the commandBucket, stating the segments generated
                console.log('Total Segments: ', segments.length);
                
                var cloneResponse = {
                    segments: {
                        count: segments.length,
                        values: segments
                    }
                };
                
                var responseKey = command.id + '/' + 'clone-response';
                var cloneResponseUploadParams = { Bucket: commandBucket, Key: responseKey, Body: JSON.stringify(cloneResponse), ContentType: 'application/json'};
                
                console.log("Uploading: ", responseKey);
                s3.putObject(cloneResponseUploadParams, putCallback);
            }
        }
    });
}

exports.handler = function(event, context) {
    //console.log('Received event:', JSON.stringify(event, null, 2));
    
    var commandBucket = event.Records[0].s3.bucket.name;
    var key = decodeURIComponent(event.Records[0].s3.object.key.replace(/\+/g, ' '));
    var params = {
        Bucket: commandBucket,
        Key: key
    };
    
    s3.getObject(params, function(err, data) 
    {
        if (err) 
        {
            context.fail(err);
        }
        else 
        {
            var command = JSON.parse(data.Body);
            var s3Source = new aws.S3({ apiVersion: '2006-03-01', region: 'ap-southeast-2' });
            
            var segments = [];
            generateCloneSegments(s3Source, command, commandBucket, '', context, segments);
        }
    });
};

I’m sure some improvements could be made to the Javascript (I’d love to find a way automate tests on it), but its not bad for being written directly into the AWS console.

Hi Ho, Hi Ho, Its Off To Work We Go

The actual cloning function is remarkably similar to the segmenting function.

It still has to loop through items in the bucket, except it limits itself to items that match a certain prefix. It still has to do something for each item (execute a copy and add the key to its on result set) and it still has to write a file right at the end when everything is done.

console.log('Loading function');

var aws = require('aws-sdk');
var commandS3 = new aws.S3({ apiVersion: '2006-03-01' });

function copyCallback(err, data)
{
    if (err)
    {
        console.log('Failed to Copy ', err);
    }
}

function copyFiles(s3, command, commandBucket, marker, context, files)
{
    var params = { Bucket: command.source.name, Marker: marker, Prefix: command.source.prefix };
    s3.listObjects(params, function(err, data) {
        if (err)
        {
            context.fail(err);
        }
        else
        {
            for (var i = 0; i < data.Contents.length; i++)
            {
                var key = data.Contents[i].Key;
                files.push(key);
                console.log("Copying [", key, "] from [", command.source.name, "] to [", command.destination.name, "]");
                
                var copyParams = {
                    Bucket: command.destination.name,
                    CopySource: command.source.name + '/' + key,
                    Key: key
                };
                s3.copyObject(copyParams, copyCallback);
            }
            
            if(data.IsTruncated)
            {
                copyFiles(s3, command, commandBucket, data.NextMarker, context, segments);
            }
            else
            {
                // Write a segment-response file
                console.log('Total Files: ', files.length);
                
                var segmentResponse = {
                    id: command.id,
                    source: command.source,
                    destination : {
                        name: command.destination.name,
                        files: {
                            count: files.length,
                            files: files
                        }
                    }
                };
                
                var responseKey = command.id + '/' + command.source.prefix.replace('/', '') + '-segment-response';
                var segmentResponseUploadParams = { Bucket: commandBucket, Key: responseKey, Body: JSON.stringify(segmentResponse), ContentType: 'application/json'};
                
                console.log("Uploading: ", responseKey);
                commandS3.putObject(segmentResponseUploadParams, function(err, data) { });
            }
        }
    });
}

exports.handler = function(event, context) {
    //console.log('Received event:', JSON.stringify(event, null, 2));
    
    var commandBucket = event.Records[0].s3.bucket.name;
    var key = decodeURIComponent(event.Records[0].s3.object.key.replace(/\+/g, ' '));
    var params = {
        Bucket: commandBucket,
        Key: key
    };
    
    commandS3.getObject(params, function(err, data) 
    {
        if (err) 
        {
            context.fail(err);
        }
        else 
        {
            var command = JSON.parse(data.Body);
            var s3 = new aws.S3({ apiVersion: '2006-03-01', region: 'ap-southeast-2' });
            
            var files = [];
            copyFiles(s3, command, commandBucket, '', context, files);
        }
    });
};

Tricksy Trickses

You may notice that there is no mention of credentials in the code above. That’s because the Lambda functions run under a role with a policy that gives them the ability to list, read and put into any bucket in our account. Roles are handy for accomplishing things in AWS, avoiding the new to supply credentials. When applied to the resource, and no credentials are supplied, the aws-sdk will automatically generate a short term token using the role, reducing the likelihood of leaked credentials.

As I mentioned above, The asynchronous nature of Node.js made everything a little but more difficult than expected. It was hard to determine when anything was done (somewhat important for writing manifest files). Annoyingly enough, it was even hard to determine when the function itself was finished. I kept running into issues where the function execution had finished, and it looked like it had done all of the work I expected it to do, but AWS Lambda was reporting that it did not complete successfully.

In the initial version of Node.js I was using (v0.10.42), the AWS supplied context object had a number of methods on it to indicate completion (whether success or failure). If I called the Succeed method after I setup my callbacks, the function would terminate without doing anything, because it didn’t automatically wait for the callbacks to complete. If I didn’t call it, the function would be marked as “did not complete successfully”. Extremely annoying.

As is often the case with AWS though, on literally the second hack day, AWS released support for Node.js v4.3, which automatically waits for all pending callbacks to complete before completing the function, completely changing the interaction model for the better. I did upgrade to the latest version during the second hack day (after I had accepted that my function was going to error out in the control panel but actually do all the work it needed to), but it wasn’t until later that I realised that the upgrade had fixed my problem.

The last tripwire I ran into was related to AWS Lambda not being available in all regions yet. Specifically, its not in ap-southeast-2 (Sydney), which is where all of our infrastructure lives. S3 is weird in relation to regions, as buckets are globally unique and accessible, but they do actually have a home region. What does this have to do with Lambda? Well, the S3 bucket triggers I used as the impetus for the function execution only work if the S3 bucket is in the same region as the Lambda function (so us-west-1), even though once you get inside the Lambda function you can read/write to any bucket you like. Weird.

Conclusion

I’ve omitted the Powershell code responsible for executing the clone for brevity. It writes the request to the bucket, reads the response and then polls waiting for all of the segments to be completed, so its not particularly interesting, although the polling for segment completion was my first successful application of the Invoke-Parallel function from Script Center.

Profiling the AWS Lambda approach versus the original AWS CLI sync command approach over a test bucket (7500 objects, 195 distinct prefixes, 8000 MB of data) showed a decent improvement in performance. The sync approach took 142 seconds and the Lambda approach took 55 seconds, approximately a third of the time, which was good to see considering the last time I tried to parallelise the clone it actually decreased the performance. I think with some tweaking the Lambda approach could be improved further, with tighter polling tolerances and an increased number of parallel Lamda executions allowed.

Unfortunately, I have not had the chance to execute the AWS Lambda implementation on the huge bucket that is the entire reason it exists, but I suspect that it won’t work.

Lambda allows at maximum 5 minutes of execution time per function, and I suspect that the initial segmentation for a big enough bucket will probably take longer than that. It might be possible to chain lambda functions together (i.e. trigger one from the next one, perhaps per 1000 results returned from S3, but I’m not entirely sure how to do that yet (maybe using SNS notifications instead of S3?). Additionally, with a big enough bucket, the manifest file itself (detailed the segments) might become unwieldy. I think the problem bucket has something like 200K unique prefixes, so the size of the manifest file can add up quickly.

Regardless, the whole experience was definitely useful from a technical growth point of view. Its always a good idea to remove yourself from your comfort zone and try some new things, and AWS Lambda + Node.js are definitely well outside my comfort zone.

A whole different continent in fact.