0 Comments

With all that talk about logging environments and their improvements out of the way, its time to return to familiar territory yet again.

Our ELB logs processor has been chugging along for a while now, dutifully aggregating all of the ELB logs for one of our APIs. We have quite a few APIs though, and almost all of the them have ELB logs to process, so there was no sense keeping the awesome sequestered in one place.

Time to spread the love.

Honestly, we always planned on this sort of reuse anyway, so the logs processor was built in such a way that it could be relatively easily used in another location.

The actual Javascript code is completely generic, not specific to a particular Lambda function/API combo. There are things that need to change between instances obviously, but they are covered by the variables, the changes of which are easily accomplished via Octopus Deploy. In order to setup the processor for a new API, all you need to do is add the Lambda function/permissions to the environment definition (i.e. the CloudFormation template), make a new Octopus Project and add a small piece of Powershell into the environment scripts to deploy that project.

What could possibly go wrong?

Surface Tension

Surprisingly, very little actually went wrong.

I mean, it didn’t work, and we had no log events inside the ELK stack for the new API, but at a glance, there were no catastrophic failures.

The Lambda function existed, it was receiving events from S3, downloading files, processing them and pumping them out via TCP. There were no errors (at the Lambda function level OR within the function itself) and all of the function logging showed everything working as expected.

After digging into it a little bit, it turned out that not only were the ELB logs for the new API missing, the CI and Staging logs for the older API were missing as well. Weirdly enough, the Production logs for the older API were working exactly as expected.

Mysteriously, the code was exactly the same in all 4 Lambda functions, with the only differences being in configuration (which was relatively simple).

Forget About Freeman!

To cut a relatively long story short, the core of the problem was related to the asynchronous nature of TCP connections (via sockets) in Node.

This was the problem code:

create: function() {
    return new Promise(function(resolve, reject) {
        const socket = net.createConnection(config.logstashPort, config.logstashHost);
        summary.connections.created += 1;
        resolve(socket);
    })
}

This snippet is the mechanism by which TCP connections are created and added to the pool in the Lambda function, prior to being used to actually send data to our log stack.

It turns out that a socket can be returned by net.createConnection before it has actually connected. Even worse, sockets that are still connecting will not throw errors when you attempt to write to them. I’m not sure what actually happens to the writes in this case, but giving Javascript/Node the benefit of the doubt, I assume they are probably queued in some way.

Unfortunately, the Lambda function was exiting before any sort of queued writes could actually be applied, but only when there was a small total number of ELB log lines being processed. This is why it was working fine in production (thousands of messages every execution) and failing miserably in CI/Staging and the new environment (tens of messages). The function was just too quick and didn’t know that it needed to wait for the socket to be connected and for all pending writes to execute before it was allowed to finished. As far as the promise chain was concerned, everything was done and dusted.

The solution is actually pretty simple:

create: function() {
    return new Promise(function(resolve, reject) {
        const socket = new net.Socket;

        socket.on("connect", () => {
            summary.connections.created += 1;
            resolve(socket);
        });

        socket.on("error", error => {
            summary.failures.sockets.total += 1;
            if (summary.failures.sockets.lastFew.length >= 5) {
                summary.failures.sockets.lastFew.shift();
            }
            summary.failures.sockets.lastFew.push(error);
            socket.destroy(error);
            reject(error);
        })

        socket.connect(config.logstashPort, config.logstashHost);
    })
}

Instead of immediately returning the socket, hook a listener up to its connect event and resolve the promise then. I added in some extra error handling/reporting as well, but its all pretty straight forward.

What this means is that a socket is never added to the pool unless its already connected, which makes the rest of the code work exactly as expected.

Conclusion

It was nice to return briefly to the ELB logs processor to witness its reuse, especially because the return actually led to the discovery and eventual fix of a bug that I missed the first time.

I’m still wrapping my head around the intrinsically asynchronous/callback nature of Node though, which was the root cause of the issue. To be honest, while I really like the way that promises work, I really really hate the callback hell that vanilla Node seems to encourage. It was so easy to create a socket and write to it, without the socket actually being valid, and the only way to ensure the socket was valid was via an asynchronous event handler.

To complicated matters, I’m sure it doesn’t help that AWS Lambda can only run a relatively old version of Node (4.3.2). The latest version of Node is miles ahead of the that, and it probably has hundreds of bugfixes that might be relevant.

Anyway, I’m just happy I got to steal more Half-Life chapter titles as headings.

Seriously, they fit so well.

0 Comments

Its getting harder and harder to come up with relevant subtitles for this particular topic. I should probably just starting using numbers.

Totally nailed this one though.

To the topic at hand though, you might be forgiven for thinking that we had our AWS Lambda ELB Logs Processor well under control from my last post. It was processing files happily all the way through and everything seemed to be arriving in our log stack in an acceptable timeframe.

To give a complete picture, you can see the source for the Lambda function here.

Inside the source, the #{something} notation is an Octopus Deploy variable substitution. Basically, when this code is deployed into AWS using Octopus Deploy, those strings are replaced with real values appropriate for the environment being deployed to.

Its long enough that I didn’t just want to paste it into this post, but still short enough to fit comfortably in a single file. It streams a file from S3 line by line, parses each line into its constituent parts and then posts each line to a logstash TCP endpoint. It has some basic connection pooling (homegrown) and features some basic logging.

Unfortunately, the logging turned out to be pretty painful to use in production. This was primarily because there were multiple places inside the function that write log messages for each line encountered. In our sample data this is fine, because there are tens of lines. In production, we get over 60000 lines every run of the function, which means the CloudWatch logs are basically unusable. Not only that, but the information that comes out in the logs is pretty useless at scale, stating things like “starting posting to logstash”, “processing file”, “posted to logstash” and so on.

The other issue was the connection pooling. It seemed to be working (in that it reduced the pace of the function such that it no longer ran out of TCP connections), but I wasn’t entirely sure it was doing what we thought it was. To be honest, the code was pretty hard to understand, mostly as a result of the use of aysnchronous callbacks. To me, it seemed like a connection pool should be something that other people have needed, so surely there was a package available that would meet our needs.

Taking these two points into considering, I set about improving the function by:

  1. Finding and using a library for the connection pool, hopefully making it easier to read and reason about.
  2. Making the logging output actually useful when the function was executed in production.

Car Pool

Looking around there are a few packages that enable pooling of various resources available in NPM. As seems to be the pattern though, they are of varying quality and completeness. In the end I settled on a package called “generic-pool” (Github here), which seemed pretty good. A major point in its favour was that it had a version greater than 0, which seems rare for an NPM package.

Unfortunately for me, it was promise based,and I had no idea what promises were or how to use them. I am, however, familiar with the Task Parallel Library in C#, which seems to be similar conceptually, so I didn’t have to learn a whole bunch of concepts entirely from scratch.

Using the library to set up a pool of TCP connections was relatively straightforward, as the snippet below shows.

const poolFactory = {
    create: function() {
        return new Promise(function(resolve, reject) {
            const socket = net.createConnection(config.logstashPort, config.logstashHost);
            summary.connections.created += 1;
            resolve(socket);
        })
    },
    destroy: function(socket) {
        return new Promise(function(resolve, reject) {
            socket.end();
            socket.destroy();
            summary.connections.destroyed += 1;
            resolve();
        })
    },
    validate: function(socket) {
        return new Promise(function(resolve, reject){
            resolve(!socket.destroyed);
        })
    }
};

var poolOptions = {
    max: config.connectionCountLimit,
    min: 0,
    acquireTimeoutMillis: config.connectionWaitMilliseconds,
    testOnBorrow: true
};

var pool = gPool.createPool(poolFactory, poolOptions);

With a pool in place, all that was left to do was to incorporate it into the actual line parsing/processing engine, which meant a shift away from asynchronous callbacks to promises.

summary.lines.parsed += 1;

var promise = pool.acquire()
    .then((socket) => { return post(socket, entry); })
    .then((socket) => { pool.release(socket); })
    .catch((error) => { 
        summary.failures.sending.total += 1;
        if (summary.failures.sending.lastFew.length >= 5) {
            summary.failures.sending.lastFew.shift();
        }
        summary.failures.sending.lastFew.push(error);
    });

promises.push(promise);

The promises collection/array is used to determine when the function is complete, which is the opportunity to do anything that needs to happen right at the end.

Like logging.

Log Ride

Speaking of logging, the first improvement was to remove all instances where the function wrote a message for each line. Easy. Now at least the production logs wouldn’t be an unreadable mess.

The next step was to add some metrics to the logging, so we could track down the function was doing. This needed to include some statistics like lines read, lines parsed and lines sent to logstash.

This wasn’t so bad (simply introduce a function scoped variable for the metrics, and then increment the various counters whenever the specified event occurred), but it did start to cause issues when I tried to refactor the function to break it down into smaller, more easily reasoned about components. I couldn’t easily move any of the inner functions around because then they wouldn’t have the metrics object in scope. I think I might have been able to solve this problem by adding the metrics object in as a parameter to each of the functions that needed to edit it, but this would have made using those functions in the various asynchronous callbacks much harder.

With the line based metrics sorted, it was a relatively easy matter to add metrics for connection creation and destruction via the pool construct outlined above, so I don’t need to go into too much detail about that.

Finally, I also wanted to include some information about errors that occur in the process, making sure that if every single line failed for some reason, the CloudWatch logs would still remain readable. The easiest way to do this was to accumulate a small queue of the most recent errors, pushing older ones out in favour of newer ones with the understanding that the last 5 errors is probably enough information to diagnose issues. You can see the code for doing this in the promise catch handler above.

The only thing left to do was to actually log the metrics object that contained all of this juicy information, and the only place where this was possible was in the handler for when the readline module had finished reading the entire file.

function handleReaderClose() {
    console.log('File reader for ELB log file is closing because all lines have been read. Waiting for all promises (for sending parsed lines to logstash) to resolve');
    Promise
        .all(promises)
        .then(() => { console.log("Cleaning up the connection pool, which has [%s/%s] (current/max) connections", pool.size, pool.max); return pool.drain(); })
        .then(() => pool.clear())
        .then(() => { 
            console.log("All processing complete. Summary follows"); 
            console.log("%s", JSON.stringify(summary, fixStringifyError, 4)); 
        });
}

There is a few other things in there that I’m not going into too much detail about, like cleaning up the pool and a custom function to stringify errors, but the important part for this discussion is the loggin.

Conclusion

After doing the work above, I’m a bit happier with the current state of the AWS Lambda ELB Logs Processor. I feel like its better structured and easier to reason about, mostly due to the introduction of promises instead of callbacks. In addition to the quality of the code itself, the improved log output makes it easier to see how the function is going on a day to day basis as it chews through thousands and thousands of lines every run.

In particular, the summary that gets output right at the end is extremely useful for getting a high level picture of the output of a single run without overloading the reader.

An example of the sort of output we see from a normal (successful) run is below:

START RequestId: c0e94f4d-051f-11e7-8fbb-2b8356f39769 Version: $LATEST
2017-03-09T23:25:59.785Z    c0e94f4d-051f-11e7-8fbb-2b8356f39769    Retrieving ELK log file from S3 bucket/key specified in the initiating event. Bucket: [REDACTED], Key: [REDACTED]
2017-03-09T23:27:31.344Z    c0e94f4d-051f-11e7-8fbb-2b8356f39769    File reader for ELB log file is closing because all lines have been read. Waiting for all promises (for sending parsed lines to logstash) to resolve
2017-03-09T23:27:34.924Z    c0e94f4d-051f-11e7-8fbb-2b8356f39769    Cleaning up the connection pool, which has [48/100] (current/max) connections
2017-03-09T23:27:34.967Z    c0e94f4d-051f-11e7-8fbb-2b8356f39769    All processing complete. Summary follows
2017-03-09T23:27:35.003Z    c0e94f4d-051f-11e7-8fbb-2b8356f39769
{
    "lines": {
        "encountered": 76464,
        "parsed": 76464,
        "sent": 76464
    },
    "connections": {
        "created": 48,
        "destroyed": 48
    },
    "failures": {
        "parsing": {
            "total": 0,
            "lastFew": []
        },
        "sending": {
            "total": 0,
            "lastFew": []
        }
    }
}
END RequestId: c0e94f4d-051f-11e7-8fbb-2b8356f39769
REPORT RequestId: c0e94f4d-051f-11e7-8fbb-2b8356f39769    Duration: 95227.56 ms    Billed Duration: 95300 ms Memory Size: 192 MB    Max Memory Used: 165 MB

There are still many improvements to be made (especially regarding the structure of the repo and its complete lack of automated tests), but its definitely more betterer than it was.

0 Comments

As I mentioned briefly last week, our AWS Lambda ELB Logs Processor did not quite work when we pushed it to production. This was a shame, because we were pretty sure we got it right this time. Ah well, every failure is a learning experience.

In this post I’m going to briefly talk about how we deployed the new logs processor, how we identified the issue (and what it was) and finally how we fixed it.

Colour Blind

In order to maximise our ability to adapt quickly, we try to deploy as frequently as possible, especially when it comes to services that we fully control (like websites and API’s). The most important thing to get right when you’re deploying frequently is the ability to do so without disturbing the end-user. Historically, deployments dodged this problem by simply deploying during periods of low use, which for us means Sundays or late at night. I certainly don’t want to deal with that though, and I don’t want to delegate it to some poor sap, so instead we just make sure our change sets are small and easy to reason about, and our deployments happen in such a way that the service is never fully down while the deployment is occurring (easily accomplished in AWS with multiple deployment locations combined with rolling deployments).

For bigger changes though, we’ve started using blue/green deployments. Essentially this means having two completely separate environments active at the one time, with some top level mechanism for directing traffic to the appropriate one as necessary. For us this is a top level URL (like service.thing.com) which acts as a DNS alias for an environment specific URL (like prod-green-service.thing.com). We then use Route53 and Weighted DNS to direct traffic as we so desire.

For websites, blue/green deployments are trivial, assuming the backend services still work the same (or have already been deployed to be compatible with both versions). For those titular backend services though, blue/green deployments can be challenging, especially when you have to take into account data persistence and continuity of service.

When it comes to whatever persistence layer is necessary, our newer services usually feature data storage in a separately maintained environment (like RDS instances, S3 buckets and so on), specifically to help us do blue/green deployments on parts of the architecture that can freely change without causing any issues with data. Some of our earlier services did not do this, and as such are problematic when it comes to blue/green deployments. In those cases we usually test the new system by using traffic replication and then resort to traditional downtime when it comes time to do the actual deployment.

Blue/green deployments have proved to be particularly useful to for us because of the way we handle our environments as code.

To tie this all back in with the ELB logs processor, we used our environment package to create a blue environment for the service with the ELB logs processor in it (because our currently active one was green). Once the environment was up and running, we used weighted DNS inside Route53 to shunt a small amount of traffic to the blue environment to see if the ELB logs were being processed into our ELK stack correctly.

And they were, and the victory dance was done and everyone lived happily ever after.

Oh Yeah, Scale

Well, not quite.

The small amount of test traffic worked perfectly. All of the traffic was being represented inside our ELK stack as expected. Continuing on with the traditional blue/green deployment, we increased the amount of traffic hitting the new blue environment by increasing its weight in Route53.

Once we got to around 10% of the total traffic, things started to fall down. The new blue environment was handling the traffic perfectly well, but we were not seeing the expected number of log events from the ELB inside the ELK stack.

Looking at the CloudWatch logs for the Lambda function, it looked like the Lambda function was simply taking longer to run than the default timeout provided (which is 3 seconds), which was a silly oversight on our part. Considering it was trying to process a few megabytes of raw log data, its not surprising that it wasn’t getting through the entire thing.

AWS Lambda is charged by calculating the actual execution time with the resources that were allocated for the function, so there timeout basically represents the maximum amount of money you will be charged for each function execution. if you have an unbounded number of function executions (i.e. they occur on some unreliable trigger), than this can be very useful to limit your potential costs. For our usage, we know that the ELB logs are generated approximately once every 5 minutes, so we’re pretty safe to set the timeout to the maximum (300 seconds) to give the function as much time as possible to process the log file.

With that small change in place more of the log file was being processed, but it still wasn’t processing the entire thing. The good news was that it was no longer just timing out and terminating itself, but the bad news was that it was now just plain old crashing after processing a some of the file.

No Its Fine, I Don’t Need Those Connections Again

Javascript Lambda functions automatically write their output to CloudWatch, which is really useful from a debugging and tracing point of view. I mean, I don’t know how it would work otherwise (because you literally cannot access the underlying operating system they run on), but its still nice that it just works out of the box.

In our case, the error was as follows:

Error: connect EMFILE {ip}:{port} - Local (undefined:undefined)
at Object.exports._errnoException (util.js:870:11)
at exports._exceptionWithHostPort (util.js:893:20)
at connect (net.js:843:14)
at net.js:985:7
at GetAddrInfoReqWrap.asyncCallback [as callback] (dns.js:63:16)
at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:82:10)

A small amount of digging showed that this error occurs when a TCP connection cannot be established to the specified IP over the specified port.

If you look back at the code that actually processes the ELB log file, the only place where a TCP connection is being made is after a line in the file has been processed and transformed into a JSON payload, ready to be  pushed to our ELK stack via Logstash. Being that the error only manifests after part of the file has already been processed successfully, it looked like the issue was one of resource exhaustion.

The obvious candidate was that the Lambda function was simply trying to open too many TCP connections at once. This made sense based on my understanding of Node.js at the time, so we implemented a connection pooling mechanism to prevent it from occurring (i.e. instead of simply trying to establish a connection, it would try to draw one from a global pool with a limit and if one was not available, wait for a few moments until it was).

Because each connection was only required for a few moments, the solution would essentially throttle the processing to whatever limit we impose, hopefully dodging the perceived problem with too many parallel connections.

function getConnection(callback) {
    if (connectionPool.length < connectionCountLimit) {
        console.log('creating new connection');
        const newSocket = net.createConnection(_logPort, _logHost);
        connectionPool.push(newSocket);
        return callback(newSocket);
    }

    const activeConnections = connectionPool.filter(function (socket) {
        return !socket.destroyed;
    });
    if (activeConnections.length != connectionCountLimit) {
        connectionPool = activeConnections;
    }

    setTimeout(function () {
        getConnection(callback);
    }, waitForConnectionDuration);
}

function postToLogstash(connection) {
    return function (entry) {
        console.log("INFO: Posting event to logstash... ");
        var message = JSON.stringify(entry) + "\n";
        message = message.replace("Timestamp", "@timestamp");
        connection.write(message, null, function () {
            console.log("INFO: Posting to logstash...done");
            connection.end();
        });
    }
}

Conclusion

I’d love to say that after we implemented the simple connection pooling, everything worked just fine and dandy and the victory dance was enjoyed by all.

And for once, I can!

With the simple connection pooling we implemented (which had a maximum connection count of like 50 in the first iteration), we managed to process an entire ELB log file from S3 without getting the error we were getting before. We still need to do some more investigation around whether or not we’re actually getting all of the messages we expect to get, but its looking good.

I’m still not entirely sure how the issue occurred though. Originally, because of my limited knowledge of Node.js, I thought that it was creating connections in parallel. It turns out, Node.js is not actually multi-threaded at all, unlike some other, real programming languages (cough, c#, cough), so it couldn’t possibly have been opening a bunch of connections all at the same time as I pictured.

What it might have been doing, however, is opening up a lot of connections very quickly. Node is really good at making use of the typical dead time in an application, like when executing HTTP requests and waiting for results, or when opening a file and waiting for IO to return. Its possible that each connection being opened was ceding control of the main execution pipeline for just long enough for another connection to be opened, so on and so forth until the underlying hardware ran out of resources.

Another possibility is that there was actually a connection leak, and the simple connection pool alleviated it by reusing the connection objects instead of always creating new ones.

Regardless, it processes through large ELB log files without breaking, which is good enough for this post.

0 Comments

A little over 4 months ago, I wrote a post about trying to improve the speed of cloning a large S3 bucket. At the time, I tried to simply parallelise the execution of the AWS CLI sync command, which actually proved to be much slower than simply leaving the CLI alone to do its job. It was an unsurprising result in retrospect, but you never know unless you try.

Unwilling to let the idea die, I decided to make it my focus during our recent hack days.

If you are unfamiliar with the concept of a hack day (or Hackathon as they are sometimes known), have a look at this Wikipedia article. At my current company, we’re only just starting to include hack days on a regular basis, but its a good sign of a healthy development environment.

Continuing on with the original train of thought (parallelise via prefixes), I needed to find a way to farm out the work to something (whether it was a pool of our own workers or some other mechanism). Continuing with that train of thought, I chose to use AWS Lambda.

Enter Node.js on Lambda.

At A High Level

AWS Lambda is a relatively new offering, allowing you to configure some code to automatically execute following a trigger from one of a number of different events, including an SNS Topic Notification, changes to an S3 bucket or a HTTP call. You can use Python, Java or Javascript (through Node.js) as code natively, but you can technically use anything you can compile into a Linux compatible executable and make accessible to the function via S3 or something similar.

Since Javascript seems to be everywhere now (even though its hard to call it a real language), it was a solid choice. No point being afraid of new things.

Realistically, I should have been at least a little afraid of new things.

Conceptually the idea can be explained as a simple divide and conquer strategy, managed by files in an S3 bucket (because S3 was the triggering mechanism I was most familiar with).

If something wants to trigger a clone, it writes a file into a known S3 bucket detailing the desired operation (source, destination, some sort of id) with a key of {id}-{source}-{destination}/clone-request.

In response, the Lambda function will trigger, segment the work and write a file for each segment with a key of {id}-{source}-{destination}/{prefix}-segment-request. When it has finished breaking down the work, it will write another file with the key {id}-{source}-{destination}/clone-response, containing a manifest of the breakdown, indicating that it is done with the division of work.

As each segment file is being written, another Lambda function will be triggered, doing the actual copy work and finally writing a file with the key {id}-{source}-{destination}/{prefix}-segment-response to indicate that its done.

File Formats Are Interesting

Each clone-request file looks like this:

{
    id: {id},
    source: {
        name: {source-bucket-name}
    },
    destination: {
        name: {destination-bucket-name}
    }
}

Its a relatively simple file that would be easy to extend as necessary (for example, if you needed to specify the region, credentials to access the bucket, etc).

The clone-response file (the manifest), looks like this:

{
    id: {id},
    source: {
        name: {source-bucket-name}
    },
    destination: {
        name: {destination-bucket-name}
    },
    segments: {
        count: {number-of-segments},
        values: [
            {segment-key},
            {segment-key}
            ...
        ]
    }
}

Again, another relatively simple file. The only additional information is the segments that the task was broken into. These segments are used for tracking purposes, as the code that requests a clone needs some way to know when the clone is done.

Each segment-request file looks like this:

{
    id: {id},
    source: {
        name: {source-bucket-name},
        prefix: {prefix}
    },
    destination: {
        name: {destination-bucket-name}
    }
}

And finally, each segment-response file looks like this:

{
    id: {id},
    source: {
        name: {source-bucket-name},
        prefix: {prefix}
    },
    destination: {
        name: {destination-bucket-name}
    },    
    files: [        
        {key},
        {key},
        ...
    ]
}

Nothing fancy or special, just straight JSON files with all the information needed.

Breaking It All Down

First up, the segmentation function.

Each Javascript Lambda function already comes with access to the aws-sdk, which is super useful, because honestly if you’re using Lambda, you’re probably doing it because you need to talk to other AWS offerings.

The segmentation function has to read in the triggering file from S3, parse it (its Javascript and JSON so that’s trivial at least), iterate through the available prefixes (using a delimiter, and sticking with the default “/”), write out a file for each unique prefix and finally write out a file containing the manifest.

As I very quickly learned, using Node.js to accomplish the apparently simple task outlined above was made not simple at all thanks to its fundamentally asynchronous nature, and the fact that async calls don’t seem to return a traceable component (unlike in C#, where if you were using async tasks you would get a task object that could be used to track whether or not the task succeeded/failed).

To complicate this even further, the aws-sdk will only return a maximum of 1000 results when listing the prefixes in a bucket (or doing anything with a bucket really), which means you have to loop using the callbacks. This makes accumulating some sort of result set annoying difficult, especially if you want to know when you are done.

Anyway, the segmentation function is as follows:

console.log('Loading function');

var aws = require('aws-sdk');
var s3 = new aws.S3({ apiVersion: '2006-03-01' });

function putCallback(err, data)
{
    if (err)
    {
        console.log('Failed to Upload Clone Segment ', err);
    }
}

function generateCloneSegments(s3Source, command, commandBucket, marker, context, segments)
{
    var params = { Bucket: command.source.name, Marker: marker, Delimiter: '/' };
    console.log("Listing Prefixes: ", JSON.stringify(params));
    s3Source.listObjects(params, function(err, data) {
        if (err)
        {
            context.fail(err);
        }
        else
        {
            for (var i = 0; i < data.CommonPrefixes.length; i++)
            {
                var item = data.CommonPrefixes[i];
                var segmentRequest = {
                    id: command.id,
                    source : {
                        name: command.source.name,
                        prefix: item.Prefix
                    },
                    destination : {
                        name: command.destination.name
                    }
                };
                
                var segmentKey = command.id + '/' + item.Prefix.replace('/', '') + '-segment-request';
                segments.push(segmentKey);
                console.log("Uploading: ", segmentKey);
                var segmentUploadParams = { Bucket: commandBucket, Key: segmentKey, Body: JSON.stringify(segmentRequest), ContentType: 'application/json'};
                s3.putObject(segmentUploadParams, putCallback);
            }
            
            if(data.IsTruncated)
            {
                generateCloneSegments(s3Source, command, commandBucket, data.NextMarker, context, segments);
            }
            else
            {
                // Write a clone-response file to the commandBucket, stating the segments generated
                console.log('Total Segments: ', segments.length);
                
                var cloneResponse = {
                    segments: {
                        count: segments.length,
                        values: segments
                    }
                };
                
                var responseKey = command.id + '/' + 'clone-response';
                var cloneResponseUploadParams = { Bucket: commandBucket, Key: responseKey, Body: JSON.stringify(cloneResponse), ContentType: 'application/json'};
                
                console.log("Uploading: ", responseKey);
                s3.putObject(cloneResponseUploadParams, putCallback);
            }
        }
    });
}

exports.handler = function(event, context) {
    //console.log('Received event:', JSON.stringify(event, null, 2));
    
    var commandBucket = event.Records[0].s3.bucket.name;
    var key = decodeURIComponent(event.Records[0].s3.object.key.replace(/\+/g, ' '));
    var params = {
        Bucket: commandBucket,
        Key: key
    };
    
    s3.getObject(params, function(err, data) 
    {
        if (err) 
        {
            context.fail(err);
        }
        else 
        {
            var command = JSON.parse(data.Body);
            var s3Source = new aws.S3({ apiVersion: '2006-03-01', region: 'ap-southeast-2' });
            
            var segments = [];
            generateCloneSegments(s3Source, command, commandBucket, '', context, segments);
        }
    });
};

I’m sure some improvements could be made to the Javascript (I’d love to find a way automate tests on it), but its not bad for being written directly into the AWS console.

Hi Ho, Hi Ho, Its Off To Work We Go

The actual cloning function is remarkably similar to the segmenting function.

It still has to loop through items in the bucket, except it limits itself to items that match a certain prefix. It still has to do something for each item (execute a copy and add the key to its on result set) and it still has to write a file right at the end when everything is done.

console.log('Loading function');

var aws = require('aws-sdk');
var commandS3 = new aws.S3({ apiVersion: '2006-03-01' });

function copyCallback(err, data)
{
    if (err)
    {
        console.log('Failed to Copy ', err);
    }
}

function copyFiles(s3, command, commandBucket, marker, context, files)
{
    var params = { Bucket: command.source.name, Marker: marker, Prefix: command.source.prefix };
    s3.listObjects(params, function(err, data) {
        if (err)
        {
            context.fail(err);
        }
        else
        {
            for (var i = 0; i < data.Contents.length; i++)
            {
                var key = data.Contents[i].Key;
                files.push(key);
                console.log("Copying [", key, "] from [", command.source.name, "] to [", command.destination.name, "]");
                
                var copyParams = {
                    Bucket: command.destination.name,
                    CopySource: command.source.name + '/' + key,
                    Key: key
                };
                s3.copyObject(copyParams, copyCallback);
            }
            
            if(data.IsTruncated)
            {
                copyFiles(s3, command, commandBucket, data.NextMarker, context, segments);
            }
            else
            {
                // Write a segment-response file
                console.log('Total Files: ', files.length);
                
                var segmentResponse = {
                    id: command.id,
                    source: command.source,
                    destination : {
                        name: command.destination.name,
                        files: {
                            count: files.length,
                            files: files
                        }
                    }
                };
                
                var responseKey = command.id + '/' + command.source.prefix.replace('/', '') + '-segment-response';
                var segmentResponseUploadParams = { Bucket: commandBucket, Key: responseKey, Body: JSON.stringify(segmentResponse), ContentType: 'application/json'};
                
                console.log("Uploading: ", responseKey);
                commandS3.putObject(segmentResponseUploadParams, function(err, data) { });
            }
        }
    });
}

exports.handler = function(event, context) {
    //console.log('Received event:', JSON.stringify(event, null, 2));
    
    var commandBucket = event.Records[0].s3.bucket.name;
    var key = decodeURIComponent(event.Records[0].s3.object.key.replace(/\+/g, ' '));
    var params = {
        Bucket: commandBucket,
        Key: key
    };
    
    commandS3.getObject(params, function(err, data) 
    {
        if (err) 
        {
            context.fail(err);
        }
        else 
        {
            var command = JSON.parse(data.Body);
            var s3 = new aws.S3({ apiVersion: '2006-03-01', region: 'ap-southeast-2' });
            
            var files = [];
            copyFiles(s3, command, commandBucket, '', context, files);
        }
    });
};

Tricksy Trickses

You may notice that there is no mention of credentials in the code above. That’s because the Lambda functions run under a role with a policy that gives them the ability to list, read and put into any bucket in our account. Roles are handy for accomplishing things in AWS, avoiding the new to supply credentials. When applied to the resource, and no credentials are supplied, the aws-sdk will automatically generate a short term token using the role, reducing the likelihood of leaked credentials.

As I mentioned above, The asynchronous nature of Node.js made everything a little but more difficult than expected. It was hard to determine when anything was done (somewhat important for writing manifest files). Annoyingly enough, it was even hard to determine when the function itself was finished. I kept running into issues where the function execution had finished, and it looked like it had done all of the work I expected it to do, but AWS Lambda was reporting that it did not complete successfully.

In the initial version of Node.js I was using (v0.10.42), the AWS supplied context object had a number of methods on it to indicate completion (whether success or failure). If I called the Succeed method after I setup my callbacks, the function would terminate without doing anything, because it didn’t automatically wait for the callbacks to complete. If I didn’t call it, the function would be marked as “did not complete successfully”. Extremely annoying.

As is often the case with AWS though, on literally the second hack day, AWS released support for Node.js v4.3, which automatically waits for all pending callbacks to complete before completing the function, completely changing the interaction model for the better. I did upgrade to the latest version during the second hack day (after I had accepted that my function was going to error out in the control panel but actually do all the work it needed to), but it wasn’t until later that I realised that the upgrade had fixed my problem.

The last tripwire I ran into was related to AWS Lambda not being available in all regions yet. Specifically, its not in ap-southeast-2 (Sydney), which is where all of our infrastructure lives. S3 is weird in relation to regions, as buckets are globally unique and accessible, but they do actually have a home region. What does this have to do with Lambda? Well, the S3 bucket triggers I used as the impetus for the function execution only work if the S3 bucket is in the same region as the Lambda function (so us-west-1), even though once you get inside the Lambda function you can read/write to any bucket you like. Weird.

Conclusion

I’ve omitted the Powershell code responsible for executing the clone for brevity. It writes the request to the bucket, reads the response and then polls waiting for all of the segments to be completed, so its not particularly interesting, although the polling for segment completion was my first successful application of the Invoke-Parallel function from Script Center.

Profiling the AWS Lambda approach versus the original AWS CLI sync command approach over a test bucket (7500 objects, 195 distinct prefixes, 8000 MB of data) showed a decent improvement in performance. The sync approach took 142 seconds and the Lambda approach took 55 seconds, approximately a third of the time, which was good to see considering the last time I tried to parallelise the clone it actually decreased the performance. I think with some tweaking the Lambda approach could be improved further, with tighter polling tolerances and an increased number of parallel Lamda executions allowed.

Unfortunately, I have not had the chance to execute the AWS Lambda implementation on the huge bucket that is the entire reason it exists, but I suspect that it won’t work.

Lambda allows at maximum 5 minutes of execution time per function, and I suspect that the initial segmentation for a big enough bucket will probably take longer than that. It might be possible to chain lambda functions together (i.e. trigger one from the next one, perhaps per 1000 results returned from S3, but I’m not entirely sure how to do that yet (maybe using SNS notifications instead of S3?). Additionally, with a big enough bucket, the manifest file itself (detailed the segments) might become unwieldy. I think the problem bucket has something like 200K unique prefixes, so the size of the manifest file can add up quickly.

Regardless, the whole experience was definitely useful from a technical growth point of view. Its always a good idea to remove yourself from your comfort zone and try some new things, and AWS Lambda + Node.js are definitely well outside my comfort zone.

A whole different continent in fact.