0 Comments

As I mentioned briefly last week, our AWS Lambda ELB Logs Processor did not quite work when we pushed it to production. This was a shame, because we were pretty sure we got it right this time. Ah well, every failure is a learning experience.

In this post I’m going to briefly talk about how we deployed the new logs processor, how we identified the issue (and what it was) and finally how we fixed it.

Colour Blind

In order to maximise our ability to adapt quickly, we try to deploy as frequently as possible, especially when it comes to services that we fully control (like websites and API’s). The most important thing to get right when you’re deploying frequently is the ability to do so without disturbing the end-user. Historically, deployments dodged this problem by simply deploying during periods of low use, which for us means Sundays or late at night. I certainly don’t want to deal with that though, and I don’t want to delegate it to some poor sap, so instead we just make sure our change sets are small and easy to reason about, and our deployments happen in such a way that the service is never fully down while the deployment is occurring (easily accomplished in AWS with multiple deployment locations combined with rolling deployments).

For bigger changes though, we’ve started using blue/green deployments. Essentially this means having two completely separate environments active at the one time, with some top level mechanism for directing traffic to the appropriate one as necessary. For us this is a top level URL (like service.thing.com) which acts as a DNS alias for an environment specific URL (like prod-green-service.thing.com). We then use Route53 and Weighted DNS to direct traffic as we so desire.

For websites, blue/green deployments are trivial, assuming the backend services still work the same (or have already been deployed to be compatible with both versions). For those titular backend services though, blue/green deployments can be challenging, especially when you have to take into account data persistence and continuity of service.

When it comes to whatever persistence layer is necessary, our newer services usually feature data storage in a separately maintained environment (like RDS instances, S3 buckets and so on), specifically to help us do blue/green deployments on parts of the architecture that can freely change without causing any issues with data. Some of our earlier services did not do this, and as such are problematic when it comes to blue/green deployments. In those cases we usually test the new system by using traffic replication and then resort to traditional downtime when it comes time to do the actual deployment.

Blue/green deployments have proved to be particularly useful to for us because of the way we handle our environments as code.

To tie this all back in with the ELB logs processor, we used our environment package to create a blue environment for the service with the ELB logs processor in it (because our currently active one was green). Once the environment was up and running, we used weighted DNS inside Route53 to shunt a small amount of traffic to the blue environment to see if the ELB logs were being processed into our ELK stack correctly.

And they were, and the victory dance was done and everyone lived happily ever after.

Oh Yeah, Scale

Well, not quite.

The small amount of test traffic worked perfectly. All of the traffic was being represented inside our ELK stack as expected. Continuing on with the traditional blue/green deployment, we increased the amount of traffic hitting the new blue environment by increasing its weight in Route53.

Once we got to around 10% of the total traffic, things started to fall down. The new blue environment was handling the traffic perfectly well, but we were not seeing the expected number of log events from the ELB inside the ELK stack.

Looking at the CloudWatch logs for the Lambda function, it looked like the Lambda function was simply taking longer to run than the default timeout provided (which is 3 seconds), which was a silly oversight on our part. Considering it was trying to process a few megabytes of raw log data, its not surprising that it wasn’t getting through the entire thing.

AWS Lambda is charged by calculating the actual execution time with the resources that were allocated for the function, so there timeout basically represents the maximum amount of money you will be charged for each function execution. if you have an unbounded number of function executions (i.e. they occur on some unreliable trigger), than this can be very useful to limit your potential costs. For our usage, we know that the ELB logs are generated approximately once every 5 minutes, so we’re pretty safe to set the timeout to the maximum (300 seconds) to give the function as much time as possible to process the log file.

With that small change in place more of the log file was being processed, but it still wasn’t processing the entire thing. The good news was that it was no longer just timing out and terminating itself, but the bad news was that it was now just plain old crashing after processing a some of the file.

No Its Fine, I Don’t Need Those Connections Again

Javascript Lambda functions automatically write their output to CloudWatch, which is really useful from a debugging and tracing point of view. I mean, I don’t know how it would work otherwise (because you literally cannot access the underlying operating system they run on), but its still nice that it just works out of the box.

In our case, the error was as follows:

Error: connect EMFILE {ip}:{port} - Local (undefined:undefined)
at Object.exports._errnoException (util.js:870:11)
at exports._exceptionWithHostPort (util.js:893:20)
at connect (net.js:843:14)
at net.js:985:7
at GetAddrInfoReqWrap.asyncCallback [as callback] (dns.js:63:16)
at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:82:10)

A small amount of digging showed that this error occurs when a TCP connection cannot be established to the specified IP over the specified port.

If you look back at the code that actually processes the ELB log file, the only place where a TCP connection is being made is after a line in the file has been processed and transformed into a JSON payload, ready to be  pushed to our ELK stack via Logstash. Being that the error only manifests after part of the file has already been processed successfully, it looked like the issue was one of resource exhaustion.

The obvious candidate was that the Lambda function was simply trying to open too many TCP connections at once. This made sense based on my understanding of Node.js at the time, so we implemented a connection pooling mechanism to prevent it from occurring (i.e. instead of simply trying to establish a connection, it would try to draw one from a global pool with a limit and if one was not available, wait for a few moments until it was).

Because each connection was only required for a few moments, the solution would essentially throttle the processing to whatever limit we impose, hopefully dodging the perceived problem with too many parallel connections.

function getConnection(callback) {
    if (connectionPool.length < connectionCountLimit) {
        console.log('creating new connection');
        const newSocket = net.createConnection(_logPort, _logHost);
        connectionPool.push(newSocket);
        return callback(newSocket);
    }

    const activeConnections = connectionPool.filter(function (socket) {
        return !socket.destroyed;
    });
    if (activeConnections.length != connectionCountLimit) {
        connectionPool = activeConnections;
    }

    setTimeout(function () {
        getConnection(callback);
    }, waitForConnectionDuration);
}

function postToLogstash(connection) {
    return function (entry) {
        console.log("INFO: Posting event to logstash... ");
        var message = JSON.stringify(entry) + "\n";
        message = message.replace("Timestamp", "@timestamp");
        connection.write(message, null, function () {
            console.log("INFO: Posting to logstash...done");
            connection.end();
        });
    }
}

Conclusion

I’d love to say that after we implemented the simple connection pooling, everything worked just fine and dandy and the victory dance was enjoyed by all.

And for once, I can!

With the simple connection pooling we implemented (which had a maximum connection count of like 50 in the first iteration), we managed to process an entire ELB log file from S3 without getting the error we were getting before. We still need to do some more investigation around whether or not we’re actually getting all of the messages we expect to get, but its looking good.

I’m still not entirely sure how the issue occurred though. Originally, because of my limited knowledge of Node.js, I thought that it was creating connections in parallel. It turns out, Node.js is not actually multi-threaded at all, unlike some other, real programming languages (cough, c#, cough), so it couldn’t possibly have been opening a bunch of connections all at the same time as I pictured.

What it might have been doing, however, is opening up a lot of connections very quickly. Node is really good at making use of the typical dead time in an application, like when executing HTTP requests and waiting for results, or when opening a file and waiting for IO to return. Its possible that each connection being opened was ceding control of the main execution pipeline for just long enough for another connection to be opened, so on and so forth until the underlying hardware ran out of resources.

Another possibility is that there was actually a connection leak, and the simple connection pool alleviated it by reusing the connection objects instead of always creating new ones.

Regardless, it processes through large ELB log files without breaking, which is good enough for this post.