That Neat Little Symbol From Half-Life, Too Fast, Too Furious

May 2. 2017 0 Comments

With all that talk about logging environments and their improvements out of the way, its time to return to familiar territory yet again.

Our ELB logs processor has been chugging along for a while now, dutifully aggregating all of the ELB logs for one of our APIs. We have quite a few APIs though, and almost all of the them have ELB logs to process, so there was no sense keeping the awesome sequestered in one place.

Time to spread the love.

Honestly, we always planned on this sort of reuse anyway, so the logs processor was built in such a way that it could be relatively easily used in another location.

The actual Javascript code is completely generic, not specific to a particular Lambda function/API combo. There are things that need to change between instances obviously, but they are covered by the variables, the changes of which are easily accomplished via Octopus Deploy. In order to setup the processor for a new API, all you need to do is add the Lambda function/permissions to the environment definition (i.e. the CloudFormation template), make a new Octopus Project and add a small piece of Powershell into the environment scripts to deploy that project.

What could possibly go wrong?

Surface Tension

Surprisingly, very little actually went wrong.

I mean, it didn’t work, and we had no log events inside the ELK stack for the new API, but at a glance, there were no catastrophic failures.

The Lambda function existed, it was receiving events from S3, downloading files, processing them and pumping them out via TCP. There were no errors (at the Lambda function level OR within the function itself) and all of the function logging showed everything working as expected.

After digging into it a little bit, it turned out that not only were the ELB logs for the new API missing, the CI and Staging logs for the older API were missing as well. Weirdly enough, the Production logs for the older API were working exactly as expected.

Mysteriously, the code was exactly the same in all 4 Lambda functions, with the only differences being in configuration (which was relatively simple).

Forget About Freeman!

To cut a relatively long story short, the core of the problem was related to the asynchronous nature of TCP connections (via sockets) in Node.

This was the problem code:

create: function() {
    return new Promise(function(resolve, reject) {
        const socket = net.createConnection(config.logstashPort, config.logstashHost);
        summary.connections.created += 1;
        resolve(socket);
    })
}

This snippet is the mechanism by which TCP connections are created and added to the pool in the Lambda function, prior to being used to actually send data to our log stack.

It turns out that a socket can be returned by net.createConnection before it has actually connected. Even worse, sockets that are still connecting will not throw errors when you attempt to write to them. I’m not sure what actually happens to the writes in this case, but giving Javascript/Node the benefit of the doubt, I assume they are probably queued in some way.

Unfortunately, the Lambda function was exiting before any sort of queued writes could actually be applied, but only when there was a small total number of ELB log lines being processed. This is why it was working fine in production (thousands of messages every execution) and failing miserably in CI/Staging and the new environment (tens of messages). The function was just too quick and didn’t know that it needed to wait for the socket to be connected and for all pending writes to execute before it was allowed to finished. As far as the promise chain was concerned, everything was done and dusted.

The solution is actually pretty simple:

create: function() {
    return new Promise(function(resolve, reject) {
        const socket = new net.Socket;

        socket.on("connect", () => {
            summary.connections.created += 1;
            resolve(socket);
        });

        socket.on("error", error => {
            summary.failures.sockets.total += 1;
            if (summary.failures.sockets.lastFew.length >= 5) {
                summary.failures.sockets.lastFew.shift();
            }
            summary.failures.sockets.lastFew.push(error);
            socket.destroy(error);
            reject(error);
        })

        socket.connect(config.logstashPort, config.logstashHost);
    })
}

Instead of immediately returning the socket, hook a listener up to its connect event and resolve the promise then. I added in some extra error handling/reporting as well, but its all pretty straight forward.

What this means is that a socket is never added to the pool unless its already connected, which makes the rest of the code work exactly as expected.

Conclusion

It was nice to return briefly to the ELB logs processor to witness its reuse, especially because the return actually led to the discovery and eventual fix of a bug that I missed the first time.

I’m still wrapping my head around the intrinsically asynchronous/callback nature of Node though, which was the root cause of the issue. To be honest, while I really like the way that promises work, I really really hate the callback hell that vanilla Node seems to encourage. It was so easy to create a socket and write to it, without the socket actually being valid, and the only way to ensure the socket was valid was via an asynchronous event handler.

To complicated matters, I’m sure it doesn’t help that AWS Lambda can only run a relatively old version of Node (4.3.2). The latest version of Node is miles ahead of the that, and it probably has hundreds of bugfixes that might be relevant.

Anyway, I’m just happy I got to steal more Half-Life chapter titles as headings.

Seriously, they fit so well.