0 Comments

With all that talk about logging environments and their improvements out of the way, its time to return to familiar territory yet again.

Our ELB logs processor has been chugging along for a while now, dutifully aggregating all of the ELB logs for one of our APIs. We have quite a few APIs though, and almost all of the them have ELB logs to process, so there was no sense keeping the awesome sequestered in one place.

Time to spread the love.

Honestly, we always planned on this sort of reuse anyway, so the logs processor was built in such a way that it could be relatively easily used in another location.

The actual Javascript code is completely generic, not specific to a particular Lambda function/API combo. There are things that need to change between instances obviously, but they are covered by the variables, the changes of which are easily accomplished via Octopus Deploy. In order to setup the processor for a new API, all you need to do is add the Lambda function/permissions to the environment definition (i.e. the CloudFormation template), make a new Octopus Project and add a small piece of Powershell into the environment scripts to deploy that project.

What could possibly go wrong?

Surface Tension

Surprisingly, very little actually went wrong.

I mean, it didn’t work, and we had no log events inside the ELK stack for the new API, but at a glance, there were no catastrophic failures.

The Lambda function existed, it was receiving events from S3, downloading files, processing them and pumping them out via TCP. There were no errors (at the Lambda function level OR within the function itself) and all of the function logging showed everything working as expected.

After digging into it a little bit, it turned out that not only were the ELB logs for the new API missing, the CI and Staging logs for the older API were missing as well. Weirdly enough, the Production logs for the older API were working exactly as expected.

Mysteriously, the code was exactly the same in all 4 Lambda functions, with the only differences being in configuration (which was relatively simple).

Forget About Freeman!

To cut a relatively long story short, the core of the problem was related to the asynchronous nature of TCP connections (via sockets) in Node.

This was the problem code:

create: function() {
    return new Promise(function(resolve, reject) {
        const socket = net.createConnection(config.logstashPort, config.logstashHost);
        summary.connections.created += 1;
        resolve(socket);
    })
}

This snippet is the mechanism by which TCP connections are created and added to the pool in the Lambda function, prior to being used to actually send data to our log stack.

It turns out that a socket can be returned by net.createConnection before it has actually connected. Even worse, sockets that are still connecting will not throw errors when you attempt to write to them. I’m not sure what actually happens to the writes in this case, but giving Javascript/Node the benefit of the doubt, I assume they are probably queued in some way.

Unfortunately, the Lambda function was exiting before any sort of queued writes could actually be applied, but only when there was a small total number of ELB log lines being processed. This is why it was working fine in production (thousands of messages every execution) and failing miserably in CI/Staging and the new environment (tens of messages). The function was just too quick and didn’t know that it needed to wait for the socket to be connected and for all pending writes to execute before it was allowed to finished. As far as the promise chain was concerned, everything was done and dusted.

The solution is actually pretty simple:

create: function() {
    return new Promise(function(resolve, reject) {
        const socket = new net.Socket;

        socket.on("connect", () => {
            summary.connections.created += 1;
            resolve(socket);
        });

        socket.on("error", error => {
            summary.failures.sockets.total += 1;
            if (summary.failures.sockets.lastFew.length >= 5) {
                summary.failures.sockets.lastFew.shift();
            }
            summary.failures.sockets.lastFew.push(error);
            socket.destroy(error);
            reject(error);
        })

        socket.connect(config.logstashPort, config.logstashHost);
    })
}

Instead of immediately returning the socket, hook a listener up to its connect event and resolve the promise then. I added in some extra error handling/reporting as well, but its all pretty straight forward.

What this means is that a socket is never added to the pool unless its already connected, which makes the rest of the code work exactly as expected.

Conclusion

It was nice to return briefly to the ELB logs processor to witness its reuse, especially because the return actually led to the discovery and eventual fix of a bug that I missed the first time.

I’m still wrapping my head around the intrinsically asynchronous/callback nature of Node though, which was the root cause of the issue. To be honest, while I really like the way that promises work, I really really hate the callback hell that vanilla Node seems to encourage. It was so easy to create a socket and write to it, without the socket actually being valid, and the only way to ensure the socket was valid was via an asynchronous event handler.

To complicated matters, I’m sure it doesn’t help that AWS Lambda can only run a relatively old version of Node (4.3.2). The latest version of Node is miles ahead of the that, and it probably has hundreds of bugfixes that might be relevant.

Anyway, I’m just happy I got to steal more Half-Life chapter titles as headings.

Seriously, they fit so well.

0 Comments

Its getting harder and harder to come up with relevant subtitles for this particular topic. I should probably just starting using numbers.

Totally nailed this one though.

To the topic at hand though, you might be forgiven for thinking that we had our AWS Lambda ELB Logs Processor well under control from my last post. It was processing files happily all the way through and everything seemed to be arriving in our log stack in an acceptable timeframe.

To give a complete picture, you can see the source for the Lambda function here.

Inside the source, the #{something} notation is an Octopus Deploy variable substitution. Basically, when this code is deployed into AWS using Octopus Deploy, those strings are replaced with real values appropriate for the environment being deployed to.

Its long enough that I didn’t just want to paste it into this post, but still short enough to fit comfortably in a single file. It streams a file from S3 line by line, parses each line into its constituent parts and then posts each line to a logstash TCP endpoint. It has some basic connection pooling (homegrown) and features some basic logging.

Unfortunately, the logging turned out to be pretty painful to use in production. This was primarily because there were multiple places inside the function that write log messages for each line encountered. In our sample data this is fine, because there are tens of lines. In production, we get over 60000 lines every run of the function, which means the CloudWatch logs are basically unusable. Not only that, but the information that comes out in the logs is pretty useless at scale, stating things like “starting posting to logstash”, “processing file”, “posted to logstash” and so on.

The other issue was the connection pooling. It seemed to be working (in that it reduced the pace of the function such that it no longer ran out of TCP connections), but I wasn’t entirely sure it was doing what we thought it was. To be honest, the code was pretty hard to understand, mostly as a result of the use of aysnchronous callbacks. To me, it seemed like a connection pool should be something that other people have needed, so surely there was a package available that would meet our needs.

Taking these two points into considering, I set about improving the function by:

  1. Finding and using a library for the connection pool, hopefully making it easier to read and reason about.
  2. Making the logging output actually useful when the function was executed in production.

Car Pool

Looking around there are a few packages that enable pooling of various resources available in NPM. As seems to be the pattern though, they are of varying quality and completeness. In the end I settled on a package called “generic-pool” (Github here), which seemed pretty good. A major point in its favour was that it had a version greater than 0, which seems rare for an NPM package.

Unfortunately for me, it was promise based,and I had no idea what promises were or how to use them. I am, however, familiar with the Task Parallel Library in C#, which seems to be similar conceptually, so I didn’t have to learn a whole bunch of concepts entirely from scratch.

Using the library to set up a pool of TCP connections was relatively straightforward, as the snippet below shows.

const poolFactory = {
    create: function() {
        return new Promise(function(resolve, reject) {
            const socket = net.createConnection(config.logstashPort, config.logstashHost);
            summary.connections.created += 1;
            resolve(socket);
        })
    },
    destroy: function(socket) {
        return new Promise(function(resolve, reject) {
            socket.end();
            socket.destroy();
            summary.connections.destroyed += 1;
            resolve();
        })
    },
    validate: function(socket) {
        return new Promise(function(resolve, reject){
            resolve(!socket.destroyed);
        })
    }
};

var poolOptions = {
    max: config.connectionCountLimit,
    min: 0,
    acquireTimeoutMillis: config.connectionWaitMilliseconds,
    testOnBorrow: true
};

var pool = gPool.createPool(poolFactory, poolOptions);

With a pool in place, all that was left to do was to incorporate it into the actual line parsing/processing engine, which meant a shift away from asynchronous callbacks to promises.

summary.lines.parsed += 1;

var promise = pool.acquire()
    .then((socket) => { return post(socket, entry); })
    .then((socket) => { pool.release(socket); })
    .catch((error) => { 
        summary.failures.sending.total += 1;
        if (summary.failures.sending.lastFew.length >= 5) {
            summary.failures.sending.lastFew.shift();
        }
        summary.failures.sending.lastFew.push(error);
    });

promises.push(promise);

The promises collection/array is used to determine when the function is complete, which is the opportunity to do anything that needs to happen right at the end.

Like logging.

Log Ride

Speaking of logging, the first improvement was to remove all instances where the function wrote a message for each line. Easy. Now at least the production logs wouldn’t be an unreadable mess.

The next step was to add some metrics to the logging, so we could track down the function was doing. This needed to include some statistics like lines read, lines parsed and lines sent to logstash.

This wasn’t so bad (simply introduce a function scoped variable for the metrics, and then increment the various counters whenever the specified event occurred), but it did start to cause issues when I tried to refactor the function to break it down into smaller, more easily reasoned about components. I couldn’t easily move any of the inner functions around because then they wouldn’t have the metrics object in scope. I think I might have been able to solve this problem by adding the metrics object in as a parameter to each of the functions that needed to edit it, but this would have made using those functions in the various asynchronous callbacks much harder.

With the line based metrics sorted, it was a relatively easy matter to add metrics for connection creation and destruction via the pool construct outlined above, so I don’t need to go into too much detail about that.

Finally, I also wanted to include some information about errors that occur in the process, making sure that if every single line failed for some reason, the CloudWatch logs would still remain readable. The easiest way to do this was to accumulate a small queue of the most recent errors, pushing older ones out in favour of newer ones with the understanding that the last 5 errors is probably enough information to diagnose issues. You can see the code for doing this in the promise catch handler above.

The only thing left to do was to actually log the metrics object that contained all of this juicy information, and the only place where this was possible was in the handler for when the readline module had finished reading the entire file.

function handleReaderClose() {
    console.log('File reader for ELB log file is closing because all lines have been read. Waiting for all promises (for sending parsed lines to logstash) to resolve');
    Promise
        .all(promises)
        .then(() => { console.log("Cleaning up the connection pool, which has [%s/%s] (current/max) connections", pool.size, pool.max); return pool.drain(); })
        .then(() => pool.clear())
        .then(() => { 
            console.log("All processing complete. Summary follows"); 
            console.log("%s", JSON.stringify(summary, fixStringifyError, 4)); 
        });
}

There is a few other things in there that I’m not going into too much detail about, like cleaning up the pool and a custom function to stringify errors, but the important part for this discussion is the loggin.

Conclusion

After doing the work above, I’m a bit happier with the current state of the AWS Lambda ELB Logs Processor. I feel like its better structured and easier to reason about, mostly due to the introduction of promises instead of callbacks. In addition to the quality of the code itself, the improved log output makes it easier to see how the function is going on a day to day basis as it chews through thousands and thousands of lines every run.

In particular, the summary that gets output right at the end is extremely useful for getting a high level picture of the output of a single run without overloading the reader.

An example of the sort of output we see from a normal (successful) run is below:

START RequestId: c0e94f4d-051f-11e7-8fbb-2b8356f39769 Version: $LATEST
2017-03-09T23:25:59.785Z    c0e94f4d-051f-11e7-8fbb-2b8356f39769    Retrieving ELK log file from S3 bucket/key specified in the initiating event. Bucket: [REDACTED], Key: [REDACTED]
2017-03-09T23:27:31.344Z    c0e94f4d-051f-11e7-8fbb-2b8356f39769    File reader for ELB log file is closing because all lines have been read. Waiting for all promises (for sending parsed lines to logstash) to resolve
2017-03-09T23:27:34.924Z    c0e94f4d-051f-11e7-8fbb-2b8356f39769    Cleaning up the connection pool, which has [48/100] (current/max) connections
2017-03-09T23:27:34.967Z    c0e94f4d-051f-11e7-8fbb-2b8356f39769    All processing complete. Summary follows
2017-03-09T23:27:35.003Z    c0e94f4d-051f-11e7-8fbb-2b8356f39769
{
    "lines": {
        "encountered": 76464,
        "parsed": 76464,
        "sent": 76464
    },
    "connections": {
        "created": 48,
        "destroyed": 48
    },
    "failures": {
        "parsing": {
            "total": 0,
            "lastFew": []
        },
        "sending": {
            "total": 0,
            "lastFew": []
        }
    }
}
END RequestId: c0e94f4d-051f-11e7-8fbb-2b8356f39769
REPORT RequestId: c0e94f4d-051f-11e7-8fbb-2b8356f39769    Duration: 95227.56 ms    Billed Duration: 95300 ms Memory Size: 192 MB    Max Memory Used: 165 MB

There are still many improvements to be made (especially regarding the structure of the repo and its complete lack of automated tests), but its definitely more betterer than it was.

0 Comments

As I mentioned briefly last week, our AWS Lambda ELB Logs Processor did not quite work when we pushed it to production. This was a shame, because we were pretty sure we got it right this time. Ah well, every failure is a learning experience.

In this post I’m going to briefly talk about how we deployed the new logs processor, how we identified the issue (and what it was) and finally how we fixed it.

Colour Blind

In order to maximise our ability to adapt quickly, we try to deploy as frequently as possible, especially when it comes to services that we fully control (like websites and API’s). The most important thing to get right when you’re deploying frequently is the ability to do so without disturbing the end-user. Historically, deployments dodged this problem by simply deploying during periods of low use, which for us means Sundays or late at night. I certainly don’t want to deal with that though, and I don’t want to delegate it to some poor sap, so instead we just make sure our change sets are small and easy to reason about, and our deployments happen in such a way that the service is never fully down while the deployment is occurring (easily accomplished in AWS with multiple deployment locations combined with rolling deployments).

For bigger changes though, we’ve started using blue/green deployments. Essentially this means having two completely separate environments active at the one time, with some top level mechanism for directing traffic to the appropriate one as necessary. For us this is a top level URL (like service.thing.com) which acts as a DNS alias for an environment specific URL (like prod-green-service.thing.com). We then use Route53 and Weighted DNS to direct traffic as we so desire.

For websites, blue/green deployments are trivial, assuming the backend services still work the same (or have already been deployed to be compatible with both versions). For those titular backend services though, blue/green deployments can be challenging, especially when you have to take into account data persistence and continuity of service.

When it comes to whatever persistence layer is necessary, our newer services usually feature data storage in a separately maintained environment (like RDS instances, S3 buckets and so on), specifically to help us do blue/green deployments on parts of the architecture that can freely change without causing any issues with data. Some of our earlier services did not do this, and as such are problematic when it comes to blue/green deployments. In those cases we usually test the new system by using traffic replication and then resort to traditional downtime when it comes time to do the actual deployment.

Blue/green deployments have proved to be particularly useful to for us because of the way we handle our environments as code.

To tie this all back in with the ELB logs processor, we used our environment package to create a blue environment for the service with the ELB logs processor in it (because our currently active one was green). Once the environment was up and running, we used weighted DNS inside Route53 to shunt a small amount of traffic to the blue environment to see if the ELB logs were being processed into our ELK stack correctly.

And they were, and the victory dance was done and everyone lived happily ever after.

Oh Yeah, Scale

Well, not quite.

The small amount of test traffic worked perfectly. All of the traffic was being represented inside our ELK stack as expected. Continuing on with the traditional blue/green deployment, we increased the amount of traffic hitting the new blue environment by increasing its weight in Route53.

Once we got to around 10% of the total traffic, things started to fall down. The new blue environment was handling the traffic perfectly well, but we were not seeing the expected number of log events from the ELB inside the ELK stack.

Looking at the CloudWatch logs for the Lambda function, it looked like the Lambda function was simply taking longer to run than the default timeout provided (which is 3 seconds), which was a silly oversight on our part. Considering it was trying to process a few megabytes of raw log data, its not surprising that it wasn’t getting through the entire thing.

AWS Lambda is charged by calculating the actual execution time with the resources that were allocated for the function, so there timeout basically represents the maximum amount of money you will be charged for each function execution. if you have an unbounded number of function executions (i.e. they occur on some unreliable trigger), than this can be very useful to limit your potential costs. For our usage, we know that the ELB logs are generated approximately once every 5 minutes, so we’re pretty safe to set the timeout to the maximum (300 seconds) to give the function as much time as possible to process the log file.

With that small change in place more of the log file was being processed, but it still wasn’t processing the entire thing. The good news was that it was no longer just timing out and terminating itself, but the bad news was that it was now just plain old crashing after processing a some of the file.

No Its Fine, I Don’t Need Those Connections Again

Javascript Lambda functions automatically write their output to CloudWatch, which is really useful from a debugging and tracing point of view. I mean, I don’t know how it would work otherwise (because you literally cannot access the underlying operating system they run on), but its still nice that it just works out of the box.

In our case, the error was as follows:

Error: connect EMFILE {ip}:{port} - Local (undefined:undefined)
at Object.exports._errnoException (util.js:870:11)
at exports._exceptionWithHostPort (util.js:893:20)
at connect (net.js:843:14)
at net.js:985:7
at GetAddrInfoReqWrap.asyncCallback [as callback] (dns.js:63:16)
at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:82:10)

A small amount of digging showed that this error occurs when a TCP connection cannot be established to the specified IP over the specified port.

If you look back at the code that actually processes the ELB log file, the only place where a TCP connection is being made is after a line in the file has been processed and transformed into a JSON payload, ready to be  pushed to our ELK stack via Logstash. Being that the error only manifests after part of the file has already been processed successfully, it looked like the issue was one of resource exhaustion.

The obvious candidate was that the Lambda function was simply trying to open too many TCP connections at once. This made sense based on my understanding of Node.js at the time, so we implemented a connection pooling mechanism to prevent it from occurring (i.e. instead of simply trying to establish a connection, it would try to draw one from a global pool with a limit and if one was not available, wait for a few moments until it was).

Because each connection was only required for a few moments, the solution would essentially throttle the processing to whatever limit we impose, hopefully dodging the perceived problem with too many parallel connections.

function getConnection(callback) {
    if (connectionPool.length < connectionCountLimit) {
        console.log('creating new connection');
        const newSocket = net.createConnection(_logPort, _logHost);
        connectionPool.push(newSocket);
        return callback(newSocket);
    }

    const activeConnections = connectionPool.filter(function (socket) {
        return !socket.destroyed;
    });
    if (activeConnections.length != connectionCountLimit) {
        connectionPool = activeConnections;
    }

    setTimeout(function () {
        getConnection(callback);
    }, waitForConnectionDuration);
}

function postToLogstash(connection) {
    return function (entry) {
        console.log("INFO: Posting event to logstash... ");
        var message = JSON.stringify(entry) + "\n";
        message = message.replace("Timestamp", "@timestamp");
        connection.write(message, null, function () {
            console.log("INFO: Posting to logstash...done");
            connection.end();
        });
    }
}

Conclusion

I’d love to say that after we implemented the simple connection pooling, everything worked just fine and dandy and the victory dance was enjoyed by all.

And for once, I can!

With the simple connection pooling we implemented (which had a maximum connection count of like 50 in the first iteration), we managed to process an entire ELB log file from S3 without getting the error we were getting before. We still need to do some more investigation around whether or not we’re actually getting all of the messages we expect to get, but its looking good.

I’m still not entirely sure how the issue occurred though. Originally, because of my limited knowledge of Node.js, I thought that it was creating connections in parallel. It turns out, Node.js is not actually multi-threaded at all, unlike some other, real programming languages (cough, c#, cough), so it couldn’t possibly have been opening a bunch of connections all at the same time as I pictured.

What it might have been doing, however, is opening up a lot of connections very quickly. Node is really good at making use of the typical dead time in an application, like when executing HTTP requests and waiting for results, or when opening a file and waiting for IO to return. Its possible that each connection being opened was ceding control of the main execution pipeline for just long enough for another connection to be opened, so on and so forth until the underlying hardware ran out of resources.

Another possibility is that there was actually a connection leak, and the simple connection pool alleviated it by reusing the connection objects instead of always creating new ones.

Regardless, it processes through large ELB log files without breaking, which is good enough for this post.

0 Comments

Over the last few weeks, I’ve been sharing the bits and pieces that went into our construction of an ELB logs processing pipeline using AWS Lambda.

As I said in the introduction, the body of work around the processor can be broken down into three pieces:

I’ve gone through each of these components in detail in the posts I’ve linked above, so this last post is really just to tie it all together and reflect on the process, as well as provide a place to mention a few things that didn’t neatly fit into any of the buckets above.

Also, I’ll obviously be continuing the pattern of naming every sub-header after a chapter title in the original Half-Life, almost all of which have proven to be surprisingly apt for whatever topic is being discussed. I mean seriously, look at this next one? How is that not perfectly fitting for a conclusion/summary post.

Residue Processing

It took a fair amount of effort for us to get to the solution we have in place now, and a decent amount of time. The whole thing was put together over the course of a few weeks by one of the people I work with, with some guidance and feedback from other members of the team from time to time. This timeframe was to develop, test and then deploy the solution into a real production environment, by a person with little to no working knowledge of the AWS toolset, so I think it was a damn good effort.

The most time consuming part was the long turnaround on environment builds, because each build needs to run a suite of tests which involve creating and destroying at least one environment, sometimes more. In reality, this means a wait time or something like 30-60 minutes per build, which is so close to eternity as to be effectively indistinguishable from it. I’ll definitely have to come up with some sort of way to tighten this feedback loop, but being that most of it is actually just waiting for AWS resources, I’m not really sure what I can do.

The hardest part of the whole process was probably just working with Lambda for the first time outside of the AWS management website.

As a team, we’d used Lambda before (back when I tried to make something to clone S3 buckets more quickly), but we’d never tried to manage the various Lambda bits and pieces through CloudFormation.

It turns out that the AWS website does a hell of a lot of things in order to make sure that your Lambda function runs, including dealing with profiles and permissions, network interfaces, listeners and so on. Having to do all of that explicitly through CloudFormation was something of a learning process.

Speaking of CloudFormation and Lambda, we ran into a nasty bug with Elastic Network Interfaces and VPC hosted Lambda functions created through CloudFormation, where the CloudFormation stack doesn’t delete cleanly because the ENI is still in use. It looks like its a known issue, so I assume it will be fixed at some point in the future, but as a result we had to include some additional cleanup in the Powershell that wraps our environment management to check the stack for Lambda functions and manually remove and delete the ENI before we try to delete the stack.

This isn’t the first time we’ve had to manually cleanup resources “managed” by CloudFormation. We do the same thing with S3 buckets because CloudFormation won’t delete a bucket with anything in it (and some of our buckets, like the ELB logs ones, are constantly being written to by other AWS services).

The only other difficult part of the whole thing I’ve already mentioned in the deployment post, which was figuring out how we could incorporate non-machine based Octopus deployments into our environments. For now they just happen after the actual AWS stack is created (as part of the Powershell scripts wrapping the entire process) and rely on having an Octopus tentacle registered in each environment on the Octopus Server machine, used as a script execution point.

Conclusion

Having put this whole system in place, the obvious question is “Was it worth it?”.

For me, the answer is “definitely”.

We’ve managed to retire a few hacky components (a Windows service running Powershell scripts via NSSM to download files from an S3 bucket, for example) and removed an entire machine from every environment that needs to process ELB logs. Its not often that you get to reduce both running and maintenance costs in one blow, so it was nice to get that accomplished.

Ignoring the reduced costs to the business for a second, we’ve also decreased the latency for receiving our ELB logs for analysis because rather than relying on a polling system, we’re now triggering the processing directly when the ELB writes the log file into S3.

Finally, we’ve gained some more experience with systems and services that we haven’t really had a chance to look into, allowing us to leverage that knowledge and tooling for other, potentially more valuable purposes.

All in all, I consider this exercise a resounding success, and I’m happy I was able to dedicate some time to improving an existing process, even though it was already “working”.

Improving existing engineering like this is incredibly valuable to the morale of a development time, which is an important and limited resource.

0 Comments

Another week, another piece of the puzzle.

This time around, I’ll go through how we’ve setup the build and deployment pipeline for the Lambda function code that processes the ELB log files. Its not a particularly complex system, but it was something completely different compared to things we’ve done in the past.

In general terms, building software is relatively straightforward, as long as you have two things:

  • A mechanism of tracking changes to the source code, i.e. source control.
  • The ability to validate the code and create a versioned package, where the behaviour of the code can be reasoned about.

With at least those two pieces in place (preferably fully automated), your builds are taken care of. Granted, the accomplishing those things is not as simple as two dot points would lead you to believe, but conceptually there is not a lot going on.

Once you have versioned packages that can be reasoned about, all you need to do is figure out how to deliver them to the appropriate places. Again, ideally the whole process is automated. You shouldn’t have to remote onto a machine and manually copy files around, as that hardly ever ends well. This can get quite complicated depending on exactly what it is that you are deploying: on-premises software can be notoriously painful to deploy without some supporting system, but deploying static page websites to S3 is ridiculously easy.

My team uses TeamCity, Nuget and Powershell to accomplish the build part and Octopus Deploy to deploy almost all of our code, and I don’t plan on changing any of that particularly soon.

Some people seem to think that because its so easy to manage Lambda functions from the AWS management website that they don’t have to have a build and deployment pipeline. After all, just paste the code into the website and you’re good to go, right?

I disagree vehemently.

Power Up

Our ELB logs processor Lambda function follows our normal repository structure, just like any other piece of software we write.

The code for the Lambda function goes into the /src folder, along with a Nuspec file that describes how to construct the resulting versioned package at build time.

Inside a /scripts folder is a build script, written in Powershell, containing all of the logic necessary to build and verify a deployable package. It mostly just leverages a library of common functions (so our builds are consistent), and its goal is to facilitate all of the parts of the pipeline, like compilation (ha Javascript), versioning, testing, packaging and deployment.

Some build systems are completely encapsulated inside the software that does the build, like Jenkins or TeamCity. I don’t like this approach, because you can’t easily run/debug the build on a different machine for any reason. I much prefer the model where the repository has all of the knowledge necessary to do the entire build and deployment process, barring the bits that it needs to accomplish via an external system.

The build script the ELB logs processor function is included below, but keep in mind, this is just the entry point, and a lot of the bits and pieces are inside the common functions that you can see referenced.

[CmdletBinding()]
param
(
    [switch]$deploy,
    [string]$octopusServerUrl,
    [string]$octopusApiKey,
    [string]$component,
    [string]$commaSeparatedDeploymentEnvironments,
    [string[]]$projects,
    [int]$buildNumber,
    [switch]$prerelease,
    [string]$prereleaseTag
)

$error.Clear()

$ErrorActionPreference = "Stop"

$here = Split-Path $script:MyInvocation.MyCommand.Path

. "$here\_Find-RootDirectory.ps1"

$rootDirectory = Find-RootDirectory $here
$rootDirectoryPath = $rootDirectory.FullName

. "$rootDirectoryPath\scripts\Invoke-Bootstrap.ps1"
. "$rootDirectoryPath\scripts\common\Functions-Build.ps1"

$arguments = @{}
$arguments.Add("Deploy", $deploy)
$arguments.Add("CommaSeparatedDeploymentEnvironments", $commaSeparatedDeploymentEnvironments)
$arguments.Add("OctopusServerUrl", $octopusServerUrl)
$arguments.Add("OctopusServerApiKey", $octopusApiKey)
$arguments.Add("Projects", $projects)
$arguments.Add("VersionStrategy", "SemVerWithPatchFilledAutomaticallyWithBuildNumber")
$arguments.Add("buildNumber", $buildNumber)
$arguments.Add("Prerelease", $prerelease)
$arguments.Add("PrereleaseTag", $prereleaseTag)
$arguments.Add("BuildEngineName", "nuget")

Build-DeployableComponent @arguments

I’m pretty sure I’ve talked about our build process and common scripts before, so I’m not going to go into any more detail.

Prior to deployment, the only interesting output is the versioned Nuget file, containing all of the dependencies necessary to run the Lambda function (which in our case is just the file in my previous post).

On A Rail

When it comes to deploying the Lambda function code, its a little bit more complicated that our standard software deployments via Octopus Deploy.

In most cases, we create a versioned package containing all of the necessary logic to deploy the software, so in the case of an API it contains a deploy.ps1 script that gets run automatically during deployment, responsible for creating a website and configuring IIS on the local machine. The most important thing Octopus does for us to provide mechanisms to get this package to the place where it needs to be.

It usually does this via an Octopus Tentacle, a service that runs on the deployment target and allows for communication between the Octopus server and the machine in question.

This system kind of falls apart when you’re trying to deploy to an AWS Lambda function, which cannot have a tentacle installed on it.

Instead, we rely on the AWS API and what amounts to a worker machine sitting in each environment.

When we do a deployment of our Lambda function project, it gets copied to the worker machine (which is actually just the Octopus server) and it runs the deployment script backed into the package. This script then uses Octopus variables to package the code again (in a way that AWS likes, a simple zip file) and uses the AWS API to upload the changed code to the appropriate Lambda function (by following a naming convention).

The deployment script is pretty straightforward:

function Get-OctopusParameter
{
    [CmdletBinding()]
    param
    (
        [string]$key
    )

    if ($OctopusParameters -eq $null)
    {
        throw "No variable called OctopusParameters is available. This script should be executed as part of an Octopus deployment."
    }

    if (-not($OctopusParameters.ContainsKey($key)))
    {
        throw "The key [$key] could not be found in the set of OctopusParameters."
    }

    return $OctopusParameters[$key]
}

$VerbosePreference = "Continue"
$ErrorActionPreference = "Stop"

$here = Split-Path -Parent $MyInvocation.MyCommand.Path
. "$here\_Find-RootDirectory.ps1"


$rootDirectory = Find-RootDirectory $here
$rootDirectoryPath = $rootDirectory.FullName

$awsKey = $OctopusParameters["AWS.Deployment.Key"]
$awsSecret = $OctopusParameters["AWS.Deployment.Secret"]
$awsRegion = $OctopusParameters["AWS.Deployment.Region"]

$environment = $OctopusParameters["Octopus.Environment.Name"]
$version = $OctopusParameters["Octopus.Release.Number"]

. "$rootDirectoryPath\scripts\common\Functions-Aws.ps1"
$aws = Get-AwsCliExecutablePath

$env:AWS_ACCESS_KEY_ID = $awsKey
$env:AWS_SECRET_ACCESS_KEY = $awsSecret
$env:AWS_DEFAULT_REGION = $awsRegion

$functionPath = "$here\src\function"

Write-Verbose "Compressing lambda code file"
Add-Type -AssemblyName System.IO.Compression.FileSystem
[system.io.compression.zipfile]::CreateFromDirectory($functionPath, "index.zip")

Write-Verbose "Updating Log Processor lambda function to version [$environment/$version]"
(& $aws lambda update-function-code --function-name $environment-ELBLogProcessorFunction --zip-file fileb://index.zip) | Write-Verbose

Nothing fancy, just using the AWS CLI to deploy code to a known function.

Apprehension

AWS Lambda does provide some other mechanisms to deploy code, and we probably could have used them to accomplish the same sort of thing, but I like our patterns to stay consistent and I’m a big fan of the functionality that Octopus Deploy provides, so I didn’t want to give that up.

We had to make a few changes to our environment pattern to allow for non-machine based deployment, like:

  • Ensuring every automatically created environment has a machine registered in it representing the Octopus server (for running scripts that effect external systems)
    • This meant that our environment cleanup also needed to take this into account, deregistering the machine before trying to remove the Octopus environment
  • Providing a section at the end of the environment setup to deploy Octopus projects that aren’t related to specific machines
    • Previously all of our deployments happened via cfn-init on the EC2 instances in question

Once all of that was in place, it was pretty easy to deploy code to a Lambda function as part of setting up the environment, just like we would deploy code to an EC2 instance. It was one of those cases where I’m glad we wrap our usage of CloudFormation in Powershell, because if we were just using raw CloudFormation, I’m not sure how we would have integrated the usage of Octopus Deploy.

To Be Summarised

I’ve only got one more post left in this series, which will summarise the entire thing and link back to all the constituent parts.

Until then, I don’t really have anything else to say.