0 Comments

With all that talk about logging environments and their improvements out of the way, its time to return to familiar territory yet again.

Our ELB logs processor has been chugging along for a while now, dutifully aggregating all of the ELB logs for one of our APIs. We have quite a few APIs though, and almost all of the them have ELB logs to process, so there was no sense keeping the awesome sequestered in one place.

Time to spread the love.

Honestly, we always planned on this sort of reuse anyway, so the logs processor was built in such a way that it could be relatively easily used in another location.

The actual Javascript code is completely generic, not specific to a particular Lambda function/API combo. There are things that need to change between instances obviously, but they are covered by the variables, the changes of which are easily accomplished via Octopus Deploy. In order to setup the processor for a new API, all you need to do is add the Lambda function/permissions to the environment definition (i.e. the CloudFormation template), make a new Octopus Project and add a small piece of Powershell into the environment scripts to deploy that project.

What could possibly go wrong?

Surface Tension

Surprisingly, very little actually went wrong.

I mean, it didn’t work, and we had no log events inside the ELK stack for the new API, but at a glance, there were no catastrophic failures.

The Lambda function existed, it was receiving events from S3, downloading files, processing them and pumping them out via TCP. There were no errors (at the Lambda function level OR within the function itself) and all of the function logging showed everything working as expected.

After digging into it a little bit, it turned out that not only were the ELB logs for the new API missing, the CI and Staging logs for the older API were missing as well. Weirdly enough, the Production logs for the older API were working exactly as expected.

Mysteriously, the code was exactly the same in all 4 Lambda functions, with the only differences being in configuration (which was relatively simple).

Forget About Freeman!

To cut a relatively long story short, the core of the problem was related to the asynchronous nature of TCP connections (via sockets) in Node.

This was the problem code:

create: function() {
    return new Promise(function(resolve, reject) {
        const socket = net.createConnection(config.logstashPort, config.logstashHost);
        summary.connections.created += 1;
        resolve(socket);
    })
}

This snippet is the mechanism by which TCP connections are created and added to the pool in the Lambda function, prior to being used to actually send data to our log stack.

It turns out that a socket can be returned by net.createConnection before it has actually connected. Even worse, sockets that are still connecting will not throw errors when you attempt to write to them. I’m not sure what actually happens to the writes in this case, but giving Javascript/Node the benefit of the doubt, I assume they are probably queued in some way.

Unfortunately, the Lambda function was exiting before any sort of queued writes could actually be applied, but only when there was a small total number of ELB log lines being processed. This is why it was working fine in production (thousands of messages every execution) and failing miserably in CI/Staging and the new environment (tens of messages). The function was just too quick and didn’t know that it needed to wait for the socket to be connected and for all pending writes to execute before it was allowed to finished. As far as the promise chain was concerned, everything was done and dusted.

The solution is actually pretty simple:

create: function() {
    return new Promise(function(resolve, reject) {
        const socket = new net.Socket;

        socket.on("connect", () => {
            summary.connections.created += 1;
            resolve(socket);
        });

        socket.on("error", error => {
            summary.failures.sockets.total += 1;
            if (summary.failures.sockets.lastFew.length >= 5) {
                summary.failures.sockets.lastFew.shift();
            }
            summary.failures.sockets.lastFew.push(error);
            socket.destroy(error);
            reject(error);
        })

        socket.connect(config.logstashPort, config.logstashHost);
    })
}

Instead of immediately returning the socket, hook a listener up to its connect event and resolve the promise then. I added in some extra error handling/reporting as well, but its all pretty straight forward.

What this means is that a socket is never added to the pool unless its already connected, which makes the rest of the code work exactly as expected.

Conclusion

It was nice to return briefly to the ELB logs processor to witness its reuse, especially because the return actually led to the discovery and eventual fix of a bug that I missed the first time.

I’m still wrapping my head around the intrinsically asynchronous/callback nature of Node though, which was the root cause of the issue. To be honest, while I really like the way that promises work, I really really hate the callback hell that vanilla Node seems to encourage. It was so easy to create a socket and write to it, without the socket actually being valid, and the only way to ensure the socket was valid was via an asynchronous event handler.

To complicated matters, I’m sure it doesn’t help that AWS Lambda can only run a relatively old version of Node (4.3.2). The latest version of Node is miles ahead of the that, and it probably has hundreds of bugfixes that might be relevant.

Anyway, I’m just happy I got to steal more Half-Life chapter titles as headings.

Seriously, they fit so well.

0 Comments

Its getting harder and harder to come up with relevant subtitles for this particular topic. I should probably just starting using numbers.

Totally nailed this one though.

To the topic at hand though, you might be forgiven for thinking that we had our AWS Lambda ELB Logs Processor well under control from my last post. It was processing files happily all the way through and everything seemed to be arriving in our log stack in an acceptable timeframe.

To give a complete picture, you can see the source for the Lambda function here.

Inside the source, the #{something} notation is an Octopus Deploy variable substitution. Basically, when this code is deployed into AWS using Octopus Deploy, those strings are replaced with real values appropriate for the environment being deployed to.

Its long enough that I didn’t just want to paste it into this post, but still short enough to fit comfortably in a single file. It streams a file from S3 line by line, parses each line into its constituent parts and then posts each line to a logstash TCP endpoint. It has some basic connection pooling (homegrown) and features some basic logging.

Unfortunately, the logging turned out to be pretty painful to use in production. This was primarily because there were multiple places inside the function that write log messages for each line encountered. In our sample data this is fine, because there are tens of lines. In production, we get over 60000 lines every run of the function, which means the CloudWatch logs are basically unusable. Not only that, but the information that comes out in the logs is pretty useless at scale, stating things like “starting posting to logstash”, “processing file”, “posted to logstash” and so on.

The other issue was the connection pooling. It seemed to be working (in that it reduced the pace of the function such that it no longer ran out of TCP connections), but I wasn’t entirely sure it was doing what we thought it was. To be honest, the code was pretty hard to understand, mostly as a result of the use of aysnchronous callbacks. To me, it seemed like a connection pool should be something that other people have needed, so surely there was a package available that would meet our needs.

Taking these two points into considering, I set about improving the function by:

  1. Finding and using a library for the connection pool, hopefully making it easier to read and reason about.
  2. Making the logging output actually useful when the function was executed in production.

Car Pool

Looking around there are a few packages that enable pooling of various resources available in NPM. As seems to be the pattern though, they are of varying quality and completeness. In the end I settled on a package called “generic-pool” (Github here), which seemed pretty good. A major point in its favour was that it had a version greater than 0, which seems rare for an NPM package.

Unfortunately for me, it was promise based,and I had no idea what promises were or how to use them. I am, however, familiar with the Task Parallel Library in C#, which seems to be similar conceptually, so I didn’t have to learn a whole bunch of concepts entirely from scratch.

Using the library to set up a pool of TCP connections was relatively straightforward, as the snippet below shows.

const poolFactory = {
    create: function() {
        return new Promise(function(resolve, reject) {
            const socket = net.createConnection(config.logstashPort, config.logstashHost);
            summary.connections.created += 1;
            resolve(socket);
        })
    },
    destroy: function(socket) {
        return new Promise(function(resolve, reject) {
            socket.end();
            socket.destroy();
            summary.connections.destroyed += 1;
            resolve();
        })
    },
    validate: function(socket) {
        return new Promise(function(resolve, reject){
            resolve(!socket.destroyed);
        })
    }
};

var poolOptions = {
    max: config.connectionCountLimit,
    min: 0,
    acquireTimeoutMillis: config.connectionWaitMilliseconds,
    testOnBorrow: true
};

var pool = gPool.createPool(poolFactory, poolOptions);

With a pool in place, all that was left to do was to incorporate it into the actual line parsing/processing engine, which meant a shift away from asynchronous callbacks to promises.

summary.lines.parsed += 1;

var promise = pool.acquire()
    .then((socket) => { return post(socket, entry); })
    .then((socket) => { pool.release(socket); })
    .catch((error) => { 
        summary.failures.sending.total += 1;
        if (summary.failures.sending.lastFew.length >= 5) {
            summary.failures.sending.lastFew.shift();
        }
        summary.failures.sending.lastFew.push(error);
    });

promises.push(promise);

The promises collection/array is used to determine when the function is complete, which is the opportunity to do anything that needs to happen right at the end.

Like logging.

Log Ride

Speaking of logging, the first improvement was to remove all instances where the function wrote a message for each line. Easy. Now at least the production logs wouldn’t be an unreadable mess.

The next step was to add some metrics to the logging, so we could track down the function was doing. This needed to include some statistics like lines read, lines parsed and lines sent to logstash.

This wasn’t so bad (simply introduce a function scoped variable for the metrics, and then increment the various counters whenever the specified event occurred), but it did start to cause issues when I tried to refactor the function to break it down into smaller, more easily reasoned about components. I couldn’t easily move any of the inner functions around because then they wouldn’t have the metrics object in scope. I think I might have been able to solve this problem by adding the metrics object in as a parameter to each of the functions that needed to edit it, but this would have made using those functions in the various asynchronous callbacks much harder.

With the line based metrics sorted, it was a relatively easy matter to add metrics for connection creation and destruction via the pool construct outlined above, so I don’t need to go into too much detail about that.

Finally, I also wanted to include some information about errors that occur in the process, making sure that if every single line failed for some reason, the CloudWatch logs would still remain readable. The easiest way to do this was to accumulate a small queue of the most recent errors, pushing older ones out in favour of newer ones with the understanding that the last 5 errors is probably enough information to diagnose issues. You can see the code for doing this in the promise catch handler above.

The only thing left to do was to actually log the metrics object that contained all of this juicy information, and the only place where this was possible was in the handler for when the readline module had finished reading the entire file.

function handleReaderClose() {
    console.log('File reader for ELB log file is closing because all lines have been read. Waiting for all promises (for sending parsed lines to logstash) to resolve');
    Promise
        .all(promises)
        .then(() => { console.log("Cleaning up the connection pool, which has [%s/%s] (current/max) connections", pool.size, pool.max); return pool.drain(); })
        .then(() => pool.clear())
        .then(() => { 
            console.log("All processing complete. Summary follows"); 
            console.log("%s", JSON.stringify(summary, fixStringifyError, 4)); 
        });
}

There is a few other things in there that I’m not going into too much detail about, like cleaning up the pool and a custom function to stringify errors, but the important part for this discussion is the loggin.

Conclusion

After doing the work above, I’m a bit happier with the current state of the AWS Lambda ELB Logs Processor. I feel like its better structured and easier to reason about, mostly due to the introduction of promises instead of callbacks. In addition to the quality of the code itself, the improved log output makes it easier to see how the function is going on a day to day basis as it chews through thousands and thousands of lines every run.

In particular, the summary that gets output right at the end is extremely useful for getting a high level picture of the output of a single run without overloading the reader.

An example of the sort of output we see from a normal (successful) run is below:

START RequestId: c0e94f4d-051f-11e7-8fbb-2b8356f39769 Version: $LATEST
2017-03-09T23:25:59.785Z    c0e94f4d-051f-11e7-8fbb-2b8356f39769    Retrieving ELK log file from S3 bucket/key specified in the initiating event. Bucket: [REDACTED], Key: [REDACTED]
2017-03-09T23:27:31.344Z    c0e94f4d-051f-11e7-8fbb-2b8356f39769    File reader for ELB log file is closing because all lines have been read. Waiting for all promises (for sending parsed lines to logstash) to resolve
2017-03-09T23:27:34.924Z    c0e94f4d-051f-11e7-8fbb-2b8356f39769    Cleaning up the connection pool, which has [48/100] (current/max) connections
2017-03-09T23:27:34.967Z    c0e94f4d-051f-11e7-8fbb-2b8356f39769    All processing complete. Summary follows
2017-03-09T23:27:35.003Z    c0e94f4d-051f-11e7-8fbb-2b8356f39769
{
    "lines": {
        "encountered": 76464,
        "parsed": 76464,
        "sent": 76464
    },
    "connections": {
        "created": 48,
        "destroyed": 48
    },
    "failures": {
        "parsing": {
            "total": 0,
            "lastFew": []
        },
        "sending": {
            "total": 0,
            "lastFew": []
        }
    }
}
END RequestId: c0e94f4d-051f-11e7-8fbb-2b8356f39769
REPORT RequestId: c0e94f4d-051f-11e7-8fbb-2b8356f39769    Duration: 95227.56 ms    Billed Duration: 95300 ms Memory Size: 192 MB    Max Memory Used: 165 MB

There are still many improvements to be made (especially regarding the structure of the repo and its complete lack of automated tests), but its definitely more betterer than it was.

0 Comments

As I mentioned briefly last week, our AWS Lambda ELB Logs Processor did not quite work when we pushed it to production. This was a shame, because we were pretty sure we got it right this time. Ah well, every failure is a learning experience.

In this post I’m going to briefly talk about how we deployed the new logs processor, how we identified the issue (and what it was) and finally how we fixed it.

Colour Blind

In order to maximise our ability to adapt quickly, we try to deploy as frequently as possible, especially when it comes to services that we fully control (like websites and API’s). The most important thing to get right when you’re deploying frequently is the ability to do so without disturbing the end-user. Historically, deployments dodged this problem by simply deploying during periods of low use, which for us means Sundays or late at night. I certainly don’t want to deal with that though, and I don’t want to delegate it to some poor sap, so instead we just make sure our change sets are small and easy to reason about, and our deployments happen in such a way that the service is never fully down while the deployment is occurring (easily accomplished in AWS with multiple deployment locations combined with rolling deployments).

For bigger changes though, we’ve started using blue/green deployments. Essentially this means having two completely separate environments active at the one time, with some top level mechanism for directing traffic to the appropriate one as necessary. For us this is a top level URL (like service.thing.com) which acts as a DNS alias for an environment specific URL (like prod-green-service.thing.com). We then use Route53 and Weighted DNS to direct traffic as we so desire.

For websites, blue/green deployments are trivial, assuming the backend services still work the same (or have already been deployed to be compatible with both versions). For those titular backend services though, blue/green deployments can be challenging, especially when you have to take into account data persistence and continuity of service.

When it comes to whatever persistence layer is necessary, our newer services usually feature data storage in a separately maintained environment (like RDS instances, S3 buckets and so on), specifically to help us do blue/green deployments on parts of the architecture that can freely change without causing any issues with data. Some of our earlier services did not do this, and as such are problematic when it comes to blue/green deployments. In those cases we usually test the new system by using traffic replication and then resort to traditional downtime when it comes time to do the actual deployment.

Blue/green deployments have proved to be particularly useful to for us because of the way we handle our environments as code.

To tie this all back in with the ELB logs processor, we used our environment package to create a blue environment for the service with the ELB logs processor in it (because our currently active one was green). Once the environment was up and running, we used weighted DNS inside Route53 to shunt a small amount of traffic to the blue environment to see if the ELB logs were being processed into our ELK stack correctly.

And they were, and the victory dance was done and everyone lived happily ever after.

Oh Yeah, Scale

Well, not quite.

The small amount of test traffic worked perfectly. All of the traffic was being represented inside our ELK stack as expected. Continuing on with the traditional blue/green deployment, we increased the amount of traffic hitting the new blue environment by increasing its weight in Route53.

Once we got to around 10% of the total traffic, things started to fall down. The new blue environment was handling the traffic perfectly well, but we were not seeing the expected number of log events from the ELB inside the ELK stack.

Looking at the CloudWatch logs for the Lambda function, it looked like the Lambda function was simply taking longer to run than the default timeout provided (which is 3 seconds), which was a silly oversight on our part. Considering it was trying to process a few megabytes of raw log data, its not surprising that it wasn’t getting through the entire thing.

AWS Lambda is charged by calculating the actual execution time with the resources that were allocated for the function, so there timeout basically represents the maximum amount of money you will be charged for each function execution. if you have an unbounded number of function executions (i.e. they occur on some unreliable trigger), than this can be very useful to limit your potential costs. For our usage, we know that the ELB logs are generated approximately once every 5 minutes, so we’re pretty safe to set the timeout to the maximum (300 seconds) to give the function as much time as possible to process the log file.

With that small change in place more of the log file was being processed, but it still wasn’t processing the entire thing. The good news was that it was no longer just timing out and terminating itself, but the bad news was that it was now just plain old crashing after processing a some of the file.

No Its Fine, I Don’t Need Those Connections Again

Javascript Lambda functions automatically write their output to CloudWatch, which is really useful from a debugging and tracing point of view. I mean, I don’t know how it would work otherwise (because you literally cannot access the underlying operating system they run on), but its still nice that it just works out of the box.

In our case, the error was as follows:

Error: connect EMFILE {ip}:{port} - Local (undefined:undefined)
at Object.exports._errnoException (util.js:870:11)
at exports._exceptionWithHostPort (util.js:893:20)
at connect (net.js:843:14)
at net.js:985:7
at GetAddrInfoReqWrap.asyncCallback [as callback] (dns.js:63:16)
at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:82:10)

A small amount of digging showed that this error occurs when a TCP connection cannot be established to the specified IP over the specified port.

If you look back at the code that actually processes the ELB log file, the only place where a TCP connection is being made is after a line in the file has been processed and transformed into a JSON payload, ready to be  pushed to our ELK stack via Logstash. Being that the error only manifests after part of the file has already been processed successfully, it looked like the issue was one of resource exhaustion.

The obvious candidate was that the Lambda function was simply trying to open too many TCP connections at once. This made sense based on my understanding of Node.js at the time, so we implemented a connection pooling mechanism to prevent it from occurring (i.e. instead of simply trying to establish a connection, it would try to draw one from a global pool with a limit and if one was not available, wait for a few moments until it was).

Because each connection was only required for a few moments, the solution would essentially throttle the processing to whatever limit we impose, hopefully dodging the perceived problem with too many parallel connections.

function getConnection(callback) {
    if (connectionPool.length < connectionCountLimit) {
        console.log('creating new connection');
        const newSocket = net.createConnection(_logPort, _logHost);
        connectionPool.push(newSocket);
        return callback(newSocket);
    }

    const activeConnections = connectionPool.filter(function (socket) {
        return !socket.destroyed;
    });
    if (activeConnections.length != connectionCountLimit) {
        connectionPool = activeConnections;
    }

    setTimeout(function () {
        getConnection(callback);
    }, waitForConnectionDuration);
}

function postToLogstash(connection) {
    return function (entry) {
        console.log("INFO: Posting event to logstash... ");
        var message = JSON.stringify(entry) + "\n";
        message = message.replace("Timestamp", "@timestamp");
        connection.write(message, null, function () {
            console.log("INFO: Posting to logstash...done");
            connection.end();
        });
    }
}

Conclusion

I’d love to say that after we implemented the simple connection pooling, everything worked just fine and dandy and the victory dance was enjoyed by all.

And for once, I can!

With the simple connection pooling we implemented (which had a maximum connection count of like 50 in the first iteration), we managed to process an entire ELB log file from S3 without getting the error we were getting before. We still need to do some more investigation around whether or not we’re actually getting all of the messages we expect to get, but its looking good.

I’m still not entirely sure how the issue occurred though. Originally, because of my limited knowledge of Node.js, I thought that it was creating connections in parallel. It turns out, Node.js is not actually multi-threaded at all, unlike some other, real programming languages (cough, c#, cough), so it couldn’t possibly have been opening a bunch of connections all at the same time as I pictured.

What it might have been doing, however, is opening up a lot of connections very quickly. Node is really good at making use of the typical dead time in an application, like when executing HTTP requests and waiting for results, or when opening a file and waiting for IO to return. Its possible that each connection being opened was ceding control of the main execution pipeline for just long enough for another connection to be opened, so on and so forth until the underlying hardware ran out of resources.

Another possibility is that there was actually a connection leak, and the simple connection pool alleviated it by reusing the connection objects instead of always creating new ones.

Regardless, it processes through large ELB log files without breaking, which is good enough for this post.

0 Comments

Now that I am feeling less like dying horribly from a terrible plague, its time to continue to talk about processing ELB logs into an ELK stack via AWS Lambda.

Last time I talked about our current situation, our motivations for wanting to switch to something better and the general plan moving forward.

To summarise the plan, the new log processor system needs 3 main parts:

  • The Lambda function code, which is written in Javascript, and would need to parse an ELB log file one line at a time, break it up, convert it into a JSON structure and then push it to a Logstash TCP endpoint to be written into Elasticsearch
  • The configuration/creation of the Lambda function, which would need to be done via CloudFormation as part of our normal environment setup (i.e. spin up an environment, it comes with the Lambda function that will process any ELB logs encountered)
  • The deployment of the code to the Lambda function via Octopus (because that’s how we roll)

I’m basically going to do a short blog post on each of those pieces of work, and maybe one at the end to tie it all together.

With that in mind, lets talk Javascript.

Unforeseen Consequences

When you’re writing Lambda functions, Javascript (via Node.js) is probably your best bet. Sure you can run Java or Python (and maybe one day C# using .NET Core), but Javascript is almost certainly going to be the easiest. Its what we chose we when put together the faster S3 clone prototype, and while the fundamentally asynchronous nature of Node.js took some getting used to, it worked well enough.

When it comes to Javascript, I don’t personally write a lot of it. If I’m writing a server side component, I’m probably going to pick C# as my language of choice (with all its fancy features like “a compiler” and “type safety”) and I don’t find myself writing things like websites or small Javascript applications very often, if at all. My team definitely writes websites though, and we use React.js to do it, so its not like Javascript is an entirely foreign concept.

For the purposes of reading in and parsing an ELB log file via a Lambda function, we didn’t need a particularly complex piece of Javascript. Something that reads the specified file from S3 after the Lambda function triggers, something to process the contents of that file line by line, something to parse and format those lines in a way that a Logstash input wll accept, and something to push that JSON payload to the Logstash listener over raw TCP.

Without further ado, I give you the completed script:

'use strict';

let aws = require('aws-sdk');
let s3 = new aws.S3({ apiVersion: '2006-03-01' });
let readline = require('readline');
let net = require('net');

const _type = 'logs';
const _sourceModuleName = 'ELB';
const _logHost = '#{LogHost}'
const _logPort = #{LogPort}
const _environment = '#{Octopus.Environment.Name}'
const _component = '#{Component}'
const _application = '#{Application}'

function postToLogstash(entry){
    console.log("INFO: Posting event to logstash...");

    var socket = net.createConnection(_logPort, _logHost);
    var message = JSON.stringify(entry) + "\n";
    socket.write(message);
    socket.end();

    console.log("INFO: Posting to logstash...done");
}

exports.handler = (event, context, callback) => {
    console.log('INFO: Retrieving log from S3 bucket...');

    const bucket = event.Records[0].s3.bucket.name;
    const key = decodeURIComponent(event.Records[0].s3.object.key.replace(/\+/g, ' '));
    const params = {
        Bucket: bucket,
        Key: key
    };

    const reader = readline.createInterface({
        input: s3.getObject(params).createReadStream()
    });

    const expectedColumns = 12;

    reader.on('line', function(line) {
        console.log("INFO: Parsing S3 line entry...");

        const columns = line.split(/ (?=(?:(?:[^"]*"){2})*[^"]*$)/);

        if(columns.length >= expectedColumns){
            var entry = {
                EventReceivedTime: columns[0],
                LoadBalancerName: columns[1],
                PublicIpAndPort: columns[2],
                InternalIpAndPort: columns[3],
                TimeToForwardRequest: parseFloat(columns[4]) * 1000,
                TimeTaken: parseFloat(columns[5]) * 1000,
                TimeToForwardResponse: parseFloat(columns[6]) * 1000,
                Status: columns[7],
                BackendStatus: columns[8],
                BytesUploadedFromClient: parseInt(columns[9]),
                BytesDownloadedByClient: parseInt(columns[10]),
                FullRequest: columns[11],
                Component: _component,
                SourceModuleName: _sourceModuleName,
                Environment: _environment,
                Application: _application,
                Type: _type
            };
            postToLogstash(entry);
        } else {
            console.log("ERROR: Invalid record length was expecting " + expectedColumns.length + " but found " + columns.length);
            console.log('ERROR: -------');
            console.log(line);
            console.log('ERROR: -------');
        }
    });
};

Nothing to fancy.

In the interest of full disclosure, I did not write the script above. It was written by a few guys from my team initially as a proof of concept, then improved/hardened as a more controlled piece of work.

Office Complex

You might notice some strange variable names at the top of the script (i.e. #{Octopus.Environment.Name}).

We use Octopus deploy for  all of our deployments, and this includes the deployment of Lambda functions (which we package via Nuget and then deploy via the AWS Powershell Cmdlets/CLI inside Octopus). The #{NAME} notation is a way for Octopus to substitute variable values into files during deployment. This substitution is very useful, and can be scoped via a variety of things (like Environment, Machine, Role, etc), so by the time the script actually gets into AWS those variables are filled in with actual values.

Other than our use of Octopus variables, other things to note in this piece of Javascript are:

  • At no point does the function specify which credentials are being used to access the S3 bucket containing the ELB log files. This is because the Lambda function has been configured with an IAM role allowing it to access the required resources. The AWS Javascript library has built in support for running inside supported AWS contexts (like EC2 instances and Lambda functions), such that it can use the role applied to the context to get appropriate temporary credentials. This is a much better approach than using specific credentials in the script, which can result in unintended consequences if you aren’t eternally vigilant.
  • You need to make sure that you’re Lambda function is configured with an appropriate security group that allows it to use the expected outbound channel (i.e. make sure it can get to the Logstash host you’re trying to connect to). This was somewhat of an issue for us as our ELK stack is hosted inside another AWS account (our OPs account), so we had to make sure that all of the appropriate VPC peering was configured before it would work correctly. It can be a bit of a pain to setup the smallest possible surface area, but don’t be tempted to just configure the Lambda function to be able to do everything and go anywhere. Smaller surface area = more security and the policy of least privilege is one you should always follow.
  • Its important to ensure that if you’re doing TCP communication in a Lambda function, that you make sure to close your socket when you’re done with it, or the Lambda function might not exit. It might, but it also might not, and it can really throw you for a loop if you don’t know why.

To Be Continued

That’s it for this week. The Javascript I’ve included above is pretty generic (apart from the specific set of fields that we like to have in our log events) and will successfully process an ELB log file from S3 to a Logstash instance listening on a port of your choosing (probably 6379) when used in a Lambda function. Feel free to reuse it for your own purposes.

Next week I’ll continue this series of posts with information about how we use CloudFormation to setup our Lambda function as part of one of our environment definitions.

CloudFormation and Lambda aren’t the best of friends yet, so there is some interesting stuff that you have to be aware of.

0 Comments

I did a strange thing a few weeks ago. Something I honestly believed that I would never have to do.

I populated a Single Page Javascript Application with data extracted from a local database via a C# application installed using ClickOnce.

But why?

I maintain a small desktop application that acts as a companion to a service primarily offered through a website. Why do they need a companion application? Well, they participate in the Australian medical industry, and all the big Practice Management Software (PMS) Systems available are desktop applications, so if they want to integrate with anything, they need to have a presence on the local machine.

So that’s where I come in.

The pitch was that they need to offer a new feature (a questionnaire) that interacted with installed PMS, but wanted to implement the feature using web technologies, rather than putting it together using any of the desktop frameworks available (like WPF). The reasoning behind this desire was:

  1. The majority of the existing team had mostly web development experience (I’m pretty much the only one that works on the companion application), so building the whole thing using WPF or something similar wouldn’t be making the best use of available resources, 
  2. They wanted to be able to reuse the implementation in other places, and not just have it available in the companion application.
  3. They wanted to be able to reskin/improve the feature without having to deploy a new companion application

Completely fair and understandable points in favour of the web technology approach.

Turtles All The Way Down

When it comes to running a website inside a C# WPF desktop application, there are a few options available. You can try to use the native WebBrowser control that comes with .NET 3.5 SP1, but that seems to rely on the version of IE installed on the computer and is fraught with all sorts of peril.

You’re much better off going with a dedicated browser control library, of which the Chromium Embedded Framework is probably your best bet. For .NET that means CEFSharp (and for us, that means its WPF browser control, CEFSharp.WPF).

Installing CEFSharp and getting it to work in a development environment is pretty straightforward, just install the Nuget packages that you need and its all probably going to start working.

While functional, its not the greatest experience though.

  • If you’re using the CEFSharp WPF control, don’t try to use the designer. It will crash, slowly and painfully. Its best to just do the entire thing in code behind and leave the browser control out of the XAML entirely.
  • The underlying C++ library (libcef.dll) is monstrous, weighing in at around 48MB. For us, this was about 4 times the size of the entire application, so its a bit of a change.
  • If you’re using ClickOnce to deploy your application, you’re in for a bit of an adventure.

The last point is the most relevant to this post.

Because of the way ClickOnce works, a good chunk of the dependencies that are required by CEFSharp will not be recognised as required files, and thus will not be included in the deployment package when its built. If you weren’t using ClickOnce, everything would be copied correctly into the output folder by build events and MSBuild targets, and you could then package up your application in a sane way, but once you pick a deployment technology, you’re pretty much stuck with it.

You’ll find a lot of guidance on the internet about how to make ClickOnce work with CEFSharp, but I had the most luck with the following:

  1. Add Nuget references to the packages that you require. For me this was just CEFSharp.WPF, specifically version 49.0.0 because my application is still tied to .NET 4.0. This will add references to CEFSharp.Common and cef.redist.x86 (and x64).
  2. Edit the csproj file to remove any imports of msbuild targets added by CEFSharp. This is how CEFSharp would normally copy itself to the output folder, but it doesn’t work properly with ClickOnce. Also remove any props imports, because this is how CEFSharp does library references.
  3. Settle on a processor architecture (i.e. x86 or x64). CEFSharp only works with one or the other, so you might as well remove the one you don’t need.
  4. Manually add references to the .NET DLLs (CefSharp.WPF.dll, CefSharp.Core.dll and CefSharp.dll). You can find these DLLs inside the appropriate folder in your packages directory.
  5. This last step deals entirely with the dependencies required to run CEFSharp and making sure they present in the ClickOnce deployment. Manually edit the csproj file to include the following snippet:

    <ItemGroup>    
        <Content Include="..\packages\cef.redist.x86.3.2623.1396\CEF\*.*">
          <Link>%(Filename)%(Extension)</Link>
          <CopyToOutputDirectory>Always</CopyToOutputDirectory>
          <Visible>false</Visible>
        </Content>
        <Content Include="..\packages\cef.redist.x86.3.2623.1396\CEF\x86\*.*">
          <Link>%(Filename)%(Extension)</Link>
          <CopyToOutputDirectory>Always</CopyToOutputDirectory>
          <Visible>false</Visible>
        </Content>
        <Content Include="..\packages\CefSharp.Common.49.0.0\CefSharp\x86\CefSharp.BrowserSubprocess.*">
          <Link>%(Filename)%(Extension)</Link>
          <CopyToOutputDirectory>Always</CopyToOutputDirectory>
          <Visible>false</Visible>
        </Content>
        <Content Include="lib\*.*">
          <Link>%(Filename)%(Extension)</Link>
          <CopyToOutputDirectory>Always</CopyToOutputDirectory>
          <Visible>false</Visible>
        </Content>
        <Content Include="$(SolutionDir)packages\cef.redist.x86.3.2623.1396\CEF\locales\*">
          <Link>locales\%(Filename)%(Extension)</Link>
          <CopyToOutputDirectory>Always</CopyToOutputDirectory>
          <Visible>false</Visible>
        </Content>
    </ItemGroup>
    This particularly brutal chunk of XML adds off the dependencies required for CEFSharp to work properly as content files into the project, and then hides them, so you don’t have to see their ugly faces whenever you open your solution.

After taking those steps, I was able to successfully deploy the application with a working browser control, free from the delights of constant crashes.

Injected With A Poison

Of course, opening a browser to a particular webpage is nothing without the second part; giving that webpage access to some arbitrary data from the local context.

CEFSharp is pretty good in this respect to be honest. Before you navigate to the address you want to go to, you just have to call RegisterJsObject, supplying a name for the Javascript variable and a reference to your .NET object. The variable is then made available to the Javascript running on the page.

The proof of concept that I put together used a very simple object (a class with a few string properties), so I haven’t tested the limits of this approach, but I’m pretty sure you can do most of the basic things like arrays and properties that are objects themselves (i.e. Foo.Bar.Thing).

The interesting part is that the variable is made available to every page that is navigated to in the Browser from that point forward, so if there is a link on your page that goes somewhere else, it will get the same context as the previous page did.

In my case, my page was completely trivial, echoing back the injected variable if it existed.

<html>
<body>
<p id="content">
Script not run yet
</p>
<script>
if (typeof(injectedData) == 'undefined')
{
    document.getElementById("content").innerHTML = "no data specified";
}
else 
{
    document.getElementById("content").innerHTML = "name: " + injectedData.name + ", company: " + injectedData.company;
}
</script>
</body>
</html>

Summary

It is more than possible to have good integration between a website (well, one featuring Javascript at least) and an installed C# application, even if you happen to have the misfortune of using ClickOnce to deploy it. Really, all of the congratulations should go to the CEFSharp guys for creating an incredibly useful and flexible browser component using Chromium, even if it does feel monstrous and unholy.

Now, whether or not any of the above is a good idea is a different question. I had to make some annoying compromises to get the whole thing working, especially with ClickOnce, so its not exactly in the best place moving forward (for example, upgrading the version of CEFSharp is going to be pretty painful due to those hard references that were manually added into the csproj file). I made sure to document everything I did in the repository readme, so its not the end of the world, but its definitely going to cause pain in the future to some poor bastard.

This is probably how Dr Frankenstein felt, if he were a real person.

Disgusted, but a little bit proud as well.