0 Comments

That thing that I complained about with Logstash and HTTP outputs a few weeks ago? Turns out it was actually a bug after all!

Not only that, but it was totally a race condition relating to the way the success/failure handlers of the Manticore HTTP library were being used, which was what I guessed, so go me. I guess even a broken clock is right once a day.

Long story short, don’t fire off an asynchronous request until you’ve got your asynchronous handlers in place to handle the result.

In their typical awesome style, the maintainers of the HTTP output plugin have already fixed the issue and released a new version (4.3.2) for the masses to consume.

Unfortunately, I don’t think this new version has been rolled into an official Logstash release yet, so if I want to use it I’ll have to make a custom redistributable package to use in Octopus deploy. Not a huge issue, just a small pain. I’m sure it will be rolled into the official package soon.

Other than that piece of news, I don’t really have anything else, so I’m going to take this opportunity to muse on some of my experiences with asynchronous code.

Hellish Code

Depending on how its written, asynchronous code can be hard to reason about, and even harder to test.

Vanilla Node is a good example of bad asynchronous code. Its incredibly easy to get into callback hell, where this function will call into this function, which has a handler that calls this function, which does this thing and so on. Its even worse when you realise that the easiest way to accumulate state between so many asynchronous functions is to close over variables that are in scope when the function is created, which leaves you in a position where its hard to break things apart into nice self contained, easily testable components.

*shudder*

Encapsulating asynchronous operations into some sort of structure is a good counter to callback hell, and once we started using Promises in the ELB logs processor Javascript, I found the quality of the code increased rapidly. It was still Javascript, but at least you could reason about it.

Continuing on with a language that I actually like, C#, you’ve got a similar concept at play with tasks and the TPL (Task Parallel Library).

Before the TPL, you would probably just use raw threads to offload work (or maybe dedicated asynchronous functions or classes like BackgroundWorker), and you could easily get into the same sort of problems that you get into in vanilla Node. I mean, you had the benefit of working with a nice, structured language, but it was still difficult and prone to error.

The TPL introduced encapsulations around asynchronous operations called Tasks, which helped write asynchronous code that actually made sense.

It did something else that was incredibly helpful as well.

It exposed the Task schedulers.

All The Control

Encapsulating operations as Tasks is all well and good, and in most cases you’re probably looking to run those operations asynchronously on another thread (or a pool of threads perhaps). I mean, that’s literally exactly what the default TaskScheduler does in C# (the ThreadPoolScheduler).

Of course, that makes testing hard, because now your main thread (which the test is running on) has to have some logic in it wait for outstanding operations to complete before you can assert things. This gets even more difficult when working with ViewModels, which require that certain updates be performed on the main/UI thread in order to work properly.

This is particularly important for WPF and PropertyChanged events, which only work properly if you raise them on the UI thread.

The good news is that you can write your own TaskScheduler and then inject it for testing purposes, assuming you’ve written your code in such a way that your Tasks are created with a specified scheduler.

I’ve written a TaskScheduler that is not threaded at all (i.e. it just executes the task inline), one that accumulates operations to be executed at will (good for testing transitions) and another that allows you to run tasks on the WPF Dispatcher construct (i.e. the main/UI thread in WPF) and I use all of them frequently while developing applications that rely on Tasks to do things.

They aren’t even all that hard to write really, as the following NonThreadedTaskScheduler shows:

public sealed class NonThreadedTaskScheduler : TaskScheduler
{
    protected override void QueueTask(Task task)
    {
        TryExecuteTask(task);
    }

    protected override bool TryExecuteTaskInline(Task task, bool taskWasPreviouslyQueued)
    {
        return TryExecuteTask(task);
    }

    protected override IEnumerable<Task> GetScheduledTasks()
    {
        return Enumerable.Empty<Task>();
    }

    public override int MaximumConcurrencyLevel
    {
        get { return 1; }
    }
}

Without the ability to change how the tasks are executed, none of that would have been possible, and while they would still be useful, they would be much harder to actually use and test.

Coincidentally, this is one of the reasons why I’m a little bit leery of the async/await keywords in C#, because I don’t think they expose any of the scheduling functionality, instead assuming that you are happy with the magical way that they work. Honestly though, my knowledge of async/await is fairly limited, so that might not be entirely correct.

Conclusion

To conclude, asynchronous code is hard and its bugs are subtle and hard to detect.

Its also one of the easiest and highest impact ways to wring more performance out of a piece of software. If you’re doing a lot of things that naturally involve waiting (disk writes, network calls, etc), you stand to gain an incredible amount of time by just doing other things during those times when you’re waiting on something external and out of your control.

Double edged sword really.

Like most software.

0 Comments

Alerting on potential problems before they become real problems is something that we have been historically bad at. We’ve got a wealth of information available both through AWS and in our ELK Stack, but we’ve never really put together anything to use that information to notify us when something interesting happens.

In an effort to alleviate that, we’ve recently started using CloudWatch alarms to do some basic alerting. I’m not exactly sure why I shied away from them to begin with to be honest, as they are easy to setup and maintain. It might have been that the old way that we managed our environments didn’t lend itself to easy, non-destructive updates, making tweaking and creating alarms a relatively onerous process.

Regardless of my initial hesitation, when I was polishing the ELK stack environment recently I made sure to include some CloudWatch alarms to let us know if certain pieces fell over.

The alerting setup isn’t anything special:

  • Dedicated SNS topics for each environment (i.e. CI, Staging, Production, Scratch)
  • Each environment has two topics, one for alarms and one for notifications
  • The entire team is signed up to emails from both Staging and Production, but CI/Scratch are optional

Email topic subscriptions are enough for us for now, but we have plans to use the topics to trigger messages in HipChat as well.

I started with some basic alarms, like unhealthy instances > 0 for Elasticsearch/Logstash. Both of those components expose HTTP endpoints that can easily be pinged, so if they stop responding to traffic for any decent amount of time, we want to be notified. Other than some tweaks to the Logstash health check (I tuned it too tightly initially, so it was going off whenever a HTTP request took longer than 5 seconds), these alarms have served us well in picking up things like Logstash/Elasticsearch crashing or not starting properly.

As far as Elasticsearch is concerned, this is pretty much enough.

With Logstash being Logstash though, more work needed to be done.

Transport Failure

As a result of the issues I had with HTTP outputs in Logstash, we’re still using the Logstash TCP input/output combo.

The upside of this is that it works.

The downside is that sometimes the TCP input on the Broker side seems to crash and get into an unrecoverable state.

That wouldn’t be so terrible if it took Logstash with it, but it doesn’t. Instead, Logstash continues to accept HTTP requests, and just fails all of the TCP stuff. I’m not actually sure if the log events received via HTTP during this time are still processed through the pipeline correctly, but all of the incoming TCP traffic is definitely black holed.

As a result of the HTTP endpoint continuing to return 200 OK, the alarms I setup for unhealthy instances completely fail to pick up this particular issue.

In fact, because of the nature of TCP traffic through an ELB, and the relatively poor quality of TCP metrics, it can be very difficult to tell whether or not its working at a glance. Can’t use requests or latency, because they have no bearing on TCP traffic, and certainly can’t use status codes (obviously). Maybe network traffic, but that doesn’t seem like the best idea due to natural traffic fluctuations.

The only metric that I could find was “Backend Connection Errors”. This metric appears to be a measure of how many low level connection errors occurred between the ELB and the underlying EC2 instances, and seems like a good fit. Even better, when the Logstash TCP input falls over, it is this metric that actually changes, as all of the TCP traffic being forwarded through the ELB fails miserably.

One simple alarm initialization through CloudFormation later, and we were safe and secure in the knowledge that the next time the TCP input fell over, we wouldn’t find out about it 2 days later.

"BrokerLoadBalancerBackendConnectionErrorsAlarm": {
  "Type" : "AWS::CloudWatch::Alarm",
  "Properties" : {
    "AlarmName" : { "Fn::Join": [ "", [ "elk-broker-", { "Ref": "OctopusEnvironment" }, "-backend-errors" ] ] },
    "AlarmDescription" : "Alarm for when there is an increase in the backend connection errors on the Broker load balancer, typically indicating a problem with the Broker EC2 instances. Suggest restarting them",
    "AlarmActions" : [ { "Ref" : "AlarmsTopicARN" } ],
    "OKActions": [ { "Ref" : "AlarmsTopicARN" } ],
    "TreatMissingData": "notBreaching",
    "MetricName" : "BackendConnectionErrors",
    "Namespace" : "AWS/ELB",
    "Statistic" : "Maximum",
    "Period" : "60",
    "EvaluationPeriods" : "5",
    "Threshold" : "100",
    "ComparisonOperator" : "GreaterThanThreshold",
    "Dimensions" : [ {
      "Name" : "LoadBalancerName",
      "Value" : { "Ref" : "BrokerLoadBalancer" }
    } ]
  }
}

Mistakes Were Made

Of course, a few weeks later the TCP input crashed again and we didn’t find out for 2 days.

But why?

It turns out that the only statistic worth a damn for Backend Connection Errors when alerting is SUM, and I created the alarm on the MAXIMUM, assuming that it would act like requests and other metrics (which give you the maximum number of X that occurred during the time period). Graphing the maximum backend connection errors during a time period where the TCP input was broken gives a flat line at y = 1, which is definitely not greater than the threshold of 100 that I entered.

I switched the alarm to SUM and as far as I can see the next time the TCP input goes down, we should get a notification.

But I’ve been fooled before.

Conclusion

I’ll be honest, even though I did make a mistake with this particular CloudWatch alarm, I’m glad that we started using them.

Easy to implement (via our CloudFormation templates) and relatively easy to use, they provide an element of alerting on our infrastructure that was sorely missing. I doubt we will go about making thousands of alarms for all of the various things we want to be alerted on (like increases in 500s and latency problems), but we’ll definitely include a few alarms in each stack we create to yell at us when something simple goes wrong (like unhealthy instances).

To bring it back to the start, I think one of the reasons we’ve hesitated to use the CloudWatch alarms was because we were waiting for the all singing all dancing alerting solution based off the wealth of information in our log stack, but I don’t think that is going to happen in a hurry.

Its been years already.

0 Comments

Almost a month ago, I made a bold claim that switching our Logstash instances to HTTP outputs for all log event traffic wouldn’t be all that difficult.

Oh god, how wrong I was.

In some weird display of prescience however, these were the exact words I used:

Conceptually, changing our Logstash instances to use a HTTP output instead of TCP should also be easy. Just need to change some configuration and deploy through Octopus. Keep in mind I haven’t actually done it yet, but it feels simple enough.

What Is Logstash?

Theoretically, I was entirely correct. It was as simple as changing the TCP output to HTTP. Same host, no fancy settings, no real complexity in the configuration.

For good measure, I upgraded the version of Logstash to 5.4.1 as well (from 5.2.0), which was a whole thing that I’m not going to get into right now. Suffice to say, I had a very recent version of Logstash installed when it all went to tatter.

We’ve got a pretty good deployment pipeline for distributing Logstash to our various API instances and whatnot, so after I made the changes, I checked in, builds kicked off, tests were run, deployments were made, and it was time to check whether or not CI was still working.

It was fine. For the first time, I could see the traffic caused by Logstash on the Broker using CloudWatch, which was nice.

Promote to Staging, same thing. Worked fine, traffic visible, all seemed well.

Promote to Prod, same thing. Worked fine, traffic visible, all is well.

And that’s the end of the story.

Baby Don’t Hurt Me

Ha ha ha ha ha ha no.

What I really should have done is left Staging run for a few days before promoting anything. I guess I was just a little too eager.

It wasn’t long after promoting to production that I noticed data disappearing from our dashboards and visualizations. At first glance, it all seemed to be isolated to one of our services in particular, so I started digging.

  • Log files were being appended to locally, so not the issue we encountered recently with our Nancy APIs. This problem had different symptoms anyway (no logs at all, instead of just no application logs).
  • Restarting the local instance of Logstash seemed to get a few events through into the ELK stack, so the communication channels were intact (i.e. traffic wasn’t being blocked at the network level), but only a few hundred.
  • Broker seemed to be healthy and handling traffic well, so it didn’t seem like a resource issue on the ELK stack side.

Based on the evidence, it looked like the problem was entirely the fault of the local Logstash instances, but if that was the case, why not more failures? At this point in the investigation, other Logstash instances appeared to be pushing their log events into the ELK stack without issue.

There didn’t even seem to be anything special about the logs going through the failing Logstash instances either. It was all pretty vanilla IIS, Performance Counter and API application logs. Nothing fancy.

Time to dig into Logstash itself.

The good news was that Logstash 5.4.1 features a really nice internal API, granting access to a bunch of information about the internal workings of the application.

Using a little bit of Powershell you can get statistics about the Logstash pipeline.

(Invoke-WebRequest http://localhost:9600/_node/stats/pipeline).Content

On one of the broken machines, our output looked something like this:

{ "host" : "a-hostname", "version" : "5.4.1", "http_address" : "127.0.0.1:9600", "id" : "a82b92de-d402-4b99-a569-a859c8e00d2b", "name" : "a-name", "pipeline" : { "events" : { "duration_in_millis" : 11293, "in" : 377, "filtered" : 375, "out" : 250, "queue_push_duration_in_millis" : 28027 }, "plugins" : { "inputs" : [ { "id" : "get-events-from-api-iis-logs", "events" : { "out" : 129, "queue_push_duration_in_millis" : 14075 }, "name" : "file" }, { "id" : "get-events-from-system-stats", "events" : { "out" : 248, "queue_push_duration_in_millis" : 13952 }, "name" : "file" }, { "id" : "get-events-from-api-application-logs", "events" : { "out" : 0, "queue_push_duration_in_millis" : 0 }, "name" : "file" } ], "filters" : [ { "id" : "break-iis-fields-down-via-csv", "events" : { "duration_in_millis" : 1135, "in" : 128, "out" : 128 }, "name" : "csv" }

# More filters here, cut for brevity

], "outputs" : [ { "id" : "send-events-to-elk-broker-over-http", "events" : { "duration_in_millis" : 5441, "in" : 375, "out" : 250 }, "name" : "http" } ] }, "reloads" : { "last_error" : null, "successes" : 0, "last_success_timestamp" : null, "last_failure_timestamp" : null, "failures" : 0 }, "queue" : { "type" : "memory" }, "id" : "main" } }

Super dense, but the interesting part is the HTTP output at the bottom.

See how it has 375 events in, but only 250 out?

It hangs like that forever.

Restarting Logstash causes it to start again and send a similar number of events, but eventually hang in exactly the same manner.

Don’t Hurt Me, No More

I’ll be honest, I tried a lot of things to get this to work, and I just couldn’t.

In fact, the problem actually got worse, because other Logstash instances started failing in the same way, and more and more data dropped off our dashboards and visualizations.

In the end, I had no choice but to roll back to the TCP output (which was flaky, but in a known way) and to log an issue on the Logstash Github describing my problem in the hope that someone would be able to look into it and help out.

Why not look myself?

Well I started, but Logstash is Ruby code running through Java (which is then running on Windows via Powershell) and getting into that stack from a development point of view looked pretty complicated and time-consuming.

I had a bit of a look at the Ruby code of the HTTP output and the only thing I could think of was a potential race condition in the construction of the HTTP requests.

# Create an async request
request = client.background.send(@http_method, url, :body => body, :headers => headers)
request.call # Actually invoke the request in the background

request.on_success do |response|
  begin
    # bunch of code dealing with success
  end
end

request.on_failure do |exception|
  begin 
    # bunch of code dealing with failures
  end
end

I don’t know how the HTTP library works (I think its Manticore?), but if it doesn’t support the registering (and correct execution) of event handlers AFTER a call has completed, then its possible for the request.call to finish and the success and failure handlers to never run.

Part of the success handler’s responsibilities is to inform the rest of the Logstash pipeline that the the HTTP output is done with its events, so that sort of thing could easily lead to a deadlock.

Again, I don’t know Ruby at all, so that is complete conjecture on my part.

Conclusion

This was another one of those issues that just came out of nowhere and stole more of my time than I would like to admit to. I would still like to switch all of our log event traffic to HTTP (for the same reasons I went through last time), but I have other things to do for now. At least its still working, even though the TCP input in Logstash sometimes just stop working. It doesn’t happen very often, and I have some monitoring around it, so its not the end of the world.

To be honest, a smarter man would have seen that it was not working with no clear solution, rolled back to the TCP output and moved on. Probably within a few hours.

I am not a smart man.

I am, however, ridiculously stubborn, so I wailed away at this problem until it stole my will to live, then I gave up.

0 Comments

A very quick post this week, because I’ve been busy rebuilding our ELB Logs Processor in .NET Core. I had some issues converting it to use HTTP instead of TCP for connecting to Logstash and I just got tired of dealing with Javascript.

I’m sure Javsacript is a perfectly fine language capable of accomplishing many wonderous things. Its not a language I will voluntarily choose to work with though, not when I can pick C# to accomplish the same thing.

On to the meat of this post though, which is quick heads up for people who want to upgrade from Logstash 5.2.X to 5.4.0 (something I did a recently for the Broker/Indexer layer inside our Log Aggregation Stack).

Make sure you configure a data directory and that that directory both exists and has appropriate permissions.

Queueing Is Very British

Logstash 5.4.0 marked the official release of the Persistent Queues feature (which had been in beta for a few versions). This is a pretty neat feature that allows you to skip the traditional queue/cache layer in your log aggregation stack. Basically, when enabled, it inserts a disk queue into your Logstash instance in between inputs and filters/outputs. It only works for inputs that have request/response models (so HTTP good, TCP bad), but it’s a pretty cool feature all round.

I have plans to eventually use it to completely replace our Cache and Indexer layers in the log aggregation stack (a win for general complexity and number of moving parts), but when I upgraded to 5.4.0 I left it disabled because we already have Elasticache:Redis for that.

That didn’t stop it from causing problems though.

I Guess They Just Like Taking Orderly Turns

Upgrading the version of Logstash we use is relatively straightforward. We bake a known version of Logstash into a new AMI via Packer, update an environment parameter for the stack, kick off a build and let TeamCity/Octopus take care of the rest.

To actually bake the AMI, we just update the Packer template with the new information (in this case, the Logstash version that should be installed via yum) and then run it through TeamCity.

On the other side, in the environment itself, when we update the AMI in use, CloudFormation will slowly replace all of the EC2 instances inside the Auto Scaling Group with new ones, waiting for each one to come online before continuing. We use Octopus Deploy Triggers to automate the deployment of software to those machines when they come online.

This is where things started to fall down with Logstash 5.4.0.

The Octopus deployment of the Logstash Configuration was failing. Specifically, Logstash would simply never come online with the AMI that used 5.4.0 and the configuration that we were using successfully for 5.2.0.

The Logstash log files were full of errors like this:

[2017-05-24T04:42:02,021][FATAL][logstash.runner          ] An unexpected error occurred! 
{
    : error =>  #  < ArgumentError: Path "/usr/share/logstash/data/queue" must be a writable directory.It is not writable. > ,
    : backtrace => [
        "/usr/share/logstash/logstash-core/lib/logstash/settings.rb:433:in `validate'", 
        "/usr/share/logstash/logstash-core/lib/logstash/settings.rb:216:in `validate_value'", 
        "/usr/share/logstash/logstash-core/lib/logstash/settings.rb:132:in `validate_all'", 
        "org/jruby/RubyHash.java:1342:in `each'", 
        "/usr/share/logstash/logstash-core/lib/logstash/settings.rb:131:in `validate_all'", 
        "/usr/share/logstash/logstash-core/lib/logstash/runner.rb:217:in `execute'", 
        "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/clamp-0.6.5/lib/clamp/command.rb:67:in `run'", 
        "/usr/share/logstash/logstash-core/lib/logstash/runner.rb:185:in `run'", 
        "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/clamp-0.6.5/lib/clamp/command.rb:132:in `run'", 
        "/usr/share/logstash/lib/bootstrap/environment.rb:71:in `(root)'"
    ]
}

A bit weird considering I hadn’t changed anything in our config, but it makes sense that maybe Logstash itself can’t write to the directory it was installed into by yum, and the new version now needs to do just that.

Moving the data directory was simple enough. Add path.datato the logstash.yml inside our configuration package, making sure that the data directory exists and that the Logstash user/group has ownership and full control.

I still got the same error though, except the directory was different (it was the one I specified).

I Mean Who Doesn’t

I fought with this problem for a few hours to be honest. Trying various permutations of permissions, ACLs, ownership, groups, users, etc.

In the end, I just created the queue directory ahead of time (as part of the config deployment) and set the ownership of the data directory recursively to the Logstash user/group.

This was enough to make Logstash stop complaining about the feature I didn’t want to use and get on with its life.

I still don’t understand what happened though, so I logged an issue in the Logstash repo in Github. Maybe someone will explain it one day. Weirdly it looks like Logstash created a directory that it was not allowed to access (the /queue directory under the specified data directory), which leads me towards something being wrong with my configuration (like ownership or something like that), but I couldn’t find anything that would point to that.

Conclusion

This one really came out of left field. I didn’t expect the upgrade to 5.4.0 to be completely painless (rarely is a software upgrade painless), but I didn’t expect to struggle with an issue caused by a feature that I didn’t even want to use.

What’s even weirder about the whole thing is Persistent Queues were available in the version of Logstash that was was upgrading from (5.2.0), at least in beta form, and I had no issues whatsoever.

Don’t get me wrong, Logstash is an amazing product, but it can also be incredibly frustrating.

0 Comments

Now that I am feeling less like dying horribly from a terrible plague, its time to continue to talk about processing ELB logs into an ELK stack via AWS Lambda.

Last time I talked about our current situation, our motivations for wanting to switch to something better and the general plan moving forward.

To summarise the plan, the new log processor system needs 3 main parts:

  • The Lambda function code, which is written in Javascript, and would need to parse an ELB log file one line at a time, break it up, convert it into a JSON structure and then push it to a Logstash TCP endpoint to be written into Elasticsearch
  • The configuration/creation of the Lambda function, which would need to be done via CloudFormation as part of our normal environment setup (i.e. spin up an environment, it comes with the Lambda function that will process any ELB logs encountered)
  • The deployment of the code to the Lambda function via Octopus (because that’s how we roll)

I’m basically going to do a short blog post on each of those pieces of work, and maybe one at the end to tie it all together.

With that in mind, lets talk Javascript.

Unforeseen Consequences

When you’re writing Lambda functions, Javascript (via Node.js) is probably your best bet. Sure you can run Java or Python (and maybe one day C# using .NET Core), but Javascript is almost certainly going to be the easiest. Its what we chose we when put together the faster S3 clone prototype, and while the fundamentally asynchronous nature of Node.js took some getting used to, it worked well enough.

When it comes to Javascript, I don’t personally write a lot of it. If I’m writing a server side component, I’m probably going to pick C# as my language of choice (with all its fancy features like “a compiler” and “type safety”) and I don’t find myself writing things like websites or small Javascript applications very often, if at all. My team definitely writes websites though, and we use React.js to do it, so its not like Javascript is an entirely foreign concept.

For the purposes of reading in and parsing an ELB log file via a Lambda function, we didn’t need a particularly complex piece of Javascript. Something that reads the specified file from S3 after the Lambda function triggers, something to process the contents of that file line by line, something to parse and format those lines in a way that a Logstash input wll accept, and something to push that JSON payload to the Logstash listener over raw TCP.

Without further ado, I give you the completed script:

'use strict';

let aws = require('aws-sdk');
let s3 = new aws.S3({ apiVersion: '2006-03-01' });
let readline = require('readline');
let net = require('net');

const _type = 'logs';
const _sourceModuleName = 'ELB';
const _logHost = '#{LogHost}'
const _logPort = #{LogPort}
const _environment = '#{Octopus.Environment.Name}'
const _component = '#{Component}'
const _application = '#{Application}'

function postToLogstash(entry){
    console.log("INFO: Posting event to logstash...");

    var socket = net.createConnection(_logPort, _logHost);
    var message = JSON.stringify(entry) + "\n";
    socket.write(message);
    socket.end();

    console.log("INFO: Posting to logstash...done");
}

exports.handler = (event, context, callback) => {
    console.log('INFO: Retrieving log from S3 bucket...');

    const bucket = event.Records[0].s3.bucket.name;
    const key = decodeURIComponent(event.Records[0].s3.object.key.replace(/\+/g, ' '));
    const params = {
        Bucket: bucket,
        Key: key
    };

    const reader = readline.createInterface({
        input: s3.getObject(params).createReadStream()
    });

    const expectedColumns = 12;

    reader.on('line', function(line) {
        console.log("INFO: Parsing S3 line entry...");

        const columns = line.split(/ (?=(?:(?:[^"]*"){2})*[^"]*$)/);

        if(columns.length >= expectedColumns){
            var entry = {
                EventReceivedTime: columns[0],
                LoadBalancerName: columns[1],
                PublicIpAndPort: columns[2],
                InternalIpAndPort: columns[3],
                TimeToForwardRequest: parseFloat(columns[4]) * 1000,
                TimeTaken: parseFloat(columns[5]) * 1000,
                TimeToForwardResponse: parseFloat(columns[6]) * 1000,
                Status: columns[7],
                BackendStatus: columns[8],
                BytesUploadedFromClient: parseInt(columns[9]),
                BytesDownloadedByClient: parseInt(columns[10]),
                FullRequest: columns[11],
                Component: _component,
                SourceModuleName: _sourceModuleName,
                Environment: _environment,
                Application: _application,
                Type: _type
            };
            postToLogstash(entry);
        } else {
            console.log("ERROR: Invalid record length was expecting " + expectedColumns.length + " but found " + columns.length);
            console.log('ERROR: -------');
            console.log(line);
            console.log('ERROR: -------');
        }
    });
};

Nothing to fancy.

In the interest of full disclosure, I did not write the script above. It was written by a few guys from my team initially as a proof of concept, then improved/hardened as a more controlled piece of work.

Office Complex

You might notice some strange variable names at the top of the script (i.e. #{Octopus.Environment.Name}).

We use Octopus deploy for  all of our deployments, and this includes the deployment of Lambda functions (which we package via Nuget and then deploy via the AWS Powershell Cmdlets/CLI inside Octopus). The #{NAME} notation is a way for Octopus to substitute variable values into files during deployment. This substitution is very useful, and can be scoped via a variety of things (like Environment, Machine, Role, etc), so by the time the script actually gets into AWS those variables are filled in with actual values.

Other than our use of Octopus variables, other things to note in this piece of Javascript are:

  • At no point does the function specify which credentials are being used to access the S3 bucket containing the ELB log files. This is because the Lambda function has been configured with an IAM role allowing it to access the required resources. The AWS Javascript library has built in support for running inside supported AWS contexts (like EC2 instances and Lambda functions), such that it can use the role applied to the context to get appropriate temporary credentials. This is a much better approach than using specific credentials in the script, which can result in unintended consequences if you aren’t eternally vigilant.
  • You need to make sure that you’re Lambda function is configured with an appropriate security group that allows it to use the expected outbound channel (i.e. make sure it can get to the Logstash host you’re trying to connect to). This was somewhat of an issue for us as our ELK stack is hosted inside another AWS account (our OPs account), so we had to make sure that all of the appropriate VPC peering was configured before it would work correctly. It can be a bit of a pain to setup the smallest possible surface area, but don’t be tempted to just configure the Lambda function to be able to do everything and go anywhere. Smaller surface area = more security and the policy of least privilege is one you should always follow.
  • Its important to ensure that if you’re doing TCP communication in a Lambda function, that you make sure to close your socket when you’re done with it, or the Lambda function might not exit. It might, but it also might not, and it can really throw you for a loop if you don’t know why.

To Be Continued

That’s it for this week. The Javascript I’ve included above is pretty generic (apart from the specific set of fields that we like to have in our log events) and will successfully process an ELB log file from S3 to a Logstash instance listening on a port of your choosing (probably 6379) when used in a Lambda function. Feel free to reuse it for your own purposes.

Next week I’ll continue this series of posts with information about how we use CloudFormation to setup our Lambda function as part of one of our environment definitions.

CloudFormation and Lambda aren’t the best of friends yet, so there is some interesting stuff that you have to be aware of.