0 Comments

And thus we come to the end of this particular saga, of rewriting the our AWS Lambda ELB Logs Processor in C# using .NET Core.

To summarise the posts in this series:

  1. I went through some of the basics around how to use C# and .NET Core to implement an AWS Lambda function
  2. I explained how we constructed the automated build, test and deployment pipeline for the .NET Core ELB Logs Processor
  3. I outlined how we tested the new .NET Core Lambda function, with a specific focus around how we handled the AWS Credentials

In conjunction with those posts, I’ve also uploaded a copy of the entire source code (build and deployment scripts included) to Solavirum.Logging.ELB.Processor.Lambda, mostly as a master reference point. Its a copy of our internal repository, quickly sanitized to remove anything specific to our organization, so I can’t guarantee that it works perfectly, but there should be enough stuff there to demonstrate the entire process working end to end.

As with all the rest of our stuff, the build script is meant to run from inside TeamCity, and the deployment script is meant to run from inside Octopus Deploy, so if you don’t have either of those things, you will have to use your imagination.

Of course, everything I’ve gone through is purely academic unless its running successfully in Production…

Seize The Means Of Production

If you’ve read any of my previous posts about the Javascript/Node.js ELB Logs Processor, you’re probably aware of the fact that it tended to look like it was working fine right up to the point where I published it into production and the sheer weight of traffic caused it to have a meltdown.

Unsurprisingly, the difference between a sample file with a few hundred lines in it and a real file with hundreds of thousands of lines is significant, and that’s not even really taking into account the difference between my local development environment and AWS Lambda.

In an unprecedented result though, the .NET Core implementation worked just fine when I shipped it to our production environment.

Unfortunately it was around twenty times slower.

The old Javascript/Node.js implementation could process a file in around 9 seconds, which is  pretty decent.

The .NET Core implementation processes a file in around 200 seconds, which is a hell of a lot slower.

Now, I can explain some of that away with the fact that the .NET Core implementation uses HTTP instead of TCP to communicate with Logstash (which was the primary reason I rewrote it in the first place), but it was still pretty jarring to see.

Interestingly enough, because executing Lambda functions is so cheap, the total cost increase was like $US8 a month, even though it was twenty times slower. so who really cares.

In the end, while the increase in processing time was undesirable, I’d rather have slower C# and HTTP over faster Javascript and TCP.

Summary

Overall, I’m pretty happy with how this whole .NET Core rewrite went, for a number of reasons:

  • I got to explore and experiment with the newest .NET and C# framework, while delivering actual business value, which was great.
  • In the process of developing this particular solution, I managed to put together some more reusable scripts and components for building other .NET Core components
  • I finally switched the ELB Logs Processor to HTTP, which means that the overall feed of events into our ELK stack is more reliable (TCP was prone to failure)
  • I managed to get a handle on fully automated testing of components that require AWS Credentials, without accidentally checking the credentials into source control

In the end though, I’m mostly just happy because its not Javascript.

0 Comments

Last week I wrote a high level overview of the build, package, test and deploy process that we created in order to get .NET Core C# into AWS Lambda.

As part of that post, I briefly touched on the validation part of the proces, which basically amounted to “we wrote some tests in C# using XUnit, then used the dotnet command line interface to run them as part of the build”.

Considering that the testing took a decent amount of the total time, that amount of coverage isn’t really appropriate, so this post will elaborate further, and explain some of the challenges we encountered and how we overcame them.

I’m A Little Testy

At a high level, testing a .NET Core AWS Lambda function written in C# is not really any more difficult than testing other C# code. Like the stuff I’ve been writing for years.

Unit test classes in isolation, integration test classes together, have a few end to end tests (if possible) and you’re probably in a good place.

When you write a lambda function, all you really need is a single class with a method on it (Handler::Handler or equivalent). You could, if you wanted, put literally all of your logic into this class/method combination and have a fully functional lambda function.

That would probably make it harder to test though.

A better idea is to break it down into its constituent parts, ideally taking the form of small classes with a single responsibility, working together to accomplish a greater goal.

For the ELB Logs Processor, there are three clear responsibilities:

  1. Read a file from S3, and expose the information therein as lines of text
  2. Take a line of text and transform it into some sort of structure, potentially breaking it down and adding meta information 
  3. Take a structure and output it to a Logstash endpoint

Technically there is also a fourth responsibility which is to tie the three things above together (i.e. orchestration), but we’ll get to that in a minute.

Anyway, if the responsibilities are clearly separated, we’re in unit testing territory, which is familiar enough that I don’t really want to talk about it too much. Grab your testing framework of choice (XUnit is a good pick for .NET Core, NUnit didn’t seem to work too well in Visual Studio + Resharper), mock out your dependencies and test away.

You can apply the same sort of testing to the orchestrator (the Handler) by giving it its dependencies through its constructor. Make sure you give it a parameterless constructor with default values though, or you’ll have issues running it in AWS.

This is basically just a poor mans dependency injection, but my favour dependency injection library, Ninject, does not exist in .NET Core and I didn’t want to invest the time to learn one of the other ones (like AutoFac) for such a small project.

Of course, while unit tests are valuabl, they don’t really tell us whether or not everything will work when you put it all together.

That’s where integration tests come in.

Integration Functions

In my experience, I’ve consistently found there to be two distinct flavours of integration test.

The first is the test that validates whether or not everything works when all of your things are put together. This tends to end up being a test of your dependency injection configuration (if you’re using it), where you create some top level object using your container of choice, then execute a series of methods on it (and its children) echoing what would normally happen in your program execution. You might mock out the very edges of the system (like databases, API dependencies, disk-based configuration, etc), but in general you want to validate that your tens (hundreds? thousands?) of classes play nice together and actually accomplish their overarching goal.

The second type of test is the one that validates whether or not a class that relies on an external system works as expected. These are very close to unit tests in appearance (single class, some mocked out dependencies), but are much slower and more fragile (due to the reliance on something outside of the current code base). They do have a lot of value though, because beyond application failures, they inform you when your integration with the external dependency breaks for whatever reason.

The ELB Logs Processor features both types of integration test, which is nice, because it makes for a comprehensive blog post.

The simpler integration test is the one for the component that integrates with S3 to download a file and expose it as a series of lines. All you need to do is put a file in a real S3 bucket, then run the class with an appropriate set of parameters to find that file, download it and expose it in the format that is desired. You’re free to simulate missing files (i.e. go get this file that doesn’t exist) and other error cases as desired as well, which is good.

The other integration test involves the Handler, using its default configuration and a realistic input. The idea would be to run the Handler in a way that is as close to how it will be run in reality as possible, and validate that it does what it advertises. This means we need to put some sort of sample file in S3, fire an event at the Handler describing that file (i.e. the same event that it would receive during a normal trigger in AWS), validate that the Handler ran as expected (perhaps measuring log output or something) and then validate that the events from the sample file were available in the destination ELK stack. A complicated test for sure, but one with lots of value.

As is always the case, while both of those tests are conceptually fairly simple and easy to spell out, they both have to surmount the same challenge.

AWS Authentication and Authorization.

401, Who Are You?

The various libraries supplied by AWS (Javascript, Powershell, .NET, Python, etc) all handle credential management in a similar way.

They use a sort of priority system for determining the source of their credentials, and it looks something like this:

  1. If credentials have been supplied directly in the code (i.e. key/secret), use those
  2. Else, if a profile has been supplied, look for that and use it
  3. Else, if no profile has been supplied, use the default profile
  4. Else, if environment variables have been set, use those
  5. Else, use information extracted from the local machine (i.e. IAM profiles or EC2 meta information)

The specific flow differs from language to language (.NET specific rules can be found here for example), but the general flow is pretty consistent.

What does that mean for our tests?

Well, the actual Lambda function code doesn’t need credentials at all, because the easiest way to run it is with an IAM profile applied to the Lambda function itself in AWS. Assuming no other credentials are available, this kicks in at option 5 in the list above, and is by far the least painful (and most secure) option to use.

When running the tests during development, or running them on the build server, there is no ambient IAM profile available, so we need to find some other way to get credentials to the code.

Luckily, the .NET AWS library allows you to override the logic that I described above, which is a pretty great (even though it does it with static state, which makes me sad).

With a little work we can nominate an AWS profile with a specific name to be our current “context”, and then use those credentials to both setup a temporary S3 bucket with a file in it, and to run the code being tested, like in the following Handler test.

/// <summary>
/// This test is reliant on an AWS profile being setup with the name "elb-processor-s3-test" that has permissions
/// to create and delete S3 buckets with prefixes oth.console.elb.processor.tests. and read items in those buckets
/// </summary>
[Fact]
public void WhenHandlingAValidS3Event_ConnectsToS3ToDownloadTheFile_AndOutputsEventsToLogstash()
{
    var inMemoryLogger = new InMemoryLogger();
    var xunitLogger = new XUnitLogger(_output);
    var logger = new CompositeLogger(inMemoryLogger, xunitLogger);

    var application = Guid.NewGuid().ToString();
    var config = new Dictionary<string, string>
    {
        { Configuration.Keys.Event_Environment, Guid.NewGuid().ToString() },
        { Configuration.Keys.Event_Application, application},
        { Configuration.Keys.Event_Component, Guid.NewGuid().ToString() },
        { Configuration.Keys.Logstash_Url, "{logstash-url}" }
    };
    using (AWSProfileContext.New("elb-processor-s3-test"))
    {
        logger.Log("Creating test bucket and uploading test file");
        var s3 = new AmazonS3Client(new AmazonS3Config { RegionEndpoint = RegionEndpoint.APSoutheast2 });
        var testBucketManager = new TestS3BucketManager(s3, "oth.console.elb.processor.tests.");
        using (var bucket = testBucketManager.Make())
        {
            var templateFile = Path.Combine(ApplicationEnvironment.ApplicationBasePath, @"Helpers\Data\large-sample.log");
            var altered = File.ReadAllText(templateFile).Replace("@@TIMESTAMP", DateTime.UtcNow.ToString("yyyy-MM-ddTHH:mm:ss.ffffffZ"));
            var stream = new MemoryStream(Encoding.UTF8.GetBytes(altered));
            var putResult = s3.PutObjectAsync(new PutObjectRequest { BucketName = bucket.Name, InputStream = stream, Key = "test-file" }).Result;

            var message = new S3EventBuilder().ForS3Object(bucket.Name, "test-file").Build();

            logger.Log("Executing ELB logs processor");
            var handler = new Handler(logger, config);
            handler.Handle(message);
        }
    }

    // Check that there are some events in Elasticsearch with our Application
    Func<long> query = () =>
    {
        var client = new HttpClient {BaseAddress = new Uri("{elasticsearch-url}")};
        var raw = client.GetStringAsync($"/logstash-*/_search?q=Application:{application}").Result;
        dynamic result = JsonConvert.DeserializeObject(raw);
        return (long) result.hits.total;
    };

    query.ResultShouldEventuallyBe(hits => hits.Should().BeGreaterThan(0, $"because there should be some documents in Elasticsearch with Application:{application}"));
}

The code for the AWSProfileContext class is relatively simple, in that while it is in effect it inserts the profile into the first position of the credential fallback process, so that it takes priority over everything else. When disposed, it removes that override, returning everything to the way it was. A similar dispose-delete pattern is used for the temporary S3 bucket.

With a profile override in place, you can run the tests in local development just by ensuring you have a profile available with the correct name, which allows for a nice smooth development experience once a small amount of manual setup has been completed.

To run the tests on the build servers, we put together a simple helper in Powershell that creates an AWS profile, runs a script block and then removes the profile. The sensitive parts of this process (i.e. the credentials themselves) are stored in TeamCity, and supplied when the script is run, so nothing dangerous needs to go into source control at all.

To Be Continued

The most complicated part of testing the ELB Logs Processor was getting the AWS credentials working consistently across a few different execution strategies, while not compromising the safety of the credentials themselves. The rest of the testing is pretty just vanilla C# testing.

The only other thing worth mentioning is that the AWS libraries available for .NET Core aren’t quite the same as the normal .NET libraries, so some bits and pieces didn’t work as expected. The biggest issue I ran into was the how a lot of the functionality for managing credential is typically configured through the app.config, and that sort of thing doesn’t even seem to exist in .NET Core. Regardless, we still got it work in the end, even though we did have to use the obsolete StoredProfileAWSCredentials class.

Next week will be the last post in this series, drawing everything together and focusing on actually running the new .NET Core ELB Logs Processor in a production environment.

0 Comments

So, last week was mostly an introduction to .NET Core and AWS Lambda. Not a lot of meat.

This week I’ll be writing about the build and deployment pipeline that we set up, and how it ties in with our normal process.

Hopefully it will be a bit more nutritious.

The Story So Far

Over the last few years we’ve put together a fairly generic build and deployment process using Powershell that focuses on ensuring the process can be run painlessly on a fresh checkout of the repository. The goal is to have everything either embedded or bootstrapped, and to have any and all logic encapsulated within the scripts.

What this means is that we have a relatively simple setup in TeamCity, which is basically just parameter management and a small bit of logic for kicking off the scripts that live inside the repositories.

The same approach is extended to deployment as well; we have Powershell (or an equivalent scripting language) that does the actual deployment, and then only use Octopus Deploy for orchestration and variable management.

Because we need to build, test and deploy a variety of things (like C# desktop applications, Javscript websites, the old Javascript ELB Logs Processor, CloudFormation environments), the common scripts feature pluggable processes in the form of Powershell script blocks, allowing the user to choose what sort of build is needed (simple Nuget package over some files? A full MSBuild execution that generates an .exe in a Nuget package?), how and when it is tested (Nunit over .dlls containing the phrase “Tests” before deployment? Selenium using Webdriver IO through npm after deployment?) and how it is deployed (pretty much just Octopus Deploy).

The API for the pluggable processes (which I’m going to call engines from this point forward) is relatively lax, in that the Powershell just calls the supplied script block with known parameters, and fails miserably if they don’t exit.

In retrospect, we probably should have just used something already available, like psake, but it just didn’t seem necessary at the time. We’re probably going to get hurt if we try to do anything that only works on Linux systems (no Powershell there), but it hasn’t happened yet, so I’m sure it will never happen.

To summarise, we have a framework that allows for configurable approaches to build, test and deployment, and a .NET Core AWS Lambda function should be able to slot right in.

Vrmm Vrmm Goes The Engine

In order to reliably get the lambda function code from the git repository in which it lives, to the actual AWS Lambda function where it will be executed, we needed to do three things:

  1. Creation of an artefact containing a versioned and validated copy of the code to be executed
  2. The validation logic itself (i.e. testing)
  3. Deploying said artefact to the AWS Lambda function

Coincidentally, these three things line up nicely with the pluggable components in our generic build and deployment process, so all we have to do is provide an engine for each, then run the overall process, and everything should be sweet.

Bob The Builder

The first thing to deal with is the actual creation of something that can be deployed, which for us means a Nuget package.

Referring back to the Javascript version of the ELB Logs Processor, there really wasn’t a lot to the creation of the artefact. Use npm to make sure the appropriate dependencies are present, a quick consolidation step with webpack, then wrap everything up in a Nuget package.

For the C# version, its a similar story, except instead of using node/npm, we can use the dotnet executable available in the SDK, followed by the Nuget command line tool to wrap it all up neatly.

Basically, it looks like this:

dotnet restore {solution-path}
dotnet build –c Release {solution-path}
dotnet publish –o {output-path} –c Release {solution-path}
nuget pack {nuspec-path} {output-directory} {version}

The actual script block for the engine is a little bit more complicated than that (TeamCity integration, bootstrapping the .NET Core SDK, error handling/checking, etc), but its still relatively straightforward.

As I mentioned last week, the only thing to watch out for here is to make sure that you’re targeting the appropriate runtime with your solution configuration, and not to try and build a standalone executable. The code will actually be executed on a Linux machine behind the scenes, and AWS Lambda will make sure the .NET Core runtime is available, so all you have to worry about is your code. I’m pretty sure if you try and supply an OS independent executable it will fail miserably.

Emotional Validation

The second thing to deal with is the validation of the artefact.

Similar to the build process, if we refer back to the Javascript ELB Logs Processor, we had no testing at all. Literally nothing. Our validation was “deploy it and see if the logs stop”.

Don’t get me wrong, there are tonnes of ways to write tests and automate them when using node/npm, we just never went to the effort of actually doing it.

I’m much more familiar with C# (even taking into account the differences of .NET Core), and we don’t write code without tests, so this was an ideal opportunity to put some stuff together for validation purposes.

I’m going to dedicate an entire post to just how we went about testing the rewrite (some interesting stuff there about managing AWS credentials in a secure way), but at a high level we followed our standard pattern of having dedicated test projects that sit in parallel to the code (i.e. Lambda.Tests.Unit, Lambda.Tests.Integration) with automatic execution through the generic build process via an engine.

This particular engine is applied predeployment (i.e. before the Nuget package is pushed to Octopus and a project executed), and is pretty straight forward:

dotnet test {project-path} –c Release –no-build –logger {output-file-path}

The only caveat here was the general immaturity of .NET Core, as I had some problems using NUnit (which is the testing framework we typically use for C#), but XUnit worked perfectly fine. They aren’t that dissimilar, so it wasn’t a huge problem.

Tentacle Joke

The last thing to deal with is the actual deployment of the artefact to AWS.

Assuming step 1 and 2 have gone well (i.e. we have a versioned Nuget package that has been validated), all that’s left is to leverage Octopus Deploy.

For us this means a relatively simple project inside Octopus that just deploys the Nuget package to a worker tentacle, and a Powershell script embedded in the package that does the bulk of the work.

$awsKey = Get-OctopusParameter "AWS.Deployment.Key";
$awsSecret = Get-OctopusParameter "AWS.Deployment.Secret";
$awsRegion = Get-OctopusParameter "AWS.Deployment.Region";

$version = Get-OctopusParameter "Octopus.Release.Number";

$functionName = Get-OctopusParameter "AWS.Lambda.Function.Name";

$aws = Get-AwsCliExecutablePath

$env:AWS_ACCESS_KEY_ID = $awsKey
$env:AWS_SECRET_ACCESS_KEY = $awsSecret
$env:AWS_DEFAULT_REGION = $awsRegion

$functionPath = "$here\function"

Write-Verbose "Compressing lambda code file"
Add-Type -AssemblyName System.IO.Compression.FileSystem
[system.io.compression.zipfile]::CreateFromDirectory($functionPath, "index.zip")

Write-Verbose "Updating Log Processor lambda function [$functionName] to version [$version]"
(& $aws lambda update-function-configuration --function-name $functionName --runtime "dotnetcore1.0" --handler "Lambda::Handler::Handle") | Write-Verbose
(& $aws lambda update-function-code --function-name $functionName --zip-file fileb://index.zip) | Write-Verbose
(& $aws lambda update-function-configuration --function-name $functionName --description $version) | Write-Verbose

In summary, the script takes the contents of the Nuget package that was just deployed, zips up the relevant parts of it (in this case, everything under the “function” directory) and then uses the AWS CLI and some Octopus parameters to push the zip file to a known Lambda function.

Really the only interesting thing here is that we take the version from the Nuget package and apply it directly to the function in AWS for backwards traceability.

To Be Continued

The steps and processes that I’ve outlined above only describe the key parts of the build and deployment process. A lot of the detail is missing, and that is on purpose, because the entire thing is quite complicated and describing it comprehensively would make this blog post into more of a novel.

Its all ready pretty long.

The key things to take away are the .NET Core build and publish process (using the dotnet cmdline application), the .NET Core testing process (again, through the cmdline application) and the simplest way that I’ve found to actually push code to a Lambda function in an automated fashion.

Next week I’ll extrapolate on the testing of the lambda function, specifically around managing and using AWS credentials to provide an end to end test of function logic.

0 Comments

Over two months ago I commented that I was going to rewrite our ELB Logs Processor in C#. You know, a language that I actually like and respect.

No, it didn’t take me two months to actually do the rewrite (what a poor investment that would have been), I finished it up a while ago. I just kept forgetting to write about it.

Thus, here we are.

HTTP Is The Way For Me

To set the stage, lets go back a step.

The whole motivation for even touching the ELB Logs Processor again was to stop it pushing all of its log events to our Broker via TCP. The reasoning here was that TCP traffic is much harder to reason about, and Logstash has much better support around tracking, retrying and otherwise handling HTTP traffic than it does for TCP, so the switch makes sense as a way to improve the overall health of the system.

I managed to flip the Javascript ELB Logs Processor to HTTP using the Axios library (which is pretty sweet), but it all fell apart when I pushed it into production, and it failed miserably with what looked like memory issues.

Now, this is the point where a saner person might have just thrown enough memory at the Lambda function to get it to work and called it a day. Me? I’d had enough of Javascript at that point, so I started looking elsewhere.

At this point I was well aware that AWS Lambda offers execution of C# via .NET Core, as an alternative to Node.js.

Even taking into account my inexperience with doing anything at all using .NET Core, C# in Lambda was still infinitely preferable to Javascript. The assumption was that the ELB Logs Processor doesn’t actually do much (handle incoming S3 event, read file, send log events to Logstash), so rewriting it shouldn’t take too much effort.

Thus, here we are.

You Got .NET Core In My Lambda!

The easiest way to get started with using C# via .NET Core in AWS Lamnda, is to install the latest version of the AWS Toolkit and then just use Visual Studio. The toolkit provides a wealth of functionality for working with AWS in Visual Studio, including templates for creating Lambda functions as well as functionality for publishing your code to the place where it will actually be executed.

Actually setting up the Lambda function to run .NET Core is also relatively trivial. Assuming you’ve already setup a Lambda function with the appropriate listeners, its literally a combo box in the AWS Dashboard where you select the language/runtime version that your code requires.

Using the integrated toolkit and Visual Studio is a great way to get a handle on the general concepts in play and the components that you need to manage, but its not how you should do it in a serious development environment, especially when it comes to publishing.

A better way to do it is to go for something a bit more portable, like the actual .NET Core SDK, which is available for download as a simple compressed archive, with no installation required. Once you’ve got a handle on the command line options, its the best way to put something together that will also work on a build server of some description.

I’ll talk about the build and deployment process in more detail in the second post in this series, but for now, its enough to know that you can use the combination of the command line tools from the .NET Core SDK along with Visual Studio to create, develop, build, package and publish your .NET Core AWS Lambda function without too much trouble.

At that point you can choose to either create your Lambda function manually, or automate its creation through something like CloudFormation. It doesn’t really matter from the point of view of the function itself.

You Got Lambda In My .NET Core!

From the perspective of C#, all you really need to write is a class with a handler function.

namespace Lambda
{
    public class Handler
    {
        public void Handle(System.IO.Stream e)
        {
            // do some things here
        }
    }
}

A stream is the most basic input type, leaving the handling of the incoming data entirely up to you, using whatever knowledge you have about its format and structure.

Once you have the code, you can compile and publish it to a directory of your choosing (to get artefacts that you can move around), and then push it to the AWS Lambda function. As long as you’ve targeted the appropriate runtime as part of the publish (i.e. not a standalone platform independent application, you actually need to target the same runtime that you specified in the AWS Lambda function configuration), the only other thing to be aware of is how to tell the function where to start execution.

Specifying the entry point for the function is done via the Handler property, which you can set via the command line (using the AWS CLI), via Powershell Cmdlets, or using the AWS dashboard. Regardless of how you set it, its in the format “{namespace}::{class}::{function}”, so if you decide to arbitrarily change your namespaces or class names, you’ll have to keep your lambda function in sync or it will stop working the next time you publish your code.

At this point you should have enough pieces in place that you can trigger whatever event you’re listening for (like a new S3 file appearing in a bucket) and track the execution of the function via CloudWatch or something similar.

Helpful Abstractions

If you look back at the simple function handler above, you can probably conclude that working with raw streams isn’t the most efficient way of handling incoming data. A far more common approach is to get AWS Lambda to deserialize your incoming event for you. This allows you to use an actually structured object of your choosing, ranging from the presupported AWS events (like an S3 event), through to a POCO of your own devising.

In order to get this working all you have to do is add a reference to the Amazon.Lambda.Serialization.Json nuget package, and then annotate your handler with the appropriate attribute, like so:

namespace Lambda
{
    public class Handler
    {
        [LambdaSerializer(typeof(Amazon.Lambda.Serialization.Json.JsonSerializer))]
        public void Handle(Amazon.Lambda.S3Events.S3Event e)
        {
            // Do stuff here
        }
    }
}

In the function above, I’ve also made use of the Amazon.Lambda.S3Events nuget package in order to strongly type my input object, because I know that the only listener configured will be for S3 events

I’m actually not sure what you would have to do to create a function that handles multiple input types, so I assume the model is to just fall back to a more generic input type.

Or maybe to have multiple functions or something.

To Be Continued

I think that’s as good a place as any to stop and take a breath.

Next week I’ll continue on the same theme and talk about the build, package and publish process that we put together in order to actually get the code deployed.

0 Comments

If anybody actually reads this blog, it would be easy to see that my world has been all about log processing and aggregation for a while now. I mean, I do other things as well, but the questions and problems that end up here on this blog have been pretty focused around logging for as far back as I care to remember. The main reason is that I just like logging, and appreciate the powerful insights that it can give about your software. Another reason is that we don’t actually have anyone who sits in the traditional “operations” role (i.e. infrastructure/environment improvement and maintenance), so in order to help keep the rest of the team focused on our higher level deliverables, I end up doing most of that stuff.

Anyway, I don’t see that pattern changing any time soon, so on with the show.

While I was going about regaining control over our log stack, I noticed that it was extremely difficult to reason about TCP traffic when using an AWS ELB.

Why does this matter?

Well, the Broker layer in our ELK stack (i.e the primary ingress point), uses a Logstash configuration with a TCP input and as a result, all of the things that write to the Broker (our externally accessible Ingress API, other instances of Logstash, the ELB Logs Processor) use TCP. That’s a significant amount of traffic, and something that I’d really like to be able to monitor and understand.

Stealthy Traffic

As is my current understanding, when you make a TCP connection through an ELB, the ELB records the initial creation of the connection as a “request” (one of the metrics you can track with CloudWatch) and then pretty much nothing else after that. I mean, this makes sense, as its the ELB’s job to essentially pick an underlying machine to route traffic to, and most TCP connections created and used specifically as TCP connections tend to be long lived (as opposed to TCP connections created as part of HTTP requests and responses).

As far as our three primary contributors are concerned:

  • The Logging Ingress API is pretty oblivious. It just makes a new TCP connection for each incoming log event, so unless the .NET Framework is silently caching TCP connections for optimization purposes, it’s going to cause one ELB request per log event.
  • The ELB Logs Processor definitely caches TCP connections. We went through a whole ordeal with connection pooling and reuse before it would function in production, so its definitely pushing multiple log events through a single socket.
  • The Logstash instances that we have distributed across our various EC2 instances (local machine log aggregation, like IIS and application logs) are using the Logstash TCP output. I assume it uses one (or many) long live connections to do its business, but I don’t really know. Logstash is very mysterious.

This sort of usage makes it very hard to tell just how many log events are coming through the system via CloudWatch, which is a useful metric, especially when things start to go wrong and you need to debug which part of the stack is actually causing the failure.

Unfortunately, the monitoring issue isn’t the only problem with using the Logstash TCP input/output. Both input and output have, at separate times, been…flakey. I’ve experienced both sides of the pipeline going down for no obvious reason, or simply not working after running uninterrupted for a long time.

The final nail in the coffin for TCP came recently, when Elastic.co released the Logstash Persistent Queue feature for Logstash 5.4.0, which does not work with TCP at all (it only supports inputs that use the request-response model). I want to use persistent queues to remove both the Cache and Indexer layers from our log stack, so it was time for TCP to die.

Socket Surgery

Adding a HTTP input to our Broker layer was easy enough. In fact, such an input was already present because the ELB uses a HTTP request to check whether or not the Broker EC2 instances are healthy.

Conceptually, changing our Logstash instances to use a HTTP output instead of TCP should also be easy. Just need to change some configuration and deploy through Octopus. Keep in mind I haven’t actually done it yet, but it feels simple enough.

In a similar vein, changing our Logging Ingress API to output through HTTP instead of TCP should also be easy. A small code change to use HttpClient or RestSharp or something, a new deployment and everything is puppies and roses. Again, I haven’t actually done it yet, so who knows what dragons lurk there.

Then we have the ELB Logs Processor, which is a whole different kettle of fish.

It took a significant amount of effort to get it working with TCP in the first place (connection pooling was the biggest problem), and due to the poor quality of the Javascript (entirely my fault), its pretty tightly coupled to that particular mechanism.

Regardless of the difficulty, TCP has to go, for the good of the log stack

The first issue I ran into was “how do you even do HTTP requests in Node 4.3.2 anyway?”. There are many answers to this question, but the most obvious one is to use the HTTP API that comes with Node. Poking around this for a while showed that it wasn’t too bad, as long as I didn’t want to deal with a response payload, which I didn’t.

The biggest issue with the native Node HTTP API was that it was all callbacks, all the time. In my misadventures with the ELB Logs Processor I’d become very attached to promises and the effect they have on the readability of the resulting code, and didn’t really want to give that up so easily. I dutifully implemented a simple promise wrapper around our specific usage of the native Node HTTP API (which was just a POST of a JSON payload), and incorporated it into the Lambda function.

Unfortunately, this is where my memory gets a little bit fuzzy (it was a few weeks ago), and I don’t really remember how well it went. I don’t think it went well, because I decided to switch to a package called Axios which offered promise based HTTP requests out of the box.

Axios of Evil

Axios was pretty amazing. Well, I mean, Axios IS pretty amazing, but I suppose that sentence gave it away that the relationship didn’t end well.

The library did exactly what it said it did and with its native support for promises, was relatively easy to incorporate into the existing code, as you can see from the following excerpt:

// snip, whole bunch of setup, including summary object initialization

let axios = require('axios').create({
    baseURL: 'http://' + config.logstashHost,
    headers: {'Content-Type': 'application/json'}
});

// snip, more setup, other functions

function handleLine(line) {
    summary.lines.encountered += 1;

    var entry = {
        Component: config.component,
        SourceModuleName: config.sourceModuleName,
        Environment: config.environment,
        Application: config.application,
        message: line,
        type: config.type,
        Source: {
            S3: s3FileDetails
        }
    };
        
    var promise = axios
        .post("/", entry)
        .then((response) => {
            summary.lines.sent += 1;
        })
        .catch((error) => { 
            summary.failures.sending.total += 1;
            if (summary.failures.sending.lastFew.length >= 5) {
                summary.failures.sending.lastFew.shift();
            }
            summary.failures.sending.lastFew.push(error);
        });

    promises.push(promise);
}

// snip, main body of function (read S3, stream to line processor, wait for promises to finish

Even though it took a lot of effort to write, it was nice to remove all of the code relating to TCP sockets and connection pooling, as it simplified the whole thing.

The (single, manual) test proved that it still did its core job (contents of file written into the ELK stack), it worked in CI and it worked in Staging, so I was pretty happy.

For about 15 minutes that is, until I deployed it into Production.

Crushing Disappointment

Just like last time, the implementation simply could not deal with the amount of traffic that was being thrown at it. Even worse, it wasn’t actually logging any errors or giving me any indication as to why it was failing. After a brief and frustrating investigation, it looked like it was simply running out of memory (the Lambda function was only configured to use 192 MB, which had been enough for the TCP approach) and it was simply falling over once it reached that amount. This was my hypothesis, but I was never able to conclusively prove ii, but it was definitely using all of the memory available to the function each time it ran.

I could have just increased the available memory, but I wanted to understand where all the memory was going first.

Then I realised I would have to learn how to do memory analysis in Javascript, and I just gave up.

On Javascript that is.

Instead, I decided to rewrite the ELB Logs Processor in .NET Core, using a language that I actually like (C#).

Conclusion

This is one of those cases where looking back, with the benefits of hindsight, I probably should have just increased the memory until it worked and then walked away.

But I was just so tired of struggling with Javascript and Node that it was incredibly cathartic to just abandon it all in favour of something that actually made sense to me.

Of course, implementing the thing in C# via .NET Core wasn’t exactly painless, but that’s a topic for another time.

Probably next week.