0 Comments

So, last week was mostly an introduction to .NET Core and AWS Lambda. Not a lot of meat.

This week I’ll be writing about the build and deployment pipeline that we set up, and how it ties in with our normal process.

Hopefully it will be a bit more nutritious.

The Story So Far

Over the last few years we’ve put together a fairly generic build and deployment process using Powershell that focuses on ensuring the process can be run painlessly on a fresh checkout of the repository. The goal is to have everything either embedded or bootstrapped, and to have any and all logic encapsulated within the scripts.

What this means is that we have a relatively simple setup in TeamCity, which is basically just parameter management and a small bit of logic for kicking off the scripts that live inside the repositories.

The same approach is extended to deployment as well; we have Powershell (or an equivalent scripting language) that does the actual deployment, and then only use Octopus Deploy for orchestration and variable management.

Because we need to build, test and deploy a variety of things (like C# desktop applications, Javscript websites, the old Javascript ELB Logs Processor, CloudFormation environments), the common scripts feature pluggable processes in the form of Powershell script blocks, allowing the user to choose what sort of build is needed (simple Nuget package over some files? A full MSBuild execution that generates an .exe in a Nuget package?), how and when it is tested (Nunit over .dlls containing the phrase “Tests” before deployment? Selenium using Webdriver IO through npm after deployment?) and how it is deployed (pretty much just Octopus Deploy).

The API for the pluggable processes (which I’m going to call engines from this point forward) is relatively lax, in that the Powershell just calls the supplied script block with known parameters, and fails miserably if they don’t exit.

In retrospect, we probably should have just used something already available, like psake, but it just didn’t seem necessary at the time. We’re probably going to get hurt if we try to do anything that only works on Linux systems (no Powershell there), but it hasn’t happened yet, so I’m sure it will never happen.

To summarise, we have a framework that allows for configurable approaches to build, test and deployment, and a .NET Core AWS Lambda function should be able to slot right in.

Vrmm Vrmm Goes The Engine

In order to reliably get the lambda function code from the git repository in which it lives, to the actual AWS Lambda function where it will be executed, we needed to do three things:

  1. Creation of an artefact containing a versioned and validated copy of the code to be executed
  2. The validation logic itself (i.e. testing)
  3. Deploying said artefact to the AWS Lambda function

Coincidentally, these three things line up nicely with the pluggable components in our generic build and deployment process, so all we have to do is provide an engine for each, then run the overall process, and everything should be sweet.

Bob The Builder

The first thing to deal with is the actual creation of something that can be deployed, which for us means a Nuget package.

Referring back to the Javascript version of the ELB Logs Processor, there really wasn’t a lot to the creation of the artefact. Use npm to make sure the appropriate dependencies are present, a quick consolidation step with webpack, then wrap everything up in a Nuget package.

For the C# version, its a similar story, except instead of using node/npm, we can use the dotnet executable available in the SDK, followed by the Nuget command line tool to wrap it all up neatly.

Basically, it looks like this:

dotnet restore {solution-path}
dotnet build –c Release {solution-path}
dotnet publish –o {output-path} –c Release {solution-path}
nuget pack {nuspec-path} {output-directory} {version}

The actual script block for the engine is a little bit more complicated than that (TeamCity integration, bootstrapping the .NET Core SDK, error handling/checking, etc), but its still relatively straightforward.

As I mentioned last week, the only thing to watch out for here is to make sure that you’re targeting the appropriate runtime with your solution configuration, and not to try and build a standalone executable. The code will actually be executed on a Linux machine behind the scenes, and AWS Lambda will make sure the .NET Core runtime is available, so all you have to worry about is your code. I’m pretty sure if you try and supply an OS independent executable it will fail miserably.

Emotional Validation

The second thing to deal with is the validation of the artefact.

Similar to the build process, if we refer back to the Javascript ELB Logs Processor, we had no testing at all. Literally nothing. Our validation was “deploy it and see if the logs stop”.

Don’t get me wrong, there are tonnes of ways to write tests and automate them when using node/npm, we just never went to the effort of actually doing it.

I’m much more familiar with C# (even taking into account the differences of .NET Core), and we don’t write code without tests, so this was an ideal opportunity to put some stuff together for validation purposes.

I’m going to dedicate an entire post to just how we went about testing the rewrite (some interesting stuff there about managing AWS credentials in a secure way), but at a high level we followed our standard pattern of having dedicated test projects that sit in parallel to the code (i.e. Lambda.Tests.Unit, Lambda.Tests.Integration) with automatic execution through the generic build process via an engine.

This particular engine is applied predeployment (i.e. before the Nuget package is pushed to Octopus and a project executed), and is pretty straight forward:

dotnet test {project-path} –c Release –no-build –logger {output-file-path}

The only caveat here was the general immaturity of .NET Core, as I had some problems using NUnit (which is the testing framework we typically use for C#), but XUnit worked perfectly fine. They aren’t that dissimilar, so it wasn’t a huge problem.

Tentacle Joke

The last thing to deal with is the actual deployment of the artefact to AWS.

Assuming step 1 and 2 have gone well (i.e. we have a versioned Nuget package that has been validated), all that’s left is to leverage Octopus Deploy.

For us this means a relatively simple project inside Octopus that just deploys the Nuget package to a worker tentacle, and a Powershell script embedded in the package that does the bulk of the work.

$awsKey = Get-OctopusParameter "AWS.Deployment.Key";
$awsSecret = Get-OctopusParameter "AWS.Deployment.Secret";
$awsRegion = Get-OctopusParameter "AWS.Deployment.Region";

$version = Get-OctopusParameter "Octopus.Release.Number";

$functionName = Get-OctopusParameter "AWS.Lambda.Function.Name";

$aws = Get-AwsCliExecutablePath

$env:AWS_ACCESS_KEY_ID = $awsKey
$env:AWS_SECRET_ACCESS_KEY = $awsSecret
$env:AWS_DEFAULT_REGION = $awsRegion

$functionPath = "$here\function"

Write-Verbose "Compressing lambda code file"
Add-Type -AssemblyName System.IO.Compression.FileSystem
[system.io.compression.zipfile]::CreateFromDirectory($functionPath, "index.zip")

Write-Verbose "Updating Log Processor lambda function [$functionName] to version [$version]"
(& $aws lambda update-function-configuration --function-name $functionName --runtime "dotnetcore1.0" --handler "Lambda::Handler::Handle") | Write-Verbose
(& $aws lambda update-function-code --function-name $functionName --zip-file fileb://index.zip) | Write-Verbose
(& $aws lambda update-function-configuration --function-name $functionName --description $version) | Write-Verbose

In summary, the script takes the contents of the Nuget package that was just deployed, zips up the relevant parts of it (in this case, everything under the “function” directory) and then uses the AWS CLI and some Octopus parameters to push the zip file to a known Lambda function.

Really the only interesting thing here is that we take the version from the Nuget package and apply it directly to the function in AWS for backwards traceability.

To Be Continued

The steps and processes that I’ve outlined above only describe the key parts of the build and deployment process. A lot of the detail is missing, and that is on purpose, because the entire thing is quite complicated and describing it comprehensively would make this blog post into more of a novel.

Its all ready pretty long.

The key things to take away are the .NET Core build and publish process (using the dotnet cmdline application), the .NET Core testing process (again, through the cmdline application) and the simplest way that I’ve found to actually push code to a Lambda function in an automated fashion.

Next week I’ll extrapolate on the testing of the lambda function, specifically around managing and using AWS credentials to provide an end to end test of function logic.

0 Comments

Over two months ago I commented that I was going to rewrite our ELB Logs Processor in C#. You know, a language that I actually like and respect.

No, it didn’t take me two months to actually do the rewrite (what a poor investment that would have been), I finished it up a while ago. I just kept forgetting to write about it.

Thus, here we are.

HTTP Is The Way For Me

To set the stage, lets go back a step.

The whole motivation for even touching the ELB Logs Processor again was to stop it pushing all of its log events to our Broker via TCP. The reasoning here was that TCP traffic is much harder to reason about, and Logstash has much better support around tracking, retrying and otherwise handling HTTP traffic than it does for TCP, so the switch makes sense as a way to improve the overall health of the system.

I managed to flip the Javascript ELB Logs Processor to HTTP using the Axios library (which is pretty sweet), but it all fell apart when I pushed it into production, and it failed miserably with what looked like memory issues.

Now, this is the point where a saner person might have just thrown enough memory at the Lambda function to get it to work and called it a day. Me? I’d had enough of Javascript at that point, so I started looking elsewhere.

At this point I was well aware that AWS Lambda offers execution of C# via .NET Core, as an alternative to Node.js.

Even taking into account my inexperience with doing anything at all using .NET Core, C# in Lambda was still infinitely preferable to Javascript. The assumption was that the ELB Logs Processor doesn’t actually do much (handle incoming S3 event, read file, send log events to Logstash), so rewriting it shouldn’t take too much effort.

Thus, here we are.

You Got .NET Core In My Lambda!

The easiest way to get started with using C# via .NET Core in AWS Lamnda, is to install the latest version of the AWS Toolkit and then just use Visual Studio. The toolkit provides a wealth of functionality for working with AWS in Visual Studio, including templates for creating Lambda functions as well as functionality for publishing your code to the place where it will actually be executed.

Actually setting up the Lambda function to run .NET Core is also relatively trivial. Assuming you’ve already setup a Lambda function with the appropriate listeners, its literally a combo box in the AWS Dashboard where you select the language/runtime version that your code requires.

Using the integrated toolkit and Visual Studio is a great way to get a handle on the general concepts in play and the components that you need to manage, but its not how you should do it in a serious development environment, especially when it comes to publishing.

A better way to do it is to go for something a bit more portable, like the actual .NET Core SDK, which is available for download as a simple compressed archive, with no installation required. Once you’ve got a handle on the command line options, its the best way to put something together that will also work on a build server of some description.

I’ll talk about the build and deployment process in more detail in the second post in this series, but for now, its enough to know that you can use the combination of the command line tools from the .NET Core SDK along with Visual Studio to create, develop, build, package and publish your .NET Core AWS Lambda function without too much trouble.

At that point you can choose to either create your Lambda function manually, or automate its creation through something like CloudFormation. It doesn’t really matter from the point of view of the function itself.

You Got Lambda In My .NET Core!

From the perspective of C#, all you really need to write is a class with a handler function.

namespace Lambda
{
    public class Handler
    {
        public void Handle(System.IO.Stream e)
        {
            // do some things here
        }
    }
}

A stream is the most basic input type, leaving the handling of the incoming data entirely up to you, using whatever knowledge you have about its format and structure.

Once you have the code, you can compile and publish it to a directory of your choosing (to get artefacts that you can move around), and then push it to the AWS Lambda function. As long as you’ve targeted the appropriate runtime as part of the publish (i.e. not a standalone platform independent application, you actually need to target the same runtime that you specified in the AWS Lambda function configuration), the only other thing to be aware of is how to tell the function where to start execution.

Specifying the entry point for the function is done via the Handler property, which you can set via the command line (using the AWS CLI), via Powershell Cmdlets, or using the AWS dashboard. Regardless of how you set it, its in the format “{namespace}::{class}::{function}”, so if you decide to arbitrarily change your namespaces or class names, you’ll have to keep your lambda function in sync or it will stop working the next time you publish your code.

At this point you should have enough pieces in place that you can trigger whatever event you’re listening for (like a new S3 file appearing in a bucket) and track the execution of the function via CloudWatch or something similar.

Helpful Abstractions

If you look back at the simple function handler above, you can probably conclude that working with raw streams isn’t the most efficient way of handling incoming data. A far more common approach is to get AWS Lambda to deserialize your incoming event for you. This allows you to use an actually structured object of your choosing, ranging from the presupported AWS events (like an S3 event), through to a POCO of your own devising.

In order to get this working all you have to do is add a reference to the Amazon.Lambda.Serialization.Json nuget package, and then annotate your handler with the appropriate attribute, like so:

namespace Lambda
{
    public class Handler
    {
        [LambdaSerializer(typeof(Amazon.Lambda.Serialization.Json.JsonSerializer))]
        public void Handle(Amazon.Lambda.S3Events.S3Event e)
        {
            // Do stuff here
        }
    }
}

In the function above, I’ve also made use of the Amazon.Lambda.S3Events nuget package in order to strongly type my input object, because I know that the only listener configured will be for S3 events

I’m actually not sure what you would have to do to create a function that handles multiple input types, so I assume the model is to just fall back to a more generic input type.

Or maybe to have multiple functions or something.

To Be Continued

I think that’s as good a place as any to stop and take a breath.

Next week I’ll continue on the same theme and talk about the build, package and publish process that we put together in order to actually get the code deployed.

0 Comments

That thing that I complained about with Logstash and HTTP outputs a few weeks ago? Turns out it was actually a bug after all!

Not only that, but it was totally a race condition relating to the way the success/failure handlers of the Manticore HTTP library were being used, which was what I guessed, so go me. I guess even a broken clock is right once a day.

Long story short, don’t fire off an asynchronous request until you’ve got your asynchronous handlers in place to handle the result.

In their typical awesome style, the maintainers of the HTTP output plugin have already fixed the issue and released a new version (4.3.2) for the masses to consume.

Unfortunately, I don’t think this new version has been rolled into an official Logstash release yet, so if I want to use it I’ll have to make a custom redistributable package to use in Octopus deploy. Not a huge issue, just a small pain. I’m sure it will be rolled into the official package soon.

Other than that piece of news, I don’t really have anything else, so I’m going to take this opportunity to muse on some of my experiences with asynchronous code.

Hellish Code

Depending on how its written, asynchronous code can be hard to reason about, and even harder to test.

Vanilla Node is a good example of bad asynchronous code. Its incredibly easy to get into callback hell, where this function will call into this function, which has a handler that calls this function, which does this thing and so on. Its even worse when you realise that the easiest way to accumulate state between so many asynchronous functions is to close over variables that are in scope when the function is created, which leaves you in a position where its hard to break things apart into nice self contained, easily testable components.

*shudder*

Encapsulating asynchronous operations into some sort of structure is a good counter to callback hell, and once we started using Promises in the ELB logs processor Javascript, I found the quality of the code increased rapidly. It was still Javascript, but at least you could reason about it.

Continuing on with a language that I actually like, C#, you’ve got a similar concept at play with tasks and the TPL (Task Parallel Library).

Before the TPL, you would probably just use raw threads to offload work (or maybe dedicated asynchronous functions or classes like BackgroundWorker), and you could easily get into the same sort of problems that you get into in vanilla Node. I mean, you had the benefit of working with a nice, structured language, but it was still difficult and prone to error.

The TPL introduced encapsulations around asynchronous operations called Tasks, which helped write asynchronous code that actually made sense.

It did something else that was incredibly helpful as well.

It exposed the Task schedulers.

All The Control

Encapsulating operations as Tasks is all well and good, and in most cases you’re probably looking to run those operations asynchronously on another thread (or a pool of threads perhaps). I mean, that’s literally exactly what the default TaskScheduler does in C# (the ThreadPoolScheduler).

Of course, that makes testing hard, because now your main thread (which the test is running on) has to have some logic in it wait for outstanding operations to complete before you can assert things. This gets even more difficult when working with ViewModels, which require that certain updates be performed on the main/UI thread in order to work properly.

This is particularly important for WPF and PropertyChanged events, which only work properly if you raise them on the UI thread.

The good news is that you can write your own TaskScheduler and then inject it for testing purposes, assuming you’ve written your code in such a way that your Tasks are created with a specified scheduler.

I’ve written a TaskScheduler that is not threaded at all (i.e. it just executes the task inline), one that accumulates operations to be executed at will (good for testing transitions) and another that allows you to run tasks on the WPF Dispatcher construct (i.e. the main/UI thread in WPF) and I use all of them frequently while developing applications that rely on Tasks to do things.

They aren’t even all that hard to write really, as the following NonThreadedTaskScheduler shows:

public sealed class NonThreadedTaskScheduler : TaskScheduler
{
    protected override void QueueTask(Task task)
    {
        TryExecuteTask(task);
    }

    protected override bool TryExecuteTaskInline(Task task, bool taskWasPreviouslyQueued)
    {
        return TryExecuteTask(task);
    }

    protected override IEnumerable<Task> GetScheduledTasks()
    {
        return Enumerable.Empty<Task>();
    }

    public override int MaximumConcurrencyLevel
    {
        get { return 1; }
    }
}

Without the ability to change how the tasks are executed, none of that would have been possible, and while they would still be useful, they would be much harder to actually use and test.

Coincidentally, this is one of the reasons why I’m a little bit leery of the async/await keywords in C#, because I don’t think they expose any of the scheduling functionality, instead assuming that you are happy with the magical way that they work. Honestly though, my knowledge of async/await is fairly limited, so that might not be entirely correct.

Conclusion

To conclude, asynchronous code is hard and its bugs are subtle and hard to detect.

Its also one of the easiest and highest impact ways to wring more performance out of a piece of software. If you’re doing a lot of things that naturally involve waiting (disk writes, network calls, etc), you stand to gain an incredible amount of time by just doing other things during those times when you’re waiting on something external and out of your control.

Double edged sword really.

Like most software.

0 Comments

Almost a month ago, I made a bold claim that switching our Logstash instances to HTTP outputs for all log event traffic wouldn’t be all that difficult.

Oh god, how wrong I was.

In some weird display of prescience however, these were the exact words I used:

Conceptually, changing our Logstash instances to use a HTTP output instead of TCP should also be easy. Just need to change some configuration and deploy through Octopus. Keep in mind I haven’t actually done it yet, but it feels simple enough.

What Is Logstash?

Theoretically, I was entirely correct. It was as simple as changing the TCP output to HTTP. Same host, no fancy settings, no real complexity in the configuration.

For good measure, I upgraded the version of Logstash to 5.4.1 as well (from 5.2.0), which was a whole thing that I’m not going to get into right now. Suffice to say, I had a very recent version of Logstash installed when it all went to tatter.

We’ve got a pretty good deployment pipeline for distributing Logstash to our various API instances and whatnot, so after I made the changes, I checked in, builds kicked off, tests were run, deployments were made, and it was time to check whether or not CI was still working.

It was fine. For the first time, I could see the traffic caused by Logstash on the Broker using CloudWatch, which was nice.

Promote to Staging, same thing. Worked fine, traffic visible, all seemed well.

Promote to Prod, same thing. Worked fine, traffic visible, all is well.

And that’s the end of the story.

Baby Don’t Hurt Me

Ha ha ha ha ha ha no.

What I really should have done is left Staging run for a few days before promoting anything. I guess I was just a little too eager.

It wasn’t long after promoting to production that I noticed data disappearing from our dashboards and visualizations. At first glance, it all seemed to be isolated to one of our services in particular, so I started digging.

  • Log files were being appended to locally, so not the issue we encountered recently with our Nancy APIs. This problem had different symptoms anyway (no logs at all, instead of just no application logs).
  • Restarting the local instance of Logstash seemed to get a few events through into the ELK stack, so the communication channels were intact (i.e. traffic wasn’t being blocked at the network level), but only a few hundred.
  • Broker seemed to be healthy and handling traffic well, so it didn’t seem like a resource issue on the ELK stack side.

Based on the evidence, it looked like the problem was entirely the fault of the local Logstash instances, but if that was the case, why not more failures? At this point in the investigation, other Logstash instances appeared to be pushing their log events into the ELK stack without issue.

There didn’t even seem to be anything special about the logs going through the failing Logstash instances either. It was all pretty vanilla IIS, Performance Counter and API application logs. Nothing fancy.

Time to dig into Logstash itself.

The good news was that Logstash 5.4.1 features a really nice internal API, granting access to a bunch of information about the internal workings of the application.

Using a little bit of Powershell you can get statistics about the Logstash pipeline.

(Invoke-WebRequest http://localhost:9600/_node/stats/pipeline).Content

On one of the broken machines, our output looked something like this:

{ "host" : "a-hostname", "version" : "5.4.1", "http_address" : "127.0.0.1:9600", "id" : "a82b92de-d402-4b99-a569-a859c8e00d2b", "name" : "a-name", "pipeline" : { "events" : { "duration_in_millis" : 11293, "in" : 377, "filtered" : 375, "out" : 250, "queue_push_duration_in_millis" : 28027 }, "plugins" : { "inputs" : [ { "id" : "get-events-from-api-iis-logs", "events" : { "out" : 129, "queue_push_duration_in_millis" : 14075 }, "name" : "file" }, { "id" : "get-events-from-system-stats", "events" : { "out" : 248, "queue_push_duration_in_millis" : 13952 }, "name" : "file" }, { "id" : "get-events-from-api-application-logs", "events" : { "out" : 0, "queue_push_duration_in_millis" : 0 }, "name" : "file" } ], "filters" : [ { "id" : "break-iis-fields-down-via-csv", "events" : { "duration_in_millis" : 1135, "in" : 128, "out" : 128 }, "name" : "csv" }

# More filters here, cut for brevity

], "outputs" : [ { "id" : "send-events-to-elk-broker-over-http", "events" : { "duration_in_millis" : 5441, "in" : 375, "out" : 250 }, "name" : "http" } ] }, "reloads" : { "last_error" : null, "successes" : 0, "last_success_timestamp" : null, "last_failure_timestamp" : null, "failures" : 0 }, "queue" : { "type" : "memory" }, "id" : "main" } }

Super dense, but the interesting part is the HTTP output at the bottom.

See how it has 375 events in, but only 250 out?

It hangs like that forever.

Restarting Logstash causes it to start again and send a similar number of events, but eventually hang in exactly the same manner.

Don’t Hurt Me, No More

I’ll be honest, I tried a lot of things to get this to work, and I just couldn’t.

In fact, the problem actually got worse, because other Logstash instances started failing in the same way, and more and more data dropped off our dashboards and visualizations.

In the end, I had no choice but to roll back to the TCP output (which was flaky, but in a known way) and to log an issue on the Logstash Github describing my problem in the hope that someone would be able to look into it and help out.

Why not look myself?

Well I started, but Logstash is Ruby code running through Java (which is then running on Windows via Powershell) and getting into that stack from a development point of view looked pretty complicated and time-consuming.

I had a bit of a look at the Ruby code of the HTTP output and the only thing I could think of was a potential race condition in the construction of the HTTP requests.

# Create an async request
request = client.background.send(@http_method, url, :body => body, :headers => headers)
request.call # Actually invoke the request in the background

request.on_success do |response|
  begin
    # bunch of code dealing with success
  end
end

request.on_failure do |exception|
  begin 
    # bunch of code dealing with failures
  end
end

I don’t know how the HTTP library works (I think its Manticore?), but if it doesn’t support the registering (and correct execution) of event handlers AFTER a call has completed, then its possible for the request.call to finish and the success and failure handlers to never run.

Part of the success handler’s responsibilities is to inform the rest of the Logstash pipeline that the the HTTP output is done with its events, so that sort of thing could easily lead to a deadlock.

Again, I don’t know Ruby at all, so that is complete conjecture on my part.

Conclusion

This was another one of those issues that just came out of nowhere and stole more of my time than I would like to admit to. I would still like to switch all of our log event traffic to HTTP (for the same reasons I went through last time), but I have other things to do for now. At least its still working, even though the TCP input in Logstash sometimes just stop working. It doesn’t happen very often, and I have some monitoring around it, so its not the end of the world.

To be honest, a smarter man would have seen that it was not working with no clear solution, rolled back to the TCP output and moved on. Probably within a few hours.

I am not a smart man.

I am, however, ridiculously stubborn, so I wailed away at this problem until it stole my will to live, then I gave up.

0 Comments

Nancy is a pretty great, low weight, low ceremony API framework for C#/.NET. We’ve used it for basically every API that we’ve created since I started working here, with the exception of our Auth API, which uses Web API because of a dependency on the ASP Identity Management package.

In all the time that I’ve been using Nancy, I’ve encountered very few issues that weren’t just my own stupid fault. Sure, we’ve customised pieces of its pipeline (like the bootstrapper, because we use Ninject for dependency injection) but all in all its default behaviour is pretty reliable.

In fact, I can only think of one piece of default behaviour that wasn’t great, and that was the way that errors were handled during initialization when running inside IIS.

The Setup

When you use Nancy through IIS/ASP.NET, you add a reference to a handler class in your web.config file, which tells IIS how it should forward incoming requests to your code.

After installing the Nancy.Hosting.AspNet package, your web.config will contain some stuff that looks like this:

<?xml version="1.0" encoding="utf-8"?>
<configuration>
    <system.web>
        <compilation debug="true" targetFramework="4.5" />
        <httpRuntime targetFramework="4.5" requestPathInvalidCharacters="&lt;,&gt;,%,&amp;,?"/>
        <httpHandlers>
            <add verb="*" type="Nancy.Hosting.Aspnet.NancyHttpRequestHandler" path="*" />
        </httpHandlers>
    </system.web>

    <system.webServer>
        <validation validateIntegratedModeConfiguration="false" />
        <httpErrors existingResponse="PassThrough" />
        <handlers>
            <add name="Nancy" verb="*" type="Nancy.Hosting.Aspnet.NancyHttpRequestHandler" path="*" />
        </handlers>
        <security>
    </system.webServer>
</configuration>

There are two sections here (system.web and system.webserver) for different versions of IIS (if I remember correctly, the top one is for the IISExpress web server in Visual Studio and the other one is for IIS7+? I’m not 100% certain).

What the configuration means is that for every incoming request (verb=*, path =* in the xml), IIS will simply forward the request to the NancyHttpRequestHandler, where it will do Nancy things.

This worked fine for us until we had an issue with our bootstrapper initialization. Specifically, our bootstrapper was throwing exceptions during creation (because it was trying to connect to a database which wasn’t available yet or something) and when that happened, it would stop the webservice from ever starting. In fact, it would be non-functional until we restarted the IIS application pool.

The root cause here was in the NancyHttpRequestHandler and the way it interacted with IIS. Basically, IIS would create one of these classes, which would trigger its static constructor. If that encountered an error, then the whole thing would be broken, never to recover.

The fix was relatively simple. Create a custom request handler (based off the default one) that had some logic in it to lazy load the Nancy bootstrapper/engine with appropriate error handling. The end result was that each request that failed to initialize would fail as expected (with a 500 or something), but the first one to succeed would cache the result for any subsequent requests.

With that solution in place our webservices became a little bit more reliable and tolerant of transitory errors during startup.

The Symptom

So, in summary, Nancy good, we made a small customisation to make startup more reliable.

Obviously the story doesn’t end there.

Not too long ago we encountered a bug in one of our APIs where it was returning errors, but there were no associated error logs from the application in our ELK stack.

We could see the errors in our ELB logs (purely as HTTP response codes), but we didn’t get any application level logs showing what the root issue was (i.e. the exception). There were other error logs (some transient problems occurring during service startup), so we assumed that there was some problem with our application logging where it wasn’t correctly reporting on errors that occurred as a result of web requests.

Reproducing the bug locally, the log output showed the requests being logged correctly, and at the correct level.

It was all very mysterious.

The Source

Our first instinct was that the data ingress into the ELK stack was failing. I’d recently been working on the stack, so I naturally assumed that it was my fault, but when we investigated, we discovered that the ELK stack was working exactly as expected. There were no errors indicating that an event had been received and then rejected for some reason (Elasticsearch field mapping conflicts are the most common).

Digger deeper, we checked the local log files on the staging webservice machines and discovered that the log events were missing altogether, having never been logged at all. Even when we caused an error on purpose, nothing was being logged into the local log file.

The answer lay in the custom request handler we implemented. It had allowed the service to remain functional (where previously it would have crashed and burned), but had an unfortunate side effect.

The sequence of actions looked like this:

  1. IIS receives a request
  2. IIS forwards request to custom request handler
  3. Custom request handler initialises itself, creating a bootstrapper
  4. Bootstrapper creates logger configuration, including a custom Sink which opens a write stream to a file (named based on date)
  5. An error occurs (can’t initialize DB because the DNS record does not exist yet, or something equally transient)
  6. Bootstrapper initialization fails. Sink is disposable with a finalizer, so it will be cleaned up eventually, just not straightaway
  7. Request handler initialization fails, failing the request
  8. Another request is received
  9. IIS does the same thing
  10. New bootstrapper is created
  11. New Sink created, pointing at same file (which is already locked)
  12. Initialization succeeds, request returns successfully
  13. Sink now throws an error every time it is logged to, because of the file lock
  14. Serilog discovers errors with sink, so stops logging to preserve application health
  15. Original Sink disposes of itself and releases file lock
  16. Second Sink starts functioning again, but Serilog has stopped logging to it, so nothing happens

The interesting thing here is that the Sink does not lock the file during its constructor because its core behaviour is to roll files based on both date and size, so every time you call it, it dynamically determines where it should write to. This meant it was created successfully, but could not write any events.

Serilog, being the good logging framework that it is, was catching those errors and stopping them from breaking the application. Unfortunately, because that Sink was the only place where we had visible output, Serilog itself could not say that it was experiencing errors. During the investigation we actually enabled the Serilog self log, and it was reporting all sorts of useful things, and was in critical in actually diagnosing the problem.

Basically, we had a misbehaving sink and Serilog was protecting us from ourselves.

The Solution

We fixed the problem by moving the initialization of the logging earlier, enforcing that it only happes once per process by using a lazily evaluated static property on the request handler, which was how some of the Nancy properties were already being handled.

This fixed the problem, but looking back I think there were probably a few other ways in which we could have tackled it that would have been better:

  • We could have associated some sort of unique ID with the logger (via the bootstrapper) guaranteeing no file conflicts
  • We could have changed the Sink to handle errors that occur while accessing its desired log file, by generating and logging to a different file
  • We could have handled the fact that the Sink was disposable, and its lifetime should be tied to the bootstrapper as expected

I’ll probably implement at least the second option at some point in the future, just to make the Sink more robust in the face of unexpected circumstances.

Conclusion

The interesting thing about this whole problem (with the custom Sink and file locking) was that we had actually anticipated that it would be a problem when we initially implemented the Sink. IIS has a tendency to run two applications in parallel whenever it does a recycle, so we we knew that there would be periods when two processes might be trying to write to the same location, so we implemented a process ID based prefix to every file. Unfortunately, that approach is remarkably ineffective when everything is happening within the same process.

The hardest part in this whole adventure was trying to determine the cause of a problem when logging was the problem.

Once you get out of the development environment, and into deployed software, logging is pretty much all you have. When that’s gone, diagnosing problems becomes exponentially more difficult.

Like picking up broken glass in the dark.

Sure, its possible, but you’ll probably get cut.