The Case Of The Missing Log Events

June 27. 2017 0 Comments

Nancy is a pretty great, low weight, low ceremony API framework for C#/.NET. We’ve used it for basically every API that we’ve created since I started working here, with the exception of our Auth API, which uses Web API because of a dependency on the ASP Identity Management package.

In all the time that I’ve been using Nancy, I’ve encountered very few issues that weren’t just my own stupid fault. Sure, we’ve customised pieces of its pipeline (like the bootstrapper, because we use Ninject for dependency injection) but all in all its default behaviour is pretty reliable.

In fact, I can only think of one piece of default behaviour that wasn’t great, and that was the way that errors were handled during initialization when running inside IIS.

The Setup

When you use Nancy through IIS/ASP.NET, you add a reference to a handler class in your web.config file, which tells IIS how it should forward incoming requests to your code.

After installing the Nancy.Hosting.AspNet package, your web.config will contain some stuff that looks like this:

<?xml version="1.0" encoding="utf-8"?>
<configuration>
    <system.web>
        <compilation debug="true" targetFramework="4.5" />
        <httpRuntime targetFramework="4.5" requestPathInvalidCharacters="&lt;,&gt;,%,&amp;,?"/>
        <httpHandlers>
            <add verb="*" type="Nancy.Hosting.Aspnet.NancyHttpRequestHandler" path="*" />
        </httpHandlers>
    </system.web>

    <system.webServer>
        <validation validateIntegratedModeConfiguration="false" />
        <httpErrors existingResponse="PassThrough" />
        <handlers>
            <add name="Nancy" verb="*" type="Nancy.Hosting.Aspnet.NancyHttpRequestHandler" path="*" />
        </handlers>
        <security>
    </system.webServer>
</configuration>

There are two sections here (system.web and system.webserver) for different versions of IIS (if I remember correctly, the top one is for the IISExpress web server in Visual Studio and the other one is for IIS7+? I’m not 100% certain).

What the configuration means is that for every incoming request (verb=*, path =* in the xml), IIS will simply forward the request to the NancyHttpRequestHandler, where it will do Nancy things.

This worked fine for us until we had an issue with our bootstrapper initialization. Specifically, our bootstrapper was throwing exceptions during creation (because it was trying to connect to a database which wasn’t available yet or something) and when that happened, it would stop the webservice from ever starting. In fact, it would be non-functional until we restarted the IIS application pool.

The root cause here was in the NancyHttpRequestHandler and the way it interacted with IIS. Basically, IIS would create one of these classes, which would trigger its static constructor. If that encountered an error, then the whole thing would be broken, never to recover.

The fix was relatively simple. Create a custom request handler (based off the default one) that had some logic in it to lazy load the Nancy bootstrapper/engine with appropriate error handling. The end result was that each request that failed to initialize would fail as expected (with a 500 or something), but the first one to succeed would cache the result for any subsequent requests.

With that solution in place our webservices became a little bit more reliable and tolerant of transitory errors during startup.

The Symptom

So, in summary, Nancy good, we made a small customisation to make startup more reliable.

Obviously the story doesn’t end there.

Not too long ago we encountered a bug in one of our APIs where it was returning errors, but there were no associated error logs from the application in our ELK stack.

We could see the errors in our ELB logs (purely as HTTP response codes), but we didn’t get any application level logs showing what the root issue was (i.e. the exception). There were other error logs (some transient problems occurring during service startup), so we assumed that there was some problem with our application logging where it wasn’t correctly reporting on errors that occurred as a result of web requests.

Reproducing the bug locally, the log output showed the requests being logged correctly, and at the correct level.

It was all very mysterious.

The Source

Our first instinct was that the data ingress into the ELK stack was failing. I’d recently been working on the stack, so I naturally assumed that it was my fault, but when we investigated, we discovered that the ELK stack was working exactly as expected. There were no errors indicating that an event had been received and then rejected for some reason (Elasticsearch field mapping conflicts are the most common).

Digger deeper, we checked the local log files on the staging webservice machines and discovered that the log events were missing altogether, having never been logged at all. Even when we caused an error on purpose, nothing was being logged into the local log file.

The answer lay in the custom request handler we implemented. It had allowed the service to remain functional (where previously it would have crashed and burned), but had an unfortunate side effect.

The sequence of actions looked like this:

IIS receives a request
IIS forwards request to custom request handler
Custom request handler initialises itself, creating a bootstrapper
Bootstrapper creates logger configuration, including a custom Sink which opens a write stream to a file (named based on date)
An error occurs (can’t initialize DB because the DNS record does not exist yet, or something equally transient)
Bootstrapper initialization fails. Sink is disposable with a finalizer, so it will be cleaned up eventually, just not straightaway
Request handler initialization fails, failing the request
Another request is received
IIS does the same thing
New bootstrapper is created
New Sink created, pointing at same file (which is already locked)
Initialization succeeds, request returns successfully
Sink now throws an error every time it is logged to, because of the file lock
Serilog discovers errors with sink, so stops logging to preserve application health
Original Sink disposes of itself and releases file lock
Second Sink starts functioning again, but Serilog has stopped logging to it, so nothing happens

The interesting thing here is that the Sink does not lock the file during its constructor because its core behaviour is to roll files based on both date and size, so every time you call it, it dynamically determines where it should write to. This meant it was created successfully, but could not write any events.

Serilog, being the good logging framework that it is, was catching those errors and stopping them from breaking the application. Unfortunately, because that Sink was the only place where we had visible output, Serilog itself could not say that it was experiencing errors. During the investigation we actually enabled the Serilog self log, and it was reporting all sorts of useful things, and was in critical in actually diagnosing the problem.

Basically, we had a misbehaving sink and Serilog was protecting us from ourselves.

The Solution

We fixed the problem by moving the initialization of the logging earlier, enforcing that it only happes once per process by using a lazily evaluated static property on the request handler, which was how some of the Nancy properties were already being handled.

This fixed the problem, but looking back I think there were probably a few other ways in which we could have tackled it that would have been better:

We could have associated some sort of unique ID with the logger (via the bootstrapper) guaranteeing no file conflicts
We could have changed the Sink to handle errors that occur while accessing its desired log file, by generating and logging to a different file
We could have handled the fact that the Sink was disposable, and its lifetime should be tied to the bootstrapper as expected

I’ll probably implement at least the second option at some point in the future, just to make the Sink more robust in the face of unexpected circumstances.

Conclusion

The interesting thing about this whole problem (with the custom Sink and file locking) was that we had actually anticipated that it would be a problem when we initially implemented the Sink. IIS has a tendency to run two applications in parallel whenever it does a recycle, so we we knew that there would be periods when two processes might be trying to write to the same location, so we implemented a process ID based prefix to every file. Unfortunately, that approach is remarkably ineffective when everything is happening within the same process.

The hardest part in this whole adventure was trying to determine the cause of a problem when logging was the problem.

Once you get out of the development environment, and into deployed software, logging is pretty much all you have. When that’s gone, diagnosing problems becomes exponentially more difficult.

Like picking up broken glass in the dark.

Sure, its possible, but you’ll probably get cut.