Target Acquired

October 25. 2017 0 Comments

Posted in:
c#
.NET
serilog
nuget

We have the unfortunate honour of supporting multiple versions of the .NET Framework across our entire software suite.

All of our client side software (i.e. the stuff that is installed on client machines, both desktop and server) tends to target .NET Framework 4.0. Historically, this was the last version available on Windows XP, and up until relatively recently, it was important that we continue to support that ancient and terrifying beast. You know how it works, old software, slow clients, we’ve all been there.

Meanwhile, all of our server side software tends to target .NET Framework 4.5. We really should be targeting the latest .NET Framework (as the deployment environments are fully under our control), but it takes effort to roll that change out (new environments mostly) and it just hasn’t been a priority. Realistically our next major move might be to just start using .NET Core inside Docker images instead, so it might never really happen.

One of the main challenges with supporting both .NET 4.0 and .NET 4.5 is trying to build libraries that work across both versions. Its entirely possible, but is somewhat complicated and unless you understand a bunch of fundamental things, can appear confusing and magical.

Manual Targeting Mode Engaged

The first time we had to support both .NET 4.0 and .NET 4.5 for a library was for Serilog.

Serilog by itself works perfectly fine for this sort of thing. As of version 1.5.9, it supported both of the .NET Framework versions that we needed in a relatively painless manner.

As is almost always the way, we quickly ran into a few situations where we needed some custom components (a new Sink, a custom JSON formatter, some custom destructuring logic), so we wrapped them up in a library for reuse. Targeting the lowest common denominator (.NET 4.0), we assumed that we would be able to easily incorporate the library wherever we needed.

The library itself compiled file, and all of its tests passed.

We incorporated it into one the pieces of software that targeted .NET 4.0 and it worked a treat. All of the custom components were available and worked exactly as advertised.

Then we installed it into a piece of software targeting .NET 4.5 and while it compiled perfectly fine, it crashed and burned at runtime.

What had actually happened was that Serilog changed a few of its public interfaces between 4.0 and 4.5, replacing some IDictionary<TKey, TValue> types with IReadOnlyDictionary<TKey, TValue>. When we created the library, and specifically targeted .NET 4.0, we thought that it would use that version of the framework all the way through, even if we installed it in something targeting .NET 4.5. Instead, when the Serilog Nuget package was installed into the destination via the dependency chain, it came through as .NET 4.5, while our poor library was still locked at 4.0.

At runtime the two disagreed and everything exploded.

Use The Death Dot

The easiest solution would have been to mandate the flip to .NET 4.5. We actually first encountered this issue quite a while ago, and couldn’t just drop support for .NET 4.0, especially as this was the first time we were putting together our common logging library. Building it such that it would never work on our older, still actively maintained components would have been a recipe for disaster.

The good news is that Nuget packages allow for the installation of different DLLs for different target frameworks. Unsurprisingly, this is exactly how Serilog does it, so it seemed like the sane path to engage on.

Targeting specific frameworks is pretty straightforward. All you have to do is make sure that your Nuget package is structured in a specific way:

net40

{any and all required DLLs for .NET 4.0}

net45

{any and all required DLLs for .NET 4.5}

In the example above I’ve only included the options relevant for this blog post, but there are a bunch of others.

When the package is installed, it will pick the appropriate directory for the target framework and add references as necessary. Its actually quite a neat little system.

For us, this meant we needed to add a few more projects to the common logging library, one for .NET 4.0 and another for .NET 4.5. The only thing inside these projects was the CustomJsonFormatter class, because it was the only thing we were interacting with that had a different signature depending on the target framework.

With the projects in place (and appropriate tests written), we constructed a custom nuspec file that would take the output from the core project (still targeting .NET 4.0), and the output from the ancillary projects and structure them in the way that Nuget wanted. We also had to add Nuget package dependencies into the nuspec file as well, because once you’re using a hand-rolled nuspec file, you don’t get any of the nice automatic things that you would if you were using the automatic package creation from a .csproj file.

Target Destroyed

With the new improved Nuget package in place, we could reference it from both .NET 4.0 and .NET 4.5 projects without any issue.

At least when the references were direct anyway. That is, when the Nuget package was referenced directly from a project and the classes therein only used in that context.

When the package was referenced as part of a dependency chain (i.e. project references package A, package A references logging package) we actually ran into the same problem again, where the disconnect between .NET 4.0 and .NET 4.5 caused irreconcilable differences in the intervening libraries.

It was kind of line async or generics, in that once you start doing it, you kind of have to keep doing it all the way down.

Interestingly enough, we had to use a separate approach to multi-targeting for one of the intermediary libraries, because it did not actually have any code that differed between .NET 4.0 and .NET 4.5. We created a shell project targeting .NET 4.5 and used csproj trickery to reference all of the files from the .NET 4.0 project such that both projects would compile to basically the same DLL (except with different targets).

I don’t recommend this approach to be honest, as it looks kind of like magic unless you know exactly what is going on, so is a bit of a trap for the uninitiated.

Regardless, with the intermediary packages sorted everything worked fine.

Summary

The biggest pain throughout this whole process was the fact that none of the issues were picked up at compilation time. Even when the issues did occur, the error messages are frustratingly vague, talking about missing methods and whatnot, which can be incredibly hard to diagnose if you don’t immediately know what it actually means.

More importantly, it wasn’t until the actual code was executed that the problem surfaced, so not only was the issue only visible at runtime, it would happen during execution rather than right at the start.

We were lucky that we had good tests for most of our functionality so we realised before everything exploded in the wild, but it was still somewhat surprising.

We’re slowly moving away from .NET 4.0 now (thank god), so at some stage I will make a decision to just ditch the whole .NET 4.0/4.5 multi-targeting and mark the new package as a breaking change.

Sometimes you just have to move on.

The Case Of The Missing Log Events

June 27. 2017 0 Comments

Nancy is a pretty great, low weight, low ceremony API framework for C#/.NET. We’ve used it for basically every API that we’ve created since I started working here, with the exception of our Auth API, which uses Web API because of a dependency on the ASP Identity Management package.

In all the time that I’ve been using Nancy, I’ve encountered very few issues that weren’t just my own stupid fault. Sure, we’ve customised pieces of its pipeline (like the bootstrapper, because we use Ninject for dependency injection) but all in all its default behaviour is pretty reliable.

In fact, I can only think of one piece of default behaviour that wasn’t great, and that was the way that errors were handled during initialization when running inside IIS.

The Setup

When you use Nancy through IIS/ASP.NET, you add a reference to a handler class in your web.config file, which tells IIS how it should forward incoming requests to your code.

After installing the Nancy.Hosting.AspNet package, your web.config will contain some stuff that looks like this:

<?xml version="1.0" encoding="utf-8"?>
<configuration>
    <system.web>
        <compilation debug="true" targetFramework="4.5" />
        <httpRuntime targetFramework="4.5" requestPathInvalidCharacters="&lt;,&gt;,%,&amp;,?"/>
        <httpHandlers>
            <add verb="*" type="Nancy.Hosting.Aspnet.NancyHttpRequestHandler" path="*" />
        </httpHandlers>
    </system.web>

    <system.webServer>
        <validation validateIntegratedModeConfiguration="false" />
        <httpErrors existingResponse="PassThrough" />
        <handlers>
            <add name="Nancy" verb="*" type="Nancy.Hosting.Aspnet.NancyHttpRequestHandler" path="*" />
        </handlers>
        <security>
    </system.webServer>
</configuration>

There are two sections here (system.web and system.webserver) for different versions of IIS (if I remember correctly, the top one is for the IISExpress web server in Visual Studio and the other one is for IIS7+? I’m not 100% certain).

What the configuration means is that for every incoming request (verb=*, path =* in the xml), IIS will simply forward the request to the NancyHttpRequestHandler, where it will do Nancy things.

This worked fine for us until we had an issue with our bootstrapper initialization. Specifically, our bootstrapper was throwing exceptions during creation (because it was trying to connect to a database which wasn’t available yet or something) and when that happened, it would stop the webservice from ever starting. In fact, it would be non-functional until we restarted the IIS application pool.

The root cause here was in the NancyHttpRequestHandler and the way it interacted with IIS. Basically, IIS would create one of these classes, which would trigger its static constructor. If that encountered an error, then the whole thing would be broken, never to recover.

The fix was relatively simple. Create a custom request handler (based off the default one) that had some logic in it to lazy load the Nancy bootstrapper/engine with appropriate error handling. The end result was that each request that failed to initialize would fail as expected (with a 500 or something), but the first one to succeed would cache the result for any subsequent requests.

With that solution in place our webservices became a little bit more reliable and tolerant of transitory errors during startup.

The Symptom

So, in summary, Nancy good, we made a small customisation to make startup more reliable.

Obviously the story doesn’t end there.

Not too long ago we encountered a bug in one of our APIs where it was returning errors, but there were no associated error logs from the application in our ELK stack.

We could see the errors in our ELB logs (purely as HTTP response codes), but we didn’t get any application level logs showing what the root issue was (i.e. the exception). There were other error logs (some transient problems occurring during service startup), so we assumed that there was some problem with our application logging where it wasn’t correctly reporting on errors that occurred as a result of web requests.

Reproducing the bug locally, the log output showed the requests being logged correctly, and at the correct level.

It was all very mysterious.

The Source

Our first instinct was that the data ingress into the ELK stack was failing. I’d recently been working on the stack, so I naturally assumed that it was my fault, but when we investigated, we discovered that the ELK stack was working exactly as expected. There were no errors indicating that an event had been received and then rejected for some reason (Elasticsearch field mapping conflicts are the most common).

Digger deeper, we checked the local log files on the staging webservice machines and discovered that the log events were missing altogether, having never been logged at all. Even when we caused an error on purpose, nothing was being logged into the local log file.

The answer lay in the custom request handler we implemented. It had allowed the service to remain functional (where previously it would have crashed and burned), but had an unfortunate side effect.

The sequence of actions looked like this:

IIS receives a request
IIS forwards request to custom request handler
Custom request handler initialises itself, creating a bootstrapper
Bootstrapper creates logger configuration, including a custom Sink which opens a write stream to a file (named based on date)
An error occurs (can’t initialize DB because the DNS record does not exist yet, or something equally transient)
Bootstrapper initialization fails. Sink is disposable with a finalizer, so it will be cleaned up eventually, just not straightaway
Request handler initialization fails, failing the request
Another request is received
IIS does the same thing
New bootstrapper is created
New Sink created, pointing at same file (which is already locked)
Initialization succeeds, request returns successfully
Sink now throws an error every time it is logged to, because of the file lock
Serilog discovers errors with sink, so stops logging to preserve application health
Original Sink disposes of itself and releases file lock
Second Sink starts functioning again, but Serilog has stopped logging to it, so nothing happens

The interesting thing here is that the Sink does not lock the file during its constructor because its core behaviour is to roll files based on both date and size, so every time you call it, it dynamically determines where it should write to. This meant it was created successfully, but could not write any events.

Serilog, being the good logging framework that it is, was catching those errors and stopping them from breaking the application. Unfortunately, because that Sink was the only place where we had visible output, Serilog itself could not say that it was experiencing errors. During the investigation we actually enabled the Serilog self log, and it was reporting all sorts of useful things, and was in critical in actually diagnosing the problem.

Basically, we had a misbehaving sink and Serilog was protecting us from ourselves.

The Solution

We fixed the problem by moving the initialization of the logging earlier, enforcing that it only happes once per process by using a lazily evaluated static property on the request handler, which was how some of the Nancy properties were already being handled.

This fixed the problem, but looking back I think there were probably a few other ways in which we could have tackled it that would have been better:

We could have associated some sort of unique ID with the logger (via the bootstrapper) guaranteeing no file conflicts
We could have changed the Sink to handle errors that occur while accessing its desired log file, by generating and logging to a different file
We could have handled the fact that the Sink was disposable, and its lifetime should be tied to the bootstrapper as expected

I’ll probably implement at least the second option at some point in the future, just to make the Sink more robust in the face of unexpected circumstances.

Conclusion

The interesting thing about this whole problem (with the custom Sink and file locking) was that we had actually anticipated that it would be a problem when we initially implemented the Sink. IIS has a tendency to run two applications in parallel whenever it does a recycle, so we we knew that there would be periods when two processes might be trying to write to the same location, so we implemented a process ID based prefix to every file. Unfortunately, that approach is remarkably ineffective when everything is happening within the same process.

The hardest part in this whole adventure was trying to determine the cause of a problem when logging was the problem.

Once you get out of the development environment, and into deployed software, logging is pretty much all you have. When that’s gone, diagnosing problems becomes exponentially more difficult.

Like picking up broken glass in the dark.

Sure, its possible, but you’ll probably get cut.