0 Comments

Date and Time data structures are always so fun and easy to use, aren’t they?

Nope.

They are, in fact, the devil. No matter how hard I try (and believe me, I’ve tried pretty hard) or what decisions I make, I always find myself facing subtle issues relating to time that go unnoticed until the problem is way harder to fix than it should be. Like say, after your customers have created millions and millions of entities with important timestamps attached to them that aren’t timezone aware.

More recently, we were bitten when we tried to sync fields representing a whole day (i.e. 1 January 2017, no time) through our data synchronization algorithm.

To our surprise, a different day came out of the other end, which was less than ideal.

Dates Are Delicious

During the initial analysis of the tables earmarked to be synced from client databases, we located quite a few fields containing dates with time information. That is, usage of the actual DateTime data structure in SQL Server. As most of us had been bitten in the past by subtle conversion bugs when playing with raw DateTimes in .NET, we made the decision to convert all DateTimes to DateTimeOffsets (i.e. 2017-01-01 00:00+10) at the earliest possible stage in the sync process, using the timezone information on the client server that the sync process was running on.

What we didn’t know was that some of the DateTime fields actually represented whole dates, and they were just represented as that day at midnight because there was no better structure available when they were initially created.

Unfortunately, converting whole days stored as DateTimes into DateTimeOffsets isn’t actually the best idea, because an atomic days representation should not change when you move into different timezones. For example, 1 January 2017 in Brisbane does not magically turn into 31 December 2016 22:00 just because you’re in Western Australia. Its still 1 January 2017.

This is one of the weird and frustrating things about the difference between whole Dates and DateTimes. Technically speaking, a Date as explained above probably should be location aware, especially as the timezone differences get more extreme. The difference between WA and QLD is pretty academic, but there’s a whole day between the US and Australia. If two users were to calculate something like rental arrears in two different places using whole dates, they would probably get two different numbers, which could lead to poor decisions. Of course, from a users point of view, the last thing they would expect is to have one day turn into another, or to add a time representation to something they entered as a day using a calendar selector, so all told, its confusing and terrible and I hate it.

If you want to get technical, the converted DateTime still represents the same instantin time, so as long as you know what the original offset was, you can use that to revert back to the original value (which is some day at midnight) without too much trouble, and then interpret it as necessary.

Of course, that’s when PostgreSQL decided to get involved.

Lost Time

A long time ago when we started the data synchronization project, we decided to use PostgreSQL as the remote store. We did this mostly because PostgreSQL was cheaper to run in AWS via RDS (the licensing costs for SQL Server in RDS were kind of nuts in comparison).

In hindsight, this was a terrible decision.

We might have saved raw money on a month to month basis, but we exposed ourselves to all sorts of problems inherent to the differences between the two database engines, not to mention the generally poor quality of the PostgreSQL tools, at least in comparison SQL Server Management Studio.

Returning to the date and time discussion; we chose to use Entity Framework (via NPGSQL) as our interface to PostgreSQL and to be honest we pretty much just trusted it to get the database schema right. All of our  DateTimeOffsets got mapped to the PostgreSQL data structure timestamp_with_timezone, which looks like its pretty much the same thing.

Except its not. Its not the same at all. It actually loses date when storing a DateTimeOffset, and it does this by design.

In PostgreSQL terms, using a timestamp_with_timezone structure actually means “please automatically adjust the data I insert into this field using the given offset, so store it as UTC”. This makes sense, in a way, because strictly speaking, the data still represents the same instant in time, and can be freely adjusted to the users current offset as necessary (i.e. show me what the data looks like in +10).

Unfortunately, this approach means that the actual offset the data was inserted with is lost completely.

PostgreSQL has another data type called timestamp_without_timezone, but all it does it ignore the offset completely, while still stripping it out. Less than useful.

To summarise, here is the chain of events:

  • The user enters some data, representing a whole day: 1 Jan 2017
  • The system stores this data in SQL Server as a DateTime: 1 Jan 2017 00:00
  • Our sync process reads the data as a DateTimeOffset, using the local timezone: 1 Jan 2017 00:00 +10
  • The data is pushed to PostgreSQL and stored: 31 Dec 2016 14:00

Technically the data still represents the exact same point in time, but its actual meaning is now compromised. If anyone reads the remote data and assumes its still just a date, they are now a full day off, which is terrible.

Daylight savings doesn’t help either, because now the offset is inconsistent, so in NSW sometimes you will see the date as the previous day at 1400 and sometimes at 1300.

I mean honestly, daylight savings doesn’t really help anyone anyway, but that’s a different story.

That Approach Is Sooooo Dated

For whole dates, the only realistic solution is to treat them exactly as they should be treated, as dates with no time component.

Completely obvious in retrospect.

Both SQL Server and PostgreSQL have a Date data type which does exactly what it says on the box and leaves no room for misinterpretation or weird conversion errors.

Of course, .NET still only has the good old DateTime and DateTimeOffset data types, so there is room for shenanigans there, but at least the storage on either end would be correct.

For dates that actually do come with a time component (like an appointment), you have to really understand whether or not the time should be able to be interpreted in another timezone. Taking the appointment example, it might be reasonable to think that a company using your software product to track their appointments might exist in both NSW and QLD. The person executing the appointment would want to schedule it in their local time, but a remote administration person might want to know what time the appointment was in their own local time so that they know now to transfer calls.

SQL Server is fine in this respect, because a DateTimeOffset is perfect, freely able to be converted between one timezone and another with no room for misinterpretation.

In PostgreSQL, the timestamp_with_timezone data type might be good enough, assuming you don’t need to know with absolute certainty what the original offset was (and thus the time in the original creators context). If you do need to know that (maybe for business intelligence or analytics) you either need to know some information about the creator, or you should probably just use a timestamp data type, convert it to UTC yourself and store the original offset separately.

Conclusion

Representing time is hard in software. I’ve heard some people say this is because the fundamentally explicit structures that we have to have in software are simply unable to represent such a fluid and mutable construct, and I’m not sure if I disagree.

For the case I outlined above, we made some bad decisions with good goals (lets use DateTimeOffset, its impossible to misinterpret!) that built on top of other peoples bad decisions, but at least we understand the problem well enough now to potentially fix it.

Unfortunately, its unlikely that the concept of dates and times in software is just going to go away, as it seems pretty important.

The best we can hope for is probably just for existence to end.

Someone would probably want to know when existence is going to end taking daylight savings into account.

0 Comments

And we’re back

Its time to ring in the new year with some more chatter about our data synchronization algorithm, which honestly, has turned into a pretty complex beast.

The last time I wrote about the algorithm, I explained how we flipped the original implementation of the differencing scanfrom bottom up, to top down. This change represented a hearty improvement over the original implementation, because the data at the “top” of the table (i.e. when ordered by Row Version descending) is far more likely to change than the data at the “bottom”, at least for our clients anyway.

In that post though, I pointed out an inefficiency in the differencing scan, regardless of the direction”":

There is still an inherent inefficiency in the design of the scan, such that it will continue to do extra work for no benefit. Consider the situation where the scan has started and has dealt with 4 chunks. Its a bit down in the table, and the change it originally wanted to find was in chunk 5. In the meantime, the user has deleted data in chunk 2. The process will have to finish its scan of the entire table before it returns to the top and finds that particular change, which is just wasteful. Instead, it should be able to determine if there are any differences in the remainder of the data set (i.e. from the bottom of the current chunk down) and then use that information to decide to reset its position to the top again. We haven’t actually done this bit yet, because it came up after the deployed the first version to our users.

That inherent inefficiency is the topic for today.

Wait, Come Back, You’re Going In The Wrong Direction!

As mentioned above, the problem lies in the fact that once a difference is detected (i.e. one or more missing rows locally or remotely), the scan will just keep going until it finds and fixes said difference (or it hits the “end” of its scan and resets to the “top”). Of course, time does not stop while the scan is happening, and because it happens over a period of hours, with many relatively isolated executions of the algorithm, other changes are still being made to the underlying data.

Additions and modifications are handled by an entirely different process, so we don’t have to worry about them, but hard deletions can still occur at will.

When such a deletion occurs in the section that has already been scanned, but before the original difference that triggered the scan has been located and fixed, the algorithm can get stuck scanning the remainder of the data set for no real benefit. This actually isn’t a huge issue when the data set is small, or even medium sized, as you can just eat the inefficiency and wait. Eventually it will all work itself out.

Unfortunately, as the data set gets larger, the chances of a deletion occurring in the scanned section increases. Also, since changes are more likely to occur at the “top” of the data set, and the differencing scan works top-down, the sections that were most recently scanned are actually the most likely to contain differences. As a result, the algorithm spends longer and longer doing work for no benefit, so the inefficiency gets worse and worse.

Each chunk comparison requires the data to be present in memory in the remote database as well (its not an index-only query), so every pointless comparison actually results in more reads, which means more IOPS, which is why we started optimizing in the first place!

Long story short, we have to do something about it.

4665, 4666, 4667! There Are 4667 Rows Both Locally And Remotely, Ah Ah Ah

The good news is that this particular problem is not overly difficult to solve. It will still take some effort, but its not some fundamental flaw in the algorithm.

Every time a chunk is completed as part of the differencing scan, we can use a simple count to see whether or not the remaining differences (if there are any) are above or below the current location.

Locally this is trivial, just a query to the DB for count where < current range end.

Getting the same information from the remote requires the introduction of a new endpoint to the API though:

/v1/customers/{customerId}/databases/{databaseId}/tables/{tableName}/count{?aboveRowVersion=123&belowRowVersion=456}

Slightly generalised from our specific use case, but it basically lets you get a count of records:

  • above a certain row version
  • below a certain row version
  • in between two boundary row versions

For the differencing scan, we only really care about the “below a certain row version” use case, to get a count and compare it to the same count from the local data.

If the count is the same, we can safely exit early and flip back to the top of the data set, resetting the differencing scan so that it can pick up the more recent changes.

If the count is different, we just keep going down and repeat the process once we’ve actioned the next chunkl.

Nice and easy.

Of course, there are still some edge cases. Its possible (though not likely) get into a situation where the counts are the same but the data is actually different (a particular combination of locally deleted data and data that was never uploaded successfully) which can throw the whole thing for a bit of a loop, but that could have happened regardless, so we’re still in a better place than we would otherwise be.

Conclusion

Its been a bit of a hard slog trying to optimise the data synchronization algorithm, and we’re still not really there. Not only that, the algorithm itself has become more and more complex over time (obviously), and is getting pretty hard to reason about.

Annoyingly enough, we haven’t run across that one magical improvement that changes everything. Its very much been a kind of “death by a thousand cuts” sort of thing, with tens of small optimizations that alleviate the issue slightly. The improvement in this post is a good example of that sort of thing, and pretty much boils down to “don’t do something you don’t have to do”, which isn’t exactly ground-breaking.

Don’t get me wrong, the process is much better than its ever been, but we’re still seeing very similar patterns in read IOPS, which is problematic.

It might be that the algorithm itself just doesn’t scale as well as we want it to, and that we might need to flip to a fundamentally different approach. Perhaps something that hooks into deeper into the customer data and notifies us of creations/updates/deletions as they occurs, rather than us having to poll for the same information.

Still, that sort of change is not something to be embarked on lightly, even disregarding the fallacy of sunk cost.

0 Comments

It’s been a little while since I wrote about our data synchronization algorithm, but it still gets a fair bit of space in my mind on a day to day basis.

Most recently, we put some effort into optimizing it in the face of rising IOPS usage at the database level, which worked well enough to stabilize the system in the face of what we wanted it to do at the time.

Then we told it to sync more data.

Specifically, one of the two biggest data sets in our software which covers the storage of repeatable tasks, like appointments, phone calls, inspections and so on.

Not only is this table one of the largest in raw size, it also features the most churn AND is one of the only tables to feature hard deletes, a challenge that we had to build an entire secondary sync process for.

Funnily enough, it actually worked pretty well at first, but everything started to fray at the edges as we started syncing more and more clients. Same sort of thing as before, pressure on IOPS at the database level, mostly as a result of reads.

With the memory of the last round of optimizations fresh in our minds, it was time to enter the breach once more.

Will To Survive

We prodded at the system for a bit, analysing the traffic patterns to try and determine what might be causing the high read IOPS this time. Being that we’d already investigated similar sorts of things recently, we knew our way around it pretty well, so it wasn’t long before we noticed something suspicious.

Lets go back a step though.

The core data synchronization algorithm hasn’t changed all that much from the last time I wrote about it, even in the face of the optimizations.

Get Local Count/Version/Timestamp
Get Remote Count/Version/Timestamp
If Local Count/Version/Timestamp == Remote Count/Version/Timestamp
    Do Nothing and Exit
If Local Version/Timestamp == Remote Version/Timestamp BUT Local Count != Remote Count
    Calculate [BATCH SIZE] Using Historical Data
    Get Last Local Position
    Get Next [BATCH SIZE] Local Rows from last position
    Get Min & Max Version in Batch
    Query Remote for Manifest Between Min/Max Local Version
    Create Manifest from Local Batch
    Compare
        Find Remote Not in Local
            Delete from Remote
        Find Local Not in Remote
            Upload to Remote
If Local Version/Timestamp > Remote Version/Timestamp
    Calculate [BATCH SIZE] Using Historical Data
    Get Next [BATCH SIZE] Local Rows > Remote Version
    Upload to Remote
        Record Result for [BATCH SIZE] Tuning
        If Failure & Minimum [BATCH SIZE], Skip Ahead
If Local Version/Timestamp < Remote Version/Timestamp
    Find Remote > Local Version
    Delete from Remote

We’ve tweaked the mechanisms by which each evaluation of the algorithm determines its local and remote versions (there’s some caching on both the server and client side now, and each request to put/remote data returns new meta information from the server for efficiency) but that’s about it.

This time when we analysed the traffic, we noticed that there was a lot of calls to get table chunk manifests, an important part of the diff checker component of the algorithm, which is primarily meant to remove rows that have been deleted locally from the remote (i.e. hard deletes rather than the much easier to handle soft deletes, which are really just updates).

The problem with there being so many manifest calls is that they are quite expensive, especially for large tables.

Each manifest call requires a database query that can only use the first two components of the clustered index (which is client segmentation data), but must then scan through the remaining data in order to get the “chunk” needed for analysis. This can mean a scan of millions of rows, which is quite intense on read IOPS, because all of that data needs to be brought into memory in order to be evaluated.

But why was the number of manifest calls a problem now?

Sneaky Corporation

It all ties back into the data that we were syncing this time. As I mentioned at the beginning of this article, not only is it a huge table, it also features hard-deletes as the norm, rather than as exceptions. These few things added together create a bit of a perfect storm with regards to diff checker activity, which is why the system was starting to strain as a result of the massive increase in manifest calls.

Conceptually the diff checker part of the algorithm is a batched scan. It starts on one end of the data set, and gets chunks of data sequentially from both locations (local and remote), analysing and reacting as appropriate until it either gets back in sync or it reaches the other end, when it then wraps back around to it’s start point again.

Here’s where we made a mistake that seems obvious in hindsight.

In almost all cases, changes (even hard deletes) are more likely to occur at the top end of the data set, rather than the bottom.

Our scan?

Starts at the bottom and works its way up.

Thus, any time something was deleted from close to the top of the data set, the diff checker would have to scan through the entire table before finding and fixing it. This means more reads (to get the manifest chunks) which means higher IOPS.

This One Goes To Eleven

It really didn’t take a huge amount of effort to flip the diff checker algorithm to go top down instead of bottom up.

In fact, because of the way in which we’d built the code in the first place (using the strategy pattern to implement the actual chunking algorithm) it was as simple as writing a new class and binding it up instead of the old one via our dependency injection container.

The top down logic is basically the same as it is for bottom up, just inverted.

Starting the lower boundary of the last chunk (or the very top of the table if the state was reset), get the next [BATCH SIZE] chunk of rows from the local in a downwards direction. Using that chunk, find the extremities (top and bottom) and ask the remote for an equivalent chunk. Contrast and compare, react accordingly and then remember where you were up to for next time. If the algorithm reaches a point where the local and remote are identical, reset all state and do nothing.

Nothing super fancy, and an extremely obvious performance optimization in retrospect considering how much more likely changes are to occur at the top end rather than the bottom.

Such Strange Things

There are two tricksy bits though, one specific the top down approach and the other that was present in the bottom up approach, but is more obvious when working top down.

  1. When you get the local chunk, and you get less than the batch size, you know that you’re probably at the bottom of the table. If you follow the same rules for asking the remote at this point, you risk leaving data in the remote that is not in the local if its right at the bottom. Instead, when you reach this point, you have to infer a lower boundary of 0 to make sure you get everything.
  2. There is still an inherent inefficiency in the design of the scan, such that it will continue to do extra work for no benefit. Consider the situation where the scan has started and has dealt with 4 chunks. Its a bit down in the table, and the change it originally wanted to find was in chunk 5. In the meantime, the user has deleted data in chunk 2. The process will have to finish its scan of the entire table before it returns to the top and finds that particular change, which is just wasteful. Instead, it should be able to determine if there are any differences in the remainder of the data set (i.e. from the bottom of the current chunk down) and then use that information to decide to reset its position to the top again. We haven’t actually done this bit yet, because it came up after the deployed the first version to our users.

The first issue is a bit of a deal breaker (because it means the process can result in an incomplete data synchronization) so we’re lucky that we noticed it during testing and fixed it right away.

The second issue is not as bad because the process still does what it is supposed to do, just slower. Not great from an optimization point of view, so we’ll still fix it, but not exactly a critical issue.

Conclusion

Like I said a little bit earlier, doing the diff check scan from top to bottom is incredibly obvious in retrospect. All the changes are at the top, why would we start at the bottom of a potentially huge data set and waste time and effort analysing a bunch of data that is not likely to be relevant? Of course, that wasn’t actually obvious until we did some analysis on top of other optimizations, but it really does feel like a bit of a boneheaded mistake.

The good news is that when we implemented the changes as described (algorithm from bottom up to top down) we saw a dramatic decrease in traffic, because the diff checker was running fewer times before it found and fixed the changes.

The bad news is that due to point number two above (the diff checker doesn’t know that it should start from the top again when there are more recent changes) the IOPS usage didn’t really change all that much at all.

Still, the algorithm continues to get better and better and we continue to be able to synchronize more and more data without having to pay for additional infrastructure to support it.

Those are some good engineering feels.

0 Comments

We have the unfortunate honour of supporting multiple versions of the .NET Framework across our entire software suite.

All of our client side software (i.e. the stuff that is installed on client machines, both desktop and server) tends to target .NET Framework 4.0. Historically, this was the last version available on Windows XP, and up until relatively recently, it was important that we continue to support that ancient and terrifying beast. You know how it works, old software, slow clients, we’ve all been there.

Meanwhile, all of our server side software tends to target .NET Framework 4.5. We really should be targeting the latest .NET Framework (as the deployment environments are fully under our control), but it takes effort to roll that change out (new environments mostly) and it just hasn’t been a priority. Realistically our next major move might be to just start using .NET Core inside Docker images instead, so it might never really happen.

One of the main challenges with supporting both .NET 4.0 and .NET 4.5 is trying to build libraries that work across both versions. Its entirely possible, but is somewhat complicated and unless you understand a bunch of fundamental things, can appear confusing and magical.

Manual Targeting Mode Engaged

The first time we had to support both .NET 4.0 and .NET 4.5 for a library was for Serilog.

Serilog by itself works perfectly fine for this sort of thing. As of version 1.5.9, it supported both of the .NET Framework versions that we needed in a relatively painless manner.

As is almost always the way, we quickly ran into a few situations where we needed some custom components (a new Sink, a custom JSON formatter, some custom destructuring logic), so we wrapped them up in a library for reuse. Targeting the lowest common denominator (.NET 4.0), we assumed that we would be able to easily incorporate the library wherever we needed.

The library itself compiled file, and all of its tests passed.

We incorporated it into one the pieces of software that targeted .NET 4.0 and it worked a treat. All of the custom components were available and worked exactly as advertised.

Then we installed it into a piece of software targeting .NET 4.5 and while it compiled perfectly fine, it crashed and burned at runtime.

What had actually happened was that Serilog changed a few of its public interfaces between 4.0 and 4.5, replacing some IDictionary<TKey, TValue> types with IReadOnlyDictionary<TKey, TValue>. When we created the library, and specifically targeted .NET 4.0, we thought that it would use that version of the framework all the way through, even if we installed it in something targeting .NET 4.5. Instead, when the Serilog Nuget package was installed into the destination via the dependency chain, it came through as .NET 4.5, while our poor library was still locked at 4.0.

At runtime the two disagreed and everything exploded.

Use The Death Dot

The easiest solution would have been to mandate the flip to .NET 4.5. We actually first encountered this issue quite a while ago, and couldn’t just drop support for .NET 4.0, especially as this was the first time we were putting together our common logging library. Building it such that it would never work on our older, still actively maintained components would have been a recipe for disaster.

The good news is that Nuget packages allow for the installation of different DLLs for different target frameworks. Unsurprisingly, this is exactly how Serilog does it, so it seemed like the sane path to engage on.

Targeting specific frameworks is pretty straightforward. All you have to do is make sure that your Nuget package is structured in a specific way:

  • lib
    • net40
      • {any and all required DLLs for .NET 4.0}
    • net45
      • {any and all required DLLs for .NET 4.5}

In the example above I’ve only included the options relevant for this blog post, but there are a bunch of others.

When the package is installed, it will pick the appropriate directory for the target framework and add references as necessary.  Its actually quite a neat little system.

For us, this meant we needed to add a few more projects to the common logging library, one for .NET 4.0 and another for .NET 4.5. The only thing inside these projects was the CustomJsonFormatter class, because it was the only thing we were interacting with that had a different signature depending on the target framework.

With the projects in place (and appropriate tests written), we constructed a custom nuspec file that would take the output from the core project (still targeting .NET 4.0), and the output from the ancillary projects and structure them in the way that Nuget wanted. We also had to add Nuget package dependencies into the nuspec file as well, because once you’re using a hand-rolled nuspec file, you don’t get any of the nice automatic things that you would if you were using the automatic package creation from a .csproj file.

Target Destroyed

With the new improved Nuget package in place, we could reference it from both .NET 4.0 and .NET 4.5 projects without any issue.

At least when the references were direct anyway. That is, when the Nuget package was referenced directly from a project and the classes therein only used in that context.

When the package was referenced as part of a dependency chain (i.e. project references package A, package A references logging package) we actually ran into the same problem again, where the disconnect between .NET 4.0 and .NET 4.5 caused irreconcilable differences in the intervening libraries.

It was kind of line async or generics, in that once you start doing it, you kind of have to keep doing it all the way down.

Interestingly enough, we had to use a separate approach to multi-targeting for one of the intermediary libraries, because it did not actually have any code that differed between .NET 4.0 and .NET 4.5.  We created a shell project targeting .NET 4.5 and used csproj trickery to reference all of the files from the .NET 4.0 project such that both projects would compile to basically the same DLL (except with different targets).

I don’t recommend this approach to be honest, as it looks kind of like magic unless you know exactly what is going on, so is a bit of a trap for the uninitiated.

Regardless, with the intermediary packages sorted everything worked fine.

Summary

The biggest pain throughout this whole process was the fact that none of the issues were picked up at compilation time. Even when the issues did occur, the error messages are frustratingly vague, talking about missing methods and whatnot, which can be incredibly hard to diagnose if you don’t immediately know what it actually means.

More importantly, it wasn’t until the actual code was executed that the problem surfaced, so not only was the issue only visible at runtime, it would happen during execution rather than right at the start.

We were lucky that we had good tests for most of our functionality so we realised before everything exploded in the wild, but it was still somewhat surprising.

We’re slowly moving away from .NET 4.0 now (thank god), so at some stage I will make a decision to just ditch the whole .NET 4.0/4.5 multi-targeting and mark the new package as a breaking change.

Sometimes you just have to move on.

0 Comments

Nancy is a pretty great, low weight, low ceremony API framework for C#/.NET. We’ve used it for basically every API that we’ve created since I started working here, with the exception of our Auth API, which uses Web API because of a dependency on the ASP Identity Management package.

In all the time that I’ve been using Nancy, I’ve encountered very few issues that weren’t just my own stupid fault. Sure, we’ve customised pieces of its pipeline (like the bootstrapper, because we use Ninject for dependency injection) but all in all its default behaviour is pretty reliable.

In fact, I can only think of one piece of default behaviour that wasn’t great, and that was the way that errors were handled during initialization when running inside IIS.

The Setup

When you use Nancy through IIS/ASP.NET, you add a reference to a handler class in your web.config file, which tells IIS how it should forward incoming requests to your code.

After installing the Nancy.Hosting.AspNet package, your web.config will contain some stuff that looks like this:

<?xml version="1.0" encoding="utf-8"?>
<configuration>
    <system.web>
        <compilation debug="true" targetFramework="4.5" />
        <httpRuntime targetFramework="4.5" requestPathInvalidCharacters="&lt;,&gt;,%,&amp;,?"/>
        <httpHandlers>
            <add verb="*" type="Nancy.Hosting.Aspnet.NancyHttpRequestHandler" path="*" />
        </httpHandlers>
    </system.web>

    <system.webServer>
        <validation validateIntegratedModeConfiguration="false" />
        <httpErrors existingResponse="PassThrough" />
        <handlers>
            <add name="Nancy" verb="*" type="Nancy.Hosting.Aspnet.NancyHttpRequestHandler" path="*" />
        </handlers>
        <security>
    </system.webServer>
</configuration>

There are two sections here (system.web and system.webserver) for different versions of IIS (if I remember correctly, the top one is for the IISExpress web server in Visual Studio and the other one is for IIS7+? I’m not 100% certain).

What the configuration means is that for every incoming request (verb=*, path =* in the xml), IIS will simply forward the request to the NancyHttpRequestHandler, where it will do Nancy things.

This worked fine for us until we had an issue with our bootstrapper initialization. Specifically, our bootstrapper was throwing exceptions during creation (because it was trying to connect to a database which wasn’t available yet or something) and when that happened, it would stop the webservice from ever starting. In fact, it would be non-functional until we restarted the IIS application pool.

The root cause here was in the NancyHttpRequestHandler and the way it interacted with IIS. Basically, IIS would create one of these classes, which would trigger its static constructor. If that encountered an error, then the whole thing would be broken, never to recover.

The fix was relatively simple. Create a custom request handler (based off the default one) that had some logic in it to lazy load the Nancy bootstrapper/engine with appropriate error handling. The end result was that each request that failed to initialize would fail as expected (with a 500 or something), but the first one to succeed would cache the result for any subsequent requests.

With that solution in place our webservices became a little bit more reliable and tolerant of transitory errors during startup.

The Symptom

So, in summary, Nancy good, we made a small customisation to make startup more reliable.

Obviously the story doesn’t end there.

Not too long ago we encountered a bug in one of our APIs where it was returning errors, but there were no associated error logs from the application in our ELK stack.

We could see the errors in our ELB logs (purely as HTTP response codes), but we didn’t get any application level logs showing what the root issue was (i.e. the exception). There were other error logs (some transient problems occurring during service startup), so we assumed that there was some problem with our application logging where it wasn’t correctly reporting on errors that occurred as a result of web requests.

Reproducing the bug locally, the log output showed the requests being logged correctly, and at the correct level.

It was all very mysterious.

The Source

Our first instinct was that the data ingress into the ELK stack was failing. I’d recently been working on the stack, so I naturally assumed that it was my fault, but when we investigated, we discovered that the ELK stack was working exactly as expected. There were no errors indicating that an event had been received and then rejected for some reason (Elasticsearch field mapping conflicts are the most common).

Digger deeper, we checked the local log files on the staging webservice machines and discovered that the log events were missing altogether, having never been logged at all. Even when we caused an error on purpose, nothing was being logged into the local log file.

The answer lay in the custom request handler we implemented. It had allowed the service to remain functional (where previously it would have crashed and burned), but had an unfortunate side effect.

The sequence of actions looked like this:

  1. IIS receives a request
  2. IIS forwards request to custom request handler
  3. Custom request handler initialises itself, creating a bootstrapper
  4. Bootstrapper creates logger configuration, including a custom Sink which opens a write stream to a file (named based on date)
  5. An error occurs (can’t initialize DB because the DNS record does not exist yet, or something equally transient)
  6. Bootstrapper initialization fails. Sink is disposable with a finalizer, so it will be cleaned up eventually, just not straightaway
  7. Request handler initialization fails, failing the request
  8. Another request is received
  9. IIS does the same thing
  10. New bootstrapper is created
  11. New Sink created, pointing at same file (which is already locked)
  12. Initialization succeeds, request returns successfully
  13. Sink now throws an error every time it is logged to, because of the file lock
  14. Serilog discovers errors with sink, so stops logging to preserve application health
  15. Original Sink disposes of itself and releases file lock
  16. Second Sink starts functioning again, but Serilog has stopped logging to it, so nothing happens

The interesting thing here is that the Sink does not lock the file during its constructor because its core behaviour is to roll files based on both date and size, so every time you call it, it dynamically determines where it should write to. This meant it was created successfully, but could not write any events.

Serilog, being the good logging framework that it is, was catching those errors and stopping them from breaking the application. Unfortunately, because that Sink was the only place where we had visible output, Serilog itself could not say that it was experiencing errors. During the investigation we actually enabled the Serilog self log, and it was reporting all sorts of useful things, and was in critical in actually diagnosing the problem.

Basically, we had a misbehaving sink and Serilog was protecting us from ourselves.

The Solution

We fixed the problem by moving the initialization of the logging earlier, enforcing that it only happes once per process by using a lazily evaluated static property on the request handler, which was how some of the Nancy properties were already being handled.

This fixed the problem, but looking back I think there were probably a few other ways in which we could have tackled it that would have been better:

  • We could have associated some sort of unique ID with the logger (via the bootstrapper) guaranteeing no file conflicts
  • We could have changed the Sink to handle errors that occur while accessing its desired log file, by generating and logging to a different file
  • We could have handled the fact that the Sink was disposable, and its lifetime should be tied to the bootstrapper as expected

I’ll probably implement at least the second option at some point in the future, just to make the Sink more robust in the face of unexpected circumstances.

Conclusion

The interesting thing about this whole problem (with the custom Sink and file locking) was that we had actually anticipated that it would be a problem when we initially implemented the Sink. IIS has a tendency to run two applications in parallel whenever it does a recycle, so we we knew that there would be periods when two processes might be trying to write to the same location, so we implemented a process ID based prefix to every file. Unfortunately, that approach is remarkably ineffective when everything is happening within the same process.

The hardest part in this whole adventure was trying to determine the cause of a problem when logging was the problem.

Once you get out of the development environment, and into deployed software, logging is pretty much all you have. When that’s gone, diagnosing problems becomes exponentially more difficult.

Like picking up broken glass in the dark.

Sure, its possible, but you’ll probably get cut.