Hang In There, Query

December 4. 2018 0 Comments

Posted in:
postgreSQL
bugs
c#

And so I return from my break, with a fresh reenergized hatred for terrible and unexpected technical issues.

Speaking of which…

Jacked Up And Good To Go

I’ve been sitting on a data export project for a while now. Its a simple C# command line application that connects to one of our databases, correlates some entities together, pulls some data out and dumps it into an S3 bucket as CSV files. It has a wide variety of automated unit tests proving that the various components of the system function as expected, and a few integration and end-to-end tests that show that the entire export process works (i.e. given a set of command line arguments + an S3 bucket, does running the app result in data in the bucket).

From a purely technical point of view, the project has been “complete” for a while now, but I couldn’t turn it on until some other factors cleared up.

Well, while I was on holidays those other factors cleared up, so I thought to myself “I’ll just turn on the schedule in TeamCity and all will be well”.

Honestly, you’d think having written as much software as I have that I would know better.

That first run was something of a crapshow, and while some of the expected files ended up in the S3 bucket, a bunch of others were missing completely.

Even worse, the TeamCity task that we use for job execution thought everything completed successfully. The only indicator that it had failed was that the task description in TeamCity was not updated to show the summary of what it had done (i.e. it simply showed the generic “Success” message instead of the custom one we were supplying), which is suspicious, because I’ve seen TeamCity fail miserably like that before.

Not Enough Time In The World

Breaking it down, there were two issues afoot:

Something failed, but TeamCity thought everything was cool
Something failed

The first problem was easy to deal with; there was a bug in the way the process was reporting back to TeamCity when there was an unhandled exception in the data extraction process. With that sorted, the next run at least indicated a failure had occurred.

With the process indicating correctly that a bad thing had happened, the cause of the second problem became obvious.

Timeouts.

Which explained perfectly why the tests had not picked up any issues, as while they run the full application (from as close to the command line as possible), they don’t run a full export, instead leveraging a limiter parameter to avoid the test taking too long.

Entirely my fault really, as I should have at least done a full export at some stage.

Normally I would look at a timeout with intense suspicion, as they typically indicate a inefficient query or operation of some sort. Simply raising the time allowed when timeouts start occurring is often a route to a terrible experience for the end-user, as operations take longer and longer to do what they want them to.

In this case though, it was reasonable that the data export would actually take a chunk of time greater than the paltry 60 seconds applied to command execution in Npgsql by default. Also, the exports that were failing were for the larger data sets (one of which had some joins onto other data sets) and being full exports, could not really make effective use of any indexes for optimisation.

So I upped the timeouts via the command line parameters and off it went.

Three hours later though, and I was pretty sure something else was wrong.

Socket To Me

Running the same process with the same inputs from my development environment showed the same problem. The process just kept on chugging along, never finishing. Pausing the execution in the development environment showed that at least one thread was stuck waiting eternally for some database related thing to conclude.

The code makes use of parallelisation (each export is its own isolated operation), so my first instinct was that there was some sort of deadlock involving one or more exports.

With the hang appearing to be related to database connectivity, I thought that maybe it was happening in the usage of Npgsql connections, but each export creates its own connection, so that seemed unlikely. There was always the possibility that the problem was related to connection pooling though, which is built into the library and is pretty much static state, but I had disabled that via the connection string, so it shouldn’t have been a factor.

I ripped out all of the parallelisation and ran the process again and it still hung. On closer investigation, it was only one specific export that was hanging, which was weird, because they all use exactly the same code.

Turning towards the PostgreSQL end of the system, I ran the process again, except this time started paying attention to the active connections to the database, the queries they were running, state transitions (i.e. active –> idle) and execution timestamps. This is pretty easy to do using the query:

SELECT * FROM pg_stat_activity

I could clearly see the export that was failing execute its query on a connection, stay active for 15 minutes and then transition back to idle, indicating that the database was essentially done with that operation. On the process side though, it kept chugging along, waiting eternally for some data to come through the Socket that it was listening on that would apparently never come.

The 15 minute query time was oddly consistent too.

It turns out the query was actually being terminated server side because of replica latency (the export process queries a replica DB), which was set to max out at, wait for it, 15 minutes.

For some reason the stream returned by the NpgsqlConnection.BeginTextExport(sql) just never ends when the underlying query is terminated on the server side.

My plan is to put together some information and log an issue in the Npgsql Github Repository, because I can’t imagine that the behaviour is intended.

Solve For X

With the problem identified, the only question remaining was what to do about it.

I don’t even like that our maximum replica latency is set to 15 minutes, so raising it was pretty much out of the question (and this process is intended to be automated and ongoing, so I would have to raise it permanently).

The only real remaining option is to break down the bigger query into a bunch of smaller queries and then aggregate myself.

So that’s exactly what I did.

Luckily for me, the data set had a field that made segmentation easy, though running a bunch of queries and streaming the results into a single CSV file meant that they had to be run sequentially, so no parallelization bonus for me.

Conclusion

This was one of those issues where I really should have had the foresight to see the first problem (timeouts when dealing with large data sets), but the existence of what looks to be a real bug made everything far more difficult than it could have been.

Still, it just goes to show that no matter how confident you are in a process, there is always the risk that when you execute it in reality that it all might go belly up.

Which just reinforces the idea that you should always be running it in reality as soon as you possibly can, and paying attention to what the results are.

View Models Sounds Like a Fun Activity At First

October 30. 2018 0 Comments

I’ve been using MVVM as a pattern for UI work for a while now, mostly because of WPF. Its a solid pattern and while I’ve not really delved into the publicly available frameworks (Prism, Caliburn.Micro, etc) I have put together a few reusable bits and pieces to make the journey easier.

One of those bits and pieces is the ability to perform work in the background, so that the UI remains responsive and usable while other important things are happening. This usually manifests as some sort of refresh or busy indicator on the screen after the user elects to do something complicated, but the important part is that the screen itself does not become unresponsive.

People get antsy when software “stops responding” and tend to murder it with extreme prejudice.

Now, the reusable components are by no means perfect, but they do get the job done.

Except when they don’t.

Right On Schedule

The framework itself is pretty bare bones stuff, with a few simple ICommand implementations and some view model base classes giving easy access to commonly desired functions.

The most complex part is the build in support to easily do background work in a view model while leaving the user experience responsive and communicative. The core idea is to segregate the stuff happening in the background from the stuff happening in the foreground (which is where all the WPF rendering and user interaction lives) using Tasks and TaskSchedulers from the TPL (Task Parallel Library), while also helping to manage some state to communicate what was happening to the user (like busy indicators).

Each view model is be responsible for executing some long running operation (probably started from a command), and then deciding what should happen when that operation succeeds, fails or is cancelled.

In order to support this segregation, the software takes a dependency on three separate task schedulers; one for the background (which is just a normal ThreadPoolTaskScheduler), one for the foreground (which is a DispatcherTaskScheduler or something similar) and one for tasks that needed to be scheduled on a regular basis (another ThreadPoolTaskScheduler).

This dependency injection allows for those schedulers to be overridden for testing purposes, so that they executed completely synchronously or could be pumped at will as necessary in tests.

It all worked pretty great until we started really pushing it hard.

Schedule Conflict

Our newest component to use the framework did a huge amount of work in the background. Not only that, because of the way the interface was structured, it pretty much did all of the work at the same time (i.e. as soon as the screen was loaded), in order to give the user a better experience and minimise the total amount of time spent waiting.

From a technical standpoint, the component needed to hit both a local database (not a real problem) and a remote API (much much slower), both of which are prime candidates for background work due to their naturally slow nature. Not a lot of CPU intensive work though, mostly just DB and API calls.

With 6-10 different view models all doing work in the background, it quickly became apparent that we were getting some amount of contention for resources, as not all Tasks were being completed in a reasonable amount of time. Surprisingly hard to measure, but it looked like The Tasks manually scheduled via the TaskSchedulers were quite expensive to run, and the ThreadPoolTaskSchedulers could only run so much at the same time due to the limits on parallelization and the number of threads that they could have running at once.

So that sucked.

As a bonus annoyance, the framework did not lend itself to usage of async/await at all. It expected everything to be synchronous, where the “background” nature of the work was decided by virtue of where it was executed. Even the addition of one async function threw the whole thing into disarray, as it became harder to reason about where the work was actually being executed.

In the grand scheme of things, async/await is still relatively new (but not that new, it was made available in 2013 after all), but its generally considered a better and less resource intensive way to ensure that blocking calls (like HTTP requests, database IO, file IO and so on) are not causing both the system and the user to wait unnecessarily. As a result, more and more libraries are being built with async functions, sometimes not even exposing a synchronous version at all. Its somewhat difficult to make an async function synchronous to, especially if you want to avoid potential deadlocks.

With those limitations noted, we had to do something.

Why Not Both?

What we ended up doing was allowing for async functions to be used as part of the background work wrappers inside the base view models. This retained the managed “busy” indicator functionality and the general programming model that had been put into place (i.e. do work, do this on success, this on failure, etc).

Unfortunately what it also did was increase the overall complexity of the framework.

It was now much harder to reason about which context things were executing on, and while the usage of async functions was accounted for in the background work part of the framework, it was not accounted for in either the success or error paths.

This meant that is was all too easy to use an async function in the wrong context, causing a mixture of race conditions (where the overarching call wasn’t aware that part of itself was asynchronous) or bad error handling (where a developer had marked a function as async void to get around the compiler errors/warnings).

Don’t get me wrong, it all worked perfectly fine, assuming you knew to avoid all of the things that would make it break.

The tests got a lot more flaky though, because while its relatively easy to override TaskSchedulers with synchronous versions, its damn near impossible to force async functions to execute synchronously.

Sole Survivor

Here’s where it all gets pretty hypothetical, because the solution we actually have right now is the one that I just wrote about (the dual natured, overly complex abomination) and its causing problems on and off in a variety of ways.

A far better model is to incorporate async/await into the fabric of the framework, allowing for its direct usage and doing away entirely with the segmentation logic that I originally put together (with the TaskSchedulers and whatnot).

Stephen Cleary has some really good articles in MSDN magazine about this sort of stuff (being async ViewModels and supporting constructs), so I recommend reading them all if you’re interested.

At a high level, if we expose the fact that the background work is occurring asynchronously (view async commands and whatnot), then not only do we make it far easier to do work in the background (literally just use the standard async/await constructs), but it becomes far easier to handler errors in a reliable way, and the tests become easier too, because they can simply be async themselves (which all major unit testing frameworks support).

It does represent a significant refactor though, which is always a bit painful.

Conclusion

I’m honestly still not sure what the better approach is for this sort of thing

Async/await is so easy to use at first glance, but has a bunch of complexity and tripwires for the unwary. Its also something of an infection, where once you use it even a little bit, you kind of have to push it through everything in order for it to work properly end-to-end. This can be problematic for an existing system, where you want to introduce it a bit at a time.

On the other side, the raw TPL stuff that I put together is much more complex to use, but is relatively shallow. It much easier to reason about where work is actually happening and relatively trivial tocompletely change the nature of the application for testing purposes. Ironically enough, the ability to easily change from asynchronous background workers to a purely synchronous execution is actually detrimental in a way, because it means your tests aren’t really doing the same thing as your application will, which can mask issues.

My gut feel is to go with the newer thing, even though it feels a bit painful.

I think the pain is a natural response to something new though, so its likely to be a temporary thing.

Change is hard, you just have to push through it.

You Gotta Have Standards

October 9. 2018 0 Comments

After that brief leadership interlude, its time to get back into the technical stuff with a weird and incredibly frustrating issue that I encountered recently when building a small command-line application using .NET 4.7.

More specifically, it failed to compile on our build server citing problems with a dependency that it shouldn’t have even required.

So without further ado, on with the show.

One Build Process To Rule Them All

One of the rules I like to follow for a good build process is that it should be executable outside of the build environment.

There are a number of reasons for this rule, but the two that are most relevant are:

If something goes wrong during the build process you can try and run it yourself and see what’s happening, without having to involve the build server
As a result of having to execute the process outside of the build environment, its likely that the build logic will be encapsulated in source control, alongside the code

With the way that a lot of software compilation works though, it can be hard to create build processes that automatically bootstrap the necessary components on a clean machine.

For example, there is no real way to compile a .NET Framework 4.7 codebase without using software that has to be installed. As far as I know you have to use either MSBuild, Visual Studio or some other component to do the dirty work. .NET Core is a lot better in this respect, because its all command line driven and doesn’t feature any components that must be installed on the machine before it will work. All you have to do is bootstrap the self contained SDK.

Thus while the dream is for the build process to be painless to execute on a clean git clone (with the intent that that is exactly what the build server does), sometimes dreams don’t come true, no matter how hard you try.

For us, our build server comes with a small number of components pre-installed, including MSBuild, and then our build scripts rely on those components existing in order to work correctly. There is a little bit of fudging involved though so you don’t have to have exactly the same components installed locally, and it will dynamically find MSBuild for you.

This was exactly how the command-line application build process was working before I touched it.

Then I touched it and it all went to hell.

Missing Without A Trace

Whenever you go back to a component that hasn’t been actively developed for a while, you always have to decide whether or not you should go to the effort of updating its dependencies that are now probably out of date.

Of course, some upgrades are a lot easier to action than others (i.e. a NuGet package update is generally a lot less painful than updating to a later version of the .NET Framework), but the general idea is to put some effort into making sure you’ve got a strong base to work from.

So that’s exactly what I did when I resurrected the command-line application used for metrics generation. I updated the build process, renamed the repository/namespaces (to be more appropriate), did a pass over the readme and updated the NuGet packages. No .NET version changes though, because that stuff can get hairy and it was already at 4.7, so it wasn’t too bad.

Everything compiled perfectly fine in Visual Studio and the self-contained build process continued to work on my local machine, so I pushed ahead and implemented the necessary changes.

Then I pushed my code and the automated build process on our build server failed consistently with a bunch of compilation errors like the following:

Framework\IntegrationTestKernel.cs(64,13): error CS0012: The type 'ValueType' is defined in an assembly that is not referenced. You must add a reference to assembly 'netstandard, Version=2.0.0.0, Culture=neutral, PublicKeyToken=cc7b13ffcd2ddd51'.

The most confusing part?

I had taken no dependency on netstandard as far as I knew.

More importantly, my understanding of netstandard was that it is basically a set of common interfaces to allow for a interoperability between the .NET Framework and .NET Core. I had no idea why my code would fail to compile citing a dependency I didn’t even ask for.

Also, it worked perfectly on my machine, so clearly something was awry.

The Standard Response

The obvious first response is to add a reference to netstandard.

This is apparently possible via NETStandard.Library NuGet package, so I added that, verified that it compiled locally and pushed again.

Same compilation errors.

My next hypothesis was that maybe something had gone weird with .NET Framework 4.7. There are a number of articles on the internet about similar looking topics and some of them read like later versions of .NET 4.7 (which are in-place upgrades for god only knows what reason) have changes relating to netstandard and .NET Framework integrations and compatibility. It was a shaky hypothesis though, because this application had always specifically targeted .NET 4.7.

Anyway, I flipped the projects to all target an earlier version of the .NET Framework (4.6.2) and then reinstalled all the NuGet packages (thank god for the Update-Package –reinstall command).

Still no luck.

The last thing I tried was removing all references to the C# 7 Value Tuple feature (super helpful when creating methods with complex return types), but that didn’t help either.

I Compromised; In That I Did Exactly What It Wanted

In the end I accepted defeat and made the Visual Studio Build Tools 2017 available on our build server by installing them on our current build agent AMI, taking a new snapshot and then updating TeamCity to use that snapshot instead. In order to get everything to compile cleanly, I had to specifically install the .NET Core Build Tools, which made me sad, because .NET Core was actually pretty clean from a build standpoint. Now if someone puts together a .NET Core repository incorrectly, it will probably still continue to compile just fine on the build server, leaving a tripwire for the next time someone cleanly checks out the repo.

Ah well, can’t win them all.

Conclusion

I suspect that the root cause of the issue was updating to some of the NuGet packages, specifically the packages that are only installed in the test projects (like the Ninject.MockingKernel and its NSubstitute implementation) as the test projects were the only ones that were failing to compile.

I’m not entirely sure whya package update would cause compilation errors though, which is pretty frustrating. I’ve never experienced anything similar before, so perhaps those libraries were compiled to target a specific framework (netstandard 2.0) and those dependencies flowed through into the main projects they were installed into?

Anyway, our build agents are slightly less clean now as a result, which makes me sad, but I can live with it for now.

I really do hate system installed components though.

The Bigger Picture

September 25. 2018 0 Comments

We have a lot of logs.

Its mostly my fault to be honest. It was only a few years ago that I learned about log aggregation, and once you have an ELK stack, everything looks like an structured log event formatted as JSON.

We aggregate a wealth of information into our log stack these days, including, but not limited to:

Business intelligence events from our legacy software (i.e. customer did X)
Application logs from a variety of places
IIS logs from our APIs
ELB logs from AWS
System statistics for infrastructure (i.e. CPU, Memory, etc)

Now, if I had my way, we would keep everything forever. My dream would be to be able to ask the question “What did our aggregate API traffic look like over the last 12 months?”

Unfortunately, I can’t keep the raw data forever.

But I might be able to keep a part of it.

Storage Space

Storage space is pretty cheap these days, especially in AWS. In the Asia Pacific region, we pay $US 0.12 per GB per month for a stock standard, non-provisioned IOPS ELB volume.

Our ELK stack accumulates gigabytes of data every day though, and trying to store everything for all of eternity can add up pretty quickly. This gets even more complicated by the nature of Elasticsearch, because it likes to keep replicas of things just in case a node explodes, so you actually need more storage space than you think in order to account for redundancy.

In the end we somewhat randomly decided to keep a bit more than a months worth of data (40 days), which gives us the capability to reliably support our products, and to have a decent window for viewing business intelligence and usage. We have a scheduled task in TeamCity that leverages Curator to remove data as appropriate.

Now, a little more than a month is a pretty long time.

But I want more.

In For The Long Haul

In any data set, you are likely to find patterns that emerge over a much longer period than a month.

A good example would be something like daily active users. This is the sort of trend that is likely to show itself over months or years, especially for a relatively stable product. Unless you’ve done something extreme of course, in which case we might get a meaningful trend over a much shorter period.

Ignoring the extremes, we have all the raw data required to calculate the metric, we’re just not keeping it. If we had some way of summarising it into a smaller data set though, we can keep it for a much longer period. Maybe some sort of mechanism to do some calculations and store the resulting derivation somewhere safe?

The simplest approach is some sort of script or application that runs on a schedule and uses the existing data in the ELK stack to create and store new documents, preferably back into the ELK stack. If we want to ensure those new documents don’t get deleted by Curator, all we have to do is put them into different indexes (as Curator is only cleaning up indexes prefixed with logstash).

Seems simple enough.

Generator X

For once it actually was simple enough.

At some point in the past we actually implemented a variation of this idea, where we calculated some metrics from a database (yup, that database) and stored them in an Elasticsearch instance for later use.

Architecturally, the metric generator was a small C# command line application scheduled for daily execution through TeamCity, so nothing particularly complicated.

We ended up decommissioning those particular metrics (because it turned out they were useless) and disabling the scheduled task, but the framework already existed to do at least half of what I wanted to do; the part relating to generating documents and storing them in Elasticsearch. All I had to do was extend it to query a different data source (Elasticsearch) and generate a different set of metrics documents for indexing.

So that’s exactly what I did.

The only complicated part was figuring out how to query Elasticsearch from .NET, which as you can see from the following metrics generation class, can be quite a journey.

public class ElasticsearchDailyDistinctUsersDbQuery : IDailyDistinctUsersDbQuery
{
    public ElasticsearchDailyDistinctUsersDbQuery
    (
        SourceElasticsearchUrlSetting sourceElasticsearch,
        IElasticClientFactory factory,
        IClock clock,
        IMetricEngineVersionResolver version
    )
    {
        _sourceElasticsearch = sourceElasticsearch;
        _clock = clock;
        _version = version;
        _client = factory.Create(sourceElasticsearch.Value);
    }

    private const string _indexPattern = "logstash-*";

    private readonly SourceElasticsearchUrlSetting _sourceElasticsearch;
    private readonly IClock _clock;
    private readonly IMetricEngineVersionResolver _version;

    private readonly IElasticClient _client;

    public IEnumerable<DailyDistinctUsersMetric> Run(DateTimeOffset parameters)
    {
        var start = parameters - parameters.TimeOfDay;
        var end = start.AddDays(1);

        var result = _client.Search<object>
        (
            s => s
                .Index(_indexPattern)
                .AllTypes()
                .Query
                (
                    q => q
                        .Bool
                        (
                            b => b
                                .Must(m => m.QueryString(a => a.Query("Application:GenericSoftwareName AND Event.Name:SomeLoginEvent").AnalyzeWildcard(true)))
                                .Must(m => m
                                    .DateRange
                                    (
                                        d => d
                                            .Field("@timestamp")
                                            .GreaterThanOrEquals(DateMath.Anchored(start.ToUniversalTime().DateTime))
                                            .LessThan(DateMath.Anchored(end.ToUniversalTime().DateTime))
                                    )
                                )
                        )
                )
                .Aggregations(a => a
                    .Cardinality
                    (
                        "DistinctUsers", 
                        c => c.Field("SomeUniqueUserIdentifier")
                    )
                )
        );

        var agg = result.Aggs.Cardinality("DistinctUsers");

        return new[]
        {
            new DailyDistinctUsersMetric(start)
            {
                count = Convert.ToInt32(agg.Value),
                generated_at = _clock.UtcNow,
                source = $"{_sourceElasticsearch.Value}/{_indexPattern}",
                generator_version = _version.ResolveVersion().ToString()
            }
        };
    }
}

Conclusion

The concept of calculating some aggregated values from our logging data and keeping them separately has been in my own personal backlog for a while now, so it was nice to have a chance to dig into it in earnest.

It was even nicer to be able to build on top of an existing component, as it would have taken me far longer if I had to put everything together from scratch. I think its a testament to the quality of our development process that even this relatively unimportant component was originally built following solid software engineering practices, and has plenty of automated tests, dependency injection and so on. It made refactoring it and turning it towards a slightly different purpose much easier.

Now all I have to do is wait months while the longer term data slowly accumulates.

Stuck In The Middle With Interceptors

September 18. 2018 0 Comments

A very long time ago I wrote a post on this blog about interceptors.

The idea behind an interceptor is pretty straight forward; dynamically create a wrapper around some interface or class to either augment or alter its behaviour when used by the rest of the system, without actually having to implement the interface or override the class. For example, my original post was for an interceptor that slowed down requests heading to an API, simulating how that particular application would feel for a user with terrible internet.

I honestly haven’t touched the concept since, until recently that is.

I wanted to add some logging around usage of a third party API from our legacy application and conceptually the problem seemed like a perfect application for another interceptor. A quick implementation of a logging interceptor injected via Ninject and I’d have all the logging I could ever want, without having to mess around too much.

Reality had other ideas though.

Here’s Why I Came Here Tonight

Our legacy software is at that time in its life where it mostly just gets integrations. Its pretty feature complete as far as core functionality goes, so until the day we finally grant it Finis Rerum and it can rest, we look to add value to our users by integrating with third party services.

The most recent foray in this space integrated a payment provider into the software, which is quite useful considering its core value proposition is trust account management. From a technical point of view, the payment provider has an API and we wrote a client to access that API, with structured request and response objects. Pretty standard stuff.

As part of our development, we included various log events that allowed us to track the behaviour of parts of the system, mostly so that we could more easily support the application and get accurate metrics and feedback from users in the wild relating to performance. This is all well and good, but those events generally cover off combined parts of the application logic; for example, an operation that queries the local DB and then augments that information by calling into the third party API to display a screen to the user.

This makes it relatively easy to see when any users are experiencing performance issues, but it makes it hard to see whether or not the local DB, the actual programming logic or the internet call was the root cause.

An improvement to this would be to also log any outgoing API requests and their responses, along with the execution time. With that information we would be able to either blame or absolve the clients internet connection when it comes to performance questions.

Now, I’m an extremely lazy developer, so while we have a nice interface that I could implement to accomplish this (i.e. some sort of LoggingPaymentProviderClient), its got like twenty methods and I really don’t have the time, patience or motivation for that. Also, while its unlikely that the interface would change all that much over time, its still something of a maintenance nightmare.

Interceptors to the rescue!

But I Got The Feeling That Something Ain’t Right

As I explained in my original post all those years ago, the IInterceptor interface supplied by the Castle library allows you to implement logic via a proxy and slot it seamlessly into a call stack. Its usage is made easier by the presence of good dependency injection, but its definitely not required at all.

Thus enters the logging interceptor.

public class PaymentProviderClientMethodExecutionLoggingInterceptor : IInterceptor
{
    public PaymentProviderClientMethodExecutionLoggingInterceptor(ILogger logger)
    {
        _logger = logger;
    }

    private readonly ILogger _logger;

    public void Intercept(IInvocation invocation)
    {
        var stopwatch = Stopwatch.StartNew();
        try
        {
            invocation.Proceed();
            stopwatch.Stop();

            var log = new PaymentProviderMethodCallCompleted(invocation, stopwatch.Elapsed);
            _logger.Information(log);
        }
        catch (Exception ex)
        {
            stopwatch.Stop();

            var log = new PaymentProviderMethodCallFailed(invocation, stopwatch.Elapsed, ex);
            _logger.Warning(log);

            throw;
        }
    }
}

Its not an overly complicated class, and while its written to be specific, its actually quite generic.

Given a proxied class, all methods called on that class will be logged via Serilog, with the method name, its parameters and its return value (the structured logging being provided by the dedicated event classes).

Nothing ever works the first time though, and while I’m constantly being reminded of that, I’m always hopeful all the same. Denial is a powerful motivator.

The problem was that the IInterceptor interface is old enough that it doesn’t consider the existence of asynchronous methods. It does exactly what it says on the tin, starts a timer, proceeds with the method invocation and then because the method is asynchronous, immediately logs an event with the wrong execution time and no return value.

It has no idea that it has to wait for the invocation to complete because it thinks everything is synchronous.

Clowns To The Left Of Me, Jokers To The Right

This is where everything is going to get a little bit fuzzier than I would like, because I wrote this blog post before I had a working solution.

From what I can tell, the situation is quite complex.

The simplest solution appears to be to leverage the existing interface and simply check for the presence of a Task (or Task<T>) return value. If detected, append a continuation to that Task to perform the desired functionality. For me this would be a continuation on both faulted and success (and maybe cancelled?) that would log the completion of the method. It seems like it would work, but I do have some concerns about the scheduling of the continuation and how that makes the code harder to reason about.

Luckily, someone has already written a reusable library together that allows for asynchronous interceptors via a slightly different interface.

This is attractive because its code that I don’t have to write (remember, I’m lazy), but it not being built into the core Castle library does make me question its legitimacy. Surely if it was that critical the maintainers would have updated Castle.Core?

Regardless, I explored using the library first, but in order to use it I had to go on an adventure to upgrade a bunch of our Nuget dependencies (because it relied on the latest version of Castle.Core), which meant updates to Castle, Ninject and Ninject’s extension libraries. This caused knock on effects because the Ninject.MockingKernel.NSubstitute library was not available for .NET 4.5 (even though all the others were), so I had to temporarily copy that particular implementation into our codebase.

Once everything was compiling, a full test run showed some failing tests that weren’t failing before the library upgrades, so I kind of stopped there.

For now.

Conclusion

Unfortunately this is one of those blog posts that comes off feeling a little hollow. I didn’t actually get to my objective (seamless per method logging for a third-party dependency), but I did learn a lot on the way, so I think it was still useful to write about.

Probably should have waited a little longer though, I jumped the gun a bit.

Its not the only time in recent memory that asynchronous behaviour has made things more difficult than they would have been otherwise. In an unrelated matter, some of our automated tests have been flakey recently, and the core of the issue seems to be asynchronous behaviour that is either a.) not being correctly overridden to be synchronous for testing purposes or b.) not correctly being waited for before tests proceed.

Its not hard to write tests that are purely async of course (NUnit supports tests marked with the async keyword), but when you’re testing a view model and the commands are “synchronous” it gets a bit more challenging.

That’s probably a topic for another day though.