0 Comments

After that brief leadership interlude, its time to get back into the technical stuff with a weird and incredibly frustrating issue that I encountered recently when building a small command-line application using .NET 4.7.

More specifically, it failed to compile on our build server citing problems with a dependency that it shouldn’t have even required.

So without further ado, on with the show.

One Build Process To Rule Them All

One of the rules I like to follow for a good build process is that it should be executable outside of the build environment.

There are a number of reasons for this rule, but the two that are most relevant are:

  1. If something goes wrong during the build process you can try and run it yourself and see what’s happening, without having to involve the build server
  2. As a result of having to execute the process outside of the build environment, its likely that the build logic will be encapsulated in source control, alongside the code

With the way that a lot of software compilation works though, it can be hard to create build processes that automatically bootstrap the necessary components on a clean machine.

For example, there is no real way to compile a .NET Framework 4.7 codebase without using software that has to be installed. As far as I know you have to use either MSBuild, Visual Studio or some other component to do the dirty work. .NET Core is a lot better in this respect, because its all command line driven and doesn’t feature any components that must be installed on the machine before it will work. All you have to do is bootstrap the self contained SDK.

Thus while the dream is for the build process to be painless to execute on a clean git clone (with the intent that that is exactly what the build server does), sometimes dreams don’t come true, no matter how hard you try.

For us, our build server comes with a small number of components pre-installed, including MSBuild, and then our build scripts rely on those components existing in order to work correctly. There is a little bit of fudging involved though so you don’t have to have exactly the same components installed locally,  and it will dynamically find MSBuild for you.

This was exactly how the command-line application build process was working before I touched it.

Then I touched it and it all went to hell.

Missing Without A Trace

Whenever you go back to a component that hasn’t been actively developed for a while, you always have to decide whether or not you should go to the effort of updating its dependencies that are now probably out of date.

Of course, some upgrades are a lot easier to action than others (i.e. a NuGet package update is generally a lot less painful than updating to a later version of the .NET Framework), but the general idea is to put some effort into making sure you’ve got a strong base to work from.

So that’s exactly what I did when I resurrected the command-line application used for metrics generation. I updated the build process, renamed the repository/namespaces (to be more appropriate), did a pass over the readme and updated the NuGet packages. No .NET version changes though, because that stuff can get hairy and it was already at 4.7, so it wasn’t too bad.

Everything compiled perfectly fine in Visual Studio and the self-contained build process continued to work on my local machine, so I pushed ahead and implemented the necessary changes.

Then I pushed my code and the automated build process on our build server failed consistently with a bunch of compilation errors like the following:

Framework\IntegrationTestKernel.cs(64,13): error CS0012: The type 'ValueType' is defined in an assembly that is not referenced. You must add a reference to assembly 'netstandard, Version=2.0.0.0, Culture=neutral, PublicKeyToken=cc7b13ffcd2ddd51'.

The most confusing part?

I had taken no dependency on netstandard as far as I knew.

More importantly, my understanding of netstandard was that it is basically a set of common interfaces to allow for a interoperability between the .NET Framework and .NET Core. I had no idea why my code would fail to compile citing a dependency I didn’t even ask for.

Also, it worked perfectly on my machine, so clearly something was awry.

The Standard Response

The obvious first response is to add a reference to netstandard.

This is apparently possible via NETStandard.Library NuGet package, so I added that, verified that it compiled locally and pushed again.

Same compilation errors.

My next hypothesis was that maybe something had gone weird with .NET Framework 4.7. There are a number of articles on the internet about similar looking topics and some of them read like later versions of .NET 4.7 (which are in-place upgrades for god only knows what reason) have changes relating to netstandard and .NET Framework integrations and compatibility. It was a shaky hypothesis though, because this application had always specifically targeted .NET 4.7.

Anyway, I flipped the projects to all target an earlier version of the .NET Framework (4.6.2) and then reinstalled all the NuGet packages (thank god for the Update-Package –reinstall command).

Still no luck.

The last thing I tried was removing all references to the C# 7 Value Tuple feature (super helpful when creating methods with complex return types), but that didn’t help either.

I Compromised; In That I Did Exactly What It Wanted

In the end I accepted defeat and made the  Visual Studio Build Tools 2017 available on our build server by installing them on our current build agent AMI, taking a new snapshot and then updating TeamCity to use that snapshot instead. In order to get everything to compile cleanly, I had to specifically install the .NET Core Build Tools, which made me sad, because .NET Core was actually pretty clean from a build standpoint. Now if someone puts together a .NET Core repository incorrectly, it will probably still continue to compile just fine on the build server, leaving a tripwire for the next time someone cleanly checks out the repo.

Ah well, can’t win them all.

Conclusion

I suspect that the root cause of the issue was updating to some of the NuGet packages, specifically the packages that are only installed in the test projects (like the Ninject.MockingKernel and its NSubstitute implementation) as the test projects were the only ones that were failing to compile.

I’m not entirely sure whya package update would cause compilation errors though, which is pretty frustrating. I’ve never experienced anything similar before, so perhaps those libraries were compiled to target a specific framework (netstandard 2.0) and those dependencies flowed through into the main projects they were installed into?

Anyway, our build agents are slightly less clean now as a result, which makes me sad, but I can live with it for now.

I really do hate system installed components though.

0 Comments

Building new features and functionality on top of legacy software is a special sort of challenge, one that I’ve written about from time to time.

To be honest though, the current legacy application that I’m working with is not actually that bad. The prior technical lead had the great idea to implement a relatively generic way to execute modern . NET functionality from the legacy VB6 code thanks to the magic of COM, so you can still work with a language that doesn’t make you sad on a day to day basis. Its a pretty basic eventing system (i.e. both sides can raise events that are handled by the other side), but its effective enough.

Everything gets a little bit tricksy when windowing and modal dialogs are involved though.

One Thing At A Time

The legacy application is basically a multiple document interface (MDI) experience, where the user is free to open a bunch of different entities and screens at the same time. Following this approach for new functionality adds a bunch of inherent complexity though, in that the user might edit an entity that is currently being displayed elsewhere (maybe in a list), requiring some sort of common, global channel for saying “hey, I’ve updated entity X, please react accordingly”.

This kind of works in the legacy code (VB6), because it just runs global refresh functions and changes form controls whenever it feels like it.

When the .NET code gets involved though, it gets very difficult to maintain both worlds in the same way, so we’ve to isolating all the new features from the legacy stuff, primarily through modal dialogs. That is, the user is unable to access the rest of the application when the .NET feature is running.

To be honest, I was pretty surprised that we could open up a modal form in WPF from an event handler started in VB6, but I think it worked because both VB6 and .NET shared a common UI thread, and the modality of a form is handled at a low level common to both technologies (i.e. win32 or something).

We paid a price from a user experience point of view of course, but we mostly worked around it by making sure that the user had all of the information they needed to make a decision on any screen in the .NET functionality, so they never needed to refer back to the legacy stuff.

Then we did a new thing and it all came crashing down.

Unforgettable Legacy

Up until fairly recently, the communication channel between VB6 and .NET was mostly one way. VB6 would raise X event, .NET would handle it by opening up a modal window or by executing some code that then returned a response. If there was any communication that needed to go back to VB6 from the .NET, it always happened after the modal window was already closed.

This approach worked fine until we needed to execute some legacy functionality as part of a workflow in .NET, while still having the .NET window be displayed in a modal fashion.

The idea was simple enough.

  • Use the .NET functionality to identify the series of actions that needed to be completed
  • In the background, iterate through that list of actions and raise an event to be handled by the VB6 to do the actual work
  • This event would be synchronous, in that the .NET would wait for the VB6 to finish its work and respond before moving on to the next item
  • Once all actions are completed, present a summary to the user in .NET

We’d actually used a similar approach for a different feature in the past, and while we had to refactor some of the VB6 to make the functionality available in a way that made sense, it worked okay.

This time the legacy functionality we were interested in was already available as a function on a static class, so easily callable. I mean, it was a poorly written function dependent on some static state, so it wasn’t a complete walk in the part, but we didn’t need to do any high-risk refactoring or anything.

Once we wrote the functionality though, two problems became immediately obvious:

  1. The legacy functionality could popup dialogs, asking the user questions relevant to the operation. This was actually kind of good, as one of the main reasons we didn’t want to reimplement was because we couldn’t be sure we would capture all the edge cases so using the existing functionality guaranteed that we would, because it was already doing it (and had been for years). These cases were rare, so while they were a little disconcerting, they were acceptable.
  2. Sometimes executing the legacy functionality would murder the modal-ness of the .NET window, which led to all sorts of crazy side effects. This seemed to happen mostly when the underlying VB6 context was changed in such a way by the operation that it would historically have required a refresh. When it happened, the .NET window would drop behind the main application window, and the main window would be fully interactable, including opening additional windows (which would explode if they too were modal). There did not seem to be a way to get the original .NET window back into focus either. I suspect that there were a number of Application.DoEvents calls secreted throughout the byzantine labyrinth of code that were forcing screen redraws, but we couldn’t easily prove it. It was definitely broken though.

The first problem wasn’t all that bad, even if it wasn’t great for a semi-automated process.

The second one was a deal-breaker.

Freedom! Horrible Terrifying Freedom!

We tried a few things to “fix” the whole modal window problem, including:

  • Trying to restore the modal-ness of the window once it had broken. This didn’t work at all, because the window was still modal somehow, and technically we’d lost the thread context from the initial .ShowDialog call (which may or may not have still been blocked, it was hard to tell). In fact, other windows in the application that required modality would explode if you tried to use them, with an error similar to “cannot display a modal dialog when there is already one in effect”.
  • Trying to locate and fix the root reason why the modal-ness was being removed. This was something of a fools errand as I mentioned above, as the code was ridiculously labyrinthian and it was impossible to tell what was actually causing the behaviour. Also, it would involve simultaneously debugging both VB6 and .NET, which is somewhat challenging.
  • Forcing the .NET window to be “always on top” while the operation was happening, to at least prevent it from disappearing. This somewhat worked, but required us to use raw Win32 windowing calls, and the window was still completely broken after the operation was finished. Also, it would be confusing to make the window always on top all the time, while leaving the ability to click on the parts of the parent window that were visible.

In the end, we went with just making the .NET window non-modal and dealing with the ramifications. With the structures we’d put into place, we were able to refresh the content of the .NET window whenever it gained focus (to prevent it displaying incorrect data due to changes in the underlying application), and our refreshes were quick due to performance optimization, so that wasn’t a major problem anymore.

It was still challenging though, as sitting a WPF Dispatcher on top of the main VB6 UI thread (well, the only VB6 thread) and expecting them both to work at the same time was just asking too much. We had to create a brand new thread just for the WPF functionality, and inject a TaskScheduler initialized on the VB6 thread for scheduling the events that get pushed back into VB6.

Conclusion

Its challenging edge cases like this whole adventure that make working with legacy code time consuming in weird and unexpected ways. If we had of just stuck to pure .NET functionality, we wouldn’t have run into any of these problems, but we would have paid a different price in reimplementing functionality that already exists, both in terms of development time, and in terms of risk (in that we don’t full understand all of the things the current functionality does).

I think we made the right decision, in that the actual program functionality is the same as its always been (doing whatever it does), and we instead paid a technical price in order to get it to work well, as opposed to forcing the user to accept a sub-par feature.

Its still not immediately clear to me how the VB6 and .NET functionality actually works together at all (with the application windowing, threading and various message pumps and loops), but it does work, so at least we have that.

I do look forward to the day when we can lay this application to rest though, giving it the peace it deserves after many years of hard service.

Yes, I’ve personified the application in order to empathise with it.

0 Comments

We have the unfortunate honour of supporting multiple versions of the .NET Framework across our entire software suite.

All of our client side software (i.e. the stuff that is installed on client machines, both desktop and server) tends to target .NET Framework 4.0. Historically, this was the last version available on Windows XP, and up until relatively recently, it was important that we continue to support that ancient and terrifying beast. You know how it works, old software, slow clients, we’ve all been there.

Meanwhile, all of our server side software tends to target .NET Framework 4.5. We really should be targeting the latest .NET Framework (as the deployment environments are fully under our control), but it takes effort to roll that change out (new environments mostly) and it just hasn’t been a priority. Realistically our next major move might be to just start using .NET Core inside Docker images instead, so it might never really happen.

One of the main challenges with supporting both .NET 4.0 and .NET 4.5 is trying to build libraries that work across both versions. Its entirely possible, but is somewhat complicated and unless you understand a bunch of fundamental things, can appear confusing and magical.

Manual Targeting Mode Engaged

The first time we had to support both .NET 4.0 and .NET 4.5 for a library was for Serilog.

Serilog by itself works perfectly fine for this sort of thing. As of version 1.5.9, it supported both of the .NET Framework versions that we needed in a relatively painless manner.

As is almost always the way, we quickly ran into a few situations where we needed some custom components (a new Sink, a custom JSON formatter, some custom destructuring logic), so we wrapped them up in a library for reuse. Targeting the lowest common denominator (.NET 4.0), we assumed that we would be able to easily incorporate the library wherever we needed.

The library itself compiled file, and all of its tests passed.

We incorporated it into one the pieces of software that targeted .NET 4.0 and it worked a treat. All of the custom components were available and worked exactly as advertised.

Then we installed it into a piece of software targeting .NET 4.5 and while it compiled perfectly fine, it crashed and burned at runtime.

What had actually happened was that Serilog changed a few of its public interfaces between 4.0 and 4.5, replacing some IDictionary<TKey, TValue> types with IReadOnlyDictionary<TKey, TValue>. When we created the library, and specifically targeted .NET 4.0, we thought that it would use that version of the framework all the way through, even if we installed it in something targeting .NET 4.5. Instead, when the Serilog Nuget package was installed into the destination via the dependency chain, it came through as .NET 4.5, while our poor library was still locked at 4.0.

At runtime the two disagreed and everything exploded.

Use The Death Dot

The easiest solution would have been to mandate the flip to .NET 4.5. We actually first encountered this issue quite a while ago, and couldn’t just drop support for .NET 4.0, especially as this was the first time we were putting together our common logging library. Building it such that it would never work on our older, still actively maintained components would have been a recipe for disaster.

The good news is that Nuget packages allow for the installation of different DLLs for different target frameworks. Unsurprisingly, this is exactly how Serilog does it, so it seemed like the sane path to engage on.

Targeting specific frameworks is pretty straightforward. All you have to do is make sure that your Nuget package is structured in a specific way:

  • lib
    • net40
      • {any and all required DLLs for .NET 4.0}
    • net45
      • {any and all required DLLs for .NET 4.5}

In the example above I’ve only included the options relevant for this blog post, but there are a bunch of others.

When the package is installed, it will pick the appropriate directory for the target framework and add references as necessary.  Its actually quite a neat little system.

For us, this meant we needed to add a few more projects to the common logging library, one for .NET 4.0 and another for .NET 4.5. The only thing inside these projects was the CustomJsonFormatter class, because it was the only thing we were interacting with that had a different signature depending on the target framework.

With the projects in place (and appropriate tests written), we constructed a custom nuspec file that would take the output from the core project (still targeting .NET 4.0), and the output from the ancillary projects and structure them in the way that Nuget wanted. We also had to add Nuget package dependencies into the nuspec file as well, because once you’re using a hand-rolled nuspec file, you don’t get any of the nice automatic things that you would if you were using the automatic package creation from a .csproj file.

Target Destroyed

With the new improved Nuget package in place, we could reference it from both .NET 4.0 and .NET 4.5 projects without any issue.

At least when the references were direct anyway. That is, when the Nuget package was referenced directly from a project and the classes therein only used in that context.

When the package was referenced as part of a dependency chain (i.e. project references package A, package A references logging package) we actually ran into the same problem again, where the disconnect between .NET 4.0 and .NET 4.5 caused irreconcilable differences in the intervening libraries.

It was kind of line async or generics, in that once you start doing it, you kind of have to keep doing it all the way down.

Interestingly enough, we had to use a separate approach to multi-targeting for one of the intermediary libraries, because it did not actually have any code that differed between .NET 4.0 and .NET 4.5.  We created a shell project targeting .NET 4.5 and used csproj trickery to reference all of the files from the .NET 4.0 project such that both projects would compile to basically the same DLL (except with different targets).

I don’t recommend this approach to be honest, as it looks kind of like magic unless you know exactly what is going on, so is a bit of a trap for the uninitiated.

Regardless, with the intermediary packages sorted everything worked fine.

Summary

The biggest pain throughout this whole process was the fact that none of the issues were picked up at compilation time. Even when the issues did occur, the error messages are frustratingly vague, talking about missing methods and whatnot, which can be incredibly hard to diagnose if you don’t immediately know what it actually means.

More importantly, it wasn’t until the actual code was executed that the problem surfaced, so not only was the issue only visible at runtime, it would happen during execution rather than right at the start.

We were lucky that we had good tests for most of our functionality so we realised before everything exploded in the wild, but it was still somewhat surprising.

We’re slowly moving away from .NET 4.0 now (thank god), so at some stage I will make a decision to just ditch the whole .NET 4.0/4.5 multi-targeting and mark the new package as a breaking change.

Sometimes you just have to move on.

0 Comments

A number of people much smarter than me have stated that there are only two hard problems in Computer Science, naming things, cache invalidation and off by one errors. Based on my own experience, I agree completely with this sentiment, and its one of the reasons why I always hesitate to incorporate caching into a system until its absolutely necessary.

Unfortunately, absolutely necessary always comes much sooner than I would like.

Over the last few weeks we’ve been slowly deploying our data freeing functionality to our customers. The deployment occurs automatically without user involvement (we have a neat deployment pipeline) so that has been pretty painless, but we’ve had some teething issues on the server side as our traffic ramped up. We’re obviously hosting everything in AWS, so its not like we can’t scale as necessary (and we have), but just throwing more power at something is a road we’ve been down before, and it never ends well. Plus, we’re software engineers, and when it looks like our code is slow or inefficient, it makes us sad.

The first bottleneck we ran into was our authentication service, and mitigating that particular problem is the focus of this post.

For background, we use a relatively simple token based approach to authentication. Consumers of our services are required to supply some credentials directly to our auth service to get one of these tokens, which can then be redeemed to authenticate against other services and to gain access to the resources the consumer is interested in. Each service knows that the auth service is the ultimate arbitrator for resolving those tokens, so they connect to the service as necessary for token validation and make authorization decisions based on the information returned.

A relatively naive approach to authentication, but it works for us.

Its pretty spammy though, and therein lies the root cause of the bottleneck.

I Already Know The Answer

One of the biggest factors at play here is that every single authenticated request (which is most of them) coming into any of our services needs to hit another service to validate the token before it can do anything else. It can’t even keep doing things in the background while the token is being validated, because if the auth fails, that would be a waste of effort (and a potential security risk if an error occurs and something leaks out).

On the upside, once a service has resolved a token, its unlikely that the answer will change in the near future. We don’t even do token invalidation at this stage (no need), so the answer is actually going to be valid for the entire lifetime of the token.

You can probably already see where I’m going with this, but an easy optimization is to simply remember the resolution for each token in order to bypass calls to the auth service when that token is seen again. Why ask a question when you already know the answer? This works particularly well for us in the data synchronization case because a single token will be used for a flurry of calls.

Of course, now we’re in the caching space, and historically, implementing caching can increase the overall complexity of a system by a large amount. To alleviate some of the load on the auth service, we just want a simple in-memory cache. If a particular instance of the service has seen the token, use the cache, else validate the token and then cache the result. To keep it simple, and to deal with our specific usage pattern (the aforementioned flurry), we’re not even going to cache the token resolution for the entire lifetime of the token, just for the next hour.

We could write the cache ourselves, but that would be stupid. Caching IS a hard problem, especially once you get into invalidation. Being that caching has been a known problem for a while, there are a lot of different components available for you to use in your language of choice, its just a matter of picking one.

But why pick just one?

There Has To Be A Better Way

As we are using C#, CacheManager seems like a pretty good bet for avoiding the whole “pick a cache and stick with it” problem.

Its a nice abstraction over the top of many different caching providers (like the in-memory System.Runtime.Caching and the distributed Redis), so we can easily implement our cache using CacheManager and then change our caching to something more complicated later, without having to worry about breaking everything in between. It also simplifies the interfaces to those caches, and does a bunch of really smart work behind the scenes.

The entire cache is implemented in the following class:

public class InMemoryNancyAuthenticationContextCache : INancyAuthenticationContextCache
{
    private readonly ICacheManager<AuthenticationContext> _manager;

    public InMemoryNancyAuthenticationContextCache()
        : this(TimeSpan.FromHours(1))
    {

    }

    public InMemoryNancyAuthenticationContextCache(TimeSpan cachedItemsValidFor)
    {
        _manager = CacheFactory.Build<AuthenticationContext>(a => a.WithSystemRuntimeCacheHandle().WithExpiration(ExpirationMode.Absolute, cachedItemsValidFor));
    }

    public AuthenticationContext Get(string token)
    {
        return _manager.Get(token);
    }

    public void Insert(string token, AuthenticationContext authContext)
    {
        _manager.Put(token, authContext);
    }
}

Our common authentication library now checks whatever cache was injected into it before going off to the real auth service to validate tokens. I made sure to write a few tests to validate the caching behaviour (like expiration and eviction, validating a decrease in the number of calls to the auth provider when flooded with the same token and so on), and everything seems to be good.

The one downside of using CacheManager (and the System.Runtime.Caching libraries) is that I’m putting a lot of trust in everything to “just work”. Performance testing will prove out whether or not there are any issues, but I can guarantee that if there are they will be a massive pain to diagnose if anything weird happens, just because our code is so many levels removed from the actual caching.

I sure do hope it does the right thing under stress.

Summary

Like I said at the start, I always hold off on implementing a cache for as long as I possibly can. Caching is a powerful tool to improve performance, but it can lead to some frustratingly hard to debug situations because of the transient nature of the information. No longer can you assume that you will be getting the most accurate information, which can make it harder to reason about execution paths and return values after something has actually happened. Of course, any decent system (databases, browsers, even disk access) usually has caching implemented to some degree, so I suppose you’re always dealing with the problem regardless of what your code actually does.

Good caching is completely invisible, feeding consumers appropriate information and mitigating performance problems without ever really showing its ugly side.

When it comes to implementing caching, it doesn’t make sense to write it yourself (which is true for a lot of things). Accept the fact that many people a lot smarter than you have already probably already solved the problem and just use their work instead. The existence of CacheManager was an unexpected bonus to loosely coupling our code to a specific cache implementation, which was nice.

Now if only someone could solve the whole naming things problem.