0 Comments

A number of people much smarter than me have stated that there are only two hard problems in Computer Science, naming things, cache invalidation and off by one errors. Based on my own experience, I agree completely with this sentiment, and its one of the reasons why I always hesitate to incorporate caching into a system until its absolutely necessary.

Unfortunately, absolutely necessary always comes much sooner than I would like.

Over the last few weeks we’ve been slowly deploying our data freeing functionality to our customers. The deployment occurs automatically without user involvement (we have a neat deployment pipeline) so that has been pretty painless, but we’ve had some teething issues on the server side as our traffic ramped up. We’re obviously hosting everything in AWS, so its not like we can’t scale as necessary (and we have), but just throwing more power at something is a road we’ve been down before, and it never ends well. Plus, we’re software engineers, and when it looks like our code is slow or inefficient, it makes us sad.

The first bottleneck we ran into was our authentication service, and mitigating that particular problem is the focus of this post.

For background, we use a relatively simple token based approach to authentication. Consumers of our services are required to supply some credentials directly to our auth service to get one of these tokens, which can then be redeemed to authenticate against other services and to gain access to the resources the consumer is interested in. Each service knows that the auth service is the ultimate arbitrator for resolving those tokens, so they connect to the service as necessary for token validation and make authorization decisions based on the information returned.

A relatively naive approach to authentication, but it works for us.

Its pretty spammy though, and therein lies the root cause of the bottleneck.

I Already Know The Answer

One of the biggest factors at play here is that every single authenticated request (which is most of them) coming into any of our services needs to hit another service to validate the token before it can do anything else. It can’t even keep doing things in the background while the token is being validated, because if the auth fails, that would be a waste of effort (and a potential security risk if an error occurs and something leaks out).

On the upside, once a service has resolved a token, its unlikely that the answer will change in the near future. We don’t even do token invalidation at this stage (no need), so the answer is actually going to be valid for the entire lifetime of the token.

You can probably already see where I’m going with this, but an easy optimization is to simply remember the resolution for each token in order to bypass calls to the auth service when that token is seen again. Why ask a question when you already know the answer? This works particularly well for us in the data synchronization case because a single token will be used for a flurry of calls.

Of course, now we’re in the caching space, and historically, implementing caching can increase the overall complexity of a system by a large amount. To alleviate some of the load on the auth service, we just want a simple in-memory cache. If a particular instance of the service has seen the token, use the cache, else validate the token and then cache the result. To keep it simple, and to deal with our specific usage pattern (the aforementioned flurry), we’re not even going to cache the token resolution for the entire lifetime of the token, just for the next hour.

We could write the cache ourselves, but that would be stupid. Caching IS a hard problem, especially once you get into invalidation. Being that caching has been a known problem for a while, there are a lot of different components available for you to use in your language of choice, its just a matter of picking one.

But why pick just one?

There Has To Be A Better Way

As we are using C#, CacheManager seems like a pretty good bet for avoiding the whole “pick a cache and stick with it” problem.

Its a nice abstraction over the top of many different caching providers (like the in-memory System.Runtime.Caching and the distributed Redis), so we can easily implement our cache using CacheManager and then change our caching to something more complicated later, without having to worry about breaking everything in between. It also simplifies the interfaces to those caches, and does a bunch of really smart work behind the scenes.

The entire cache is implemented in the following class:

public class InMemoryNancyAuthenticationContextCache : INancyAuthenticationContextCache
{
    private readonly ICacheManager<AuthenticationContext> _manager;

    public InMemoryNancyAuthenticationContextCache()
        : this(TimeSpan.FromHours(1))
    {

    }

    public InMemoryNancyAuthenticationContextCache(TimeSpan cachedItemsValidFor)
    {
        _manager = CacheFactory.Build<AuthenticationContext>(a => a.WithSystemRuntimeCacheHandle().WithExpiration(ExpirationMode.Absolute, cachedItemsValidFor));
    }

    public AuthenticationContext Get(string token)
    {
        return _manager.Get(token);
    }

    public void Insert(string token, AuthenticationContext authContext)
    {
        _manager.Put(token, authContext);
    }
}

Our common authentication library now checks whatever cache was injected into it before going off to the real auth service to validate tokens. I made sure to write a few tests to validate the caching behaviour (like expiration and eviction, validating a decrease in the number of calls to the auth provider when flooded with the same token and so on), and everything seems to be good.

The one downside of using CacheManager (and the System.Runtime.Caching libraries) is that I’m putting a lot of trust in everything to “just work”. Performance testing will prove out whether or not there are any issues, but I can guarantee that if there are they will be a massive pain to diagnose if anything weird happens, just because our code is so many levels removed from the actual caching.

I sure do hope it does the right thing under stress.

Summary

Like I said at the start, I always hold off on implementing a cache for as long as I possibly can. Caching is a powerful tool to improve performance, but it can lead to some frustratingly hard to debug situations because of the transient nature of the information. No longer can you assume that you will be getting the most accurate information, which can make it harder to reason about execution paths and return values after something has actually happened. Of course, any decent system (databases, browsers, even disk access) usually has caching implemented to some degree, so I suppose you’re always dealing with the problem regardless of what your code actually does.

Good caching is completely invisible, feeding consumers appropriate information and mitigating performance problems without ever really showing its ugly side.

When it comes to implementing caching, it doesn’t make sense to write it yourself (which is true for a lot of things). Accept the fact that many people a lot smarter than you have already probably already solved the problem and just use their work instead. The existence of CacheManager was an unexpected bonus to loosely coupling our code to a specific cache implementation, which was nice.

Now if only someone could solve the whole naming things problem.