0 Comments

As part of the work we did recently to add search to one of our newer API’s, we leveraged the support that Npgsql has for constructing PostgreSQL Full Text Search queries via Linq. This was a vast improvement over our previous attempts to use the PostgreSQL Full Text Search, because we could combine it with sorting and filtering, and have everything apply right at the database level for maximum efficiency.

Internally, we use a Retriever class for each entity that we offer access to via the API. This class in turn leverages a number of different dependencies to provide for filtering, sorting and searching by modifying an IQueryable. For search, there is an IFullTextSearcher<TEntity> interface, which contains a single method Search, taking an IQueryable and the query and returning a modified IQueryable.

A searcher implementation for an entity called Entity looks like this:

public class EntityFullTextSearcher : IFullTextSearcher<Entity>
{
    private readonly IPostgresQueryBuilder _postgresQueryBuilder;

    public EntityFullTextSearcher(IPostgresQueryBuilder postgresQueryBuilder)
    {
        _postgresQueryBuilder = postgresQueryBuilder;
    }

    public IQueryable<EntityListProjectionModel> Search(IQueryable<EntityListProjectionModel> queryable, SearchQuery searchQuery)
    {
        var query = _postgresQueryBuilder.Build(searchQuery);
        if (string.IsNullOrEmpty(query)) return queryable;

        queryable = queryable.Where(b => NpgsqlTextFunctions.Match
        (
            NpgsqlTextFunctions.QueryOr(
                NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldA), NpgsqlWeightLabel.A),
                NpgsqlTextFunctions.QueryOr(
                    NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldB), NpgsqlWeightLabel.A),
                    NpgsqlTextFunctions.QueryOr(
                        NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldC), NpgsqlWeightLabel.A),
                        NpgsqlTextFunctions.QueryOr(
                            NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldD), NpgsqlWeightLabel.B),
                            NpgsqlTextFunctions.QueryOr(
                                NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldE), NpgsqlWeightLabel.C),
                                NpgsqlTextFunctions.QueryOr(
                                    NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldF), NpgsqlWeightLabel.C),
                                    NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldG), NpgsqlWeightLabel.C)
                                    )
                                )
                            )
                        )
                    )
                )
            ,
            NpgsqlTextFunctions.ToTsQuery(query))
        );

        return queryable;
    }
}

Pretty straightforward, but there is a massive inline lambda function that is somewhat hard to read.

Its A Frowny Face

The essence of that massive chunk of lambda is the definition of the fields that will be indexed during the search, along with the weight at which they will be indexed and how they are combined, which in our case, is just using OR. This Full Text Search definition is then combined with the Match operator. This is the way to get Full Text Search working via Linq, as I wrote about in my previous post on the subject.

If you look at the definition of these functions inside the Npgsql source code, you’ll see that their implementations all throw an InvalidOperationException. How do they even work?

The answer lies in the magic of expression trees and Entity Framework. The Where function on an IQueryable actually takes an Expression<Func<TEntity, bool>>, which this lambda function is supplying.

At no point is this code intended to be executed directly. You could try (i.e. by running this on top of a List or In Memory Database via Effort), but it would fail miserably, throwing an InvalidOperationException as expected.

However, when the IQueryable is passed through the Npgsql EF Provider, it is transformed into actual SQL which is then executed and the results interpreted accordingly. To be honest, I’ve never had to really dig into the whole EF provider concepts too deeply, and I don’t really want to. I’m happy enough that it works, and allows us to combine Full Text Search with arbitrary selection, filtering and sorting, leaving all of the hard work up to the database.

Back to the massive chunk of lambda though, at least its encapsulated inside this tiny class. Right?

Well, it was up until we needed to add sorting by relevance.

Maybe More Of A Grimace?

Have a look at this class and see if anything looks familiar.

public class EntitySearchRelevanceSorter : ISearchRelevanceSorter<EntityListProjectionModel>
{
    private readonly IPostgresQueryBuilder _postgresQueryBuilder;

    public EntitySearchRelevanceSorter(IPostgresQueryBuilder postgresQueryBuilder)
    {
        _postgresQueryBuilder = postgresQueryBuilder;
    }

    public IQueryable<EntityListProjectionModel> Sort(IQueryable<EntityListProjectionModel> queryable, SearchQuery searchQuery)
    {
        if(searchQuery == null || !searchQuery.Terms.Any()) return queryable;
        var query = _postgresQueryBuilder.Build(searchQuery);
        if (string.IsNullOrEmpty(query)) return queryable;

        return queryable.OrderByDescending(b => NpgsqlTextFunctions.TsRank(
               NpgsqlTextFunctions.QueryOr(
                   NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldA), NpgsqlWeightLabel.A),
                   NpgsqlTextFunctions.QueryOr(
                       NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldB), NpgsqlWeightLabel.A),
                       NpgsqlTextFunctions.QueryOr(
                           NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldC), NpgsqlWeightLabel.A),
                           NpgsqlTextFunctions.QueryOr(
                               NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldD), NpgsqlWeightLabel.B),
                               NpgsqlTextFunctions.QueryOr(
                                   NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldE), NpgsqlWeightLabel.C),
                                   NpgsqlTextFunctions.QueryOr(
                                       NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldF), NpgsqlWeightLabel.C),
                                       NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldG), NpgsqlWeightLabel.C)
                                       ))))))
               ,
               NpgsqlTextFunctions.ToTsQuery(query))
               );
    }
}

The lambda expression being used as the input for the TsRank function is exactly the same as the function used inside the searcher class.

It should be the same as well, because the relevance sorting should return the results in an order appropriate for how they were found via the search.

Now we have a problem though, because if the search definition changes, the relevance sorting will no longer make sense. The same is somewhat true for the reverse option (changing the sort order).

My first attempt at fixing this so that there was only a single place where the Full Text Search definition existed, was to simply extract the lambda function to a class of its own, and take a dependency on that class in both the searcher and sorter.

The definition class then looked like this:

public class EntitySearchDefinition
{
    public Expression<Func<Entity, string>> Definition
    {
        get
        {
            return b => NpgsqlTextFunctions.QueryOr( 
                NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldA), NpgsqlWeightLabel.A), 
                NpgsqlTextFunctions.QueryOr( 
                    NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldB), NpgsqlWeightLabel.A), 
                    NpgsqlTextFunctions.QueryOr( 
                        NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldC), NpgsqlWeightLabel.A), 
                        NpgsqlTextFunctions.QueryOr( 
                            NpgsqlTextFunctions.SetWeightt(NpgsqlTextFunctions.ToTsVector(b.FieldD), NpgsqlWeightLabel.B), 
                            NpgsqlTextFunctions.QueryOr(
                                NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldE), NpgsqlWeightLabel.C), 
                                NpgsqlTextFunctions.QueryOr( 
                                    NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldF), NpgsqlWeightLabel.C), 
                                    NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldG), NpgsqlWeightLabel.C) 
                                )
                            )
                        )
                    )
                )
            )
        }
    }
}

And the usage within the searcher looked like this:

public class EntityFullTextSearcher : IFullTextSearcher<Entity>
{
    private readonly IPostgresQueryBuilder _postgresQueryBuilder;
    private readonly EntitySearchDefinition _def;

    public EntityFullTextSearcher(IPostgresQueryBuilder postgresQueryBuilder, EntitySearchDefinition def)
    {
        _postgresQueryBuilder = postgresQueryBuilder;
        _def = def;
    }

    public IQueryable<EntityListProjectionModel> Search(IQueryable<EntityListProjectionModel> queryable, SearchQuery searchQuery)
    {
        var query = _postgresQueryBuilder.Build(searchQuery);
        if (string.IsNullOrEmpty(query)) return queryable;

        queryable = queryable.Where(b => NpgsqlTextFunctions.Match
        (
            def.Definition,
            NpgsqlTextFunctions.ToTsQuery(query))
        );

        return queryable;
    }
}

This won’t compile at all though.

Nope, Definitely Rage

The problem here is that the definition of NpgsqlTextFunctions.Match is bool Match(string, string), so from the compilers point of view it doesn’t take an expression like the one I extracted out to the other class. We can’t just call it here either, because that would defeat the entire point of constructing the Where so that it can be turned into SQL and executed against the database. We have to put together an expression tree containing the logic we want to be pushed to the database.

How can we do that and maintain some common definition of the Full Text Search that will be used by both search and sorting components?

Long story short, my definition class ended up looking like this:

public class EntitySearchDefinition
{
    private Expression<Func<Entity, string>> Definition
    {
        get
        {
            return b => NpgsqlTextFunctions.QueryOr( 
                NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldA), NpgsqlWeightLabel.A), 
                NpgsqlTextFunctions.QueryOr( 
                    NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldB), NpgsqlWeightLabel.A), 
                    NpgsqlTextFunctions.QueryOr( 
                        NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldC), NpgsqlWeightLabel.A), 
                        NpgsqlTextFunctions.QueryOr( 
                            NpgsqlTextFunctions.SetWeightt(NpgsqlTextFunctions.ToTsVector(b.FieldD), NpgsqlWeightLabel.B), 
                            NpgsqlTextFunctions.QueryOr(
                                NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldE), NpgsqlWeightLabel.C), 
                                NpgsqlTextFunctions.QueryOr( 
                                    NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldF), NpgsqlWeightLabel.C), 
                                    NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldG), NpgsqlWeightLabel.C) 
                                )
                            )
                        )
                    )
                )
            )
        }
    }

    private static MethodInfo TsRank()
    {
        return MethodInfoFromExpression<Func<string, string, float>>((a, b) => NpgsqlTextFunctions.TsRank(a, b));
    }

    private static MethodInfo ToTsQuery()
    {
        return MethodInfoFromExpression<Func<string, string>>(a => NpgsqlTextFunctions.ToTsQuery(a));
    }

    private MethodInfo Match()
    {
        return MethodInfoFromExpression<Func<string, string, bool>>((a, b) => NpgsqlTextFunctions.Match(a, b));
    }
    
    public Expression<Func<Entity, float>> CreateOrderBy(string query)
    {
        return MakeMethodCallExpression<float>(query, TsRank());
    }

    public Expression<Func<Entity, bool>> CreateWhere(string query)
    {
        return MakeMethodCallExpression<bool>(query, Match());
    }

    private Expression<Func<Entity, TOut>> MakeMethodCallExpression<TOut>(string query, MethodInfo methodInfo)
    {
        var p = Expression.Parameter(typeof(Entity));
        var queryExpression = CreateConstantExpressionForQuery(query);
        var definitionBody = Definition.Body;
        var replacer = new ExpressionReplacer(a => a.Exp == typeof(Entity), a => p);
        var visited = replacer.Visit<TEntity>(definitionBody);
        var expression = Expression.Lambda<Func<TEntity, TOut>>(Expression.Call(methodInfo, visited, queryExpression), new List<ParameterExpression> { p });
        return expression;
    }

    private static MethodCallExpression CreateConstantExpressionForQuery(string query)
    {
        var queryConstant = Expression.Constant(query);
        var queryExpression = Expression.Call(ToTsQuery(), queryConstant);
        return queryExpression;
    }
    
    private class ExpressionReplacer : ExpressionVisitor
    {
        private readonly Func<Expression, bool> _selector;
        private readonly Func<Expression, Expression> _replace;

        public ExpressionReplacer(Func<Expression, bool> selector, Func<Expression, Expression> replace)
        {
            _selector = selector;
            _replace = replace;
        }

        public override Expression Visit(Expression exp)
        {
            if (exp == null) return exp;
            if (_selector(exp)) exp = _replace(exp);
            return base.Visit(exp);
        }
    }
}

Yup, that’s a lot of code, but a lot of it is just utility methods for interacting with expression trees (specifically the parts that get method names without resorting to Reflection or magic strings and the expression visitor that mutates the expression tree for a purpose I’ll get to in a minute).

Using this definition class the body of the search method in EntityFullTextSearcher now looks like this:

public IQueryable<EntityListProjectionModel> Search(IQueryable<EntityListProjectionModel> queryable, SearchQuery searchQuery)
{
    var query = _postgresQueryBuilder.Build(searchQuery);
    if (string.IsNullOrEmpty(query)) return queryable;

    queryable = queryable.Where(b => NpgsqlTextFunctions.Match(def.CreateWhere(query));

    return queryable;
}

With the body of the sort method looking very very similar (calling CreateOrderBy instead).

But what does it all mean?

Expressing Myself

There are two important things to notice about the search definition class above.

The first is that it dynamically constructs the expression tree that would have previously been supplied by the inline lambda expression. You can see this construction inside the MakeMethodCallExpressionmethod. It creates a ParameterExpression (for the incoming Entity from the IQueryable), and then constructs a Call node that has the definition as one parameter, and the query as the other, just like how it was originally specified inline. This method is then used in the construction of the expression trees for Where and OrderBy.

The second important thing is the usage of the ExpressionReplacerclass. The lambda expression that contains the definition for the Full Text Search (the one returned by the Definition property) contains a parameter of type Entity. The lambda expression constructed by the CreateWhere and CreateOrderBy methods also contains a parameter of type Entity.

They are not the same thing.

In order for the inner expression parameter to be correctly bound to the same value as the outer expression parameter, it was necessary to walk the expression tree, find the inner parameter expression and replace it with the outer one. Assuming that it is unlikely there will be multiple parameter expressions throughout the tree of the same type that shouldn’t be bound to the same value, this approach will work just fine to consolidate all of the parameter expressions to the same eventual value.

The end result of the replacement is a new expression tree where all parameter expressions of type Entity are the same reference (there are only two, one inner and one outer).

Conclusion

Looking back at the effort that went into this, I’m not entirely sure it was worth it. I did accomplish my original goal (only one place where the Full Text Search definition was defined), but I’m not sure if the code is more understandable.

It felt good to separate the concerns, which makes it clear that the searcher and sorter classes are dependent on the same thing, and all of the complicated and hard to understand expression tree logic is encapsulated inside a single class, but I worry that it will be almost impenetrable to anyone who comes after me who doesn’t understand expression trees. Keep in mind, at this point, even after this entire adventure, I barely understand them. Me 3 months from now will probably look at the code, get confused, google part of it, find this blog post and then relive the entire affair.

On the upside, this did generate a significant increase in understanding the way that magic of Entity Framework works, and sparked quite a few discussions within the team that let to some interesting insights for all involved. It even spawned some other changes, where we can now use our own custom methods inside the expressions being passed into Entity Framework and then have them transformed appropriately to things that EF knows how to turn into SQL.

That’s a topic for another time though.

0 Comments

Search is one of those features that most people probably don’t think about all that much. Its ubiquitous across every facet of the internet, and is a core part of what we do every day. Its kind of just…there, and I personally can’t imagine using software without it. Well, I can imagine it, and it doesn’t look good in my head.

Our latest development efforts have been focused around putting together a services platform that we can extend moving forward, providing a semblance of cloud connected functionality to an extremely valuable data set that is currently locked in a series of on-premises databases. The initial construction of this platform is being driven by a relatively simple website for showing a subset of the entities in the system. The intent is that this will be the first step in allowing that data to be accessed outside its current prison, letting us measure interest and use those findings to drive the direction of future development.

To tie everything back in with my first paragraph, we’ve hit a point where we need to provide the ability to search.

Where Are You?

Specifically, we need to provide a nice intuitive search that doesn’t require people to have a fundamental understanding of the underlying data structures. The intent is that it will be used within a webpage initially, to help people narrow down the list of things that they are looking at. Type a few letters/partial words and have the list be automatically reduced to only those things that are relevant, ordered by how relevant they are (i.e. type in greenand 28 Green St, Indooroopilly should come up first, with 39 Smith St, Greenslopes after it, and so on).

From a webpage point of view, search looks like a very small piece, at least as far as the total percentage of the presentation it occupies. Its just a small box that you type things into, how hard could it be?

From an API point of view, search can be a project unto itself, especially when you consider weighting, ranking, what fields are searchable, and so on. That’s not even taking into account cross entity searching.

At this point our API already has partial filtering and sorting built into it, using fairly standard query string parameters (filter={key}::{value}[|{key}::{value}]+ and sort=[{direction}]key[,[{direction}]{key}]+). This allowed us to support complex interactions with lists using GET requests (which are easier to cache due to HTTP semantics), without having to resort to complex POST bodies. Its also much easier to query from the command line, which is nice and is very descriptive from a logging point of view when doing analysis on pure IIS logs.

You may be wondering what the difference is between searching and filtering. To me, its a subtle difference. Both are used to winnow down a full data set to the bits that you are interested in. Filtering is all about directly using field names and doing comparisons like that (so you know you have an Address.Suburb field, so you want to filter to only things in Forest Lake). Searching is more free form, and allows you to enter just about anything and have the service make a decision about what might be relevant. They don’t necessarily need to be separate, but in this case I think the separation of concerns has value.

To keep to our pattern, we want to add a new query string parameter called search. For our purposes, it should be fairly simple (some text, no real language specification) and should be able to be combined with our existing sorting and filtering functionality.

Simple enough conceptually.

Where In Gods Name Are You!

Inside our API we leverage Entity Framework and PostgreSQL for querying. This has worked pretty well so far, as it was simple enough to use DynamicLinq to support filtering and sorting (based on keywords we control, not on fields in the data model being returned) and have everything execute at the database for maximum efficiency.

When it comes to search, PostgreSQL exposes a series of features that allow you to do Full Text Searching, which is pretty much exactly what we want. This deals with things like partial matching, case insensitivity, weighting and ranking, which all combine to make for a nice searching experience for the user.

Combining the Full Text Search functionality with the whole IQueryable/Entity Framework insanity though, that’s where things started to get complicated.

We have used PostgreSQL’s Full Text Search functionality in the past, in a different API. At the time, we were less confident in our ability to create a nice descriptive API following HTTP semantics, so we simply did a /search endpoint that accepted POST requests with a very custom body defining the search to perform.

Under the hood, because we didn’t have any other sorting or filtering, we just constructed the SQL required to do the Full Text Search and then executed it through Entity Framework. It wasn’t the best solution, but it met our immediate needs, at least for that project.

Unfortunately, this made testing search on an In Memory Database impossible, which was annoying, but we did manage to isolate the execution of the search into a series of Search Provider classes that allowed us to abstract out this dependency and test it independently.

When it came time to incorporate search into our latest API, we looked for a better way to do it. A way that didn’t involve constructing SQL ourselves.

A Wild Commercial Library Appears

After a small amount of research, one of my colleagues found a commercial library that appeared to offer the ability to construct Full Text Search queries within Linq statements (and have them be automatically turned into SQL, as you would expect). It was a glorious day, and early experiments seemed to show that it worked just as we expected. We could include normal Where and OrderBy statements along with the Full Text Search match statements, and everything would execute at the database level. Nice and efficient.

However, when it was time to move from prototype to actual implementation, it all fell apart. Replacing our existing PostgreSQL provider was fairly painless (they provided very similar functionality), but we had problems with our database migrations, and the documentation was terrible.

We use the Code First approach for our database, so migrations are a core part of how we manage our schema. Everything worked just fine when running on top of a database that already existed (which is what we were doing in the prototype), but trying to get the new library to create a database correctly from nothing (which we do all the time in our tests) was failing miserably.

We worked through this issue with the help of the vendor (whose solution was to give us two magical lines of code that referred to the deletion strategy for the database, on static classes no less), but the whole interaction had somewhat soured us on the library.

The deal breaker came when we discovered that the licencing for the library would have been a nightmare to include into our build process. We’re so used to using open source tools (or even just tools that are licenced intelligently, with licence files or keys) that we didn’t even think of this at first. As we wanted to include the commercial library inside a Nuget package of our own, we would have needed to identify within the library all of the executables that would have ever used it. The final nail in the coffin was that we would have had to install (install!) the library onto our build agents, which to me, is a massively stupid move that just makes it harder to build software.

It Can’t Be That Hard

Investigating the way in which the library accomplished Full Text Search, we thought that maybe we could implement it ourselves. It didn’t look particularly difficult, just some methods that exist purely to be translated into SQL at a later date.

It turns out, it is actually quite hard.

Luckily, something else came to our rescue.

Old Faithful

It turned out that the library we were originally using for EF compatibility with PostgreSQL (which by the way is Npgsql, an amazing open source library), had very recently received a pull request that did exactly what we wanted, added the Full Text Search functionality into EF 6.

It turns out that Npgsql has offered the core Full Text Search functionality via code since version 3 (through the NpgsqlTsVector and NpgsqlTsQuery classes), it just wasn’t compatible with the EF/Linq way of doing things.

Unfortunately, it wasn’t all good.

The pull request had been merged, but only into the branch for the next hotfix (3.0.6), which was not available through the normal Nuget channels yet. We searched around for an unstable release (on MyGet and similar sites), and found some things, but they were really unstable, so much so that we couldn’t get anything to work properly.

While we waited for the hotfix to be officially released, we downloaded and compiled the source ourselves. After a few hiccups with dependencies and the build process, we got everything working and manually included the Npgsql binaries into our library. Obviously this is a temporary solution, while we wait for the official release, but its enough to get us moving forward for now.

This is one of the great things about open source, if this were a commercial library we would have been at the mercy of that particular organisation, and it would have blocked us from making any progress at all.

Conclusion

In the end we accomplished what we originally set as our ideal. We have incorporated Full Text Searching (with weights and ranking) into our current querying pipeline, allowing us to intelligently combine searching, filtering and sorting together and have it all executed at the database level. There is still a significant amount of work to be done to make sure that what we’ve put together is performant once we get some real traffic on it, but I think it shows promise. I do have ideas about eventually leveraging Elasticsearch to do the search (and exposing the very familiar Lucene query syntax from the API), but that’s a much larger amount of work than just leveraging an existing piece of architecture.

This was one of those pieces of functionality where it felt like we spun our wheels for a while, struggling with technical issues. If we had of compromised and put together a separate /search endpoint we could have probably re-used our old solution (constructing the SQL ourselves using helper methods, or even using the Npgsql functions that we didn’t realise existed at the time) and delivered something more quickly.

In the end though, I think it would have been a worse solution overall, compromising on the design and general cohesiveness of the API in favour of just shipping something.

That sort of thing feels good in the short term, but just builds potential pain into a profession that is already pretty painful.

0 Comments

And now, for the thrilling conclusion! The thrilling conclusion to what you might ask? Check back here for the first part in this saga.

I mentioned very briefly in the first part of this epic that we attempted to fix a strange threading bug with token generation via the ASP.NET Identity Framework by making sure that the Entity Framework DbContext was fully initialized (i.e. model created, connection established, etc) before it left the factory method. Initial tests were promising, but it turns out this did not fix the issue.

I mention this because I had absolutely no luck reproducing the connection leak when I was running locally (with or without a profiler attached). I could easily force timeouts when getting a connection from the pool (because it was too busy), but I couldn’t reproduce the apparent situation where there were connections established that could not be actively used.

When going through the combination of CloudWatch logs for RDS (to track connection usage) and our own ELK stack I found a correlation between the errors that sometimes occurred when generating tokens and the increase in the usage of connections. This pattern was pretty consistent. Whenever there was a cluster of errors related to token generation, there was an increase in the total number of connections used by the service, which never went down again until the application pool was recycled at the default time of 29 hours from the last recycle.

Token Failure

We’ve been struggling with the root cause of the token generation failures for a while now. The most annoying part is that it doesn’t fail all the time. In fact, my initial load tests showed only around a 1% failure rate, which is pretty low in the scheme of things. The problem manifests itself in exceptions occurring when a part of the Identity Framework attempts to use the Entity Framework DbContext that it was given. It looks as though there is some sort of threading issue with Entity Framework, which makes sense conceptually. Generally EF DbContext objects are not thread safe, so you shouldn’t attempt to use them on two different threads at the same time.

The errors were many and varied, but all consistently come from our implementation of the OAuthAuthorizationServerProvider. A few examples are below:

System.Data.Entity.Core.EntityCommandExecutionException: An error occurred while executing the command definition. See the inner exception for details. ---> System.InvalidOperationException: Operation is not valid due to the current state of the object.
   at Npgsql.NpgsqlConnector.StartUserAction(ConnectorState newState)
   at Npgsql.NpgsqlCommand.ExecuteDbDataReaderInternal(CommandBehavior behavior)
   at Npgsql.NpgsqlCommand.ExecuteDbDataReader(CommandBehavior behavior)
   at System.Data.Entity.Infrastructure.Interception.InternalDispatcher`1.Dispatch[TTarget,TInterceptionContext,TResult](TTarget target, Func`3 operation, TInterceptionContext interceptionContext, Action`3 executing, Action`3 executed)
   at System.Data.Entity.Infrastructure.Interception.DbCommandDispatcher.Reader(DbCommand command, DbCommandInterceptionContext interceptionContext)
   at System.Data.Entity.Core.EntityClient.Internal.EntityCommandDefinition.ExecuteStoreCommands(EntityCommand entityCommand, CommandBehavior behavior)
   --- End of inner exception stack trace ---
   at System.Data.Entity.Core.EntityClient.Internal.EntityCommandDefinition.ExecuteStoreCommands(EntityCommand entityCommand, CommandBehavior behavior)
   at System.Data.Entity.Core.Objects.Internal.ObjectQueryExecutionPlan.Execute[TResultType](ObjectContext context, ObjectParameterCollection parameterValues)
   at System.Data.Entity.Core.Objects.ObjectContext.ExecuteInTransaction[T](Func`1 func, IDbExecutionStrategy executionStrategy, Boolean startLocalTransaction, Boolean releaseConnectionOnSuccess)
   at System.Data.Entity.Core.Objects.ObjectQuery`1.<>c__DisplayClass7.<GetResults>b__5()
   at System.Data.Entity.Core.Objects.ObjectQuery`1.GetResults(Nullable`1 forMergeOption)
   at System.Data.Entity.Core.Objects.ObjectQuery`1.<System.Collections.Generic.IEnumerable<T>.GetEnumerator>b__0()
   at System.Data.Entity.Internal.LazyEnumerator`1.MoveNext()
   at System.Linq.Enumerable.FirstOrDefault[TSource](IEnumerable`1 source)
   at System.Linq.Queryable.FirstOrDefault[TSource](IQueryable`1 source, Expression`1 predicate)
   at [OBFUSCATION!].Infrastructure.Repositories.AuthorizationServiceRepository.GetApplicationByKey(String appKey, String appSecret) in c:\[OBFUSCATION!]\Infrastructure\Repositories\AuthorizationServiceRepository.cs:line 412
   at [OBFUSCATION!].Infrastructure.Providers.AuthorizationServiceProvider.ValidateClientAuthentication(OAuthValidateClientAuthenticationContext context) in c:\[OBFUSCATION!]\Infrastructure\Providers\AuthorizationServiceProvider.cs:line 42
   
System.NullReferenceException: Object reference not set to an instance of an object.
   at Npgsql.NpgsqlConnector.StartUserAction(ConnectorState newState)
   at Npgsql.NpgsqlCommand.ExecuteDbDataReaderInternal(CommandBehavior behavior)
   at Npgsql.NpgsqlCommand.ExecuteDbDataReader(CommandBehavior behavior)
   at System.Data.Entity.Infrastructure.Interception.InternalDispatcher`1.Dispatch[TTarget,TInterceptionContext,TResult](TTarget target, Func`3 operation, TInterceptionContext interceptionContext, Action`3 executing, Action`3 executed)
   at System.Data.Entity.Infrastructure.Interception.DbCommandDispatcher.Reader(DbCommand command, DbCommandInterceptionContext interceptionContext)
   at System.Data.Entity.Core.EntityClient.Internal.EntityCommandDefinition.ExecuteStoreCommands(EntityCommand entityCommand, CommandBehavior behavior)
   at System.Data.Entity.Core.Objects.Internal.ObjectQueryExecutionPlan.Execute[TResultType](ObjectContext context, ObjectParameterCollection parameterValues)
   at System.Data.Entity.Core.Objects.ObjectContext.ExecuteInTransaction[T](Func`1 func, IDbExecutionStrategy executionStrategy, Boolean startLocalTransaction, Boolean releaseConnectionOnSuccess)
   at System.Data.Entity.Core.Objects.ObjectQuery`1.<>c__DisplayClass7.<GetResults>b__5()
   at System.Data.Entity.Core.Objects.ObjectQuery`1.GetResults(Nullable`1 forMergeOption)
   at System.Data.Entity.Core.Objects.ObjectQuery`1.<System.Collections.Generic.IEnumerable<T>.GetEnumerator>b__0()
   at System.Data.Entity.Internal.LazyEnumerator`1.MoveNext()
   at System.Linq.Enumerable.FirstOrDefault[TSource](IEnumerable`1 source)
   at System.Linq.Queryable.FirstOrDefault[TSource](IQueryable`1 source, Expression`1 predicate)
   at [OBFUSCATION!].Infrastructure.Repositories.AuthorizationServiceRepository.GetApplicationByKey(String appKey, String appSecret) in c:\[OBFUSCATION!]\Infrastructure\Repositories\AuthorizationServiceRepository.cs:line 412
   at [OBFUSCATION!].Infrastructure.Providers.AuthorizationServiceProvider.ValidateClientAuthentication(OAuthValidateClientAuthenticationContext context) in c:\[OBFUSCATION!]\Infrastructure\Providers\AuthorizationServiceProvider.cs:line 42

In the service, this doesn’t make a huge amount of sense. There is one DbContext created per request (via Owin), and while the Owin middleware is asynchronous by nature (meaning that execution can jump around between threads) there is no parallelism. The DbContext should not be being used on multiple threads at one time, but apparently it was.

It was either that, or something was going seriously wrong in the connection pooling code for Npgsql.

Scope Increase

As I didn’t quite understand how the dependency injection/object lifetime management worked via the OwinContext, I had my suspicions that something was going awry there. Either the DbContext was not in fact generated once per request, or there was some strange race condition that allowed a DbContext to be reused on more than one thread.

As decided to rewrite the way in which dependencies are obtained in the service. Instead of generating a DbContext per request, I would supply a DbContextFactory to everything, and let it all generate its own, temporarily scoped DbContext that it is responsible for disposing.

In order to accomplish this I switched to an IoC container that I was more familiar with, Ninject. Not a small amount of work, and not without added complexity, but I felt that it made the code more consistent with the rest of our code bases and generally better.

In retrospect, I should have verified that I could reproduce the token generation errors at will first, but I didn’t. I wrote the test after I’d spent the better part of a day switching out the dependency injection mechanisms. This was a mistake.

Since the errors always occurred during the execution of a single endpoint, I wrote a test that uses 10 tasks to spam that particular endpoint. If none of the tasks fault within a time limit (i.e. no exceptions are thrown), then the test is considered a success. Basically a very small, focused, stress test to be run automatically as part of our functional test suite.

[Test]
[Category("functional")]
public void WhenAttemptingToGenerateMultipleTokensAtTheSameTime_NoRequestsFail()
{
    var authClientFactory = _resolver.Get<IAuthClientFactory>();
    var app = new ApplicationBuilder(authClientFactory.CreateSeeder())
        .WithRole("auth_token_generate")
        .WithRole("auth_customer_register")
        .WithRole("auth_database_register")
        .WithRole("auth_user_register")
        .WithRole("auth_query")
        .Build();

    var userBuilder = new UserBuilder(authClientFactory.CreateFromApplication(app.ApplicationDetails.Key, app.ApplicationSecret));
    userBuilder.Build();

    List<Task> tokenGenerationTasks = new List<Task>();
    var cancellation = new CancellationTokenSource();
    for (int i = 0; i < 10; i++)
    {
        var task = Task.Factory.StartNew
        (
            () =>
            {
                var client = authClientFactory.CreateFromApplication(app.ApplicationDetails.Key, app.ApplicationSecret);
                while (true)
                {
                    if (cancellation.Token.IsCancellationRequested) break;
                    var token = client.TokenGenerate(userBuilder.CustomerId + "/" + userBuilder.DatabaseId + "/" + userBuilder.UserCode, userBuilder.Password);
                }
            },
            cancellation.Token,
            TaskCreationOptions.LongRunning,
            TaskScheduler.Default
        );

        tokenGenerationTasks.Add(task);
    }

    // The idea here is that if any of the parallel token generation tasks throw an exception, it will come out here
    // during the wait.
    Task.WaitAny(tokenGenerationTasks.ToArray(), TimeSpan.FromSeconds(15));
    cancellation.Cancel();

    var firstFaulted = tokenGenerationTasks.FirstOrDefault(a => a.IsFaulted);
    if (firstFaulted != null) throw firstFaulted.Exception;
}

The first time I ran the test against a local service it passed successfully…

Now, I don’t know about anyone else, but when a test works the first time I am immediately suspicious.

I rolled my changes back and ran the test again, and it failed.

So my restructuring successfully fixed the issue, but why?

The Root Of The Problem

I hadn’t actually understood the issue, all I did was make enough changes such that it seemed to no longer occur. Without that undestanding, if it recurred, I would have to start all over again, possibly misdirecting myself with the changes that I made last time.

Using the test that guaranteed a reproduction, I investigated in more depth. Keeping all of my changes reverted, I was still getting a weird sampling of lots of different errors, but they were all consistently coming from one of our repositories (classes which wrap a DbContext and add extra functionality) whenever it was used within our OAuthAuthorizationServerProvider implementation.

Staring at the code for a while, the obviousness of the issue hit me.

At startup, a single OAuthAuthorizationServerProvider implementation is created and assigned to generate tokens for requests to the /auth/token endpoint.

This of course meant that all of the functions in that provider needed to be thread safe.

They were not.

Of the two functions in the class, both set and then used a class level variable, which in turn had a dependency on a DbContext.

This was the smoking gun. If two requests came in quickly enough, one would set the variable (using the DbContext for the request) the other would do the same (using a different DbContext) and then the first would attempt to use a different threads DbContext (indirectly through the variable). This would rightly cause an error (as multiple threads tried to use the same DbContext) and throw an exception, failing the token generation.

I abandoned my changes (though I will probably make them again over time), removed the class variable and re-ran the test.

It was all good. No errors at all, even after running for a few hours.

But why did the error cause a resource leak at the database connection level?

Leaky Logic

In the end I didn’t find out exactly why threading errors with Entity Framework (using Npgsq) were causing connection leaks. I plan to investigate in more depth in the future, and I’ll probably blog about my findings, but for now I was just happy to have the problem solved.

With the bug fixed, profiling over a period of at least 24 hours showed no obvious connection leaks as a result of normal traffic. Previously this would have guaranteed at least 10 connections leaking, possibly more. So for now the problem is solved and I need to move on

Summary

Chasing down resource leaks can be a real pain, especially when you don’t have a reliable reproduction.

If I had realised earlier that the token generation failures and connection leaks were related, I would have put more effort into reproducing the first in order to reproduce the second. It wasn’t immediately obviously that they were linked though, so I spent a lot of time analysing the code trying to figure out what could possibly be leaking valuable resources. This was a time consuming and frustrating process, ultimately leading nowhere.

Once I finally connected the dots between the token failures and the connection leak, everything came together, even if I didn’t completely understand why the connections were leaking in error situations.

Ah well, can’t win em all.

0 Comments

The core of the newest suite of components that we are building is an authentication and authorization service. Its goal is simple, provide a central location for identifying clients and the data sets and users that they own, for the purposes of using remote, cloud based resources. It forms part of a greater strategy of helping to get previously client side only data into a location where it can be used by mobile applications, websites and third party integrations.

This service also includes the management of application keys, allowing us to control who is able to access the cloud components, and uniquely identifying where requests are coming from. Typically we generate a unique application for each piece of software that wants to interact with the system, including third party software and our own.

Most of our services up to now have been built using the Nancy framework. The main reason for this is because its simple and easy to understand, and gives a lot of control over the entire pipeline, which is important for a number of reasons (one of which is logging).

Unfortunately, when it came time to build the authentication and authorization service, we were unable to use Nancy. Instead we had to use Web API. Now, don’t get me wrong, Web API is a fine framework, but the majority of our experience lay with Nancy, so we might not have put together the best Web API implementation. We were forced the use Web API because the component we were going to use for the core of the authentication (the ASP.NET Identity Framework) integrated easily with Web API, but didn’t seem to have any obvious way to integrate with Nancy (when hosted in ASP.NET).

Really, all of the above is just background to the problem that I’m currently struggling with.

Leaking connections to the persistence store.

A Nice Deep Pool

The backend of our authentication and authorization service is a PostgreSQL database hosted in AWS using RDS. We chose PostgreSQL because SQL Server is generally too expensive (even hosted through AWS) and because there was a PostgreSQL provider for Entity Framework that seemed solid (Npgsql). PostgreSQL is also an extremely mature database engine, and is just far enough out of our comfort zone to provide some new learning opportunities.

Because of the way the Identity Framework uses Entity Framework, all we had to do was create the DbContext using the appropriate connection/provider and EF would take care of the rest, seamlessly persisting our data into the appropriate place.

To be honest, it was actually pretty seamless. We didn’t have any issues (that I can remember anyway) and everything seemed to go pretty smoothly. The service wasn't being actively used in a production environment (because we were still developing the solution it was built for), but it was getting regular traffic and handling it admirably.

Then we did some load testing and it crashed and burnt.

Investigation showed that the service stopped working because it could no longer open new connections to the database, as the database already had too many connections currently in use. We were only using a d2.t2.micros RDS instance, and it has a cap on the number of connections (approximately 26), so that made sense. As soon as there was a greater number of requests trying to access the database than the available number of connections, requests would fail.

A quick investigation showed that the database connections being used for EF did not have pooling enabled. The pooling supported in Npgsql is an entirely client side structured, allowing for efficient reuse of connections by avoiding the work that goes into establishing the pipeline. It also offers the ability to queue the creation of additional connections once the configured pool limit has been set. Obviously there is a timeout (to prevent deadlock), but this allowed us to set the connection pool limit to close to the RDS limit, safe in the knowledge that subsequent requests would simply be slower, rather than failing altogether.

With pooling enabled, the load tests passed with flying colours. I knew that we would have to scale up the RDS instance based on real traffic, but that could happen later. The pooling allowed the service to make better use of limited resources.

For now, problem solved.

Right?

These Are My Connections! Mine!

After running happily for many weeks, the staging service went down hard. No load tests involved this time, just normal traffic. Shortly afterwards (within days), a newly provisioned production server did the same thing.

Looking at the logs the symptoms seemed to be similar, errors related to being unable to obtain a connection because the server did not have any available. There was no real sign that the service was doing anything particularly difficult that would require it to hold the connections for an extended amount of time, so we were a bit confused.

Being that the service had been running happily for a number of weeks, it must have been something we changed recently. Unfortuntately, the only recent change made was a fix to a threading issue with Entity Framework, Owin and the Identity Framework (2 threads attempting to lazily initialize the same DbContext at the same time). It was a fairly trivial fix. Instead of returning the DbContext from the factory straightaway, make sure that the model has been initialized first.

It was time to dig into it in more depth.

My hypothesis was that if the application was leaking connections to the database (such that they were hanging around but no longer in a usable state), then I should be able to see them when analysing a memory dump. I started up the Auth API locally using the SciTech .NET Memory Profiler, took an initial snapshot before doing anything meaningful, and then started the stress tests.

Every 15 minutes I would take another snapshot of the memory space, specifically looking for any DbContext or NpgsqlConnection objects that were hanging around for longer than they should have been.

What I found was interesting.

Slowly but surely a collection of NpgsqlConnection objects were accumulating. They were disposed, but could not be garbage collected because they were being rooted by an event handler inside the Npgsql connection pooling implementation. I’m not actually sure if they were holding on to server side resources (being disposed, you would think that they wouldn’t be), but it didn’t look promising.

I turned off connection pooling and those objects stopped being accumulated. Obviously the service also stopped working properly once it reached a certain amount of traffic (when reaching the connection cap of the RDS instance), but it did at least confirm that the issue seemed to be within Npgsql itself and not in our code.

Luckily, a new version of Npgsql had been released (we were using 2.2.5, but version 3 was now available). I removed all of the references to the old package and installed the new one. After dealing with some differences (some very strange errors relating to attempting to use disposed DbContexts during Owin cleanup due to the unnecessary usage of SaveChangesAsync), everything seemed to be working just fine. All of the functional tests passed, and the stress tests didn’t show anything interesting unless I tuned them up to “crash the computer” level.

Pushed, built, deployed.

Job done.

Where Have All The Connections Gone

Nope.

Still leaking connections. Still fails when some critical mass of leaked connections is reached.

Come back next week for part 2, when I hopefully have an actual answer as to where the connections are going.