Searching All The Things

April 26. 2016 0 Comments

Search is one of those features that most people probably don’t think about all that much. Its ubiquitous across every facet of the internet, and is a core part of what we do every day. Its kind of just…there, and I personally can’t imagine using software without it. Well, I can imagine it, and it doesn’t look good in my head.

Our latest development efforts have been focused around putting together a services platform that we can extend moving forward, providing a semblance of cloud connected functionality to an extremely valuable data set that is currently locked in a series of on-premises databases. The initial construction of this platform is being driven by a relatively simple website for showing a subset of the entities in the system. The intent is that this will be the first step in allowing that data to be accessed outside its current prison, letting us measure interest and use those findings to drive the direction of future development.

To tie everything back in with my first paragraph, we’ve hit a point where we need to provide the ability to search.

Where Are You?

Specifically, we need to provide a nice intuitive search that doesn’t require people to have a fundamental understanding of the underlying data structures. The intent is that it will be used within a webpage initially, to help people narrow down the list of things that they are looking at. Type a few letters/partial words and have the list be automatically reduced to only those things that are relevant, ordered by how relevant they are (i.e. type in greenand 28 Green St, Indooroopilly should come up first, with 39 Smith St, Greenslopes after it, and so on).

From a webpage point of view, search looks like a very small piece, at least as far as the total percentage of the presentation it occupies. Its just a small box that you type things into, how hard could it be?

From an API point of view, search can be a project unto itself, especially when you consider weighting, ranking, what fields are searchable, and so on. That’s not even taking into account cross entity searching.

At this point our API already has partial filtering and sorting built into it, using fairly standard query string parameters (filter={key}::{value}[|{key}::{value}]+ and sort=[{direction}]key[,[{direction}]{key}]+). This allowed us to support complex interactions with lists using GET requests (which are easier to cache due to HTTP semantics), without having to resort to complex POST bodies. Its also much easier to query from the command line, which is nice and is very descriptive from a logging point of view when doing analysis on pure IIS logs.

You may be wondering what the difference is between searching and filtering. To me, its a subtle difference. Both are used to winnow down a full data set to the bits that you are interested in. Filtering is all about directly using field names and doing comparisons like that (so you know you have an Address.Suburb field, so you want to filter to only things in Forest Lake). Searching is more free form, and allows you to enter just about anything and have the service make a decision about what might be relevant. They don’t necessarily need to be separate, but in this case I think the separation of concerns has value.

To keep to our pattern, we want to add a new query string parameter called search. For our purposes, it should be fairly simple (some text, no real language specification) and should be able to be combined with our existing sorting and filtering functionality.

Simple enough conceptually.

Where In Gods Name Are You!

Inside our API we leverage Entity Framework and PostgreSQL for querying. This has worked pretty well so far, as it was simple enough to use DynamicLinq to support filtering and sorting (based on keywords we control, not on fields in the data model being returned) and have everything execute at the database for maximum efficiency.

When it comes to search, PostgreSQL exposes a series of features that allow you to do Full Text Searching, which is pretty much exactly what we want. This deals with things like partial matching, case insensitivity, weighting and ranking, which all combine to make for a nice searching experience for the user.

Combining the Full Text Search functionality with the whole IQueryable/Entity Framework insanity though, that’s where things started to get complicated.

We have used PostgreSQL’s Full Text Search functionality in the past, in a different API. At the time, we were less confident in our ability to create a nice descriptive API following HTTP semantics, so we simply did a /search endpoint that accepted POST requests with a very custom body defining the search to perform.

Under the hood, because we didn’t have any other sorting or filtering, we just constructed the SQL required to do the Full Text Search and then executed it through Entity Framework. It wasn’t the best solution, but it met our immediate needs, at least for that project.

Unfortunately, this made testing search on an In Memory Database impossible, which was annoying, but we did manage to isolate the execution of the search into a series of Search Provider classes that allowed us to abstract out this dependency and test it independently.

When it came time to incorporate search into our latest API, we looked for a better way to do it. A way that didn’t involve constructing SQL ourselves.

A Wild Commercial Library Appears

After a small amount of research, one of my colleagues found a commercial library that appeared to offer the ability to construct Full Text Search queries within Linq statements (and have them be automatically turned into SQL, as you would expect). It was a glorious day, and early experiments seemed to show that it worked just as we expected. We could include normal Where and OrderBy statements along with the Full Text Search match statements, and everything would execute at the database level. Nice and efficient.

However, when it was time to move from prototype to actual implementation, it all fell apart. Replacing our existing PostgreSQL provider was fairly painless (they provided very similar functionality), but we had problems with our database migrations, and the documentation was terrible.

We use the Code First approach for our database, so migrations are a core part of how we manage our schema. Everything worked just fine when running on top of a database that already existed (which is what we were doing in the prototype), but trying to get the new library to create a database correctly from nothing (which we do all the time in our tests) was failing miserably.

We worked through this issue with the help of the vendor (whose solution was to give us two magical lines of code that referred to the deletion strategy for the database, on static classes no less), but the whole interaction had somewhat soured us on the library.

The deal breaker came when we discovered that the licencing for the library would have been a nightmare to include into our build process. We’re so used to using open source tools (or even just tools that are licenced intelligently, with licence files or keys) that we didn’t even think of this at first. As we wanted to include the commercial library inside a Nuget package of our own, we would have needed to identify within the library all of the executables that would have ever used it. The final nail in the coffin was that we would have had to install (install!) the library onto our build agents, which to me, is a massively stupid move that just makes it harder to build software.

It Can’t Be That Hard

Investigating the way in which the library accomplished Full Text Search, we thought that maybe we could implement it ourselves. It didn’t look particularly difficult, just some methods that exist purely to be translated into SQL at a later date.

It turns out, it is actually quite hard.

Luckily, something else came to our rescue.

Old Faithful

It turned out that the library we were originally using for EF compatibility with PostgreSQL (which by the way is Npgsql, an amazing open source library), had very recently received a pull request that did exactly what we wanted, added the Full Text Search functionality into EF 6.

It turns out that Npgsql has offered the core Full Text Search functionality via code since version 3 (through the NpgsqlTsVector and NpgsqlTsQuery classes), it just wasn’t compatible with the EF/Linq way of doing things.

Unfortunately, it wasn’t all good.

The pull request had been merged, but only into the branch for the next hotfix (3.0.6), which was not available through the normal Nuget channels yet. We searched around for an unstable release (on MyGet and similar sites), and found some things, but they were really unstable, so much so that we couldn’t get anything to work properly.

While we waited for the hotfix to be officially released, we downloaded and compiled the source ourselves. After a few hiccups with dependencies and the build process, we got everything working and manually included the Npgsql binaries into our library. Obviously this is a temporary solution, while we wait for the official release, but its enough to get us moving forward for now.

This is one of the great things about open source, if this were a commercial library we would have been at the mercy of that particular organisation, and it would have blocked us from making any progress at all.

Conclusion

In the end we accomplished what we originally set as our ideal. We have incorporated Full Text Searching (with weights and ranking) into our current querying pipeline, allowing us to intelligently combine searching, filtering and sorting together and have it all executed at the database level. There is still a significant amount of work to be done to make sure that what we’ve put together is performant once we get some real traffic on it, but I think it shows promise. I do have ideas about eventually leveraging Elasticsearch to do the search (and exposing the very familiar Lucene query syntax from the API), but that’s a much larger amount of work than just leveraging an existing piece of architecture.

This was one of those pieces of functionality where it felt like we spun our wheels for a while, struggling with technical issues. If we had of compromised and put together a separate /search endpoint we could have probably re-used our old solution (constructing the SQL ourselves using helper methods, or even using the Npgsql functions that we didn’t realise existed at the time) and delivered something more quickly.

In the end though, I think it would have been a worse solution overall, compromising on the design and general cohesiveness of the API in favour of just shipping something.

That sort of thing feels good in the short term, but just builds potential pain into a profession that is already pretty painful.