Dynamic Young Go-Getter

May 17. 2016 0 Comments

A few weeks ago I uploaded a post describing the usage of a simple BNF grammar to describe and parse the configuration for a feature in the software my team maintains. As I mentioned then, that post didn’t cover any of the detail around how the configuration worked, only how it was structured and used to construct an in memory model that could be leveraged to push all of the necessary data to the external service.

This post will instead cover the other half of the story, the content of the configuration file beyond the structure.

Charging a Dynamo

As I said in the post I linked above, we recently integrated with an external service that allowed for the creation of various forms relevant to the Real Estate industry in Australia. Their model was to use an API to programmatically construct the form and to then use their website to confirm that the form was correct (and print it or email it or whatever other usage might be desired).

For the integration, we needed to supply the ability to pick the desired form from the available one, and then supply all of the data that that form required, as sourced from our application.

The form selection was trivial, but filling in the necessary data was somewhat harder. Each form could be printed from various places within the application, so we had to put together a special intermediary domain model based on the current context. Once the model was constructed, we could easily extract the necessary fields/properties from it (depending on what was available) and then put them into a key/value store to be uploaded to the external API.

For a first version, the easiest way to approach the problem was to do it in C#. A domain model, a factory for creating it from different contexts and a simple extractor that built the key/value store.

The limitations of this approach are obvious, but the worst one is that it can’t be changed without releasing a new version of the software. We generally only do a few releases a year (for various reasons I don’t really want to get into), so this meant we would have a very limited ability to change in the face of the external service changing. New keys would be particularly brutal, because they would simply be lacking values until we did a new release, but fixing any errors in the released key mappings would also be difficult.

Our second version needed to be more flexible about how the key mappings were defined and updated, so we switched to a configuration model

Dyno-mite!

Driving the key mappings from configuration added a whole bunch of complexity to the solution.

For one, we could no longer use the infinite flexibility of C# to extract what we needed from the intermediary domain model (and format it in the way we thought was best). Instead we needed to be able to provide some sort of expression evaluation that could be easily defined within text. The intent was that we would maintain the C# code that constructed the domain model based on the context of the operation, and then use a series of key mappings created from the configuration file to extract all of the necessary data to be pushed to the external API.

The second complication was that we would no longer be the only people defining key mappings (as we were when they were written in C#). An expected outcome of the improvements was that anyone with sufficient technical knowledge would be able to edit the mappings (or provide overrides to customers as part of our support process).

At first we thought that we might be able to published a small exploration language which would allow for the definition of a simple expression to get values out of the intermediary domain model. Something similar to C# syntax (i.e. dot notation, like A.B.C). This would be relatively easy to define in text, and would be evaluatable via Reflection.

The more we looked at our original C# key mappings though, the more we realised that the simple property exploration approach would not be enough. We were selecting items from arrays, dynamically concatenating two or more properties together and formatting strings, none of which would be covered by the simple model. We were also somewhat pensive about using Reflection to evaluate the expressions. There were quite a few of them (a couple of hundred) and we knew from previous experience that Reflection could be slow. The last thing we wanted to do was make the feature slower.

Two of those problems could be solved by adding the ability to concatenate two or more expressions in the mapping definition (the <concatenated> term in the grammer defined in the previous post and by offering a format string component for the expression (which just leverages String.Format).

Accessing items out of arrays/lists was something else entirely.

Dynamic Links

Rather than try to write some code to do the value extraction ourselves, we went looking for a library to do it for us.

We located two potential candidates, Flee and System.Linq.Dynamic.

Both of these libraries offered the ability to dynamically execute C# code obtained from a string, which would let us stick with C# syntax for the configuration file. Both were also relatively easy to use and integrate.

In the end, for reasons I can no longer remember, we went with System.Linq.Dynamic. I think this might have been because we were already using it for some functionality elsewhere (dynamic filtering and sorting, before we knew how to manipulate expression trees directly), so it made sense to reuse it.

A single line in the configuration file could now look like this:

Key_Name|DynamicLinq;Owner.Contacts[0].First;”{0} “;”unknown”

This line translates to “For the key named Key_Name, use the DynamicLinq engine, where the expression is to access the first name of zeroth element of the contacts list associated with the current owner. This value should be formatted with a trailing space, and if any errors occur it should default to the string unknown”.

The beauty of this is that we don’t have to handle any of the actual expression evaluation, just the bits around the edges (formatting and error handling).

After the configuration is parsed into an Abstract Syntax Tree by Irony, that tree is then converted into a series of classes that can actually be used to obtain a key/value store from the intermediary domain model. An interface describes the commonality of the right side of the configuration line above (IEvaluationExpression) and there are implementations of this class for each of the types of expression supported by the grammar called DateExpression, LiteralExpression, ConcatenatedExpression and the one I actually want to talk about, DynamicLinqExpression.

This class is relatively simple. Its entire goal is to take a string that can be used to extract some value from an object and run it through the functionality supplied by System.Dynamic.Linq. It does some error handling and formatting as well, but its main purpose is to extract the value.

public class DynamicLinqExpression : IEvaluationExpression
{
    public DynamicLinqExpression(string expression, string default = null, string format = null)
    {
        _expression = expression;
        _default = default;
        _format = format;
    }

    public string Evaluate(Model model)
    {
        IEnumerable query;
        try
        {
            query = (new List<Model> { model }).AsQueryable<Model>().Select(_expression);
        }
        catch (Exception ex)
        {
            return _default;
        }

        List<dynamic> list;
        try
        {
            list = query.Cast<dynamic>().ToList();
        }
        catch (Exception ex)
        {
            return _default;
        }

        try
        {
            return list.Single() == null ? _default : string.Format(_format, list.Single());
        }
        catch (Exception ex)
        {
            return _default;
        }
    }
}

I’ve stripped out all of the logging and some other non-interesting pieces, so you’ll have to excuse the code. In reality we have some detailed logging that occurs at the various failure levels, which is why everything is spread out the way it is.

The important piece is the part where the incoming Model is converted into a Queryable, and System.Dynamic.Linq is used to query that using the supplied expression. A value is then extracted from the resulting queried enumerable, which is them formatted and returned as necessary.

Conclusion

Pushing off the majority of the value extraction let us focus on a nice structure to support all of the things that the configuration file needed to do that were outside of the “just extract a value from this object” scope. It also let us put more effort into the other parts of managing a configuration based approach to a complex problem, like the definition and usage of a grammar (another case of code I would rather not own if I didn’t have to).

The only weird thing left over is the fact that the DynamicLinqExpression has to do a bunch of collection transformations in order to be run on a single object. This leaves a somewhat sour taste in my mouth, but performance testing showed that it was well within the bounds that we needed to accomplish for this particular feature.

In the end, I was mostly just happy that we didn’t have to maintain some convoluted (and likely slow) Reflection based code that extracted fields and properties from an object model in some sort of vastly reduced mockery of C#.

Weird Name for a Good Concept

May 10. 2016 0 Comments

Posted in:
tutoring
qut
agile

As I’ve mentioned a bunch of times, I tutor an Agile Project Management course at the Queensland University of Technology. Its been useful to me on a number of fronts, from making me think more about what the concept of being agile actually means to me, to simply giving me more experience speaking in front of large groups of people. On the other side of the equation, I hope its been equally useful to the students.

A secondary benefit of tutoring is that it exposes me to new concepts and activities that I’ve never really seen before. I’d never heard of Scrum City until we did it with the students back in the first semester, and the topic of todays blog post is a similar sort of thing.

Lean Coffee.

An Unpleasant Drink

Fortunately, Lean Coffee has absolutely nothing to do with coffee, well not anymore anyway.

Apparently the term was originally coined as a result of a desire to not have to organise speakers or deal with the logistics of organising a venue for a regular meeting. The participants simply met at a particular coffee shop and started engaging in the process which would eventually become known as Lean Coffee (one because its lightweight and two because it was at a coffee shop).

At a high level, I would describe Lean Coffee as a democratically driven discussion, used to facilitate conversation around a theme while maintaining interest and engagement.

Its the sort of idea that aims to deal with the problem of mind-numbing meetings that attempt to deal with important subjects, but fail miserably because of the way they are run.

Who Needs a Boost Anyway?

It all starts with the selection of an overarching theme. This can be anything at all, but obviously it should be something that actually needs discussing and that the group engaging in the discussion has some stake in.

In the case of the tutorial, the theme was an upcoming piece of assessment (an Agile Inception Deck for a project of their choosing).

Each individual is then responsible for coming up with any number of topics or questions that fit into the theme. Each topic should be clear enough to be understood easily, should have some merit when applied to the greater group and should be noted down clearly on a post-it or equivalent.

This will take about 10 minutes, and ideally should happen in relative silence (as the topics are developed by the individual, and do not need additional discussion, at least not yet).

At the end, all topics should be affixed to the first column in a basic 3 column workflow board (To Discuss, Discussingand Discussed.

Don’t worry too much about the relevance of each topic, as the next stage will sort that out. Remember, you are just the facilitator, the actual discussion is owned by the group of people who are doing the discussing.

I’m High on Life

Spend a few minutes going through the topics, reading them out and getting clarifications as necessary.

Now, get the group to vote on the topics. Each person gets 3 votes, and they can apply them in any way they see fit (multiple votes to one topic, spread them out, only use some of their votes, it doesn’t matter). If you have a large number of people, a simple line is good enough to avoid the crush during voting, but it will take some time to get through everyone. Depending on how big your wall of topics is, its best to get more than 1 person voting at a time, and limit the amount of time each person has to vote to less than 30 seconds.

Conceptually, voting allows the topics that concern the greatest number of people to rise to the top, allowing you to prioritize them ahead of the topics that concern fewer people. This is the democratic part of the process and allows for some real engagement in the discussion by the participants, because they’ve had some input into what they think is the most important things to talk about.

That last point was particularly relevant for my tutorial. For some reason, when given the opportunity, the students were reticent to ask questions about the assessment. I did have a few, but not nearly as many as I expected. This activity generated something like 20 topics though, of which around 15 were useful to the group as a whole, and really helped them to get a better handle on how to do well.

That’s What They Call Cocaine Now

After the voting is finished, rearrange the board to be organised by the number of votes (i.e. priority) and then its time to start the actual discussions.

Pick off the top topic, read it out and make sure everyone has a common understanding of what needs to be discussed. If a topic is not clear by this point (and it should be, because in order to vote you need to the topic to be understandable) you may have to get the creator of the topic to speak up. Once everything is ready, start a timer for 5 minutes and then let the discussion begin. After the time runs out, try to summarise the discussion (and note down actions or other results as necessary). If there is more discussion to be had, start another timer for 2 minutes, and then let that play out.

Once the second timer runs out, regardless of whether everything is perfectly sorted out, move on to the next topic. Rinse and repeat until you run out of time (or topics obviously).

In my case, the topics being discussed were mostly one sided (i.e. me answering questions and offering clarifications about the piece of assessment), but running this activity in a normal business situation where no-one has all the answers should allow everyone to take part equally.

Conclusion

I found the concept of Lean Coffee to be extremely effective in facilitating a discussion while also maintaining a high level of engagement. It has been a long time since I’ve really felt like a group of people were interested in discussing a topic like they were when this process was used to facilitate the conversation.

This interests me at a fundamental level, because I’d actually tried to engage the students on the theme at an earlier occasion, thinking they would have a lot of questions about the assessment item. At that time I used the simplest approach, which was to canvas the group for questions and topics one at a time. I did have a few bites, but nowhere near the level of participation that I did with Lean Coffee.

The name is still stupid though.

The Language of Love

May 4. 2016 0 Comments

Posted in:
c#
grammar
bnf

Last year we built a feature that integrated with an external forms provider, extracting data from an internal data model and pushing it out to the supplied API, for use in a variety of ways (its the real estate industry, so there’s a lot of red tape and forms).

It was a fairly simple feature, because the requirements for the API were pretty simple. Select a form from a list, supply a key/value store representing the available data and then display the form and confirm it’s correct. In fact, the hardest part of the whole thing was finding a decent embedded browser to use to display the portal supplied alongside the API (which, annoyingly, is the only place where some of the functionality of the service is available).

At the time, we weren’t sure what the uptake would be for the feature (because the external service required its own subscription), so we built only exactly what was necessary, released it into the wild and waited to see what happened.

I wouldn’t say it was an amazing ground-breaking success, but reception was solid enough that the business decided to return to the feature and solidify it into something better.

Do You Speak the Language?

One of the issues with the external service is that its heavily dependent on the key/value store supplied to it for filling the forms. Unfortunately, it does not programmatically expose a list of the keys required for a form. This information is instead supplied out of band, mostly via emailing spreadsheets around.

Even if the service did expose a list of keys, it would really only be an optimisation (so we know which keys we needed to fill, rather than just all of them). We would still need to know what each key means to us, and where the value can be obtained from.

Our first version was hardcoded. We had a few classes that knew how to create an intermediate domain model specific for this feature based on where you were using it from, and they could then be easily turned into a key/value store. At the time, the limitations of this approach were understood (hard to change, must ship new code in order to deal with new fields), but it was good enough to get us started.

Now that we had clearance to return to the feature and improve it, one of the first things we needed to do was change the way that we obtain the key values. In order to support changing the way the keys were filled without having to do a new deployment we needed to move to a configuration based approach.

The best way to do that? Create a simple language to describe how the configuration should be put together and then use a parser for that language to turn the text into a data structure.

Grammar Nazi

Most of the language is pretty simple, so I’ll just write it down in EBNF notation here:

<config> ::= <line> | <line> <config>
<line> ::= <mapping> | <comment> | <blank>
<blank> ::= <EOL>
<comment> ::= "#" <text> <EOL>
<mapping> ::= <key> "|" <expression> <EOL>
<expression> ::= <dynamic> | <literal> | <date> | <concatenated>
<literal> ::= "Literal;" <text>
<date> ::= "Date;" <format_string>
<concatenated> ::= <expression> "+" <expression>
<dynamic> ::= "DynamicLinq;" <code> [ ";" <format_string> ] [";" <default> ]

Essentially, a config consists of multiple lines, where each line might be a mapping, a blank line or a comment. Each mapping must contain a key, followed by some sort of expression to get the value of the key at runtime. There are a number of different expressions available. In the above specification, any term with no expansion is free text.

With a defined grammar, all we needed was a library to parse it for us. While we could have parsed it ourselves manually, I’d much rather lean on a library that deals specifically with languages to do the work. I don’t want to own custom parsing code.

We chose Irony, which is a nice, neat little language parsing engine available for .NET.

When using Irony, you create a grammar (deriving from the Grammar class) and then fill it out with the appropriate rules. It ends up looking like this:

public class ConfigurationGrammar : Grammar
{
    public ConfigurationGrammar() 
        : base(false)
    {
        var data = new NonTerminal("data");
        var line = new NonTerminal("Line");
        var key = new FreeTextLiteral("key", FreeTextOptions.ConsumeTerminator, "|");
        var concatExpression = new NonTerminal("concatExpression");
        var singleExpression = new NonTerminal("singleExpression");
        var dateExpression = new NonTerminal("dateExpression");
        var literalExpression = new NonTerminal("literalExpression");
        var linqExpression = new NonTerminal("linqExpression");
        var sourceField = new IdentifierTerminal("sourceField", "[].");
        var formatString = new NonTerminal("formatString");
        var defaultValue = new QuotedValueLiteral("defaultValue", "\"", TypeCode.String);
        var unquotedFormatString = new QuotedValueLiteral("unquotedFormatString", "{", "}", TypeCode.String);
        var quotedFormatString = new StringLiteral("quotedFormatString", "\"");


        formatString.Rule = unquotedFormatString | quotedFormatString;
        linqExpression.Rule = ToTerm("DynamicLinq") + ";" + sourceField + ";" + formatString + ";" + defaultValue |
                              ToTerm("DynamicLinq") + ";" + sourceField + ";" + formatString |
                              ToTerm("DynamicLinq") + ";" + sourceField;
        literalExpression.Rule = ToTerm("Literal") + ";" + defaultValue;
        dateExpression.Rule = ToTerm("Date") + ";" + formatString;
        singleExpression.Rule = dateExpression | literalExpression | linqExpression;
        concatExpression.Rule = MakePlusRule(concatExpression, ToTerm("+"), singleExpression);
        line.Rule = key + concatExpression;
        data.Rule = line + Eof;

        this.Root = data;
        this.LanguageFlags |= LanguageFlags.NewLineBeforeEOF;
        this.MarkPunctuation("|", ";");
    }
}

The output from using this grammar is an Abstract Syntax Tree, which can then be converted into an appropriate data structure that does the actual work. I’ll make another post about the details of how that data structure extracts data from our domain model in the future (because this one is already long enough just considering the stuff about the language/grammar). Its relatively interesting though, because we had to move from the infinite flexibility of C# code to a more constrained set of functionality that can be represented with text (and would be usable by non-developers).

Halt! Wir Müssen Reden

Observant readers might notice that the grammar above does not line up exactly with the EBNF specification.

Originally we wrote the grammar class to line up exactly with the specification. Unfortunately, when we started actually writing the configuration file and testing various failure conditions, we discovered that if anything in the file failed, the entire parse would fail. Sure it would tell you where it failed, but we needed a little bit more robustness than that (its not as important if one line is bad, but it is important that the rest continue to work as expected).

Our first attempt simple removed the problem line on failure and then parsed again, in a loop, until the configuration parsed correctly. This was both inefficient, and caused the line numbers reported in the error messages to be incorrect.

Finally, we decided to handle each line independently, so amended the grammar to only know about a valid line, and then ignored blank and comment lines with our own code.

I think it was a solid compromise. We’re still leveraging the grammar engine to do the heavy lifting, we just make sure it has good input to work with.

Summary

In the interests of full disclosure, I did not actually perform all of the work above. I guided the developer involved towards a language based solution, the selection of a library to do it for us and the review of the code, but that’s about it.

I think that in the end it made for a better, more maintainable solution, with the most important thing being that we don’t own the parsing code. All we own is the code that transforms the resulting syntax tree into appropriate objects. This is especially valuable when it comes to the handling of parsing errors, because parsing a perfect file is easy.

Detecting all the places where that file might be wrong, that’s the hard part, and is generally where home-baked parsing falls apart.

Like the heading puns in this post.

Searching All The Things

April 26. 2016 0 Comments

Search is one of those features that most people probably don’t think about all that much. Its ubiquitous across every facet of the internet, and is a core part of what we do every day. Its kind of just…there, and I personally can’t imagine using software without it. Well, I can imagine it, and it doesn’t look good in my head.

Our latest development efforts have been focused around putting together a services platform that we can extend moving forward, providing a semblance of cloud connected functionality to an extremely valuable data set that is currently locked in a series of on-premises databases. The initial construction of this platform is being driven by a relatively simple website for showing a subset of the entities in the system. The intent is that this will be the first step in allowing that data to be accessed outside its current prison, letting us measure interest and use those findings to drive the direction of future development.

To tie everything back in with my first paragraph, we’ve hit a point where we need to provide the ability to search.

Where Are You?

Specifically, we need to provide a nice intuitive search that doesn’t require people to have a fundamental understanding of the underlying data structures. The intent is that it will be used within a webpage initially, to help people narrow down the list of things that they are looking at. Type a few letters/partial words and have the list be automatically reduced to only those things that are relevant, ordered by how relevant they are (i.e. type in greenand 28 Green St, Indooroopilly should come up first, with 39 Smith St, Greenslopes after it, and so on).

From a webpage point of view, search looks like a very small piece, at least as far as the total percentage of the presentation it occupies. Its just a small box that you type things into, how hard could it be?

From an API point of view, search can be a project unto itself, especially when you consider weighting, ranking, what fields are searchable, and so on. That’s not even taking into account cross entity searching.

At this point our API already has partial filtering and sorting built into it, using fairly standard query string parameters (filter={key}::{value}[|{key}::{value}]+ and sort=[{direction}]key[,[{direction}]{key}]+). This allowed us to support complex interactions with lists using GET requests (which are easier to cache due to HTTP semantics), without having to resort to complex POST bodies. Its also much easier to query from the command line, which is nice and is very descriptive from a logging point of view when doing analysis on pure IIS logs.

You may be wondering what the difference is between searching and filtering. To me, its a subtle difference. Both are used to winnow down a full data set to the bits that you are interested in. Filtering is all about directly using field names and doing comparisons like that (so you know you have an Address.Suburb field, so you want to filter to only things in Forest Lake). Searching is more free form, and allows you to enter just about anything and have the service make a decision about what might be relevant. They don’t necessarily need to be separate, but in this case I think the separation of concerns has value.

To keep to our pattern, we want to add a new query string parameter called search. For our purposes, it should be fairly simple (some text, no real language specification) and should be able to be combined with our existing sorting and filtering functionality.

Simple enough conceptually.

Where In Gods Name Are You!

Inside our API we leverage Entity Framework and PostgreSQL for querying. This has worked pretty well so far, as it was simple enough to use DynamicLinq to support filtering and sorting (based on keywords we control, not on fields in the data model being returned) and have everything execute at the database for maximum efficiency.

When it comes to search, PostgreSQL exposes a series of features that allow you to do Full Text Searching, which is pretty much exactly what we want. This deals with things like partial matching, case insensitivity, weighting and ranking, which all combine to make for a nice searching experience for the user.

Combining the Full Text Search functionality with the whole IQueryable/Entity Framework insanity though, that’s where things started to get complicated.

We have used PostgreSQL’s Full Text Search functionality in the past, in a different API. At the time, we were less confident in our ability to create a nice descriptive API following HTTP semantics, so we simply did a /search endpoint that accepted POST requests with a very custom body defining the search to perform.

Under the hood, because we didn’t have any other sorting or filtering, we just constructed the SQL required to do the Full Text Search and then executed it through Entity Framework. It wasn’t the best solution, but it met our immediate needs, at least for that project.

Unfortunately, this made testing search on an In Memory Database impossible, which was annoying, but we did manage to isolate the execution of the search into a series of Search Provider classes that allowed us to abstract out this dependency and test it independently.

When it came time to incorporate search into our latest API, we looked for a better way to do it. A way that didn’t involve constructing SQL ourselves.

A Wild Commercial Library Appears

After a small amount of research, one of my colleagues found a commercial library that appeared to offer the ability to construct Full Text Search queries within Linq statements (and have them be automatically turned into SQL, as you would expect). It was a glorious day, and early experiments seemed to show that it worked just as we expected. We could include normal Where and OrderBy statements along with the Full Text Search match statements, and everything would execute at the database level. Nice and efficient.

However, when it was time to move from prototype to actual implementation, it all fell apart. Replacing our existing PostgreSQL provider was fairly painless (they provided very similar functionality), but we had problems with our database migrations, and the documentation was terrible.

We use the Code First approach for our database, so migrations are a core part of how we manage our schema. Everything worked just fine when running on top of a database that already existed (which is what we were doing in the prototype), but trying to get the new library to create a database correctly from nothing (which we do all the time in our tests) was failing miserably.

We worked through this issue with the help of the vendor (whose solution was to give us two magical lines of code that referred to the deletion strategy for the database, on static classes no less), but the whole interaction had somewhat soured us on the library.

The deal breaker came when we discovered that the licencing for the library would have been a nightmare to include into our build process. We’re so used to using open source tools (or even just tools that are licenced intelligently, with licence files or keys) that we didn’t even think of this at first. As we wanted to include the commercial library inside a Nuget package of our own, we would have needed to identify within the library all of the executables that would have ever used it. The final nail in the coffin was that we would have had to install (install!) the library onto our build agents, which to me, is a massively stupid move that just makes it harder to build software.

It Can’t Be That Hard

Investigating the way in which the library accomplished Full Text Search, we thought that maybe we could implement it ourselves. It didn’t look particularly difficult, just some methods that exist purely to be translated into SQL at a later date.

It turns out, it is actually quite hard.

Luckily, something else came to our rescue.

Old Faithful

It turned out that the library we were originally using for EF compatibility with PostgreSQL (which by the way is Npgsql, an amazing open source library), had very recently received a pull request that did exactly what we wanted, added the Full Text Search functionality into EF 6.

It turns out that Npgsql has offered the core Full Text Search functionality via code since version 3 (through the NpgsqlTsVector and NpgsqlTsQuery classes), it just wasn’t compatible with the EF/Linq way of doing things.

Unfortunately, it wasn’t all good.

The pull request had been merged, but only into the branch for the next hotfix (3.0.6), which was not available through the normal Nuget channels yet. We searched around for an unstable release (on MyGet and similar sites), and found some things, but they were really unstable, so much so that we couldn’t get anything to work properly.

While we waited for the hotfix to be officially released, we downloaded and compiled the source ourselves. After a few hiccups with dependencies and the build process, we got everything working and manually included the Npgsql binaries into our library. Obviously this is a temporary solution, while we wait for the official release, but its enough to get us moving forward for now.

This is one of the great things about open source, if this were a commercial library we would have been at the mercy of that particular organisation, and it would have blocked us from making any progress at all.

Conclusion

In the end we accomplished what we originally set as our ideal. We have incorporated Full Text Searching (with weights and ranking) into our current querying pipeline, allowing us to intelligently combine searching, filtering and sorting together and have it all executed at the database level. There is still a significant amount of work to be done to make sure that what we’ve put together is performant once we get some real traffic on it, but I think it shows promise. I do have ideas about eventually leveraging Elasticsearch to do the search (and exposing the very familiar Lucene query syntax from the API), but that’s a much larger amount of work than just leveraging an existing piece of architecture.

This was one of those pieces of functionality where it felt like we spun our wheels for a while, struggling with technical issues. If we had of compromised and put together a separate /search endpoint we could have probably re-used our old solution (constructing the SQL ourselves using helper methods, or even using the Npgsql functions that we didn’t realise existed at the time) and delivered something more quickly.

In the end though, I think it would have been a worse solution overall, compromising on the design and general cohesiveness of the API in favour of just shipping something.

That sort of thing feels good in the short term, but just builds potential pain into a profession that is already pretty painful.

Faster S3! Clone! Clone!

April 19. 2016 0 Comments

A little over 4 months ago, I wrote a post about trying to improve the speed of cloning a large S3 bucket. At the time, I tried to simply parallelise the execution of the AWS CLI sync command, which actually proved to be much slower than simply leaving the CLI alone to do its job. It was an unsurprising result in retrospect, but you never know unless you try.

Unwilling to let the idea die, I decided to make it my focus during our recent hack days.

If you are unfamiliar with the concept of a hack day (or Hackathon as they are sometimes known), have a look at this Wikipedia article. At my current company, we’re only just starting to include hack days on a regular basis, but its a good sign of a healthy development environment.

Continuing on with the original train of thought (parallelise via prefixes), I needed to find a way to farm out the work to something (whether it was a pool of our own workers or some other mechanism). Continuing with that train of thought, I chose to use AWS Lambda.

Enter Node.js on Lambda.

At A High Level

AWS Lambda is a relatively new offering, allowing you to configure some code to automatically execute following a trigger from one of a number of different events, including an SNS Topic Notification, changes to an S3 bucket or a HTTP call. You can use Python, Java or Javascript (through Node.js) as code natively, but you can technically use anything you can compile into a Linux compatible executable and make accessible to the function via S3 or something similar.

Since Javascript seems to be everywhere now (even though its hard to call it a real language), it was a solid choice. No point being afraid of new things.

Realistically, I should have been at least a little afraid of new things.

Conceptually the idea can be explained as a simple divide and conquer strategy, managed by files in an S3 bucket (because S3 was the triggering mechanism I was most familiar with).

If something wants to trigger a clone, it writes a file into a known S3 bucket detailing the desired operation (source, destination, some sort of id) with a key of {id}-{source}-{destination}/clone-request.

In response, the Lambda function will trigger, segment the work and write a file for each segment with a key of {id}-{source}-{destination}/{prefix}-segment-request. When it has finished breaking down the work, it will write another file with the key {id}-{source}-{destination}/clone-response, containing a manifest of the breakdown, indicating that it is done with the division of work.

As each segment file is being written, another Lambda function will be triggered, doing the actual copy work and finally writing a file with the key {id}-{source}-{destination}/{prefix}-segment-response to indicate that its done.

File Formats Are Interesting

Each clone-request file looks like this:

{
    id: {id},
    source: {
        name: {source-bucket-name}
    },
    destination: {
        name: {destination-bucket-name}
    }
}

Its a relatively simple file that would be easy to extend as necessary (for example, if you needed to specify the region, credentials to access the bucket, etc).

The clone-response file (the manifest), looks like this:

{
    id: {id},
    source: {
        name: {source-bucket-name}
    },
    destination: {
        name: {destination-bucket-name}
    },
    segments: {
        count: {number-of-segments},
        values: [
            {segment-key},
            {segment-key}
            ...
        ]
    }
}

Again, another relatively simple file. The only additional information is the segments that the task was broken into. These segments are used for tracking purposes, as the code that requests a clone needs some way to know when the clone is done.

Each segment-request file looks like this:

{
    id: {id},
    source: {
        name: {source-bucket-name},
        prefix: {prefix}
    },
    destination: {
        name: {destination-bucket-name}
    }
}

And finally, each segment-response file looks like this:

{
    id: {id},
    source: {
        name: {source-bucket-name},
        prefix: {prefix}
    },
    destination: {
        name: {destination-bucket-name}
    },    
    files: [        
        {key},
        {key},
        ...
    ]
}

Nothing fancy or special, just straight JSON files with all the information needed.

Breaking It All Down

First up, the segmentation function.

Each Javascript Lambda function already comes with access to the aws-sdk, which is super useful, because honestly if you’re using Lambda, you’re probably doing it because you need to talk to other AWS offerings.

The segmentation function has to read in the triggering file from S3, parse it (its Javascript and JSON so that’s trivial at least), iterate through the available prefixes (using a delimiter, and sticking with the default “/”), write out a file for each unique prefix and finally write out a file containing the manifest.

As I very quickly learned, using Node.js to accomplish the apparently simple task outlined above was made not simple at all thanks to its fundamentally asynchronous nature, and the fact that async calls don’t seem to return a traceable component (unlike in C#, where if you were using async tasks you would get a task object that could be used to track whether or not the task succeeded/failed).

To complicate this even further, the aws-sdk will only return a maximum of 1000 results when listing the prefixes in a bucket (or doing anything with a bucket really), which means you have to loop using the callbacks. This makes accumulating some sort of result set annoying difficult, especially if you want to know when you are done.

Anyway, the segmentation function is as follows:

console.log('Loading function');

var aws = require('aws-sdk');
var s3 = new aws.S3({ apiVersion: '2006-03-01' });

function putCallback(err, data)
{
    if (err)
    {
        console.log('Failed to Upload Clone Segment ', err);
    }
}

function generateCloneSegments(s3Source, command, commandBucket, marker, context, segments)
{
    var params = { Bucket: command.source.name, Marker: marker, Delimiter: '/' };
    console.log("Listing Prefixes: ", JSON.stringify(params));
    s3Source.listObjects(params, function(err, data) {
        if (err)
        {
            context.fail(err);
        }
        else
        {
            for (var i = 0; i < data.CommonPrefixes.length; i++)
            {
                var item = data.CommonPrefixes[i];
                var segmentRequest = {
                    id: command.id,
                    source : {
                        name: command.source.name,
                        prefix: item.Prefix
                    },
                    destination : {
                        name: command.destination.name
                    }
                };
                
                var segmentKey = command.id + '/' + item.Prefix.replace('/', '') + '-segment-request';
                segments.push(segmentKey);
                console.log("Uploading: ", segmentKey);
                var segmentUploadParams = { Bucket: commandBucket, Key: segmentKey, Body: JSON.stringify(segmentRequest), ContentType: 'application/json'};
                s3.putObject(segmentUploadParams, putCallback);
            }
            
            if(data.IsTruncated)
            {
                generateCloneSegments(s3Source, command, commandBucket, data.NextMarker, context, segments);
            }
            else
            {
                // Write a clone-response file to the commandBucket, stating the segments generated
                console.log('Total Segments: ', segments.length);
                
                var cloneResponse = {
                    segments: {
                        count: segments.length,
                        values: segments
                    }
                };
                
                var responseKey = command.id + '/' + 'clone-response';
                var cloneResponseUploadParams = { Bucket: commandBucket, Key: responseKey, Body: JSON.stringify(cloneResponse), ContentType: 'application/json'};
                
                console.log("Uploading: ", responseKey);
                s3.putObject(cloneResponseUploadParams, putCallback);
            }
        }
    });
}

exports.handler = function(event, context) {
    //console.log('Received event:', JSON.stringify(event, null, 2));
    
    var commandBucket = event.Records[0].s3.bucket.name;
    var key = decodeURIComponent(event.Records[0].s3.object.key.replace(/\+/g, ' '));
    var params = {
        Bucket: commandBucket,
        Key: key
    };
    
    s3.getObject(params, function(err, data) 
    {
        if (err) 
        {
            context.fail(err);
        }
        else 
        {
            var command = JSON.parse(data.Body);
            var s3Source = new aws.S3({ apiVersion: '2006-03-01', region: 'ap-southeast-2' });
            
            var segments = [];
            generateCloneSegments(s3Source, command, commandBucket, '', context, segments);
        }
    });
};

I’m sure some improvements could be made to the Javascript (I’d love to find a way automate tests on it), but its not bad for being written directly into the AWS console.

Hi Ho, Hi Ho, Its Off To Work We Go

The actual cloning function is remarkably similar to the segmenting function.

It still has to loop through items in the bucket, except it limits itself to items that match a certain prefix. It still has to do something for each item (execute a copy and add the key to its on result set) and it still has to write a file right at the end when everything is done.

console.log('Loading function');

var aws = require('aws-sdk');
var commandS3 = new aws.S3({ apiVersion: '2006-03-01' });

function copyCallback(err, data)
{
    if (err)
    {
        console.log('Failed to Copy ', err);
    }
}

function copyFiles(s3, command, commandBucket, marker, context, files)
{
    var params = { Bucket: command.source.name, Marker: marker, Prefix: command.source.prefix };
    s3.listObjects(params, function(err, data) {
        if (err)
        {
            context.fail(err);
        }
        else
        {
            for (var i = 0; i < data.Contents.length; i++)
            {
                var key = data.Contents[i].Key;
                files.push(key);
                console.log("Copying [", key, "] from [", command.source.name, "] to [", command.destination.name, "]");
                
                var copyParams = {
                    Bucket: command.destination.name,
                    CopySource: command.source.name + '/' + key,
                    Key: key
                };
                s3.copyObject(copyParams, copyCallback);
            }
            
            if(data.IsTruncated)
            {
                copyFiles(s3, command, commandBucket, data.NextMarker, context, segments);
            }
            else
            {
                // Write a segment-response file
                console.log('Total Files: ', files.length);
                
                var segmentResponse = {
                    id: command.id,
                    source: command.source,
                    destination : {
                        name: command.destination.name,
                        files: {
                            count: files.length,
                            files: files
                        }
                    }
                };
                
                var responseKey = command.id + '/' + command.source.prefix.replace('/', '') + '-segment-response';
                var segmentResponseUploadParams = { Bucket: commandBucket, Key: responseKey, Body: JSON.stringify(segmentResponse), ContentType: 'application/json'};
                
                console.log("Uploading: ", responseKey);
                commandS3.putObject(segmentResponseUploadParams, function(err, data) { });
            }
        }
    });
}

exports.handler = function(event, context) {
    //console.log('Received event:', JSON.stringify(event, null, 2));
    
    var commandBucket = event.Records[0].s3.bucket.name;
    var key = decodeURIComponent(event.Records[0].s3.object.key.replace(/\+/g, ' '));
    var params = {
        Bucket: commandBucket,
        Key: key
    };
    
    commandS3.getObject(params, function(err, data) 
    {
        if (err) 
        {
            context.fail(err);
        }
        else 
        {
            var command = JSON.parse(data.Body);
            var s3 = new aws.S3({ apiVersion: '2006-03-01', region: 'ap-southeast-2' });
            
            var files = [];
            copyFiles(s3, command, commandBucket, '', context, files);
        }
    });
};

Tricksy Trickses

You may notice that there is no mention of credentials in the code above. That’s because the Lambda functions run under a role with a policy that gives them the ability to list, read and put into any bucket in our account. Roles are handy for accomplishing things in AWS, avoiding the new to supply credentials. When applied to the resource, and no credentials are supplied, the aws-sdk will automatically generate a short term token using the role, reducing the likelihood of leaked credentials.

As I mentioned above, The asynchronous nature of Node.js made everything a little but more difficult than expected. It was hard to determine when anything was done (somewhat important for writing manifest files). Annoyingly enough, it was even hard to determine when the function itself was finished. I kept running into issues where the function execution had finished, and it looked like it had done all of the work I expected it to do, but AWS Lambda was reporting that it did not complete successfully.

In the initial version of Node.js I was using (v0.10.42), the AWS supplied context object had a number of methods on it to indicate completion (whether success or failure). If I called the Succeed method after I setup my callbacks, the function would terminate without doing anything, because it didn’t automatically wait for the callbacks to complete. If I didn’t call it, the function would be marked as “did not complete successfully”. Extremely annoying.

As is often the case with AWS though, on literally the second hack day, AWS released support for Node.js v4.3, which automatically waits for all pending callbacks to complete before completing the function, completely changing the interaction model for the better. I did upgrade to the latest version during the second hack day (after I had accepted that my function was going to error out in the control panel but actually do all the work it needed to), but it wasn’t until later that I realised that the upgrade had fixed my problem.

The last tripwire I ran into was related to AWS Lambda not being available in all regions yet. Specifically, its not in ap-southeast-2 (Sydney), which is where all of our infrastructure lives. S3 is weird in relation to regions, as buckets are globally unique and accessible, but they do actually have a home region. What does this have to do with Lambda? Well, the S3 bucket triggers I used as the impetus for the function execution only work if the S3 bucket is in the same region as the Lambda function (so us-west-1), even though once you get inside the Lambda function you can read/write to any bucket you like. Weird.

Conclusion

I’ve omitted the Powershell code responsible for executing the clone for brevity. It writes the request to the bucket, reads the response and then polls waiting for all of the segments to be completed, so its not particularly interesting, although the polling for segment completion was my first successful application of the Invoke-Parallel function from Script Center.

Profiling the AWS Lambda approach versus the original AWS CLI sync command approach over a test bucket (7500 objects, 195 distinct prefixes, 8000 MB of data) showed a decent improvement in performance. The sync approach took 142 seconds and the Lambda approach took 55 seconds, approximately a third of the time, which was good to see considering the last time I tried to parallelise the clone it actually decreased the performance. I think with some tweaking the Lambda approach could be improved further, with tighter polling tolerances and an increased number of parallel Lamda executions allowed.

Unfortunately, I have not had the chance to execute the AWS Lambda implementation on the huge bucket that is the entire reason it exists, but I suspect that it won’t work.

Lambda allows at maximum 5 minutes of execution time per function, and I suspect that the initial segmentation for a big enough bucket will probably take longer than that. It might be possible to chain lambda functions together (i.e. trigger one from the next one, perhaps per 1000 results returned from S3, but I’m not entirely sure how to do that yet (maybe using SNS notifications instead of S3?). Additionally, with a big enough bucket, the manifest file itself (detailed the segments) might become unwieldy. I think the problem bucket has something like 200K unique prefixes, so the size of the manifest file can add up quickly.

Regardless, the whole experience was definitely useful from a technical growth point of view. Its always a good idea to remove yourself from your comfort zone and try some new things, and AWS Lambda + Node.js are definitely well outside my comfort zone.

A whole different continent in fact.