The Language of Love

May 4. 2016 0 Comments

Posted in:
c#
grammar
bnf

Last year we built a feature that integrated with an external forms provider, extracting data from an internal data model and pushing it out to the supplied API, for use in a variety of ways (its the real estate industry, so there’s a lot of red tape and forms).

It was a fairly simple feature, because the requirements for the API were pretty simple. Select a form from a list, supply a key/value store representing the available data and then display the form and confirm it’s correct. In fact, the hardest part of the whole thing was finding a decent embedded browser to use to display the portal supplied alongside the API (which, annoyingly, is the only place where some of the functionality of the service is available).

At the time, we weren’t sure what the uptake would be for the feature (because the external service required its own subscription), so we built only exactly what was necessary, released it into the wild and waited to see what happened.

I wouldn’t say it was an amazing ground-breaking success, but reception was solid enough that the business decided to return to the feature and solidify it into something better.

Do You Speak the Language?

One of the issues with the external service is that its heavily dependent on the key/value store supplied to it for filling the forms. Unfortunately, it does not programmatically expose a list of the keys required for a form. This information is instead supplied out of band, mostly via emailing spreadsheets around.

Even if the service did expose a list of keys, it would really only be an optimisation (so we know which keys we needed to fill, rather than just all of them). We would still need to know what each key means to us, and where the value can be obtained from.

Our first version was hardcoded. We had a few classes that knew how to create an intermediate domain model specific for this feature based on where you were using it from, and they could then be easily turned into a key/value store. At the time, the limitations of this approach were understood (hard to change, must ship new code in order to deal with new fields), but it was good enough to get us started.

Now that we had clearance to return to the feature and improve it, one of the first things we needed to do was change the way that we obtain the key values. In order to support changing the way the keys were filled without having to do a new deployment we needed to move to a configuration based approach.

The best way to do that? Create a simple language to describe how the configuration should be put together and then use a parser for that language to turn the text into a data structure.

Grammar Nazi

Most of the language is pretty simple, so I’ll just write it down in EBNF notation here:

<config> ::= <line> | <line> <config>
<line> ::= <mapping> | <comment> | <blank>
<blank> ::= <EOL>
<comment> ::= "#" <text> <EOL>
<mapping> ::= <key> "|" <expression> <EOL>
<expression> ::= <dynamic> | <literal> | <date> | <concatenated>
<literal> ::= "Literal;" <text>
<date> ::= "Date;" <format_string>
<concatenated> ::= <expression> "+" <expression>
<dynamic> ::= "DynamicLinq;" <code> [ ";" <format_string> ] [";" <default> ]

Essentially, a config consists of multiple lines, where each line might be a mapping, a blank line or a comment. Each mapping must contain a key, followed by some sort of expression to get the value of the key at runtime. There are a number of different expressions available. In the above specification, any term with no expansion is free text.

With a defined grammar, all we needed was a library to parse it for us. While we could have parsed it ourselves manually, I’d much rather lean on a library that deals specifically with languages to do the work. I don’t want to own custom parsing code.

We chose Irony, which is a nice, neat little language parsing engine available for .NET.

When using Irony, you create a grammar (deriving from the Grammar class) and then fill it out with the appropriate rules. It ends up looking like this:

public class ConfigurationGrammar : Grammar
{
    public ConfigurationGrammar() 
        : base(false)
    {
        var data = new NonTerminal("data");
        var line = new NonTerminal("Line");
        var key = new FreeTextLiteral("key", FreeTextOptions.ConsumeTerminator, "|");
        var concatExpression = new NonTerminal("concatExpression");
        var singleExpression = new NonTerminal("singleExpression");
        var dateExpression = new NonTerminal("dateExpression");
        var literalExpression = new NonTerminal("literalExpression");
        var linqExpression = new NonTerminal("linqExpression");
        var sourceField = new IdentifierTerminal("sourceField", "[].");
        var formatString = new NonTerminal("formatString");
        var defaultValue = new QuotedValueLiteral("defaultValue", "\"", TypeCode.String);
        var unquotedFormatString = new QuotedValueLiteral("unquotedFormatString", "{", "}", TypeCode.String);
        var quotedFormatString = new StringLiteral("quotedFormatString", "\"");


        formatString.Rule = unquotedFormatString | quotedFormatString;
        linqExpression.Rule = ToTerm("DynamicLinq") + ";" + sourceField + ";" + formatString + ";" + defaultValue |
                              ToTerm("DynamicLinq") + ";" + sourceField + ";" + formatString |
                              ToTerm("DynamicLinq") + ";" + sourceField;
        literalExpression.Rule = ToTerm("Literal") + ";" + defaultValue;
        dateExpression.Rule = ToTerm("Date") + ";" + formatString;
        singleExpression.Rule = dateExpression | literalExpression | linqExpression;
        concatExpression.Rule = MakePlusRule(concatExpression, ToTerm("+"), singleExpression);
        line.Rule = key + concatExpression;
        data.Rule = line + Eof;

        this.Root = data;
        this.LanguageFlags |= LanguageFlags.NewLineBeforeEOF;
        this.MarkPunctuation("|", ";");
    }
}

The output from using this grammar is an Abstract Syntax Tree, which can then be converted into an appropriate data structure that does the actual work. I’ll make another post about the details of how that data structure extracts data from our domain model in the future (because this one is already long enough just considering the stuff about the language/grammar). Its relatively interesting though, because we had to move from the infinite flexibility of C# code to a more constrained set of functionality that can be represented with text (and would be usable by non-developers).

Halt! Wir Müssen Reden

Observant readers might notice that the grammar above does not line up exactly with the EBNF specification.

Originally we wrote the grammar class to line up exactly with the specification. Unfortunately, when we started actually writing the configuration file and testing various failure conditions, we discovered that if anything in the file failed, the entire parse would fail. Sure it would tell you where it failed, but we needed a little bit more robustness than that (its not as important if one line is bad, but it is important that the rest continue to work as expected).

Our first attempt simple removed the problem line on failure and then parsed again, in a loop, until the configuration parsed correctly. This was both inefficient, and caused the line numbers reported in the error messages to be incorrect.

Finally, we decided to handle each line independently, so amended the grammar to only know about a valid line, and then ignored blank and comment lines with our own code.

I think it was a solid compromise. We’re still leveraging the grammar engine to do the heavy lifting, we just make sure it has good input to work with.

Summary

In the interests of full disclosure, I did not actually perform all of the work above. I guided the developer involved towards a language based solution, the selection of a library to do it for us and the review of the code, but that’s about it.

I think that in the end it made for a better, more maintainable solution, with the most important thing being that we don’t own the parsing code. All we own is the code that transforms the resulting syntax tree into appropriate objects. This is especially valuable when it comes to the handling of parsing errors, because parsing a perfect file is easy.

Detecting all the places where that file might be wrong, that’s the hard part, and is generally where home-baked parsing falls apart.

Like the heading puns in this post.