Easy Does It

July 17. 2018 0 Comments

In my experience, you are far more likely to extend an existing system as a professional software developer than you are to build something from scratch. I mean, unless you’re amazing at engineering situations for yourself where you never have to build on top of existing code that is.

You tricky developer you.

Assuming that you’re not always as tricky as you want to be, when you build on top of an existing system, I have found the following quote to be quite helpful:

First, refactor to make the change easy (warning, this might be hard),
Then make the easy change

The internet says that this particular piece of wisdom originally came from Kent Beck in his book on TDD, which seems pretty likely.

Regardless of where it came from though, I’m going to use this post to explore a case study where I applied the quote directly while I was extending a piece of code that I didn’t originally write.

Minimum Viable Product

To set the stage, we have a small application that extracts data from a database and pushes it to an S3 bucket, where it is consumed by some third parties. It runs quietly in the background on a regular schedule thanks to TeamCity.

One of the consuming third parties would prefer that the data was delivered via SFTP though, to allow them to trigger some process automatically when the files arrive, rather than having to poll the S3 bucket for new files on their own schedule.

A completely understandable and relatable desire.

In the same vein, when we run the automated process through TeamCity, it would be nice if the files generated were attached to the execution as Build Artifacts, so that we could easily look at them later without having to go into the S3 bucket.

Obviously neither of these two requirements existing when the application was originally written, but now there needs to be multiple ways to export the data for each run. Perhaps we don’t always want to export to all the destinations either, maybe we want to do a test run that only exports to TeamCity for example.

In my current role I don’t typically write as much code as I used to, but everyone else was engaged in their own priorities, so it was time for me to step up, extend a system I did not write and flex my quickly atrophying technical muscles.

Structural Integrity Lacking

The application in question is not particularly complicated.

There are less than 20 classes total, and they are generally pretty focused. For example, there is class that offers the capability to write a file to S3 (abstracting away some of the complexity inherent in the AWS supplied interfaces), another for building queries, another for executing a query on top of a PostgreSQL database and streaming the results to a CSV file and finally a class that orchestrates the whole “get data, upload data” process for all of the queries that we’re interested in.

That last class (the aptly named CsvS3Export) is where we need to start refactoring.

public class CsvS3Export : IS3Export
{
    private readonly IClock _clock;
    private readonly IFileNamer _fileNamer;
    private readonly ITableToCsvExporter _tableExporter;
    private readonly IS3FileUploader _s3Uploader;
    private readonly ITableExportSpecificationResolver _tableExportSpecificationResolver;
    private readonly AmazonS3Client _s3Client;
    private readonly IReporter _reporter;

    public CsvS3Export(
        IClock clock, 
        IFileNamer fileNamer, 
        ITableToCsvExporter tableExporter,
        IS3FileUploader s3Uploader,
        ITableExportSpecificationResolver tableExportSpecificationResolver,
        AmazonS3Client s3Client, 
        IReporter reporter
    )
    {
        _clock = clock;
        _fileNamer = fileNamer;
        _tableExporter = tableExporter;
        _s3Uploader = s3Uploader;
        _tableExportSpecificationResolver = tableExportSpecificationResolver;
        _s3Client = s3Client;
        _reporter = reporter;
    }

    public S3ExportsSummary ExportTables(IList<string> tableNames, string workingDirectory, string s3BucketName)
    {
        try
        {
            var runTimestamp = _clock.UtcNow;
            var summaries = tableNames
                .AsParallel()
                .Select((tn, i) => ExportTable(workingDirectory, s3BucketName, tn, runTimestamp));

            return new S3ExportsSummary(summaries);
        }
        catch (Exception e)
        {
            _reporter.ReportInfo($"Error running export \n\n {e}");
            return new S3ExportsSummary(e, tableNames);
        }
    }

    private S3TableExportSummary ExportTable(string workingDirectory, string s3BucketName, string tn, DateTimeOffset runTimestamp)
    {
        try
        {
            var fileName = _fileNamer.GetFileName(runTimestamp, tn);
            var spec = _tableExportSpecificationResolver.Resolve(tn);
            var localOutputFile = new FileInfo(workingDirectory + "\\" + fileName);
            var exportResult = _tableExporter.Export(spec, localOutputFile);

            var s3Location = new S3FileLocation(s3BucketName, fileName.Replace("\\", "/"));
            var uploadResult = _s3Uploader.Upload(exportResult.FileLocation, s3Location);

            return new S3TableExportSummary(spec.TableName, localOutputFile, uploadResult);
        }
        catch (Exception e)
        {
            _reporter.ReportInfo($"Error running export for table {tn}\n\n {e}");
            return new S3TableExportSummary(tn, e);
        }
    }
}

As you can tell from the source above, the code is very focused around a specific type of file (a CSV) and a specific export destination (S3).

We don’t really care about the file generation part (we have no interest in generating non-CSV files for now), but if we want to start adding new destinations, we really should break this class apart into its constituent parts.

Remember, first make the change easy (refactor the existing code to allow for new destinations), then make the easy change (create the new destinations and slot them in).

Export Tariffs

Rather than coupling the class directly with the S3 functionality, all we have to do is extract a simple IExporter interface and then take a list of exporters as a constructor dependency.

The CsvS3Export then becomes more of an orchestrator, calling out to the class that does the actual data extraction with a target file and then iterating (safely) through the exporters, with the intent to report on the success or failure of any of the components involved.

public class DefaultExportOrchestrator : IExportOrchestrator
{
    public DefaultExportOrchestrator(IClock clock, ITableExportSpecificationResolver resolver, IFileNamer namer, ITableToCsvExtractor extractor, List<IExporter> exporters)
    {
        _clock = clock;
        _resolver = resolver;
        _namer = namer;
        _extractor = extractor;
        _exporters = exporters;
    }

    private readonly IClock _clock;
    private readonly ITableExportSpecificationResolver _resolver;
    private readonly IFileNamer _namer;
    private readonly ITableToCsvExtractor _extractor;
    private readonly List<IExporter> _exporters;

    public ExportOrchestrationResult Export(IList<string> tableNames, string workingDirectory)
    {
        if (!Directory.Exists(workingDirectory))
        {
            return new ExportOrchestrationResult(new DirectoryNotFoundException($"The supplied working directory [{workingDirectory}] did not exist"));
        }

        var now = _clock.UtcNow;
        var extracted = tableNames
            .AsParallel()
            .Select(a => ExtractTable(a, workingDirectory, now));

        return new ExportOrchestrationResult(extracted.ToList());
    }

    private TableExportOrchestrationResult ExtractTable(string table, string workingDirectory, DateTimeOffset runTimestamp)
    {
        var fileName = _namer.GetFileName(runTimestamp, table, "csv");
        var spec = _resolver.Resolve(table);
        var localOutputFile = new FileInfo(Path.Combine(workingDirectory, fileName.Combined));

        var result = _extractor.Extract(spec, localOutputFile);

        if (!result.Successfull)
        {
            return new TableExportOrchestrationResult(table, result);
        }

        var exported = _exporters.AsParallel().Select(a => SafeExport(table, result.FileLocation, a)).ToList();

        return new TableExportOrchestrationResult(table, result, exported);
    }

    private ExportResult SafeExport(string tableName, FileInfo file, IExporter exporter)
    {
        try
        {
            return exporter.Export(tableName, file);
        }
        catch (Exception ex)
        {
            return new ExportResult(exporter.Description, tableName, file, false, string.Empty, ex);
        }
    }
}

The important part of this process is that I did not add any new exporters until I was absolutely sure that the current functionality of the application was maintained (via tests and whatnot).

Only once the refactor was “complete” did I add the new functionality, and test it independently and in isolation, safe in the knowledge that I was extending a system designed to be extended.

From a Git point of view, the refactor itself was a single commit, and the addition of each new exporter was another commit.

Because a clean Git history is a beautiful thing.

Conclusion

Looking back, you might think that the initial structure of the code was wrong, and it should have been built this way in the first place.

I disagree with that, because it did exactly what it needed to do before we went and changed the requirements. It would be a fools errand to try and anticipate all future changes that might need to be made and accommodate them, not to mention incredibly expensive and wasteful, so I really can’t recommend going down that path.

The most important thing is to ensure that we consistently follow good engineering practices, so that later on when we need to make some changes, we don’t have to struggle through terribly structured code that doesn’t have any tests.

To tie it all back to the quote that started this post, spending the time to refactor the code to be more accepting of the pattern that I wanted to use made the additional of the additional export destinations much easier, while still following good software engineering practices.

I could probably have just jammed some naïve code right into that original CsvS3Export class that just exported to the other destinations, but that would have just made the code much more confusing to anyone coming by at a later date.

There’s already enough terrible code in the world, why add to it?

There Are 10 Kinds of Data In This World

May 15. 2018 0 Comments

There are many challenges inherent in creating an API. From the organization of the resources, to interaction models (CQRS and eventual consistency? Purely synchronous?) all the way through to the specific mechanisms for dealing with large collections of data (paging? filtering?) there is a lot of complexity in putting a service together.

One such area of complexity is binary data, and the handling thereof. Maybe its images associated with some entity, or arbitrary files, or maybe something else entirely, but the common thread is that the content is delivered differently to a normal payload (like JSON or XML).

I mean, you’re unlikely to to serve an image to a user as a base64 encoded chunk in the middle of a JSON document.

Or you might, I don’t know your business. Maybe it seemed like a really good idea at the time.

Regardless, if you need to deal with binary data (upload or download, doesn’t really matter), its worth thinking about the ramifications with regard to API throughput at the very least.

Use All The Resources

It really all comes down to resource utilization and limits.

Ideally, in an API, you want to minimise the time that every request takes, mostly so that you can maintain throughput. Every single request consumes resources of some sort, and if a particular type of request is consuming more than its fair share, the whole system can suffer.

Serving or receiving binary data is definitely one of those needytypes of requests, as the payloads tend to be larger (files, images and so on), so whatever resources are involved in dealing with the request are consumed for longer.

You can alleviate this sort of thing by making use of good asynchronous practices, ceding control over the appropriate resources so that they can be used to deal with other incoming requests. Ideally you never want to waste time waiting for slow things, like HTTP requests and IO, but there is always going to be an upper limit on the amount you can optimize by using this approach.

A different issue can arise when you consider the details of how you’re dealing with the binary content itself. Ideally, you want to deal with it as a stream, regardless of whether its coming in or going on. This sort of thing can be quite difficult to actually accomplish though, unless you really know what you’re doing and completely understand the entire pipeline of whatever platform you’re using.

Its far more common to accidentally read the entire body of the binary data into memory all at once, consuming way more resources that are strictly necessary than if you were pushing bytes through in a continuous stream. I’ve had this happen silently and unexpectedly when hosting web applications in IIS, and it can be quite a challenging situation to engineer around.

Of course if you’re only dealing with small amounts of binary data, or small numbers of requests, you’re probably going to be fine.

Assuming every relevant stakeholder understands the limitations that is, and no one sells the capability to easily upload and download thousands and thousands of photos taken from a modern camera phone.

But its not like that’s happened to me before at all.

Separation Of Concerns

If you can, a better model to invoke is one in which you don’t deal with the binary data directly at all, and instead make use of services that are specifically built for that purpose.

Why re-invent the wheel, only to do it worse and with more limitations?

A good example of this is to still have the API handle the entities representing the binary data, but when it comes to dealing with the hard bit (the payload), provide instructions on how to do that.

That is, instead of hitting GET /thing/{id}/images/{id}, and getting the binary content in the response payload, have that endpoint return a normal payload (JSON or similar) that contains a link for where to download the binary content from.

In more concrete terms, this sort of thing is relatively easily accomplished using Amazon Simple Storage Service (S3). You can have all of your binary content sitting securely in an S3 bucket somewhere, and create pre-signed links for both uploading and downloading with very little effort. No binary data will ever have to pass directly through your API, so assuming you can deal with the relatively simple requests to provide directions, you’re free and clear.

Its not like S3 is going to have difficulties dealing with your data.

Its somewhat beyond my area of expertise, but I assume you can accomplish the same sort of thing with equivalent services, like Azure Blobs.

Hell, you can probably even get it working with a simple Nginx box serving and storing content using its local disk, but honestly, that seems like a pretty terrible idea from a reliability point of view.

Conclusion

As with anything in software, it all comes down to the specific situation that you find yourself in.

Maybe you know for certain that the binary data you deal with is small and will remain that way. Fine, just serve it from the API. Why complicate a system if you don’t have to.

But if the system grows beyond your initial assumptions (and if its popular, it will), then you’ll almost certainly have problems scaling if you’re constantly handling binary data yourself, and you’ll have to pay a price to resolve the situation in the long run.

Easily the hardest part of being software engineer is trying to predict what direction the wind will blow, and adjust accordingly.

Really, its something of a fools errand, and the best idea is to be able to react quickly in the face of changing circumstances, but that’s even harder.

Comma, Comma, Comma Chameleon

April 24. 2018 0 Comments

In previous posts, I’ve briefly mentioned how one of the most valuable outcomes of the data synchronization algorithm was the capability to gather meaningful business intelligence about our users. This includes information like how often they use the software, what parts they are using and how many of its entities they are creating (and deleting) during their normal daily process.

When used responsibly, this sort of intelligence is invaluable in helping to create a better, more valuable experience for everyone involved. Literally every part of your standard organization (support, sales, development, marketing, finance) can make use of the data in some way.

Of course, when its all sequestered inside a PostgreSQL database, the barrier to entry can be quite high. Generally this just means that the people who are best positioned to make use of the data either don’t get access to it (a waste) or they end up constantly bugging the members of the small group of people with the capability to do the analysis.

When I’m a member of that small group, I get suddenly motivated to find an alternate way to give them what they need.

Re-education Campaign

There are really two kinds of analysis that tend to happen for business intelligence purposes.

The first is answering ad-hoc questions. Typically these sorts of questions have not been asked before, or they could be a slight variation on the same theme as a previous question.

Regardless, these generally require someone who actually knows the data and has the skills to wrangle it in order to get to the meat of the issue and provide the answer. This can be taught (with some effort), but not everyone has the time to dedicate to that sort of thing, so new analysis will probably always have to be performed by trained professionals.

The second type of analysis is re-answering a previously answered question, now that time has passed.

This tends to occur quite frequently, especially when it comes to metrics (like “how many customers have X, how many Y does each customer have, etc), and it is here that there is an opportunity to optimise.

In the long term, there should be an easy platform for people to get at the information they want, without having to understand how to run SQL scripts (or Powershell). This is likely going to come in the form of Metabase, but unfortunately this is probably still a little ways off in the future, and people are asking for data now.

So in the short term, some way to easily run pre-canned queries and get the results as a CSV is desirable.

Copy That

The good news is that PostgreSQL makes it pretty easy to run a query and get CSV output using the COPY command:

COPY ({query}) TO STDOUT WITH DELIMITER ',' CSV HEADER

Executing that command will result in a stream containing the data specified in the query (along with headers), in CSV format.

Of course, its kind of useless without a mechanism to easily run it, so if you’re using .NET, you can leverage Npgsql to actually execute the command using the BeginTextExport function on an NpgsqlConnection, like in the following class:

using System.IO;

namespace Solavirum.Exporting
{
    public class PostgresTableStreamer
    {
        private readonly INpgsqlConnectionFactory _npgsqlConnectionFactory;

        public PostgresTableStreamer(INpgsqlConnectionFactory npgsqlConnectionFactory)
        {
            _npgsqlConnectionFactory = npgsqlConnectionFactory;
        }

        public long Stream(string copySql, StreamWriter output)
        {
            var numberOfLinesInFile = 0;
            using (var conn = _npgsqlConnectionFactory.Create())
            {
                conn.Open();
                using (var reader = conn.BeginTextExport(copySql))
                {
                    var i = reader.Read();
                    var insideQuotedField = false;

                    while (i != -1)
                    {
                        var c = (char) i;
                        if (c == '"') insideQuotedField = !insideQuotedField;
                        switch (c)
                        {
                            case '\r':
                                i = reader.Read();
                                continue;
                            case '\n':
                                if (insideQuotedField) output.Write("\\n");
                                else
                                {
                                    output.Write(c);
                                    numberOfLinesInFile++;
                                }
                                break;
                            default:
                                output.Write(c);
                                break;
                        }
                        i = reader.Read();
                    }
                    output.Flush();
                }
                conn.Close();
            }
            return numberOfLinesInFile;
        }
    }
}

The only tricksy thing is escaping new line characters so that they don’t explode the formatting of the resulting file.

A Model View

Of course, a C# class is just as useless to a non-technical person as an SQL query, so now we have to think about delivery.

The easiest way to give people access to some piece of functionality is typically a website, so that seems like a good approach.

My skills are somewhat limited when it comes to putting together a website in a hurry. Our previous websites have been React.JS, but they need a bit of a pipeline to get up and running and I didn’t want to invest that much in this prototype.

Eventually, I settled on an ASP.NET CORE MVC website, deployed manually to a single Linux instance in AWS.

Nothing too complicated, just a single controller that lists the queries that are available (which are deployed with the code, no custom queries here), and then a way to run a single query and get the result as a file.

Getting ASP.NET CORE MVC up and running on a Linux box is pretty straightforward, assuming you use a local Nginx instance as a reverse proxy (which is easy).

As always, because we’re not stupid, a little touch of security is also necessary, so:

The box is behind a load balancer, to allow for easy HTTPS.
The load balancer is only accessible from the static IP at our main office. If you’re out of the office you can still use the service, but you need to be on the VPN, which is required for a bunch of our other things anyway.

If the website lives for long enough, I’ll probably augment it with Active Directory logins, so people can just use their domain credentials, but short term, its secure enough.

Conclusion

That’s basically it.

Start to finish, it all took about 4 hours, and while I wouldn’t say it was my best work, Its good enough to solve the immediate problem. At least until we get Metabase up and running and we throw the whole thing out anyway.

I’ll probably clean it up a bit (the manual deployment rankled me more than I wanted to admit) and then wash my hands of the whole thing.

More importantly, the amount of people asking me to “please run query X and email me the results” has dropped dramatically, so I’ll consider this a success.

Now I have more time for the really important things.

Like participating in the Mario Kart Tournament.

The Ramifications of Time Travel

April 17. 2018 0 Comments

Posted in:
programming

The pieces of software that are easiest to manage are often the most isolated, mostly because of a combination of two things:

You control the entire world that the software cares about (maybe its all in a database, maybe the file system, maybe written into the fabric of the universe, but you control it all)
You don’t really have to care about any other parties in the relationship except for your users, and you control their world, so that’s easy

Unfortunately, the most isolated software is also often the least valuable, because it does not participate in a wider community.

How is this relevant to me?

Well, my team is responsible for a piece of software that was originally very well isolated. It used to be that everything the user cared about (all of their data), was encapsulated nice and neat in an SQL Server database. Connect to a different database, and the user can easily change their entire context.

This database centric view naturally led to an excellent backup and restore process, and over time the users became familiar with using restores for a number of purposes, from resolving actual software bugs that caused data corruption to simply undoing actions that they didn’t mean to do.

All was well and the world was happy, but everything changed when the third party integrations attacked.

Back To The….Past?

From a software point of view, when you no longer control the entire universe, things get substantially more difficult. This is not new information, but from our point of view, the users much loved (and frequently executed) backup and restore operations became something of a liability.

Integrating with an entirely separate system is challenging at the best of times, but when you add into the mix the capability for one of the participants in the relationship to arbitrarily regress to an earlier point in its timeline, it gets positively frustrating.

We’ve done a lot of integrations in the time that I’ve been working on the software in question, but there are a few that stick out in my mind as being good examples of the variety of ways that we’ve had to use to deal with the issue:

A third party service that made use of the client data, to in turn create things that went back into the client data. This one wasn’t too bad, as the worst thing that could happen would be an entity coming back to the client based on data that was no longer relevant. In this case, we can discard the irrelevant information, usually confirming with the user first.
A data synchronization process that replicates client data to a remote location. The nice thing about this one is that the data is only flowing one way, so all we really needed to do was detect the database restore and react appropriately. The impact of the database restore on the overall complexity of the synchronization algorithm was relatively minimal.
A third party service that we integrated with directly (as opposed to integrating with us like in the first point). Third party entities are linked to local entities, but have separate lifetimes, so the biggest issue as a result of database restores was orphans. Unfortunately, this third party service was a payment provider, so an orphan entity still capable of taking real money is a pretty big problem. I’ll get into it more detail later on in this post, but another thing we had to consider here was detecting and avoiding duplicate operations, as a result of the user repeating actions that had been undone by a database restore.

Of those three integrations, the third one is the most recent, and has proven to be the most challenging for a variety of reasons.

Localised Area Of Effect

As a result of the restore capabilities inherent in the product, we’ve had to repeatedly take certain measures in order to ensure that our users don’t get themselves into difficult situations.

Where possible, we try to store as much state as we can in the local database, so if a database restore occurs, at least everything goes back to an earlier time in unison. This works remarkably well as long as the source of truth is the local data source, so if a third party integration differs from the local, trust the local and remove the offending information. Essentially, we are extending the scope of the database restore to the third party, as much as you can anyway.

Of course, just annihilating user data is not always possible (or desirable), even if that’s technically what they asked us to do by performing a restore.

Our second approach is to allow the data to live elsewhere, but force it to synchronize automatically to the local database as necessary. This technically splits the source of truth responsibility, so it can have unfortunate side effects, like synchronizing information that is no longer relevant to the restored state of the database, which isn’t great. A technical mechanism that can make this synchronization process easier is for each entity to have a unique, monotonically increasing numeric identifier, and to be able to query the entities by that identifier and its range (i.e. give me next 100 > X). We’ve used this approach a few times now, and assuming the remote entities don’t change over time, its fairly effective.

As you can imagine, both of these things involve a significant amount of communication overhead, both technically and cognitively and can result in some seriously confusing bugs and behaviour if it all goes pear shaped.

A Potent Concept

The approaches that I was just writing about are a bit more relevant when we are acting as our own third party (i.e. we’re writing both sides of the integration, which happens a fair bit).

Sometimes we have to integrate with services and systems provided by actual third parties.

In those cases, we have to get creative, or, unfortunately, put some of the responsibility on the users themselves. If we take orphan data as an example, when detected, we let the user know, and they are then required to deal with the problems and bring the system back into an acceptable state before they can move on. Obviously you have to be careful with how you identify the orphans, especially if you’re not the only integrator, but that is a solvable problem.

If you’re lucky, the third party service will allow you to do things in an idempotent manner, such that even if you accidentally do the same operation twice (or more), the ramifications are controlled. This is particularly relevant when integrating with services that deal with actual money, because a double (or triple) transaction is pretty catastrophic from a reputation point of view.

Of course, in order for the command or operation to be idempotent, you need to have some sort of consistent way to identify and classify it. Usually this means an ID of some sort that doesn’t change.

This gets challenging for us, because a database restore is a brutally efficient operation. Everything that was, ceases to exist, as if it never was to begin with. If the user then chooses to recreate some entity that was interacting with the third party service, there is no guarantee that its identity will be the same, even if it represents the same conceptual thing. For example, IDs generated at the database level might differ based on the order of operations, or might just be different altogether, as in the case when automatically generating GUIDs.

When this happens, we typically have to rely on the user to know what to do, which is a bit of a cop out, but compromises have to be made sometimes.x

Conclusion

Easily one of the most important lessons from supporting arbitrary database backups and restores is just how costly and complicated they can make integrations.

The older and more well established a piece of software gets, the more likely it will be to have revenue growth opportunities involving other parties, especially if it forms a community of sorts. I mean, its probably not cost effective to try and do everything yourself, no matter how amazing your organisation is.

Keep in mind, to the user, all of this should just work, because they don’t see the complexity inherent in the situation.

That means if you screw it up, the dissatisfaction level will be high, but if you get it right, at best they will be neutral.

To quote a great episode of Futurama:

When you do things right, people won’t be sure you’ve done anything at all

You’re Mocking Me Aren’t You

January 6. 2015 0 Comments

Object Relational Mappers (ORMs) are useful tools. If you don’t want to have to worry about writing the interactions to a persistence layer yourself, they are generally a good idea. Extremely powerful, they let you focus on describing the data that you want, rather than manually hacking out (and then maintaining!) the queries yourself, and help with change tracking, transactional constructs and other things.

Testing code that uses an ORM, however, is typically a pain. At least in my experience.

People typically respond to this pain by abstracting the usage of their ORM away, by introducing repositories or some other persistence strategy pattern. They use the ORM inside the repository, and then use the more easily mocked repository everywhere else, where it can be substituted with a much smaller amount of effort. There are other benefits to this approach, including the ability to model the domain more accurately (can abstract away the persistence structure) and the ability to switch out the ORM you use for some other persistence strategy without having to make lots of changes. Possibly.

The downside of creating an abstraction like the above is that you lose a lot of ORM specific functionality, which can be quite powerful. One of the most useful feature of ORMs in C# is to be able to write Linq queries directly against the persistence layer. Doing this allows for all sorts of great things, like only selecting the properties you want and farming out as much of the work as possible to the persistence layer (maybe an SQL database), rather than doing primitive queries and putting it all together in memory. If you do want to leverage that power, you are forced to either make your abstraction leaky, exposing bits of the ORM through it (which makes mocking it harder) or you have to write the needed functionality again yourself, except into your interface, which is duplicate work.

Both of those approaches (expose the ORM, write an abstraction layer) have their upsides and downsides so like everything in software it comes down to picking the best solution for your specific environment.

In the past, I’ve advocated creating an abstraction to help isolate the persistence strategy, usually using the Repository pattern. These days I’m not so sure that is the best way to go about it though, as I like to keep my code as simple as possible and the downsides of introducing another layer (with home grown functionality similar to but not quite the same as the ORM) have started to wear on me more and more.

EF You

I’ve recently started working on an application that uses Entity Framework 6, which is a new experience for me, as all of my prior experience with ORM’s was via NHibernate, and to be brutally honest, there wasn’t much of it.

Alas, this application does not have very many tests, which is something that I hate, so I have been trying to introduce tests into the codebase as I add functionality and fix bugs.

I’m going to assume at this point that everyone who has ever done and programming and writes tests has tried to add tests into a codebase after the fact. Its hard. Its really hard. You have to try and resist the urge to rebuild everything and try and find ways to add testability to an architecture that was never intended to be testable without making too many changes or people start to get uncomfortable.

I understand that discomfort. I mean that's one of the biggest reasons you have tests in the first place, so you can make changes without having to worry about breaking stuff. Without those tests, refactoring to introduce tests is viewed as a risky activity, especially when you first start doing it.

Anyway, I wanted to write an integration tests for a particular piece of new functionality, to verify that everything worked end to end. I’ve written about what I consider an integration test before, but in essence it is any test that involves multiple components working together. These sorts of tests are usually executed with as many things configured and setup the same as the actual application, with some difficult or slow components that sit right at the boundaries being substituted. Persistence layers (i.e. databases) are a good thing to substitute, as well as non-local services, because they are slow (compared to in memory) and usually hard to setup or configure.

In my case I needed to find a way to remove the dependency on an external database, as well as a number of services. The services would be easy, because its relatively trivial to introduce an interface to encapsulate the required behaviour from a service, and then provide an implementation just for testing.

The persistence layer though…

This particular application does NOT use the abstraction strategy that I mentioned earlier. It simply exposes the ability to get a DbContext whenever something needs to access to a persistent store.

A for Effort

Being that the application in question used EF6, I thought that it would be easy enough to leverage the Effort library.

Effort provides an in-memory provider for Entity Framework, allowing you to easily switch between whatever your normal provider is (probably SQL Server) for one that runs entirely in memory.

Notice that I said I thought that it would be easy to leverage Effort…

As is always the case with this sort of thing, the devil is truly in the details.

It was easy enough to introduce a factory to create the DbContext that the application used instead of using its constructor. This allowed me to supply a different factory for the tests, one that leveraged Effort’s in-memory provider. You accomplish this by making sure that there is a constructor for the DbContext that takes a DbConnection, and then use Effort to create one of its fancy in-memory connections.

On the first run of the test with the new in-memory provider, I got one of the least helpful errors I have ever encountered:

System.InvalidOperationException occurred: Sequence contains no matching element
  StackTrace:
       at System.Linq.Enumerable.Single[TSource](IEnumerable`1 source, Func`2 predicate)
       at System.Data.Entity.Utilities.DbProviderManifestExtensions.GetStoreTypeFromName(DbProviderManifest providerManifest, String name)     at System.Data.Entity.ModelConfiguration.Configuration.Properties.Primitive.PrimitivePropertyConfiguration.Configure(EdmProperty column, EntityType table, DbProviderManifest providerManifest, Boolean allowOverride, Boolean fillFromExistingConfiguration)
       at System.Data.Entity.ModelConfiguration.Configuration.Properties.Primitive.PrimitivePropertyConfiguration.<>c__DisplayClass1.<Configure>b__0(Tuple`2 pm)
       at System.Data.Entity.Utilities.IEnumerableExtensions.Each[T](IEnumerable`1 ts, Action`1 action)
       at System.Data.Entity.ModelConfiguration.Configuration.Properties.Primitive.PrimitivePropertyConfiguration.Configure(IEnumerable`1 propertyMappings, DbProviderManifest providerManifest, Boolean allowOverride, Boolean fillFromExistingConfiguration)
       at System.Data.Entity.ModelConfiguration.Configuration.Properties.Primitive.BinaryPropertyConfiguration.Configure(IEnumerable`1 propertyMappings, DbProviderManifest providerManifest, Boolean allowOverride, Boolean fillFromExistingConfiguration)
       at System.Data.Entity.ModelConfiguration.Configuration.Types.StructuralTypeConfiguration.ConfigurePropertyMappings(IList`1 propertyMappings, DbProviderManifest providerManifest, Boolean allowOverride)
       at System.Data.Entity.ModelConfiguration.Configuration.Types.EntityTypeConfiguration.ConfigurePropertyMappings(DbDatabaseMapping databaseMapping, EntityType entityType, DbProviderManifest providerManifest, Boolean allowOverride)
       at System.Data.Entity.ModelConfiguration.Configuration.Types.EntityTypeConfiguration.Configure(EntityType entityType, DbDatabaseMapping databaseMapping, DbProviderManifest providerManifest)
       at System.Data.Entity.ModelConfiguration.Configuration.ModelConfiguration.ConfigureEntityTypes(DbDatabaseMapping databaseMapping, DbProviderManifest providerManifest)
       at System.Data.Entity.ModelConfiguration.Configuration.ModelConfiguration.Configure(DbDatabaseMapping databaseMapping, DbProviderManifest providerManifest)
       at System.Data.Entity.DbModelBuilder.Build(DbProviderManifest providerManifest, DbProviderInfo providerInfo)
       at System.Data.Entity.DbModelBuilder.Build(DbConnection providerConnection)
       at System.Data.Entity.Internal.LazyInternalContext.CreateModel(LazyInternalContext internalContext)
       at System.Data.Entity.Internal.RetryLazy`2.GetValue(TInput input)

Keep in mind, I got this error when attempting to begin a transaction on the DbContext. So the context had successfully been constructed, but it was doing…something…during the begin transaction that was going wrong. Probably initialization.

After a significant amount of reading, I managed to find some references to the fact that Effort doesn’t support certain SQL Server specific column types. Makes sense in retrospect, although at the time I didn’t even know you could specify provider specific information like that. I assume it was all based around automatic translation between CLR types and the underlying types of the provider.

There is a lot of entities in this application and the all have a large amount of properties. I couldn’t read through all of the classes to find what the problem was, and I didn’t even know exactly what I was looking for. Annoyingly, the error message didn’t say anything about what the actual problem was, as you can see above. So, back to first principles. Take everything out, and start reintroducing things until it breaks.

I turns out quite a lot of the already existing entities were specifying the types (using strings!) in the Column attribute of their properties. The main offenders were the “timestamp” and “money” data types, which Effort did not seem to understand.

Weirdly enough, Effort had no problems with the Timestamp attribute when specified on a property. It was only when the type “timestamp” was specified as a string in the Column attribute that errors occurred.

The issue here was of course that the type was string based, so the only checking that occurred, occurred at run-time. Because I had introduced a completely different provider to the mix, and the code was written assuming SQL Server, it would get to the column type in the initialisation (which is lazy, because it doesn’t happen until you try to use the DbContext) and when there was no matching column type returned by the provider, it would throw the exception above.

Be Specific

Following some advice on the Effort discussion board, I found some code that moved the SQL Server specific column types into their own attributes. These attributes would then only be interrogated when the connection of the DbContext was actually an SQL Server connection. Not the best solution, but it left the current behaviour intact, while allowing me to use an in-memory database for testing purposes.

Here is the attribute, it just stores a column type as a string.

public class SqlColumnTypeAttribute : Attribute
{
    public SqlColumnTypeAttribute(string columnType = null)
    {
        ColumnType = columnType;
    }

    public string ColumnType { get; private set; }
}

Here is the attribute convention, which EF uses to define a rule that will interpret attributes and change the underlying configuration.

public class SqlColumnTypeAttributeConvention : PrimitivePropertyAttributeConfigurationConvention<SqlColumnTypeAttribute>
{
    public override void Apply(ConventionPrimitivePropertyConfiguration configuration, SqlColumnTypeAttribute attribute)
    {
        if (!string.IsNullOrWhiteSpace(attribute.ColumnType))
        {
            configuration.HasColumnType(attribute.ColumnType);
        }
    }
}

Here is a demo DbContext showing how I used the attribute convention. Note that the code only gets executed if the connection is an SqlConnection.

public partial class DemoDbContext : DbContext
{
    public DemoDbContext(DbConnection connection, bool contextOwnsConnection = true)
        : base(connection, contextOwnsConnection)
    {
    
    }
    
    protected override void OnModelCreating(DbModelBuilder modelBuilder)
    {
        if (Database.Connection is SqlConnection)
        {
            modelBuilder.Conventions.Add<SqlColumnTypeAttributeConvention>();
        }
    }
}

Finally, here is the attribute being used in an entity. Previously this entity would have simply had a [Column(TypeName = “timestamp”)] attribute on the RowVersion property, which causes issue with Effort.

public partial class Entity
{
    [Key]
    public int Id { get; set; }

    [SqlColumnType("timestamp")]
    [MaxLength(8)]
    [Timestamp]
    public byte[] RowVersion { get; set; }
}

Even though there was a lot of entities with a lot of properties, this was an easy change to make, as I could leverage a regular expression and find and replace.

Of course, it still didn’t work.

I was still getting the same error after making the changes above. I was incredibly confused for a while, until I did a search for “timestamp” and found an instance of the Column attribute where it supplied both the data type and the order. Of course, my regular expression wasn’t smart enough to pick this up, so I had to manually go through and split those two components (Type which Effort didn’t support and Order which it did) manually wherever they occurred. Luckily it was only about 20 places, so it was easy enough to fix.

And then it worked!

No more SQL Server dependency for the integration tests, which means they are now faster and more controlled, with less hard to manage dependencies.

Of course, the trade-off for this is that the integration tests are no longer testing as close to the application as they could be, but that’s why we have functional tests as well, which run through the instaled application, on top of a real SQL Server instance. You can still choose to run the integration tests with an SQL Server connection if you want, but now you can use the much faster and easier to manage in-memory database as well.

Conclusion

Effort is awesome. Apart from the problems caused by using SQL Server specific annotations on common entities, Effort was extremely easy to setup and configure.

I can’t really hold the usage of SQL Server specific types against the original developers though, as I can’t imagine they saw the code ever being run on a non-SQL Server provider. Granted, it would have been nice if they had of isolated the SQL Server specific stuff from the core functionality, but that would have been unnecessary for their needs at the time, so I understand.

The biggest problem I ran into was the incredibly unhelpful error message coming from EF6 with regards to the unsupported types. If the exception had stated what the type was that couldn’t be found and for which property in which class, I wouldn’t have had to go to so much trouble to find out what the actual problem was.

Its never good being confronted with an entirely useless exception message, and we have to always be careful to make sure that our exceptions fully communicate the problem so that they help future developers, instead of just getting in the way.

A little Effort goes a long way after all.