0 Comments

Software Development as a discipline puts a bunch of effort into trying to minimise the existence and creation of bugs, but the reality is that its an investment/return curve that flattens off pretty quickly.

Early discovery of issues is critical. Remember, the cost to the business for a bug existing is never lower than it is at development time. The longer it has to fester, the worse its going to get.

Of course, when a bug is discovered, there are decisions to make around whether or not to fix it. For me, every bug that exists in a piece of software that might cause an issue for a user is a mark against its good name, so my default policy is to fix. Maybe not in the same piece of work that it was found in, but in general, bugs should be fixed.

That is, unless you hit that awkward conjunction of high cost/low incidence.

Why waste a bunch of money fixing a bug that might never happen?

Split Personality

I’m sure you can guess from the mere existence of this blog post that this is exactly the situation we found ourselves in recently.

While evaluating a new component in our legacy application, we noticed that it was technically possible to break what was intended to be an entirely atomic operation consisting of multiple database writes.

Normally this wouldn’t even be worth talking about, as its basically the reason that database transactions exist. When used correctly its a guarantee of an all or nothing situation.

Unfortunately, one of the writes was in Visual Basic 6 code, and the other was in .NET.

I don’t know if you’ve ever tried to span a database transaction across a technology boundary like that, but its not exactly the easiest thing in the world.

When we looked into the actual likelihood of the issue occurring, we discovered that if the VB6 part failed, we could easily just rollback the .NET part. If the write failed in .NET though, we had no way to go back and undo the work that had already been done in VB6. Keep in mind, this entire section was essentially creating transactions in trust accounting application, so non-atomic operations can get users into all sorts of terrible situations.

On deeper inspection, the only way we thought the .NET stuff could fail would be transitory database issues. That is, connection or command timeouts or disconnects.

We implemented a relatively straightforward retry strategy to deal with those sorts of failures and then moved on. Sure, it wasn’t perfect, but it seemed like we’d covered our bases pretty well and mitigated the potential issue as best we could.

I Did Not See That One Coming

Of course, the code failed in a way completely unrelated to temporary connectivity issues.

In our case, we were stupid and attempted to write an Entity Framework entity to the database whose string values exceeded the column size limits. Long story short, we were concatenating an information field from some other fields and didn’t take into account that maybe the sum of those other fields would exceed the maximum.

The write failure triggered exactly the situation that we were worried about; the actual trust account record had been created (VB6) but our record if it happening was missing (.NET).

I still don’t actually know why we bothered implementing column size limits. As far as I know, there is no difference between a column of VARCHAR(60) and VARCHAR(MAX) when it comes to performance. Sure, you could conceivably store a ridiculous amount of data in the MAX column at some point, but I feel like that is a lot less destructive than the write failures (and its knock-on effects) that we got.

Even worse, from the users point of view, the operation looked like it had worked. There were no error notifications visible to them, because we couldn’t write to the table that we used to indicate errors! When they returned to their original action list though, the item that failed was still present. They then processed it again and the same thing happened (it looked like it worked but it was still in the list afterwards) at which point they twigged that something was unusual and contacted our support team (thank god).

Once we found out about the issue, we figured out pretty quickly what the root cause was thanks to our logging and cursed our hubris.

Off With Their Head!

The fix for this particular problem was easy enough and involved two extension methods; one for truncating a string and another for scanning an object and automatically truncating string lengths as per data annotation attributes.

public static string Truncate(this string value, int maxLength, ILogger logger = null)
{
    if (string.IsNullOrEmpty(value))
    {
        return value;
    }

    if (maxLength < 0) throw new ArgumentException($"Truncate cannot be used with a negative max length (supplied {nameof(maxLength)} was [{maxLength}]). That doesn't even make sense, what would it even do?", nameof(maxLength));

    if (value.Length <= maxLength)
    {
        return value;
    }

    string truncated = null;
    truncated = maxLength <= 3 ? value.Substring(0, maxLength) : value.Substring(0, maxLength - 3) + "...";

    
    logger?.Debug("The string [{original}] was truncated because it was longer than the allowed length of [{length}]. The truncated value is [{truncated}]", value, maxLength, truncated);

    return truncated;
}

public static void TruncateAllStringPropertiesByTheirMaxLengthAttribute(this object target, ILogger logger = null)
{
    var props = target.GetType().GetProperties().Where(prop => Attribute.IsDefined(prop, typeof(MaxLengthAttribute)) && prop.CanWrite && prop.PropertyType == typeof(string));

    foreach (var prop in props)
    {
        var maxLength = prop.GetCustomAttribute(typeof(MaxLengthAttribute)) as MaxLengthAttribute;
        if (maxLength != null)
        {
            prop.SetValue(target, ((string)prop.GetValue(target)).Truncate(maxLength.Length, logger));
        }
    }
}

Basically, before we write the entity in question, just call the TruncateAllStringPropertiesByTheirMaxLengthAttribute method on it.

With the immediate problem solved, we were still left with two outstanding issues though:

  • A failure occurred in the code and the user was not notified
  • An atomic operation was still being split across two completely different programming contexts

In this particular case we didn’t have time to alleviate the first issue, so we pushed forward with the fix for the issue that we knew could occur.

We still have absolutely no idea how to deal with the second issue though, and honestly, probably never will.

Conclusion

In retrospect, I don’t think we actually made the wrong decision. We identified an issue, analysed the potential occurrences, costed a fix and then implemented a smaller fix that should have covered out bases.

The retry strategy would likely have dealt with transitory failures handily, we just didn’t identify the other cases in which that section could fail.

As much as I would like to, its just not cost-effective to account for every single edge case when you’re developing software.

Well, unless you’re building like pacemaker software or something.

Then you probably should.

0 Comments

Working on a legacy system definitely has its challenges.

For example, its very common for there to be large amounts of important business logic encapsulated in the structures and features of the database. Usually this takes the form of things like stored procedures and functions, default values and triggers, and when you put them all together, they can provide a surprising amount of functionality for older applications.

While not ideal by todays standards, this sort of approach is not necessarily terrible. If the pattern was followed consistently, at least all of the logic is in the same place, and legacy apps tend to have these magical global database connections anyway, so you can always get to the DB whenever you need to.

That is, until you start adding additional functionality in a more recent programming language, and you want to follow good development practices.

Like automated tests.

What The EF

If you’re using Entity Framework on top of an already existing database and you have stored procedures that you want (need?) to leverage, you have a few options.

The first is to simply include the stored procedure or function when you use the Entity Data Model Wizard in Visual Studio. This will create a function on the DbContext to call the stored procedure, and map the result set into some set of entities. If you need to change the entity return type, you can do that too, all you have to do is make sure the property names line up. This approach is useful when your stored procedures represent business logic, like calculations or projections.

If the stored procedures in the database represent custom insert/update/delete functionality, then you can simply map the entity in question to its stored procedures. The default mapping statement will attempt to line everything up using a few naming conventions, but you also have the ability to override that behaviour and specify procedures and functions as necessary.

If you don’t want to encapsulate the usage of the stored procedures, you can also just use the SqlQueryand ExecuteSqlCommandAsync functions available on the DbContext.Database property, but that requires you to repeat the usage of magic strings (the stored procedure and function names) whenever you want to execute the functionality, so I don’t recommend it.

So, in summary, its all very possible, and it will all work, up until you want to test your code using an in-memory database.

Which is something we do all the time. 

In Loving Memory

To prevent us from having to take a direct dependency on the DbContext, we learn towards using factories.

There are a few reasons for this, but the main one is that it makes it far easier to reason about DbContext scope (you make a context, you destroy a context) and to limit potential concurrency issues within the DbContext itself. Our general approach is to have one factory for connecting to a real database (i.e. ExistingLegacyDatabaseDbContextFactory) and then another for testing (like an InMemoryDbContextFactory, using Effort). They both share an interface (usually just the IDbContextFactory<TContext> interface), which is taken as a dependency as necessary, and the correct factory is injected whenever the object graph is resolved using our IoC container of choice.

Long story short, we’re still using the same DbContext, we just have different ways of creating it, giving us full control over the underlying provoider at the dependency injection level.

When we want to use  an in-memory database, Effort will create the appropriate structures for us using the entity mappings provided, but it can’t create the stored procedures because it doesn’t know anything about them (except maybe their names). Therefore, if we use any of the approaches I’ve outlined above, the in-memory database will be fundamentally broken depending on which bits you want to use.

This is one of the ways that Entity Framework and its database providers are something of a leaky abstraction, but that is a topic for another day.

This is pretty terrible for testing purposes, because sometimes the code will work, and sometimes it won’t.

But what else can we do?

Abstract Art

This is one of those nice cases where an abstraction actually comes to the rescue, instead of just making everything one level removed from what you care about and ten times harder to understand.

Each stored procedure and function can easily have an interface created for it, as they all take some set of parameters and return either nothing or some set of results.

We can then have two implementations, one which uses a database connection to execute the stored procedure/function directly, and another which replicates the same functionality through Linq or something similar (i.e. using the DbContext). We bind the interface to the first implementation when we’re running on top of a real database, and to the DbContext specific implementation when we’re not. If a function calls another function in the database, you can replicate the same approach by specifying the function as a dependency on the Linq implementation, which works rather nicely.

Of course, this whole song and dance still leaves us in a situation where the tests might do different things because there is no guarantee that the Linq based stored procedure implementation is the same as the one programmed into SQL Server.

So we write tests that compare the results returned from both for identical inputs, trusting the legacy implementation when differences are discovered.

Why bother at all though? I mean after everything is said and done, you now have two implementations to maintain instead of one, and more complexity to boot.

Other than the obvious case of “now we can write tests on an in-memory database that leverage stored procedures”, there are a few other factors in favour off this approach:

  • With a good abstraction in place, its more obvious what is taking a dependency on the stored procedures in the database
  • With a solid Linq based implementation of the stored procedure, we can think about retiring them altogether, putting the logic where it belongs (in the domain)
  • We gain large amounts of knowledge around the legacy stored procedures while building and testing the replacement, which makes them less mysterious and dangerous
  • We have established a strong pattern for how to get at some of the older functionality from our new and shiny code, leaving less room for sloppy implementations

So from my point of view, the benefits outweigh the costs.

Conclusion

When trying to leverage stored procedures and functions programmed into a database, I recommend creating interfaces to abstract their usages. You are then free to provide implementations of said interfaces based on the underlying database provider, which feels a lot more flexible than just lumping the function execution into whatever structures that EF provides for that purpose. The approach does end up adding some additional complexity and effort, but the ability to ensure that tests can run without requiring a real database (which is slow and painful) is valuable enough, even if you ignore the other benefits.

Caveat, the approach probably wouldn’t work as well if there aren’t good dependency injection systems in place, but the general concept is sound regardless.

To echo my opening statement, working with legacy code definitely has its own unique set of challenges. Its nice in that way though, because solving those challenges can really make you think about how to provide a good solution within the boundaries and limitations that have already been established.

Like playing a game with a challenge mode enabled, except you get paid at the end.

0 Comments

Its been at least a few years since I’ve actually worked with a dedicated DBA. I’m sure the role still exists, but it seems like a lot of organisations mostly just expect their developers to be able to do anything at least passably well, especially with the seemingly inexorable merge of the development and operations responsibilities.

As a developer, I personally like the trend, because it moves me closer to the end-user and it gives me a greater sense of responsibility about the software I produce. If I’m going to be the one who has to support it and the one that gets woken up at 1 in the morning because production servers are melting down due to a poorly optimised data update algorithm, then you better believe I will be writing the most maintainable, supportable and reliable software that I can.

I mean, I would have done that anyway, but now I have even more of a personal stake in the situation.

Where I work now, we still have a dedicated operations department, but they are mostly responsible for managing the IT operations of the business as a whole, rather than the deployment specifics of any software that my team writes. They do still participate in our process in an advisory role (something for which I am perpetually grateful), but other than that, we’re pretty much on our own.

Anyway, the point that this rambling preface is slowly trying to arrive at is we’ve had some teething issues with one of our more recent releases that I thought might be worth talking about.

Specifically, massive, unexpected read load on our relational databases during periods of high write activity.

Everything In a Box To The Write

The service at the core of the data freeing functionality that I’ve mentioned previously is our synchronization service.

Its only purpose is to facilitate the continual synchronization of data from on-premises databases to a central cloud repository, with the help of a locally installed application. I won’t go into too much detail about the actual algorithm in play, but its mostly just batched updates to the cloud database at regular intervals through a HTTP API, where the batched data is obtained by leveraging the SQL Server row version concept.

In the last few weeks, we’ve been deploying the on-premises application to customers in order to kick off the synchronization process, a chunk of customers at a time.

It went pretty well, releasing to each group and watching their data upload, until we hit some sort of breakpoint and the Read IOPS of our backend RDS service rose to an unsustainable level.

We were seeing numbers in excess of 1500 for read IOPS (and 200ish for write), which is somewhat problematic, because the database is relatively small (150GB), which means it only has around 450 baseline IOPS split between reads and writes. Considering the way that AWS volumes work (guaranteed baseline, spike up to 3000 by consuming IO credits), the consumption rate we were seeing would leave us high and dry within an hour or two. We scaled a number of things to deal with the traffic (which eventually subsided as the initial flurry of data gave way to more incremental updates).

But where did the massive amount of read load come from?

To understand that, I’m going to have to outline a little bit more about the service itself.

The API for the service exposes an endpoint for each table we’re synchronizing, differentiated by customer identity. Something like the following:

https://{url}/{customer-identity}/tables/{table-name}

A GET to this endpoint returns a manifest of sorts, stating information about what data is currently available in the service for that customer-table combination, which is then used to decide what to upload. A POST allows for inserts and updates and a DELETE allows for either the removal of all data or the removal of a subset of data defined by the body.

Inside the service, the POST body is handled by iterating through the rows contained therein, and executing an add or update for each one using Entity Framework, saving as we go.

Profiling the resulting queries from the execution of a POST, we discovered that entity framework will always do a SELECT first on each row in question, in order to determine whether to run an INSERT or an UPDATE. Given that the number of these SELECT operations dwarfed the number of reads resulting from GET requests, we assumed that that was where the majority of the read load was coming from.

Now we just had to find out how to optimise it.

The EF Tradeoff

One of the really nice things when working with Entity Framework is that you don’t need to give too much thought to the resulting queries that it executes on the underlying data store. I mean, the whole point of the library is to reduce the visibility of those sorts of things, adding a nice abstraction layer over the top so you don’t have to deal with it.

When it comes to performance though, that is also one of the places where EF can hurt you the most.

For our purposes, EF increased the speed at which we could deliver this service (by removing some of the nastyness around talking to databases), but by being so far away from the proverbial metal, when it came time to try and deal with performance issues, there didn’t seem to be much that we could do.

We tried a few things with EF, including:

  • Just adding the entities, instead of add/update, which forced EF to insert and fail on primary key violations. This was faster, but a lot of our load is actually updates as opposed to pure inserts, so applied to our production environment it would have simply caused other performance issues (as a result of the constant exceptions).
  • Turning off the various entity tracking features in EF, as we didn’t need them for the relatively simple inserts we were doing. This helped a little bit, but the same pattern of SELECT, INSERT/UPDATE was still present afterwards, so it wouldn’t have resolved out read problem.
  • Updating to the very latest version of EF6 and Npgsql (just in case there had been some opimisations that we’d missed), but it was pretty much the same. We were not that far behind the curve anyway, so it wasn’t too surprising.
  • Batching our data updates under a single SaveChanges. This helped a fair amount, but would require us to change the way we handle update errors to remove the bad rows and try the update again rather than simply moving on. We don’t have any real statistics as to how many errors we actually get during updates, so this one was plausible, but would need more investigation.

Then we tried something completely different.

No EF.

Just pure ADO.NET insert calls, catching primary key violations and doing updates.

This approach was so much faster it wasn’t funny, but it left a bad taste in our mouths at multiple levels. For starters, we would have to write and maintain the SQL ourselves, which is something we’d rather not do (for various reasons, maintainability being one and database specific code being another). Even getting past that, we were still handling exceptions as part of normal application flow, which didn’t feel right, and would almost certainly lead to performance problems of its own, even if it was dodging writes.

It was here that PostgreSQL came to our rescure.

Version 9.5 introduced an UPSERT command in the form of INSERT () ON CONFLICT UPDATE (). Now we could leave the whole thing up to the database and move on with our lives. We’d still have to maintain the SQL ourselves (or write a generator of some sort), but that would be a small price to pay for a massive performance increase.

Unfortunately this sort ends with an anti-climax, because we’re still in the process of implementing the raw ADO.NET, PostgreSQL specific UPSERT as an optimisation. Once I have more information I’ll follow up with another post filling in the end of the story.

In the meanwhile, we threw money at it by scaling up the RDS instance so it could keep more things in memory at once, which alleviated the issue, at least until we hit the next breakpoint.

Conclusion

Its true that no plan survives contact with the enemy, even when the enemy isn’t particularly hostile. All of our data synchronization was a silent precursor to shipping some new cloud features, so no-one notice when it exploded a few times, but it was still pretty depressing to see it be completely unable to handle the load we were throwing at it. I’m not surprised that Entity Framework doesn’t lead to optimal performance, even though its not pleasant having that particular fact slap you in the face. We can still use EF for a lot of what we need to do, and benefit from all of its goodies, like database migrations and easy in-memory testing, but we definitely need to put some things in place to bypass EF when the situation calls for it.

When it comes to this particular issue; even though throwing money at a problem doesn’t really work for us in the long term (for obvious reasons), it was nice to have that option available until we had something better in place. We’re still not sure if what we’re planning on doing is going to make a massive difference, but I remain hopeful.

Of course, software development and hope is never a good combination.

Look where it gets us when it comes to estimates.

0 Comments

We have the following simple entities, representing a fairly basic property management system.

public class Contact
{
    public int Id { get; set; }
}

public class ContactLink
{
    public int Id { get; set; }
    public int ContactId { get; set; }
    public string Type { get; set; }
    public int LinkedId { get; set; }
}

public class Owner
{
    public int Id { get; set; }
}

public class Tenant
{
    public int Id { get; set; }
    public int? PropertyId { get; set; }
}

public class Property
{
    public int Id { get; set; }
    public int? OwnerId { get; set; }
}

Each Property has zero or one Owners, but an Owner can be linked to multiple Contacts.

A Property can be Tenanted, and each Tenant can be linked to multiple Contacts.

The ContactLink class represents multiple arbitrary links to Contacts, and is how both the Owner and Tenant to Contact relationships are specified.

This model was not created with Entity Framework in mind, which unfortunately is something that you have to deal with when working on a legacy system. The EF model was put in place after many years of the database being the closest thing the software had to a domain model. The models that I’ve listed above are obviously not complete. They contain many more fields than those displayed, and have fields for segmentation because the database is multi-tenant. Anything not specified here is not useful for the purposes of illustrating the point of post.

We need to answer a relatively simple question.

“Show me the Properties that are related to a Contact”

The goal is to answer that question while leaving as much of the work as possible up to the database. So ideally one call and minimal post processing.

Give Me All The Relationships

First things first, a test to ensure that our query works the way we expect it to. Make a database, seed it, run the query, validate the right number of results. Very straightforward.

public void WhenAContactIsRelatedToPropertiesThroughBothOwnerAndTenantRelationships_TheQueryReturnsTheCorrectNumberOfRows() { using (var db = GetDbContext()) { var contactId = RandomId(); // Seed some data into the database. We use Builder objects and extension methods on DataSets to do // seeding, allowing us to fill a previously empty database with all the information we need for a test. // Specifically for this test, seed: // 1.) 1 Contact with the specified Id. // 2.) 1 Owner. // 3.) 2 Properties. One owned by the Owner that was just created.

// 4.) 1 Tenant, for the property with no Owner.

// 5.) 2 ContactLinks, linking the Contact to the Owner and Tenant. var target = new QueryEncapsulation(); var results = target.Execute(db, contactId); results.Count().Should().Be(2); } }

With the test written, all that’s left is to implement the query itself.

public class QueryEncapsulation
{
    public IEnumerable<LinkedProperty> Execute(DbContext db)
    {
        return (from c in db.Contacts
            .Where(c => c.Id == specifiedId)
        from cl in db.ContactLinks
            .Where(cl => cl.ContactId == c.Id)
        from o in db.Owners
            .Where(o => o.Id == cl.LinkedId && cl.Type == "owner")
            .DefaultIfEmpty()
        from t in db.Tenants
            .Where(t => t.Id == cl.LinkedId && cl.Type == "tenant")
            .DefaultIfEmpty()
        from p in db.Properties
            .Where(p => p.OwnerId == o.Id || t.PropertyId == p.Id)
        select new LinkedProperty(o, t, p))
        .ToArray();
    }
}

Not super complicated, in fact it looks fairly similar to the query that you might put together in SQL. Join a bunch of tables together and then filter them as appropriate so only the valid links are included.

The usage of the DefaultIfEmptymethod is the mechanism by which you accomplish Left Outer Joins in Linq. This is important because even though each Owner and Tenant map to a single property, as a result of the way the relationships are structured, and the direction we are starting from (the Contact) each “row” returned by this query will only contain either an Owner or a Tenant, never both. Without the DefaultIfEmpty, the query wouldn’t return anything.

Of course, the test fails.

Failure Is Just A Stepping Stone To Greatness

Instead of getting 2 results (one Property linked via the Owner and one Property linked via the Creditor) the query returns 3 results.

The Property linked to the Tenant is coming back twice. Once for the correct link (to the Tenant) and then again for the Owner, which is nonsense.

As far as I understand, this combination is still included because a completely unspecified Owner can be linked to a null Owner Id on the Property.

Intuitively, p.OwnerId == o.Id feels like it should throw a NullReferenceException when o is null. I’m still not exactly sure I understand why this is a valid result, but I know if I make a similar query in SQL using Left Outer Joins, that it will work the same way. Essentially I’m not being specific enough in my exclusion criteria and null is indeed equal to null.

The fix?

Add a check on the Property condition that checks for Owner being null.

The final query we’re left with is:

public class QueryEncapsulation
{
    public IEnumerable<LinkedProperty> Execute(DbContext db)
    {
        return (from c in db.Contacts
            .Where(c => c.Id == specifiedId)
        from cl in db.ContactLinks
            .Where(cl => cl.ContactId == c.Id)
        from o in db.Owners
            .Where(o => o.Id == cl.LinkedId && cl.Type == "owner")
            .DefaultIfEmpty()
        from t in db.Tenants
            .Where(t => t.Id == cl.LinkedId && cl.Type == "tenant")
            .DefaultIfEmpty()
        from p in db.Properties
            .Where(p => (o != null && p.OwnerId == o.Id) || t.PropertyId == p.Id)
        select new LinkedProperty(o, t, p))
        .ToArray();
    }
}

Summary

Honestly, I didn’t expect this sort of behaviour from EF, but that’s not surprising. EF is an incredibly complicated beast and I’m barely familiar with the shallow end.

To try and understand the results, I had to build up the set by hand and account for all of the potential combinations, combined with whether or not it made sense for them to be included in the final results. Doing this helped to identify the fact that fields on what should be a null object created via DefaultIfEmpty seem to be nullable even though the underlying type on the model is not.

An interesting finding at least and a good example of how sometimes its important to understand your queries conceptually.

Especially if they are giving weird results.

0 Comments

And now, for the thrilling conclusion! The thrilling conclusion to what you might ask? Check back here for the first part in this saga.

I mentioned very briefly in the first part of this epic that we attempted to fix a strange threading bug with token generation via the ASP.NET Identity Framework by making sure that the Entity Framework DbContext was fully initialized (i.e. model created, connection established, etc) before it left the factory method. Initial tests were promising, but it turns out this did not fix the issue.

I mention this because I had absolutely no luck reproducing the connection leak when I was running locally (with or without a profiler attached). I could easily force timeouts when getting a connection from the pool (because it was too busy), but I couldn’t reproduce the apparent situation where there were connections established that could not be actively used.

When going through the combination of CloudWatch logs for RDS (to track connection usage) and our own ELK stack I found a correlation between the errors that sometimes occurred when generating tokens and the increase in the usage of connections. This pattern was pretty consistent. Whenever there was a cluster of errors related to token generation, there was an increase in the total number of connections used by the service, which never went down again until the application pool was recycled at the default time of 29 hours from the last recycle.

Token Failure

We’ve been struggling with the root cause of the token generation failures for a while now. The most annoying part is that it doesn’t fail all the time. In fact, my initial load tests showed only around a 1% failure rate, which is pretty low in the scheme of things. The problem manifests itself in exceptions occurring when a part of the Identity Framework attempts to use the Entity Framework DbContext that it was given. It looks as though there is some sort of threading issue with Entity Framework, which makes sense conceptually. Generally EF DbContext objects are not thread safe, so you shouldn’t attempt to use them on two different threads at the same time.

The errors were many and varied, but all consistently come from our implementation of the OAuthAuthorizationServerProvider. A few examples are below:

System.Data.Entity.Core.EntityCommandExecutionException: An error occurred while executing the command definition. See the inner exception for details. ---> System.InvalidOperationException: Operation is not valid due to the current state of the object.
   at Npgsql.NpgsqlConnector.StartUserAction(ConnectorState newState)
   at Npgsql.NpgsqlCommand.ExecuteDbDataReaderInternal(CommandBehavior behavior)
   at Npgsql.NpgsqlCommand.ExecuteDbDataReader(CommandBehavior behavior)
   at System.Data.Entity.Infrastructure.Interception.InternalDispatcher`1.Dispatch[TTarget,TInterceptionContext,TResult](TTarget target, Func`3 operation, TInterceptionContext interceptionContext, Action`3 executing, Action`3 executed)
   at System.Data.Entity.Infrastructure.Interception.DbCommandDispatcher.Reader(DbCommand command, DbCommandInterceptionContext interceptionContext)
   at System.Data.Entity.Core.EntityClient.Internal.EntityCommandDefinition.ExecuteStoreCommands(EntityCommand entityCommand, CommandBehavior behavior)
   --- End of inner exception stack trace ---
   at System.Data.Entity.Core.EntityClient.Internal.EntityCommandDefinition.ExecuteStoreCommands(EntityCommand entityCommand, CommandBehavior behavior)
   at System.Data.Entity.Core.Objects.Internal.ObjectQueryExecutionPlan.Execute[TResultType](ObjectContext context, ObjectParameterCollection parameterValues)
   at System.Data.Entity.Core.Objects.ObjectContext.ExecuteInTransaction[T](Func`1 func, IDbExecutionStrategy executionStrategy, Boolean startLocalTransaction, Boolean releaseConnectionOnSuccess)
   at System.Data.Entity.Core.Objects.ObjectQuery`1.<>c__DisplayClass7.<GetResults>b__5()
   at System.Data.Entity.Core.Objects.ObjectQuery`1.GetResults(Nullable`1 forMergeOption)
   at System.Data.Entity.Core.Objects.ObjectQuery`1.<System.Collections.Generic.IEnumerable<T>.GetEnumerator>b__0()
   at System.Data.Entity.Internal.LazyEnumerator`1.MoveNext()
   at System.Linq.Enumerable.FirstOrDefault[TSource](IEnumerable`1 source)
   at System.Linq.Queryable.FirstOrDefault[TSource](IQueryable`1 source, Expression`1 predicate)
   at [OBFUSCATION!].Infrastructure.Repositories.AuthorizationServiceRepository.GetApplicationByKey(String appKey, String appSecret) in c:\[OBFUSCATION!]\Infrastructure\Repositories\AuthorizationServiceRepository.cs:line 412
   at [OBFUSCATION!].Infrastructure.Providers.AuthorizationServiceProvider.ValidateClientAuthentication(OAuthValidateClientAuthenticationContext context) in c:\[OBFUSCATION!]\Infrastructure\Providers\AuthorizationServiceProvider.cs:line 42
   
System.NullReferenceException: Object reference not set to an instance of an object.
   at Npgsql.NpgsqlConnector.StartUserAction(ConnectorState newState)
   at Npgsql.NpgsqlCommand.ExecuteDbDataReaderInternal(CommandBehavior behavior)
   at Npgsql.NpgsqlCommand.ExecuteDbDataReader(CommandBehavior behavior)
   at System.Data.Entity.Infrastructure.Interception.InternalDispatcher`1.Dispatch[TTarget,TInterceptionContext,TResult](TTarget target, Func`3 operation, TInterceptionContext interceptionContext, Action`3 executing, Action`3 executed)
   at System.Data.Entity.Infrastructure.Interception.DbCommandDispatcher.Reader(DbCommand command, DbCommandInterceptionContext interceptionContext)
   at System.Data.Entity.Core.EntityClient.Internal.EntityCommandDefinition.ExecuteStoreCommands(EntityCommand entityCommand, CommandBehavior behavior)
   at System.Data.Entity.Core.Objects.Internal.ObjectQueryExecutionPlan.Execute[TResultType](ObjectContext context, ObjectParameterCollection parameterValues)
   at System.Data.Entity.Core.Objects.ObjectContext.ExecuteInTransaction[T](Func`1 func, IDbExecutionStrategy executionStrategy, Boolean startLocalTransaction, Boolean releaseConnectionOnSuccess)
   at System.Data.Entity.Core.Objects.ObjectQuery`1.<>c__DisplayClass7.<GetResults>b__5()
   at System.Data.Entity.Core.Objects.ObjectQuery`1.GetResults(Nullable`1 forMergeOption)
   at System.Data.Entity.Core.Objects.ObjectQuery`1.<System.Collections.Generic.IEnumerable<T>.GetEnumerator>b__0()
   at System.Data.Entity.Internal.LazyEnumerator`1.MoveNext()
   at System.Linq.Enumerable.FirstOrDefault[TSource](IEnumerable`1 source)
   at System.Linq.Queryable.FirstOrDefault[TSource](IQueryable`1 source, Expression`1 predicate)
   at [OBFUSCATION!].Infrastructure.Repositories.AuthorizationServiceRepository.GetApplicationByKey(String appKey, String appSecret) in c:\[OBFUSCATION!]\Infrastructure\Repositories\AuthorizationServiceRepository.cs:line 412
   at [OBFUSCATION!].Infrastructure.Providers.AuthorizationServiceProvider.ValidateClientAuthentication(OAuthValidateClientAuthenticationContext context) in c:\[OBFUSCATION!]\Infrastructure\Providers\AuthorizationServiceProvider.cs:line 42

In the service, this doesn’t make a huge amount of sense. There is one DbContext created per request (via Owin), and while the Owin middleware is asynchronous by nature (meaning that execution can jump around between threads) there is no parallelism. The DbContext should not be being used on multiple threads at one time, but apparently it was.

It was either that, or something was going seriously wrong in the connection pooling code for Npgsql.

Scope Increase

As I didn’t quite understand how the dependency injection/object lifetime management worked via the OwinContext, I had my suspicions that something was going awry there. Either the DbContext was not in fact generated once per request, or there was some strange race condition that allowed a DbContext to be reused on more than one thread.

As decided to rewrite the way in which dependencies are obtained in the service. Instead of generating a DbContext per request, I would supply a DbContextFactory to everything, and let it all generate its own, temporarily scoped DbContext that it is responsible for disposing.

In order to accomplish this I switched to an IoC container that I was more familiar with, Ninject. Not a small amount of work, and not without added complexity, but I felt that it made the code more consistent with the rest of our code bases and generally better.

In retrospect, I should have verified that I could reproduce the token generation errors at will first, but I didn’t. I wrote the test after I’d spent the better part of a day switching out the dependency injection mechanisms. This was a mistake.

Since the errors always occurred during the execution of a single endpoint, I wrote a test that uses 10 tasks to spam that particular endpoint. If none of the tasks fault within a time limit (i.e. no exceptions are thrown), then the test is considered a success. Basically a very small, focused, stress test to be run automatically as part of our functional test suite.

[Test]
[Category("functional")]
public void WhenAttemptingToGenerateMultipleTokensAtTheSameTime_NoRequestsFail()
{
    var authClientFactory = _resolver.Get<IAuthClientFactory>();
    var app = new ApplicationBuilder(authClientFactory.CreateSeeder())
        .WithRole("auth_token_generate")
        .WithRole("auth_customer_register")
        .WithRole("auth_database_register")
        .WithRole("auth_user_register")
        .WithRole("auth_query")
        .Build();

    var userBuilder = new UserBuilder(authClientFactory.CreateFromApplication(app.ApplicationDetails.Key, app.ApplicationSecret));
    userBuilder.Build();

    List<Task> tokenGenerationTasks = new List<Task>();
    var cancellation = new CancellationTokenSource();
    for (int i = 0; i < 10; i++)
    {
        var task = Task.Factory.StartNew
        (
            () =>
            {
                var client = authClientFactory.CreateFromApplication(app.ApplicationDetails.Key, app.ApplicationSecret);
                while (true)
                {
                    if (cancellation.Token.IsCancellationRequested) break;
                    var token = client.TokenGenerate(userBuilder.CustomerId + "/" + userBuilder.DatabaseId + "/" + userBuilder.UserCode, userBuilder.Password);
                }
            },
            cancellation.Token,
            TaskCreationOptions.LongRunning,
            TaskScheduler.Default
        );

        tokenGenerationTasks.Add(task);
    }

    // The idea here is that if any of the parallel token generation tasks throw an exception, it will come out here
    // during the wait.
    Task.WaitAny(tokenGenerationTasks.ToArray(), TimeSpan.FromSeconds(15));
    cancellation.Cancel();

    var firstFaulted = tokenGenerationTasks.FirstOrDefault(a => a.IsFaulted);
    if (firstFaulted != null) throw firstFaulted.Exception;
}

The first time I ran the test against a local service it passed successfully…

Now, I don’t know about anyone else, but when a test works the first time I am immediately suspicious.

I rolled my changes back and ran the test again, and it failed.

So my restructuring successfully fixed the issue, but why?

The Root Of The Problem

I hadn’t actually understood the issue, all I did was make enough changes such that it seemed to no longer occur. Without that undestanding, if it recurred, I would have to start all over again, possibly misdirecting myself with the changes that I made last time.

Using the test that guaranteed a reproduction, I investigated in more depth. Keeping all of my changes reverted, I was still getting a weird sampling of lots of different errors, but they were all consistently coming from one of our repositories (classes which wrap a DbContext and add extra functionality) whenever it was used within our OAuthAuthorizationServerProvider implementation.

Staring at the code for a while, the obviousness of the issue hit me.

At startup, a single OAuthAuthorizationServerProvider implementation is created and assigned to generate tokens for requests to the /auth/token endpoint.

This of course meant that all of the functions in that provider needed to be thread safe.

They were not.

Of the two functions in the class, both set and then used a class level variable, which in turn had a dependency on a DbContext.

This was the smoking gun. If two requests came in quickly enough, one would set the variable (using the DbContext for the request) the other would do the same (using a different DbContext) and then the first would attempt to use a different threads DbContext (indirectly through the variable). This would rightly cause an error (as multiple threads tried to use the same DbContext) and throw an exception, failing the token generation.

I abandoned my changes (though I will probably make them again over time), removed the class variable and re-ran the test.

It was all good. No errors at all, even after running for a few hours.

But why did the error cause a resource leak at the database connection level?

Leaky Logic

In the end I didn’t find out exactly why threading errors with Entity Framework (using Npgsq) were causing connection leaks. I plan to investigate in more depth in the future, and I’ll probably blog about my findings, but for now I was just happy to have the problem solved.

With the bug fixed, profiling over a period of at least 24 hours showed no obvious connection leaks as a result of normal traffic. Previously this would have guaranteed at least 10 connections leaking, possibly more. So for now the problem is solved and I need to move on

Summary

Chasing down resource leaks can be a real pain, especially when you don’t have a reliable reproduction.

If I had realised earlier that the token generation failures and connection leaks were related, I would have put more effort into reproducing the first in order to reproduce the second. It wasn’t immediately obviously that they were linked though, so I spent a lot of time analysing the code trying to figure out what could possibly be leaking valuable resources. This was a time consuming and frustrating process, ultimately leading nowhere.

Once I finally connected the dots between the token failures and the connection leak, everything came together, even if I didn’t completely understand why the connections were leaking in error situations.

Ah well, can’t win em all.