0 Comments

Software Development as a discipline puts a bunch of effort into trying to minimise the existence and creation of bugs, but the reality is that its an investment/return curve that flattens off pretty quickly.

Early discovery of issues is critical. Remember, the cost to the business for a bug existing is never lower than it is at development time. The longer it has to fester, the worse its going to get.

Of course, when a bug is discovered, there are decisions to make around whether or not to fix it. For me, every bug that exists in a piece of software that might cause an issue for a user is a mark against its good name, so my default policy is to fix. Maybe not in the same piece of work that it was found in, but in general, bugs should be fixed.

That is, unless you hit that awkward conjunction of high cost/low incidence.

Why waste a bunch of money fixing a bug that might never happen?

Split Personality

I’m sure you can guess from the mere existence of this blog post that this is exactly the situation we found ourselves in recently.

While evaluating a new component in our legacy application, we noticed that it was technically possible to break what was intended to be an entirely atomic operation consisting of multiple database writes.

Normally this wouldn’t even be worth talking about, as its basically the reason that database transactions exist. When used correctly its a guarantee of an all or nothing situation.

Unfortunately, one of the writes was in Visual Basic 6 code, and the other was in .NET.

I don’t know if you’ve ever tried to span a database transaction across a technology boundary like that, but its not exactly the easiest thing in the world.

When we looked into the actual likelihood of the issue occurring, we discovered that if the VB6 part failed, we could easily just rollback the .NET part. If the write failed in .NET though, we had no way to go back and undo the work that had already been done in VB6. Keep in mind, this entire section was essentially creating transactions in trust accounting application, so non-atomic operations can get users into all sorts of terrible situations.

On deeper inspection, the only way we thought the .NET stuff could fail would be transitory database issues. That is, connection or command timeouts or disconnects.

We implemented a relatively straightforward retry strategy to deal with those sorts of failures and then moved on. Sure, it wasn’t perfect, but it seemed like we’d covered our bases pretty well and mitigated the potential issue as best we could.

I Did Not See That One Coming

Of course, the code failed in a way completely unrelated to temporary connectivity issues.

In our case, we were stupid and attempted to write an Entity Framework entity to the database whose string values exceeded the column size limits. Long story short, we were concatenating an information field from some other fields and didn’t take into account that maybe the sum of those other fields would exceed the maximum.

The write failure triggered exactly the situation that we were worried about; the actual trust account record had been created (VB6) but our record if it happening was missing (.NET).

I still don’t actually know why we bothered implementing column size limits. As far as I know, there is no difference between a column of VARCHAR(60) and VARCHAR(MAX) when it comes to performance. Sure, you could conceivably store a ridiculous amount of data in the MAX column at some point, but I feel like that is a lot less destructive than the write failures (and its knock-on effects) that we got.

Even worse, from the users point of view, the operation looked like it had worked. There were no error notifications visible to them, because we couldn’t write to the table that we used to indicate errors! When they returned to their original action list though, the item that failed was still present. They then processed it again and the same thing happened (it looked like it worked but it was still in the list afterwards) at which point they twigged that something was unusual and contacted our support team (thank god).

Once we found out about the issue, we figured out pretty quickly what the root cause was thanks to our logging and cursed our hubris.

Off With Their Head!

The fix for this particular problem was easy enough and involved two extension methods; one for truncating a string and another for scanning an object and automatically truncating string lengths as per data annotation attributes.

public static string Truncate(this string value, int maxLength, ILogger logger = null)
{
    if (string.IsNullOrEmpty(value))
    {
        return value;
    }

    if (maxLength < 0) throw new ArgumentException($"Truncate cannot be used with a negative max length (supplied {nameof(maxLength)} was [{maxLength}]). That doesn't even make sense, what would it even do?", nameof(maxLength));

    if (value.Length <= maxLength)
    {
        return value;
    }

    string truncated = null;
    truncated = maxLength <= 3 ? value.Substring(0, maxLength) : value.Substring(0, maxLength - 3) + "...";

    
    logger?.Debug("The string [{original}] was truncated because it was longer than the allowed length of [{length}]. The truncated value is [{truncated}]", value, maxLength, truncated);

    return truncated;
}

public static void TruncateAllStringPropertiesByTheirMaxLengthAttribute(this object target, ILogger logger = null)
{
    var props = target.GetType().GetProperties().Where(prop => Attribute.IsDefined(prop, typeof(MaxLengthAttribute)) && prop.CanWrite && prop.PropertyType == typeof(string));

    foreach (var prop in props)
    {
        var maxLength = prop.GetCustomAttribute(typeof(MaxLengthAttribute)) as MaxLengthAttribute;
        if (maxLength != null)
        {
            prop.SetValue(target, ((string)prop.GetValue(target)).Truncate(maxLength.Length, logger));
        }
    }
}

Basically, before we write the entity in question, just call the TruncateAllStringPropertiesByTheirMaxLengthAttribute method on it.

With the immediate problem solved, we were still left with two outstanding issues though:

  • A failure occurred in the code and the user was not notified
  • An atomic operation was still being split across two completely different programming contexts

In this particular case we didn’t have time to alleviate the first issue, so we pushed forward with the fix for the issue that we knew could occur.

We still have absolutely no idea how to deal with the second issue though, and honestly, probably never will.

Conclusion

In retrospect, I don’t think we actually made the wrong decision. We identified an issue, analysed the potential occurrences, costed a fix and then implemented a smaller fix that should have covered out bases.

The retry strategy would likely have dealt with transitory failures handily, we just didn’t identify the other cases in which that section could fail.

As much as I would like to, its just not cost-effective to account for every single edge case when you’re developing software.

Well, unless you’re building like pacemaker software or something.

Then you probably should.

Post comment