0 Comments

And so I return from my break, with a fresh reenergized hatred for terrible and unexpected technical issues.

Speaking of which…

Jacked Up And Good To Go

I’ve been sitting on a data export project for a while now. Its a simple C# command line application that connects to one of our databases, correlates some entities together, pulls some data out and dumps it into an S3 bucket as CSV files. It has a wide variety of automated unit tests proving that the various components of the system function as expected, and a few integration and end-to-end tests that show that the entire export process works (i.e. given a set of command line arguments + an S3 bucket, does running the app result in data in the bucket).

From a purely technical point of view, the project has been “complete” for a while now, but I couldn’t turn it on until some other factors cleared up.

Well, while I was on holidays those other factors cleared up, so I thought to myself “I’ll just turn on the schedule in TeamCity and all will be well”.

Honestly, you’d think having written as much software as I have that I would know better.

That first run was something of a crapshow, and while some of the expected files ended up in the S3 bucket, a bunch of others were missing completely.

Even worse, the TeamCity task that we use for job execution thought everything completed successfully. The only indicator that it had failed was that the task description in TeamCity was not updated to show the summary of what it had done (i.e. it simply showed the generic “Success” message instead of the custom one we were supplying), which is suspicious, because I’ve seen TeamCity fail miserably like that before.

Not Enough Time In The World

Breaking it down, there were two issues afoot:

  1. Something failed, but TeamCity thought everything was cool
  2. Something failed

The first problem was easy to deal with; there was a bug in the way the process was reporting back to TeamCity when there was an unhandled exception in the data extraction process. With that sorted, the next run at least indicated a failure had occurred.

With the process indicating correctly that a bad thing had happened, the cause of the second problem became obvious.

Timeouts.

Which explained perfectly why the tests had not picked up any issues, as while they run the full application (from as close to the command line as possible), they don’t run a full export, instead leveraging a limiter parameter to avoid the test taking too long.

Entirely my fault really, as I should have at least done a full export at some stage.

Normally I would look at a timeout with intense suspicion, as they typically indicate a inefficient query or operation of some sort. Simply raising the time allowed when timeouts start occurring is often a route to a terrible experience for the end-user, as operations take longer and longer to do what they want them to.

In this case though, it was reasonable that the data export would actually take a chunk of time greater than the paltry 60 seconds applied to command execution in Npgsql by default. Also, the exports that were failing were for the larger data sets (one of which had some joins onto other data sets) and being full exports, could not really make effective use of any indexes for optimisation.

So I upped the timeouts via the command line parameters and off it went.

Three hours later though, and I was pretty sure something else was wrong.

Socket To Me

Running the same process with the same inputs from my development environment showed the same problem. The process just kept on chugging along, never finishing. Pausing the execution in the development environment showed that at least one thread was stuck waiting eternally for some database related thing to conclude.

The code makes use of parallelisation (each export is its own isolated operation), so my first instinct was that there was some sort of deadlock involving one or more exports.

With the hang appearing to be related to database connectivity, I thought that maybe it was happening in the usage of Npgsql connections, but each export creates its own connection, so that seemed unlikely. There was always the possibility that the problem was related to connection pooling though, which is built into the library and is pretty much static state, but I had disabled that via the connection string, so it shouldn’t have been a factor.

I ripped out all of the parallelisation and ran the process again and it still hung. On closer investigation, it was only one specific export that was hanging, which was weird,  because they all use exactly the same code.

Turning towards the PostgreSQL end of the system, I ran the process again, except this time started paying attention to the active connections to the database, the queries they were running, state transitions (i.e. active –> idle) and execution timestamps. This is pretty easy to do using the query:

SELECT * FROM pg_stat_activity

I could clearly see the export that was failing execute its query on a connection, stay active for 15 minutes and then transition back to idle, indicating that the database was essentially done with that operation. On the process side though, it kept chugging along, waiting eternally for some data to come through the Socket that it was listening on that would apparently never come.

The 15 minute query time was oddly consistent too.

It turns out the query was actually being terminated server side because of replica latency (the export process queries a replica DB), which was set to max out at, wait for it, 15 minutes.

For some reason the stream returned by the NpgsqlConnection.BeginTextExport(sql) just never ends when the underlying query is terminated on the server side.

My plan is to put together some information and log an issue in the Npgsql Github Repository, because I can’t imagine that the behaviour is intended.

Solve For X

With the problem identified, the only question remaining was what to do about it.

I don’t even like that our maximum replica latency is set to 15 minutes, so raising it was pretty much out of the question (and this process is intended to be automated and ongoing, so I would have to raise it permanently).

The only real remaining option is to break down the bigger query into a bunch of smaller queries and then aggregate myself.

So that’s exactly what I did.

Luckily for me, the data set had a field that made segmentation easy, though running a bunch of queries and streaming the results into a single CSV file meant that they had to be run sequentially, so no parallelization bonus for me.

Conclusion

This was one of those issues where I really should have had the foresight to see the first problem (timeouts when dealing with large data sets),  but the existence of what looks to be a real bug made everything far more difficult than it could have been.

Still, it just goes to show that no matter how confident you are in a process, there is always the risk that when you execute it in reality that it all might go belly up.

Which just reinforces the idea that you should always be running it in reality as soon as you possibly can, and paying attention to what the results are.

0 Comments

Once you put a piece of software out there in the market, its only a matter of time until a real paying customer finds an issue of some sort.

Obviously it would be fantastic if they didn’t find any issues at all, and the software was perfect in every way, but lets be honest, most software is unlikely to ever be constructed with the amount of rigour and quality checks required to ensure it never has a single issue.

Its just not cost effective.

So, the reality is that paying customers will probably find some issues.

How you handle those interactions can make a huge difference to how your customers feel about them.

A Loop Has To Start Somewhere, Right?

The best you can hope for when a customer discovers an issue is that they let you know.

This sort of behaviour should be heavily encouraged; that customer cared enough about the situation to let you know, and you want your customers to care. That’s a good emotion.

The first thing you need to do is acknowledge their contribution in some way; perhaps an email or a phone call to thank them for their time and effort. Anything will do really, but if its got a personal touch, all the better. If the issue was discovered during a support case then you’re probably already covered, but it can still help to send out some sort of summary stating that the root cause of their call was an issue in the software and that you really appreciate that they brought it to your attention.

Once an issue has been acknowledged it probably goes into your backlog.

This is an incredibly dangerous place, because its easy for things to disappear, and for the customer to lose sight of them. Its here that there is a real danger that the customer can become disengaged; they spent some amount of time and effort on the issue and they are probably personally invested in it, so if it disappears for some amount of time without any updates, they are going to feel disheartened.

So, even when an issue is sitting in your backlog its worthwhile to still keep customers informed. Granted, its pretty difficult to do this, especially if you’re not actively working on their issue (limited time and energy, hard decisions have to be made), but I prefer an honest and open discussion to just going dark.

Speaking of going dark…

Or Maybe It Doesn’t

Historically, for our legacy product, we’ve haven’t had much of a system, or perhaps whatever system we had was completely invisible to the development team, I’m not sure.

As far as I know, customers would lodge support cases and those cases would result in a stream of bugs being logged into system. The prioritization process was pretty ad-hoc, unless a particular customer made a bunch of noise, or something happened repeatedly and we twigged onto a greater pattern.

That’s not to say that we didn’t care, because we did. We just didn’t have a particularly good process that kept our customers in the loop.

In the last few months, along with the formation of a dedicated maintenance team for that particular piece of software, we’ve taken the process that we did have and improved it. Its early days, and we’re still tweaking the improvements, but we’re hopeful that it will make a difference.

We now have a much better triaging processes for support cases to classify them as a result of an existing bug or a new bug (or something else, like user error). This happens mostly as a result of a daily standup we put in place connecting the support team with the maintenance team. Additionally, whenever we identify that a case was related to a bug, we link it in our tracking systems and we also contact the customer directly and thank them for helping us identify a problem.

This ties in nicely with what I think is the most important improvement that we’ve made, which is that whenever we make a release (monthly), we identify any customers that were affected by bugs we just fixed and contact them directly, letting them know that their issue has been fixed and thanking them again for helping us identify and resolve the problem.

Some nice improvements to be sure, but its all a bit manual right now, and the engineer in me is uncomfortable with that sort of overhead.

I think we can do even better.

Time Travel Is Hard Like That

I want to expose our internal bug tracking system to our customers, ideally completely publicly. Failing that, a system that is equivalent and feeds back automatically into the internal system, I don’t really mind how its done.

Mostly, I just want to reduce the overhead in communicating issues to customers and to give them a nice view into not only their issue, but all of the issues that we are aware of and how they are organised. For me, its all part of being completely open and honest about the whole thing.

Its not entirely altruistic though, as I do have something of an ulterior motive.

One of the hardest problems that we face is really understanding how many customers are being negatively affected by an issue. Sure, we have a bunch of business intelligence that we can use, but nothing is quite as good as a customer themselves identifying that X is hurting their ability to do their job. Our support system and the cases that it creates helps with this, but its not enough for me.

I want our customers to be able to vote for their issues themselves, and then use that information to help make decisions.

Of course, I’m assuming customers care enough to participate, which might be a bad assumption, but I won’t know for sure until I try.

There are a bunch of things to be careful about, obviously, like exposing customer information. We often attach customer reports and other pieces of information to bugs in order to help diagnose them, so that could lead us to a world of hurt if we don’t handle it properly. There’s also the slightly lesser risk of making comments or changing the words in the ticket to not be “customer friendly”, which is a nice way of saying we would have to be constantly and intensely aware of how we present ourselves, even in an apparently internal system.

We’re pretty good, but lets be honest, sometimes people can be very frustrating.

Conclusion

The root point here, ignoring the potentially insane dream of exposing all of our internal bug tracking to our customers directly, is to ensure that your customers know what is going on.

This shouldn’t be a hard position to take, but it can be difficult and time consuming to actually accomplish. As always, a legacy product containing a bunch of known and unknown issues makes everything harder than it otherwise should be, but its still a tenable situation all the same.

At the very least the one thing you should keep in mind is that there is nothing more dangerous than complete radio silence.

That silence is very easily filled up with all sorts of terrible things, so you might as well fill it up with facts instead, even if they might not be the facts that your customers want.

Like that the only thing that you know is that their issue mightbe fixed sometime between now and the heat death of the universe.

0 Comments

I’ve been using MVVM as a pattern for UI work for a while now, mostly because of WPF. Its a solid pattern and while I’ve not really delved into the publicly available frameworks (Prism, Caliburn.Micro, etc) I have put together a few reusable bits and pieces to make the journey easier.

One of those bits and pieces is the ability to perform work in the background, so that the UI remains responsive and usable while other important things are happening. This usually manifests as some sort of refresh or busy indicator on the screen after the user elects to do something complicated, but the important part is that the screen itself does not become unresponsive.

People get antsy when software “stops responding” and tend to murder it with extreme prejudice.

Now, the reusable components are by no means perfect, but they do get the job done.

Except when they don’t.

Right On Schedule

The framework itself is pretty bare bones stuff, with a few simple ICommand implementations and some view model base classes giving easy access to commonly desired functions.

The most complex part is the build in support to easily do background work in a view model while leaving the user experience responsive and communicative. The core idea is to segregate the stuff happening in the background from the stuff happening in the foreground (which is where all the WPF rendering and user interaction lives) using Tasks and TaskSchedulers from the TPL (Task Parallel Library), while also helping to manage some state to communicate what was happening to the user (like busy indicators).

Each view model is be responsible for executing some long running operation (probably started from a command), and then deciding what should happen when that operation succeeds, fails or is cancelled.

In order to support this segregation, the software takes a dependency on three separate task schedulers; one for the background (which is just a normal ThreadPoolTaskScheduler), one for the foreground (which is a DispatcherTaskScheduler or something similar) and one for tasks that needed to be scheduled on a regular basis (another ThreadPoolTaskScheduler).

This dependency injection allows for those schedulers to be overridden for testing purposes, so that they executed completely synchronously or could be pumped at will as necessary in tests.

It all worked pretty great until we started really pushing it hard.

Schedule Conflict

Our newest component to use the framework did a huge amount of work in the background. Not only that, because of the way the interface was structured, it pretty much did all of the work at the same time (i.e. as soon as the screen was loaded), in order to give the user a better experience and minimise the total amount of time spent waiting.

From a technical standpoint, the component needed to hit both a local database (not a real problem) and a remote API (much much slower), both of which are prime candidates for background work due to their naturally slow nature. Not a lot of CPU intensive work though, mostly just DB and API calls.

With 6-10 different view models all doing work in the background, it quickly became apparent that we were getting some amount of contention for resources, as not all Tasks were being completed in a reasonable amount of time. Surprisingly hard to measure, but it looked like The Tasks manually scheduled via the TaskSchedulers were quite expensive to run, and the ThreadPoolTaskSchedulers could only run so much at the same time due to the limits on parallelization and the number of threads that they could have running at once.

So that sucked.

As a bonus annoyance, the framework did not lend itself to usage of async/await at all. It expected everything to be synchronous, where the “background” nature of the work was decided by virtue of where it was executed. Even the addition of one async function threw the whole thing into disarray, as it became harder to reason about where the work was actually being executed.

In the grand scheme of things, async/await is still relatively new (but not that new, it was made available in 2013 after all), but its generally considered a better and less resource intensive way to ensure that blocking calls (like HTTP requests, database IO, file IO and so on) are not causing both the system and the user to wait unnecessarily. As a result, more and more libraries are being built with async functions, sometimes not even exposing a synchronous version at all. Its somewhat difficult to make an async function synchronous to, especially if you want to avoid potential deadlocks.

With those limitations noted, we had to do something.

Why Not Both?

What we ended up doing was allowing for async functions to be used as part of the background work wrappers inside the base view models. This retained the managed “busy” indicator functionality and the general programming model that had been put into place (i.e. do work, do this on success, this on failure, etc).

Unfortunately what it also did was increase the overall complexity of the framework.

It was now much harder to reason about which context things were executing on, and while the usage of async functions was accounted for in the background work part of the framework, it was not accounted for in either the success or error paths.

This meant that is was all too easy to use an async function in the wrong context, causing a mixture of race conditions (where the overarching call wasn’t aware that part of itself was asynchronous) or bad error handling (where a developer had marked a function as async void to get around the compiler errors/warnings).

Don’t get me wrong, it all worked perfectly fine, assuming you knew to avoid all of the things that would make it break.

The tests got a lot more flaky though, because while its relatively easy to override TaskSchedulers with synchronous versions, its damn near impossible to force async functions to execute synchronously.

Sole Survivor

Here’s where it all gets pretty hypothetical, because the solution we actually have right now is the one that I just wrote about (the dual natured, overly complex abomination) and its causing problems on and off in a variety of ways.

A far better model is to incorporate async/await into the fabric of the framework, allowing for its direct usage and doing away entirely with the segmentation logic that I originally put together (with the TaskSchedulers and whatnot).

Stephen Cleary has some really good articles in MSDN magazine about this sort of stuff (being async ViewModels and supporting constructs), so I recommend reading them all if you’re interested.

At a high level, if we expose the fact that the background work is occurring asynchronously (view async commands and whatnot), then not only do we make it far easier to do work in the background (literally just use the standard async/await constructs), but it becomes far easier to handler errors in a reliable way, and the tests become easier too, because they can simply be async themselves (which all major unit testing frameworks support).

It does represent a significant refactor though, which is always a bit painful.

Conclusion

I’m honestly still not sure what the better approach is for this sort of thing

Async/await is so easy to use at first glance, but has a bunch of complexity and tripwires for the unwary. Its also something of an infection, where once you use it even a little bit, you kind of have to push it through everything in order for it to work properly end-to-end. This can be problematic for an existing system, where you want to introduce it a bit at a time.

On the other side, the raw TPL stuff that I put together is much more complex to use, but is relatively shallow. It much easier to reason about where work is actually happening and relatively trivial tocompletely change the nature of the application for testing purposes. Ironically enough, the ability to easily change from asynchronous background workers to a purely synchronous execution is actually detrimental in a way, because it means your tests aren’t really doing the same thing as your application will, which can mask issues.

My gut feel is to go with the newer thing, even though it feels a bit painful.

I think the pain is a natural response to something new though, so its likely to be a temporary thing.

Change is hard, you just have to push through it.

0 Comments

As is typical when I get sick, I have not written a blog post, and I did not think to write one ahead of time to avoid this particular situation.

Maybe next time I will be better prepared.