0 Comments

Building new features and functionality on top of legacy software is a special sort of challenge, one that I’ve written about from time to time.

To be honest though, the current legacy application that I’m working with is not actually that bad. The prior technical lead had the great idea to implement a relatively generic way to execute modern . NET functionality from the legacy VB6 code thanks to the magic of COM, so you can still work with a language that doesn’t make you sad on a day to day basis. Its a pretty basic eventing system (i.e. both sides can raise events that are handled by the other side), but its effective enough.

Everything gets a little bit tricksy when windowing and modal dialogs are involved though.

One Thing At A Time

The legacy application is basically a multiple document interface (MDI) experience, where the user is free to open a bunch of different entities and screens at the same time. Following this approach for new functionality adds a bunch of inherent complexity though, in that the user might edit an entity that is currently being displayed elsewhere (maybe in a list), requiring some sort of common, global channel for saying “hey, I’ve updated entity X, please react accordingly”.

This kind of works in the legacy code (VB6), because it just runs global refresh functions and changes form controls whenever it feels like it.

When the .NET code gets involved though, it gets very difficult to maintain both worlds in the same way, so we’ve to isolating all the new features from the legacy stuff, primarily through modal dialogs. That is, the user is unable to access the rest of the application when the .NET feature is running.

To be honest, I was pretty surprised that we could open up a modal form in WPF from an event handler started in VB6, but I think it worked because both VB6 and .NET shared a common UI thread, and the modality of a form is handled at a low level common to both technologies (i.e. win32 or something).

We paid a price from a user experience point of view of course, but we mostly worked around it by making sure that the user had all of the information they needed to make a decision on any screen in the .NET functionality, so they never needed to refer back to the legacy stuff.

Then we did a new thing and it all came crashing down.

Unforgettable Legacy

Up until fairly recently, the communication channel between VB6 and .NET was mostly one way. VB6 would raise X event, .NET would handle it by opening up a modal window or by executing some code that then returned a response. If there was any communication that needed to go back to VB6 from the .NET, it always happened after the modal window was already closed.

This approach worked fine until we needed to execute some legacy functionality as part of a workflow in .NET, while still having the .NET window be displayed in a modal fashion.

The idea was simple enough.

  • Use the .NET functionality to identify the series of actions that needed to be completed
  • In the background, iterate through that list of actions and raise an event to be handled by the VB6 to do the actual work
  • This event would be synchronous, in that the .NET would wait for the VB6 to finish its work and respond before moving on to the next item
  • Once all actions are completed, present a summary to the user in .NET

We’d actually used a similar approach for a different feature in the past, and while we had to refactor some of the VB6 to make the functionality available in a way that made sense, it worked okay.

This time the legacy functionality we were interested in was already available as a function on a static class, so easily callable. I mean, it was a poorly written function dependent on some static state, so it wasn’t a complete walk in the part, but we didn’t need to do any high-risk refactoring or anything.

Once we wrote the functionality though, two problems became immediately obvious:

  1. The legacy functionality could popup dialogs, asking the user questions relevant to the operation. This was actually kind of good, as one of the main reasons we didn’t want to reimplement was because we couldn’t be sure we would capture all the edge cases so using the existing functionality guaranteed that we would, because it was already doing it (and had been for years). These cases were rare, so while they were a little disconcerting, they were acceptable.
  2. Sometimes executing the legacy functionality would murder the modal-ness of the .NET window, which led to all sorts of crazy side effects. This seemed to happen mostly when the underlying VB6 context was changed in such a way by the operation that it would historically have required a refresh. When it happened, the .NET window would drop behind the main application window, and the main window would be fully interactable, including opening additional windows (which would explode if they too were modal). There did not seem to be a way to get the original .NET window back into focus either. I suspect that there were a number of Application.DoEvents calls secreted throughout the byzantine labyrinth of code that were forcing screen redraws, but we couldn’t easily prove it. It was definitely broken though.

The first problem wasn’t all that bad, even if it wasn’t great for a semi-automated process.

The second one was a deal-breaker.

Freedom! Horrible Terrifying Freedom!

We tried a few things to “fix” the whole modal window problem, including:

  • Trying to restore the modal-ness of the window once it had broken. This didn’t work at all, because the window was still modal somehow, and technically we’d lost the thread context from the initial .ShowDialog call (which may or may not have still been blocked, it was hard to tell). In fact, other windows in the application that required modality would explode if you tried to use them, with an error similar to “cannot display a modal dialog when there is already one in effect”.
  • Trying to locate and fix the root reason why the modal-ness was being removed. This was something of a fools errand as I mentioned above, as the code was ridiculously labyrinthian and it was impossible to tell what was actually causing the behaviour. Also, it would involve simultaneously debugging both VB6 and .NET, which is somewhat challenging.
  • Forcing the .NET window to be “always on top” while the operation was happening, to at least prevent it from disappearing. This somewhat worked, but required us to use raw Win32 windowing calls, and the window was still completely broken after the operation was finished. Also, it would be confusing to make the window always on top all the time, while leaving the ability to click on the parts of the parent window that were visible.

In the end, we went with just making the .NET window non-modal and dealing with the ramifications. With the structures we’d put into place, we were able to refresh the content of the .NET window whenever it gained focus (to prevent it displaying incorrect data due to changes in the underlying application), and our refreshes were quick due to performance optimization, so that wasn’t a major problem anymore.

It was still challenging though, as sitting a WPF Dispatcher on top of the main VB6 UI thread (well, the only VB6 thread) and expecting them both to work at the same time was just asking too much. We had to create a brand new thread just for the WPF functionality, and inject a TaskScheduler initialized on the VB6 thread for scheduling the events that get pushed back into VB6.

Conclusion

Its challenging edge cases like this whole adventure that make working with legacy code time consuming in weird and unexpected ways. If we had of just stuck to pure .NET functionality, we wouldn’t have run into any of these problems, but we would have paid a different price in reimplementing functionality that already exists, both in terms of development time, and in terms of risk (in that we don’t full understand all of the things the current functionality does).

I think we made the right decision, in that the actual program functionality is the same as its always been (doing whatever it does), and we instead paid a technical price in order to get it to work well, as opposed to forcing the user to accept a sub-par feature.

Its still not immediately clear to me how the VB6 and .NET functionality actually works together at all (with the application windowing, threading and various message pumps and loops), but it does work, so at least we have that.

I do look forward to the day when we can lay this application to rest though, giving it the peace it deserves after many years of hard service.

Yes, I’ve personified the application in order to empathise with it.

0 Comments

That thing that I complained about with Logstash and HTTP outputs a few weeks ago? Turns out it was actually a bug after all!

Not only that, but it was totally a race condition relating to the way the success/failure handlers of the Manticore HTTP library were being used, which was what I guessed, so go me. I guess even a broken clock is right once a day.

Long story short, don’t fire off an asynchronous request until you’ve got your asynchronous handlers in place to handle the result.

In their typical awesome style, the maintainers of the HTTP output plugin have already fixed the issue and released a new version (4.3.2) for the masses to consume.

Unfortunately, I don’t think this new version has been rolled into an official Logstash release yet, so if I want to use it I’ll have to make a custom redistributable package to use in Octopus deploy. Not a huge issue, just a small pain. I’m sure it will be rolled into the official package soon.

Other than that piece of news, I don’t really have anything else, so I’m going to take this opportunity to muse on some of my experiences with asynchronous code.

Hellish Code

Depending on how its written, asynchronous code can be hard to reason about, and even harder to test.

Vanilla Node is a good example of bad asynchronous code. Its incredibly easy to get into callback hell, where this function will call into this function, which has a handler that calls this function, which does this thing and so on. Its even worse when you realise that the easiest way to accumulate state between so many asynchronous functions is to close over variables that are in scope when the function is created, which leaves you in a position where its hard to break things apart into nice self contained, easily testable components.

*shudder*

Encapsulating asynchronous operations into some sort of structure is a good counter to callback hell, and once we started using Promises in the ELB logs processor Javascript, I found the quality of the code increased rapidly. It was still Javascript, but at least you could reason about it.

Continuing on with a language that I actually like, C#, you’ve got a similar concept at play with tasks and the TPL (Task Parallel Library).

Before the TPL, you would probably just use raw threads to offload work (or maybe dedicated asynchronous functions or classes like BackgroundWorker), and you could easily get into the same sort of problems that you get into in vanilla Node. I mean, you had the benefit of working with a nice, structured language, but it was still difficult and prone to error.

The TPL introduced encapsulations around asynchronous operations called Tasks, which helped write asynchronous code that actually made sense.

It did something else that was incredibly helpful as well.

It exposed the Task schedulers.

All The Control

Encapsulating operations as Tasks is all well and good, and in most cases you’re probably looking to run those operations asynchronously on another thread (or a pool of threads perhaps). I mean, that’s literally exactly what the default TaskScheduler does in C# (the ThreadPoolScheduler).

Of course, that makes testing hard, because now your main thread (which the test is running on) has to have some logic in it wait for outstanding operations to complete before you can assert things. This gets even more difficult when working with ViewModels, which require that certain updates be performed on the main/UI thread in order to work properly.

This is particularly important for WPF and PropertyChanged events, which only work properly if you raise them on the UI thread.

The good news is that you can write your own TaskScheduler and then inject it for testing purposes, assuming you’ve written your code in such a way that your Tasks are created with a specified scheduler.

I’ve written a TaskScheduler that is not threaded at all (i.e. it just executes the task inline), one that accumulates operations to be executed at will (good for testing transitions) and another that allows you to run tasks on the WPF Dispatcher construct (i.e. the main/UI thread in WPF) and I use all of them frequently while developing applications that rely on Tasks to do things.

They aren’t even all that hard to write really, as the following NonThreadedTaskScheduler shows:

public sealed class NonThreadedTaskScheduler : TaskScheduler
{
    protected override void QueueTask(Task task)
    {
        TryExecuteTask(task);
    }

    protected override bool TryExecuteTaskInline(Task task, bool taskWasPreviouslyQueued)
    {
        return TryExecuteTask(task);
    }

    protected override IEnumerable<Task> GetScheduledTasks()
    {
        return Enumerable.Empty<Task>();
    }

    public override int MaximumConcurrencyLevel
    {
        get { return 1; }
    }
}

Without the ability to change how the tasks are executed, none of that would have been possible, and while they would still be useful, they would be much harder to actually use and test.

Coincidentally, this is one of the reasons why I’m a little bit leery of the async/await keywords in C#, because I don’t think they expose any of the scheduling functionality, instead assuming that you are happy with the magical way that they work. Honestly though, my knowledge of async/await is fairly limited, so that might not be entirely correct.

Conclusion

To conclude, asynchronous code is hard and its bugs are subtle and hard to detect.

Its also one of the easiest and highest impact ways to wring more performance out of a piece of software. If you’re doing a lot of things that naturally involve waiting (disk writes, network calls, etc), you stand to gain an incredible amount of time by just doing other things during those times when you’re waiting on something external and out of your control.

Double edged sword really.

Like most software.

0 Comments

Almost a month ago, I made a bold claim that switching our Logstash instances to HTTP outputs for all log event traffic wouldn’t be all that difficult.

Oh god, how wrong I was.

In some weird display of prescience however, these were the exact words I used:

Conceptually, changing our Logstash instances to use a HTTP output instead of TCP should also be easy. Just need to change some configuration and deploy through Octopus. Keep in mind I haven’t actually done it yet, but it feels simple enough.

What Is Logstash?

Theoretically, I was entirely correct. It was as simple as changing the TCP output to HTTP. Same host, no fancy settings, no real complexity in the configuration.

For good measure, I upgraded the version of Logstash to 5.4.1 as well (from 5.2.0), which was a whole thing that I’m not going to get into right now. Suffice to say, I had a very recent version of Logstash installed when it all went to tatter.

We’ve got a pretty good deployment pipeline for distributing Logstash to our various API instances and whatnot, so after I made the changes, I checked in, builds kicked off, tests were run, deployments were made, and it was time to check whether or not CI was still working.

It was fine. For the first time, I could see the traffic caused by Logstash on the Broker using CloudWatch, which was nice.

Promote to Staging, same thing. Worked fine, traffic visible, all seemed well.

Promote to Prod, same thing. Worked fine, traffic visible, all is well.

And that’s the end of the story.

Baby Don’t Hurt Me

Ha ha ha ha ha ha no.

What I really should have done is left Staging run for a few days before promoting anything. I guess I was just a little too eager.

It wasn’t long after promoting to production that I noticed data disappearing from our dashboards and visualizations. At first glance, it all seemed to be isolated to one of our services in particular, so I started digging.

  • Log files were being appended to locally, so not the issue we encountered recently with our Nancy APIs. This problem had different symptoms anyway (no logs at all, instead of just no application logs).
  • Restarting the local instance of Logstash seemed to get a few events through into the ELK stack, so the communication channels were intact (i.e. traffic wasn’t being blocked at the network level), but only a few hundred.
  • Broker seemed to be healthy and handling traffic well, so it didn’t seem like a resource issue on the ELK stack side.

Based on the evidence, it looked like the problem was entirely the fault of the local Logstash instances, but if that was the case, why not more failures? At this point in the investigation, other Logstash instances appeared to be pushing their log events into the ELK stack without issue.

There didn’t even seem to be anything special about the logs going through the failing Logstash instances either. It was all pretty vanilla IIS, Performance Counter and API application logs. Nothing fancy.

Time to dig into Logstash itself.

The good news was that Logstash 5.4.1 features a really nice internal API, granting access to a bunch of information about the internal workings of the application.

Using a little bit of Powershell you can get statistics about the Logstash pipeline.

(Invoke-WebRequest http://localhost:9600/_node/stats/pipeline).Content

On one of the broken machines, our output looked something like this:

{ "host" : "a-hostname", "version" : "5.4.1", "http_address" : "127.0.0.1:9600", "id" : "a82b92de-d402-4b99-a569-a859c8e00d2b", "name" : "a-name", "pipeline" : { "events" : { "duration_in_millis" : 11293, "in" : 377, "filtered" : 375, "out" : 250, "queue_push_duration_in_millis" : 28027 }, "plugins" : { "inputs" : [ { "id" : "get-events-from-api-iis-logs", "events" : { "out" : 129, "queue_push_duration_in_millis" : 14075 }, "name" : "file" }, { "id" : "get-events-from-system-stats", "events" : { "out" : 248, "queue_push_duration_in_millis" : 13952 }, "name" : "file" }, { "id" : "get-events-from-api-application-logs", "events" : { "out" : 0, "queue_push_duration_in_millis" : 0 }, "name" : "file" } ], "filters" : [ { "id" : "break-iis-fields-down-via-csv", "events" : { "duration_in_millis" : 1135, "in" : 128, "out" : 128 }, "name" : "csv" }

# More filters here, cut for brevity

], "outputs" : [ { "id" : "send-events-to-elk-broker-over-http", "events" : { "duration_in_millis" : 5441, "in" : 375, "out" : 250 }, "name" : "http" } ] }, "reloads" : { "last_error" : null, "successes" : 0, "last_success_timestamp" : null, "last_failure_timestamp" : null, "failures" : 0 }, "queue" : { "type" : "memory" }, "id" : "main" } }

Super dense, but the interesting part is the HTTP output at the bottom.

See how it has 375 events in, but only 250 out?

It hangs like that forever.

Restarting Logstash causes it to start again and send a similar number of events, but eventually hang in exactly the same manner.

Don’t Hurt Me, No More

I’ll be honest, I tried a lot of things to get this to work, and I just couldn’t.

In fact, the problem actually got worse, because other Logstash instances started failing in the same way, and more and more data dropped off our dashboards and visualizations.

In the end, I had no choice but to roll back to the TCP output (which was flaky, but in a known way) and to log an issue on the Logstash Github describing my problem in the hope that someone would be able to look into it and help out.

Why not look myself?

Well I started, but Logstash is Ruby code running through Java (which is then running on Windows via Powershell) and getting into that stack from a development point of view looked pretty complicated and time-consuming.

I had a bit of a look at the Ruby code of the HTTP output and the only thing I could think of was a potential race condition in the construction of the HTTP requests.

# Create an async request
request = client.background.send(@http_method, url, :body => body, :headers => headers)
request.call # Actually invoke the request in the background

request.on_success do |response|
  begin
    # bunch of code dealing with success
  end
end

request.on_failure do |exception|
  begin 
    # bunch of code dealing with failures
  end
end

I don’t know how the HTTP library works (I think its Manticore?), but if it doesn’t support the registering (and correct execution) of event handlers AFTER a call has completed, then its possible for the request.call to finish and the success and failure handlers to never run.

Part of the success handler’s responsibilities is to inform the rest of the Logstash pipeline that the the HTTP output is done with its events, so that sort of thing could easily lead to a deadlock.

Again, I don’t know Ruby at all, so that is complete conjecture on my part.

Conclusion

This was another one of those issues that just came out of nowhere and stole more of my time than I would like to admit to. I would still like to switch all of our log event traffic to HTTP (for the same reasons I went through last time), but I have other things to do for now. At least its still working, even though the TCP input in Logstash sometimes just stop working. It doesn’t happen very often, and I have some monitoring around it, so its not the end of the world.

To be honest, a smarter man would have seen that it was not working with no clear solution, rolled back to the TCP output and moved on. Probably within a few hours.

I am not a smart man.

I am, however, ridiculously stubborn, so I wailed away at this problem until it stole my will to live, then I gave up.

0 Comments

Back in the day (pre Windows Vista, so we’re talking about Windows XP and Windows Server 2003), it was possible for a user and system services to share the same logical space. Depending on what your login settings were, the first user to login to a system was likely to be automatically assigned to session 0, which is where services and other system processes would run.

This was both a blessing and a curse.

It was a blessing because now the user would be able to see if a service decided to display a dialog or some other incredibly stupid mechanism to try and communicate.

It was a curse because now the user was running in the same space as system critical services, enabling a particularly dangerous attack vector for viruses and malicious code.

In Windows Vista this was changed, with the advent of Session 0 isolation. In short, services would now run in their own session (session 0) and whenever a user logged in they would automatically be assigned to sessions greater than 0.

The entire post should be considered with a caveat. I am certainly no expert in Windows session management, so while I’ve tried my best to understand the concepts at play, I cannot guarantee their technical correctness. I do know that the solution I outline later does allow for a workaround for the application I was working with, so I hope that will prove useful at least.

Vista was many years ago though, so you might be asking what relevance any of this has now?

Well, applications have a tendency to live long past when anyone expects them to, and old applications in particular have a tendency to accumulate cruft over the years.

I was working with one such application recently.

Denied

Most of the time, my preference is to work with virtual machines, especially when investigating or demonstrating software. Its just much easier to work in a virtual environment that can easily be reset to a known state.

I mostly use Virtual Box, but that’s just because its the virtualisation tool I am most familiar with. Virtual Box is all well and good, but it does make it very hard to collaborate with other people, especially considering the size of the virtual machines themselves (Windows is much worse than Linux). Its hard to pass the the virtual machine around, and its beyond most people to expose a virtual machine to the greater internet so someone in a different part of the world can access it.

As a result I’ve gravitated towards AWS for demonstration machines

AWS is not perfect (its hard to get older OS versions setup for example, which limits its usage for testing things that require a certain OS) but for centralizing demonstration machines its a godsend.

How does all of this relate to session 0 and old applications?

Well, I recently setup an EC2 instance in AWS to demonstrate to stakeholders some work that we’d been doing. In order to demonstrate some new functionality in our product, I needed to configure a third-party product in a particular way. I had done this a number of times before on local virtual machines, so imagine my surprise when I was confronted with an error message stating that I was not allowed to configure this particular setting when not logged in as a console user.

Console?

To most users, I would imagine that that error message is incredibly unhelpful.

Well, this is where everything ties back into session 0, because in the past, if you wanted to remote into a machine, and be sure that you were seeing the same thing each time you logged in, you would use the following command:

mstsc /console

This would put you into session 0, which is usually the same session as you would see when physically accessing the server, i.e. it was as if you were viewing the monitor/keyboard/mouse physically connected to the box. More importantly, it also let you interact with services that insisted on trying to communicate with the user through dialogs or SendMessage.

The consistent usage of the console switch could be used to prevent issues like Bob logging in and starting an application server, then Mary also logging in and doing the same. Without the /console switch, both would log into their own sessions, even if they were using the same user, and start duplicate copies of the application.

Being familiar with the concept (sometimes experience has its uses), I recognised the real meaning of the “you are not logged in as the console” message. It meant that it had detected that I was not in session 0 and it need to do something that requires communication to a service via outdated mechanisms. Disappointing, but the application has been around for a while, so I can’t be too mad.

Unfortunately, the console switch does not give access to session 0 anymore. At least not since the introduction of session 0 isolation in Vista. There is an /admin switch, but it has slightly different behaviour (its really only for getting access to the physical keyboard/screen, so not relevant in this situation).

Good Old Sysinternals

After scouring the internet for a while, I discovered a few things that were new to me.

The first was that when Microsoft introduced session 0 isolation they did not just screw over older applications. Microsoft is (mostly) good like this.

In the case of services that rely on interacting with the user through GUI components (dialogs, SendMessage, etc), you can enable the Interactive Services Detection Service (ui0detect). Once this service is enabled and running, whenever a service attempts to show a dialog or similar, a prompt will show up for the logged in user, allowing them to switch to the application.

The second was that you can actually run any old application you like in session 0, assuming you have administrator access to the machine.

This is where Sysinternals comes to the rescue yet again (seriously, these tools have saved me so many times, the authors may very well be literal angels for all I know).

Using psexec, you can start an application inside session 0.

psexec –s 0 –i {path to application}

You’ll need to be running as an Administrator (obviously), but assuming you have the Interactive Services Detection Service running, you should immediately receive a prompt that says something like “an application is trying to communicate with you”, which you can then use to switch to the GUI of the application running in session 0.

With this new power it was a fairly simple matter to start the application within session 0, fooling whatever check it had, which allowed me to change the setting and demonstrate our software as needed.

Conclusion

As I mentioned earlier, software has an unfortunate tendency to live for far longer than you think it should.

I doubt the person who wrote the console/session 0 check inside the application expected someone to be installing and running it inside a virtual machine hosted purely remotely in AWS. In fact, when the check was written, I doubt AWS was even a glimmer in Chris Pinkham’s eye. I’m sure the developer had a very good reason for the check (it prevented a bug or it allowed a solution that cost 1/4 as much to implement), and they couldn’t have possibly anticipated the way technology would change in the future.

Sometimes I worry that for all the thought we do put into software, and all the effort we put into making sure that it will do what it needs to do as long as it needs to, its all somewhat pointless. We cannot possibly anticipate shifts in technology or users, so really the only reasonable approach is to try and make sure we can change anything with confidence.

Honestly, I’m surprised most software works at all, let alone works mostly as expected decades later.