Well Hung

July 4. 2017 0 Comments

Almost a month ago, I made a bold claim that switching our Logstash instances to HTTP outputs for all log event traffic wouldn’t be all that difficult.

Oh god, how wrong I was.

In some weird display of prescience however, these were the exact words I used:

Conceptually, changing our Logstash instances to use a HTTP output instead of TCP should also be easy. Just need to change some configuration and deploy through Octopus. Keep in mind I haven’t actually done it yet, but it feels simple enough.

What Is Logstash?

Theoretically, I was entirely correct. It was as simple as changing the TCP output to HTTP. Same host, no fancy settings, no real complexity in the configuration.

For good measure, I upgraded the version of Logstash to 5.4.1 as well (from 5.2.0), which was a whole thing that I’m not going to get into right now. Suffice to say, I had a very recent version of Logstash installed when it all went to tatter.

We’ve got a pretty good deployment pipeline for distributing Logstash to our various API instances and whatnot, so after I made the changes, I checked in, builds kicked off, tests were run, deployments were made, and it was time to check whether or not CI was still working.

It was fine. For the first time, I could see the traffic caused by Logstash on the Broker using CloudWatch, which was nice.

Promote to Staging, same thing. Worked fine, traffic visible, all seemed well.

Promote to Prod, same thing. Worked fine, traffic visible, all is well.

And that’s the end of the story.

Baby Don’t Hurt Me

Ha ha ha ha ha ha no.

What I really should have done is left Staging run for a few days before promoting anything. I guess I was just a little too eager.

It wasn’t long after promoting to production that I noticed data disappearing from our dashboards and visualizations. At first glance, it all seemed to be isolated to one of our services in particular, so I started digging.

Log files were being appended to locally, so not the issue we encountered recently with our Nancy APIs. This problem had different symptoms anyway (no logs at all, instead of just no application logs).
Restarting the local instance of Logstash seemed to get a few events through into the ELK stack, so the communication channels were intact (i.e. traffic wasn’t being blocked at the network level), but only a few hundred.
Broker seemed to be healthy and handling traffic well, so it didn’t seem like a resource issue on the ELK stack side.

Based on the evidence, it looked like the problem was entirely the fault of the local Logstash instances, but if that was the case, why not more failures? At this point in the investigation, other Logstash instances appeared to be pushing their log events into the ELK stack without issue.

There didn’t even seem to be anything special about the logs going through the failing Logstash instances either. It was all pretty vanilla IIS, Performance Counter and API application logs. Nothing fancy.

Time to dig into Logstash itself.

The good news was that Logstash 5.4.1 features a really nice internal API, granting access to a bunch of information about the internal workings of the application.

Using a little bit of Powershell you can get statistics about the Logstash pipeline.

(Invoke-WebRequest http://localhost:9600/_node/stats/pipeline).Content

On one of the broken machines, our output looked something like this:

{ "host" : "a-hostname", "version" : "5.4.1", "http_address" : "127.0.0.1:9600", "id" : "a82b92de-d402-4b99-a569-a859c8e00d2b", "name" : "a-name", "pipeline" : { "events" : { "duration_in_millis" : 11293, "in" : 377, "filtered" : 375, "out" : 250, "queue_push_duration_in_millis" : 28027 }, "plugins" : { "inputs" : [ { "id" : "get-events-from-api-iis-logs", "events" : { "out" : 129, "queue_push_duration_in_millis" : 14075 }, "name" : "file" }, { "id" : "get-events-from-system-stats", "events" : { "out" : 248, "queue_push_duration_in_millis" : 13952 }, "name" : "file" }, { "id" : "get-events-from-api-application-logs", "events" : { "out" : 0, "queue_push_duration_in_millis" : 0 }, "name" : "file" } ], "filters" : [ { "id" : "break-iis-fields-down-via-csv", "events" : { "duration_in_millis" : 1135, "in" : 128, "out" : 128 }, "name" : "csv" }

# More filters here, cut for brevity

], "outputs" : [ { "id" : "send-events-to-elk-broker-over-http", "events" : { "duration_in_millis" : 5441, "in" : 375, "out" : 250 }, "name" : "http" } ] }, "reloads" : { "last_error" : null, "successes" : 0, "last_success_timestamp" : null, "last_failure_timestamp" : null, "failures" : 0 }, "queue" : { "type" : "memory" }, "id" : "main" } }

Super dense, but the interesting part is the HTTP output at the bottom.

See how it has 375 events in, but only 250 out?

It hangs like that forever.

Restarting Logstash causes it to start again and send a similar number of events, but eventually hang in exactly the same manner.

Don’t Hurt Me, No More

I’ll be honest, I tried a lot of things to get this to work, and I just couldn’t.

In fact, the problem actually got worse, because other Logstash instances started failing in the same way, and more and more data dropped off our dashboards and visualizations.

In the end, I had no choice but to roll back to the TCP output (which was flaky, but in a known way) and to log an issue on the Logstash Github describing my problem in the hope that someone would be able to look into it and help out.

Why not look myself?

Well I started, but Logstash is Ruby code running through Java (which is then running on Windows via Powershell) and getting into that stack from a development point of view looked pretty complicated and time-consuming.

I had a bit of a look at the Ruby code of the HTTP output and the only thing I could think of was a potential race condition in the construction of the HTTP requests.

# Create an async request
request = client.background.send(@http_method, url, :body => body, :headers => headers)
request.call # Actually invoke the request in the background

request.on_success do |response|
  begin
    # bunch of code dealing with success
  end
end

request.on_failure do |exception|
  begin 
    # bunch of code dealing with failures
  end
end

I don’t know how the HTTP library works (I think its Manticore?), but if it doesn’t support the registering (and correct execution) of event handlers AFTER a call has completed, then its possible for the request.call to finish and the success and failure handlers to never run.

Part of the success handler’s responsibilities is to inform the rest of the Logstash pipeline that the the HTTP output is done with its events, so that sort of thing could easily lead to a deadlock.

Again, I don’t know Ruby at all, so that is complete conjecture on my part.

Conclusion

This was another one of those issues that just came out of nowhere and stole more of my time than I would like to admit to. I would still like to switch all of our log event traffic to HTTP (for the same reasons I went through last time), but I have other things to do for now. At least its still working, even though the TCP input in Logstash sometimes just stop working. It doesn’t happen very often, and I have some monitoring around it, so its not the end of the world.

To be honest, a smarter man would have seen that it was not working with no clear solution, rolled back to the TCP output and moved on. Probably within a few hours.

I am not a smart man.

I am, however, ridiculously stubborn, so I wailed away at this problem until it stole my will to live, then I gave up.