0 Comments

And so I return from my break, with a fresh reenergized hatred for terrible and unexpected technical issues.

Speaking of which…

Jacked Up And Good To Go

I’ve been sitting on a data export project for a while now. Its a simple C# command line application that connects to one of our databases, correlates some entities together, pulls some data out and dumps it into an S3 bucket as CSV files. It has a wide variety of automated unit tests proving that the various components of the system function as expected, and a few integration and end-to-end tests that show that the entire export process works (i.e. given a set of command line arguments + an S3 bucket, does running the app result in data in the bucket).

From a purely technical point of view, the project has been “complete” for a while now, but I couldn’t turn it on until some other factors cleared up.

Well, while I was on holidays those other factors cleared up, so I thought to myself “I’ll just turn on the schedule in TeamCity and all will be well”.

Honestly, you’d think having written as much software as I have that I would know better.

That first run was something of a crapshow, and while some of the expected files ended up in the S3 bucket, a bunch of others were missing completely.

Even worse, the TeamCity task that we use for job execution thought everything completed successfully. The only indicator that it had failed was that the task description in TeamCity was not updated to show the summary of what it had done (i.e. it simply showed the generic “Success” message instead of the custom one we were supplying), which is suspicious, because I’ve seen TeamCity fail miserably like that before.

Not Enough Time In The World

Breaking it down, there were two issues afoot:

  1. Something failed, but TeamCity thought everything was cool
  2. Something failed

The first problem was easy to deal with; there was a bug in the way the process was reporting back to TeamCity when there was an unhandled exception in the data extraction process. With that sorted, the next run at least indicated a failure had occurred.

With the process indicating correctly that a bad thing had happened, the cause of the second problem became obvious.

Timeouts.

Which explained perfectly why the tests had not picked up any issues, as while they run the full application (from as close to the command line as possible), they don’t run a full export, instead leveraging a limiter parameter to avoid the test taking too long.

Entirely my fault really, as I should have at least done a full export at some stage.

Normally I would look at a timeout with intense suspicion, as they typically indicate a inefficient query or operation of some sort. Simply raising the time allowed when timeouts start occurring is often a route to a terrible experience for the end-user, as operations take longer and longer to do what they want them to.

In this case though, it was reasonable that the data export would actually take a chunk of time greater than the paltry 60 seconds applied to command execution in Npgsql by default. Also, the exports that were failing were for the larger data sets (one of which had some joins onto other data sets) and being full exports, could not really make effective use of any indexes for optimisation.

So I upped the timeouts via the command line parameters and off it went.

Three hours later though, and I was pretty sure something else was wrong.

Socket To Me

Running the same process with the same inputs from my development environment showed the same problem. The process just kept on chugging along, never finishing. Pausing the execution in the development environment showed that at least one thread was stuck waiting eternally for some database related thing to conclude.

The code makes use of parallelisation (each export is its own isolated operation), so my first instinct was that there was some sort of deadlock involving one or more exports.

With the hang appearing to be related to database connectivity, I thought that maybe it was happening in the usage of Npgsql connections, but each export creates its own connection, so that seemed unlikely. There was always the possibility that the problem was related to connection pooling though, which is built into the library and is pretty much static state, but I had disabled that via the connection string, so it shouldn’t have been a factor.

I ripped out all of the parallelisation and ran the process again and it still hung. On closer investigation, it was only one specific export that was hanging, which was weird,  because they all use exactly the same code.

Turning towards the PostgreSQL end of the system, I ran the process again, except this time started paying attention to the active connections to the database, the queries they were running, state transitions (i.e. active –> idle) and execution timestamps. This is pretty easy to do using the query:

SELECT * FROM pg_stat_activity

I could clearly see the export that was failing execute its query on a connection, stay active for 15 minutes and then transition back to idle, indicating that the database was essentially done with that operation. On the process side though, it kept chugging along, waiting eternally for some data to come through the Socket that it was listening on that would apparently never come.

The 15 minute query time was oddly consistent too.

It turns out the query was actually being terminated server side because of replica latency (the export process queries a replica DB), which was set to max out at, wait for it, 15 minutes.

For some reason the stream returned by the NpgsqlConnection.BeginTextExport(sql) just never ends when the underlying query is terminated on the server side.

My plan is to put together some information and log an issue in the Npgsql Github Repository, because I can’t imagine that the behaviour is intended.

Solve For X

With the problem identified, the only question remaining was what to do about it.

I don’t even like that our maximum replica latency is set to 15 minutes, so raising it was pretty much out of the question (and this process is intended to be automated and ongoing, so I would have to raise it permanently).

The only real remaining option is to break down the bigger query into a bunch of smaller queries and then aggregate myself.

So that’s exactly what I did.

Luckily for me, the data set had a field that made segmentation easy, though running a bunch of queries and streaming the results into a single CSV file meant that they had to be run sequentially, so no parallelization bonus for me.

Conclusion

This was one of those issues where I really should have had the foresight to see the first problem (timeouts when dealing with large data sets),  but the existence of what looks to be a real bug made everything far more difficult than it could have been.

Still, it just goes to show that no matter how confident you are in a process, there is always the risk that when you execute it in reality that it all might go belly up.

Which just reinforces the idea that you should always be running it in reality as soon as you possibly can, and paying attention to what the results are.

0 Comments

Generally, I’ve been pretty happy with the way that we do log aggregation, using Elasticsearch, Logstash and Kibana (ELK). That’s not to say it couldn’t use improvements in almost every area, but for now, it does what it needs to do. Its definitely better than we had before, and people are actually using it to answer questions, which I would consider the mark of a successful implementation.

Of all the components in the triumvirate though, Logstash has easily been the worst. Kibana has been basically problem free (somewhat to be expected for a visualisation framework that delegates the majority of its work to the data store) and Elasticsearch has been complicated and somewhat temperamental (mostly because we haven’t had the time nor experience to tune it properly), but Logstash has been a pain.

Version 1.4 flat out did not work properly on Windows with file based inputs, as it didn’t properly remember its position in files it was reading. Version 1.5.0 fixed that problem, but we had to suffer through a number of beta issues because we were forced to use the beta version. Even after the 1.5 release we’ve had some other issues relating to TCP outputs consuming mass amounts of CPU (kind of annoying on CPU Credit limited instances), using the S3 input from behind a proxy and recovering reliably after its TCP output endpoint was down temporarily.

It’s become a bit of meme between myself and one of our operations people that whenever something weird happens with logs (even if its unrelated to the actual process of log aggregation), that it’s probably Logstash’s fault.

When one of our AWS instances responsible for filling API requests became unresponsive a few weeks ago, we joked that it was probably Logstash’s fault.

It turned out that it actually was.

What’s That Thing You Use to Dig Food?

Our API’s all have redundancy built into them via the magic of auto scaling groups and multiple instances. Its the kind of thing you have to have when you’re running anything in AWS, as there is no guarantee that your instances will be long lived. Its likely that they will be fine and everything will go swimmingly, but there is no guarantee. Honestly, that’s no different from normal hosting, the information is just available upfront, instead of after a few months when something explodes. Plan for failure, because it will happen.

Anyway, a few weeks ago one of our API instances stopped doing the thing it was supposed to be doing, i.e. filling API requests, so it was quietly removed from the load balancer. The alarms we have setup for number of healthy instances kicked in and we put the failing instance into standby mode (which spun up a new one to take its place) and started investigating.

Looking at the logs in Kibana, the instance started returning 500’s sporadically early in the morning and then got progressively worse until it was removed from the load balancer a little after 0800. The root cause of the 500’s? OutOfMemoryExceptions, the API process literally could not allocate a piece of memory large enough to do the thing it wanted to do.

My first response was to make that deep guttural noise that only comes from software developers when they know what’s about to happen. A mixture between a groan and a resigned sigh, tinged with frustration.

I’ve dealt with memory leaks before.

They are terrible, and I hate them.

I assumed that we had a memory leak in our deployed .NET application. Although .NET is a managed language, it is surprisingly easy to leak memory, with various ways to root vast amounts of information, stopping the garbage collector from recovering it.

The best tool that I’ve found for visualizing memory leaks in .NET is the SciTech .NET Memory Profiler. Honestly though, other than Visual Studio itself, its basically the only one I’ve really used, so I can’t strictly say that its better than anything else. I can say that it is good at what it does, because its helped me locate and fix quite a few memory leaks.

Being that the API had been running uninterrupted for a few weeks, the memory leak would be slow. it might even be related to specific requests that only happen rarely, or even worse, it might be time based, rather than event based. I knew I would need to either reproduce the problem locally (which might be quite hard) and find some way to get the memory profiler tool into the environment where the problem occurred and then wait until it happened again.

Like I said, I hate memory leaks.

Surprise! It Wasn’t Us

Remoting on to the AWS instance with the memory issue showed an entirely unexpected situation.

Our .NET process wasn’t using much memory at all.

The Java process, however, was using 1.5GB of physical memory and a huge chunk of virtual memory. The only piece of Java code running? Logstash.

Now I had a different problem. At least with a memory leak within our software I could trace it, find it and fix it. I know .NET, I’ve done memory leak analysis before, and while its terrible, its a known problem. It might take me a while to get to the bottom of it, but I would get there.

Logstash runs Ruby via JRuby, which runs inside Java. Its open source, but I wouldn’t even know where to begin with that mess. How could I find out why it was using so much memory for the apparently simple task of processing some text files? None of our Logstash configs are fancy, they just process text files and shunt the results to another instance of Logstash which puts them into Elasticsearch.

Clearly, The Problem Is Solved

At this particular point in time, I do not have the luxury of getting to the root cause of problems with supplementary components. This makes me sad inside, but that’s how reality rolls sometimes.

The easiest solution right now? Restart the Logstash service every day.

Easy enough to implement with a Scheduled Task that executes Powershell.

I did have one incredibly weird issue with the task. The task name that I generated for the service restart had square braces in it. As a result, so did the folder where the Powershell task puts its transcribed output. The Start-Transcript command does not like square braces in its output file for some reason, and it fails with a Start-Transcript: The host is not currently transcribing, which is incredibly helpful. This particular quirk took far too long to get to the bottom of.

Long term we need to look into the issue in depth. Maybe its a problem in the particular version of Logstash we are using (1.5.2), maybe its our configuration or maybe its the environment. I don’t know, and that particular investigation could be a gigantic time sink, so we’ll have to dedicate some time to it later on.

Another solution would be to not use Logstash directly on the machines generating the logs, and to instead use the Logstash Forwarder, which is a lightweight replacement for Logstash that only forwards events to a remote location. The remote location is typically an actual Logstash instance which has all the logic to parse the event into a meaningful format. To me this just moves the issue around, but it still might be an option.

Conclusion

This is actually the first case that I can remember where a memory leak wasn’t the fault of the software I was actively working on. I call it a memory leak because I can’t conceive how an apparently simple process like Logstash could use so much memory over time, but it might just be what it does. I mean, i wouldn’t consider SQL Server using massive amounts of memory a leak per se (because I understand what it does is hard, and it consumes more memory based on the size of the data), it’s just really weird that a process which should be entirely transient (read, transform, write, discard) could keep using more and more memory over time without there being some sort of bug.

To be honest, I’m extremely unhappy with the installing a scheduled task to restart the logstash service. It’s one of those bandaid fixes that you just know will end up biting you hard at some point in the future. It will probably cause an entirely worse issue, like massive, unexpected data loss or something.

For now, at least I know that the Logstash process will not stop our APi’s from working, which I guess is a victory.

A pyrrhic victory.