0 Comments

A long time ago, I wrote a post about fixing up our log management scripts to actually delete logs properly. Logs were being aggregated into our ELK stack and were then being deleted after reaching a certain age. It fixed everything perfectly and it was all fine, forever. The End

Or not.

The log management script itself was very simple (even though it was wrapped in a bunch of Powershell to schedule it via Windows Scheduled Task. It looked at a known directory (C:\logs), found all the files matching *.log, narrowed it down to only those files that had not been written to in the last 7 days and then deleted them. Not exactly rocket science.

As we got more and more usage on the services in question though, the log file generation started to outstrip the ability of the script to clean up.

They Can Fill You Up Inside

The problem was twofold, the log management script was hardcoded to only delete things that hadn’t been touched in the last 7 days and the installation of the scheduled task that ran the script was done during machine initialisation (as part of the cfn-init configuration). Changing the script would require refreshing the environment, which requires a migration (which is time consuming and still not perfect). Not only that, but changing the script for one service (i.e. to delete all logs older than a day), might not be the best thing for other services.

The solution was to not be stupid and deploy log management in the same way we deploy everything, Octopus Deploy.

With Octopus, we could build a package containing the all of the logic for how to deploy a log management solution, and use that package in any number of projects with customised parameters, like retention period, directory, filter, whatever.

More importantly, it would give us the ability to update existing log management deployments without having to alter their infrastructure configuration, which is important for minimising downtime and just generally staying sane.

They Are Stronger Than Your Drive

It was easy enough to create a Nuget package containing the logic for installing the log management scheduled task. All I had to do was create a new repository, add a reference to our common scripts and then create a deploy.ps1 file describing how the existing scheduled task setup script should be called during deployment.

In fact, the only difference from calling the script directly like we were previously, was that I grabbed variable values from the available Octopus parameters, and then slotted them into the script.

With a nice self contained Nuget package that would install a scheduled task for log management, the only thing left to do was create a new Octopus project to deploy the package.

I mostly use a 1-1 Nuget Package to Project strategy when using Octopus. I’ve found that its much easier to get a handle on the versions of components being deployed when each project only has 1 thing in it. In this case, the Nuget package was generic, and was then referenced from at least two projects, one to manage the logs on some API instances and the other to manage logs on a log shipper component (which gets and processes ELB logs).

In order to support a fully automated build, test and deployment cycle, I also created a TeamCity Build Configuration. Like all Build Configurations, it watches a repository for changes, runs tests, packages and then deploys to the appropriate environments. This one was slightly different though, in that it deployed multiple Octopus projects, rather than a single one like almost all of our other Build Configurations. I’m not sure how I feel about this yet (because its different from what we usually do), but its definitely easier to manage, especially since all the projects just use the same package anyway.

Conclusion

This post might not look like much, and to be honest it isn’t. Last weeks post on Expression Trees was a bit exhausting so I went for something much smaller this week.

That is not to say that the concepts covered herein are not important.

One thing that I’ve taken away from this is that if you ever think that installing something just once on machine initialization is enough, you’re probably wrong. Baking the installation of log management into our cfn-init steps caused us a whole bunch of problems once we realised we needed to update them because they didn’t quite do the job. It was much harder to change them on the fly without paying a ridiculous penalty in terms of downtime. It also felt really strange to have to migrate an entire environment just because the log files weren’t being cleaned up correctly.

When it comes down to it, the real lesson is to always pick the path that lets you react to change and deploy with the least amount of effort.

0 Comments

The L in ELK stands for Logstash. This is not news.

When we put together the system we use for log aggregation, we needed to find a way to deal with Logstash such that it would fit into our relatively new (and still maturing) model for automated build and deployment. We put together a system that stores a deployable version of Logstash in Bitbucket, a build/package script written in Powershell that puts everything together into a Nuget package, TeamCity to listen for changes and then Octopus Deploy for managing the deployment.

The difficult part in this was that each place that required Logstash required a different configuration (because it was processing different files or transforming the data in different ways). There was a lot of commonality, but it was mostly in the actual installation and management of Logstash, rather than the configuration Logstash happened to be running at any particular point in time.

We have one git repository that contains everything necessary to run a copy of Logstash on a machine. This includes Logstash itself, the Java Runtime Environment (JRE) and a set of Powershell scripts that allow us to install Logstash as a Windows Service and choose which configuration it should be running. This meant that all of the configuration files for each of our Logstash deployments all lived in the same repository.

This worked okay for a little while, until we started adding more configuration files.

One Change Makes Many Work?

Suddenly, we had a bunch Build Configurations in TeamCity triggering off changes to the one repository. They only triggered off changes to their own Logstash configurations at least, but they all triggered whenever we made changes to Logstash itself or to the scripts surrounding it. Technically, they were all building (and deploying) the same package (X.Logging.Logstash.nupkg), but each would generate a different version, and deploy a different Octopos Deploy project. Luckily our versions are based off the current time, so it wasn’t like the version kept going up and down (because of the difference in number of builds), but there was the occasional conflict when two TeamCity tasks just happened to build on two different Build Agents within a few seconds of each other (which would generate identical packages).

The bigger issue was that each package was over 100MB! Logstash is a good 80MB by itself and the JRE is another 40MB, so once you add in a MB or two for everything else, your package is huge.

Yes, technically we could deal with this issue by making sure Java and Logstash are installed on the target machines ahead of time, but I don’t like this from an isolation/self-reliance point of view. I want to be able to push a package with the minimal dependencies already in existence on the target machine, ideally just an Octopus tentacle and Powershell (for Windows boxes anyway). Anything else that is required should be contained within the package itself (or, in extreme cases, bootstrapped from somewhere after deployment (but that just moves the problem around slightly)).

Suddenly a checkin to the repository would gum up our CI build agents with a set of relatively unimportant tasks, stopping other builds from progressing and interrupting peoples work.

The easiest (but somewhat fiddly) solution was to split the concept of an installed version of Logstash from the configuration it happened to be running. With this approach we could deploy Logstash to a target machine once and then not have to pay the price of shifting that much data over the network every single time we wanted to alter the config. When we did want to upgrade Logstash, we could simply build a new version and have it deployed in the same way.

The plan was relatively simple. Create one repository for a deployable version of Logstash by itself (making sure to generalise it enough such that you could easily point it at any configuration you wanted) and then split out each configuration into a repository of its own. Whenever Logstash changes, it would be built and published to Octopus, but nothing would be deployed. Each configuration repository would be able to choose to upgrade to the new version (by changing the dependency in source) and then TeamCity would pick up the changes and run the normal build/package/deploy cycle for that configuration.

Executions Are Tiresome

As is almost always the case, coming up with the idea and its execution plan was a lot more fun than actually doing it.

The Logstash repository that we had already was pretty tightly coupled to the way it handled configuration. It actually used the current Octopus Project name during the installation to determine the configuration that it should be running, and each configuration really only consisted of a single Logstash conf file.

The first task was to generalise the Logstash installation, so that we could deploy it separately and then have the configuration projects use it via a known interface. Nothing particularly interesting here from a design standpoint, just a few Powershell functions. Execute-Logstash, Install-LogstashService (and its counterpart, Remove-LogstashService) and some helpers for generating configuration files based on templates (because sometimes during deployment you need to be able to substitute some deployment specific values into your configuration, like AWS Instance Id).

The next task was taking one of the current configuration files and converting it into the new pattern, a repository of its own. This repository would need to contain everything necessary for the configuration of a Logstash instance, plus some tests to verify that the config file works as expected when given a set of known inputs.

Its not overly easy to test a Logstash configuration, especially when it has never been tested before. Like all code, you need to be able to substitute certain values (like the location of the log files to read) and then figure out a way to measure the outcome, without changing the actual configuration too much. The approach I settled on was to parameterise the log locations like I mentioned above and to add an additional output during testing that wrote everything to a file. That way I could read the file in and check to make sure that it was outputting as many lines as I expected.

The last task was to rebuild the Octopus Deploy project for the configuration to deploy both the Logstash deployable component and the configuration and verify that it installed and ran correctly on deployment. The most difficult part here was that different versions of each component were required, so we had to extend our Octopus scripts to handle this properly (i.e. step 1 which deploys Logstash needs to know that it should deploy version 1.5.2, but step 2 needs to deploy version 1.0.15235.12789 of the actual configuration).

Conclusion

I really should have structured our Logstash deployments in this way from the start. Its almost always better to separate configuration from application code, especially when you don’t control the application. Often you will find that configuration changes a lot more than the code does, and when the code itself is quite large (as is the case with Logstash and its dependencies) it can get quite painful shifting all of those bytes around for no real reason.

But, alas, you can never see into the future with any sort of clarity, and you need to be willing to change things that you’ve done in the past when problems are identified or a better solution comes along.

Can’t get sentimental about these things.

0 Comments

So, we have all these logs now, which is awesome. Centralised and easily searchable, containing lots of information relating to application health, feature usage, infrastructure utilization along with many other things I’m not going to bother to list here.

But who is actually looking at them?

The answer to that question is something we’ve been struggling with for a little while. Sure, we go into Kibana and do arbitrary searches, we use dashboards, we do our best to keep an eye on things like CPU usage, free memory, number of errors and so on, but we often have other things to do. Nobody has a full time job of just watching this mountain of data for interesting patterns and problems.

We’ve missed things:

  • We had an outage recently that was the result of a disk filling up with log files because an old version of the log management script had a bug in it. The disk usage was clearly going down when you looked at it in Kibana dashboard, but it was happening so gradually that it was never really brought up as a top priority.
  • We had a different outage recently where we had a gradual memory leak in the non-paged memory pool on some of our API instances. Similar to above, we were recording free memory and it was clearly dropping over time, but no-one noticed.

There has been other instances (like an increase in the total number of 500’s being returned from an API, indicating a bug), but I won’t go into too much more detail about the fact that we miss things. We’re human, we have other things to do, it happens.

Instead, lets attack the root of the issue. The human element.

We can’t reasonably expect anyone to keep an eye on all of the data hurtling towards us. Its too much. On the other hand, all of the above problems could have easily been detected by a computer, all we need is something that can do the analysis for us, and then let us know when there is something to action. It doesn’t have to be incredibly fancy (no learning algorithms….yet), all it has to do is be able to compare a few points in time and alert off a trend in the wrong direction.

One of my colleagues was investigating solutions to this problem, and they settled on Sensu.

Latin: In The Sense Of

I won’t go into too much detail about Sensu here, because I think the documentation will do a much better job than I will.

My understanding of it, however, is that it is a very generic, messaging based check/handle system, where a check can be almost anything (run an Elasticsearch query, go get some current system stats, respond to some incoming event) and a handler is an arbitrary reaction (send an email, restart a server, launch the missiles).

Sensu has a number of components, including servers (wiring logic, check –> handler), clients (things that get checks executed on them) and an API (separate from the server). All communication happens through RabbitMQ and there is some usage of Redis for storage (which I’m not fully across yet).

I am by no means any sort of expert in Sensu, as I did not implement our current proof of concept. I am, however, hopefully going to use it to deal with some of the alerting problems that I outlined above.

The first check/handler to implement?

Alert us via email/SMS when the available memory on an API instance is below a certain threshold.

Alas I have not actually done this yet. This post is more going to outline the conceptual approach, and I will return later with more information about how it actually worked (or didn’t work).

Throw Out Broken Things

One of the things that you need to come to terms with early when using AWS is that everything will break. It might be your fault, it might not be, but you should accept right from the beginning that at some point, your things will break. This is good in a way, because it forces you to not have any single points of failure (unless you are willing to accept the risk that they might go down and you will have an outage, which is a business decision).

I mention this because the problem with the memory in our API instances that I mentioned above is pretty mysterious. Its not being taken by any active process (regardless of user), so it looks like a driver problem. It could be one of those weird AWS things (there are a lot), and it goes away if you reboot, so the easiest solution is to just recycle the bad API instance and move on. Its already in an auto-scaling group for redundancy, and there is always more than 1, so its better to just murder it, relax, and let the ASG do its work.

Until we’re comfortable automating that sort of recycling, we’ll settle for an alert that someone can use to make a decision and execute the recycle themselves.

By installing the Sensu client on the machines in question (incorporating it into the environment setup itself), we can create a check that allow us to remotely query the available free memory and compare it against some configured value that we deem too low (lets say 100MB). We can then configure 2 handlers for the check result, one that emails a set of known addresses and another that does the same for SMS.

Seems simple enough. I wonder if it will actually be that simple in practice.

Summary

Alerting on your aggregate of information (logs, stats, etc) is a pretty fundamental ability that you need to have.

AWS does provide some alerting in the form of CloudWatch alarms, but we decided to investigate a different (more generic) route instead, mostly because of the wealth of information that we already had available inside our ELK stack (and our burning desire to use it for something other than pretty graphs).

As I said earlier, this post is more of an outline of how we plan to attack the situation using Sensu, so its a bit light on details I’m afraid.

I’m sure the followup will be amazing though.

Right?

0 Comments

In my recent adventures with trying to create and maintain Windows Scheduled Tasks that execute Powershell, I’ve discovered a need for logging exactly what happens in the scripts. Mostly this is due to the fact that getting the output from a scheduled task (beyond the exit code) is a massive pain, but I’ve wanted a good Powershell logging system/framework for a while now. I mean, its basically code, and I would never write code without good logging, so this shouldn’t be an exception.

Luckily, Powershell offers a mechanism for logging, usable via the X-Transcript commands.

Unluckily, its a bit crap.

Transcribing

Powershell offers two cmdlets for dealing with transcription. Start-Transcript and Stop-Transcript.

When you start a transcript, you can optionally supply a filepath, and the transcription will take place there. If not, it will create a file for you and tell you where it is.

As for what is actually transcribed, you get all of the Output of the script plus errors and other exceptions. You also get some meta information, including times and the script being executed.

All of this sounds pretty perfect for logging on the surface. Just start a transcript at the beginning of your script and stop it at the end. So I tried it.

It was…okay. Once I figured out that it doesn’t automatically create missing directories in the path supplied (it doesn’t fail, it just does everything really really slowly), that permissions can really ruin its day (same sort of thing, no failures, just super slow while it fails all of its writes) and that if you have square braces in your path it fails miserably with the least helpful error message ever, it at least created a record of what was happening. Sort of.

Okay, It was terrible and it made me sad.

Apart from the weird behaviour with directories/permissions/braces, transcribing does not record the Verbose or Debug streams. This is kind of annoying, because that’s where I put the majority of my log statements (Output is bad because it messes up stream processing and return values and Host is just generally not advised). I believe later versions of transcribing can record additional streams, but I didn’t look into it any further, because I was already annoyed with the feature.

I also hated the output it created. If I’m looking at a log file, at minimum I want timestamps, severity and messages. For a language/framework with a mature logging tool (like .NET with log4net or Serilog) I expect a lot more, but for a simple log file, those three things I listed are enough.

Time to try an alternate approach. I hesitate to say better, but honestly, I was pretty unimpressed with transcription, so it seemed very possible.

Redirection

Powershell offers the ability to redirect the various available streams, using the standard Windows command line redirection, which seems like it would be the easiest way to create a log file. Simply redirect the streams you’re interested in to a file (pretty much all of them) and Bob’s your uncle.

In my case, I had a lot of issues getting this to work. Because I was running Powershell from a Windows Scheduled Task, using redirection didn’t seem to work. It turns out that one of my mistakes was assuming you could redirect the output from Powershell.exe, which you don’t seem to be able to do. Instead the redirection appears to be primarily for commands within Powershell.

I tried wrapping my Powershell in more Powershell (which actually did the redirection, as well as setup the file name via the current timestamp and ensure the appropriate directories existed), and while it seemed to work, it still wasn’t good enough. No timestamps on the lines in the file left me with no real information as to the time that elapsed during the script. Additionally, the syntax for getting the redirection to work properly was a bit strange, with the all redirection seemingly not working as expected (9>&1 didn’t seem to work, and neither did *>&1).

It was still better than transcribing though.

It was time to go deeper.

Interception

Powershell is an interesting language. Coming from a heavy .NET/C# background, I was initially confused by Powershell and its untyped variables and extremely loose, interpreted syntax. I still definitely prefer C#, but Powershell has its place in my environment, but only when its supported by tests using Pester.

One of the wonderful things Powershell lets you do, is function overriding.

You can create an implementation of a function, and have that implementation be used within a limited scope, instead of the default implementation.

You might be able to see where I’m going with this.

All of my logging was already in place via a combination of Write-Verbose, Write-Debug and Write-Warning (and the other write functions as well, just more rarely). All I needed to do was intercept those calls and write the messages to some file instead of whatever the default did.

It was easy enough to implement a file containing implementations of all of the Write-X cmdlets, and redirect the output to a file. I could easily add timestamps and any other information I wanted, and could control the output format, allowing it to be easily parsed by something else (like Logstash). To use the feature you simply setup a global variable containing the directory to put the log files into and then dot include the file.

The only downside to this was that any function I overwrote in this way no longer worked normally (which is kind of obvious), so during my testing I broke my IDE a number of times, which made development a bit harder than expected. Luckily, I ran across this StackOverflow answer from someone much smarter than me, which showed how to wrap the existing implementations instead of simply replacing them.

function Get-TimestampForLogFileName
{
    return [DateTime]::Now.ToString('yyyyMMdd');
}

function Get-CurrentLogFile
{
    if ([string]::IsNullOrEmpty($logsDirectoryPath)) { throw "LogsDirectoryPath script scoped variable not set. Thats bad, its used to create log files for Powershell output." }

    $timestamp = Get-TimestampForLogFileName

    $logFilePath = "$logsDirectoryPath\$timestamp.log";
    if (-not(Test-Path $logFilePath))
    {
        $logFile = New-Item -Type File -Path $logFilePath -Force;
        $max = 5
        $current = 0
        while (-not ($logFile.Exists))
        {
            Sleep -Seconds 1
            $current++
            if ($current -gt $max) { break }
        }
    }

    return $logFilePath
}

function Get-TimestampForLogContent
{
    return [DateTime]::Now.ToString('yyyyMMddHHmmss');
}

function Write-Debug 
{
    [CmdletBinding()]
    param
    (
       [Parameter(Mandatory=$true, Position=0, ValueFromPipeline=$true)]
       [AllowEmptyString()]
       [System.String]${Message}
    )

    begin 
    {
       try 
       {
           $outBuffer = $null
           if ($PSBoundParameters.TryGetValue('OutBuffer', [ref]$outBuffer))
           {
               $PSBoundParameters['OutBuffer'] = 1
           }
           $wrappedCmd = $ExecutionContext.InvokeCommand.GetCommand('Write-Debug', [System.Management.Automation.CommandTypes]::Cmdlet)
           $scriptCmd = {& $wrappedCmd @PSBoundParameters }
           $steppablePipeline = $scriptCmd.GetSteppablePipeline($myInvocation.CommandOrigin)
           $steppablePipeline.Begin($PSCmdlet)
       } 
       catch 
       {
           throw
       }
    }

    process 
    {
       try 
       {
            $logFilePath = Get-CurrentLogFile;
            Add-Content -Path $logFilePath "$(Get-TimestampForLogContent):DEBUG:$message" -Force;
            $steppablePipeline.Process($_)
       } 
       catch 
       {
           throw
       }
    }

    end 
    {
       try 
       {
           $steppablePipeline.End()
       } catch 
       {
           throw
       }
    }
}

I’ve only included the Write-Debug implementation, but all of the other ones look the same.

Finally I had a logging implementation that didn’t make me sad. It wrote reliably to a file, and I could easily use it from within a Windows Scheduled Task (after I started creating an actual .ps1 file for the scheduled task to execute that is).

Conclusion

The logging solution that I ended up putting into place isn’t perfect, but it seems to be a hell of a lot better than the other two approaches that I tried.

It outputs in a format that people are used to seeing, and is easily parsable by both machines and people. If I wanted to, I could import the log information into ELK using Logstash, which would let me keep track of the execution of the scheduled tasks remotely. Honestly that’s probably something I’ll put into place eventually, just not right now.

It was nice to be able to wrap functionality provided by something else without having to have written the code explicitly for that in the first place. Coming from C#, I typically code to interfaces, so I’m used to being able to replace things as necessary, but that requires effort up front. Here I simply replaced some system provided cmdlets and made them do something slightly different based on my requirements. I can’t count the number of times I’ve wanted the ability to do that sort of thing with System.Diagnostics.Trace or Console or DateTime (usually for testing purposes).

Having that sort of power is both amazing and dangerous (amazingly dangerous?). It’s very easy to completely break other, unrelated things when you are essentially replacing the implementation of a global static function.

Ah well, with great power, comes great responsibility I suppose.

Wouldn’t want Uncle Bob to be killed some rogue code in a tragically ironic accident.

0 Comments

Generally, I’ve been pretty happy with the way that we do log aggregation, using Elasticsearch, Logstash and Kibana (ELK). That’s not to say it couldn’t use improvements in almost every area, but for now, it does what it needs to do. Its definitely better than we had before, and people are actually using it to answer questions, which I would consider the mark of a successful implementation.

Of all the components in the triumvirate though, Logstash has easily been the worst. Kibana has been basically problem free (somewhat to be expected for a visualisation framework that delegates the majority of its work to the data store) and Elasticsearch has been complicated and somewhat temperamental (mostly because we haven’t had the time nor experience to tune it properly), but Logstash has been a pain.

Version 1.4 flat out did not work properly on Windows with file based inputs, as it didn’t properly remember its position in files it was reading. Version 1.5.0 fixed that problem, but we had to suffer through a number of beta issues because we were forced to use the beta version. Even after the 1.5 release we’ve had some other issues relating to TCP outputs consuming mass amounts of CPU (kind of annoying on CPU Credit limited instances), using the S3 input from behind a proxy and recovering reliably after its TCP output endpoint was down temporarily.

It’s become a bit of meme between myself and one of our operations people that whenever something weird happens with logs (even if its unrelated to the actual process of log aggregation), that it’s probably Logstash’s fault.

When one of our AWS instances responsible for filling API requests became unresponsive a few weeks ago, we joked that it was probably Logstash’s fault.

It turned out that it actually was.

What’s That Thing You Use to Dig Food?

Our API’s all have redundancy built into them via the magic of auto scaling groups and multiple instances. Its the kind of thing you have to have when you’re running anything in AWS, as there is no guarantee that your instances will be long lived. Its likely that they will be fine and everything will go swimmingly, but there is no guarantee. Honestly, that’s no different from normal hosting, the information is just available upfront, instead of after a few months when something explodes. Plan for failure, because it will happen.

Anyway, a few weeks ago one of our API instances stopped doing the thing it was supposed to be doing, i.e. filling API requests, so it was quietly removed from the load balancer. The alarms we have setup for number of healthy instances kicked in and we put the failing instance into standby mode (which spun up a new one to take its place) and started investigating.

Looking at the logs in Kibana, the instance started returning 500’s sporadically early in the morning and then got progressively worse until it was removed from the load balancer a little after 0800. The root cause of the 500’s? OutOfMemoryExceptions, the API process literally could not allocate a piece of memory large enough to do the thing it wanted to do.

My first response was to make that deep guttural noise that only comes from software developers when they know what’s about to happen. A mixture between a groan and a resigned sigh, tinged with frustration.

I’ve dealt with memory leaks before.

They are terrible, and I hate them.

I assumed that we had a memory leak in our deployed .NET application. Although .NET is a managed language, it is surprisingly easy to leak memory, with various ways to root vast amounts of information, stopping the garbage collector from recovering it.

The best tool that I’ve found for visualizing memory leaks in .NET is the SciTech .NET Memory Profiler. Honestly though, other than Visual Studio itself, its basically the only one I’ve really used, so I can’t strictly say that its better than anything else. I can say that it is good at what it does, because its helped me locate and fix quite a few memory leaks.

Being that the API had been running uninterrupted for a few weeks, the memory leak would be slow. it might even be related to specific requests that only happen rarely, or even worse, it might be time based, rather than event based. I knew I would need to either reproduce the problem locally (which might be quite hard) and find some way to get the memory profiler tool into the environment where the problem occurred and then wait until it happened again.

Like I said, I hate memory leaks.

Surprise! It Wasn’t Us

Remoting on to the AWS instance with the memory issue showed an entirely unexpected situation.

Our .NET process wasn’t using much memory at all.

The Java process, however, was using 1.5GB of physical memory and a huge chunk of virtual memory. The only piece of Java code running? Logstash.

Now I had a different problem. At least with a memory leak within our software I could trace it, find it and fix it. I know .NET, I’ve done memory leak analysis before, and while its terrible, its a known problem. It might take me a while to get to the bottom of it, but I would get there.

Logstash runs Ruby via JRuby, which runs inside Java. Its open source, but I wouldn’t even know where to begin with that mess. How could I find out why it was using so much memory for the apparently simple task of processing some text files? None of our Logstash configs are fancy, they just process text files and shunt the results to another instance of Logstash which puts them into Elasticsearch.

Clearly, The Problem Is Solved

At this particular point in time, I do not have the luxury of getting to the root cause of problems with supplementary components. This makes me sad inside, but that’s how reality rolls sometimes.

The easiest solution right now? Restart the Logstash service every day.

Easy enough to implement with a Scheduled Task that executes Powershell.

I did have one incredibly weird issue with the task. The task name that I generated for the service restart had square braces in it. As a result, so did the folder where the Powershell task puts its transcribed output. The Start-Transcript command does not like square braces in its output file for some reason, and it fails with a Start-Transcript: The host is not currently transcribing, which is incredibly helpful. This particular quirk took far too long to get to the bottom of.

Long term we need to look into the issue in depth. Maybe its a problem in the particular version of Logstash we are using (1.5.2), maybe its our configuration or maybe its the environment. I don’t know, and that particular investigation could be a gigantic time sink, so we’ll have to dedicate some time to it later on.

Another solution would be to not use Logstash directly on the machines generating the logs, and to instead use the Logstash Forwarder, which is a lightweight replacement for Logstash that only forwards events to a remote location. The remote location is typically an actual Logstash instance which has all the logic to parse the event into a meaningful format. To me this just moves the issue around, but it still might be an option.

Conclusion

This is actually the first case that I can remember where a memory leak wasn’t the fault of the software I was actively working on. I call it a memory leak because I can’t conceive how an apparently simple process like Logstash could use so much memory over time, but it might just be what it does. I mean, i wouldn’t consider SQL Server using massive amounts of memory a leak per se (because I understand what it does is hard, and it consumes more memory based on the size of the data), it’s just really weird that a process which should be entirely transient (read, transform, write, discard) could keep using more and more memory over time without there being some sort of bug.

To be honest, I’m extremely unhappy with the installing a scheduled task to restart the logstash service. It’s one of those bandaid fixes that you just know will end up biting you hard at some point in the future. It will probably cause an entirely worse issue, like massive, unexpected data loss or something.

For now, at least I know that the Logstash process will not stop our APi’s from working, which I guess is a victory.

A pyrrhic victory.