0 Comments

As I’ve said repeatedly before, logging is easily one of the most important things that comes out of a piece of software. Good logging can be used for all sorts of purposes, but most importantly, it makes maintaining and supporting a piece of software much much easier. I won’t go into any more detail about how important I think logging is right now, as I’m going to use this post to talk about dealing with the bits and pieces that logging often leaves behind.

Mostly that means log files.

In our environment setup, we have drifted towards a pattern where all log files are written into a common logs directory, with subdirectories for each component.

If we want to find anything out about our components, we can just go to the logs directory. Nice and easy, no need to waste time hunting over the file system for logs that matter. Of course, its only used as a backup, as all of our logs are aggregated into ELK, but its still useful.

At this point in time, most of our components simply create log files on disk, and write everything there. Its then up to an external component (logstash in most cases, nxlog in very rare cases) to take those files, do something to them and write the events to some other location for aggregation. We have log files being created by Squid proxies, performance counters, web services, iis, windows services and Powershell scripts and probably many other things that aren't occurring to me right now.

Of course, the downside of this extensive logging is that you are constantly consuming more and more disk space. Some logging frameworks and mechanisms manage that for you (either with a maximum directory size or some maximum number of days of log files), but not all.

A great side effect of having all logs go to the same directory, is that it’s easy to write something to clean up that directory based on some retention policy (not written in the last 14 days for example).

So we did just that.

Scheduled Annihilation

Every environment that we create has two Powershell scripts executed on all machines by default.

The first sets up the logs directory at the common location, ensuring that it is writable by all appropriate users, and that all subdirectories and files are also appropriately writable.

We implemented this because when we setup a Performance Counter to write a log file containing system stats, it created the directory with an owner of System, and everything else just failed miserably from that point foward, completely breaking the common log location right off the bat. It was, for some reason, stupidly difficult to claim the folder back and fix its permissions as well, so we just implemented a script to configure the folder ahead of time.

function InitialiseCommonLogsDirectory
{
    $directoryPath = "C:\logs"
    if (-not (Test-Path $directoryPath))
    {
        Write-Verbose "Creating common logs directory at [$directoryPath]"
        New-Item $directoryPath -Type Directory 
    }

    Write-Verbose "Making common logs directory [$directoryPath] writeable by everyone."
    $Acl = Get-Acl $directoryPath

    $Ar = New-Object System.Security.AccessControl.FileSystemAccessRule("Everyone", "FullControl", "ContainerInherit,ObjectInherit", "None", "Allow")

    $Acl.SetAccessRule($Ar)
    Set-Acl $directoryPath $Acl
}

The second script installs a Scheduled Task to run a fairly simple piece of Powershell that just deletes all files that hadn’t been written to in the last 14 days. The assumption was that the physical files on disk were relatively unimportant assuming the log aggregation was working correctly, but we keep a small window just in case.

Annoyingly, the commands for setting up a scheduled task via Powershell are a bit arcane, mostly because you have to resort to using COM services to get the job done. The script to setup the scheduled task looked like this:

function CreateLogsClearingTask
{
    $directoryPath = "C:\logs"

    $taskName = "Clear Old Logs"

    $service = new-object -ComObject("Schedule.Service")
    $service.Connect()
    $rootFolder = $service.GetFolder("\")
    try
    {
        $rootFolder.DeleteTask($taskName, 0)
    }
    catch
    {
        if (-not ($_.Exception.Message -like "*The system cannot find the file specified*"))
        {
            throw $_.Exception
        }
    }

    $daysOld = 14
    $taskDescription = "Clears Logs that are older than [$daysOld] days"
    $taskCommand = "powershell"
    $taskArguments = "-Command `"&{ Get-ChildItem $directoryPath\* -Include *.log -Recurse | Where LastWriteTime -lt (Get-Date).AddDays(-$daysOld) | Remove-Item }`" -ExecutionPolicy Bypass"
 
    write-verbose "Creating Scheduled Task to automatically execute [$taskCommand] with arguments [$taskArguments] on a regular basis."

    $TaskStartTime = [DateTime]::Now.Date.AddHours(1)
 
    $TaskDefinition = $service.NewTask(0) 
    $TaskDefinition.RegistrationInfo.Description = "$taskDescription"
    $TaskDefinition.Settings.Enabled = $true
    $TaskDefinition.Settings.AllowDemandStart = $true
 
    $triggers = $TaskDefinition.Triggers
    #http://msdn.microsoft.com/en-us/library/windows/desktop/aa383915(v=vs.85).aspx
    $trigger = $triggers.Create(2)
    $trigger.StartBoundary = $TaskStartTime.ToString("yyyy-MM-dd'T'HH:mm:ss")
    $trigger.Enabled = $true
    $trigger.DaysInterval = 1
 
    # http://msdn.microsoft.com/en-us/library/windows/desktop/aa381841(v=vs.85).aspx
    $Action = $TaskDefinition.Actions.Create(0)
    $action.Path = "$taskCommand"
    $action.Arguments = "$taskArguments"
 
    #http://msdn.microsoft.com/en-us/library/windows/desktop/aa381365(v=vs.85).aspx
    $rootFolder.RegisterTaskDefinition("$TaskName",$TaskDefinition,6,"System",$null,5)

    $task = $rootFolder.GetTask($taskName)

    Write-Verbose "Executing Scheduled Task."
    $task.Run($null)
}

Not that friendly.

You may be thinking, why not just use the awesome scheduled task cmdlets available in later versions of Windows, New-ScheduledTaskTrigger, Get-ScheduledTask and Register-ScheduledTask. These commands are extremely useful and easy to understand, but unfortunately, in our particular case, we needed to target operating systems where those features are not available. Specifically, our build server (which runs the automated tests for the Powershell scripts) is running an older version of Windows Server, where the cmdlets are unavailable. Sometimes you just have to target the lowest common denominator.

Straightforward, right? I mean, it looks like it does the job.

It did not.

Logs! Logs Everywhere!

We didn’t notice for a pretty long time, because the growth was tiny on our non-production environments, but once we started to have some real traffic, the log files for IIS and the Web Service started to add up quickly. Luckily, we have dashboards that track the disk usage of our various server components, and someone happened to see the pattern of the % free space on machines dropping steadily over time.

A closer look at the script above will show you that it was failing because it was trying to clean up C:\logs\logs, which of course doesn’t exist. The directory variable had C:\logs already in it, and the Powershell script was appending another \logs to it before it was being run.

It turns out that the task had been failing continuously, pretty much since we implemented it, and because the leading indicator of failure was missing (a failing task), no-one noticed. The task thought it had succeeded because Powershell returns an exit code of 0 when an exception is thrown.

I cleaned up the script, wrote a test to verify that it was actually deleting old files (without error), incorporated the fixed scripts into our environment setup and deployed the fix to the live machines.

Once deployed, the task said it succeeded, but looking at the log files, they hadn’t actually been cleaned up at all. It had failed silently again.

Powershell and Failure: Close Friends

The root cause for the task not indicating failure correctly was that the Powershell executable, for some ungodly reason, returns an exit code of 0 when an exception is thrown. If you want to make it act like a normal application would (non-zero exit code = bad, zero = good) you have to wrap your script in a try…catch at the root and exit with a non-zero code yourself.

A little bit more refactoring around the Powershell scripts that create the scheduled task, and I had it automatically wrapping the supplied script with a try..catch and exiting with a failure if an exception occurred. I’ve included the script that creates the task below, but its only a part of a bigger system (which you can see in the Solavirum.Scripts.Common repository, when I finish updating it).

function New-DailyPowershellScheduledTask
{
    param
    (
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$taskName,
        [string]$taskDescription,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$script,
        [TimeSpan]$TimeOfDay="0:00",
        [switch]$ExecuteImmediately=$true
    )

    $service = new-object -ComObject("Schedule.Service")
    $service.Connect()
    $rootFolder = $service.GetFolder("\")

    try
    {
        $command = "powershell.exe"
        $arguments = _ConvertPowershellScriptIntoArgumentsForScheduledTask -script $script -taskName $taskName

        $TaskStartTime = [DateTime]::Now.Date.Add($TimeOfDay)
 
        $TaskDefinition = $service.NewTask(0) 
        $TaskDefinition.RegistrationInfo.Description = "$taskDescription"
        $TaskDefinition.Settings.Enabled = $true
        $TaskDefinition.Settings.AllowDemandStart = $true
 
        $triggers = $TaskDefinition.Triggers
        #http://msdn.microsoft.com/en-us/library/windows/desktop/aa383915(v=vs.85).aspx
        $trigger = $triggers.Create(2)
        $trigger.StartBoundary = $TaskStartTime.ToString("yyyy-MM-dd'T'HH:mm:ss")
        $trigger.Enabled = $true
        $trigger.DaysInterval = 1
 
        # http://msdn.microsoft.com/en-us/library/windows/desktop/aa381841(v=vs.85).aspx
        $Action = $TaskDefinition.Actions.Create(0)
        $action.Path = "$command"
        $action.Arguments = "$arguments"
 
        #http://msdn.microsoft.com/en-us/library/windows/desktop/aa381365(v=vs.85).aspx
        $rootFolder.RegisterTaskDefinition("$TaskName",$TaskDefinition,6,"System",$null,5)

        $task = $rootFolder.GetTask($taskName)

        if ($ExecuteImmediately)
        {
            Write-Verbose "An immediate execution of the just created scheduled task [$taskName] was requested. Executing now."
            $task.Run($null)
        }
    }
    finally
    {
        [System.Runtime.Interopservices.Marshal]::ReleaseComObject($service)
    }
}

function _ConvertPowershellScriptIntoArgumentsForScheduledTask
{
    param
    (
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$script,
        [string]$taskName
    )

    $taskLogsDirectoryName = "$($taskName.Replace(' ', '_'))"
    $rootLogsDirectoryPath = "C:\logs\scheduled-tasks"
    $logsDirectoryPath = "$rootLogsDirectoryPath\$taskLogsDirectoryName"
    $logDirectory = New-Item -ItemType Directory -Path $logsDirectoryPath -Force
    return "-command `"& { try { `$timestamp = [DateTime]::Now.ToString('yyyyMMdd'); Start-Transcript -Path $logsDirectoryPath\`$timestamp.log -Append; $script } catch { Stop-Transcript; exit 1; } }`" -ExecutionPolicy Bypass"
}

Now that I had fixed the issue with it failing to fail in an obvious way, it was time to add  some logging. So I could log why it wasn’t deleting the logs.

Using the code that wraps the incoming script in a try…catch, I added a call to Start-Transcript at the start of the script, and a call to Stop-Transcript at the end. Start-Transcript is a useful feature built into Powershell that lets you write to a file everything interesting that happened during the transcript block. Its not perfect (it doesn’t seem to record Verbose statements), but it does record the output stream and any errors and warnings, which is enough.

Be careful. If you supply Start-Transcript with a file path inside a directory that does not exist, it appears to just fail silently. I’m pretty sure I tried the –Force flag (which usually deals with these things), but it didn’t seem to help. Make sure your destination directory exists before you try to start the transcript.

I made sure that the log files coming out of the scheduled task were appropriately date stamped (so they too, could in turn be cleaned up) and deployed the updated scheduled task again.

The Root of My Problems

The transcript file showed that I was getting an Access Denied exception when the script was running as a task. Obviously this did NOT happen in my tests, so it was something specific to the deployed environments.

Fair enough, that can happen with scheduled tasks, because they default to running as the System account, which doesn’t necessarily have normal permissions.

The interesting thing was that the root logs directory was specifically marked as full control to all appropriate users, and that included the System account. Checking the permissions on the directory with Get-Acl proved that to be true. System clearly had the ability to do whatever it liked to the logs directory and its content.

So where was the Access Denied error coming from?

Time to go back to debugging basics. It was time for Process Monitor (procmon). Process Monitor is one of those awesome sysinternals tools supplied by Microsoft/someone at Microsoft. It lets you monitor every single event happening within windows, including interactions with the registry, file system and all sorts of other things. Also, more importantly for the maintenance of sanity, it lets you filter those events.

Firing up procmon and running the scheduled task showed that it was clearly deleting the very first file that it found that was old enough to be deleted, and then stopping. There was no obvious errors coming from the file system, the process just seemed to stop trying. If I ran the task again, it actually had deleted the first file, and it would delete the next one (which was now the first) and stop again.

I still have no idea why it was doing this. I suspect it might have had something to do with deleting files during a streaming operation in Powershell (i.e. Get|Where|Delete), but if that were the case, I would think it would happen during my test as well.

I fixed the issue by wrapping the Get|Where in brackets, to make sure it was evaluated before it was streamed to the Delete functions, kind of like using a ToArray or ToList in C# when enumerating through a collection you intend to remove values from.

Once I made that change everything worked. The files were deleted, the task indicated success and all was well in the world.

Conclusion

There are a number of lessons to come out of this, but the most important is to make sure that your failure indicators actually indicate failure when a failure occurs, which is really just a specific application of “test all the things”.

Its not enough to just test it yourself either (in a unit test or similar), you must test in the environment that it will actually be deployed to, because the differences catch you out. In additional to testing, you need to have indicators or events that you monitor in order to make sure everything is acting as you expected. In our case, the monitoring that we had on the machines allowed us to notice the issue, but if no-one was paying attention, we would have probably had issues when the disk finally filled up.

It would have been better to have alerts that warned us when something was trending the wrong way. We probably glossed over that sort of thing because we thought it was a solved problem (after all, we had a script), so I suppose the moral is to question and check everything, even if you think its fine.

Because its probably not.