The Trouble With Logs Is

May 31. 2016 0 Comments

A long time ago, I wrote a post about fixing up our log management scripts to actually delete logs properly. Logs were being aggregated into our ELK stack and were then being deleted after reaching a certain age. It fixed everything perfectly and it was all fine, forever. The End

Or not.

The log management script itself was very simple (even though it was wrapped in a bunch of Powershell to schedule it via Windows Scheduled Task. It looked at a known directory (C:\logs), found all the files matching *.log, narrowed it down to only those files that had not been written to in the last 7 days and then deleted them. Not exactly rocket science.

As we got more and more usage on the services in question though, the log file generation started to outstrip the ability of the script to clean up.

They Can Fill You Up Inside

The problem was twofold, the log management script was hardcoded to only delete things that hadn’t been touched in the last 7 days and the installation of the scheduled task that ran the script was done during machine initialisation (as part of the cfn-init configuration). Changing the script would require refreshing the environment, which requires a migration (which is time consuming and still not perfect). Not only that, but changing the script for one service (i.e. to delete all logs older than a day), might not be the best thing for other services.

The solution was to not be stupid and deploy log management in the same way we deploy everything, Octopus Deploy.

With Octopus, we could build a package containing the all of the logic for how to deploy a log management solution, and use that package in any number of projects with customised parameters, like retention period, directory, filter, whatever.

More importantly, it would give us the ability to update existing log management deployments without having to alter their infrastructure configuration, which is important for minimising downtime and just generally staying sane.

They Are Stronger Than Your Drive

It was easy enough to create a Nuget package containing the logic for installing the log management scheduled task. All I had to do was create a new repository, add a reference to our common scripts and then create a deploy.ps1 file describing how the existing scheduled task setup script should be called during deployment.

In fact, the only difference from calling the script directly like we were previously, was that I grabbed variable values from the available Octopus parameters, and then slotted them into the script.

With a nice self contained Nuget package that would install a scheduled task for log management, the only thing left to do was create a new Octopus project to deploy the package.

I mostly use a 1-1 Nuget Package to Project strategy when using Octopus. I’ve found that its much easier to get a handle on the versions of components being deployed when each project only has 1 thing in it. In this case, the Nuget package was generic, and was then referenced from at least two projects, one to manage the logs on some API instances and the other to manage logs on a log shipper component (which gets and processes ELB logs).

In order to support a fully automated build, test and deployment cycle, I also created a TeamCity Build Configuration. Like all Build Configurations, it watches a repository for changes, runs tests, packages and then deploys to the appropriate environments. This one was slightly different though, in that it deployed multiple Octopus projects, rather than a single one like almost all of our other Build Configurations. I’m not sure how I feel about this yet (because its different from what we usually do), but its definitely easier to manage, especially since all the projects just use the same package anyway.

Conclusion

This post might not look like much, and to be honest it isn’t. Last weeks post on Expression Trees was a bit exhausting so I went for something much smaller this week.

That is not to say that the concepts covered herein are not important.

One thing that I’ve taken away from this is that if you ever think that installing something just once on machine initialization is enough, you’re probably wrong. Baking the installation of log management into our cfn-init steps caused us a whole bunch of problems once we realised we needed to update them because they didn’t quite do the job. It was much harder to change them on the fly without paying a ridiculous penalty in terms of downtime. It also felt really strange to have to migrate an entire environment just because the log files weren’t being cleaned up correctly.

When it comes down to it, the real lesson is to always pick the path that lets you react to change and deploy with the least amount of effort.