0 Comments

In my quest to automate all the things, I’ve failed a lot.

A lot.

Typically the failures are configuration related (forgot to do X, didn’t line up the names of Y and Z properly, accidentally using uninitialized variable (Powershell is pretty “flexible”)), but sometimes they are just  transient. The cloud can be like that and you need to make sure you build with fault tolerance in mind. For example, sometimes an AWS EC2 instance will just take ages to complete its initialization, and your environment will fail to create with a timeout.

Using CloudFormation, you can use WaitConditions and WaitHandles to keep track of the setup of the more complicated components of your environment (like EC2 instances). Generally you take the exit code from your initialization on the EC2 instance (for us that is typically cfn-init) and push it through cfn-signal pointing at a particular WaitHandle. In the case where the exit code is non-zero, CloudFormation considers it a failure and it bubbles up and fails the stack (or whatever your error handling strategy is). If you need to you can supply a reason as well (a short string) which can help in identifying the root cause of the failure.

Up until recently, when my environment setups failed, I would report the failure (i.e. this stack failed to initialize) and then automatically clean it up. My scripts would wait to see if the stack status was changing to CREATE_COMPLETE within some timeout, and in the case of a failure, were simply reporting that the stack failed. If I was executing from a test, typically I would break on the line that does the cleanup, and investigate the root cause of the failure before I let the stack be torn down.

Now that I’m automatically executing environment creation/destruction through TeamCity, that kind of approach doesn’t really cut it.

If I arrive at work in the morning and see that the CI environment for service A failed to automatically re-create I need to know why. The stack will have already been cleaned up, so unless the scripts report as much information as possible, my only real recourse is to run it again locally and see what happened. This only works if the failure is consistent. It might run perfectly the second time, which is incredibly annoying when you’re trying to track down a failure.

It was time to think about failing a little bit harder.

Investigating Failures

When a stack fails I typically do 3 things.

The first is to check to see if the stack failed on creation (i.e. on the actual call to New-CfnStack). These failures are usually related to template issues, so the environment setup scripts already handle this. When New-CfnStack throws an exception it bubbles up to the top and is logged. All you have to do is read the message and it will (hopefully) point you at the root cause.

An interesting fact. If your template file is long-ish (it happened to me when mine was about 1000 lines long, but I’m pretty sure its dependent on characters, not lines) then the call to New-CfnStack can and will fail with an incredibly useless error about incorrectly formatted XML. This only happens if you are uploading the content as part of the call, instead of uploading to S3 first. I’m not entirely sure why this happens, but I have a sneaking suspicion that the Powershell/.NET AWS components have some hard limit on the length of the request that isn’t being checked properly. Just upload your template to S3 (I upload to the same place I put my dependencies) and it will work just fine, regardless of the size of the template.

The second is to check the Events attached to the stack for failures, specifically the first one chronologically, as failures typically have a knock on effect. These events are available in the AWS dashboard, under the CloudFormation section. Sometimes this is all it takes, as most of the failure events are pretty descriptive. If you fail to have permissions to do something, this is where it normally comes up, along with some errors that the template verification system won’t find (like an RDS instance using a snapshot but also supplying a DBName).

The third is to use the events to determine which component failed (usually an EC2 instance), remote onto the machine and look at some log files. For Windows EC2 instances there are two log files that are useful, cfn-init and EC2Config.

You can typically find the cfn-init log file at C:\cfn\log\cfn-init.log. It contains the full log trace from everything that happened as a result of the call to cfn-init on that machine. As far as log files go, its pretty useful, and I’ve plumbed its depths many times trying to figure out why my initialization was failing.

Second interesting fact, also related to length. If the output from one of your commands in cfn-init is too long, it will fail with an incredibly unhelpful error.

You can typically find the EC2Config log at C:\Program Files\Amazon\EC2Config\logs\EC2Config.log. This one is useful if your cfn-init log file doesn’t have anything in it, which can happen if you’re failing so hard that your EC2 instance can’t even start executing cfn-init. It happens.

The 3 step process I outlined above usually identifies the root cause of the failure, which can then be rectified. Following the process isn’t difficult, but it is annoying, and manual. The last step is especially egregious, requiring me to find the machine/s in question, get their private IPs, be connected to our AWS VPN, remote on, locate the log files, read them, etc.

If I have to leave the confines of my environment script in order to go check why it failed, then I’m losing valuable time and context. It would be much better if the environment setup script detected failures and went and did those things for me.

The Right Kind of Lazy

Essentially a good lazy programmer is the one who doesn’t have the stomach for constantly doing a manual task, so instead automates it. This frees them up to do more interesting things, and the automated task can then be run with minimal human involvement, hopefully acting as a multiplier for the effectiveness of the team.

In my case, I needed to automate the second two steps above. The first step is taken care of because the exceptions being thrown from the various calls to the AWS API are already being logged in TeamCity.

For the second step (interrogating the Events of the Stack), AWS offers API’s for pretty much everything, which is awesome (Octopus is similar, which is equally awesome, maybe more so because they are Windows native). All you need is a call to Get-CfnStackEvent, and then you can filter the resulting events to those that failed, which helps to cut down on the amount of output. The following piece of Powershell demonstrates just that.

function _TryGetFailingStackEvents
{
    param
    (
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$stackId,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsKey,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsSecret,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsRegion
    )

    try
    {
        if ($rootDirectory -eq $null) { throw "RootDirectory script scoped variable not set. That's bad, its used to find dependencies." }
        $rootDirectoryDirectoryPath = $rootDirectory.FullName
        $commonScriptsDirectoryPath = "$rootDirectoryDirectoryPath\scripts\common"
        
        . "$commonScriptsDirectoryPath\Functions-Enumerables.ps1"

        $events = Get-CFNStackEvent -StackName $stackId -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion
        $failingEvents = @($events | Where { $_.ResourceStatus.Value -match "FAILED" })
        if ($failingEvents | Any -Predicate { $true })
        {
            return $failingEvents
        }
        else
        {
            return @()
        }
    }
    catch
    {
        Write-Warning "Could not get events for stack [$stackId]."
        Write-Warning $_
    }
}

Automating the third step is a little more difficult.

As a general rule, all of our EC2 instances can have Powershell remotely executed on them from our secure AWS VPN CIDR. This is already part of our general EC2 setup (security group configuration to allow the traffic, WIndows firewall exception to unblock the Powershell Remote Execution port, etc).

Whenever a failure event occurs that contains a valid EC2 instance ID (identified by a regular expression), the script creates a remote session to the machine using its private IP address and reads the last couple of hundred lines from the log files I mentioned above. You can see the full script below. It actually feeds off the failure events identified in the previous step in order to find the ID of the instance that it needs to connect to.

function _TryExtractLogFilesFromInstanceFailingViaCfnInit
{
    param
    (
        [Parameter(Mandatory=$true)]
        $failingStackEvents,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsKey,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsSecret,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsRegion,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$adminPassword
    )

    if ($failingStackEvents -eq $null) { return "No events were supplied, could not determine if anything failed as a result of CFN-INIT failure" }

    try
    {
        if ($rootDirectory -eq $null) { throw "RootDirectory script scoped variable not set. That's bad, its used to find dependencies." }
        $rootDirectoryDirectoryPath = $rootDirectory.FullName
        $commonScriptsDirectoryPath = "$rootDirectoryDirectoryPath\scripts\common"

        $cfnInitFailedIndicator = "CFN-INIT-FAILED"
        Write-Verbose "Attempting to identify and extract information from failure events containing the string [$cfnInitFailedIndicator]"
        $instanceIdRegexExtractor = "(i\-[0-9a-zA-Z]+)"
        $cfnFailureEvent = $failingStackEvents | 
            Where {$_.ResourceStatusReason -match $cfnInitFailedIndicator} | 
            Select -First 1

        if ($cfnFailureEvent.ResourceStatusReason -match $instanceIdRegexExtractor)
        {
            $instanceId = $matches[0];
            Write-Verbose "Found a failure event for instance [$instanceId]"
            Write-Verbose "Attempting to extract some information from the logs from that machine"

            . "$commonScriptsDirectoryPath\Functions-Aws-Ec2.ps1"

            $instance = Get-AwsEc2Instance -InstanceId $instanceId -AwsKey $awsKey -AwsSecret $awsSecret -AwsRegion $awsRegion

            $ipAddress = $instance.PrivateIpAddress

            $remoteUser = "Administrator"
            $remotePassword = $adminPassword
            $securePassword = ConvertTo-SecureString $remotePassword -AsPlainText -Force
            $cred = New-Object System.Management.Automation.PSCredential($remoteUser, $securePassword)
            $session = New-PSSession -ComputerName $ipAddress -Credential $cred
    
            $remoteScript = {
                $lines = 200
                $cfnInitLogPath = "C:\cfn\log\cfn-init.log"
                Write-Output "------------------------------------------------------"
                Write-Output "Last [$lines] from $file"
                Get-Content $cfnInitLogPath -Tail $lines
                Write-Output "------------------------------------------------------"
                #Get-Content C:\Program Files
            }
            $remotelyExtractedData = Invoke-Command -Session $session -ScriptBlock $remoteScript
            # If you dont do this when you do a JSON convert later it spits out a whole bunch of useless
            # information about the machine the line was extracted from, files, etc.
            $remotelyExtractedData = $remotelyExtractedData | foreach { $_.ToString() }
            
            return $remotelyExtractedData
        }
        else
        {
            Write-Verbose "Could not find a failure event about CFN-INIT failing"
            return "No events failed with a reason containing the string [$cfnInitFailedIndicator]"
        }
    }
    catch
    {
        Write-Warning "An error occurred while attempting to gather more information about an environment setup failure"
        Write-Warning $_       
    }
}

I execute both of these functions in the catch block of the main environment setup function. Basically if anything goes wrong during the environment creation, it tries to gather some additional information to print out to the Warning channel. Sometimes failures occur before you actually managed to create the stack, so it needs to be robust in the face of not having a stack ID. As it is inside the catch block, it also needs to be wrapped in a try…catch itself, otherwise you’re likely to lose the root cause of the error if an unexpected error happens during your information gathering.

if ($stackId -ne $null)
{
    Write-Warning "Environment creation failed. Attempting to gather additional information."
    $failingStackEvents = _TryGetFailingStackEvents -StackId $stackId -awsKey $awsKey -awsSecret $awsSecret -awsRegion $awsRegion
    $cfnInitFailureLogs = _TryExtractLogFilesFromInstanceFailingViaCfnInit -failingStackEvents $failingStackEvents -adminPassword $adminPassword -awsKey $awsKey -awsSecret $awsSecret -awsRegion $awsRegion
    $customError = @{
        "StackId"=$stackId;
        "Stack"=$stack;
        "FailingStackEvents"=$failingStackEvents;
        "CfnInitFailingInstanceLogs"=$cfnInitFailureLogs;
    }

    $failingDetails = new-object PSObject $customError
    Write-Warning (ConvertTo-Json $failingDetails)
}

With everyone above working, you get a nice dump of information whenever an environment fails to create, without having to do anything else, which is nice.

The full source that the snippets above were extracted from will be available at Solavirum.Scripts.Common at some point. The content available there now is out of date, but I need to check through the scripts carefully before I upload them to Github. We all remember what happened last time.

Room for Improvement

At some point in the future I would prefer if the output from CloudFormation (its events) and the various log files from the EC2 instances being provisioned were automatically pumped into our Log Aggregation server (ELK) so that we could analyse them in more detail.

The hard part about this is that the log aggregation would require the installation of something on the machine, and if we’re in a failure state during environment setup, those components are probably not installed. I mean the whole point of the environment setup is to create an environment and deploy some software components to do it, having to have other components in place to report failures is probably not going to work.

I’ll keep thinking about it and make a post if I find a good solution.

Conclusion

I hesitated to do this automated investigation for a long time, because I knew it would be finicky and complex.

I shouldn’t have.

Every time an environment setup fails I now get detailed information right in the same window that I was watching for completion messages. It is beyond useful. It also means that whenever an environment refresh fails in the morning before I get to work, I can still investigate to see what happened, even though the stack was already cleaned up.

I should have followed my instincts and automated this kind of pointless busywork earlier.

0 Comments

I mentioned last week that I was struggling with an issue with my environment setup scripts via CloudFormation.

Well, I finally got to the bottom of it, and I thought it would be a good thing to share, because I don’t want anyone else to have to struggle with finding the root cause behind the incredibly unhelpful error that I was getting. Hopefully, my pain becomes your knowledge.

To give a little bit of background, I use a combination of CloudFormation templates, Powershell and Octopus Deploy to setup environments for our products and services in AWS. This approach has proven to be extremely useful for a number of reasons, including codification of what exactly goes into the environment, reducing the amount of time to setup an environment (in comparison to using typical infrastructure requests) and reducing the reliance on one particular person or team who knows how to put it all together.

A little while ago I noticed that the environment setup that I use for my JMeter Load Test Workers was failing to run to completion. Every time I tried to setup the environment, it would fail with a “Create Failed” status, stating that one or more of the instances failed their configuration.

Looking into the issue, I found the following error in the cfn-init.log file (available at C:\cfn\log\cfn-init.log).

In case you are unfamiliar with CloudFormation, cfn-init is a program supplied by Amazon that works with CloudFormation, and allows you to configure a series of steps to occur during the startup of an EC2 instance. Typically you  use the UserData for an instance to execute cfn-init with the appropriate parameters, and it connects to the meta data service, downloads the CloudFormation (cfn) template and executes it on the local machine. Its awesome, and I use it a lot.

The relevant excerpt from the log file:

2015-06-12 03:06:02,674 [DEBUG] Running command a-deploy-loadtest
2015-06-12 03:06:02,674 [DEBUG] No test for command a-deploy-loadtest
2015-06-12 03:06:59,846 [INFO] Command a-deploy-loadtest succeeded
2015-06-12 03:06:59,846 [ERROR] -----------------------BUILD FAILED!------------------------
2015-06-12 03:06:59,846 [ERROR] Unhandled exception during build: 'utf8' codec can't decode byte 0xf2 in position 644: invalid continuation byte
Traceback (most recent call last):
    File "cfn-init", line 123, in <module>
    File "cfnbootstrap\construction.pyc", line 117, in build
    File "cfnbootstrap\construction.pyc", line 502, in build
    File "cfnbootstrap\construction.pyc", line 513, in run_config
    File "cfnbootstrap\construction.pyc", line 125, in run_commands
    File "cfnbootstrap\command_tool.pyc", line 116, in apply
    File "encodings\utf_8.pyc", line 16, in decode UnicodeDecodeError: 'utf8' codec can't decode byte 0xf2 in position 644: invalid continuation byte

Of course, that doesn’t make any sense, as I hadn’t changed anything to do with the encoding of the output. Even more confusingly, when I ran cfn-init using the exact same mechanism (the execution of the UserData batch file), on the exact same machine, it would work just fine. It only failed during instance startup.

I looked into it for a little while, but couldn’t find an obvious root cause. I assumed that I was doing something slightly different from the norm or that something had changed upstream to cause the error. This made sense, because all of the other places where I was using the exact same code were working just fine, and it seemed to be localised to just my load test worker environment.

I altered the environment setup to not require instances to setup successfully (by removing the call to cfn-signal after cfn-init, and removing the WaitCondition from the template) and manually initialized each instance after it was created before I ran my load tests.

Not great, but an acceptable work around until it mattered more. I had bigger fish to fry.

The Virus Spreads

Then it happened somewhere else.

The new cloud based service we are working on needed an environment refresh to take advantage of scalable persistence in the form of RDS. We tore down the old CI environment, and then attempted to create a new one.

The mysterious “invalid UTF8 character” error reared its ugly head again.

I couldn’t brush it away this time, as this environment was much more important than my load test workers, so it was time to dig in and see what was really going on.

Taking the error at face value, I assumed that something had changed in the pipeline (Octopus, AWS, etc) and that there really was an invalid character. It was fairly obvious that cfn-init didn’t fail until after the first deployment had actually completed, so something seemed to be off in the output of that particular step.

"a-deploy-system-stats-logger": { "command": { "Fn::Join": [ "", [ "powershell.exe -ExecutionPolicy Bypass -Command c:\\cfn\\dependencies\\scripts\\environment\\Invoke-DeployLatestProjectForEnvironment.ps1 ",

" –Verbose ", " -OctopusServerApiKey ", { "Ref": "OctopusAPIKey" }, " -OctopusServerUrl ", { "Ref": "OctopusServerURL" }, " -OctopusEnvironment ", { "Ref": "OctopusEnvironment" }, " -ProjectName 'System Stats Logger'" ] ] }, "waitAfterCompletion": "0" }

I ran the script manually through C# and captured the raw output stream. I parsed the output as UTF8, and it came out just fine, so clearly the output itself wasn’t the problem.

At this point I took a slightly different approach. I knew that an old version of the environment setup for our new cloud service had worked just fine a few days ago, so I branched from that point in Git and executed the setup script.

It worked.

Now I at least had two copies of similar codebases that I could compare for differences to find what was breaking the deployments, and I knew that it wasn’t an environmental issue related to either Octopus Deploy or AWS components.

I used an application called WinMerge to diff the two repositories (after extracting them to different directories on disk), and found a number of differences, but nothing that was obviously the root cause. I thought that perhaps the problem was caused by upgrading to the latest version of the AWS Powershell package, but I rolled that back and still had the same issue.

Heads up WinMerge people, if you are on the internet at all. The statuses for “File Identical” and “File Different” in your application look incrediblysimilar at a glance. I thought all my files were identical initially, at least until I looked a little closer. Colours would be nice.

The Root Cause

While looking at the diff of the two directories, I noticed that I had added additional Write-Verbose statements to some of my generic Powershell enumerable functions (like Any, Single, All, etc). In fact, in all of the environment setup scripts that were working, this additional logging was missing (because they were using an older version of the common scripts). The logging was only present in those environments that were broken.

But how could additional logging statements cause a UTF8 error in cfn-init, and only during startup?

It took me a fair bit of time to realise that it was an issue with the length of the output. My deployment scripts all had –Verbose turned on by default. I did this on purpose, to make sure that I received the maximum amount of information possible from the environment setup.

Apparently cfn-init has a limit on the amount of output it can process from a command, and I was pushing that limit.

As to why it was failing during startup and not when manually executed? The code that was run at startup did some archive extraction which didn’t happen in future calls (because it was already on disk) and the items being extracted were logged. This was evidently enough to push the output over the invisible, undocumented limit.

I removed the –Verbose flag from the calls deployment scripts and everything was fine.

Problem solved.

Why?

Of course, the question remains. Why?

Why does cfn-init give a completely unrelated error when the output from a single command is too long? Why not a “output has been truncated” or “output too long” error of some description?

I have no idea.

I suspect the reason for the UTF8 error is that the output was truncated between two bytes or something, and when the parser underneath cfn-init tried to parse it, it encountered what appeared to be an invalid UTF8 character. I still have no idea why it would do arbitrary truncation though, especially without telling the user.

Ah well.

Conclusion

This is the sort of thing that you just have to deal with sometimes. I do wonder how many other people have run into this issue though. Considering that when I did a Google search for the error, I found almost nothing, I’m not sure if anyone else has. If they have, they certainly didn’t post online about it. Luckily, for everyone else in the future, I did, both here and on the AWS forums, so maybe that will help.

On the upside, running into these sorts of things makes you think more about your own code, and how it reports problems to the user when assumptions that you’ve made break down in reality. It takes a lot of effort to write robust code that is communicative when it fails, but I think its definitely worth the effort.

Remember, your code will be used and read a hell of a lot more than its written.

0 Comments

Its time to fix that whole shared infrastructure issue.

I need to farm my load tests out to AWS, and I need to do it in a way that won’t accidentally murder our current production servers. In order to do that , I need to forge the replacement for our manually created and configured proxy box. A nice, codified, auto-scaling proxy environment.

Back into CloudFormation I go.

I contemplated simply using a NAT box instead of a proxy, but decided against it because:

  • We already use a proxy, so assuming my template works as expected it should be easy enough to slot in,
  • I don’t have any experience with NAT boxes (I’m pretty weak on networking in general actually),
  • Proxies scale better in the long run, so I might as well sort that out now.

Our current lone proxy machine is a Linux instance with Squid manually installed on it. It was setup some time before I started, by someone who no longer works at the company. An excellent combination, I’m already a bit crap at Linux, and now I can’t even ask anyone how it was put together and what sort of tweaks were done to it over time as failures were encountered. Time to start from scratch. The proxy itself is sound enough, and I have some experience with Squid, so I’ll stick with it. As for the OS, while I know that Linux will likely be faster with less overhead, I’m far more comfortable with Windows, so to hell with Linux for now.

Here's the plan. Create a CloudFormation template for the actual environment (Load Balancer, Auto Scaling Group, Instance Configuration, DNS Record) and also create a NuGet package that installs and configures the proxy to be deployed via Octopus.

I’ve always liked the idea of never installing software manually, but its only been recently that I’ve had access to the tools to accomplish that. Octopus, NuGet and Powershell form a very powerful combination for managing deployments on Windows. I have no idea what the equivalent is for Linux, but I’m sure there is something. At some point in the future Octopus is going to offer the ability to do SSH deploys which will allow me to include more Linux infrastructure (or manage existing Linux infrastructure even better, I’m looking at you ELK stack).

Save the Environment

The environment is pretty simple. A Load Balancer hooked up to an Auto Scaling Group, whose instances are configured to do some simple setup, (including using Octopus to deploy some software) and a DNS record so that I can refer to the load balancer in a nice way.

I’ve done enough of these simple sorts of environments now that I didn’t really run into any interesting issues. Don’t get me wrong, they aren’t trivial, but I wasn’t stuck smashing my head against a desk for a few days while I sorted out some arcane problem that ended up being related to case sensitivity or something ridiculous like that.

One thing that I have learned, is to setup the Octopus project that will be deployed during environment setup ahead of time. Give it some trivial content, like running a Powershell script, and then make sure it deploys correctly during the startup of the instances in the Auto Scaling Group. If you try to sort out the package and its deployment at the same time as the environment, you’ll probably run into issues where the environment setup technically succeeded, but because the deployment of the package failed, the whole thing failed and you have to wait another 20 minutes to fix it. It really saves a lot of time to create the environment in such a way that you can extend it with deployments later.

Technically you could also make it so failing deployments don’t fail an environment setup, but I like my environments to work when they are “finished”, so I’m not really comfortable with that in the long run.

The only tricky things about the proxy environment are making sure that you setup your security groups appropriately so that the proxy port can be accessed, and making sure that you use the correct health check for the load balancer (for Squid at least, TCP over the port 3128 (the default for Squid) is a good health check).

That's a Nice Package

With the environment out of the way, its time to setup the package that will be used to deploy Squid.

Squid is available on Windows via diladele. Since 99% of our systems are 64 bit, I just downloaded the 64 bit MSI. Using the same structure that I used for the Nxlog package, I packaged up the MSI and some supporting scripts, making sure to version the package appropriately. Consistent versioning is important, so I use the same versioning strategy that I use for our software components. Include a SharedAssemblyInfo file and then mutate that file via some common versioning Powershell functions.

Apart from installation of Squid itself, I also included the ability to deploy a custom configuration file. The main reason I did this was so that I could replicate out current Squid proxy config exactly, because I’m sure it does things that have been built up over the last few years that I don’t understand. I did this in a similar way to how I did config deployment for Nxlog and Logstash. Essentially a set of configuration files are included in the Nuget package and then the correct one is chosen at deployment time based on some configuration within Octopus.

I honestly don’t remember if I had any issues with creating the Squid proxy package, but I’m sure if I had of, they would be fresh in my mind. MSI’s are easy to install silently with MSIEXEC once you know the arguments, and the Squid installer for Windows is pretty reliable. I really do think it was straightforward, especially considering that I was following the same pattern that I’d used to install an MSI via Octopus previously.

Delivering the Package

This is standard Octopus territory. Create a project to represent the deployable component, target it appropriately at machines in roles and then deploy to environments. Part of the build script that is responsible for putting together the package above can also automatically deploy it via Octopus to an environment of your choice.

In TeamCity, we typically do an automatic deploy to CI on every checkin (gated by passing tests), but for this project I had to hold off. We’re actually running low on Build Configurations right now (I’ve already put in a request for more, but the wheels of bureaucracy move pretty slow), so I skipped out on setting one up for the Squid proxy. Once we get some more available configurations I’ll rectify that situation.

Who Could Forget Logs?

The final step in deploying and maintaining anything is to make sure that the logs from the component are being aggregated correctly, so that you don’t have to go the machine/s in question to see what's going on and you have a nice pile of data to do analysis on later. Space is cheap after all, so you might as well store everything, all the time (except media).

Squid features a nice Access log with a well known format, which is perfect for this sort of log processing and aggregation.

Again, using the same sort of approach that I’ve used for other components, I quickly knocked up a logstash config for parsing the log file and deployed it (and Logstash) to the same machines as the Squid proxy installations. I’ll include that config here, because it lives in a different repository to the rest of the Squid stuff (it would live in the Solavirum.Logging.Logstash repo, if I updated it).

input {
    file {
        path => "@@SQUID_LOGS_DIRECTORY/access.log"
        type => "squid"
        start_position => "beginning"
        sincedb_path => "@@SQUID_LOGS_DIRECTORY/.sincedb"
    }
}

filter {
    if [type] == "squid" {
        grok {
            match => [ "message", "%{NUMBER:timestamp}\s+%{NUMBER:TimeTaken:int} %{IPORHOST:source_ip} %{WORD:squid_code}/%{NUMBER:Status} %{NUMBER:response_bytes:int} %{WORD:Verb} %{GREEDYDATA:url} %{USERNAME:user} %{WORD:squid_peerstatus}/(%{IPORHOST:destination_ip}|-) %{GREEDYDATA:content_type}" ]
        }
        date {
            match => [ "timestamp", "UNIX" ]
            remove_field => [ "timestamp" ]
        }
    }
     
    mutate {
        add_field => { "SourceModuleName" => "%{type}" }
        add_field => { "Environment" => "@@ENVIRONMENT" }
        add_field => { "Application" => "SquidProxy" }
        convert => [ "Status", "string" ]
    }
    
    # This last common mutate deals with the situation where Logstash was creating a custom type (and thus different mappings) in Elasticsearch
    # for every type that came through. The default "type" is logs, so we mutate to that, and the actual type is stored in SourceModuleName.
    # This is a separate step because if you try to do it with the SourceModuleName add_field it will contain the value of "logs" which is wrong.
    mutate {
        update => [ "type", "logs" ]
    }
}

output {
    tcp {
        codec => json_lines
        host => "@@LOG_SERVER_ADDRESS"
        port => 6379
    }
    
    #stdout {
    #    codec => rubydebug
    #}
}

I Can Never Think of a Good Title for the Summary

For reference purposes I’ve included the entire Squid package/environment setup code in this repository. Use as you see fit.

As far as environment setups go, this one was pretty much by the numbers. No major blockers or time wasters. It wasn’t trivial, and it still took me a few days of concentrated effort, but the issues I did have were pretty much just me making mistakes (like setting up security group rules wrong, or failing to tag instances correctly in Octopus, or failing to test the Squid install script locally before I deployed it). The slowest part is definitely waiting for the environment creation to either succeed or fail, because it can take 20+ minutes for the thing to run start to finish. I should look into making that faster somehow, as I get distracted during that 20 minutes.

Really the only reason for the lack of issues was that I’d done all of this sort of stuff before, and I tend to make my stuff reusable. It was a simple matter to plug everything together in the configuration that I needed, no need to reinvent the wheel.

Though sometimes you do need to smooth the wheel a bit when you go to use it again.