That Neat Little Symbol From Half-Life, Part 2

November 29. 2016 0 Comments

Now that I am feeling less like dying horribly from a terrible plague, its time to continue to talk about processing ELB logs into an ELK stack via AWS Lambda.

Last time I talked about our current situation, our motivations for wanting to switch to something better and the general plan moving forward.

To summarise the plan, the new log processor system needs 3 main parts:

The Lambda function code, which is written in Javascript, and would need to parse an ELB log file one line at a time, break it up, convert it into a JSON structure and then push it to a Logstash TCP endpoint to be written into Elasticsearch
The configuration/creation of the Lambda function, which would need to be done via CloudFormation as part of our normal environment setup (i.e. spin up an environment, it comes with the Lambda function that will process any ELB logs encountered)
The deployment of the code to the Lambda function via Octopus (because that’s how we roll)

I’m basically going to do a short blog post on each of those pieces of work, and maybe one at the end to tie it all together.

With that in mind, lets talk Javascript.

Unforeseen Consequences

When you’re writing Lambda functions, Javascript (via Node.js) is probably your best bet. Sure you can run Java or Python (and maybe one day C# using .NET Core), but Javascript is almost certainly going to be the easiest. Its what we chose we when put together the faster S3 clone prototype, and while the fundamentally asynchronous nature of Node.js took some getting used to, it worked well enough.

When it comes to Javascript, I don’t personally write a lot of it. If I’m writing a server side component, I’m probably going to pick C# as my language of choice (with all its fancy features like “a compiler” and “type safety”) and I don’t find myself writing things like websites or small Javascript applications very often, if at all. My team definitely writes websites though, and we use React.js to do it, so its not like Javascript is an entirely foreign concept.

For the purposes of reading in and parsing an ELB log file via a Lambda function, we didn’t need a particularly complex piece of Javascript. Something that reads the specified file from S3 after the Lambda function triggers, something to process the contents of that file line by line, something to parse and format those lines in a way that a Logstash input wll accept, and something to push that JSON payload to the Logstash listener over raw TCP.

Without further ado, I give you the completed script:

'use strict';
 
let aws = require('aws-sdk');
let s3 = new aws.S3({ apiVersion: '2006-03-01' });
let readline = require('readline');
let net = require('net');
 
const _type = 'logs';
const _sourceModuleName = 'ELB';
const _logHost = '#{LogHost}'
const _logPort = #{LogPort}
const _environment = '#{Octopus.Environment.Name}'
const _component = '#{Component}'
const _application = '#{Application}'
 
function postToLogstash(entry){
    console.log("INFO: Posting event to logstash...");
 
    var socket = net.createConnection(_logPort, _logHost);
    var message = JSON.stringify(entry) + "\n";
    socket.write(message);
    socket.end();
 
    console.log("INFO: Posting to logstash...done");
}
 
exports.handler = (event, context, callback) => {
    console.log('INFO: Retrieving log from S3 bucket...');
 
    const bucket = event.Records[0].s3.bucket.name;
    const key = decodeURIComponent(event.Records[0].s3.object.key.replace(/\+/g, ' '));
    const params = {
        Bucket: bucket,
        Key: key
    };
 
    const reader = readline.createInterface({
        input: s3.getObject(params).createReadStream()
    });
 
    const expectedColumns = 12;
 
    reader.on('line', function(line) {
        console.log("INFO: Parsing S3 line entry...");
 
        const columns = line.split(/ (?=(?:(?:[^"]*"){2})*[^"]*$)/);
 
        if(columns.length >= expectedColumns){
            var entry = {
                EventReceivedTime: columns[0],
                LoadBalancerName: columns[1],
                PublicIpAndPort: columns[2],
                InternalIpAndPort: columns[3],
                TimeToForwardRequest: parseFloat(columns[4]) * 1000,
                TimeTaken: parseFloat(columns[5]) * 1000,
                TimeToForwardResponse: parseFloat(columns[6]) * 1000,
                Status: columns[7],
                BackendStatus: columns[8],
                BytesUploadedFromClient: parseInt(columns[9]),
                BytesDownloadedByClient: parseInt(columns[10]),
                FullRequest: columns[11],
                Component: _component,
                SourceModuleName: _sourceModuleName,
                Environment: _environment,
                Application: _application,
                Type: _type
            };
            postToLogstash(entry);
        } else {
            console.log("ERROR: Invalid record length was expecting " + expectedColumns.length + " but found " + columns.length);
            console.log('ERROR: -------');
            console.log(line);
            console.log('ERROR: -------');
        }
    });
};

Nothing to fancy.

In the interest of full disclosure, I did not write the script above. It was written by a few guys from my team initially as a proof of concept, then improved/hardened as a more controlled piece of work.

Office Complex

You might notice some strange variable names at the top of the script (i.e. #{Octopus.Environment.Name}).

We use Octopus deploy for all of our deployments, and this includes the deployment of Lambda functions (which we package via Nuget and then deploy via the AWS Powershell Cmdlets/CLI inside Octopus). The #{NAME} notation is a way for Octopus to substitute variable values into files during deployment. This substitution is very useful, and can be scoped via a variety of things (like Environment, Machine, Role, etc), so by the time the script actually gets into AWS those variables are filled in with actual values.

Other than our use of Octopus variables, other things to note in this piece of Javascript are:

At no point does the function specify which credentials are being used to access the S3 bucket containing the ELB log files. This is because the Lambda function has been configured with an IAM role allowing it to access the required resources. The AWS Javascript library has built in support for running inside supported AWS contexts (like EC2 instances and Lambda functions), such that it can use the role applied to the context to get appropriate temporary credentials. This is a much better approach than using specific credentials in the script, which can result in unintended consequences if you aren’t eternally vigilant.
You need to make sure that you’re Lambda function is configured with an appropriate security group that allows it to use the expected outbound channel (i.e. make sure it can get to the Logstash host you’re trying to connect to). This was somewhat of an issue for us as our ELK stack is hosted inside another AWS account (our OPs account), so we had to make sure that all of the appropriate VPC peering was configured before it would work correctly. It can be a bit of a pain to setup the smallest possible surface area, but don’t be tempted to just configure the Lambda function to be able to do everything and go anywhere. Smaller surface area = more security and the policy of least privilege is one you should always follow.
Its important to ensure that if you’re doing TCP communication in a Lambda function, that you make sure to close your socket when you’re done with it, or the Lambda function might not exit. It might, but it also might not, and it can really throw you for a loop if you don’t know why.

To Be Continued

That’s it for this week. The Javascript I’ve included above is pretty generic (apart from the specific set of fields that we like to have in our log events) and will successfully process an ELB log file from S3 to a Logstash instance listening on a port of your choosing (probably 6379) when used in a Lambda function. Feel free to reuse it for your own purposes.

Next week I’ll continue this series of posts with information about how we use CloudFormation to setup our Lambda function as part of one of our environment definitions.

CloudFormation and Lambda aren’t the best of friends yet, so there is some interesting stuff that you have to be aware of.

That Neat Little Symbol From Half-Life, Part 1

November 15. 2016 0 Comments

We’ve come a long way in our log aggregation journey. Don’t get me wrong, we still have a long way to go, but bit by bit, we’re getting better at it.

A good example of getting better, is the way in which we process our Elastic Load Balancer (ELB) logs. Over a year ago I put together a system for processing these logs into our ELK stack with way too many moving parts. It used Logstash (for processing) and Powershell (for downloading files from S3) hosted on an EC2 instance to aggregate ELB logs from S3 into to our ELK stack. Somewhat complicated in practice, but it worked, even if I was never particularly happy with it.

As is the way with these things though, because it did work, we’ve had no reason to revisit it, and we’re successfully applied the same approach to at least 3 other environments we’ve setup since.

It wasn’t without its share of problems though:

The EC2 instances hosting the solution had a tendency to cap themselves at 100% CPU for long periods. They were initially t2.mediums, but they kept expending all of their CPU credits, so we had to upgrade them to m3.mediums, which was a 50% increase in cost ($US 115/month). Never did figure out exactly what needed all that CPU, but the leading theory was Logstash.
For a while, the logs simply stopped processing after a period of time (days/weeks). This turned out to be an issue with accumulating memory dumps from Java as a result of Logstash crashing and NSSM automatically restarting it.
These were the machines most vulnerable to the memory leak in Logstash that causes its TCP driver to accumulate non-paged memory on Windows AWS instances due to some driver problem.

Good old Logstash.

To turn the discussion back to getting better, we had the opportunity to revisit this process when building some new environments, using all of the knowledge and experience that we’d gained in the intervening period. I think we came up with a much more efficient and understandable solution, but it wasn’t without its share of difficulties, which makes for a good post.

Anomalous Materials

One of the primary weaknesses in the previous approach for processing ELB logs was that it required an entire EC2 instance all to itself, for each environment that we spun up. We did this in order to keep each log processor isolated from the other and to allow us to be able to spin up an entirely self-contained environment without having to worry about some common machine that processed all of the logs in a bunch of different buckets.

Another weakness in the process that bothered me was that it had way too many moving parts. Sometimes you have to have a lot of moving parts working together in order to accomplish a goal, but you should always strive for simplicity, both from an operational point of view and from a maintenance point of view. Less is almost always better in software development.

AWS has come a long way since we jammed the initial solution together, so we decided to use this opportunity to simplify the process and experiment with some AWS tools that we don’t frequently use.

After some discussion, the we formed an idea of what we would like the new log processor to look like. We wanted to use Lambda to process the ELB logs as they were created, pushing them to the same Logstash ingress endpoint that we’ve been using consistently for the last year or so. The benefits we were expecting were a reduction in complexity (no need to have 4 different things working together), a reduction in running cost (mostly due to the removal of the EC2 instance) and a reduction in latency (the Lambda function would trigger whenever a file was written to the S3 bucket by the ELB, which meant no more polling for changes).

For those of you unfamiliar with Lamba, its a service offered by AWS that lets you configure code to run whenever a variety of events occurs. I’ve used it before to create a quicker S3 bucket clone, so if you want some more information, feel free to read up on that adventure.

In order to accomplish our goal, we would need to deal with 3 things:

The Lambda function code itself, which is written in Javascript, and would need to parse an ELB log file one line at a time, break it up, convert it into a JSON structure and then push it to a Logstash TCP endpoint to be written into Elasticsearch
The configuration/creation of the Lambda function, which would need to be done via CloudFormation as part of our normal environment setup (i.e. spin up an environment, it comes with the Lambda function that will process any ELB logs encountered)
The deployment of the code to the Lambda function via Octopus (because that’s how we roll)

Nothing particularly insane there, but definitely a few things that we’d never done before.

To Be Continued

In order to avoid creating a single monstrous post with more words than a small novel, I’m going to break it here.

Next week I’ll continue, explaining the Javascript code that we put together to process the log files (its not particularly complicated) and how we configured the Lambda function by incorporating its setup into our environment setup.

Until then, may all your Lambda functions execute quickly and your S3 buckets not turn into ghosts.

The Swallows Return To Capistrano

March 15. 2016 0 Comments

A while back (god almost a full year ago), I posted about the way in which we handle environment migrations, and to be honest, it hasn’t changed all that much. We have made some improvements to way we handle our environments (for example, we’ve improved our newest environments to be built into tested, versioned packages, rather than running directly from source), which is good, but the general migration process of clone temp, tear down old, clone back to active, tear down temp hasn’t really changed all that much.

Over time, we’ve come to realise that they are a number of weaknesses in that strategy though. Its slow (double clone!), its not overly clean and it can rarely lead to all of the data for the environment under migration being destroyed.

Yes, destroyed, i.e. lost forever.

This post is about that last weakness (the others will have to continue existing…for now).

Explosions!

In the original cloning scripts, there was an ominous comment, which simply said “# compare environment data here?”, which was a pretty big red flag in retrospect. You can’t always do everything though, and the various pressures applied to the development team meant that that step became somewhat manual.

That was a mistake.

After running a number of migrations across a few different environments (using basically the same concepts), we finally triggered that particular tripwire.

An otherwise uninteresting environment upgrade for one of our production services completely annihilated the underlying database (an EC2 instance running RavenDB), but the script gave no indication that anything went wrong.

Luckily, this particular service more of a temporary waystation, acting as a holding area facilitating the connection of two applications through a common web interface. This meant that while the loss of the data was bad (very bad), it wasn’t a problem for all of our customers. Only those people who had items sitting in the holding area waiting to be picked up were affected.

Obviously, the affected customers were quite unhappy, and rightfully so.

To this day I actually have no idea what went wrong with the actual migration. I had literally run the exact same scripts on a staging environment earlier that day, and verified that the same data was present before and after. After extensive investigation, we agreed that we would probably not get to the root of the issue in a timely fashion and that it might have just been an AWS thing (for a platform based on computers, sometimes AWS is amazingly non-deterministic). Instead, we agreed to attack the code that made it possible for the data loss to occur at all.

The migration scripts themselves.

Give Me More Statistics…Stat!

Returning to that ominous comment in the migration scripts, we realised that we needed an easy way to compare the data in two environments, at least at a high level. Using a basic comparison like that would enable us to make a decision about whether to proceed with the migration (specifically the part that destroys the old environment).

The solution is to implement a statistics endpoint.

The idea is pretty simple. We provide a set of information from the endpoint that summarises the content of the service (at least as best we can summarise it). Things like how many of a certain type of entity are present are basically all we have to deal with for now (simple services), but the concept could easily be extended to include information about any piece of data in the environment.

Something as simple as the example below fills our needs:

{
    data: {
        customers: {
            count: 57
        },
        databases: {
            count: 129
        }
    }
}

A side effect of having an endpoint like this is that we can easily (at least using the http_poller input in Logstash) extract this information on a regular basis and put it into our log aggregation so that we can chart its change over time.

Making It Work

With the statistics endpoint written and deployed (after all it must be present in the environment being migrated before we can use it), all that’s left to do is incorporate it into the migration script.

I won’t rewrite the entirety of the migration script here, but I’ve included a skeleton below to provide an idea of how we use the comparison to make sure we haven’t lost anything important on the way through.

function Migrate
{
    params
    (
        #bunch of params here, mostly relating to credentials
    )
}
 
try
{
    # make current environment unavailable to normal traffic
     
    # clone current to temporary
     
    if (-not(Compare-Environments $current $temp))
    {
        # delete the temporary environment and exit with an error
    }
     
    # delete current environment
    # clone temporary environment into the place where the current environment used to be
     
    if (-not(Compare-Environments $current $temp))
    {
        # delete the new environment
        # keep the temporary environment because its the only one with the data
    }
}
catch
{
    # if the current environment still exists, delete the temporary environment
    # if the current environment still exists, restore its availability
}
 
function Compare-Environments
{
    params
    (
        $a,
        $b
    )
     
    $aEndpoint = "some logic for URL creation based off environment"
    $bEndpoint = "some logic for URL creation based off environment"
     
    $aStatistics = Invoke-RestMethod $aEndpoint #credentials, accept header, methods etc
    $bStatistics = Invoke-RestMethod $aEndpoint #credentials, accept header, methods etc
     
    if ((ConvertTo-Json $aStatistics.data) -eq (ConvertTo-Json $bStatistics.data))
    {
        return true;
    }
     
    return false;
}

Summary

The unfortunate truth of this whole saga, is that the person who originally implemented the migration scripts (I’m pretty sure it was me, so I take responsibility) was aware of the fact that the migration could potentially lead to loss of data. At the time, the protection against that was to ensure that we never deleted the old environment until we were absolutely sure that the new environment had been successfully created, making the assumption that the data had come over okay.

In the end, that assumption proved to be our undoing, because while everything appeared peachy, it actually failed spectacularly.

The introduction of a statistics endpoint (almost an environment data hash) is an elegant solution to the problem of potential data loss, which also has some nice side effects for tracking metrics that might not have been easily accessible outside of direct database access.

A double victory is a rare occurrence, so I think I’ll try to savour this one for a little while, even if I was the root cause of the problem.

IAM Legend

May 19. 2015 0 Comments

I’ve been doing a lot of work with AWS recently.

For the last service component that we developed, we put together a CloudFormation template and a series of Powershell scripts to setup, tear down and migrate environments (like CI, Staging, Production, etc). It was extremely effective, baring some issues that we still haven’t quite solved with data migration between environment versions and updating machine configuration settings.

In the first case, an environment is obviously not stateless once you start using it, and you need a good story about maintaining user data between environment versions, at the very least for Production.

In the second case tearing down an entire environment just to update a configuration setting is obviously sub-optimal. We try to make sure that most of our settings are encapsulated within components that we deploy, but not everything can be done this way. CloudFormation does have update mechanisms, I just haven’t had a chance to investigate them yet.

But I digress, lets switch to an entirely different topic for this post How to give secure access to objects in an S3 bucket during initialization of EC2 instances while executing a CloudFormation template.

That was a mouthful.

Don’t Do What Donny Don’t Does

My first CloudFormation template/environment setup system had a fairly simple rule. Minimise dependencies.

There were so many example templates on the internet that just downloaded arbitrary scripts or files from GitHub or S3, and to me that’s the last thing you want. When I run my environment setup (ideally from within a controlled environment, like TeamCity) I want it to use the versions of the resources that are present in the location I’m running the script from. It should be self contained.

Based on that rule, I put together a fairly simple process where the Git Repository housing my environment setup contained all the necessary components required by the resources in the CloudFormation template, and the script was responsible for collecting and then uploading those components to some location that the resources could access.

At the time, I was not very experienced with S3, so I struggled a lot with getting the right permissions.

Eventually I solved the issue by handing off the AWS Key/Secret to the CloudFormation template, and then using those credentials in the AWS::CloudFormation::Authentication block inside the resource (LaunchConfig/Instance). The URL of the dependencies archive was then supplied to the source element of the first initialization step in the AWS::CloudFormation::Init block, which used the supplied credentials to download the file and extract its contents (via cfn-init) to a location on disk, ready to be executed by subsequent components.

This worked, but it left a bad taste in my mouth once I learnt about IAM roles.

IAM roles give you the ability to essentially organise sets of permissions that can be applied to resources, like EC2 instances. For example, we have a logs bucket per environment that is used to capture ELB logs. Those logs are then processed by Logstash (indirectly, because I can’t get the goddamn S3 input to work with a proxy, but I digress) on a dedicated logs processing instance. I could have gone about this in two ways. The first would have been to supply the credentials to the instance, like I had in the past. This exposes those credentials on the instance though, which can be dangerous. The second option is to apply a role to the instance that says “you are allowed to access this S3 bucket, and you can do these things to it”.

I went with the second option, and it worked swimmingly (once I got it all configured).

Looking back at the way I had done the dependency distribution, I realised that using IAM roles would be a more secure option, closer to best practice. Now I just needed a justifiable opportunity to implement it.

New Environment, Time to Improve

We’ve started work on a new service, which means new environment setup. This is a good opportunity to take what you’ve done previously and reuse it, improving it along the way. For me, this was the perfect chance to try and use IAM roles for the dependency distribution, removing all of those nasty “credentials in the clear” situations.

I followed the same process that I had for the logs processing. Setup a role describing the required policy (readonly access to the S3 bucket that contains the dependencies) and then link that role to a profile. Finally, apply the profile to the instances in question.

"ReadOnlyAccessToDependenciesBucketRole": {
    "Type": "AWS::IAM::Role",
    "Properties": {
        "AssumeRolePolicyDocument": {
            "Version" : "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Principal": { "Service": [ "ec2.amazonaws.com" ] },
                    "Action": [ "sts:AssumeRole" ]
                }
            ]
        },
        "Path": "/",
        "Policies" : [
            {
                "Version" : "2012-10-17",
                "Statement": [
                    {
                        "Effect": "Allow",
                        "Action": [ "s3:GetObject", "s3:GetObjectVersion" ],
                        "Resource": { "Fn::Join": [ "", [ "arn:aws:s3:::", { "Ref": "DependenciesS3Bucket" }, "/*" ] ] }
                    },
                    {
                        "Effect": "Allow",
                        "Action": [ "s3:ListBucket" ],
                        "Resource": { "Fn::Join": [ "", [ "arn:aws:s3:::", { "Ref": "DependenciesS3Bucket" } ] ] }
                    }
                ]
            }
        ]
    }
},
"ReadOnlyDependenciesBucketInstanceProfile": {    
    "Type": "AWS::IAM::InstanceProfile",    
    "Properties": { 
        "Path": "/", 
        "Roles": [ { "Ref": "ReadOnlyDependenciesBucketRole" }, { "Ref": "FullControlLogsBucketRole" } ] 
    }
},
"InstanceLaunchConfig": {    
    "Type": "AWS::AutoScaling::LaunchConfiguration",
    "Metadata": {        
        * snip *    
    },    
    "Properties": {        
        "KeyName": { "Ref": "KeyName" },        
        "ImageId": { "Ref": "AmiId" },        
        "SecurityGroups": [ { "Ref": "InstanceSecurityGroup" } ],        
        "InstanceType": { "Ref": "InstanceType" },        
        "IamInstanceProfile": { "Ref": "ReadOnlyDependenciesBucketInstanceProfile" },        
        "UserData": {            
            * snip *        
        }    
    }
}

It worked before, so it should work again, right? I’m sure you can probably guess that that was not the case.

The first mistake I made was attempting to specify multiple roles in a single profile. I wanted to do this because the logs processor needed to maintain its permissions to the logs bucket, but it needed the new permissions to the dependencies bucket as well. Even though the roles element is defined as an array, it can only accept a single element. I now hate whoever designed that, even though I’m sure they probably had a good reason.

At least that was an easy fix, flip the relationship between roles and policies. I split the inline policies out of the roles, then linked the roles to the policies instead. Each profile only had 1 role, so everything should have been fine.

"ReadOnlyDependenciesBucketPolicy": {
    "Type":"AWS::IAM::Policy",
    "Properties": {
        "PolicyName": "ReadOnlyDependenciesBucketPolicy",
        "PolicyDocument": {
            "Version" : "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Action": [ "s3:GetObject", "s3:GetObjectVersion" ],
                    "Resource": { "Fn::Join": [ "", [ "arn:aws:s3:::", { "Ref": "DependenciesS3Bucket" }, "/*" ] ] }
                },
                {
                    "Effect": "Allow",
                    "Action": [ "s3:ListBucket" ],
                    "Resource": { "Fn::Join": [ "", [ "arn:aws:s3:::", { "Ref": "DependenciesS3Bucket" } ] ] }
                }
            ]
        },
        "Roles": [
            { "Ref" : "InstanceRole" },
            { "Ref" : "OtherInstanceRole" }
        ]
    }
},
"InstanceRole": {
    "Type": "AWS::IAM::Role",
    "Properties": {
        "AssumeRolePolicyDocument": {
            "Version" : "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Principal": { "Service": [ "ec2.amazonaws.com" ] },
                    "Action": [ "sts:AssumeRole" ]
                }
            ]
        },
        "Path": "/"
    }
},
"InstanceProfile": {
    "Type": "AWS::IAM::InstanceProfile",
    "Properties": { "Path": "/", "Roles": [ { "Ref": "InstanceRole" } ] }
}

Ha ha ha ha ha, no.

The cfn-init logs showed that the process was getting 403s when trying to access the S3 object URL. I had incorrectly assumed that because the instance was running with the appropriate role (and it was, if I remoted onto the instance and attempted to download the object from S3 via the AWS Powershell Cmdlets, it worked just fine) that cfn-init would use that role.

It does not.

You still need to specify the AWS::CloudFormation::Authentication element, naming the role and the bucket that it will be used for. This feel s a little crap to be honest. Surely the cfn-init application is using the same AWS components, so why doesn’t it just pickup the credentials from the instance profile like everything else does?

Anyway, I added the Authentication element with appropriate values, like so.

"InstanceLaunchConfig": {
    "Type": "AWS::AutoScaling::LaunchConfiguration",
    "Metadata": {
        "Comment": "Set up instance",
        "AWS::CloudFormation::Init": {
            * snip *
        },
        "AWS::CloudFormation::Authentication": {
          "S3AccessCreds": {
            "type": "S3",
            "roleName": { "Ref" : "InstanceRole" },
            "buckets" : [ { "Ref" : "DependenciesS3Bucket" } ]
          }
        }
    },
    "Properties": {
        "KeyName": { "Ref": "KeyName" },
        "ImageId": { "Ref": "AmiId" },
        "SecurityGroups": [ { "Ref": "InstanceSecurityGroup" } ],
        "InstanceType": { "Ref": "ApiInstanceType" },
        "IamInstanceProfile": { "Ref": "InstanceProfile" },
        "UserData": {
            * snip *
        }
    }
}

Then I started getting different errors. You may think this is a bad thing, but I disagree. Different errors means progress. I’d switched from getting 403 responses (access denied) to getting 404s (not found).

Like I said, progress!

The Dependencies Archive is a Lie

It was at this point that I gave up trying to use the IAM roles. I could not for the life of me figure out why it was returning a 404 for a file that clearly existed. I checked and double checked the path, and even used the same path to download the file via the AWS Powershell Cmdlets on the machines that were having the issues. It all worked fine.

Assuming the issue was with my IAM role implementation, I rolled back to the solution that I knew worked. Specifying the Access Key and Secret in the AWS::CloudFormation::Authentication element of the LaunchConfig and removed the new IAM roles resources (for readonly access to the dependencies archive).

"InstanceLaunchConfig": {
    "Type": "AWS::AutoScaling::LaunchConfiguration",
    "Metadata": {
        "Comment": "Set up instance",
        "AWS::CloudFormation::Init": {
            * snip *
        },
        "AWS::CloudFormation::Authentication": {
            "S3AccessCreds": {
                "type": "S3",
                "accessKeyId" : { "Ref" : "DependenciesS3BucketAccessKey" },
                "secretKey" : { "Ref": "DependenciesS3BucketSecretKey" },
                "buckets" : [ { "Ref":"DependenciesS3Bucket" } ]
            }
        }
    },
    "Properties": {
        "KeyName": { "Ref": "KeyName" },
        "ImageId": { "Ref": "AmiId" },
        "SecurityGroups": [ { "Ref": "InstanceSecurityGroup" } ],
        "InstanceType": { "Ref": "ApiInstanceType" },
        "IamInstanceProfile": { "Ref": "InstanceProfile" },
        "UserData": {
            * snip *
        }
    }
}

Imagine my surprise when it also didn’t work, throwing back the same response, 404 not found.

I tried quite a few things over the next few hours, and there was much gnashing and wailing of teeth. I’ve seen some weird crap with S3 and bucket names (too long and you get errors, weird characters in your key and you get errors, etc) but as far as I could tell, everything was kosher. Yet it just wouldn’t work.

After doing a line by line diff against the template/scripts that were working (the other environment setup) and my new template/scripts I realised my error.

While working on the IAM role stuff, trying to get it to work, I had attempted to remove case sensitivity from the picture by calling ToLowerInvariant on the dependencies archive URL that I was passing to my template. The old script/template combo didn’t do that.

When I took that out, it worked fine.

The issue was that the key of the file being uploaded was not being turned into lower case, only the URL of the resulting file was, and AWS keys are case sensitive.

…

Goddamn it.

Summary

I lost basically an entire day to case sensitivity. Its not even the first time this has happened to me (well, its the first time its happened in S3 I think). I come from a heavy Windows background. I don’t even consider case sensitivity to be a thing. I can understand why its a thing (technically different characters and all), but its just not on windows, so its not even on my radar most of the time. I assume the case sensitivity in S3 is a result of the AWS backend being Unix/Linux based, but its still a shock to find a case sensitive URL.

I turns out that my IAM stuff had started working just fine and I was getting 404s because of an entirely different reason. I had assumed that I was still doing something wrong with my permissions and the API was just giving a crappy response (i.e. not really a 404, some sort of permission based can’t find file error masquerading as a 404).

At the very least I didn’t make the silliest mistake you can make in software (assuming the platform is broken), I just assumed I had configured it wrong somehow. That’s generally a fairly safe assumption when you’re using a widely distributed system. Sometimes you do find a feature that is broken, but it is far more likely that you are just doing it wrong. In my case, the error message was completely accurate, and was telling me exactly the right thing, I just didn’t realise why.

Somewhat ironically, the root cause of my 404 issue was my attempt to remove case sensitivity from the picture when I was working on getting the IAM stuff up and running. I just didn’t apply the case insensitivity consistently.

Ah well.

The Last Piece of the Logging Puzzle

April 28. 2015 0 Comments

Posted in:
logging
amazon
elk
cloud

As I’ve already stated, I’ve spent the last few weeks working on putting together log aggregation so that we know what our new service looks like in real time.

I’ve incorporated IIS logs, the application logs, machine statistics (memory, CPU, etc) and Windows Event logs into the log aggregator, and successfully used those events for analysis during load testing.

There was one piece missing though, which meant there was a hole in our ability to monitor how our service was actually operating in the face of actual usage.

The Elastic Load Balancer, or ELB, that sits in front of the publically accessible web service.

During load testing, I noticed that sometimes JMeter would record an error (specifically a 504, Gateway Timeout) but our dashboard in Kibana would show nothing. No errors, everything seemed fine.

It turned out that there was a default timeout on the ELB of 60 seconds, and at that point in the load testing, some requests were taking longer than that without causing any traffic over the connection. The ELB would terminate the connection, return a 504 to the client, but the request would still complete successfully (eventually) in the backend.

I needed to get eyes on the ELB.

Its Log!

Turning logging on for an ELB is fairly easy.

Just give it the S3 bucket you want it to log to, a prefix to use for entries made into the bucket and a time interval, and off it goes. All of this can be done through the CloudFormation template, which fits well into our strategy for environment setup (no manual tasks, automate all the things).

The only complex bit is setting up a bucket policy that sets the correct permissions to allow the ELB to write to the bucket, which is all pretty well documented. There is simply a well known ARN for what I assume is all Load Balancers in a region, and you setup a simple Put/Get/List policy to allow it to do its thing.

The only gotcha I ran into was when I included an underscore (_) in the prefix configuration setting for the ELB. The prefix setting is intended to make sure that the keys for files written into the bucket start with a common value. When I included an underscore, I got nothing but Access Denied errors. This was at the same time as I was setting up the bucket policy, so I assumed I had done that incorrectly. Turns out my bucket policy was flawless, and it was a completely unrelated (and unexpected) issue causing the Access Denied errors.

Very frustrating.

With that fixed though, the logs started flowing.

Content Rich

The ELB logs contain things like the ELB IP and port, where the request was forwarded to (IP and port again), the time to forward, process and respond to requests (3 separate entries, process is the time it takes for your server to do its thing), response codes, bytes transferred and other things. Very similar to IIS really, which is not unexpected.

Now all I had to do was get the information into our Log Aggregator.

Stashing Those Logs

I had been using Nxlog as my log processor. It was responsible for picking up files, processing them as necessary, enriching them with various pieces of information (hostname, component, application) and then shipping the results off via TCP to our log aggregator where Logstash was listening.

Nxlog is a fine product, but its scripting language is a hard to get a handle on, and the documentation is a bit sparse. Also it has no concept of decimal numbers, which meant that I had to convert some numbers to integers (like decimal seconds to milliseconds) via regular expressions. Altogether it got the job done, but I wasn’t particularly happy with it.

I thought that since I needed to do something a little bit more complicated (get files from S3 and process them) that I would use Logstash this time. Logstash as a log processor is a lot easier to distribute, configure and debug, which is nice. Its configuration is all in json, and is very easy to wrap your head around, and it has lots of component to accomplish various tasks like getting files from S3, parsing CSV lines, mutating fields to the correct type, etc. It even has a mutator (Logstash calls them filters) that allows you to execute arbitrary Ruby code for those times when you have to do something unusual.

Even better, Logstash is what's listening on the other end of the pipeline, so they play well together, and you only need to know 1 piece of software, instead of 2.

I built a similar distributable project to what I built for Nxlog, that creates a NuGet package that Octopus can deploy to get a copy of Logstash up and running on the target machine as a Windows Service. I won’t go into this in too much detail, but it was essentially the same thing that I did for Nxlog, except with different dependencies (JRE, Logstash, NSSM for service installation/configuration).

I added a small EC2 instance to our environment setup to act as a Log Processor, with the intent that it would immediately be used to process the ELB logs, but may also be used in the future to process other logs that don’t necessarily fit onto a specific machine (S3 access logs is the only one that comes to mind, but I’m sure there are more). The Logs Processor had an IAM role allowing it full control over the logs bucket that ELB was using (which was also created as part of the environment. Nice and clean, and no credentials stored anywhere.

I created a Logstash configuration to grab files from S3 and process them, and then deployed it to the Logs Processor.

Access Denied.

Permission to Rage Requested

The current release version of Logstash (1.4.2) does not support the usage of IAM roles for the S3 input. If I wanted to use that input, I would have to enter the credentials manually into the config file. I could do this easily enough at deployment time (storing the credentials in Octopus, which is much better than in source control), but I would need to actually have a user setup that could access the bucket. As the bucket is created during environment creation, this would mean that the credentials would change every time the environment was recreated. We create temporary environments all the time, so this would mean a manual step editing Octopus every time you wanted to get something to work.

That's unacceptable.

I contemplated using a small script during deployment time to grab some credentials from the IAM role on the machine and enter them into the config file, but those credentials expire and Logstash was running as a service, so at some stage it would just stop working and someone would have to do something to make it work again.

Again, unacceptable.

Luckily for me, the wonderful people behind Logstash (and specifically the S3 plugin) have developed a new version that allows the usage of IAM roles, and it was already in beta. Its a little unstable still (Release Candidate 2), but it was good enough for my purposes.

While doing some reading about Logstash and the new version I discovered that the file input was basically completely broken on Windows. The component that it was leveraging to get the unique identifier for files in order to record the position in the file that it was up to does not work in 1.4.2 and below, so you end up missing huge chunks of data when processing multiple files. This actually explained why I was having so much difficulty using the earlier version to process a large amount of IIS logs from a disconnected machine, and why there was holes in my data. Long story short, if you’re using the file input in Logstash and you’re on windows, get the latest release candidate.

I incorporated the 1.5 RC2 release into my deployment, but I still couldn’t get the S3 input to work.

Why Is It Always A Proxy

I hate proxies.

Not because of what they are. I think they actually do some useful things, like caching, obfuscating where requests are coming from when accessing the internet from within a network and preventing access to bad websites.

No I hate proxies because the support for them is always a pain in the ass. Every application seems to support proxies differently, if they support them at all. Some automatically read from the Internet Explorer registry setting for the proxy, some use the HTTP_PROXY environment variable, some have their own personal settings. This means that every time you want to use a piece of software in an environment that uses a proxy, you have to fight with it to get it to work.

Such was the case with the S3 input. The underlying Ruby based aws-sdk has support for proxies (as does the .NET one), with a Set-AwsProxy method.

I could not, for the life of me, figure out how to configure Logstash with a proxy for the AWS component though.

So , I was stuck. I had all the configuration in place to process the ELB logs, but I didn't have the logs themselves.

In the end I created a small Powershell script that uses the AWS Powershell Component to move all files from an S3 bucket to a local directory on a timer. I then installed that script as a Windows Service using NSSM. Finally I edited my Logstash configuration to process the local files instead. After tweaking my config to process the files correctly, everything started coming through into the Log Aggregator as expected, and I added the missing piece to our intelligence about the service.

I don’t like this solution, because it adds more moving parts than I think is strictly necessary, but sometimes you have to make compromises.

Summary

I’ve uploaded a repository with my deployable build for Logstash here, so hopefully someone else can benefit from the effort that I put into making it re-usable.

Setting up a deployment pipeline for this component saved me a lot of time throughout the development process, making redeploying my changes when I made a mistake or needed to change a dependency (like upgrading to Logstash 1.5 RC2) a breeze. I highly recommend spending that initial bit of effort in setting things up at the start so you can move quickly later.

In regards to the actual ELB logs, they don’t provide any groundbreaking information that IIS didn't already give us, except for the case where connections are terminated at the ELB due to inactivity. At least to my knowledge anyway, I suppose they will track if the underlying instances go offline, which will be good. The ELB entries come in a bit slower than the IIS ones (due to the delay before the log files are published from the ELB + the delay added by my own S3 downloader and Logstash file processor pair), but there’s not really much I can do about that.

I still hate proxies. Only because its easier to hate one thing than every application that doesn’t support them.