IAM Legend

May 19. 2015 0 Comments

I’ve been doing a lot of work with AWS recently.

For the last service component that we developed, we put together a CloudFormation template and a series of Powershell scripts to setup, tear down and migrate environments (like CI, Staging, Production, etc). It was extremely effective, baring some issues that we still haven’t quite solved with data migration between environment versions and updating machine configuration settings.

In the first case, an environment is obviously not stateless once you start using it, and you need a good story about maintaining user data between environment versions, at the very least for Production.

In the second case tearing down an entire environment just to update a configuration setting is obviously sub-optimal. We try to make sure that most of our settings are encapsulated within components that we deploy, but not everything can be done this way. CloudFormation does have update mechanisms, I just haven’t had a chance to investigate them yet.

But I digress, lets switch to an entirely different topic for this post How to give secure access to objects in an S3 bucket during initialization of EC2 instances while executing a CloudFormation template.

That was a mouthful.

Don’t Do What Donny Don’t Does

My first CloudFormation template/environment setup system had a fairly simple rule. Minimise dependencies.

There were so many example templates on the internet that just downloaded arbitrary scripts or files from GitHub or S3, and to me that’s the last thing you want. When I run my environment setup (ideally from within a controlled environment, like TeamCity) I want it to use the versions of the resources that are present in the location I’m running the script from. It should be self contained.

Based on that rule, I put together a fairly simple process where the Git Repository housing my environment setup contained all the necessary components required by the resources in the CloudFormation template, and the script was responsible for collecting and then uploading those components to some location that the resources could access.

At the time, I was not very experienced with S3, so I struggled a lot with getting the right permissions.

Eventually I solved the issue by handing off the AWS Key/Secret to the CloudFormation template, and then using those credentials in the AWS::CloudFormation::Authentication block inside the resource (LaunchConfig/Instance). The URL of the dependencies archive was then supplied to the source element of the first initialization step in the AWS::CloudFormation::Init block, which used the supplied credentials to download the file and extract its contents (via cfn-init) to a location on disk, ready to be executed by subsequent components.

This worked, but it left a bad taste in my mouth once I learnt about IAM roles.

IAM roles give you the ability to essentially organise sets of permissions that can be applied to resources, like EC2 instances. For example, we have a logs bucket per environment that is used to capture ELB logs. Those logs are then processed by Logstash (indirectly, because I can’t get the goddamn S3 input to work with a proxy, but I digress) on a dedicated logs processing instance. I could have gone about this in two ways. The first would have been to supply the credentials to the instance, like I had in the past. This exposes those credentials on the instance though, which can be dangerous. The second option is to apply a role to the instance that says “you are allowed to access this S3 bucket, and you can do these things to it”.

I went with the second option, and it worked swimmingly (once I got it all configured).

Looking back at the way I had done the dependency distribution, I realised that using IAM roles would be a more secure option, closer to best practice. Now I just needed a justifiable opportunity to implement it.

New Environment, Time to Improve

We’ve started work on a new service, which means new environment setup. This is a good opportunity to take what you’ve done previously and reuse it, improving it along the way. For me, this was the perfect chance to try and use IAM roles for the dependency distribution, removing all of those nasty “credentials in the clear” situations.

I followed the same process that I had for the logs processing. Setup a role describing the required policy (readonly access to the S3 bucket that contains the dependencies) and then link that role to a profile. Finally, apply the profile to the instances in question.

"ReadOnlyAccessToDependenciesBucketRole": {
    "Type": "AWS::IAM::Role",
    "Properties": {
        "AssumeRolePolicyDocument": {
            "Version" : "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Principal": { "Service": [ "ec2.amazonaws.com" ] },
                    "Action": [ "sts:AssumeRole" ]
                }
            ]
        },
        "Path": "/",
        "Policies" : [
            {
                "Version" : "2012-10-17",
                "Statement": [
                    {
                        "Effect": "Allow",
                        "Action": [ "s3:GetObject", "s3:GetObjectVersion" ],
                        "Resource": { "Fn::Join": [ "", [ "arn:aws:s3:::", { "Ref": "DependenciesS3Bucket" }, "/*" ] ] }
                    },
                    {
                        "Effect": "Allow",
                        "Action": [ "s3:ListBucket" ],
                        "Resource": { "Fn::Join": [ "", [ "arn:aws:s3:::", { "Ref": "DependenciesS3Bucket" } ] ] }
                    }
                ]
            }
        ]
    }
},
"ReadOnlyDependenciesBucketInstanceProfile": {    
    "Type": "AWS::IAM::InstanceProfile",    
    "Properties": { 
        "Path": "/", 
        "Roles": [ { "Ref": "ReadOnlyDependenciesBucketRole" }, { "Ref": "FullControlLogsBucketRole" } ] 
    }
},
"InstanceLaunchConfig": {    
    "Type": "AWS::AutoScaling::LaunchConfiguration",
    "Metadata": {        
        * snip *    
    },    
    "Properties": {        
        "KeyName": { "Ref": "KeyName" },        
        "ImageId": { "Ref": "AmiId" },        
        "SecurityGroups": [ { "Ref": "InstanceSecurityGroup" } ],        
        "InstanceType": { "Ref": "InstanceType" },        
        "IamInstanceProfile": { "Ref": "ReadOnlyDependenciesBucketInstanceProfile" },        
        "UserData": {            
            * snip *        
        }    
    }
}

It worked before, so it should work again, right? I’m sure you can probably guess that that was not the case.

The first mistake I made was attempting to specify multiple roles in a single profile. I wanted to do this because the logs processor needed to maintain its permissions to the logs bucket, but it needed the new permissions to the dependencies bucket as well. Even though the roles element is defined as an array, it can only accept a single element. I now hate whoever designed that, even though I’m sure they probably had a good reason.

At least that was an easy fix, flip the relationship between roles and policies. I split the inline policies out of the roles, then linked the roles to the policies instead. Each profile only had 1 role, so everything should have been fine.

"ReadOnlyDependenciesBucketPolicy": {
    "Type":"AWS::IAM::Policy",
    "Properties": {
        "PolicyName": "ReadOnlyDependenciesBucketPolicy",
        "PolicyDocument": {
            "Version" : "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Action": [ "s3:GetObject", "s3:GetObjectVersion" ],
                    "Resource": { "Fn::Join": [ "", [ "arn:aws:s3:::", { "Ref": "DependenciesS3Bucket" }, "/*" ] ] }
                },
                {
                    "Effect": "Allow",
                    "Action": [ "s3:ListBucket" ],
                    "Resource": { "Fn::Join": [ "", [ "arn:aws:s3:::", { "Ref": "DependenciesS3Bucket" } ] ] }
                }
            ]
        },
        "Roles": [
            { "Ref" : "InstanceRole" },
            { "Ref" : "OtherInstanceRole" }
        ]
    }
},
"InstanceRole": {
    "Type": "AWS::IAM::Role",
    "Properties": {
        "AssumeRolePolicyDocument": {
            "Version" : "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Principal": { "Service": [ "ec2.amazonaws.com" ] },
                    "Action": [ "sts:AssumeRole" ]
                }
            ]
        },
        "Path": "/"
    }
},
"InstanceProfile": {
    "Type": "AWS::IAM::InstanceProfile",
    "Properties": { "Path": "/", "Roles": [ { "Ref": "InstanceRole" } ] }
}

Ha ha ha ha ha, no.

The cfn-init logs showed that the process was getting 403s when trying to access the S3 object URL. I had incorrectly assumed that because the instance was running with the appropriate role (and it was, if I remoted onto the instance and attempted to download the object from S3 via the AWS Powershell Cmdlets, it worked just fine) that cfn-init would use that role.

It does not.

You still need to specify the AWS::CloudFormation::Authentication element, naming the role and the bucket that it will be used for. This feel s a little crap to be honest. Surely the cfn-init application is using the same AWS components, so why doesn’t it just pickup the credentials from the instance profile like everything else does?

Anyway, I added the Authentication element with appropriate values, like so.

"InstanceLaunchConfig": {
    "Type": "AWS::AutoScaling::LaunchConfiguration",
    "Metadata": {
        "Comment": "Set up instance",
        "AWS::CloudFormation::Init": {
            * snip *
        },
        "AWS::CloudFormation::Authentication": {
          "S3AccessCreds": {
            "type": "S3",
            "roleName": { "Ref" : "InstanceRole" },
            "buckets" : [ { "Ref" : "DependenciesS3Bucket" } ]
          }
        }
    },
    "Properties": {
        "KeyName": { "Ref": "KeyName" },
        "ImageId": { "Ref": "AmiId" },
        "SecurityGroups": [ { "Ref": "InstanceSecurityGroup" } ],
        "InstanceType": { "Ref": "ApiInstanceType" },
        "IamInstanceProfile": { "Ref": "InstanceProfile" },
        "UserData": {
            * snip *
        }
    }
}

Then I started getting different errors. You may think this is a bad thing, but I disagree. Different errors means progress. I’d switched from getting 403 responses (access denied) to getting 404s (not found).

Like I said, progress!

The Dependencies Archive is a Lie

It was at this point that I gave up trying to use the IAM roles. I could not for the life of me figure out why it was returning a 404 for a file that clearly existed. I checked and double checked the path, and even used the same path to download the file via the AWS Powershell Cmdlets on the machines that were having the issues. It all worked fine.

Assuming the issue was with my IAM role implementation, I rolled back to the solution that I knew worked. Specifying the Access Key and Secret in the AWS::CloudFormation::Authentication element of the LaunchConfig and removed the new IAM roles resources (for readonly access to the dependencies archive).

"InstanceLaunchConfig": {
    "Type": "AWS::AutoScaling::LaunchConfiguration",
    "Metadata": {
        "Comment": "Set up instance",
        "AWS::CloudFormation::Init": {
            * snip *
        },
        "AWS::CloudFormation::Authentication": {
            "S3AccessCreds": {
                "type": "S3",
                "accessKeyId" : { "Ref" : "DependenciesS3BucketAccessKey" },
                "secretKey" : { "Ref": "DependenciesS3BucketSecretKey" },
                "buckets" : [ { "Ref":"DependenciesS3Bucket" } ]
            }
        }
    },
    "Properties": {
        "KeyName": { "Ref": "KeyName" },
        "ImageId": { "Ref": "AmiId" },
        "SecurityGroups": [ { "Ref": "InstanceSecurityGroup" } ],
        "InstanceType": { "Ref": "ApiInstanceType" },
        "IamInstanceProfile": { "Ref": "InstanceProfile" },
        "UserData": {
            * snip *
        }
    }
}

Imagine my surprise when it also didn’t work, throwing back the same response, 404 not found.

I tried quite a few things over the next few hours, and there was much gnashing and wailing of teeth. I’ve seen some weird crap with S3 and bucket names (too long and you get errors, weird characters in your key and you get errors, etc) but as far as I could tell, everything was kosher. Yet it just wouldn’t work.

After doing a line by line diff against the template/scripts that were working (the other environment setup) and my new template/scripts I realised my error.

While working on the IAM role stuff, trying to get it to work, I had attempted to remove case sensitivity from the picture by calling ToLowerInvariant on the dependencies archive URL that I was passing to my template. The old script/template combo didn’t do that.

When I took that out, it worked fine.

The issue was that the key of the file being uploaded was not being turned into lower case, only the URL of the resulting file was, and AWS keys are case sensitive.

…

Goddamn it.

Summary

I lost basically an entire day to case sensitivity. Its not even the first time this has happened to me (well, its the first time its happened in S3 I think). I come from a heavy Windows background. I don’t even consider case sensitivity to be a thing. I can understand why its a thing (technically different characters and all), but its just not on windows, so its not even on my radar most of the time. I assume the case sensitivity in S3 is a result of the AWS backend being Unix/Linux based, but its still a shock to find a case sensitive URL.

I turns out that my IAM stuff had started working just fine and I was getting 404s because of an entirely different reason. I had assumed that I was still doing something wrong with my permissions and the API was just giving a crappy response (i.e. not really a 404, some sort of permission based can’t find file error masquerading as a 404).

At the very least I didn’t make the silliest mistake you can make in software (assuming the platform is broken), I just assumed I had configured it wrong somehow. That’s generally a fairly safe assumption when you’re using a widely distributed system. Sometimes you do find a feature that is broken, but it is far more likely that you are just doing it wrong. In my case, the error message was completely accurate, and was telling me exactly the right thing, I just didn’t realise why.

Somewhat ironically, the root cause of my 404 issue was my attempt to remove case sensitivity from the picture when I was working on getting the IAM stuff up and running. I just didn’t apply the case insensitivity consistently.

Ah well.