0 Comments

Over the last few weeks, I’ve been sharing the bits and pieces that went into our construction of an ELB logs processing pipeline using AWS Lambda.

As I said in the introduction, the body of work around the processor can be broken down into three pieces:

I’ve gone through each of these components in detail in the posts I’ve linked above, so this last post is really just to tie it all together and reflect on the process, as well as provide a place to mention a few things that didn’t neatly fit into any of the buckets above.

Also, I’ll obviously be continuing the pattern of naming every sub-header after a chapter title in the original Half-Life, almost all of which have proven to be surprisingly apt for whatever topic is being discussed. I mean seriously, look at this next one? How is that not perfectly fitting for a conclusion/summary post.

Residue Processing

It took a fair amount of effort for us to get to the solution we have in place now, and a decent amount of time. The whole thing was put together over the course of a few weeks by one of the people I work with, with some guidance and feedback from other members of the team from time to time. This timeframe was to develop, test and then deploy the solution into a real production environment, by a person with little to no working knowledge of the AWS toolset, so I think it was a damn good effort.

The most time consuming part was the long turnaround on environment builds, because each build needs to run a suite of tests which involve creating and destroying at least one environment, sometimes more. In reality, this means a wait time or something like 30-60 minutes per build, which is so close to eternity as to be effectively indistinguishable from it. I’ll definitely have to come up with some sort of way to tighten this feedback loop, but being that most of it is actually just waiting for AWS resources, I’m not really sure what I can do.

The hardest part of the whole process was probably just working with Lambda for the first time outside of the AWS management website.

As a team, we’d used Lambda before (back when I tried to make something to clone S3 buckets more quickly), but we’d never tried to manage the various Lambda bits and pieces through CloudFormation.

It turns out that the AWS website does a hell of a lot of things in order to make sure that your Lambda function runs, including dealing with profiles and permissions, network interfaces, listeners and so on. Having to do all of that explicitly through CloudFormation was something of a learning process.

Speaking of CloudFormation and Lambda, we ran into a nasty bug with Elastic Network Interfaces and VPC hosted Lambda functions created through CloudFormation, where the CloudFormation stack doesn’t delete cleanly because the ENI is still in use. It looks like its a known issue, so I assume it will be fixed at some point in the future, but as a result we had to include some additional cleanup in the Powershell that wraps our environment management to check the stack for Lambda functions and manually remove and delete the ENI before we try to delete the stack.

This isn’t the first time we’ve had to manually cleanup resources “managed” by CloudFormation. We do the same thing with S3 buckets because CloudFormation won’t delete a bucket with anything in it (and some of our buckets, like the ELB logs ones, are constantly being written to by other AWS services).

The only other difficult part of the whole thing I’ve already mentioned in the deployment post, which was figuring out how we could incorporate non-machine based Octopus deployments into our environments. For now they just happen after the actual AWS stack is created (as part of the Powershell scripts wrapping the entire process) and rely on having an Octopus tentacle registered in each environment on the Octopus Server machine, used as a script execution point.

Conclusion

Having put this whole system in place, the obvious question is “Was it worth it?”.

For me, the answer is “definitely”.

We’ve managed to retire a few hacky components (a Windows service running Powershell scripts via NSSM to download files from an S3 bucket, for example) and removed an entire machine from every environment that needs to process ELB logs. Its not often that you get to reduce both running and maintenance costs in one blow, so it was nice to get that accomplished.

Ignoring the reduced costs to the business for a second, we’ve also decreased the latency for receiving our ELB logs for analysis because rather than relying on a polling system, we’re now triggering the processing directly when the ELB writes the log file into S3.

Finally, we’ve gained some more experience with systems and services that we haven’t really had a chance to look into, allowing us to leverage that knowledge and tooling for other, potentially more valuable purposes.

All in all, I consider this exercise a resounding success, and I’m happy I was able to dedicate some time to improving an existing process, even though it was already “working”.

Improving existing engineering like this is incredibly valuable to the morale of a development time, which is an important and limited resource.

0 Comments

Its that time again kids, time to continue the series of posts about how we improved the processing of our ELB logs into our ELK stack using AWS Lambda.

You can find the introduction to this whole adventure here, but last time I wrote about the Javascript content of the Lambda function that does the bulk of the work.

This time I’m going to write about how we incorporated the creation of that Lambda function into our environment management strategy and some of the tricks and traps therein.

On a completely unrelated note, it would be funny if this blog post turned up in search results for Half-Life 3.

We’ve Got Hostiles!

I’ve pushed hard to codify our environment setup where I work. The main reason for this is reproducibility, but the desire comes from a long history of interacting with manually setup environments that are lorded over by one guy who just happened to know the guy who originally set them up and where everyone is terrified of changing or otherwise touching said environments.

Its a nightmare.

As far as environment management goes, I’ve written a couple of times about environment related things on this blog, one of the most recent being the way in which we version our environments. To give some more context for this post, I recommend you go and read at least the versioning post in order to get a better understanding of how we do environment management. Our strategy is still a work in process, but its getting better all the time.

Regardless of whether or not you followed my suggestion, we use a combination of versioned Nuget packages, Powershell, CloudFormation and Octopus Deploy to create an environment, where an environment is a self contained chunk of infrastructure and code that performs some sort of role, the most common of which is acting as an API. We work primarily with EC2 instances (behind Elastic Load Balancers managed via Auto Scaling Groups), and historically, we’ve deployed Logstash to each instance alongside the code to provide log aggregation (IIS, Application, System Stats, etc). When it comes to capturing and aggregating ELB logs, we use include a standalone EC2 instance in the environment, also using Logstash. This standalone instance is the part of the system that we are aiming to replace with the Lambda function.

Because we make extensive use of CloudFormation, incorporating the creation of a Lambda function into an environment that needs to have ELB logs processed is a relatively simple affair.

Simple in that it fits nicely with our current approach. Getting it all to work as expected was still a massive pain.

Blast Pit

Below is a fragment of a completed CloudFormation template for reference purposes.

In the interests of full disclosure, I did not write most of the following fragment, another member of my team was responsible. I just helped.

{
    "Description": "This template is a fragment of a larger template that creates an environment. This fragment in particular contains all of the necessary bits and pieces for a Lambda function that processes ELB logs from S3.",
    "Parameters": {
        "ComponentName": {
            "Description": "The name of the component that this stack makes up. This is already part of the stack name, but is here so it can be used for naming/tagging purposes.",
            "Type": "String"
        },
        "OctopusEnvironment": {
            "Description": "Octopus Environment",
            "Type": "String"
        },
        "PrivateSubnets": {
            "Type": "List<AWS::EC2::Subnet::Id>",
            "Description": "Public subnets (i.e. ones that are automatically assigned public IP addresses) spread across availability zones, intended to contain load balancers and other externally accessible components.",
            "ConstraintDescription": "must be a list of an existing subnets in the selected Virtual Private Cloud."
        },
        "LogsS3BucketName": {
            "Description": "The name of the bucket where log files for the ELB and other things will be placed.",
            "Type": "String"
        }
    },
    "Resources": {
        "LogsBucket" : {
            "Type" : "AWS::S3::Bucket",
            "Properties" : {
                "BucketName" : { "Ref": "LogsS3BucketName" },
                "LifecycleConfiguration": {
                    "Rules": [
                        {
                            "Id": 1,
                            "ExpirationInDays": 7,
                            "Status": "Enabled"
                        }
                    ]
                },
                "Tags" : [
                    {
                        "Key": "function",
                        "Value": "log-storage"
                    }
                ],
                "NotificationConfiguration" : {
                  "LambdaConfigurations": [
                    {
                      "Event" : "s3:ObjectCreated:*",
                      "Function" : { "Fn::GetAtt" : [ "ELBLogProcessorFunction", "Arn" ] }
                    }
                  ]
                }
            }
        },
        "ELBLogProcessorFunctionPermission": {
            "Type" : "AWS::Lambda::Permission",
            "Properties" : {
                "Action":"lambda:invokeFunction",
                "FunctionName": { "Fn::GetAtt": [ "ELBLogProcessorFunction", "Arn" ]},
                "Principal": "s3.amazonaws.com",
                "SourceAccount": {"Ref" : "AWS::AccountId" },
                "SourceArn": {
                    "Fn::Join": [":", [ "arn","aws","s3","", "" ,{"Ref" : "LogsS3BucketName"}]]
                }
            }
        },
        "LambdaSecurityGroup": {
            "Type": "AWS::EC2::SecurityGroup",
            "Properties": {
                "GroupDescription": "Enabling all outbound communications",
                "VpcId": {
                    "Ref": "VpcId"
                },
                "SecurityGroupEgress": [
                    {
                        "IpProtocol": "tcp",
                        "FromPort": "0",
                        "ToPort": "65535",
                        "CidrIp": "0.0.0.0/0"
                    }
                ]
            }
        },
        "ELBLogProcessorFunction": {
          "Type": "AWS::Lambda::Function",
          "Properties": {
            "FunctionName": { "Fn::Join": [ "", [ { "Ref" : "ComponentName" }, "-", { "Ref" : "OctopusEnvironment" }, "-ELBLogProcessorFunction"  ] ] },
            "Description": "ELB Log Processor",
            "Handler": "index.handler",
            "Runtime": "nodejs4.3",
            "Code": {
              "ZipFile": "console.log('placeholder for lambda code')"
            },
            "Role": { "Fn::GetAtt" : ["LogsBucketAccessorRole", "Arn"]},
            "VpcConfig": {
              "SecurityGroupIds": [{"Fn::GetAtt": ["LambdaSecurityGroup", "GroupId"]}],
              "SubnetIds": { "Ref": "PrivateSubnets" }
            }
          }
        },
        "LogsBucketAccessorRole": {
          "Type": "AWS::IAM::Role",
          "Properties": {
            "AssumeRolePolicyDocument": {
                "Version": "2012-10-17",
                "Statement": [
                    {
                        "Effect": "Allow",
                        "Principal": { "Service" : ["lambda.amazonaws.com"]},
                        "Action": [
                            "sts:AssumeRole"
                        ]
                    }
                ]
            },
            "Path": "/",
            "Policies": [{ 
              "PolicyName": "access-s3-read",
              "PolicyDocument": {
                "Version": "2012-10-17",
                "Statement": [
                    {
                        "Effect": "Allow",
                        "Action": [
                            "s3:GetObject"
                        ],
                        "Resource": {
                            "Fn::Join": [":", [ "arn","aws","s3","", "" ,{"Ref" : "LogsS3BucketName"}, "/*"]]
                        }
                    }
                ]
              }
            },
            {
              "PolicyName": "access-logs-write",
              "PolicyDocument": {
                "Version": "2012-10-17",
                "Statement": [
                    {
                        "Effect": "Allow",
                        "Action": [
                            "logs:CreateLogGroup",
                            "logs:CreateLogStream",
                            "logs:PutLogEvents",
                            "logs:DescribeLogStreams"
                        ],
                        "Resource": {
                            "Fn::Join": [":", [ "arn","aws","logs", { "Ref": "AWS::Region" }, {"Ref": "AWS::AccountId"} , "*", "/aws/lambda/*"]]
                        }
                    }
                ]
              }
            },
            {
              "PolicyName": "access-ec2",
              "PolicyDocument": {
                "Version": "2012-10-17",
                "Statement": [
                    {
                        "Effect": "Allow",
                        "Action": [
                          "ec2:*"
                        ],
                        "Resource": "arn:aws:ec2:::*"
                    }
                ]
              }
            },
            {
              "PolicyName": "access-ec2-networkinterface",
              "PolicyDocument": {
                  "Version": "2012-10-17",
                  "Statement": [
                    {
                      "Effect": "Allow",
                      "Action": [
                        "ec2:DescribeInstances",
                        "ec2:CreateNetworkInterface",
                        "ec2:AttachNetworkInterface",
                        "ec2:DescribeNetworkInterfaces",
                        "ec2:DeleteNetworkInterface",
                        "ec2:DetachNetworkInterface",
                        "ec2:ModifyNetworkInterfaceAttribute",
                        "ec2:ResetNetworkInterfaceAttribute",
                        "autoscaling:CompleteLifecycleAction"
                      ],
                      "Resource": "*"
                    }
                  ]
                }
              }
            ]
          }
        }
    }
}

The most important part of the template above is the ELBLogProcessorFunction. This is where the actual Lambda function is specified, although you might notice that it does not actually have the code from the previous post attached to it in any way. The reason for this is that we create the Lambda function with placeholder code, and then use Octopus Deploy afterwards to deploy a versioned package containing the actual code to the Lambda function, like we do for everything else. Packaging and deploying the Lambda function code is a topic for another blog post though (the next one, hopefully).

Other things to note in the template fragment:

  • Lambda functions require a surprising amount of permissions to do what they do. When creating a function using the AWS website, most of this complexity is dealt with for you. When using CloudFormation however, you have to be aware of what the Lambda function needs and give it the appropriate permissions. You could just give the Lambda function as many permissions as possible, but that would be stupid in the face of “least privilege”, and would represent a significant security risk (compromised Lambda code being able to do all sorts of crazy things for example).
    • Logging is a particularly important example for  Lambda permissions. Without the capability to create log streams, your function is going to be damn near impossible to debug.
  • If you’re using S3 as the trigger for your Lambda function, you need to make sure that S3 has permissions to execute the function. This is the ELBLogProcessorFunctionPermission logical resource in the template fragment. Without this, your Lambda function will never trigger, even if you have setup the NotificationConfiguration on the bucket itself.
  • If your Lambda function needs to access external resources (like S3) you will likely have to use Private Subnets + a NAT Gateway to give it that ability. Technically you could also use a proxy, but god why would you do that to yourself. If you put your Lambda function into a Public Subnet, I’m pretty sure it doesn’t automatically get access to the greater internet like an EC2 instance does, and you will be intensely confused as to why all your external calls timeout.
  • Make sure you apply an appropriate Security Group to your Lambda function so that I can communicate externally, or you’ll get mysterious timeouts that look exactly the same as ones you get when you haven’t setup general internet access correctly.

To Be Continued

So that’s how we setup the Lambda function as part of any environment that needs to process ELB logs. Remember, the template fragment above is incomplete, and is missing Auto Scaling Groups, Launch Configurations, Load Balancers, Host Records and a multitude of other things that make up an actual environment. What I’ve shown above is enough to get the pipeline up and running, where any object introduced into the LogsBucket will trigger an execution of the Lambda function, so its enough to illustrate our approach.

Of course, the function doesn’t do anything yet, which ties in with the next post.

How we get the code into Lambda.

Until then, may all your Lambda function executions be swift and free of errors.

0 Comments

For anyone following the saga of my adventures with RavenDB, good news! It turns out its much easier to run a RavenDB server on reasonable hardware when you just put less data in it. I cleaned out the massive chunk of abandoned data over the last few weeks and everything is running much better now.

That’s not what this blog post is about though, which I’m sure is disappointing at a fundamental level.

This post is a quick one about some of the fun times that we had setting up access to customer data for the business to use for analysis purposes. What data? Well, lets go back a few steps to set the stage.

Our long term strategy has been to free customer data stuck in on-premises databases so that the customer can easily use it remotely, in mobile applications and webpages, without having to have physical access to their database server (i.e. be in the office, or connected to their office network). This sort of strategy benefits both parties, because the customer gains access to new, useful services for when they are on the move (very important for real estate agents) and we get to develop new and exciting tools and services that leverage their data. Win-win.

Of course, everything we do involving this particular product is a stopgap until we migrate those customers to a completely cloud based offering, but that’s a while away yet, and we need to stay competitive in the meantime.

As part of the creation of one of our most recent services, we consolidated the customer data location in the Cloud, and so now we have a multi tenant database in AWS that contains a subset of all of the data produced by our customers. We built the database to act as the backend for a system that allows the customer to easily view their data remotely (read only access), but the wealth of information available in that repository piqued the interest of the rest of the business, mostly around using it to calculate statistics and comparison points across our entire customer base.

Now, as a rule of thumb, I’m not going to give anyone access to a production database in order to perform arbitrary, ad-hoc queries, no matter how hard they yell at me. There are a number of concerns that lead towards this mindset, but the most important one is that the database has been optimized to work best for the applications that run on it. It is not written with generic, ad-hoc intelligence queries in mind, and any such queries could potentially have an impact on the operation of the database for its primary purpose. The last thing I want is for someone to decide they want to calculate some heavy statistics over all of the data present, tying up resources that are necessary to answer queries that customers are actually asking. Maintaining quality of service is critically important.

However, the business desire is reasonable and real value could be delivered to the customer with any intelligence gathered.

So what were we to do?

Stop Copying Me

The good thing about working with AWS is that someone, somewhere has probably already tried to do what you’re trying to do, and if you’re really lucky, Amazon has already built in features to make doing the thing easy.

Such was the case with us.

An RDS read-replica neatly resolves all of my concerns. The data will be asynchronously copied from the master to the replica, allowing business intelligence queries to be performed with wild abandon without having to be concerned with affecting the customer experience. You do have to be aware of the eventually consistent  nature of the replica, but that’s not as important when the queries being done aren’t likely to be time critical. Read-replicas can even be made publicly accessible (without affecting the master), allowing you to provision access to them without requiring a VPN connection or something similarly complicated.

Of course, if it was that easy, I wouldn’t have written a blog post about it.

Actually creating a read-replica is easy. We use CloudFormation to initialise our AWS resources, so its a fairly simple matter to extend our existing template with another resource describing the replica. You can easily specify different security groups for the replica, so we can lock it down to be publicly resolvable but only accessible from approved IP addresses without too much trouble (you’ll have to provision a security group with the appropriate rules to allow traffic from your authorised IP addresses, either as part of the template, or as a parameter injected into the template).

There are some tricks and traps though.

If you want to mark a replica as publicly accessible (i.e. it gets a public IP address) you need to make sure you have DNS Resolution and DNS Hostnames enabled on the host VPC. Not a big deal to be honest, but I think DNS Hostnames default to Off, so something to watch out for. CloudFormation gives a nice error message in this case, so its easy to tell what to do.

What’s not so easy is that if you have the standard public/private split of subnets (where a public subnet specifies the internet gateway for routing of all traffic and a private subnet either specifies nothing or a NAT) you must make sure to put your replica in the public subnets. I think this applies for any instance that is going to be given a public IP address. If you don’t do this, no traffic will be able to escape from the replica because the router table will try to push it through the NAT on the way out. This complicates things with the master RDS instance as well, because both replica and master must share the same subnet group, so the master must be placed in the public subnets as well.

With all the CloudFormation/AWS/RDS chicanery out of the way, you still need to manage access to the replica using the standard PostgreSQL mechanisms though.

The Devil Is In The Details

The good thing about PostgreSQL read replicas is that they don’t allow any changes at all, even if using the root account. They are fundamentally readonly, which is fantastic.

There was no way that I was going to publicise the master password for the production RDS instance though, so I wanted to create a special user just for the rest of the business to access the replica at will, with as few permissions as possible.

Because of the aforementioned readonly-ness of the replica, you have to create the user inside the master instance, which will then propagate it across to the replica in time. When it comes to actually managing permissions for users in the PostgreSQL database though, its a little bit different to the RDBMS that I’m most familiar with, SQL Server. I don’t think its better or worse, its just different.

PostgreSQL servers hosts many databases, and each database hosts many schemas. Users however, appear to exist at the server level, so in order to manage access, you need to grant the user access to the databases, schemas and then tables (and sequences) inside that schema that you want them to be able to use.

At the time when our RDS instance is initialised, there are no databases, so we had to do this after the fact. We could provision the user and give it login/list database rights, but it couldn’t select anything from tables until we gave it access to those tables using the master user.

GRANT USAGE ON {schema} TO {username}

GRANT SELECT ON ALL TABLES IN {schema} TO {username}

GRANT SELECT ON ALL SEQUENCES IN {schema} TO {username}

Granting access once is not enough though, because any additional tables created after the statement is executed will not be accessible. To fix that you have to alter the default privileges of the schema, granting the appropriate permissions for the user you are interested in.

ALTER DEFAULT PRIVILEGES IN SCHEMA {schema}
    GRANT SELECT ON TABLES TO {username}

With all of that out of the way, we had our replica.

Conclusion

Thanks to AWS, creating and managing a read-replica is a relatively painless procedure. There are some tricks and traps along the way, but they are very much surmountable. Its nice to be able to separate our concerns cleanly, and to have support for doing that at the infrastructure level.

I shudder to think how complicated something like this would have been to setup manually.

I really do hope AWS never goes full evil and decides to just triple or quadruple their prices though, because it would take months to years to replicate some of the things we’re doing in AWS now.

We’d probably just be screwed.

0 Comments

A while back I made a post about how cfn-init was failing with an error that didn’t seem to make any sense (utf8 codec can’t decode, invalid continuation byte). At the time, I came to the conclusion that the reason why it was failing was the length of the output.

That was actually incorrect.

The reason I came to that conclusion was because the action that seemed to fix the issue was whether or not the –Verbose flag was present on the Powershell script being called. If I had the flag on, I would get the error. Off, and everything was fine.

At first I trusted the error. I assumed that somewhere in my output there was actually an invalid set of bytes, at least as far as utf8 is concerned. It seemed entirely plausible, considering it was Ruby, parsing the output from a Powershell script, which was running an executable directly. So many layers, so many different ways in which it could break, the least of which would be output stream encoding incompatibilities.

My initial investigation into the output stream didn’t seem to show any invalid bytes, so I assumed that cfn-init was doing something stupid, and truncating the output stream because it was too long. If it truncated the stream, it was feasible that it was shearing a UTF8 byte pair in half, hence the invalid continuation byte error. It made sense to me, because:

  1. I had only recently added a large amount of extra verbose logging to the scripts, and that seemed to be the primary difference between the version that worked and the version that failed.
  2. I’ve had issues with AWS components giving arcane errors when you exceed some arbitrary length before. The Powershell cmdlet New-CfnStack will fail if you try to upload a template directly and the template is too long. It also gives incredibly unhelpful errors, about malformed XML, when in reality, its a length issue.

I accepted that I could not use verbose logging for the deployment and moved on, but it always bugged me.

I don’t like losing Verbose logging. Its extremely useful for when things go bad, which they do. All the time.

Round Two (or Three?)

I got a chance to go back to the deployment scripts recently, because some deployments were failing and I wasn’t sure why. I needed the verbose logging back, so I re-added the –Verbose flag and tried to get to the bottom of why it was failing.

My first attempt simply commented out the lines that I knew could be particularly spammy (search functions testing a predicate against a collection of objects).

The error still occurred.

I ran the script by itself (in a clean environment) and the output wasn’t even that long. I realised that I had made an incorrect conclusion from my initial investigation. It definitely wasn’t a length issue.

It was time to do a manual binary search.

I knew which script was failing, so I suppressed verbose output in half the script and tested it again to see if it failed. Its easy enough to temporarily suppress verbose output in Powershell, if a bit messy.

$pref = $VerbosePreference
$VerbosePreference = "SilentlyContinue"

... operations here ...

$VerbosePreference = $pref

What followed was a few hours of manually searching for the piece of script that broke cfn-init. I couldn’t just run it locally, because everything worked just fine, I had to instantiate an environment and get it to initialise itself as it usually would to see if the script would fail or not. It had to be a fresh environment too, because if I ran cfn-init again on an environment that had already failed, it would work just fine.

An environment takes at least 10 minutes to reach the point of failure.

It was the slowest binary search ever.

Something About Rooting Issues?

Eventually I got to the root of the problem. I was piping the output from nuget.exe (used for package installation) to the Verbose stream. Somewhere in the output from nuget, there was actually (maybe?) an invalid UTF8 character, according to the code used by cfn-init anyway. The reason it didn’t fail if you ran it a second time, was because that component was already successfully installed, so it didn’t try to install it again.

I could leave every Write-Verbose statement in place except for that one, and it would all work fine. This meant that I could finally get the verbose output from the Octopus deployment back, which was the main thing I wanted. Sure I could see it through Octopus, but I like all of my information to be in one place, because it just makes everything easier.

To complicate matters further, it wasn’t all nuget output that triggered the issue. For example, the 7Zip command line tools are one of the first components to be installed, in order to make sure the AWS cmdlets are available (they are distributed to each machine as a 7zip file). That particular component would install fine. It seemed to only be the Octopus Client package (the one that contains the .NET libraries) that caused the issue. I’m still not 100% sure of that to be honest, I was mostly just happy to get my verbose deployment errors back in the cfn-init logs, and I had to move on to something else.

To be honest, the fact that I don’t know for sure why the nuget installation output breaks the script is another application of the same mistake I made the first time, and I’m almost certainly going to have to revisit it at some point in the future, where I will no doubt discover something else entirely is actually the root cause. This solution is still better than the original one though, which is enough for now.

Conclusion

Sometimes in software you think you know the answer, but it turns out you’ve made some poor assumptions or conclusions, and just completely missed the mark. In my case, the ramifications were not particularly bad, I was just missing out on informational output, but that’s not always the case. Sometimes the results of a poor assumption/conclusions can be much worse.

The important thing to note is that I stopped investigating once I found something that made the bug go away, instead of investigating in more depth as to whythe action made the bug go away. Sometimes this is a compromise (usually as a result of time contraints/pressure), but I’ve found its almost always worthwhile to spend the time to understand the why of an issue before accepting a solution. Obviously at some stage you do need to just accept that you don’t understand the problem fully, so like everything in software development, its a balancing act.

At the very least if you investigate for a little bit longer, you’ll understand the problem better, which can only be a good thing.

0 Comments

Now that we’ve (somewhat) successfully codified our environment setup and are executing it automatically every day with TeamCity, we have a new challenge. Our setup scripts create an environment that has some set of features/bugs associated with it. Not officially (we’re not really tracking environment versions like that), but definitely in spirit. As a result, we need to update environments to the latest version of the “code” whenever we fix a bug or add a feature. Just like deploying a piece of software.

To be honest, I haven’t fully nailed the whole codified environment thing just yet, but I am getting closer. Giving it some thought, I think I will probably move towards a model where the environment is built and tested (just like a piece of software) and then packaged and versioned, ready to be executed. Each environment package should consist of installation and uninstallation logic, along with any other supporting actions, in order to make them as self contained as possible.

That might be the future. For now, we simply have a repository with scripts in it for each of our environments, supported by a set of common scripts.

The way I see it, environments fall into two categories.

  1. Environments created for a particular task, like load testing or some sort of experimental development.
  2. Environments that take part in your deployment pipeline.

The fact that we have entirely codified our environment setup gives us the power to create an environment for either of the above. The first point is not particularly interesting, but the second one is.

We have 3 standard environments, which are probably familiar to just about anyone (though maybe under different names). They are, CI, Staging and Production.

CI is the environment that is recreated every morning through TeamCity. It is used for continuous integration/deployment, and is typically not used directly for manual testing/demonstration/anything else. It forms an important part of the pipeline, as after deployment, automated functional tests are run on it, and if successful that component is (usually) automatically propagated to Staging.

Staging is, for all intents and purposes, a Production level environment. It is stable (only components that fully pass all of their tests are deployed here) and is used primarily for manual testing and feature validation, with a secondary focus on early integration within a trusted group of people (which may include external parties and exceptional customers).

Production is of course production. Its the environment that the greater world uses for any and all executions of the software component (or components) in question. It is strictly controlled and protected, to make sure that we don’t accidentally break it, inconveniencing our customers and making them unhappy.

The problem is, how do you get changes to the underlying environment (i.e. a new version of it) into Staging/Production, without losing any state held within the system? You can’t just recreate the environment (like we do each morning for CI), because the environment contains the state, and that destroys it.

You need another process.

Migration.

Birds Fly South for the Winter

Migration, for being such a short word, is actually incredibly difficult.

Most approaches that I’ve seen in the past, involved some sort of manual migration strategy (usually written down and approved by 2+ people), which is executed by some long suffering operations person at midnight when hopefully no-one is trying o use the environment for its intended purpose.

A key component to any migration strategy: What happens if it goes wrong? Otherwise known as a rollback procedure.

This is, incidentally, where everything gets pretty hard.

With our environments being entirely codified in a mixture of Powershell and CloudFormation, I wanted to create something that would automatically update an environment to the latest version, without losing any of the data currently stored in the environment, and in a safe way.

CloudFormation offers the ability to update a stack after it has been created. This way you can change the template to include a new resource (or to change existing resources) and then execute the update and have AWS handle all of the updating. This probably works fine for most people, but I was uneasy at the prospect. Our environments are already completely self contained and I didn’t understand how CloudFormation updates would handle rollbacks, or how updates would work for all components involved. I will go back and investigate it in more depth at some point in the future, but for now I wanted a more generic solution that targeted the environment itself.

My idea was fairly simple.

What if I could clone an environment? I could make a clone of the environment I wanted to migrate, test the clone to make sure all the data came through okay and its behaviour was still the same, delete the old environment and then clone the temporary environment again, into the original environments name. At any point up to the delete of the old environment I could just stop, and everything would be the same as it was before. No need for messy rollbacks that might might only do a partial job.

Of course, the idea is not actually all that simple in practice.

A Perfect Clone

In order to clone an environment, you need to identify the parts of the environment that contain persistent data (and would not automatically be created by the environment setup).  Databases and file storage (S3, disk, etc) are examples of persistent data. Log files are another example of persistent data, except they don’t really matter from a migration point of view, mostly because all of our log entries are aggregated into an ELK stack. Even if they weren’t aggregated, they probably still wouldn’t be worth spending time on.

In the case of the specific environment I’m working on for the migration this time, there is an RDS instance (the database) and at least one S3 bucket containing user data. Everything else about the environment is transient, and I won’t need to worry about it.

Luckily for me, cloning an RDS instance and an S3 bucket is relatively easy.

With RDS you can simply take a snapshot and then use that snapshot as an input into the RDS instance creation on the new environment. Fairly straightforward.

function _WaitRdsSnapshotAvailable
{
    [CmdletBinding()]
    param
    (
        [Parameter(Mandatory=$true)]
        [string]$snapshotId,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsKey,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsSecret,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsRegion,
        [int]$timeoutSeconds=3000
    )

    write-verbose "Waiting for the RDS Snapshot with Id [$snapshotId] to be [available]."
    $incrementSeconds = 15
    $totalWaitTime = 0
    while ($true)
    {
        $a = Get-RDSDBSnapshot -DBSnapshotIdentifier $snapshotId -Region $awsRegion -AccessKey $awsKey -SecretKey $awsSecret
        $status = $a.Status

        if ($status -eq "available")
        {
            write-verbose "The RDS Snapshot with Id [$snapshotId] has exited [$testStatus] into [$status] taking [$totalWaitTime] seconds."
            return $a
        }

        write-verbose "Current status of RDS Snapshot with Id [$snapshotId] is [$status]. Waiting [$incrementSeconds] seconds and checking again for change."

        Sleep -Seconds $incrementSeconds
        $totalWaitTime = $totalWaitTime + $incrementSeconds
        if ($totalWaitTime -gt $timeoutSeconds)
        {
            throw "The RDS Snapshot with Id [$snapshotId] was not [available] within [$timeoutSeconds] seconds."
        }
    }
}

... snip some scripts getting CFN stacks ...

$resources = Get-CFNStackResources -StackName $sourceStack.StackId -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion

$rds = $resources |
    Single -Predicate { $_.ResourceType -eq "AWS::RDS::DBInstance" }

$timestamp = [DateTime]::UtcNow.ToString("yyyyddMMHHmmss")
$snapshotId = "$sourceEnvironment-for-clone-to-$destinationEnvironment-$timestamp"
$snapshot = New-RDSDBSnapshot -DBInstanceIdentifier $rds.PhysicalResourceId -DBSnapshotIdentifier $snapshotId -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion

_WaitRdsSnapshotAvailable -SnapshotId $snapshotId -awsKey $awsKey -awsSecret $awsSecret -awsRegion $awsRegion

With S3, you can just copy the bucket contents. I say just, but in reality there is no support for a “sync” command in the AWS Powershell cmdlets. There is a sync command on the AWS CLI though, so I wrote a wrapper around the CLI and execute the sync command there. It works pretty nicely. Essentially its broken into two parts, the part that deals with actually locating and extracting the AWS CLI to a known location, and then the part that actually does the clone. The only difficult bit was that you don’t seem to be able to just supply credentials to the AWS CLI executable, at least in a way that I would expect (i.e. as parameters). Instead you have to use a profile, or use environment variables.

function Get-AwsCliExecutablePath
{
    if ($rootDirectory -eq $null) { throw "rootDirectory script scoped variable not set. Thats bad, its used to find dependencies." }
    $rootDirectoryPath = $rootDirectory.FullName

    $commonScriptsDirectoryPath = "$rootDirectoryPath\scripts\common"

    . "$commonScriptsDirectoryPath\Functions-Compression.ps1"

    $toolsDirectoryPath = "$rootDirectoryPath\tools"
    $nugetPackagesDirectoryPath = "$toolsDirectoryPath\packages"

    $packageId = "AWSCLI64"
    $packageVersion = "1.7.41"

    $expectedDirectory = "$nugetPackagesDirectoryPath\$packageId.$packageVersion"
    if (-not (Test-Path $expectedDirectory))
    {
        $extractedDir = 7Zip-Unzip "$toolsDirectoryPath\dist\$packageId.$packageVersion.7z" "$toolsDirectoryPath\packages"
    }

    $executable = "$expectedDirectory\aws.exe"

    return $executable
}

function Clone-S3Bucket
{
    [CmdletBinding()]
    param
    (
        [Parameter(Mandatory=$true)]
        [string]$sourceBucketName,
        [Parameter(Mandatory=$true)]
        [string]$destinationBucketName,
        [Parameter(Mandatory=$true)]
        [string]$awsKey,
        [Parameter(Mandatory=$true)]
        [string]$awsSecret,
        [Parameter(Mandatory=$true)]
        [string]$awsRegion
    )

    if ($rootDirectory -eq $null) { throw "RootDirectory script scoped variable not set. That's bad, its used to find dependencies." }
    $rootDirectoryPath = $rootDirectory.FullName

    . "$rootDirectoryPath\scripts\common\Functions-Aws.ps1"

    $awsCliExecutablePath = Get-AwsCliExecutablePath

    $previousAWSKey = $env:AWS_ACCESS_KEY_ID
    $previousAWSSecret = $env:AWS_SECRET_ACCESS_KEY
    $previousAWSRegion = $env:AWS_DEFAULT_REGION

    $env:AWS_ACCESS_KEY_ID = $awsKey
    $env:AWS_SECRET_ACCESS_KEY = $awsSecret
    $env:AWS_DEFAULT_REGION = $awsRegion

    & $awsCliExecutablePath s3 sync s3://$sourceBucketName s3://$destinationBucketName

    $env:AWS_ACCESS_KEY_ID = $previousAWSKey
    $env:AWS_SECRET_ACCESS_KEY = $previousAWSSecret
    $env:AWS_DEFAULT_REGION = $previousAWSRegion
}

I do have some concerns that as the bucket gets bigger, the clone will take longer and longer. I’ll cross that bridge when I come to it.

Using the identified areas of persistence above, the only change I need to make is to alter the new environment script to take them as optional inputs (specifically the RDS snapshot). If they are supplied, it will use them, if they are not, it will default to normal creation.

Job done, right?

A Clean Snapshot

The clone approach works well enough, but in order to perform a migration on a system that is actively being used, you need to make sure that the content does not change while you are doing it. If you don’t do this, you can potentially lose data during a migration. The most common example would be if you clone the environment, but after the clone some requests occur and the data changes. If you then delete the original and migrate back, you’ve lost that data. There are other variations as well.

This means that you need the ability to put an environment into standby mode, where it is still running, and everything is working, but it is no longer accepting user requests.

Most of our environments are fairly simple and are based around web services. They have a number of instances behind a load balancer, managed by an auto scaling group. Behind those instances are backend services, like databases and other forms of persistence/scheduled task management.

AWS Auto Scaling Groups allow you to set instances into Standby mode, which removes them from the load balancer (meaning they will no longer have requests forwarded to them) but does not delete or otherwise terminate them. More importantly, instances in Standby can count towards the desired number of instances in the Auto Scaling Group, meaning it won’t go and create X more instances to service user requests, which obviously would muck the whole plan up.

This is exactly what we need to set our environment into a Standby mode (at least until we have scheduled tasks that deal with underlying data anyway). I took the ability to shift instances into Standby mode and wrapped it into a function for setting the availability of the environment (because that’s the concept that I’m interested in, the Standby mode instances are just a mechanism to accomplish that).

function _ChangeEnvironmentAvailability
{
    [CmdletBinding()]
    param
    (
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$environment,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsKey,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsSecret,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsRegion,
        [Parameter(Mandatory=$true)]
        [ValidateSet("Available", "Unavailable")]
        [string]$availability,
        [switch]$wait
    )

    if ($rootDirectory -eq $null) { throw "RootDirectory script scoped variable not set. That's bad, its used to find dependencies." }
    $rootDirectoryDirectoryPath = $rootDirectory.FullName

    . "$rootDirectoryPath\scripts\common\Functions-Enumerables.ps1"

    . "$rootDirectoryPath\scripts\common\Functions-Aws.ps1"
    Ensure-AwsPowershellFunctionsAvailable

    $sourceStackName = Get-StackName -Environment $environment -UniqueComponentIdentifier (_GetUniqueComponentIdentifier)
    $sourceStack = Get-CFNStack -StackName $sourceStackName -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion

    $resources = Get-CFNStackResources -StackName $sourceStack.StackId -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion

    Write-Verbose "_ChangeEnvironmentAvailability([$availability]): Locating auto scaling group in [$environment]"
    $asg = $resources |
        Single -Predicate { $_.ResourceType -eq "AWS::AutoScaling::AutoScalingGroup" }

    $asg = Get-ASAutoScalingGroup -AutoScalingGroupName $asg.PhysicalResourceId -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion

    $instanceIds = @()
    $standbyActivities = @()
    if ($availability -eq "Unavailable")
    {
        Write-Verbose "_ChangeEnvironmentAvailability([$availability]): Switching all instances in [$($asg.AutoScalingGroupName)] to Standby"
        $instanceIds = $asg.Instances | Where { $_.LifecycleState -eq [Amazon.AutoScaling.LifecycleState]::InService } | Select -ExpandProperty InstanceId
        if ($instanceIds | Any)
        {
            $standbyActivities = Enter-ASStandby -AutoScalingGroupName $asg.AutoScalingGroupName -InstanceId $instanceIds -ShouldDecrementDesiredCapacity $true -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion
        }
    }
    elseif ($availability -eq "Available")
    {
        Write-Verbose "_ChangeEnvironmentAvailability([$availability]): Switching all instances in [$($asg.AutoScalingGroupName)] to Inservice"
        $instanceIds = $asg.Instances | Where { $_.LifecycleState -eq [Amazon.AutoScaling.LifecycleState]::Standby } | Select -ExpandProperty InstanceId
        if ($instanceIds | Any)
        {
            $standbyActivities = Exit-ASStandby -AutoScalingGroupName $asg.AutoScalingGroupName -InstanceId $instanceIds -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion
        }
    }

    $anyStandbyActivities = $standbyActivities | Any
    if ($wait -and $anyStandbyActivities)
    {
        Write-Verbose "_ChangeEnvironmentAvailability([$availability]): Waiting for all scaling activities to complete"
        _WaitAutoScalingGroupActivitiesComplete -Activities $standbyActivities -awsKey $awsKey -awsSecret $awsSecret -awsRegion $awsRegion
    }
}

With the only mechanism to affect the state of persisted data disabled, we can have some more confidence that the clone is a robust and clean copy.

To the Future!

I don’t think that the double clone solution is the best one, it’s just the best that I could come up with without having to make a lot of changes to way we manage our environments.

Another approach would be to maintain 2 environments during migration (A and B), but only have one of those environments be active during normal operations. So to do a migration, you would spin up Prod A if Prod B already existed. At the entry point, you have a single (or multiple) DNS record that points to either A or B based on your needs. This one still involves cloning and downtime though, so for a high availability service, it won’t really work (our services can have some amount of downtime, as long as it is organized and communicated ahead of time).

Speaking of downtime, there is another approach that you can follow in order to do zero downtime migrations. I haven’t actually done it, but if you had a mechanism to replicate incoming requests to both environments, you could conceivably bring up the new environment, let it deal with the same requests as the old environment for long enough to synchronize and to validate that it works (without responding to the user, just processing the requests) and then perform the top level switch so that the new environment becomes the active one. At some point in the future you can destroy the old environment, when you are confident that it works as expected.

This is a lot more complicated, and involves some external component managing the requests (and storing a record of all requests ever made, at least back to the last backup point) as well as each environment knowing what request they last processed. Its certainly not impossible, but if you can tolerate downtime, its probably not worth the effort.

Summary

Managing your environments is not a simple task, especially when you have actual users (and if you don’t have users, why are you bothering at all?). It’s very difficult to make sure that your production (or other live environment) does not stagnate in the face of changes, and is always kept up to date.

What I’ve outlined in this blog post is a single approach that I’ve been working on over the last week or so, to deal with out specific environments. Its not something that will work for everyone, but I thought it was worthwhile to write it down anyway, to show some of the thought processes that I needed to go through in order to accomplish the migration in a safe, robust fashion.

There is at least one nice side effect from my approach, in that we will now be able to clone any environment I want (without doing a migration, just the clone) and then use it for experiments or analysis.

I’m sure that I will run into issues eventually, but I’m pretty happy with the way the migration process is happening. It was weighing on my mind for a while, so its good to get it done.