0 Comments

We’ve come a long way in our log aggregation journey. Don’t get me wrong, we still have a long way to go, but bit by bit, we’re getting better at it.

A good example of getting better, is the way in which we process our Elastic Load Balancer (ELB) logs. Over a year ago I put together a system for processing these logs into our ELK stack with way too many moving parts. It used Logstash (for processing) and Powershell (for downloading files from S3) hosted on an EC2 instance to aggregate ELB logs from S3 into to our ELK stack. Somewhat complicated in practice, but it worked, even if I was never particularly happy with it.

As is the way with these things though, because it did work, we’ve had no reason to revisit it, and we’re successfully applied the same approach to at least 3 other environments we’ve setup since.

It wasn’t without its share of problems though:

  • The EC2 instances hosting the solution had a tendency to cap themselves at 100% CPU for long periods. They were initially t2.mediums, but they kept expending all of their CPU credits, so we had to upgrade them to m3.mediums, which was a 50% increase in cost ($US 115/month). Never did figure out exactly what needed all that CPU, but the leading theory was Logstash.
  • For a while, the logs simply stopped processing after a period of time (days/weeks). This turned out to be an issue with accumulating memory dumps from Java as a result of Logstash crashing and NSSM automatically restarting it.
  • These were the machines most vulnerable to the memory leak in Logstash that causes its TCP driver to accumulate non-paged memory on Windows AWS instances due to some driver problem.

Good old Logstash.

To turn the discussion back to getting better, we had the opportunity to revisit this process when building some new environments, using all of the knowledge and experience that we’d gained in the intervening period. I think we came up with a much more efficient and understandable solution, but it wasn’t without its share of difficulties, which makes for a good post.

Anomalous Materials

One of the primary weaknesses in the previous approach for processing ELB logs was that it required an entire EC2 instance all to itself, for each environment that we spun up. We did this in order to keep each log processor isolated from the other and to allow us to be able to spin up an entirely self-contained environment without having to worry about some common machine that processed all of the logs in a bunch of different buckets.

Another weakness in the process that bothered me was that it had way too many moving parts. Sometimes you have to have a lot of moving parts working together in order to accomplish a goal, but you should always strive for simplicity, both from an operational point of view and from a maintenance point of view. Less is almost always better in software development.

AWS has come a long way since we jammed the initial solution together, so we decided to use this opportunity to simplify the process and experiment with some AWS tools that we don’t frequently use.

After some discussion, the we formed an idea of what we would like the new log processor to look like. We wanted to use Lambda to process the ELB logs as they were created, pushing them to the same Logstash ingress endpoint that we’ve been using consistently for the last year or so. The benefits we were expecting were a reduction in complexity (no need to have 4 different things working together), a reduction in running cost (mostly due to the removal of the EC2 instance) and a reduction in latency (the Lambda function would trigger whenever a file was written to the S3 bucket by the ELB, which meant no more polling for changes).

For those of you unfamiliar with Lamba, its a service offered by AWS that lets you configure code to run whenever a variety of events occurs. I’ve used it before to create a quicker S3 bucket clone, so if you want some more information, feel free to read up on that adventure.

In order to accomplish our goal, we would need to deal with 3 things:

Nothing particularly insane there, but definitely a few things that we’d never done before.

To Be Continued

In order to avoid creating a single monstrous post with more words than a small novel, I’m going to break it here.

Next week I’ll continue, explaining the Javascript code that we put together to process the log files (its not particularly complicated) and how we configured the Lambda function by incorporating its setup into our environment setup.

Until then, may all your Lambda functions execute quickly and your S3 buckets not turn into ghosts.

0 Comments

A little over 4 months ago, I wrote a post about trying to improve the speed of cloning a large S3 bucket. At the time, I tried to simply parallelise the execution of the AWS CLI sync command, which actually proved to be much slower than simply leaving the CLI alone to do its job. It was an unsurprising result in retrospect, but you never know unless you try.

Unwilling to let the idea die, I decided to make it my focus during our recent hack days.

If you are unfamiliar with the concept of a hack day (or Hackathon as they are sometimes known), have a look at this Wikipedia article. At my current company, we’re only just starting to include hack days on a regular basis, but its a good sign of a healthy development environment.

Continuing on with the original train of thought (parallelise via prefixes), I needed to find a way to farm out the work to something (whether it was a pool of our own workers or some other mechanism). Continuing with that train of thought, I chose to use AWS Lambda.

Enter Node.js on Lambda.

At A High Level

AWS Lambda is a relatively new offering, allowing you to configure some code to automatically execute following a trigger from one of a number of different events, including an SNS Topic Notification, changes to an S3 bucket or a HTTP call. You can use Python, Java or Javascript (through Node.js) as code natively, but you can technically use anything you can compile into a Linux compatible executable and make accessible to the function via S3 or something similar.

Since Javascript seems to be everywhere now (even though its hard to call it a real language), it was a solid choice. No point being afraid of new things.

Realistically, I should have been at least a little afraid of new things.

Conceptually the idea can be explained as a simple divide and conquer strategy, managed by files in an S3 bucket (because S3 was the triggering mechanism I was most familiar with).

If something wants to trigger a clone, it writes a file into a known S3 bucket detailing the desired operation (source, destination, some sort of id) with a key of {id}-{source}-{destination}/clone-request.

In response, the Lambda function will trigger, segment the work and write a file for each segment with a key of {id}-{source}-{destination}/{prefix}-segment-request. When it has finished breaking down the work, it will write another file with the key {id}-{source}-{destination}/clone-response, containing a manifest of the breakdown, indicating that it is done with the division of work.

As each segment file is being written, another Lambda function will be triggered, doing the actual copy work and finally writing a file with the key {id}-{source}-{destination}/{prefix}-segment-response to indicate that its done.

File Formats Are Interesting

Each clone-request file looks like this:

{
    id: {id},
    source: {
        name: {source-bucket-name}
    },
    destination: {
        name: {destination-bucket-name}
    }
}

Its a relatively simple file that would be easy to extend as necessary (for example, if you needed to specify the region, credentials to access the bucket, etc).

The clone-response file (the manifest), looks like this:

{
    id: {id},
    source: {
        name: {source-bucket-name}
    },
    destination: {
        name: {destination-bucket-name}
    },
    segments: {
        count: {number-of-segments},
        values: [
            {segment-key},
            {segment-key}
            ...
        ]
    }
}

Again, another relatively simple file. The only additional information is the segments that the task was broken into. These segments are used for tracking purposes, as the code that requests a clone needs some way to know when the clone is done.

Each segment-request file looks like this:

{
    id: {id},
    source: {
        name: {source-bucket-name},
        prefix: {prefix}
    },
    destination: {
        name: {destination-bucket-name}
    }
}

And finally, each segment-response file looks like this:

{
    id: {id},
    source: {
        name: {source-bucket-name},
        prefix: {prefix}
    },
    destination: {
        name: {destination-bucket-name}
    },    
    files: [        
        {key},
        {key},
        ...
    ]
}

Nothing fancy or special, just straight JSON files with all the information needed.

Breaking It All Down

First up, the segmentation function.

Each Javascript Lambda function already comes with access to the aws-sdk, which is super useful, because honestly if you’re using Lambda, you’re probably doing it because you need to talk to other AWS offerings.

The segmentation function has to read in the triggering file from S3, parse it (its Javascript and JSON so that’s trivial at least), iterate through the available prefixes (using a delimiter, and sticking with the default “/”), write out a file for each unique prefix and finally write out a file containing the manifest.

As I very quickly learned, using Node.js to accomplish the apparently simple task outlined above was made not simple at all thanks to its fundamentally asynchronous nature, and the fact that async calls don’t seem to return a traceable component (unlike in C#, where if you were using async tasks you would get a task object that could be used to track whether or not the task succeeded/failed).

To complicate this even further, the aws-sdk will only return a maximum of 1000 results when listing the prefixes in a bucket (or doing anything with a bucket really), which means you have to loop using the callbacks. This makes accumulating some sort of result set annoying difficult, especially if you want to know when you are done.

Anyway, the segmentation function is as follows:

console.log('Loading function');

var aws = require('aws-sdk');
var s3 = new aws.S3({ apiVersion: '2006-03-01' });

function putCallback(err, data)
{
    if (err)
    {
        console.log('Failed to Upload Clone Segment ', err);
    }
}

function generateCloneSegments(s3Source, command, commandBucket, marker, context, segments)
{
    var params = { Bucket: command.source.name, Marker: marker, Delimiter: '/' };
    console.log("Listing Prefixes: ", JSON.stringify(params));
    s3Source.listObjects(params, function(err, data) {
        if (err)
        {
            context.fail(err);
        }
        else
        {
            for (var i = 0; i < data.CommonPrefixes.length; i++)
            {
                var item = data.CommonPrefixes[i];
                var segmentRequest = {
                    id: command.id,
                    source : {
                        name: command.source.name,
                        prefix: item.Prefix
                    },
                    destination : {
                        name: command.destination.name
                    }
                };
                
                var segmentKey = command.id + '/' + item.Prefix.replace('/', '') + '-segment-request';
                segments.push(segmentKey);
                console.log("Uploading: ", segmentKey);
                var segmentUploadParams = { Bucket: commandBucket, Key: segmentKey, Body: JSON.stringify(segmentRequest), ContentType: 'application/json'};
                s3.putObject(segmentUploadParams, putCallback);
            }
            
            if(data.IsTruncated)
            {
                generateCloneSegments(s3Source, command, commandBucket, data.NextMarker, context, segments);
            }
            else
            {
                // Write a clone-response file to the commandBucket, stating the segments generated
                console.log('Total Segments: ', segments.length);
                
                var cloneResponse = {
                    segments: {
                        count: segments.length,
                        values: segments
                    }
                };
                
                var responseKey = command.id + '/' + 'clone-response';
                var cloneResponseUploadParams = { Bucket: commandBucket, Key: responseKey, Body: JSON.stringify(cloneResponse), ContentType: 'application/json'};
                
                console.log("Uploading: ", responseKey);
                s3.putObject(cloneResponseUploadParams, putCallback);
            }
        }
    });
}

exports.handler = function(event, context) {
    //console.log('Received event:', JSON.stringify(event, null, 2));
    
    var commandBucket = event.Records[0].s3.bucket.name;
    var key = decodeURIComponent(event.Records[0].s3.object.key.replace(/\+/g, ' '));
    var params = {
        Bucket: commandBucket,
        Key: key
    };
    
    s3.getObject(params, function(err, data) 
    {
        if (err) 
        {
            context.fail(err);
        }
        else 
        {
            var command = JSON.parse(data.Body);
            var s3Source = new aws.S3({ apiVersion: '2006-03-01', region: 'ap-southeast-2' });
            
            var segments = [];
            generateCloneSegments(s3Source, command, commandBucket, '', context, segments);
        }
    });
};

I’m sure some improvements could be made to the Javascript (I’d love to find a way automate tests on it), but its not bad for being written directly into the AWS console.

Hi Ho, Hi Ho, Its Off To Work We Go

The actual cloning function is remarkably similar to the segmenting function.

It still has to loop through items in the bucket, except it limits itself to items that match a certain prefix. It still has to do something for each item (execute a copy and add the key to its on result set) and it still has to write a file right at the end when everything is done.

console.log('Loading function');

var aws = require('aws-sdk');
var commandS3 = new aws.S3({ apiVersion: '2006-03-01' });

function copyCallback(err, data)
{
    if (err)
    {
        console.log('Failed to Copy ', err);
    }
}

function copyFiles(s3, command, commandBucket, marker, context, files)
{
    var params = { Bucket: command.source.name, Marker: marker, Prefix: command.source.prefix };
    s3.listObjects(params, function(err, data) {
        if (err)
        {
            context.fail(err);
        }
        else
        {
            for (var i = 0; i < data.Contents.length; i++)
            {
                var key = data.Contents[i].Key;
                files.push(key);
                console.log("Copying [", key, "] from [", command.source.name, "] to [", command.destination.name, "]");
                
                var copyParams = {
                    Bucket: command.destination.name,
                    CopySource: command.source.name + '/' + key,
                    Key: key
                };
                s3.copyObject(copyParams, copyCallback);
            }
            
            if(data.IsTruncated)
            {
                copyFiles(s3, command, commandBucket, data.NextMarker, context, segments);
            }
            else
            {
                // Write a segment-response file
                console.log('Total Files: ', files.length);
                
                var segmentResponse = {
                    id: command.id,
                    source: command.source,
                    destination : {
                        name: command.destination.name,
                        files: {
                            count: files.length,
                            files: files
                        }
                    }
                };
                
                var responseKey = command.id + '/' + command.source.prefix.replace('/', '') + '-segment-response';
                var segmentResponseUploadParams = { Bucket: commandBucket, Key: responseKey, Body: JSON.stringify(segmentResponse), ContentType: 'application/json'};
                
                console.log("Uploading: ", responseKey);
                commandS3.putObject(segmentResponseUploadParams, function(err, data) { });
            }
        }
    });
}

exports.handler = function(event, context) {
    //console.log('Received event:', JSON.stringify(event, null, 2));
    
    var commandBucket = event.Records[0].s3.bucket.name;
    var key = decodeURIComponent(event.Records[0].s3.object.key.replace(/\+/g, ' '));
    var params = {
        Bucket: commandBucket,
        Key: key
    };
    
    commandS3.getObject(params, function(err, data) 
    {
        if (err) 
        {
            context.fail(err);
        }
        else 
        {
            var command = JSON.parse(data.Body);
            var s3 = new aws.S3({ apiVersion: '2006-03-01', region: 'ap-southeast-2' });
            
            var files = [];
            copyFiles(s3, command, commandBucket, '', context, files);
        }
    });
};

Tricksy Trickses

You may notice that there is no mention of credentials in the code above. That’s because the Lambda functions run under a role with a policy that gives them the ability to list, read and put into any bucket in our account. Roles are handy for accomplishing things in AWS, avoiding the new to supply credentials. When applied to the resource, and no credentials are supplied, the aws-sdk will automatically generate a short term token using the role, reducing the likelihood of leaked credentials.

As I mentioned above, The asynchronous nature of Node.js made everything a little but more difficult than expected. It was hard to determine when anything was done (somewhat important for writing manifest files). Annoyingly enough, it was even hard to determine when the function itself was finished. I kept running into issues where the function execution had finished, and it looked like it had done all of the work I expected it to do, but AWS Lambda was reporting that it did not complete successfully.

In the initial version of Node.js I was using (v0.10.42), the AWS supplied context object had a number of methods on it to indicate completion (whether success or failure). If I called the Succeed method after I setup my callbacks, the function would terminate without doing anything, because it didn’t automatically wait for the callbacks to complete. If I didn’t call it, the function would be marked as “did not complete successfully”. Extremely annoying.

As is often the case with AWS though, on literally the second hack day, AWS released support for Node.js v4.3, which automatically waits for all pending callbacks to complete before completing the function, completely changing the interaction model for the better. I did upgrade to the latest version during the second hack day (after I had accepted that my function was going to error out in the control panel but actually do all the work it needed to), but it wasn’t until later that I realised that the upgrade had fixed my problem.

The last tripwire I ran into was related to AWS Lambda not being available in all regions yet. Specifically, its not in ap-southeast-2 (Sydney), which is where all of our infrastructure lives. S3 is weird in relation to regions, as buckets are globally unique and accessible, but they do actually have a home region. What does this have to do with Lambda? Well, the S3 bucket triggers I used as the impetus for the function execution only work if the S3 bucket is in the same region as the Lambda function (so us-west-1), even though once you get inside the Lambda function you can read/write to any bucket you like. Weird.

Conclusion

I’ve omitted the Powershell code responsible for executing the clone for brevity. It writes the request to the bucket, reads the response and then polls waiting for all of the segments to be completed, so its not particularly interesting, although the polling for segment completion was my first successful application of the Invoke-Parallel function from Script Center.

Profiling the AWS Lambda approach versus the original AWS CLI sync command approach over a test bucket (7500 objects, 195 distinct prefixes, 8000 MB of data) showed a decent improvement in performance. The sync approach took 142 seconds and the Lambda approach took 55 seconds, approximately a third of the time, which was good to see considering the last time I tried to parallelise the clone it actually decreased the performance. I think with some tweaking the Lambda approach could be improved further, with tighter polling tolerances and an increased number of parallel Lamda executions allowed.

Unfortunately, I have not had the chance to execute the AWS Lambda implementation on the huge bucket that is the entire reason it exists, but I suspect that it won’t work.

Lambda allows at maximum 5 minutes of execution time per function, and I suspect that the initial segmentation for a big enough bucket will probably take longer than that. It might be possible to chain lambda functions together (i.e. trigger one from the next one, perhaps per 1000 results returned from S3, but I’m not entirely sure how to do that yet (maybe using SNS notifications instead of S3?). Additionally, with a big enough bucket, the manifest file itself (detailed the segments) might become unwieldy. I think the problem bucket has something like 200K unique prefixes, so the size of the manifest file can add up quickly.

Regardless, the whole experience was definitely useful from a technical growth point of view. Its always a good idea to remove yourself from your comfort zone and try some new things, and AWS Lambda + Node.js are definitely well outside my comfort zone.

A whole different continent in fact.

0 Comments

Something is eating all of the memory on some of our production API instances. I say somethingbecause its non-trivial diagnosing exactly what is eating the memory.

How is that even possible? Well, its eating the memory in such a way that the monitoring tools available (i.e. task manager, performance counters, etc) are completely unable to say which process is the culprit. The processes don’t have the memory, at least not in any of their working sets and the only way to notice that it is missing is that the amount of total available memory is dropping over time. It seems that the memory is accumulating in the non-paged memory pool.

Ugh, non-paged pool memory leak. Not fun. Probably a device driver or something else equally low level.

As is usually the case with this sort of thing, I blame Logstash, hence the tag on this post, but I can’t really back that up.

Unfortunately, we have not yet identified the root cause. Instead, this post will talk about some things we did to run away screaming from the problem until we have time to investigate in depth. Sometimes you just have to make it work so that everyone can stop panicking long enough to form coherent thoughts.

Immediate Mitigation

First step, scheduled reboot for the affected boxes before they die. That maintains the level of service while we attempt to get to the bottom of the issue.

Easiest way to accomplish this? Calendar reminder for a few specific people in the organisation. Odds are at least one of those people will action the item and that everything will continue to work as expected from an external point of view.

The risks here are many and varied. What if everyone on the list expects that someone on the list will do the thing? What if everyone is on holidays (Christmas is a particular bad time for this), or sick. If the scheduled task lasts long enough, you have to consider what will happen as people leave the organisation.

Its a pretty bad sign if your immediate, manual mitigation step lasts long enough for the people involved to leave the organisation. Either you are bad at prioritising or you have some serious churn problems.

Engineers and Manual Tasks

The easiest way to get something automated is to assign a regular, manual task to an engineer, or group of engineers. There is nothing an engineer hates more than repeatedly doing the same thing on some schedule. The response? Automation.

On our case, we originally thought that the best way to automate this particular restart was using a tag based system like we do for managing start and stop times for EC2 instances. The problem was, we didn’t want to restart all of the API instances inside the auto scaling group, just the oldest one (because it was the mostly likely to be closest to experiencing the problem). We didn’t want to get into a situation where we brought down the service because everything restarted at once.

Our next thought was to target the auto scaling group instead of the API instances themselves. On some regular interval, we could scaling up to N + 1, then after everything was good, scale down to N again. This would automatically terminate the oldest instance (because our termination policy was set to oldest first). Seems simple enough.

Luckily, because we went too far down the “lets write our own script path” on of our operations team remember that this functionality (scheduled scaling policies) was actually already a feature in AWS. Alas, its not exposed via the AWS management console (i.e. the website), but you can definitely create and manage the policies from the command line using the AWS CLI.

I’m not sure if you can use the equivalent AWS client libraries (like the Powershell cmdlets), but its definitely available in the CLI.

We created two policies. Scale up to N + 1 at midnight, and then scale down to N at 0100. This acts as a recycle for the instances we are having problems with, and leverages no custom code or scripting. Its just AWS functionality.

To create a schedule, assuming you have already configured the CLI, you can use the following snippet:

aws autoscaling put-scheduled-update-group-action --scheduled-action-name ScaleUp --auto-scaling-group-name <ASG Name> --recurrence "0 0 * * 1" --desired-capacity 3

This will create a scheduled action to set the desired capacity to 3 for the specified Auto Scaling Group at midnight UTC on every Monday of the year (standard Cron format, the only thing to remember is that it will execute based on UTC time).

Summary

I’m pretty disappointed that we still haven’t actually had a chance to really did into what the root cause of the issue is. In all seriousness, I do actually blame Logstash, specifically its TCP output that we use to write to another Logstash endpoint as part of our log aggregation. We’ve had some issues with that plugin before, and it wouldn’t surprise me if there was some issue where it was not properly disposing of sockets or some other sort of low level object as part of its normal operation.

I worry that the automated solution that we put into place (to workaround the issue by recycling) will probably remain in place for far longer than anyone wants it to. From a business point of view what is the motivation to identify and solve the root cause when everything is working well enough, at least from an outside perspective.

Still, its better than having to manually recycle the instances ourselves.

0 Comments

As part of my efforts in evaluating Raven 3 (as a replacement of Raven 2.5), I had to clone our production environment. The intent was that if I’m going to test whether the upgrade will work, I should definitely do it on the same data that is actually in Production. Hopefully it will be better, but you should verify your assumptions regardless.

What I really wanted to do was clone the environment, somehow shunt a copy of all of the current traffic though to the clone (ideally with no impact to the real environment) and then contrast and compare the results. Of course, that’s not simple to accomplish (unless you plan for it from the start) so I had to compromise and just take a copy of the existing data, which acts as the baseline for our load tests. I really do want to get that replicated traffic concept going, but its going to take a while.

On the upside, cloning one of our environments is a completely hands-free affair. Everything is automated, from the shutting down of the existing environment (can’t snapshot a volume without shutting down the machine that’s using it) through to the creation of the new environment, all the way to the clone of the S3 bucket that we use to store binary data.

4 hours later, I had my clone.

That’s a hell of a long time. For that 4 hours, the actual production service was down (because it needs to be non-changing for the clone to be accurate). I mean, it was a scheduled downtime, so it happened at like midnight, and our service is really only used during business hours, but its still pretty bad.

Where did all the time go?

The S3 clone.

Cloning Myself is a Terrible Idea

Well, it wasn’t all S3 to be honest. At least 30 minutes of the clone was taking up by snapshotting the existing data volume and bringing up the new environment. AWS is great, but it still takes time for everything to initialize.

The remaining 3.5 hours was all S3 though.

Our binary data bucket is approximately 100GB with a little over a million files (mostly images). I know this now thanks to the new CloudWatch metrics that AWS provides for S3 buckets (which I’m pretty sure didn’t exist a few months ago).

I’m not doing anything fancy for the bucket clone, just using the AWS CLI and the s3 sync command, doing a bucket to bucket copy. I’m definitely not downloading and then reuploading the files or anything crazy like that, so maybe it just takes that long to copy that much data through S3?

There’s got to be a better way!

They Would Fight

When you have what looks like a task that is slow because its just one thing doing it, the typical approach is to try and make multiple things do it, all at the same time, i.e. parallelise it.

So that’s where I started. All of our environment setup/clone is written in Powershell (using either the AWS Powershell Cmdlets or the AWS CLI), so my first thought was “How can I parallelize in Powershell?”

Unsurprisingly, I’m not the only one who thought that, so in the tradition of good software developers everywhere, I used someone else's code.

At that Github link you can find a function called Invoke-Parallel, which pretty much does exactly what I wanted. It creates a worker pool that pulls from a list of work up to some maximum number of concurrent operations. What was the pool of work though? Bucket prefixes.

Our binary data bucket works a lot like most S3 buckets, it uses keys that look a lot like file paths (even though that’s very much not how S3 works), with “/” as the path delimiter. It’s simple enough to get a list of prefixes in a bucket to the first delimiter, so our body of work becomes that set. All you need to do then is write a script to copy over the bucket contents based on a given prefix, then supply that script to the Invoke-Parallel function.

function Clone-S3Bucket
{
    [CmdletBinding()]
    param
    (
        [Parameter(Mandatory=$true)]
        [string]$sourceBucketName,
        [Parameter(Mandatory=$true)]
        [string]$destinationBucketName,
        [Parameter(Mandatory=$true)]
        [string]$awsKey,
        [Parameter(Mandatory=$true)]
        [string]$awsSecret,
        [Parameter(Mandatory=$true)]
        [string]$awsRegion,
        [switch]$parallelised=$false
    )

    if ($rootDirectory -eq $null) { throw "RootDirectory script scoped variable not set. That's bad, its used to find dependencies." }
    $rootDirectoryPath = $rootDirectory.FullName

    . "$rootDirectoryPath\scripts\common\Functions-Aws.ps1"

    $awsCliExecutablePath = Get-AwsCliExecutablePath

    try
    {
        $old = Set-AwsCliCredentials $awsKey $awsSecret $awsRegion

        Write-Verbose "Cloning bucket [$sourceBucketName] to bucket [$destinationBucketName]"

        if ($parallelised)
        {
            # This is the only delimiter that will work propery with s3 cp due to the way it does recursion
            $delimiter = "/"
            $parallelisationThrottle = 10

            Write-Verbose "Querying bucket [$sourceBucketName] for prefixes to allow for parallelisation"
            $listResponseRaw = [string]::Join("", (& $awsCliExecutablePath s3api list-objects --bucket $sourceBucketName --output json --delimiter $delimiter))
            $listResponseObject = ConvertFrom-Json $listResponseRaw
            $prefixes = @($listResponseObject.CommonPrefixes | Select -ExpandProperty Prefix)

            . "$rootDirectoryPath\scripts\common\Functions-Parallelisation.ps1"

            if ($prefixes -ne $null)
            {
                Write-Verbose "Parallelising clone over [$($prefixes.Length)] prefixes"
                $copyRecursivelyScript = { 
                    Write-Verbose "S3 Copy by prefix [$_]";
                    $source = "s3://$sourceBucketName/$_"
                    $destination = "s3://$destinationBucketName/$_"
                    & $awsCliExecutablePath s3 cp $source $destination --recursive | Write-Debug 
                }

                $parallelOutput = Invoke-Parallel -InputObject $prefixes -ImportVariables -ScriptBlock $copyRecursivelyScript -Throttle $parallelisationThrottle -Quiet
            }
            else
            {
                Write-Verbose "No prefixes were found using delimiter [$delimiter]"
            }

            $keys = $listResponseObject.Contents | Select -ExpandProperty Key

            if ($keys -ne $null)
            {
                Write-Verbose "Parallelising clone over [$($keys.Length)] prefixes"
                $singleCopyScript = { 
                    Write-Verbose "S3 Copy by key [$_]";

                    $copyArguments = @()
                    $copyArguments += "s3"
                    $copyArguments += "cp"
                    $copyArguments += "s3://$sourceBucketName/$_"
                    $copyArguments += "s3://$destinationBucketName/$_"
                    & $awsCliExecutablePath @copyArguments | Write-Debug
                }

                $parallelOutput = Invoke-Parallel -InputObject $keys -ImportVariables -ScriptBlock $singleCopyScript -Throttle $parallelisationThrottle -Quiet
            }
        }
        else
        {
            (& $awsCliExecutablePath s3 sync s3://$sourceBucketName s3://$destinationBucketName) | Write-Debug
        }
    }
    finally
    {
        $old = Set-AwsCliCredentials $old.Key $old.Secret $old.Region
    }
}

There Can Be Only One

Now, like any developer knows, obviously my own implementation is going to be better than the one supplied by a team of unknown size who worked on it for some unspecified length of time, but the key fact to learn would be just how much better it was going to be.

I already had a Powershell test for my bucket clone (from when I first wrote it to use the AWS CLI directly), so I tuned it up a little bit to seed a few hundred files (400 to be exact), evenly distributed into prefixed and non-prefixed keys. These files were then uploaded into a randomly generated bucket and both my old code and the newer parallelised code was execute to clone that bucket into a new bucket.

Describe "Functions-AWS-S3.Clone-S3Bucket" -Tags @("RequiresCredentials") {
    Context "When supplied with two buckets that already exist, with some content in the source bucket" {
        It "Ensures that the content of the source bucket is available in the destination bucket" {
            $workingDirectoryPath = Get-UniqueTestWorkingDirectory
            $creds = Get-AwsCredentials
            $numberOfGeneratedFiles = 400
            $delimiter = "/"

            $sourceBucketName = "$bucketPrefix$([DateTime]::Now.ToString("yyyyMMdd.HHmmss"))"
            (New-S3Bucket -BucketName $sourceBucketName -AccessKey $creds.AwsKey -SecretKey $creds.AwsSecret -Region $creds.AwsRegion) | Write-Verbose
            
            . "$rootDirectoryPath\scripts\common\Functions-Parallelisation.ps1"

            $aws = Get-AwsCliExecutablePath

            $old = Set-AwsCliCredentials $creds.AwsKey $creds.AwsSecret $creds.AwsRegion

            $fileCreation = {
                $i = $_
                $testFile = New-Item "$workingDirectoryPath\TestFile_$i.txt" -ItemType File -Force
                Set-Content $testFile "Some content with a value dependent on the loop iterator [$i]"
                $key = $testFile.Name
                if ($i % 2 -eq 0)
                {
                    $key = "sub" + $delimiter + $key
                }

                if ($i % 4 -eq 0)
                {
                    $key = (Get-Random -Maximum 5).ToString() + $delimiter + $key
                }

                & $aws s3 cp $testFile.FullName s3://$sourceBucketName/$key
            }

            Set-AwsCliCredentials $old.Key $old.Secret $old.Region

            1..$numberOfGeneratedFiles | Invoke-Parallel -ScriptBlock $fileCreation -ImportVariables -Throttle 10 -Quiet

            $destinationBucketName = "$bucketPrefix$([DateTime]::Now.ToString("yyyyMMdd.HHmmss"))"
            $destinationBucket = (New-S3Bucket -BucketName $destinationBucketName -AccessKey $creds.AwsKey -SecretKey $creds.AwsSecret -Region $creds.AwsRegion) | Write-Verbose

            try
            {
                $time = Measure-Command { Clone-S3Bucket -SourceBucketName $sourceBucketName -DestinationBucketName $destinationBucketName -AwsKey $creds.AwsKey -AwsSecret $creds.AwsSecret -AwsRegion $creds.AwsRegion -Parallelised }

                $contents = @(Get-S3Object -BucketName $destinationBucketName -AccessKey $creds.AwsKey -SecretKey $creds.AwsSecret -Region $creds.AwsRegion)

                $contents.Length | Should Be $numberOfGeneratedFiles
            }
            finally
            {
                try
                {
                    (Remove-S3Bucket -BucketName $sourceBucketName -AccessKey $creds.AwsKey -SecretKey $creds.AwsSecret -Region $creds.AwsRegion -DeleteObjects -Force) | Write-Verbose
                }
                catch 
                {
                    Write-Warning "An error occurred while attempting to delete the bucket [$sourceBucketName]."
                    Write-Warning $_
                }

                try
                {
                    (Remove-S3Bucket -BucketName $destinationBucketName -AccessKey $creds.AwsKey -SecretKey $creds.AwsSecret -Region $creds.AwsRegion -DeleteObjects -Force) | Write-Verbose
                }
                catch
                {
                    Write-Warning "An error occurred while attempting to delete the bucket [$destinationBucketName]."
                    Write-Warning $_
                }
            }
        }
    }
}

The old code took 5 seconds. That’s forever!

The new code took 50 seconds!

Yup, 10 times slower.

A disheartening result, but not all that unexpected when I think about it.

The key point here, that I was unaware of, is that the AWS CLI sync is already multithreaded, running a number of requests in parallel to deal with exactly this issue. Just trying to multitask within the same process gives me very little, and in reality is actually worse, because the CLI is almost certainly much more highly optimised than my own Powershell based parallelisation code.

Conclusion

Unfortunately I don’t yet have an amazing solution for cloning large S3 buckets. I’ll get back to it in the future, but for now I just have to accept that a clone of our production environment takes hours.

I think that if I were to use a series of workers (probably in AWS) that I could feed work to via a message queue (RabbitMQ, SQS, whatever) I could probably improve the clone speed, but that’s a hell of a lot of effort, so I’ll need to give it some more thought.

Another important takeaway from this experiment is that you should always measure the solutions you’ve implemented. There is no guarantee that your apparently awesome code is any better than something else, no matter how attached to it you might be.

Prove its awesomeness with numbers, and then, if its bad, let it die.

0 Comments

A few months ago we released a new service allowing our users to complete some of their work through a mobile application. For an application that is primarily locked to a computer within an office, it was a pretty big improvement. Its not the first time we’ve tried to add this functionality, but it was one of the better attempts.

That is, until people really started hammering it. Then it went downhill kind of quickly.

Before I started working here, an architectural decision was made to use a document database to store the data for this mobile application. The idea was that the database would be a temporary store, almost like a waystation, that would allow two way synchronization of the data between two applications (the mobile application and the clients server). Conceptually, its not a bad design. Its not necessarily the design I would have suggested, but it has merit.

The document database selected was RavenDB, specifically version 2.5.

The people who made that particular architectural design are no longer with company for various reasons, so it was up to my team and I to complete the work and actually release something. We did our best and after a fairly lengthy development period followed by an equally lengthy beta period, we released into the wild. As I mentioned above, it started well, but didn’t seem to scale to the amount of users we started seeing. I’m not talking hundreds of thousands of users either, just a few hundred, so it definitely wasn’t one of those problems where you are cursed by your own success.

The root cause for the performance problems? It appeared to be RavenDB.

An Unkindness

I always make the assumption that if a commercial component looks like its not holding up its end of the bargain, its probably not the components fault. Its almost certainly the developers fault, because they either configured it wrong or generally did not understand it enough to know that they were using it in entirely the wrong way.

I think this is true for our problems with RavenDB, but I still don’t know exactly where we went wrong.

I’ll start at the beginning.

The first architectural design had two RavenDB instances in EC2 hidden behind a load balancer. They were configured to replicate to each other. This pair was reduced down to a single instance when we discovered that that particular structure was causing issues in the system (using my future knowledge, I now know that’s not how RavenDB does redundancy). The intention was that if load testing showed that we had issues with only one instance, we would revisit.

Our load testing picked up a bunch of problems with various things, but at no point was the RavenDB instance the bottleneck, so we assumed we would be okay.

Unfortunately, the load tests were flawed somehow, because once the system started to be used in anger, the RavenDB instance was definitely the bottleneck.

When we released, the database was initially hosted on a t2.medium. These instances are burstable (meaning their CPU can spike), but are limited by CPU credits. It became obvious very quickly that the database was consuming far more CPU credits than we expected (its CPU usage was averaging something like 80%), so we quickly shifted it to an m3.medium (which does not use CPU credits). This worked for a little while, but eventually we started experiencing performance issues again as usage increased. Another shift of the underlying hardware to an m4.large improved the performance somewhat, but not as much as we expected.

When we looked into the issue, we discovered a direct correlation between the latency of requests to the service and the disk latency of the data disk that RavenDB was using for storage. What followed was a series of adjustments to the data disk, mostly related around switching to provisioned IOPS and then slowing scaling it up until the latency of the disk seemed to no longer be the issue.

But we still had performance problems, and at this point the business was (rightly) getting a bit antsy, because users were starting to notice.

After investigation, the new cause of the performance problems seemed to be paging. Specifically the RavenDB process was consuming more memory than was available and was paging to the very slow system drive. Scaling the underlying instance up to an m4.xlarge (for more memory and compute) alleviated this particular problem.

We had a number of other issues as well:

  • Because we host RavenDB in IIS, the default application pool recycle that occurs every 29 hours eventually started happening during our peak times, which didn’t end well. We now schedule the restart for early in the morning. This was made somewhat more difficult by the fact that RavenDB can’t handle overlapping processes (which IIS uses to avoid downtime during restarts).
  • We’ve had the RavenDB process crash from time to time. IIS handles this (by automatically restarting the process), but there is still a period of downtime while the whole thing heats up again.

That brings us to the present. The service is running well enough, and is pretty performant, but it really does feel like we’ve thrown way too much power at it for what it accomplishes.

Where to Now?

Raven 2.5 is old. Very old.

Our next step is to upgrade to Raven 3, and then directly contrast and compare the performance of the two versions under similar load to see exactly what we’re getting ourselves into.

The logic behind the upgrade is that the newer version is far more likely to have better performance, and we’re far more likely to be able to easily get support for it.

Initial investigations show that the upgrade itself is relatively painless. The older Raven 2.5 client is completely compatible with the new server, so we don’t even need to upgrade the API components yet. All of the data appears to migrate perfectly fine (and seamlessly), so all we need to do is put some effort into comparing performance and then we should be sweet.

Secondarily, we’re going to be setting up at least one other Raven instance, primarily as a backup, but maybe also as a load balancing measure. I’ll have to look into it more before we figure out exactly what its capable of, but at the very least we need a replicated backup.

Summary

This post was more of an introduction into some of the issues that we’ve been having with the persistence of our service, but it does bring to light some interesting points.

Needing to support something in production is very different from just deciding to use it during development. There is a whole other set of tools and services that are required before you can successfully use something in production, and a completely different set of expertise and understanding required. Because the development process was so smooth (from the persistence point of view), we never really had to dig into the guts of Raven and really figure out what it was doing, so we were completely unprepared when everything went to hell during actual usage.

Trial by fire indeed.