0 Comments

As part of my efforts in evaluating Raven 3 (as a replacement of Raven 2.5), I had to clone our production environment. The intent was that if I’m going to test whether the upgrade will work, I should definitely do it on the same data that is actually in Production. Hopefully it will be better, but you should verify your assumptions regardless.

What I really wanted to do was clone the environment, somehow shunt a copy of all of the current traffic though to the clone (ideally with no impact to the real environment) and then contrast and compare the results. Of course, that’s not simple to accomplish (unless you plan for it from the start) so I had to compromise and just take a copy of the existing data, which acts as the baseline for our load tests. I really do want to get that replicated traffic concept going, but its going to take a while.

On the upside, cloning one of our environments is a completely hands-free affair. Everything is automated, from the shutting down of the existing environment (can’t snapshot a volume without shutting down the machine that’s using it) through to the creation of the new environment, all the way to the clone of the S3 bucket that we use to store binary data.

4 hours later, I had my clone.

That’s a hell of a long time. For that 4 hours, the actual production service was down (because it needs to be non-changing for the clone to be accurate). I mean, it was a scheduled downtime, so it happened at like midnight, and our service is really only used during business hours, but its still pretty bad.

Where did all the time go?

The S3 clone.

Cloning Myself is a Terrible Idea

Well, it wasn’t all S3 to be honest. At least 30 minutes of the clone was taking up by snapshotting the existing data volume and bringing up the new environment. AWS is great, but it still takes time for everything to initialize.

The remaining 3.5 hours was all S3 though.

Our binary data bucket is approximately 100GB with a little over a million files (mostly images). I know this now thanks to the new CloudWatch metrics that AWS provides for S3 buckets (which I’m pretty sure didn’t exist a few months ago).

I’m not doing anything fancy for the bucket clone, just using the AWS CLI and the s3 sync command, doing a bucket to bucket copy. I’m definitely not downloading and then reuploading the files or anything crazy like that, so maybe it just takes that long to copy that much data through S3?

There’s got to be a better way!

They Would Fight

When you have what looks like a task that is slow because its just one thing doing it, the typical approach is to try and make multiple things do it, all at the same time, i.e. parallelise it.

So that’s where I started. All of our environment setup/clone is written in Powershell (using either the AWS Powershell Cmdlets or the AWS CLI), so my first thought was “How can I parallelize in Powershell?”

Unsurprisingly, I’m not the only one who thought that, so in the tradition of good software developers everywhere, I used someone else's code.

At that Github link you can find a function called Invoke-Parallel, which pretty much does exactly what I wanted. It creates a worker pool that pulls from a list of work up to some maximum number of concurrent operations. What was the pool of work though? Bucket prefixes.

Our binary data bucket works a lot like most S3 buckets, it uses keys that look a lot like file paths (even though that’s very much not how S3 works), with “/” as the path delimiter. It’s simple enough to get a list of prefixes in a bucket to the first delimiter, so our body of work becomes that set. All you need to do then is write a script to copy over the bucket contents based on a given prefix, then supply that script to the Invoke-Parallel function.

function Clone-S3Bucket
{
    [CmdletBinding()]
    param
    (
        [Parameter(Mandatory=$true)]
        [string]$sourceBucketName,
        [Parameter(Mandatory=$true)]
        [string]$destinationBucketName,
        [Parameter(Mandatory=$true)]
        [string]$awsKey,
        [Parameter(Mandatory=$true)]
        [string]$awsSecret,
        [Parameter(Mandatory=$true)]
        [string]$awsRegion,
        [switch]$parallelised=$false
    )

    if ($rootDirectory -eq $null) { throw "RootDirectory script scoped variable not set. That's bad, its used to find dependencies." }
    $rootDirectoryPath = $rootDirectory.FullName

    . "$rootDirectoryPath\scripts\common\Functions-Aws.ps1"

    $awsCliExecutablePath = Get-AwsCliExecutablePath

    try
    {
        $old = Set-AwsCliCredentials $awsKey $awsSecret $awsRegion

        Write-Verbose "Cloning bucket [$sourceBucketName] to bucket [$destinationBucketName]"

        if ($parallelised)
        {
            # This is the only delimiter that will work propery with s3 cp due to the way it does recursion
            $delimiter = "/"
            $parallelisationThrottle = 10

            Write-Verbose "Querying bucket [$sourceBucketName] for prefixes to allow for parallelisation"
            $listResponseRaw = [string]::Join("", (& $awsCliExecutablePath s3api list-objects --bucket $sourceBucketName --output json --delimiter $delimiter))
            $listResponseObject = ConvertFrom-Json $listResponseRaw
            $prefixes = @($listResponseObject.CommonPrefixes | Select -ExpandProperty Prefix)

            . "$rootDirectoryPath\scripts\common\Functions-Parallelisation.ps1"

            if ($prefixes -ne $null)
            {
                Write-Verbose "Parallelising clone over [$($prefixes.Length)] prefixes"
                $copyRecursivelyScript = { 
                    Write-Verbose "S3 Copy by prefix [$_]";
                    $source = "s3://$sourceBucketName/$_"
                    $destination = "s3://$destinationBucketName/$_"
                    & $awsCliExecutablePath s3 cp $source $destination --recursive | Write-Debug 
                }

                $parallelOutput = Invoke-Parallel -InputObject $prefixes -ImportVariables -ScriptBlock $copyRecursivelyScript -Throttle $parallelisationThrottle -Quiet
            }
            else
            {
                Write-Verbose "No prefixes were found using delimiter [$delimiter]"
            }

            $keys = $listResponseObject.Contents | Select -ExpandProperty Key

            if ($keys -ne $null)
            {
                Write-Verbose "Parallelising clone over [$($keys.Length)] prefixes"
                $singleCopyScript = { 
                    Write-Verbose "S3 Copy by key [$_]";

                    $copyArguments = @()
                    $copyArguments += "s3"
                    $copyArguments += "cp"
                    $copyArguments += "s3://$sourceBucketName/$_"
                    $copyArguments += "s3://$destinationBucketName/$_"
                    & $awsCliExecutablePath @copyArguments | Write-Debug
                }

                $parallelOutput = Invoke-Parallel -InputObject $keys -ImportVariables -ScriptBlock $singleCopyScript -Throttle $parallelisationThrottle -Quiet
            }
        }
        else
        {
            (& $awsCliExecutablePath s3 sync s3://$sourceBucketName s3://$destinationBucketName) | Write-Debug
        }
    }
    finally
    {
        $old = Set-AwsCliCredentials $old.Key $old.Secret $old.Region
    }
}

There Can Be Only One

Now, like any developer knows, obviously my own implementation is going to be better than the one supplied by a team of unknown size who worked on it for some unspecified length of time, but the key fact to learn would be just how much better it was going to be.

I already had a Powershell test for my bucket clone (from when I first wrote it to use the AWS CLI directly), so I tuned it up a little bit to seed a few hundred files (400 to be exact), evenly distributed into prefixed and non-prefixed keys. These files were then uploaded into a randomly generated bucket and both my old code and the newer parallelised code was execute to clone that bucket into a new bucket.

Describe "Functions-AWS-S3.Clone-S3Bucket" -Tags @("RequiresCredentials") {
    Context "When supplied with two buckets that already exist, with some content in the source bucket" {
        It "Ensures that the content of the source bucket is available in the destination bucket" {
            $workingDirectoryPath = Get-UniqueTestWorkingDirectory
            $creds = Get-AwsCredentials
            $numberOfGeneratedFiles = 400
            $delimiter = "/"

            $sourceBucketName = "$bucketPrefix$([DateTime]::Now.ToString("yyyyMMdd.HHmmss"))"
            (New-S3Bucket -BucketName $sourceBucketName -AccessKey $creds.AwsKey -SecretKey $creds.AwsSecret -Region $creds.AwsRegion) | Write-Verbose
            
            . "$rootDirectoryPath\scripts\common\Functions-Parallelisation.ps1"

            $aws = Get-AwsCliExecutablePath

            $old = Set-AwsCliCredentials $creds.AwsKey $creds.AwsSecret $creds.AwsRegion

            $fileCreation = {
                $i = $_
                $testFile = New-Item "$workingDirectoryPath\TestFile_$i.txt" -ItemType File -Force
                Set-Content $testFile "Some content with a value dependent on the loop iterator [$i]"
                $key = $testFile.Name
                if ($i % 2 -eq 0)
                {
                    $key = "sub" + $delimiter + $key
                }

                if ($i % 4 -eq 0)
                {
                    $key = (Get-Random -Maximum 5).ToString() + $delimiter + $key
                }

                & $aws s3 cp $testFile.FullName s3://$sourceBucketName/$key
            }

            Set-AwsCliCredentials $old.Key $old.Secret $old.Region

            1..$numberOfGeneratedFiles | Invoke-Parallel -ScriptBlock $fileCreation -ImportVariables -Throttle 10 -Quiet

            $destinationBucketName = "$bucketPrefix$([DateTime]::Now.ToString("yyyyMMdd.HHmmss"))"
            $destinationBucket = (New-S3Bucket -BucketName $destinationBucketName -AccessKey $creds.AwsKey -SecretKey $creds.AwsSecret -Region $creds.AwsRegion) | Write-Verbose

            try
            {
                $time = Measure-Command { Clone-S3Bucket -SourceBucketName $sourceBucketName -DestinationBucketName $destinationBucketName -AwsKey $creds.AwsKey -AwsSecret $creds.AwsSecret -AwsRegion $creds.AwsRegion -Parallelised }

                $contents = @(Get-S3Object -BucketName $destinationBucketName -AccessKey $creds.AwsKey -SecretKey $creds.AwsSecret -Region $creds.AwsRegion)

                $contents.Length | Should Be $numberOfGeneratedFiles
            }
            finally
            {
                try
                {
                    (Remove-S3Bucket -BucketName $sourceBucketName -AccessKey $creds.AwsKey -SecretKey $creds.AwsSecret -Region $creds.AwsRegion -DeleteObjects -Force) | Write-Verbose
                }
                catch 
                {
                    Write-Warning "An error occurred while attempting to delete the bucket [$sourceBucketName]."
                    Write-Warning $_
                }

                try
                {
                    (Remove-S3Bucket -BucketName $destinationBucketName -AccessKey $creds.AwsKey -SecretKey $creds.AwsSecret -Region $creds.AwsRegion -DeleteObjects -Force) | Write-Verbose
                }
                catch
                {
                    Write-Warning "An error occurred while attempting to delete the bucket [$destinationBucketName]."
                    Write-Warning $_
                }
            }
        }
    }
}

The old code took 5 seconds. That’s forever!

The new code took 50 seconds!

Yup, 10 times slower.

A disheartening result, but not all that unexpected when I think about it.

The key point here, that I was unaware of, is that the AWS CLI sync is already multithreaded, running a number of requests in parallel to deal with exactly this issue. Just trying to multitask within the same process gives me very little, and in reality is actually worse, because the CLI is almost certainly much more highly optimised than my own Powershell based parallelisation code.

Conclusion

Unfortunately I don’t yet have an amazing solution for cloning large S3 buckets. I’ll get back to it in the future, but for now I just have to accept that a clone of our production environment takes hours.

I think that if I were to use a series of workers (probably in AWS) that I could feed work to via a message queue (RabbitMQ, SQS, whatever) I could probably improve the clone speed, but that’s a hell of a lot of effort, so I’ll need to give it some more thought.

Another important takeaway from this experiment is that you should always measure the solutions you’ve implemented. There is no guarantee that your apparently awesome code is any better than something else, no matter how attached to it you might be.

Prove its awesomeness with numbers, and then, if its bad, let it die.