I definitely would not say that I am an expert at load testing web services. At best, I realise how valuable it is in order to validate your architecture and implementation, to help you get a handle on weak or slow areas and fix them before they can become a problem.

One thing I have definitely learned in the last 12 months however, is just how important it is to make sure that your load profile (i.e. your simulation for how you think your system will be loaded) is as close to reality as possible. If you get this wrong, not only will you not be testing your system properly, you will give yourself a false sense of confidence in how it performs when people are using it. This can lead to some pretty serious disasters when you actually do go live everything explodes (literally or figuratively, it doesn’t matter).

Putting together a good load profile is a difficult and time consuming task. You need to make assumptions about expected usage patterns, number of users, quality (and quantity) of date and all sorts of other things. While you’re building this profile, it will feel like you aren’t contributing directly to the software being written (there is code to write!), but believe me, a good load profile is worth it when it comes to validation all sorts of things later on. Like a good test suite, it keeps paying dividends in all sorts of unexpected places.

Such a Tool

It would be remiss of me to talk about load tests and load profiles without mentioning at least one of the tools you can use to accomplish them, as there are quite a few out there. In our organisation we use JMeter, mostly because that’s the first one that we really looked into in any depth, but it helps that it seems to be pretty well accepted in the industry, as there is a lot of information already out there to help you when you’re stuck. Extremely flexible, extendable and deployable, its an excellent tool (though it does have a fairly steep learning curve, and its written in Java so for a .NET person it can feel a little foreign).

Back to the meat of this post though.

As part of the first major piece of work done shortly after I started, my team completed the implementation of a service for supporting the remote access and editing of data that was previously locked to client sites. I made sure that we had some load tests to validate the behaviour of the service when it was actually being used, as opposed to when it was just kind of sitting there, doing nothing. I think it might have been the first time that our part of the organisation had ever designed and implemented load tests for validating performance, so it wasn’t necessarily the most…perfect, of solutions.

The load tests showed a bunch of issues which we dutifully fixed.

When we went into production though, there were so many more issues than we anticipated, especially related to the underlying persistence store (RavenDB, which I have talked about at length recently).

Of course, the question on everyone’s lips at that point was, why didn’t we see those issues ahead of time? Surely that was what the load tests were meant to catch?

The Missing Pieces

There were a number of reasons why our load tests didn’t catch any of the problems that started occurring in production.

The first was that we were still unfamiliar with JMeter when we wrote those tests. This mostly just limited our ability to simulate complex operations (of which there are a few), and made our profile a bit messier than it should have been. It didn’t necessarily cause the weak load tests, but it certainly didn’t help.

The second reason was that the data model used in the service is not overly easy to use. When I say easy to use, I mean that the objects involved are complex (100+KB of JSON) and thus are difficult to create realistic looking random data for. As a result, we took a number of samples and then used those repeatedly in the load tests, substituting values as appropriate to differentiate users from each other. I would say that the inability to easily create realistic looking fake data was definitely high up there on the list as to why the load tests were ineffective in finding the issues we encountered in production.

The third reason why our load tests didn’t do the job, was the actual load profile itself. The simulation for what sort of calls we expected a single user (where user describes more than just one actual person using the system) to make was just not detailed enough. It did not cover enough of the functionality of the server and definitely did not accurately represent reality. This was unfortunate and unexpected, because we spent a significant amount of time attempting to come up with a profile, and we got agreement from a number of different parties that this profile would be good enough for the purposes of testing. The root cause of this one was simply unfamiliarity with the intended usage of the system.

Finally, and what I think is probably the biggest contributor to the ineffectiveness of the load tests, we simply did not run them for long enough. Each load test we did only went for around 48 hours (at the high end) and was focused around finding immediate and obvious performance problems. A lot of the issues that we had in production did not manifest themselves until we’d been live for a week or two. If we had of implemented the load tests sooner, and then started and kept them running on our staging environment for weeks at a time, I imagine that we would have found a lot of the issues that ended up plaguing us.


Of course, there is no point thinking about these sort of things unless you actually make changes the next time you go to do the same sort of task.

So, what did we learn?

  1. Start thinking about the load tests and simulating realistic looking data early. We came into the service I’ve been talking about above pretty late (to clean up someone else’s mess) and we didn’t really get a chance to spend any time on creating realistic looking data. This hurt us when it came time to simulate users.
  2. Think very very hard about your actual load profile. What is a user? What does a user do? Do they do it sequentially or in parallel? Are there other processes going on that might impact performance? Are there things that happen irregularly that you should include in the profile at random? How big is the data? How much does it vary? All of those sorts of questions can be very useful for building a better load profile. Make sure you spend the time to build it properly in whatever tool you are using, such that you can tweak it easily when you go to run it.
  3. To run our load tests early and then for as much time as possible. To us, this means we should run them in an infinite loop on top of our staging environment pretty much as soon as we have them, forever (well, until we’re no longer actively developing that component anyway).

The good thing to come out of the above is that the service we completed did not flop hard enough that we don’t get a second chance. We’re just now developing some other services (to meet similar needs) and we’ve taken all of the lessons above to heart. Our load test profiles are much better and we’ve started incorporating soak tests to pick up issues that only manifest over long periods of time.

At least when it breaks we’ll know sooner, rather than when there are real, paying, customers trying to use it.

I imagine though, that we will probably have to go through this process a few times before we really get a good handle on it.


The service that I’ve mentioned previously (and the iOS app it supports) has been in beta now for a few weeks. People seem relatively happy with it, both from a performance standpoint and due to the fact that it doesn’t just arbitrarily lose their information, unlike the previous version, so we’ve got that going for us, which is nice.

We did a fair amount of load testing on it before it went out to beta, but only for small numbers of concurrent users (< 100), to make sure that our beta experience would be acceptable. That load testing picked up a few issues, including one where the service would happily (accidentally of course) delete other peoples data. It wasn’t a permissions issue, it was due to the way in which we were keying our image storage. More importantly, the load testing found issues with the way in which we were storing images (we were using Raven 2.5 attachments) and how it just wasn’t working from a performance point of view. We switched to storing the files in S3, and it was much better.

I believe the newer version of Raven has a new file storage mechanism that is much better. I don’t even think Ayende recommends that you use the attachments built into Raven 2.5 for any decent amount of file storage.

Before we go live, we knew that we needed to find the breaking point of the service. The find the number of concurrent users at which its performance degraded to the point where it was unusable (at least for the configuration that we were planning on going live with). If that number was too low, we knew we would need to make some additional changes, either in terms of infrastructure (beefier AWS instances, more instances in the Auto Scaling Group) or in terms of code.

We tried to simply run a huge amount of users through our load tests locally (which is how we we did the first batch of load testing, locally using JMeter) but we capped out our available upload bandwidth pretty quickly, well below the level of traffic that the service could handle.

It was time to farm the work out to somewhere else, somewhere with a huge amount of easily accessibly computing resources.

Where else but Amazon Web Services?

I’ve Always Wanted to be a Farmer

The concept was fairly straightforward. We had a JMeter configuration file that contained all of our load tests. It was parameterised by the number of users, so conceptually the path would be to spin up some worker instances in EC2, push JMeter, its dependencies and our config to them, then execute the tests. This way we could tune the number users per instance along with the total number of worker instances, and we would be able to easily put enough pressure on the service to find its breaking point.

JMeter gives you the ability to set the value of variables via the command line. Be careful though, as the variable names are case sensitive. That one screwed me over for a while, as I couldn’t figure out why the value of my variables was still the default on every machine I started the tests on. For the variable that defined the maximum number of users it wasn’t so bad, if a bit confusing. The other variable that defined the seed for the user identity was more of an issue when it wasn’t working, because it meant the same user was doing similar things from multiple machines. Still a valid test, but not the one I was aiming to do, as the service isn’t defined for concurrent access like that.

We wouldn’t want to put all of that load on the service all at once though, so we needed to stagger when each instance started its tests.

Leveraging the work I’d done previously for setting up environments, I created a Cloud Formation template containing an Auto Scaling Group with a variable number of worker instances. Each instance would have the JMeter config file and all of its dependencies (Java, JMeter, any supporting scripts) installed during setup, and then be available for remote execution via Powershell.

The plan was to hook into that environment (or setup a new one if one could not be found), find the worker instances and then iterate through them, starting the load tests on each one, making sure to stagger the time between starts to some reasonable amount. The Powershell script for doing exactly that is below:


$ErrorActionPreference = "Stop"

$currentDirectoryPath = Split-Path $script:MyInvocation.MyCommand.Path
write-verbose "Script is located at [$currentDirectoryPath]."

. "$currentDirectoryPath\_Find-RepositoryRoot.ps1"

$repositoryRoot = Find-RepositoryRoot $currentDirectoryPath

$repositoryRootDirectoryPath = $repositoryRoot.FullName
$commonScriptsDirectoryPath = "$repositoryRootDirectoryPath\scripts\common"

. "$repositoryRootDirectoryPath\scripts\environment\Functions-Environment.ps1"

. "$commonScriptsDirectoryPath\Functions-Aws.ps1"


$stack = $null
    $stack = Get-Environment -EnvironmentName $environmentName -AwsKey $awsKey -AwsSecret $awsSecret -AwsRegion $awsRegion
    Write-Warning $_

if ($stack -eq $null)
    $update = ($stack -ne $null)

    $stack = New-Environment -EnvironmentName $environmentName -AwsKey $awsKey -AwsSecret $awsSecret -AwsRegion $awsRegion -UpdateExisting:$update -Wait -disableCleanupOnFailure

$autoScalingGroupName = $stack.AutoScalingGroupName

$asg = Get-ASAutoScalingGroup -AutoScalingGroupNames $autoScalingGroupName -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion
$instances = $asg.Instances

. "$commonScriptsDirectoryPath\Functions-Aws-Ec2.ps1"

$remoteUser = "Administrator"
$remotePassword = "ObviouslyInsecurePasswordsAreTricksyMonkeys"
$securePassword = ConvertTo-SecureString $remotePassword -AsPlainText -Force
$cred = New-Object System.Management.Automation.PSCredential($remoteUser, $securePassword)

$usersPerMachine = 100
$nextAvailableCustomerNumber = 1
$jobs = @()
foreach ($instance in $instances)
    # Get the instance
    $instance = Get-AwsEc2Instance -InstanceId $instance.InstanceId -AwsKey $awsKey -AwsSecret $awsSecret -AwsRegion $awsRegion

    $ipAddress = $instance.PrivateIpAddress
    $session = New-PSSession -ComputerName $ipAddress -Credential $cred

    $remoteScript = {
        Set-ExecutionPolicy -ExecutionPolicy Bypass
        & "C:\cfn\dependencies\scripts\jmeter\execute-load-test-no-gui.ps1" -totalNumberOfUsers $totalNumberOfUsers -startingCustomerNumber $startingCustomerNumber -AllocatedMemory 512
    $job = Invoke-Command -Session $session -ScriptBlock $remoteScript -ArgumentList $usersPerMachine,$nextAvailableCustomerNumber -AsJob
    $jobs += $job
    $nextAvailableCustomerNumber += $usersPerMachine

    #Sleep -Seconds ([TimeSpan]::FromHours(2).TotalSeconds)
    Sleep -Seconds 300

    # Can use Get-Job or record list of jobs and then terminate them. I suppose we could also wait on all of them to be complete. Might be good to get some feedback from
    # the remote process somehow, to indicate whether or not it is still running/what it is doing.

Additionally, I’ve recreated and reuploaded the repository from my first JMeter post, containing the environment template and scripts for executing the template, as well as the script above. You can find it here.

The last time I uploaded this repository I accidentally compromised our AWS deployment credentials, so I tore it down again very quickly. Not my brightest moment, but you can rest assured I’m not making the same mistake twice. If you look at the repository, you’ll notice that I implemented the mechanism for asking for credentials for tests so I never feel tempted to put credentials in a file ever again.

We could watch the load tests kick into gear via Kibana, and keep an eye on when errors start to occur and why.

Obviously we didn’t want to run the load tests on any of the existing environments (which are in use for various reasons), so we spun up a brand new environment for the service, fired up the script to farm out the load tests (with a 2 hour delay between instance starts) and went home for the night.

15 minutes later, Production (the environment actively being used for the external beta) went down hard, and so did all of the others, including the new load test environment.

Separately Dependent

We had gone to great lengths to make sure that our environments were independent. That was the entire point behind codifying the environment setup, so that we could spin up all resources necessary for the environment, and keep it isolated from all of the other ones.

It turns out they weren’t quite as isolated as we would have liked.

Like a lot of AWS setups, we have an internet gateway, allowing resources internal to our VPC (like EC2 instances) access to the internet. By default, only resources with an external IP can access the internet through the gateway. Other resources have to use some other mechanism for accessing the internet. In our case, the other mechanism is a SQUID proxy.

It was this proxy that was the bottleneck. Both the service under test and the load test workers themselves were slamming it, the service in order to talk to S3 and the load test workers in order to hit the service (through its external URL).

We recently increased the specs on the proxy machine (because of a similar problem discovered during load testing with fewer users) and we thought that maybe it would be powerful enough to handle the incoming requests. It probably would have been if it wasn’t for the double load (i.e. if the load test requests had of been coming from an external party and the only traffic going through the proxy was to S3 from the service).

In the end the load tests did exactly what they were supposed to do, even if they did it in an unexpected way. The pushed the system to breaking point, allowing us to identify where it broke and schedule improvements to prevent the situation from occurring again.

Actions Speak Louder Than Words

What are we going to do about it? There are a number of things I have in mind.

The first is to not have a single proxy instance and instead have an auto scaling group that scales as necessary based on load. I like this idea and I will probably be implementing it at some stage in the future. To be honest, as a shared piece of infrastructure, this is how it should have been implemented in the first place. I understand that the single instance (configured lovingly by hand) was probably quicker and easier initially, but for such a critical piece of infrastructure, you really do need to spend the time to do it properly.

The second is to have environment specific proxies, probably as auto scaling groups anyway. This would give me more confidence that we won’t accidentally murder production services when doing internal things, just from an isolation point of view. Essentially, we should treat the proxy just like we treat any other service, and be able to spin them up and down as necessary for whatever purposes.

The third is to isolate our production services entirely, either with another VPC just for production, or even another AWS account just for production. I like this one a lot, because as long as we have shared environments, I’m always terrified I’ll screw up a script and accidentally delete everything. If production wasn’t located in the same account, that would literally be impossible. I’ll be trying to make this happen over the coming months, but I’ll need to move quickly, as the more stuff we have in production, the harder it will be to move.

The last optimisation is to use the new VPC endpoint feature in AWS to avoid having to go to the internet in order to access S3, which I have already done. This really just delays the root issue (shared single point of failure), but it certainly solves the immediate problem and should also provide a nice performance boost, as it removes the proxy from the picture entirely for interactions with S3, which is nice.


To me, this entire event proved just how valuable load testing is. As I stated previously, it did exactly what I expected it to do. Find where the service breaks. It broke in an entirely unexpected way (and broke other things as well), but honestly this is probably the best outcome, because that would have happened at some point in the future anyway (whenever we hit the saturation point for the proxy) and I’d prefer it to happen now, when we’re in beta and managing communications with every user closely, than later, when everybody and their dog are using the service.

Of course, now we have a whole lot more infrastructure work to complete before we can go live, but honestly, the work is never really done anyway.

I still hate proxies.


(Get it, like the law of Demeter, except completely different and totally unrelated).

My focus at work recently has been on performance and load testing the service that lies at the heart of our newest piece of functionality. The service is fairly straightforward, acting as a temporary data store and synchronization point between two applications, one installed at the client side and one mobile. Its multi-tenant, so all of the clients share the same service, and there can be multiple mobile devices per client, split by user (where each device is assumed to belong to a specific user, or at least is authenticated as one).

I’ve done some performance testing before, but mostly on desktop applications, checking to see whether or not the app was fast and how much memory it consumed. Nothing formal, just using the performance analysis tools built into Visual Studio to identify slow areas and tune them.

Load testing a service hosted in Amazon is an entirely different beast, and requires different tools.

Enter JMeter.

Yay! Java!

JMeter is well and truly a Java app. You can smell it. I tried to not hold that against it though, and found it to be very functional and easy enough to understand once you get passed that initial (steep) learning curve.

At a high level, there are solution-esque containers, containing one or more Thread Groups. Each Thread Group can model one or more users (threads) that run through a series of configured actions that make up the content of the test. I’ve barely scratched the surface of the application, but you can at least configure loops and counters, specify variables and of course, most importantly, HTTP requests. You can weave all of these things (and more) together to create whatever load test you desire.

It turns out, that’s actually the hard bit. Deciding what the test should, well, test. Ideally you want something that approximates to average expected usage, so you need to do some legwork to work out what that could be. Then, you use the expected usage example and replicate it repeatedly, up to the number of concurrent users you want to evaluate.

This would probably be difficult and time consuming, but luckily JMeter has an extremely useful feature that helps somewhat. A recording proxy.

Configure the proxy on your endpoints (in my case, an installed application on a virtual machine and a mobile device) and start it, and all requests to the internet at either of those places will be helpfully recorded in JMeter, ready to be played back. Thus, rather than trying to manually concoct the set of actions of a typical user, you can just use the software as normal, approximate some real usage and then use those recorded requests to form the baseline of your load test.

This is obviously useful when you have a mostly completed application and someone has said “now load test it” just before release. I don’t actually recommend this, as load and performance testing should be done throughout the development process and performance metrics are set early and measured often. Getting to the end only to discover that your software works great for 1 person, but fails miserably for 10 is an extremely bad situation to be in. Not only that, but throwing all the load and performance testing infrastructure together at the end doesn’t give it enough time to mature, leading to oversights and other nasty side effects. Test early and test often applies to performance as much as functional correctness.

Helpfully, if you have a variable setup (say for the base URL) and you’re recording, any instances of the value in the variable will be replaced by a reference to the variable itself. This saves a huge amount of time going through the recorded requests and changing them to allow for configurability.

The variable substitution is a bit of a double edged sword as I found out though. I had a variable set at the global scope with a value of 10 (it was a loop counter or something) and JMeter dutifully replaced all instances of 10 with references to that variable. Needless to say, that recording session had to be thrown out, and I moved the variable to the appropriate Thread Group level scope shortly after.

Spitting Images

All was not puppies and roses though, as the recording proxy seemed to choke on image uploads. The service deals in images that are attached to other pieces of data, so naturally image uploads are part of its API.

Specifically, when the proxy wasn’t in place, everything worked fine. As soon as I configured the proxy on either endpoint, the images failed to upload. Looking at the request, all of the content that would normally be attached to an image upload was being stripped out. By the time the request got to the service, after having passed through the proxy, the request looked like there should be an image attached, but there was none available.

With the image uploads failing, I couldn’t record a complete picture of interactions with the service, so I had to make some stuff up.

It wasn’t until much later, when I started incorporating image uploads into the test plan manually that I discovered JMeter does not like file uploads as part of a PUT request. It’s fine with POST, but the multi-part support does not seem to work with PUT. I wasn’t necessarily in a position to change either of our applications to force them to use POST just so I could record a load test though, and I think PUT better encapsulates what we are doing (placing content at a known ID), so I just had to live with my artificially constructed image uploads that approximated usage.

Custom Rims

Now that I had a collection of requests defining some nice expected usage I only had one more problem to solve, kind of specific to our API, but I’ll mention it here because the general approach is probably applicable.

Our API uses an Auth Header. Not particularly uncommon, but our Auth Header isn’t as simple as a token obtained from sending the username/password to an auth endpoint, or anything similarly sane. Our Auth Header is an encrypted package containing a set of customer information, which is decrypted on the server side and then validated. It contains user identifying information plus a time stamp, for a validity period.

It needs to be freshly calculated on almost every outgoing request.

My recorded requests stored the Auth Header that they were made with, but of course, the token can’t be reused as time goes on, because it will time out. So I needed to create a fresh Auth Header from the information I had available. Obviously, this being a very custom function (involving AES encryption and some encoding), JMeter can’t help out of the box.

So it was time to extend.

I was impressed with JMeter in this respect. It has a number of ways of entering custom scripts/code when the in-built functionality of JMeter doesn’t allow you to do what you need to do. Originally I was going to use a BeanShell script, but after doing some reading, I discovered that BeanShell can be slow when executed many times (like for example on every single request), so I went with a custom function written in Java.

Its been a long time since I’ve written Java. Its wouldn’t say its bad, but I definitely like (and are far more experienced) with C#. Java generics are weird.

Anyway, once I managed to implement the interface that JMeter supplies for custom functions, it was a simple matter to compile it into a JAR file and include the location of the JAR in the search_path when starting JMeter. JMeter will automatically load all of the custom functions it finds, and you can freely call them just like you would a variable (using the ${} syntax, functions are typically named with __ to distinguish them from variables).

Redistributable Fun

All of the load testing goodness above is encapsulated in a Git repository, as is normal for sane practices

I like my repositories to stand alone when possible (with the notable exception of taking an external dependency on NuGet.org or a private NuGet feed for dependencies), so I wanted to make sure that someone could pull this repository, and just run the load tests with no issues. It should just work.

To that end, I’ve wrapped the running of JMeter with and without a GUI into 2 Powershell scripts, dealing with things like including the appropriate directory containing custom functions, setting memory usage and locating the JRE and JMeter itself.

In the interests of “it should just work”, I also included the Java 1.8 JRE and the latest version of JMeter in the repository, archived. They are dynamically extracted as necessary whenever the scripts that run JMeter are executed.

In the past I’ve shied away from including binaries in my repositories, because it tends to bloat them and make them take longer to pull down. Typically I would use a NuGet package for a dependency, or if one does not exist, create one. I considered doing this for the Java and JMeter dependencies, but it wasn’t really worth the effort at this stage.

You can find a cut down version of the repository (with a very lightweight jmx file of little to no substance) on GitHub, for your perusal.


Once I got passed the initial distaste of a Java UI and then subsequently got passed the fairly steep learning curve, I was impressed with what JMeter could accomplish with regards to load testing. It can take a bit of reading to really grok the underlying concepts, but once you do, you can use it to do almost anything. I don’t believe I have the greatest understanding of the software, but I was definitely able to use it to build a good load test that I felt put our service under some serious strain.

Of course, now that I had a working load test, I would need a way to interpret and analyse the results. Also, you can only do so much load testing on a single machine before you hit some sort of physical limit, so additional machines need to get involved to really push the service to its breaking point.

Guess what the topic of my next couple of blog posts will be?