Logstash, Why Do You Keep Hurting Me

March 8. 2016 0 Comments

Posted in:
logstash
monitoring

Logstash is a harsh mistress. Master. Whatever gender you want to apply to it, its the sort of application that is so incredibly useful that you’re usually willing to overlook its various flaws and problems, as long as it just keeps shipping those sweet sweet log events into your aggregator. It doesn’t help when the only other alternative on Windows that I know of is Nxlog.

Don’t get me wrong, Nxlog is a fine, stable, established product. It does what it needs to do and it does it reliably. Its just that its a complete nightmare to configure. Logstash is just so easy to configure, with its nice JSON configuration files and its plugins. I can do just about anything in Logstash, but I struggled to do anything more than the basics with Nxlog.

I’ve blogged a couple of times about Logstash (the most recent being a post about a mysterious memory leak occurring on machines with Logstash installed and running), but the problem I’m about to describe is only really partially the fault of Logstash.

More specifically, its Java’s fault.

I’ve Got 99 Issues

Recently, we’ve noticed an annoying trend on the machines acting as log shippers (using Logstash, aggregating mostly ELB logs from S3 buckets). After a certain amount of time, they would just stop shipping logs. When we investigated, we discovered that the disks were full, which pretty quickly puts a stop to any sort of useful activities on Windows.

Our initial suspicions were that the log files themselves (of which there can be quite a lot) were simply not being cleaned up. It wouldn’t be the first time that we’d had a bug in our log file cleanup script and honestly, it probably won’t be the last. A execution of the following simple Powershell script was usually enough to get them up and running again, so we noted it as an issue and moved on to more important things.

Get-ChildItem –Path C:\logs –Recurse –Filter *.log | Where { $_.LastWriteTime –lt (Get-Date).AddDays(-7) } ) | Remove-Item –Force

But it kept happening.

Eventually we realised that something was pretty seriously wrong (and I got tired of manually running Powershell to restore functionality), so we put some more effort into finding the root cause.

The system disk was definitely filling up, but with what? Because we use headless Windows instances (who needs a UI?), it was a little more difficult than normal to find exactly what was taking up all of the disk (a combination of using the Sysinternals DU tool and using notepad to browse folders with its handy little Windows explorer equivalent), but in the end we found a lot of very large (500MB+) memory dump files, in the root directory of our Logstash installation.

It looked like Logstash was crashing (which it does) and on crash it was dumping all of its memory to a file. Now, we wrap our Logstash installations inside NSSM to turn them into Windows services (for exactly this reason), so NSSM would detect that the process had crashed and restart it. It would keep running for a while, and then crash again, repeating the cycle (a “while” in this context was usually a few days).

Over time, this meant that the memory dump files were just accumulating on the disk, eventually causing the machine to choke and die.

Fixed! Ship It?

Logstash is Java based, so I did a little digging and found that the most likely cause of .mdmp files was simply the JVM doing its thing when it crashed. In particular, there seemed to be one option which controlled whether or not a memory dump was made whenever the process crashed. Using the options supplied to the JVM, it seemed like it would be relatively easy to turn this option off (assuming that it was on by default). All you need to do is add –XX:-CreateMinidumpOnCrash to the process and everything should be fine.

The process in question was Logstash, and I had already added JVM options before (for proxy settings and memory limits) so I added the new option to JAVA_OPTS environment variable, which was where I had been putting the other ones. Doing a little reading, I discovered that documentation suggested using the LS_JAVA_OPTS environment variable instead (because it was additive with default options), so I switched to that and ran a test Logstash instance locally to check that everything seemed okay.

Using Sysinternals Process Explorer (procexp) I viewed the process tree of Logstash, showing that Powershell started a batch file, which started JRuby which started Java and so on.

None of my options had been applied to the Java process though…

Dear God Why

On Windows, Logstash runs via a batch file. This batch file in turn calls into another batch file, called setup. Setup is responsible for setting up various things (obviously), one of which is the JVM. Over time, I imagine that many people have experienced various issues with Logstash in a Windows environment, so being open source, they add intelligent default settings to the JVM so that other people don’t have to feel the same pain.

As I mentioned above, you are supposed to be able to append additional JVM settings to Logstash using the LS_JAVA_OPTS environment variable and completely override the options that Logstash uses by setting the JAVA_OPTS environment variable. Originally I didn’t realise that the JAVA_OPTS variable was supposed to be a complete override (which I didn’t want) so it was good that I had chosen to switch to LS_JAVA_OPTS.

The problem was, the usage of JAVA_OPTS and LS_JAVA_OPTS is not how the Windows version of Logstash works at all.

For starters, the Windows version does nothing with the LS_JAVA_OPTS environment variable. Nothing at all.

Secondarily, regardless of what you set JAVA_OPTS to, it will add a bunch of its own JAVA_OPTS (including duplicates of yours, which I assume results in only the latest one being used by the JVM) before it runs Logstash.

By switching to the LS_JAVA_OPTS environment variable, I had completely broken my custom JVM options.

I forked the elastic/logstash repository on Github, fixed the setup.bat file (if JAVA_OPTS, don’t add any, if LS_JAVA_OPTS, add those to the defaults) and created a pull request. Unfortunately, though Onthehouse already had a contributor agreement, but it didn’t seem to recognise me, so that particular pull request is sitting in some weird purgatory and I need to do something about it.

For my purposes though, because we wrap Logstash to make a package deployable through Octopus, I just edited the setup.bat that we use and pushed.

Once it was deployed, I checked the settings applied to the JVM and it was correctly showing my setting to not dump memory on a crash. Yay!

Of course, only time will tell whether or not it actually keeps dumping memory to disk on crash though.

Conclusion

The problem that I outlined above is one of those cases where you really do need to understand how something works before you use it in anger.

I had assumed that the behaviour of Logstash between Windows and Linux would be consistent, when in reality that was definitely not the case. It wasn’t until I really started digging into how Logstash starts, specifically on my OS (rather than just using it) that I realised the truth.

It wasn’t even difficult to fix, the functionality was simply not there, probably as a result of someone changing the Linux implementation at some distant point in the past and equal changes not being made to the Windows scripts. I’m surprised no-one noticed until now (or they might have noticed and just not fixed it in the elastic/logstash repository I suppose), but maybe that just means I’m the only person insane enough to run Logstash on Windows.

I hope not.

There is also the matter of Logstash constantly crashing, which is somewhat concerning, but that’s a task for another day.

Back in the Saddle

March 2. 2016 0 Comments

Posted in:
ravendb
tutoring

As you may (or may not) have noticed, I’ve published exactly zero blog posts over the last 3 weeks.

I was on holidays, and it was glorious.

Well, actually it was filled with more work than I would like (both from the job I was actually on holidays from as well as some other contract work I do for a company called MEDrefer), but it was still nice to be the master of my own destiny for a little while.

Anyway, I’m back now and everything is happening all at once, as these things sort of do.

Three things going on right now: tutoring at QUT, progress on the RavenDB issue I blogged about and some work I’m doing towards replacing RavenDB altogether (just in case), and I’ll be giving those items a brief explanation below. I’ve also been doing some work related to incorporating running Webdriver IO tests from TeamCity via Powershell (and including the results) as well as fixing an issue with Logstash on Windows where you can’t easily configure it to not do a full memory dump whenever it crashes (and it crashes a lot!).

Without further ado, on with the show!

How Can I Reach These Kids?

Its that time of the year when I start up my fairly regular Agile Project Management Tutoring gig at QUT (they’ve change the course code to IAB304 for some ungodly reason this semester, but its basically the same thing), so I’ve got that to look forward to. Unfortunately they are still using the DSDM material, but at least its changed somewhat to be more closely aligned to Scrum than to some old school project management/agile hybrid.

QUT is also offering sessional academics workshops on how to be a better teacher/tutor, which I plan on attending. There are 4 different workshops being run over the next few months, so I might follow each one with a blog post outlining anything interesting that was covered.

I enjoy tutoring at QUT at multiple levels, even if the bureaucracy there drives me nuts. It gives me an opportunity to really think about what it means to be Agile, which is always a useful though experiment. Meeting and interacting with people from many diverse backgrounds is also extremely useful for expanding my worldview, and I enjoy helping them understand the concepts and principles in play, and how they benefit both the practitioner and whatever business they are trying to serve.

The Birds is the Word

The guys at Hibernating Rhinos have been really helpful assisting me with getting to the bottom of the most recent RavenDB issue that I was having (a resource consumption issue that was preventing me from upgrading the production servers to RavenDB 3). Usually I would make a full post about the subject, but in this particular case it was mostly them investigating the issue, and me supplying a large number of memory dumps, exported settings, statuses, metrics and various other bits and pieces.

It turns out the issue was in an optimization in RavenDB 3 that caused problems for our particular document/workload profile. I’ve done a better explanation of the issue on the topic I made in the RavenDB Google Group, and Michael Yarichuk (one of the Hibernating Rhinos guys I was working with) has followed that up with even more detail.

I learned quite a few things relating to debugging and otherwise inspecting a running copy of RavenDB, as well as how to properly use the Sysinternals Procdump tool to take memory dumps.

A short summary:

RavenDB has stats endpoints which can be be hit via a simple HTTP call. {url}/stats and {url}/admin/stats give all sorts of great information, including memory usage and index statistics.
- I’ve incorporated a regular poll of these endpoints into my logstash configuration for monitoring our RavenDB instance. It doesn’t exactly insert cleanly into Elasticsearch (too many arrays), but its still useful, and allows us to chart various RavenDB statistics through Kibana.
RavenDB has config endpoints that show what settings are currently in effect (useful for checking available options and to see if your own setting customizations were applied correctly). The main endpoint is available at {url}/debug/config but there are apparently config endpoints for specific databases as well. We only use the default, system database, and there doesn’t seem to be an endpoint specific to that one.
The sysinternals tool procdump can be configured to take a full memory dump if your process exceeds a certain amount of usage. procdump –ma –m 4000 w3wp.exe C:\temp\IIS.dmp will take a full memory dump (i.e. not just handlers) when the w3wp process exceeds 4GB of memory for at least 10 seconds, and put it in the C:\temp directory. It can be configured to take multiple dumps as well, in case you want to track memory growth over time.

If you’re trying to get a memory dump of the w3wp process, make sure you turn off pinging for the appropriate application pool, or IIS will detect that its frozen and restart it. You can turn off pinging by running the Powershell command Set-ItemProperty "IIS:\AppPools\{application pool}" -name processmodel.pingingEnabled -Value False. Don’t forget to turn it back on when you’re done.

Google Drive is probably the easiest way to give specific people over the internet access to large (multiple gigabyte) files. Of course there is also S3 (which is annoying to permission) and FTP/HTTP (which require setting up other stuff), but I definitely found Google Drive the easiest. OneDrive and DropBox would also probably be similarly easy.

Once Hibernating Rhinos provides a stable release containing the fix, it means that we are no longer blocked in upgrading our troubled production instance to the latest version of RavenDB, which will hopefully alleviate some of its performance issues.

More to come on this topic as it unfolds.

Quoth The Raven, Nevermore

Finally, I’ve been putting some thought into how we can move away from RavenDB (or at least experiment with moving away from RavenDB), mostly so that we have a backup plan if the latest version does not in fact fix the performance problems that we’ve been having.

We’ve had a lot of difficulty in simulating the same level and variety of traffic that we see in our production environment (which was one of the reasons why we didn’t pick up any of the issues during our long and involved load testing), so I thought, why not just deploy any experimental persistence providers directly into production and watch how they behave?

Its not as crazy as it sounds, at least in our case.

Our API instances are hardly utilised at all, so we have plenty of spare CPU to play with in order to explore new solutions.

Our persistence layer is abstracted behind some very basic repository interfaces, so all we would have to do is provide a composite implementation of each repository interface that calls both persistence providers. Only take the response from the one that is not experimental, and everything is golden. As long as we log lots of information about the requests being made and how long they took, we can perform all sorts of interesting analysis without ever actually affecting the user experience.

Well, that’s the idea anyway. Whether or not it actually works is a whole different question.

I’ll likely make a followup post when I finish exploring the idea properly.

Summary

As good as my kinda-holidays were, it feels nice to be back in the thick of things, smiting problems and creating value.

I’m particularly looking forward to exploring a replacement for RavenDB in our troublesome service, because while I’m confident that the product itself is solid, it’s not something we’re very familiar with, so we’ll always be struggling to make the most of it. We don’t use it anywhere else (and are not planning on using it again), so its stuck in this weird place where we aren’t good at it and we have low desire to get better in the long run.

It was definitely good to finally get to the bottom of why the new and shiny version of RavenDB was misbehaving so badly though, because most of the time when I have a problem with a product like that, I always assume its the way I’m using it, not the product itself.

Plus, as a general rule of thumb, I don’t like it when mysteries remain unsolved. It bugs me.

Like why Firefly was cancelled.

Who does that?

Versioning Infrastructure

February 2. 2016 0 Comments

I’ve talked at length previously about the usefulness of ensuring that your environments are able to be easily spun up and down. Typically this means that they need to be represented as code and that code should be stored in some sort of Source Control (Git is my personal preference). Obviously this is much easier with AWS (or other cloud providers) than it is with traditionally provisioned infrastructure, but you can at least control configurations and other things when you are close to the iron.

We’ve come a long way on our journey to represent our environments as code, but there has been one hole that’s been nagging me for some time.

Versioning.

Our current environment pattern looks something like this:

A repository called X.Environment, where X describes the component the environment is for.
A series of Powershell scripts and CloudFormation templates that describe how to construct the environment.
A series of TeamCity Build Configurations that allow anyone to Create and Delete named versions of the environment (sometimes there are also Clone and Migrate scripts to allow for copying and updating).

When an environment is created via a TeamCity Build Configuration, the appropriate commit in the repository is tagged with something to give some traceability as to where the environment configuration came from. Unfortunately, the environment itself (typically represented as a CloudFormation stack), is not tagged for the reverse. There is currently no easy way for us to look at an environment and determine exactly the code that created it and, more importantly, how many changes have been made to the underlying description since it was created.

Granted, this information is technically available using timestamps and other pieces of data, but this is difficult, time-consuming, manual task, so its unlikely to be done with any regularity.

All of the TeamCity Build Configurations that I mentioned simply use the HEAD of the repository when they run. There is no concept of using an old Delete script or being able to (easily) spin up an old version of an environment for testing purposes.

The Best Version

The key to solving some of the problems above is to really immerse ourselves in the concept of treating the environment blueprint as code.

When dealing with code, you would never publish raw from a repository, so why would we do that for the environment?

Instead, you compile (if you need to), you test and then you package, creating a compact artefact that represents a validated copy of the code that can be used for whatever purpose you need to use it for (typically deployment). This artefact has some version associated with it (whatever versioning strategy you might use) which is traceable both ways (look at the repo, see the version, find artefact, look at the artefact, see the version, go to repository).

Obviously, for a set of Powershell scripts and CloudFormation templates, there is no real compilation step. There is a testing step though (Powershell tests written using Pester) and there can easily be a packaging step, so we have all of the bits and pieces that we need in order to provide a versioned package, and then use that package whenever we need to perform environment operations.

Versioning Details

As a general rule, I prefer to not encapsulate complicated build and test logic into TeamCity itself. Instead, I much prefer to have a self contained script within the repository, that is then used both within TeamCity and whenever you need to build locally. This typically takes the form of a build.ps1 script file with a number of common inputs, and leverages a number of common tools that I’m not going to go into any depth about. The output of the script is a versioned Nupkg file and some test results (so that TeamCity knows whether or not the build failed).

Adapting our environment repository pattern to build a nuget package is fairly straightforward (similar to the way in which we handle Logstash, just package up all the files necessary to execute the scripts using a nuspec file). Voila, a self contained package that can be used at a later date to spin up that particular version of the environment.

The only difficult part here was the actual versioning of the environment itself.

Prior to this, when an environment was created it did not have any versioning information attached to it.

The easiest way to attach that information? Introduce a new common CloudFormation template parameter called EnvironmentVersion and make sure that it is populated when an environment is created. The CloudFormation stack is also tagged with the version, for easy lookup.

For backwards compatibility, I made the environment version optional when you execute the New-Environment Powershell cmdlet (which is our wrapper around the AWS CFN tools). If not specified it will default to something that looks like 0.0.YYDDDD.SSSSS, making it very obvious that the version was not specified correctly.

For the proper versioning inside an environment’s source code, I simply reused some code we already had for dealing with AssemblyInfo files. It might not be the best approach, but including an AssemblyInfo file (along with the appropriate Assembly attributes) inside the repository and then reading from that file during environment creation is easy enough and consistency often beats optimal.

Improving Versioning

What I’ve described above is really a step in part of a larger plan.

I would vastly prefer if the mechanism for controlling what versions of an environment are present and where was delegated to Octopus Deploy, just like with the rest of our deployable components.

With a little bit of extra effort, we should be able to create a release for an appropriately named Octopus project and then push to that project whenever a new version of the environment is available.

This would give excellent visibility into what versions of the environment are where, and also allow us to leverage something I have planned for helping us see just how different the version in environment X is from the version in environment Y.

Ad-hoc environments will still need to be managed via TeamCity, but known environments (like CI, Staging and Production) should be able to be handled within Octopus.

Summary

I much prefer the versioned and packaged approach to environment management that I’ve outlined above. It seems much neater and allows for a lot of traceability and repeatability, something that was lacking when environments were being managed directly from HEAD.

It helps that it looks very similar to the way that we manage our code (both libraries and deployable components) so the pattern is already familiar and understandable.

You can see an example of what a versioned, packagable environment repository would look like here. Keep in mind that the common scripts inside that repository are not usually included directly like that. They are typically downloaded and installed via a bootstrapping process (using a Nuget package), but for this example I had to include them directly so that I didn’t have to bring along the rest or our build pipeline.

Speaking of the common scripts, unfortunately they are a constant reminder of a lack of knowledge about how to create reusable Powershell components. I’m hoping to restructure them into a number of separate modules with greater cohesion, but until then they are a bit unwieldy (just a massive chunk of scripts that are dot-included wherever they are needed).

That would probably make a good blog post actually.

How to unpick a mess of Powershell that past you made.

Sometimes I hate past me.

Isolating Danger

January 26. 2016 0 Comments

When building a series of services to allow clients to access their own (previously office locked) data over the greater internet, there are a number of considerations to be made.

The old way was simple. There is a database. Stuff is in the database. When you want stuff, access the database. As long as the database in one office was powerful enough for the users in that office, you would be fine.

Moving all of that information into the cloud though…

Now everyone needs to access all their stuff at the same time. Now efficiency and isolation matters.

Well technically it mattered before as well, just not as much to the people who came before me.

I’m going to be talking about two things briefly in this post.

The first is isolating our upload and synchronization process from the actual service that needs to be queried.

The second is isolating binary data from all other requests.

Data Coming Right Up

In order to grant remote access to previously on-premises locked data we need to get that data out somehow. Unfortunately, for this system, the source of truth needs to stay on-premises for a number of different reasons that I won’t go into in too much detail. What we’re focusing on is allowing authenticated read-only access to the data from external systems.

The simplest solution to this is to have a replica of the data available in the cloud, and use that data for all incoming remote requests. Obviously this isn’t perfect (its an eventually consistent model), but because its read-only and we have some allowances for data latency (i.e. its okay if a mobile application doesn’t see exactly what is in the on-premises data the moment that it’s changed).

Of course, all of this data constantly being uploaded can cause a considerable amount of strain on the system as a whole, so we need to make sure that if there is a surge in the quantity of synchronization requests that the service responding to queries (get me the last 100 X entities) is not negatively impacted.

Easiest solution? Simply separate the two services, and share the data via a common store of some sort (our initial implementation will have a database).

With this model we gain some protection from load on on side impacting the other.

Its not perfect mind you, but the early separation gives us a lot of power moving forward if we need to change. For example, we could queue all synchronization requests to the sync service fairly easily, or split the shared database into a master and a number of read replicas. We don’t know if we’ll have a problem or what the solution to that problem will be, the important part is that we’ve isolated the potential danger, allowing for future change without as much effort.

10 Types of People

The system that we are constructing involves a moderate amount of binary data. I say moderate, but in reality, for most people who have large databases on premises, a good percentage of that data is binary data in various forms. Mostly images, but there are a lot of documents of various types as well (ranging from small and efficient PDF files to monstrous Word document abominations with embedded Excel spreadsheets).

Binary data is relatively problematic for a web service.

If you grant access to the binary data from a service, every request ties up one of your possible request handlers (whether it be threads, pseudo-threads or various other mechanisms of parallelism). This leaves less resources available for your other requests (data queries), which can make things difficult in the long run as the total number of binary data requests in flight at any particular moment slowly rises.

If you host the data outside the main service, you have to deal with the complexity of owning something else and making sure that it is secure (raw S3 would be ideal here, but then securing it is a pain).

In our case, our plan is to go with another service, purely for binary data. This allows us to leverage our existing authentication framework (so at least everything is secure), and allows us to use our existing logging tools to track access.

The benefits of isolating the access to binary data like this is that if there is a sudden influx of requests for images or documents or something, only that part of the system will be impacted (assuming there are no other shared components). Queries to get normal data will still complete in a timely fashion, and assuming we have written our integration well, some retry strategy will ensure that the binary data is delivered appropriately once the service resumes normal operation.

Summary

It is important to consider isolation concerns like I have outlined above when you are designing the architecture of a system.

You don’t necessarily have to implement all of your considerations straight away, but you at least need to know where your flex areas and where you can make changes without having to rewrite the entire thing. Understand how and when your architecture could adapt to potential changes, but don’t build it until you need it.

In our case, we also have a gateway/router sitting in front of everything, so we can remap URLs as we see fit moving into the future. In the case of the designs I’ve outlined above, they come from past (painful) experience. We’ve already encountered those issues in the past, while implementing similar systems, so we decided to go straight to the design that caters for them, rather than implement something we know would have problems down the track.

Its this sort of learning from your prior experiences that really makes a difference to the viability of an architecture in the long run.

Good Load Profiles Make a Huge Difference

January 19. 2016 0 Comments

I definitely would not say that I am an expert at load testing web services. At best, I realise how valuable it is in order to validate your architecture and implementation, to help you get a handle on weak or slow areas and fix them before they can become a problem.

One thing I have definitely learned in the last 12 months however, is just how important it is to make sure that your load profile (i.e. your simulation for how you think your system will be loaded) is as close to reality as possible. If you get this wrong, not only will you not be testing your system properly, you will give yourself a false sense of confidence in how it performs when people are using it. This can lead to some pretty serious disasters when you actually do go live everything explodes (literally or figuratively, it doesn’t matter).

Putting together a good load profile is a difficult and time consuming task. You need to make assumptions about expected usage patterns, number of users, quality (and quantity) of date and all sorts of other things. While you’re building this profile, it will feel like you aren’t contributing directly to the software being written (there is code to write!), but believe me, a good load profile is worth it when it comes to validation all sorts of things later on. Like a good test suite, it keeps paying dividends in all sorts of unexpected places.

Such a Tool

It would be remiss of me to talk about load tests and load profiles without mentioning at least one of the tools you can use to accomplish them, as there are quite a few out there. In our organisation we use JMeter, mostly because that’s the first one that we really looked into in any depth, but it helps that it seems to be pretty well accepted in the industry, as there is a lot of information already out there to help you when you’re stuck. Extremely flexible, extendable and deployable, its an excellent tool (though it does have a fairly steep learning curve, and its written in Java so for a .NET person it can feel a little foreign).

Back to the meat of this post though.

As part of the first major piece of work done shortly after I started, my team completed the implementation of a service for supporting the remote access and editing of data that was previously locked to client sites. I made sure that we had some load tests to validate the behaviour of the service when it was actually being used, as opposed to when it was just kind of sitting there, doing nothing. I think it might have been the first time that our part of the organisation had ever designed and implemented load tests for validating performance, so it wasn’t necessarily the most…perfect, of solutions.

The load tests showed a bunch of issues which we dutifully fixed.

When we went into production though, there were so many more issues than we anticipated, especially related to the underlying persistence store (RavenDB, which I have talked about at length recently).

Of course, the question on everyone’s lips at that point was, why didn’t we see those issues ahead of time? Surely that was what the load tests were meant to catch?

The Missing Pieces

There were a number of reasons why our load tests didn’t catch any of the problems that started occurring in production.

The first was that we were still unfamiliar with JMeter when we wrote those tests. This mostly just limited our ability to simulate complex operations (of which there are a few), and made our profile a bit messier than it should have been. It didn’t necessarily cause the weak load tests, but it certainly didn’t help.

The second reason was that the data model used in the service is not overly easy to use. When I say easy to use, I mean that the objects involved are complex (100+KB of JSON) and thus are difficult to create realistic looking random data for. As a result, we took a number of samples and then used those repeatedly in the load tests, substituting values as appropriate to differentiate users from each other. I would say that the inability to easily create realistic looking fake data was definitely high up there on the list as to why the load tests were ineffective in finding the issues we encountered in production.

The third reason why our load tests didn’t do the job, was the actual load profile itself. The simulation for what sort of calls we expected a single user (where user describes more than just one actual person using the system) to make was just not detailed enough. It did not cover enough of the functionality of the server and definitely did not accurately represent reality. This was unfortunate and unexpected, because we spent a significant amount of time attempting to come up with a profile, and we got agreement from a number of different parties that this profile would be good enough for the purposes of testing. The root cause of this one was simply unfamiliarity with the intended usage of the system.

Finally, and what I think is probably the biggest contributor to the ineffectiveness of the load tests, we simply did not run them for long enough. Each load test we did only went for around 48 hours (at the high end) and was focused around finding immediate and obvious performance problems. A lot of the issues that we had in production did not manifest themselves until we’d been live for a week or two. If we had of implemented the load tests sooner, and then started and kept them running on our staging environment for weeks at a time, I imagine that we would have found a lot of the issues that ended up plaguing us.

Conclusion

Of course, there is no point thinking about these sort of things unless you actually make changes the next time you go to do the same sort of task.

So, what did we learn?

Start thinking about the load tests and simulating realistic looking data early. We came into the service I’ve been talking about above pretty late (to clean up someone else’s mess) and we didn’t really get a chance to spend any time on creating realistic looking data. This hurt us when it came time to simulate users.
Think very very hard about your actual load profile. What is a user? What does a user do? Do they do it sequentially or in parallel? Are there other processes going on that might impact performance? Are there things that happen irregularly that you should include in the profile at random? How big is the data? How much does it vary? All of those sorts of questions can be very useful for building a better load profile. Make sure you spend the time to build it properly in whatever tool you are using, such that you can tweak it easily when you go to run it.
To run our load tests early and then for as much time as possible. To us, this means we should run them in an infinite loop on top of our staging environment pretty much as soon as we have them, forever (well, until we’re no longer actively developing that component anyway).

The good thing to come out of the above is that the service we completed did not flop hard enough that we don’t get a second chance. We’re just now developing some other services (to meet similar needs) and we’ve taken all of the lessons above to heart. Our load test profiles are much better and we’ve started incorporating soak tests to pick up issues that only manifest over long periods of time.

At least when it breaks we’ll know sooner, rather than when there are real, paying, customers trying to use it.

I imagine though, that we will probably have to go through this process a few times before we really get a good handle on it.