As I’ve already mentioned, I’ve been doing load testing.
Obviously, the follow-up to executing load tests is to analyse the data that comes out of them.
With our new service, we now log many disparate sources of information into a single location, using Elasticsearch, Logstash and Kibana. As the load tests are executed, streams of data come into our log aggregator, ready to be sliced in various ways to see exactly how the service is operating and to find out where the bottlenecks are.
The service leverages Nxlog to ship all IIS and application logs to the log aggregator. This gave me the ability to track average/max request times, error responses and errors occurring inside the application, in real time, as the load test was being run. A side effect of this is that it will also do the same once the service is made publically available.
IIS logs gave me access to fields like sc-status (the returned response code, i.e. 200, 401 or 500), time-taken (the total elapsed time for the request to be completed), request path (i.e. entity/images/1) and bytes uploaded and downloaded (sc-bytes and cs-bytes), along with many others.
Our application logs mostly just gave me access to whether or not unhandled exceptions had occurred (we use Nancy, and our OnError pipeline simply logs the error out while returning a 500) as well as some information about what the service is doing internally with authentication/authorisation.
For the first run of the load test I chose 50 users with no ramp-up, i.e. they would all start working through the test plan all at once. Each user would execute a series of actions that approximated what usage on the service would be like over a day, compressed into a much smaller time period.
I used Kibana to construct a chart that showed the average and maximum request latency (using the time-taken field from IIS) across all requests over time, and another chart that showed the top 5 status codes, also over time.
I had some initial difficulties with using time-taken for numerical aggregations in Kibana. When I originally set up the Nxlog config that shipped the log entries to our log aggregator, I had neglected to type the time-taken field as an integer (its measured in whole milliseconds), so Elasticsearch had inferred the type of the field as string. As you can imagine, strings can’t participate in numerical aggregations. I still don’t quite understand ES mappings, but I had to change the type of time-taken in Nxlog (by casting/parsing to integer in the Exec blog of my input), delete the index in ES and then it correctly inferred the type as a number. When I looked at the JSON being output by Nxlog, the TimeTaken field values were enclosed in quotes, so its likely that’s why ES inferred them as strings.
My first set of results were informative, but they didn’t exactly point me towards the area that needed improvement.
As you can see from the chart above, latency was fairly terrible, even with only 50 users. Looking at the timeline of the test, latency was okay during the periods that didn’t involve images, rose when images were being uploaded and then rose to unacceptable levels when images were being downloaded.
I had standardised all of the images so that they were approximately 1.2MB, so a service that scales well with traffic should show an increase in latency (to account for the time to physically upload and download the images) but it should be consistent with the size of the images.
Looking at the response codes, everything was mostly fine (if slow) until images started to be downloaded, and then 500’s would start to appear. Investigating the 500’s showed that they were primarily caused by timeouts, indicating that part of the system was under so much load that it could no longer service requests within the timeout period. This lined up with the increasing latency of the service.
The timeouts were coming from our underlying database, RavenDB.
Cogs in the Machine
From the IIS and application logs I knew how the service was acting from the point of view of an external party (mostly). What I didn’t know was how the underlying machines were acting, and I didn’t have any eyes on the database underlying the service at all. My first implementation had only aggregated logs from the API instances.
It was easy enough to customize the Nxlog deployment that I built to ship different configurations to difference machines, so it didn’t take very long to start getting logs from the database machine (IIS).
I also needed machine statistics, to identify why the underlying database was unable to service requests in a reasonable amount of time. Ideally these statistics would be available in the same place as the other events, so that I could make correlations between the data sets.
As far as I can tell, on Windows machines if you want to log statistics about the system itself, you use performance counters. You can track just about anything, but for now I limited myself to CPU Utilization, Free Memory and Free Disk Space. I assumed that the database machine was being sitting at 100% CPU or running out of memory and resorting to virtual memory or something.
I created a project similar to the Nxlog one, except focusing on installing a performance counter based on a configuration file, again, deployed though Octopus. It leveraged the command line tool logman to configure and start the performance counter, and simply focused on outputting system statistics to a known log file. I then altered the Nxlog configuration file for both the API and Database instances to read those statistics in and push them out to ES via Logstash, just like the IIS and application logs.
With system statistics now being available, I improved the Kibana dashboard to show more information. I added charts of CPU utilization, free memory and free disk, and slightly altered my request latency charts. Most of the time the requests with the highest latency were those which involved images. The main reason for this is that it takes time to move data around the internet, and all of my test images were 1.2MB, so the time to transfer the raw data was not trivial. In order to better understand the latency of various requests, I needed to differentiate image requests from non-image requests. At the same time I wanted to distinguish different resources from one another, so I could see which resources were slow. I added some custom code to the Nxlog config to parse out resource information from the path and stored the resource identifier as a field along with all of the other data. I used the same approach to identify the requests that dealt with images, and tagged them as such.
I then used this information to mutate my latency chart into two charts. One shows the latency (average, max) of image requests, broken down by Verb (PUT, POST, GET, DELETE) and the other shows latency for each resource in the system (one chart per resource).
At this point it was helpful to define and save some queries in Kibana that immediately narrowed down the total available data to the appropriate subset for charting. We tag our log messages with an application id (something uniquely identifying the system), component id (like API or DB) and source (like IIS or application). I defined some queries for Application + Component:API + Source:IIS + Images/Not Images, which I then used as the inputs for the latency chart visualizations I mentioned above. Its hard to filter inside the visualizations themselves, so I recommend creating and saving queries for that purpose.
My new monitoring dashboard looked like this:
After running another load test, I could clearly see that the images were the slowest part (as expected), and that when images were being downloaded, the performance of every other resource suffered. Alas, all of the machine statistics looked fine at the time of the high latency. None of the machines were showing a bottleneck in any of the areas that I was monitoring. CPU and Free Memory were fine, and yet the requests were taking upwards of 100 seconds (on average) to complete when image downloads were occurring.
As I could see the timeouts coming from the underlying database in the API logs, I knew that that machine was the issue, I just didn’t know how.
I suspected that the issue was probably disk based (because it was the one thing I wasn’t monitoring) so I added a greater selection of statistics (average disk latency, disk queue length) and re-ran the load tests.
The result was clear. During the period where the load test was downloading images, the latency of the data disk was reaching 800ms per transfer. That's insane! Normal disk latency is sub 30ms.
Finally I had found the obvious bottleneck, but now I had to fix it.
Birds are Weird
Long story short, I tried to improve the performance by making the data disk more awesome (thanks to AWS, this was fairly easy as you can configure the desired amount of IOPS), and making the database machine itself more awesome (again, thanks AWS and its big chunky instance types) but nothing seemed to make much of a difference.
Finally we had to make the hard call and stop storing images in our underlying database. It just didn’t do it well.
We were still using an old version of RavenDB, and were using its attachment feature to store the images. I have to assume we were doing something wrong, because RavenDB is a pretty good product as far as I can see, and I don’t see why it wouldn’t work. I’m honestly not that familiar with RavenDB and the decision to use it was made before I started, so I couldn’t really determine why it wasn’t performing as we expected.
Moving our images to a different storage mechanism was fairly easy, our external API didn’t need to change, and we just moved the image storage into S3.
The next run of the load tests showed average response times under load going down to about 7 seconds, which is about the time it takes to move 1.2MB on our internet connection + some overhead.
The charts still look basically the same, but the numbers are a hell of a lot lower now, so we’ve got that going for us, which is nice.
Always load test. Always.
We had already guessed before we started load testing that the underlying database would probably be a bottleneck, as there was only one of them, and we had issues adding more in parallel during development. We had assumed that any performance issues with the database could be solved by simply making it more awesome in whatever way it was bottlenecked. Unfortunately, that turned out to be a false assumption, and we had to change the internals of the system in order to scale our performance when it came to image storage.
Without load testing and the analysis of its results via Kibana and its lovely charts, we wouldn’t have been able to identify the scaling issue and resolve it, with real numbers to back up our improvements.
I cannot recommend using a log aggregator enough. Literally you should aggregate all the things. Once you have all of that information at your fingertips you can do all sorts of amazing analysis, which lead to greater insights into how your software is actually running (as opposed to how you think its running).