Metaphorically Basing Our Assumptions On Things

May 22. 2018 0 Comments

Its been almost a month since I posted about my terrible ASP.NET CORE MVC website for executing pre-canned business intelligence queries, and thanks to the hard work of other people, its time to throw it all away.

Just as planned.

Honestly, there is nothing more satisfying than throwing away a prototype. I mean, its always hard to let go of something you’ve built, but every time you knock one together there is always that nagging feeling in the corner of your mind about whether it will magically turn into production code without you noticing. It feels good to let it go before that happens.

Anyway, the database with all the juicy customer data in it is available inside Metabase now, and its pretty much what I built, but infinitely better.

That’s So Meta

Assuming you didn’t immediately go and read the link, Metabase is a tool for writing, visualizing and sharing business intelligence.

It literally solves exactly the sort of problems I described in my original post, as:

People who are good at the analysis part can easily get into the system, create new questions, visualize the results (with an assortment of charts) and save the configuration for reuse
People who really want the analysis but are bad at at, can use those questions for their own nefarious purposes, without having the bug the original writer

Interestingly enough, the tool doesn’t even require you to write actual SQL, though it does offer that capability. You can create simple questions by using a fairly straightforward GUI, so there’s room for just about anyone to get in and do some analysis, if they want.

To help with this sort of thing, the tool extracts meta information from the connected databases (like table names, field names, types, etc), and supplies it whenever you’re drafting a question, even in the pure SQL mode. Its not perfect (it pretty much just has a list of keywords that it tries to match when you type things, so it doesn’t look like its contextual), but its still pretty useful.

Obviously it offers all the normal things that you would expect from an intelligence platform as well, including user management (with a solid invite system), permissions, categorization and so on.

And its all open source, so the only cost is setting it up (which, as with all things, is not to be underestimated).

Snapchat

I had very little to do with the initial setup of Metabase at our organization. It was put in place to offer intelligence on the raw data of our cloud product, mostly because people kept querying the production database replica directly.

When I saw it though, I realized how useful it could be sitting on top of the massive wealth of information that we were syncing up from our customers from the legacy product.

Thanks to the excellent efforts of one of my colleagues, it all became a beautiful reality, and I can now start on the long long process of educating people about how to get their own data without bugging me.

To extrapolate on the actual Metabase setup:

Every day we take an RDS snapshot of the legacy cloud database
These snapshots are shared automatically with an operational AWS account (keeping a good production/other stuff separation)
In the mornings, we restore the latest snapshot to a static location, which Metabase has been configured to connect to
During this restore process, we mutate the database a little bit in order to obfuscate user data and do some other tricksy things

The whole thing is executed using Ansible Playbooks, via Jenkins in a fully automated fashion.

The only real limitation is that the data is guaranteed to be up to a day old, but that’s generally not a problem for business intelligence, which is more interested in long term patterns rather than real-time information.

Plus we have our ELK stack for real time stuff anyway.

Optimus Prime

I’ve been analysing our customers data for a while now, and sometimes it can be pretty brutal. Mostly because of the scale of information, but also because the database schema was never intended for intelligence, it was built to serve an application, and its arranged to optimise that path first and foremost.

The good thing is that incorporating the database into Metabase is a great opportunity to pre-calculate some derivations of the data, allowing them to be used to augment other data sets without having to wait for a two minute aggregation to complete.

A good example is entity counts by customer. One particular entity in the database has somewhere in the realm of three million rows. Not a huge number for a database, but if you want to aggregate those entities by the customer (perhaps a sum, maybe sum of sub-classifications as well) then you’re still looking at a decent amount of computation to get it done.

This sort of pre-calculation is a great opportunity to use a materialized view, which is basically just a fancy way to create a table from a query, along with the capability to “refresh” the content of the table at a later date by re-executing the same query. I suppose you could just use a table if you want, and insert rows into it yourself, but honestly, this is exactly the sort of thing materialized views are intended for.

Its easy enough to slot the creation of the view into the process that restores the database snapshot, so there’s really no reason not to do it, especially because it lets you do awesome things like include summary statistics by customer on your dashboard, with a query that loads in milliseconds rather than minutes.

Conclusion

I’m extremely happy to have access to Metabase, both for my own usage (its a pretty great tool, even if I am more comfortable in a raw query editor) and to give everyone else in the organization access to some really useful data.

Sure, we could always do the analysis ad-hoc, using whatever visualization tools we wanted (lets be frank, its probably Excel), but there is something very nice about having a centralized location for this stuff as I don’t have to worry about someone still making business decisions using a query I wrote months ago, which has all these bugs and bad assumptions in it. I can just update the root question and everyone is using the best information available.

At least until I find more bad assumptions of course.

More importantly, because its simplified the whole sharing and re-use process, its motivated me to actually work on delivering some analysis that I’ve been meaning to do for ages, which can only result in a better informed organization in general.

In a weird twist, this is the one place where I’m an eternal optimist, as I don’t want to see any downsides to everyone having access to some subset of our customers data.

Real Meaty Analysis

April 21. 2015 0 Comments

As I’ve already mentioned, I’ve been doing load testing.

Obviously, the follow-up to executing load tests is to analyse the data that comes out of them.

With our new service, we now log many disparate sources of information into a single location, using Elasticsearch, Logstash and Kibana. As the load tests are executed, streams of data come into our log aggregator, ready to be sliced in various ways to see exactly how the service is operating and to find out where the bottlenecks are.

Requesting Assistance

The service leverages Nxlog to ship all IIS and application logs to the log aggregator. This gave me the ability to track average/max request times, error responses and errors occurring inside the application, in real time, as the load test was being run. A side effect of this is that it will also do the same once the service is made publically available.

IIS logs gave me access to fields like sc-status (the returned response code, i.e. 200, 401 or 500), time-taken (the total elapsed time for the request to be completed), request path (i.e. entity/images/1) and bytes uploaded and downloaded (sc-bytes and cs-bytes), along with many others.

Our application logs mostly just gave me access to whether or not unhandled exceptions had occurred (we use Nancy, and our OnError pipeline simply logs the error out while returning a 500) as well as some information about what the service is doing internally with authentication/authorisation.

For the first run of the load test I chose 50 users with no ramp-up, i.e. they would all start working through the test plan all at once. Each user would execute a series of actions that approximated what usage on the service would be like over a day, compressed into a much smaller time period.

I used Kibana to construct a chart that showed the average and maximum request latency (using the time-taken field from IIS) across all requests over time, and another chart that showed the top 5 status codes, also over time.

I had some initial difficulties with using time-taken for numerical aggregations in Kibana. When I originally set up the Nxlog config that shipped the log entries to our log aggregator, I had neglected to type the time-taken field as an integer (its measured in whole milliseconds), so Elasticsearch had inferred the type of the field as string. As you can imagine, strings can’t participate in numerical aggregations. I still don’t quite understand ES mappings, but I had to change the type of time-taken in Nxlog (by casting/parsing to integer in the Exec blog of my input), delete the index in ES and then it correctly inferred the type as a number. When I looked at the JSON being output by Nxlog, the TimeTaken field values were enclosed in quotes, so its likely that’s why ES inferred them as strings.

My first set of results were informative, but they didn’t exactly point me towards the area that needed improvement.

As you can see from the chart above, latency was fairly terrible, even with only 50 users. Looking at the timeline of the test, latency was okay during the periods that didn’t involve images, rose when images were being uploaded and then rose to unacceptable levels when images were being downloaded.

I had standardised all of the images so that they were approximately 1.2MB, so a service that scales well with traffic should show an increase in latency (to account for the time to physically upload and download the images) but it should be consistent with the size of the images.

Looking at the response codes, everything was mostly fine (if slow) until images started to be downloaded, and then 500’s would start to appear. Investigating the 500’s showed that they were primarily caused by timeouts, indicating that part of the system was under so much load that it could no longer service requests within the timeout period. This lined up with the increasing latency of the service.

The timeouts were coming from our underlying database, RavenDB.

Cogs in the Machine

From the IIS and application logs I knew how the service was acting from the point of view of an external party (mostly). What I didn’t know was how the underlying machines were acting, and I didn’t have any eyes on the database underlying the service at all. My first implementation had only aggregated logs from the API instances.

It was easy enough to customize the Nxlog deployment that I built to ship different configurations to difference machines, so it didn’t take very long to start getting logs from the database machine (IIS).

I also needed machine statistics, to identify why the underlying database was unable to service requests in a reasonable amount of time. Ideally these statistics would be available in the same place as the other events, so that I could make correlations between the data sets.

As far as I can tell, on Windows machines if you want to log statistics about the system itself, you use performance counters. You can track just about anything, but for now I limited myself to CPU Utilization, Free Memory and Free Disk Space. I assumed that the database machine was being sitting at 100% CPU or running out of memory and resorting to virtual memory or something.

I created a project similar to the Nxlog one, except focusing on installing a performance counter based on a configuration file, again, deployed though Octopus. It leveraged the command line tool logman to configure and start the performance counter, and simply focused on outputting system statistics to a known log file. I then altered the Nxlog configuration file for both the API and Database instances to read those statistics in and push them out to ES via Logstash, just like the IIS and application logs.

With system statistics now being available, I improved the Kibana dashboard to show more information. I added charts of CPU utilization, free memory and free disk, and slightly altered my request latency charts. Most of the time the requests with the highest latency were those which involved images. The main reason for this is that it takes time to move data around the internet, and all of my test images were 1.2MB, so the time to transfer the raw data was not trivial. In order to better understand the latency of various requests, I needed to differentiate image requests from non-image requests. At the same time I wanted to distinguish different resources from one another, so I could see which resources were slow. I added some custom code to the Nxlog config to parse out resource information from the path and stored the resource identifier as a field along with all of the other data. I used the same approach to identify the requests that dealt with images, and tagged them as such.

I then used this information to mutate my latency chart into two charts. One shows the latency (average, max) of image requests, broken down by Verb (PUT, POST, GET, DELETE) and the other shows latency for each resource in the system (one chart per resource).

At this point it was helpful to define and save some queries in Kibana that immediately narrowed down the total available data to the appropriate subset for charting. We tag our log messages with an application id (something uniquely identifying the system), component id (like API or DB) and source (like IIS or application). I defined some queries for Application + Component:API + Source:IIS + Images/Not Images, which I then used as the inputs for the latency chart visualizations I mentioned above. Its hard to filter inside the visualizations themselves, so I recommend creating and saving queries for that purpose.

My new monitoring dashboard looked like this:

After running another load test, I could clearly see that the images were the slowest part (as expected), and that when images were being downloaded, the performance of every other resource suffered. Alas, all of the machine statistics looked fine at the time of the high latency. None of the machines were showing a bottleneck in any of the areas that I was monitoring. CPU and Free Memory were fine, and yet the requests were taking upwards of 100 seconds (on average) to complete when image downloads were occurring.

Moar Stats

As I could see the timeouts coming from the underlying database in the API logs, I knew that that machine was the issue, I just didn’t know how.

I suspected that the issue was probably disk based (because it was the one thing I wasn’t monitoring) so I added a greater selection of statistics (average disk latency, disk queue length) and re-ran the load tests.

The result was clear. During the period where the load test was downloading images, the latency of the data disk was reaching 800ms per transfer. That's insane! Normal disk latency is sub 30ms.

Finally I had found the obvious bottleneck, but now I had to fix it.

Birds are Weird

Long story short, I tried to improve the performance by making the data disk more awesome (thanks to AWS, this was fairly easy as you can configure the desired amount of IOPS), and making the database machine itself more awesome (again, thanks AWS and its big chunky instance types) but nothing seemed to make much of a difference.

Finally we had to make the hard call and stop storing images in our underlying database. It just didn’t do it well.

We were still using an old version of RavenDB, and were using its attachment feature to store the images. I have to assume we were doing something wrong, because RavenDB is a pretty good product as far as I can see, and I don’t see why it wouldn’t work. I’m honestly not that familiar with RavenDB and the decision to use it was made before I started, so I couldn’t really determine why it wasn’t performing as we expected.

Moving our images to a different storage mechanism was fairly easy, our external API didn’t need to change, and we just moved the image storage into S3.

The next run of the load tests showed average response times under load going down to about 7 seconds, which is about the time it takes to move 1.2MB on our internet connection + some overhead.

The charts still look basically the same, but the numbers are a hell of a lot lower now, so we’ve got that going for us, which is nice.

Summary

Always load test. Always.

We had already guessed before we started load testing that the underlying database would probably be a bottleneck, as there was only one of them, and we had issues adding more in parallel during development. We had assumed that any performance issues with the database could be solved by simply making it more awesome in whatever way it was bottlenecked. Unfortunately, that turned out to be a false assumption, and we had to change the internals of the system in order to scale our performance when it came to image storage.

Without load testing and the analysis of its results via Kibana and its lovely charts, we wouldn’t have been able to identify the scaling issue and resolve it, with real numbers to back up our improvements.

I cannot recommend using a log aggregator enough. Literally you should aggregate all the things. Once you have all of that information at your fingertips you can do all sorts of amazing analysis, which lead to greater insights into how your software is actually running (as opposed to how you think its running).