Metaphorically Basing Our Assumptions On Things

May 22. 2018 0 Comments

Its been almost a month since I posted about my terrible ASP.NET CORE MVC website for executing pre-canned business intelligence queries, and thanks to the hard work of other people, its time to throw it all away.

Just as planned.

Honestly, there is nothing more satisfying than throwing away a prototype. I mean, its always hard to let go of something you’ve built, but every time you knock one together there is always that nagging feeling in the corner of your mind about whether it will magically turn into production code without you noticing. It feels good to let it go before that happens.

Anyway, the database with all the juicy customer data in it is available inside Metabase now, and its pretty much what I built, but infinitely better.

That’s So Meta

Assuming you didn’t immediately go and read the link, Metabase is a tool for writing, visualizing and sharing business intelligence.

It literally solves exactly the sort of problems I described in my original post, as:

People who are good at the analysis part can easily get into the system, create new questions, visualize the results (with an assortment of charts) and save the configuration for reuse
People who really want the analysis but are bad at at, can use those questions for their own nefarious purposes, without having the bug the original writer

Interestingly enough, the tool doesn’t even require you to write actual SQL, though it does offer that capability. You can create simple questions by using a fairly straightforward GUI, so there’s room for just about anyone to get in and do some analysis, if they want.

To help with this sort of thing, the tool extracts meta information from the connected databases (like table names, field names, types, etc), and supplies it whenever you’re drafting a question, even in the pure SQL mode. Its not perfect (it pretty much just has a list of keywords that it tries to match when you type things, so it doesn’t look like its contextual), but its still pretty useful.

Obviously it offers all the normal things that you would expect from an intelligence platform as well, including user management (with a solid invite system), permissions, categorization and so on.

And its all open source, so the only cost is setting it up (which, as with all things, is not to be underestimated).

Snapchat

I had very little to do with the initial setup of Metabase at our organization. It was put in place to offer intelligence on the raw data of our cloud product, mostly because people kept querying the production database replica directly.

When I saw it though, I realized how useful it could be sitting on top of the massive wealth of information that we were syncing up from our customers from the legacy product.

Thanks to the excellent efforts of one of my colleagues, it all became a beautiful reality, and I can now start on the long long process of educating people about how to get their own data without bugging me.

To extrapolate on the actual Metabase setup:

Every day we take an RDS snapshot of the legacy cloud database
These snapshots are shared automatically with an operational AWS account (keeping a good production/other stuff separation)
In the mornings, we restore the latest snapshot to a static location, which Metabase has been configured to connect to
During this restore process, we mutate the database a little bit in order to obfuscate user data and do some other tricksy things

The whole thing is executed using Ansible Playbooks, via Jenkins in a fully automated fashion.

The only real limitation is that the data is guaranteed to be up to a day old, but that’s generally not a problem for business intelligence, which is more interested in long term patterns rather than real-time information.

Plus we have our ELK stack for real time stuff anyway.

Optimus Prime

I’ve been analysing our customers data for a while now, and sometimes it can be pretty brutal. Mostly because of the scale of information, but also because the database schema was never intended for intelligence, it was built to serve an application, and its arranged to optimise that path first and foremost.

The good thing is that incorporating the database into Metabase is a great opportunity to pre-calculate some derivations of the data, allowing them to be used to augment other data sets without having to wait for a two minute aggregation to complete.

A good example is entity counts by customer. One particular entity in the database has somewhere in the realm of three million rows. Not a huge number for a database, but if you want to aggregate those entities by the customer (perhaps a sum, maybe sum of sub-classifications as well) then you’re still looking at a decent amount of computation to get it done.

This sort of pre-calculation is a great opportunity to use a materialized view, which is basically just a fancy way to create a table from a query, along with the capability to “refresh” the content of the table at a later date by re-executing the same query. I suppose you could just use a table if you want, and insert rows into it yourself, but honestly, this is exactly the sort of thing materialized views are intended for.

Its easy enough to slot the creation of the view into the process that restores the database snapshot, so there’s really no reason not to do it, especially because it lets you do awesome things like include summary statistics by customer on your dashboard, with a query that loads in milliseconds rather than minutes.

Conclusion

I’m extremely happy to have access to Metabase, both for my own usage (its a pretty great tool, even if I am more comfortable in a raw query editor) and to give everyone else in the organization access to some really useful data.

Sure, we could always do the analysis ad-hoc, using whatever visualization tools we wanted (lets be frank, its probably Excel), but there is something very nice about having a centralized location for this stuff as I don’t have to worry about someone still making business decisions using a query I wrote months ago, which has all these bugs and bad assumptions in it. I can just update the root question and everyone is using the best information available.

At least until I find more bad assumptions of course.

More importantly, because its simplified the whole sharing and re-use process, its motivated me to actually work on delivering some analysis that I’ve been meaning to do for ages, which can only result in a better informed organization in general.

In a weird twist, this is the one place where I’m an eternal optimist, as I don’t want to see any downsides to everyone having access to some subset of our customers data.