Comma, Comma, Comma Chameleon

April 24. 2018 0 Comments

In previous posts, I’ve briefly mentioned how one of the most valuable outcomes of the data synchronization algorithm was the capability to gather meaningful business intelligence about our users. This includes information like how often they use the software, what parts they are using and how many of its entities they are creating (and deleting) during their normal daily process.

When used responsibly, this sort of intelligence is invaluable in helping to create a better, more valuable experience for everyone involved. Literally every part of your standard organization (support, sales, development, marketing, finance) can make use of the data in some way.

Of course, when its all sequestered inside a PostgreSQL database, the barrier to entry can be quite high. Generally this just means that the people who are best positioned to make use of the data either don’t get access to it (a waste) or they end up constantly bugging the members of the small group of people with the capability to do the analysis.

When I’m a member of that small group, I get suddenly motivated to find an alternate way to give them what they need.

Re-education Campaign

There are really two kinds of analysis that tend to happen for business intelligence purposes.

The first is answering ad-hoc questions. Typically these sorts of questions have not been asked before, or they could be a slight variation on the same theme as a previous question.

Regardless, these generally require someone who actually knows the data and has the skills to wrangle it in order to get to the meat of the issue and provide the answer. This can be taught (with some effort), but not everyone has the time to dedicate to that sort of thing, so new analysis will probably always have to be performed by trained professionals.

The second type of analysis is re-answering a previously answered question, now that time has passed.

This tends to occur quite frequently, especially when it comes to metrics (like “how many customers have X, how many Y does each customer have, etc), and it is here that there is an opportunity to optimise.

In the long term, there should be an easy platform for people to get at the information they want, without having to understand how to run SQL scripts (or Powershell). This is likely going to come in the form of Metabase, but unfortunately this is probably still a little ways off in the future, and people are asking for data now.

So in the short term, some way to easily run pre-canned queries and get the results as a CSV is desirable.

Copy That

The good news is that PostgreSQL makes it pretty easy to run a query and get CSV output using the COPY command:

COPY ({query}) TO STDOUT WITH DELIMITER ',' CSV HEADER

Executing that command will result in a stream containing the data specified in the query (along with headers), in CSV format.

Of course, its kind of useless without a mechanism to easily run it, so if you’re using .NET, you can leverage Npgsql to actually execute the command using the BeginTextExport function on an NpgsqlConnection, like in the following class:

using System.IO;

namespace Solavirum.Exporting
{
    public class PostgresTableStreamer
    {
        private readonly INpgsqlConnectionFactory _npgsqlConnectionFactory;

        public PostgresTableStreamer(INpgsqlConnectionFactory npgsqlConnectionFactory)
        {
            _npgsqlConnectionFactory = npgsqlConnectionFactory;
        }

        public long Stream(string copySql, StreamWriter output)
        {
            var numberOfLinesInFile = 0;
            using (var conn = _npgsqlConnectionFactory.Create())
            {
                conn.Open();
                using (var reader = conn.BeginTextExport(copySql))
                {
                    var i = reader.Read();
                    var insideQuotedField = false;

                    while (i != -1)
                    {
                        var c = (char) i;
                        if (c == '"') insideQuotedField = !insideQuotedField;
                        switch (c)
                        {
                            case '\r':
                                i = reader.Read();
                                continue;
                            case '\n':
                                if (insideQuotedField) output.Write("\\n");
                                else
                                {
                                    output.Write(c);
                                    numberOfLinesInFile++;
                                }
                                break;
                            default:
                                output.Write(c);
                                break;
                        }
                        i = reader.Read();
                    }
                    output.Flush();
                }
                conn.Close();
            }
            return numberOfLinesInFile;
        }
    }
}

The only tricksy thing is escaping new line characters so that they don’t explode the formatting of the resulting file.

A Model View

Of course, a C# class is just as useless to a non-technical person as an SQL query, so now we have to think about delivery.

The easiest way to give people access to some piece of functionality is typically a website, so that seems like a good approach.

My skills are somewhat limited when it comes to putting together a website in a hurry. Our previous websites have been React.JS, but they need a bit of a pipeline to get up and running and I didn’t want to invest that much in this prototype.

Eventually, I settled on an ASP.NET CORE MVC website, deployed manually to a single Linux instance in AWS.

Nothing too complicated, just a single controller that lists the queries that are available (which are deployed with the code, no custom queries here), and then a way to run a single query and get the result as a file.

Getting ASP.NET CORE MVC up and running on a Linux box is pretty straightforward, assuming you use a local Nginx instance as a reverse proxy (which is easy).

As always, because we’re not stupid, a little touch of security is also necessary, so:

The box is behind a load balancer, to allow for easy HTTPS.
The load balancer is only accessible from the static IP at our main office. If you’re out of the office you can still use the service, but you need to be on the VPN, which is required for a bunch of our other things anyway.

If the website lives for long enough, I’ll probably augment it with Active Directory logins, so people can just use their domain credentials, but short term, its secure enough.

Conclusion

That’s basically it.

Start to finish, it all took about 4 hours, and while I wouldn’t say it was my best work, Its good enough to solve the immediate problem. At least until we get Metabase up and running and we throw the whole thing out anyway.

I’ll probably clean it up a bit (the manual deployment rankled me more than I wanted to admit) and then wash my hands of the whole thing.

More importantly, the amount of people asking me to “please run query X and email me the results” has dropped dramatically, so I’ll consider this a success.

Now I have more time for the really important things.

Like participating in the Mario Kart Tournament.