0 Comments

For anyone following the saga of my adventures with RavenDB, good news! It turns out its much easier to run a RavenDB server on reasonable hardware when you just put less data in it. I cleaned out the massive chunk of abandoned data over the last few weeks and everything is running much better now.

That’s not what this blog post is about though, which I’m sure is disappointing at a fundamental level.

This post is a quick one about some of the fun times that we had setting up access to customer data for the business to use for analysis purposes. What data? Well, lets go back a few steps to set the stage.

Our long term strategy has been to free customer data stuck in on-premises databases so that the customer can easily use it remotely, in mobile applications and webpages, without having to have physical access to their database server (i.e. be in the office, or connected to their office network). This sort of strategy benefits both parties, because the customer gains access to new, useful services for when they are on the move (very important for real estate agents) and we get to develop new and exciting tools and services that leverage their data. Win-win.

Of course, everything we do involving this particular product is a stopgap until we migrate those customers to a completely cloud based offering, but that’s a while away yet, and we need to stay competitive in the meantime.

As part of the creation of one of our most recent services, we consolidated the customer data location in the Cloud, and so now we have a multi tenant database in AWS that contains a subset of all of the data produced by our customers. We built the database to act as the backend for a system that allows the customer to easily view their data remotely (read only access), but the wealth of information available in that repository piqued the interest of the rest of the business, mostly around using it to calculate statistics and comparison points across our entire customer base.

Now, as a rule of thumb, I’m not going to give anyone access to a production database in order to perform arbitrary, ad-hoc queries, no matter how hard they yell at me. There are a number of concerns that lead towards this mindset, but the most important one is that the database has been optimized to work best for the applications that run on it. It is not written with generic, ad-hoc intelligence queries in mind, and any such queries could potentially have an impact on the operation of the database for its primary purpose. The last thing I want is for someone to decide they want to calculate some heavy statistics over all of the data present, tying up resources that are necessary to answer queries that customers are actually asking. Maintaining quality of service is critically important.

However, the business desire is reasonable and real value could be delivered to the customer with any intelligence gathered.

So what were we to do?

Stop Copying Me

The good thing about working with AWS is that someone, somewhere has probably already tried to do what you’re trying to do, and if you’re really lucky, Amazon has already built in features to make doing the thing easy.

Such was the case with us.

An RDS read-replica neatly resolves all of my concerns. The data will be asynchronously copied from the master to the replica, allowing business intelligence queries to be performed with wild abandon without having to be concerned with affecting the customer experience. You do have to be aware of the eventually consistent  nature of the replica, but that’s not as important when the queries being done aren’t likely to be time critical. Read-replicas can even be made publicly accessible (without affecting the master), allowing you to provision access to them without requiring a VPN connection or something similarly complicated.

Of course, if it was that easy, I wouldn’t have written a blog post about it.

Actually creating a read-replica is easy. We use CloudFormation to initialise our AWS resources, so its a fairly simple matter to extend our existing template with another resource describing the replica. You can easily specify different security groups for the replica, so we can lock it down to be publicly resolvable but only accessible from approved IP addresses without too much trouble (you’ll have to provision a security group with the appropriate rules to allow traffic from your authorised IP addresses, either as part of the template, or as a parameter injected into the template).

There are some tricks and traps though.

If you want to mark a replica as publicly accessible (i.e. it gets a public IP address) you need to make sure you have DNS Resolution and DNS Hostnames enabled on the host VPC. Not a big deal to be honest, but I think DNS Hostnames default to Off, so something to watch out for. CloudFormation gives a nice error message in this case, so its easy to tell what to do.

What’s not so easy is that if you have the standard public/private split of subnets (where a public subnet specifies the internet gateway for routing of all traffic and a private subnet either specifies nothing or a NAT) you must make sure to put your replica in the public subnets. I think this applies for any instance that is going to be given a public IP address. If you don’t do this, no traffic will be able to escape from the replica because the router table will try to push it through the NAT on the way out. This complicates things with the master RDS instance as well, because both replica and master must share the same subnet group, so the master must be placed in the public subnets as well.

With all the CloudFormation/AWS/RDS chicanery out of the way, you still need to manage access to the replica using the standard PostgreSQL mechanisms though.

The Devil Is In The Details

The good thing about PostgreSQL read replicas is that they don’t allow any changes at all, even if using the root account. They are fundamentally readonly, which is fantastic.

There was no way that I was going to publicise the master password for the production RDS instance though, so I wanted to create a special user just for the rest of the business to access the replica at will, with as few permissions as possible.

Because of the aforementioned readonly-ness of the replica, you have to create the user inside the master instance, which will then propagate it across to the replica in time. When it comes to actually managing permissions for users in the PostgreSQL database though, its a little bit different to the RDBMS that I’m most familiar with, SQL Server. I don’t think its better or worse, its just different.

PostgreSQL servers hosts many databases, and each database hosts many schemas. Users however, appear to exist at the server level, so in order to manage access, you need to grant the user access to the databases, schemas and then tables (and sequences) inside that schema that you want them to be able to use.

At the time when our RDS instance is initialised, there are no databases, so we had to do this after the fact. We could provision the user and give it login/list database rights, but it couldn’t select anything from tables until we gave it access to those tables using the master user.

GRANT USAGE ON {schema} TO {username}

GRANT SELECT ON ALL TABLES IN {schema} TO {username}

GRANT SELECT ON ALL SEQUENCES IN {schema} TO {username}

Granting access once is not enough though, because any additional tables created after the statement is executed will not be accessible. To fix that you have to alter the default privileges of the schema, granting the appropriate permissions for the user you are interested in.

ALTER DEFAULT PRIVILEGES IN SCHEMA {schema}
    GRANT SELECT ON TABLES TO {username}

With all of that out of the way, we had our replica.

Conclusion

Thanks to AWS, creating and managing a read-replica is a relatively painless procedure. There are some tricks and traps along the way, but they are very much surmountable. Its nice to be able to separate our concerns cleanly, and to have support for doing that at the infrastructure level.

I shudder to think how complicated something like this would have been to setup manually.

I really do hope AWS never goes full evil and decides to just triple or quadruple their prices though, because it would take months to years to replicate some of the things we’re doing in AWS now.

We’d probably just be screwed.