0 Comments

Last week I described the re-occurrence of an issue from late last year with our data synchronization process.

Basically, read IOPS on the underlying database much higher than expected, causing issues with performance in AWS when the database volume ran out of burst balance (or IO credits as they are sometimes called).

After identifying that the most recent problem had been caused by the addition of tables, we disabled those tables, restoring performance and stability.

Obviously we couldn’t just leave them turned off though, and we couldn’t throw money at the problem this time, like we did last time. It was time to dig in and fix the problem properly.

But first we’d need to reproduce it.

But The Read Performance Issue In My Store

The good thing about ensuring that your environment is codified is that you can always spin up another one that looks just like the existing one.

Well, theoretically anyway.

In the case of the sync API environment everything was fine. One script execution later and we had a new environment called “sync performance” with the same number of identically sized API instances running the exact same code as production.

The database was a little trickier unfortunately.

You see, the database environment was from the time before Iimproved our environment deployment process. This meant that it was easy to make one, but hard to update an existing one.

Unfortunately, it was hard enough to update an existing one that the most efficient course of action had been to simply update the live one each time we had to tweak it, so we had diverged from the source code.

First step? Get those two back in sync.

Second step, spin up a new one that looks just like prod, which needed to include a copy of the prod data. Luckily, RDS makes that easy with snapshots.

With a fully functional, prod-like data synchronization environment up and running, all we needed was traffic.

Good thing Gor exists. We still had a deployable Gor component from last time I wanted to replicate traffic, so all we needed to do was make a new Octopus project, configure it appropriately and deploy it to our production API.

Now we had two (mostly) identical environments processing the same traffic, behaving pretty much the same. Because we’d turned off multiple tables in order to stop the high read IOPS, it was a simple matter to turn one back on, causing a less severe version of the issue to reoccur in both environments (higher than normal read IOPS, but not enough to eat burst balance).

With that in place we were free to investigate and experiment.

Is Still Abhorred

I’m going to cut ahead in the timeline here, but we analysed the behaviour of the test environment for a few days, trying to get a handle on what was going on.

Leveraging some of the inbuilt query statistics in postgreSQL, it looked like the the most frequent and costly type of query was related to getting a “version” of the remote table for synchronization orchestration. The second most costly type of query was related to getting a “manifest” of a subset of the table being synchronized for the differencing engine.

Disabling those parts of the API (but leaving the uploads alone) dropped the IOPS significantly, surprising exactly zero people. This did disagree with our hypothesis from last time though, so that was interesting.

Of course, the API is pretty useless without the ability to inform the synchronization process, so it was optimization time.

  • We could try to reduce the total number of calls, reducing the frequency that those queries are executed. We’d already done some work recently to dramatically reduce the total number of calls to the API from each synchronization process though, so it was unlikely we would be able to get any wins here
  • We could implement a cache in front of the API, but this just complicates things and all it will really result in is doing work repeatedly for no benefit (if the process syncs data then asks the API for a status, and gets the cached response, it will just sync the data again)
  • We could reduce the frequency of syncing, doing it less often. Since we already did the work I mentioned above to reduce overall calls, the potential gains here were small
  • We could try to make the queries more efficient. The problem here was that the queries were already using the primary keys of the tables in question, so I’m not entirely sure that any database level optimizations on those tables would have helped
  • We could make getting an answer to the question “give me the remote version of this table” more efficient by using a dedicated data structure to service those requests, basically a fancy database level cache

We prototyped the last option (basically a special cache within the database that contained table versions in a much easier to query format) and it had a positive effect on the overall read IOPS.

But it didn’t get rid of it entirely.

Within The Sound Of Synchronization

Looking into our traffic, we discovered that our baseline traffic had crept up since we’d implemented the skipping strategy in the sync process. Most of that baseline traffic appeared to be requests relating to the differencing engine (i.e. scanning the table to get primary key manifests for comparison purposes), which was one of the expensive type of queries that we identified above.

We’d made some changes to the algorithm to incorporate the ability to put a cap on the number of skips we did (for safety, to avoid de-sync edge cases) and to introduce forced skips for tables whose changes we were happy to only sync a few times a day.

A side effect of these changes was that whenever we decided NOT to skip using the local comparison, the most common result of the subsequent local vs remote comparison was choosing to execute the differencing engine. There was a piece of the algorithm missing where it should have been choosing to do nothing if the local and remote were identical, but that did not seem to be working due to the way the skip resolution had been implemented.

Fixing the bug and deploying it cause the read IOPS on our normal production server to drop a little bit, which was good.

The different pattern of traffic + our prototype table version cached caused a much more dramatic drop in read IOPS in our test environment though. It looked like the two things acting together apparently reduced the demands on the database enough to prevent it from having to read so much all the time.

Conclusion

We’re still working on a production quality cached table version implementation, but I am cautiously optimistic. There are some tricky bits regarding the cache (like invalidation vs updates, and where that sort of decision is made), so we’ve got a bit of work ahead of us though.

At this point I’m pretty thankful that we were easily able to both spin up an entirely separate and self contained environment for testing purposes, and that we were able to replicate traffic from one environment to the other without a lot of fuss. Without the capability to reproduce the problem disconnected from our clients and experiment, I don’t think we would have been able to tackle the problem as efficiently as we did.

I’m a little disappointed that a bug in our sync process managed to slip through our quality assurance processes, but I can understand how it happened. It wasn’t strictly a bug with the process itself, as the actions it was performing were still strictly valid, just not optimal. Software with many interconnected dependent components can be a very difficult thing to reason about, and this problem was relatively subtle unless you were looking for it specifically.  We might have been able to prevent it from occurring with additional tests, but its always possible that we actually had those tests anyway, and during one of the subsequent changes a test failed and was fixed in an incorrect way. I mean, if that was the case, then we need to be more careful about “fixing” broken tests.

Regardless, we’re starting to head into challenging territory with our sync process now, as it is a very complicated beast. So complicated in fact that its getting difficult to keep the entire thing in your head at the same time.

Which is scary.

0 Comments

in late 2016 we deployed the first version of our data synchronization API. It was a momentous occasion, and it all worked perfectly…at least until we turned on the software that actually used it. Actual traffic has a habit of doing that.

Long story short, we experienced a bunch of growing pains as we scaled our traffic up during those first few weeks, and this post is a follow-up to one of those issues in particular.

Specifically, the issue of particularly high read IOPS on the RDS instance that backed the sync API.

At the time, we investigated, didn’t find the root cause of the problem, threw money at i,t and ran away to do things that the business considered more important.

Why does it matter now?

Well, it happened again, and this time we can’t just throw money at it.

Sound familiar?

I’ve Come to Read From You Again

As I mentioned above, the main symptom was overly high read IOPS on our RDS server. Much higher than we expected for what we were doing.

This was an issue because of the way that data volumes work in AWS. You get some guaranteed IOPS (read and write combined) based on the size of the volume (1GB = 3 IOPS), while at the same time you get the capability to burst up to 3000 IOPS for brief periods of time.

The brief period of time is the important part, because it is actually modelled as a “balance” of IOPS credits that you consume if you go above your baseline, which get replenished if you’re below your baseline. Very very similar to the way that CPU credits work on EC2 instances. Back in late 2016, what exactly the balance was at any particular point in time was a bit of mystery though, because you couldn’t graph or alarm on it through CloudWatch. You only knew when it ran out and the performance of your database tanked because it was dependent on using more IOPS than its baseline.

Anyway, you can chart and alarm on the balance now, which is nice.

We tried a few things at the time, including scaling the underlying data volume (to give us more baseline), but it wasn’t enough.

We then increased the size of the RDS instance, and the high read IOPS vanished. I mean, that was obviously the best outcome, clearly. A problem we didn’t understand reacting in a completely unexpected way when we threw more power at it.

At the time, we thought that it might be related to EF querying the database on writes (to choose between an insert or update), but we never proved that hypothesis conclusively.

No-one in the business complained too loudly about the cost and everything was working swimmingly, so after an initial period of discomfort (mostly the engineers inside us sobbing quietly), we promptly forgot about it all together.

Because Some Usage Softly Creeping

A few weeks back, our data synchronization API started behaving strangely.

The first we knew about the problem was when our external monitoring service (Pingdom) reported that the API was down. Then up. Then down again. Then up again. So on and so forth.

Looking at the RDS metrics there were thousands of connections active to the database, causing us to hit connection limits. My first thought was that someone internal to the organisation had somehow got the credentials to the database and was running insane queries for business intelligence. I mean, thinking about that conclusion now, its completely bogus, because:

  1. No-one has the credentials to the master database but a few members of my team
  2. No-one even knows what the URL to access the master database is
  3. We created a read replica specifically for the organisation to use for business intelligence, so why would someone use the master
  4. You can’t even connect to the master database without being on our internal AWS VPN, which most people in the organisation don’t have access to

It didn’t help that when PostgreSQL runs out of connections, you can’t even connect to it with pgAdmin or something similar to see what the connections are doing.

Anyway, after falsely bombarding some innocent people with polite requests to “please stop doing things to our master database, its killing our API”, I shut down our data synchronization API and all the connections disappeared.

Because it was our API.

The calls were coming from inside the house.

Wrecked Everything While I was Sleeping

Luckily, the data synchronization API is probably the one API we can take down for a few hours and not have anybody really notice. As long as the database is available, the other API that actually allows access to the data can continue to do what it normally does, just with increasingly stale information. Its an eventually consistent model anyway, so some staleness is inherent.

We still only realistically had a few hours to dig into the issue before it became a real problem for actual users though.

Our only lead was the massive number of connections that were being established to the RDS instance. I managed to get a sample of what the connections were doing just as I shut the API down, and it wasn’t anything interesting. Just normal queries that I would expect from the API. Traffic hadn’t changed either, so we weren’t dealing with some weird DDOS attack or anything like that.

Well, to skip ahead, it turns out that the rise in the number of connections was a symptom, not the cause. The root issue was that we had burnt through our IO burst balance over the last few hours due to an increase in read IOPS. We’d almost done the same thing the day before, but it bottomed out at like 5% remaining because the days traffic started dropping off, so we didn’t notice. The lack of IOPS was causing each request to execute slower than normal, which just meant that there were more requests being executed in parallel.

Each API instance had a maximum number of connections that it could use, but the sum total across all of the instances was higher than the total number available to postgreSQL (whoops).

But why were we eating all of our delicious burst balance all of a sudden?

Well, a few days prior we started syncing a new table, and even though it was a small one, it increased the drain on the infrastructure.

Interestingly enough, it wasn’t a case of read IOPS going from close to nothing to a massive number with that new table. Our read IOPS had actually increased to a ridiculous baseline a few weeks before that when we shipped a completely different table, and we just hadn’t noticed. The table we shipped recently had just been the straw that broke the camels back.

We turned both of the new tables off, restored the API and all was well.

Conclusion

Honestly, we really should have had some alarms. Any alarms. Literally any alarms on the things we were already suspicious of at the database level (connections, read IOPS, etc) would have given us some sort of warning that bad things were happening, weeks before they actually happened.

We have those alarms now, but it was still a pretty big oversight.

With the new tables turned off and stability and functionality restored, we looked to fix the problem.

Throwing money at it wasn’t really feasible, as we would have basically doubled the infrastructure costs, and it was already costing us a few thousand $US each month. No, this time we would have to understand the problem and fix is properly.

But that’s a topic for next time, because this post is already large.

0 Comments

For anyone following the saga of my adventures with RavenDB, good news! It turns out its much easier to run a RavenDB server on reasonable hardware when you just put less data in it. I cleaned out the massive chunk of abandoned data over the last few weeks and everything is running much better now.

That’s not what this blog post is about though, which I’m sure is disappointing at a fundamental level.

This post is a quick one about some of the fun times that we had setting up access to customer data for the business to use for analysis purposes. What data? Well, lets go back a few steps to set the stage.

Our long term strategy has been to free customer data stuck in on-premises databases so that the customer can easily use it remotely, in mobile applications and webpages, without having to have physical access to their database server (i.e. be in the office, or connected to their office network). This sort of strategy benefits both parties, because the customer gains access to new, useful services for when they are on the move (very important for real estate agents) and we get to develop new and exciting tools and services that leverage their data. Win-win.

Of course, everything we do involving this particular product is a stopgap until we migrate those customers to a completely cloud based offering, but that’s a while away yet, and we need to stay competitive in the meantime.

As part of the creation of one of our most recent services, we consolidated the customer data location in the Cloud, and so now we have a multi tenant database in AWS that contains a subset of all of the data produced by our customers. We built the database to act as the backend for a system that allows the customer to easily view their data remotely (read only access), but the wealth of information available in that repository piqued the interest of the rest of the business, mostly around using it to calculate statistics and comparison points across our entire customer base.

Now, as a rule of thumb, I’m not going to give anyone access to a production database in order to perform arbitrary, ad-hoc queries, no matter how hard they yell at me. There are a number of concerns that lead towards this mindset, but the most important one is that the database has been optimized to work best for the applications that run on it. It is not written with generic, ad-hoc intelligence queries in mind, and any such queries could potentially have an impact on the operation of the database for its primary purpose. The last thing I want is for someone to decide they want to calculate some heavy statistics over all of the data present, tying up resources that are necessary to answer queries that customers are actually asking. Maintaining quality of service is critically important.

However, the business desire is reasonable and real value could be delivered to the customer with any intelligence gathered.

So what were we to do?

Stop Copying Me

The good thing about working with AWS is that someone, somewhere has probably already tried to do what you’re trying to do, and if you’re really lucky, Amazon has already built in features to make doing the thing easy.

Such was the case with us.

An RDS read-replica neatly resolves all of my concerns. The data will be asynchronously copied from the master to the replica, allowing business intelligence queries to be performed with wild abandon without having to be concerned with affecting the customer experience. You do have to be aware of the eventually consistent  nature of the replica, but that’s not as important when the queries being done aren’t likely to be time critical. Read-replicas can even be made publicly accessible (without affecting the master), allowing you to provision access to them without requiring a VPN connection or something similarly complicated.

Of course, if it was that easy, I wouldn’t have written a blog post about it.

Actually creating a read-replica is easy. We use CloudFormation to initialise our AWS resources, so its a fairly simple matter to extend our existing template with another resource describing the replica. You can easily specify different security groups for the replica, so we can lock it down to be publicly resolvable but only accessible from approved IP addresses without too much trouble (you’ll have to provision a security group with the appropriate rules to allow traffic from your authorised IP addresses, either as part of the template, or as a parameter injected into the template).

There are some tricks and traps though.

If you want to mark a replica as publicly accessible (i.e. it gets a public IP address) you need to make sure you have DNS Resolution and DNS Hostnames enabled on the host VPC. Not a big deal to be honest, but I think DNS Hostnames default to Off, so something to watch out for. CloudFormation gives a nice error message in this case, so its easy to tell what to do.

What’s not so easy is that if you have the standard public/private split of subnets (where a public subnet specifies the internet gateway for routing of all traffic and a private subnet either specifies nothing or a NAT) you must make sure to put your replica in the public subnets. I think this applies for any instance that is going to be given a public IP address. If you don’t do this, no traffic will be able to escape from the replica because the router table will try to push it through the NAT on the way out. This complicates things with the master RDS instance as well, because both replica and master must share the same subnet group, so the master must be placed in the public subnets as well.

With all the CloudFormation/AWS/RDS chicanery out of the way, you still need to manage access to the replica using the standard PostgreSQL mechanisms though.

The Devil Is In The Details

The good thing about PostgreSQL read replicas is that they don’t allow any changes at all, even if using the root account. They are fundamentally readonly, which is fantastic.

There was no way that I was going to publicise the master password for the production RDS instance though, so I wanted to create a special user just for the rest of the business to access the replica at will, with as few permissions as possible.

Because of the aforementioned readonly-ness of the replica, you have to create the user inside the master instance, which will then propagate it across to the replica in time. When it comes to actually managing permissions for users in the PostgreSQL database though, its a little bit different to the RDBMS that I’m most familiar with, SQL Server. I don’t think its better or worse, its just different.

PostgreSQL servers hosts many databases, and each database hosts many schemas. Users however, appear to exist at the server level, so in order to manage access, you need to grant the user access to the databases, schemas and then tables (and sequences) inside that schema that you want them to be able to use.

At the time when our RDS instance is initialised, there are no databases, so we had to do this after the fact. We could provision the user and give it login/list database rights, but it couldn’t select anything from tables until we gave it access to those tables using the master user.

GRANT USAGE ON {schema} TO {username}

GRANT SELECT ON ALL TABLES IN {schema} TO {username}

GRANT SELECT ON ALL SEQUENCES IN {schema} TO {username}

Granting access once is not enough though, because any additional tables created after the statement is executed will not be accessible. To fix that you have to alter the default privileges of the schema, granting the appropriate permissions for the user you are interested in.

ALTER DEFAULT PRIVILEGES IN SCHEMA {schema}
    GRANT SELECT ON TABLES TO {username}

With all of that out of the way, we had our replica.

Conclusion

Thanks to AWS, creating and managing a read-replica is a relatively painless procedure. There are some tricks and traps along the way, but they are very much surmountable. Its nice to be able to separate our concerns cleanly, and to have support for doing that at the infrastructure level.

I shudder to think how complicated something like this would have been to setup manually.

I really do hope AWS never goes full evil and decides to just triple or quadruple their prices though, because it would take months to years to replicate some of the things we’re doing in AWS now.

We’d probably just be screwed.

0 Comments

Welp, holidays finished, so back into software delivery I go.

We just finished a 3 day Hackathon, which was pretty fun. The team I was in worked on a proof of concept for providing business intelligence on our customers to internal users by connecting and correlating our many disparate data sources. It was a fairly simple React.js website backed  by a C# API, which connected to a number of different systems including Microsoft Dynamics CRM, Elasticsearch (specifically, our log stack), some crazy financial database called Helpy (which I don’t even seem to be able to Google) and finally a PostgreSQL database containing customer information.

It was cool to see it all come together over just a few days, but I doubt I’ll give it its own blog post like I did for the faster S3 clone that I did for the last Hackathon. Funnily enough, the hardest part was getting the damn thing deployed to an instance of IIS on a manually provisioned server inside our office. I’ve spent so much time using automated scripts that I wrote months ago in AWS to setup websites and API’s that I’ve completely forgotten how to just hack a webserver together.

Why didn’t I use the scripts you might ask? Well, I tried, and they really only work inside AWS on top of the AMI’s we’ve built up specifically for that purpose. A bit disappointing, but I got over it.

Hackathon fun aside, I had to deal with something else when I got back from my holidays, identifying and resolving a massive unexpected increase in Read IOPS usage for one of our production databases.

And for once, It had nothing to do with RavenDB.

Zero Hour

While I was on holidays, one of our production API’s started exhibiting some pretty degraded performance. It happened suddenly too, it was running fine one minute and then all of a sudden its latency spiked hard. Unfortunately I was not there for the initial investigation, so the first half of this post is going to be a second-hand account, with all the inherent inaccuracy that entails.

Looking into the issue, all queries to the underlying database (a postgreSQL database hosted in RDS) were taking much longer than expected, but it wasn’t immediately obvious as to why. There had been no new deployments (so nothing had changed), and it didn’t look like the amount of traffic being received had drastically increased.

The RDS monitoring statistics showed the root cause of the performance problems. There had been a massive increase in the amount of Read IOPS being consumed by the database instance, from 300ish to 700ish. This had been happening for at least a few days before the performance went downhill, so the operative question was why it had become an issue all of a sudden.

Apparently, IO credits are a thing, like CPU credits.

With the increase in Read IOPS being sustained over many hours, the IO credits available to the instance were slowly being consumed, until they were all gone and performance tanked.

The way IO credits work, to my current understanding, is that when you provision an EBS volume, the size of the volume determines the minimum guaranteed IO throughput, at a rate of 3 IOPS per GB provisioned (unless you’ve specified provisioned IOPS, which is a whole different ballgame). The volume is capable of bursting to 3000 IOPS, but operating above the minimum will consume IO credits. Running out of IO credits means no more bursting, so if the performance of your service was reliant on pulling more than the minimum IO available, it will tank. Like ours did.

Apart from the unfortunate fact that unlike CPU credits, there is no way to monitor IO credits, the problem was relatively easy to solve. Either increase the volume size of the RDS instance (to increase the minimum IO performance) or leave the size the same, but switch to using provisioned IOPS. We opted for the cheaper option (bigger disk = more performance) and the performance stabilised.

This left us in a precarious position though, as we didn’t know what had cause the dramatic increase in Read IOPS.

Detective Agency

PostgreSQL has some pretty good logs, which is a relief, considering no equivalent tool to SQL Server Profiler exists to my knowledge.

Getting the logs for a PostgreSQL instance hosted in RDS isn’t particularly complicated. Either edit the existing parameter group associated with the instance, or create a new one with logging enabled and then associate it with the instance. The logging options we were interested in were log_statement and log_min_duration_statement, which represent what activities to log and how long a query needs to be running before it should be logged, respectively. More information about working with PostgreSQL logs in RDS can be found at the appropriate AWS webpage. Getting to the logs manually is pretty easy. They are available for download when looking at the RDS instance in the AWS console/dashboard.

The logs showed that there was a class of query that was constantly being executed to determine the most recent piece of information in the database for a subset of the data (i.e.a select top x from y where a, b). The only piece of software that can access the database is a C# API, and it uses Entity Framework 6 for all database communication, so the queries were monstrous, but once our attention had been drawn to the appropriate place, it was pretty easy to see some issues:

  • The data type on the first segmentation column was a string instead of an integer (only integer values were ever present in the column, and conceptually only integer values would ever be present)
  • The data type on the second segmentation column was a string instead of a guid (similar to the integer comment above)
  • The primary key on the table was a simple auto ID, whereas the actual primary key should have been a compound key with column a and b above, along with a third column
  • There did not appear to be any appropriate indexes on the segmentation columns

What all of this meant was that the most frequent query being executed on the database was incredibly sub-optimal. It was likely that because the most frequent query needed to do a table scan in order to get an answer, and because the RDS instance was only specified as a t2.micro, postgreSQL was constantly churning memory, which would account for the high Read IOPS. The issue likely only occurred recently because we finally hit some threshold in traffic or database size where the churning manifested.

Fixing the problem was simple. Edit the EF models to change the types and specify appropriate indexes and add a new EF migration to be executed on the next deployment. The migration was a little more complex than usual because it had to upgrade data already in the database (for the wrong column types), but all in all, the fix didn’t take a huge amount of effort.

Like good software developers though, we had to make sure that it actually did fix the problem without actually deploying to production.

Validate Me!

Thanks to previous efforts in traffic replication, I made the assumption that it would be relatively easy to replicate our production traffic into an updated version of the service.

I was wrong.

Our previous traffic replication had been accomplished using Gor replicating traffic from a single EC2 instance (a machine hosting RavenDB) to another similar instance. Pure HTTP traffic, no URLS, just IP addresses and Gor had been manually installed and configured. I had started work on replicating traffic from our API instances to a URL of my choosing, but I hadn’t quite finished it before I went on holidays.

I needed to replicate traffic from a variable number of instances to a publicly accessible URL, and I really didn’t want to have to manually install it on each machine (and then have to do it again next time I wanted to replicate traffic).

Creating a deployable Nuget package containing Gor and some helper Powershell scripts wasn’t overly difficult, I just used a similar pattern to the one we have for our Logstash deployment package. After deploying it though, it wouldn’t work. When I say it wouldn’t work, I mean I didn’t get any feedback at all when running it, no matter how I ran it. The issue was that Gor running on windows is dependent on WinPCap being present in order to function at all. This is easy enough to accomplish when you’re doing a manual install (just go grab WinPCap and install it), but unfortunately much harder when you want to be able to deploy it to any old machine in a completely unattended fashion. WinPCap doesn’t supply a silent installer (why god why), so I was forced to install NMap silently on the machines as part of the Gor deployment. Why NMap? It also installs WinPCap silently, because they’ve customised it.

Not my greatest victory, but I’ll take it.

With Gor working, I still had an issue where even though Gor seemed to be aware of the traffic occurring on the machine (I could log it to stdout for example), no matter what I did I could not get the traffic to replicate to another service via a standard URL available to the greater internet. Long story short, the issue was that Gor (as of version 0.15.1) does not and cannot use a proxy. Being that these instances did not have public IP addresses (no need, they were hidden behind a load balancer), they had to go through a proxy to get anything meaningful done on the internet.

We’re switching to using NAT Gateways for all internet traffic from private instances, so in the long term, this won’t be an issue. For now, I have to live with the fact that proxies are a thing, so I was forced to create an internal load balancer pointing at the same EC2 instances as the public URL I was going to use.

With that in place, I finally managed to replicate the production traffic to a test service with the changes applied.

And it worked amazingly. With the changes in place, our Read IOPS dropped from 700ish to like 10. Much more efficient. The only cost was that the service had to be made unavailable to apply the database updates (because they involved the transformation of the column contents). That only took like 5 minutes though, so I just did it at night when nobody noticed or cared.

Conclusion

I suppose the conclusion of this entire journey is to never underestimate the impact that early design decisions can have on the long term health of a service, and that the value of detailed monitoring should never be undervalued.

Honestly, we should have picked up this sort of thing much earlier than we did (probably through load testing), but it was good that we had the systems in place to react quickly when it did happen unexpectedly.

Another benefit of the whole thing is that it proved the value of our ability to spin up an entirely self contained environment as a mechanism for testing changes under relatively real circumstances. With the ability to replicate traffic at will now (thanks to the working Gor deployment), we’re in a much better position to try things out before we go live with them.