Are you monitoring a production environment with a system that raises alerts when something strange happens?
Do you also have some sort of pre-production or staging environment where you smoke test changes before pushing them live?
Does the pre-production environment have exactly the same monitoring as the production environment?
It really should.
Silence Definitely Doesn’t Equal Consent
A few months back we were revisiting one of our older API’s.
We’d just made a release into the pre-production environment and verified that the API was doing what it needed to do with a quick smoke test. Everything seemed fine.
Promote to production and a few moments later a whole bunch of alarms went off as they detected a swathe of errors occurring in the backend. I honestly can’t remember what the nature of the errors were, but they were real problems that were causing subtle failures in the migration process.
Of course, we were surprised, because we didn’t receive any such indication from the pre-production environment.
When we dug into it a bit deeper though, exactly the same things were happening in pre-prod, the environment was just missing equivalent alarms.
I’m honestly not sure how we got ourselves into this particular situation, where there was a clear difference in behaviour between two environments that should be basically identical. Perhaps the production environment was manually tweaked to include different alarms? I’m not as familiar with the process used for this API as I am for others (Jenkins and Ansible vs TeamCity and Octopus Deploy), but regardless of the technology involved, its easy to accidentally fall into the trap of “I’ll just manually create this alarm here” when you’re in the belly of the beast during a production incident.
Thar be dragons that way though.
Ideally you should be treating your infrastructure as code and deploying it similarly to how you deploy your applications. Of course, this assumes you have an equivalent, painless deployment pipeline for your infrastructure, which can be incredibly difficult to put together.
We’ve had some good wins in the past with this approach (like our log stack rebuild), where the entire environment is encapsulated within a single Nuget package (using AWS CloudFormation), and then deployed using Octopus Deploy.
Following such a process strictly can definitely slow you down when you need to get something done fast (because you have to add it, test it, review it and then deploy it through the chain of CI, Staging, Production), but it does prevent situations like this from arising.
Our Differences Make Us Special
As always, there is at least one caveat.
Sometimes you don’t WANT your production and pre-production systems to have the same alarms.
For example, imagine if you had an alarm that fired when the traffic dropped below a certain threshold. Production is always getting enough traffic, such that a drop indicates a serious problem.
Your pre-production environment might not be getting as much traffic, and the alarm might always be firing.
An alarm that is always firing is a great way to get people to ignore all of the alarms, even when they matter.
By that logic, there might be good reasons to have some differences between the two environments, but they are almost certainly going to be exceptions rather than the rule.
To be honest, not having the alarms go off in the pre-production environment wasn’t exactly the worst thing that’s ever happened to us, but it was annoying.
Short term it was easy enough to just remember to check the logs before pushing to production, but the manual and unreliable nature of that check is why we have alarms altogether.
Machines are really good at doing repetitive things after all.