Don’t blindly rollback

Last changed: Mon, 06 Dec 2021 14:38:57 CET
Change ID: d90f138caa5382380e28ef41396fb9b2b033976e

Rolling back after a deployment or configuration change causes errors to spike is a common strategy to quickly stabilize a system by reverting a known-good state.

This generally works when the deployment or configuration change is actually erroneous, but there are cases where it’s instead a symptom of another problem, and the change is just the trigger or catalyst that sets it off.

In such cases, rolling back could actually make things worse rather than better. One example of this can be seen with an incident that occurred at Slack, as mentioned in The Case of the Recursive Resolvers:

After nearly an hour, there were no signs of improvement and error rates remained stable, trending neither upwards nor downwards.

Our Traffic team was confident that reverting the DS record published in MarkMonitor was sufficient to eventually resolve DNS resolution problems. As things were not getting any better, and given the severity of the incident, we decided to rollback DNSSEC signing in our slack.com authoritative and delegated zones, wanting to recover our DNS configuration to the last previous healthy state, allowing us to completely rule out DNSSEC as a problem.

As soon as the rollback was pushed out things got much worse; our Traffic team was paged due to DNS resolution issues for slack.com from multiple resolvers across the world.