One of the core principles in Site Reliability Engineering is a focus on automation to reduce1 Toil. Toil, in essence, is manual and repetitive work which requires little to no human judgment and could easily be automated away by computers and machines.

Unlike humans who will need to be trained, automation can be duplicated at will with virtually no extra cost, which makes the reduction of toil an especially worthwhile investment as services experience growth - after all, the automation can scale alongside the service without any additional effort or cost.

Creating quality software is hard. It’s easy to build prototypes, but designing and building robust products that fully address people’s needs is incredibly complex. This is one of the major reasons why Customer Support teams exist - they can field all the questions and problems customers invariably run into when the product you built comes up short in reality.

Support requests are a healthy sign in an organization. It indicates users are using your product, and care strongly enough to reach out for help when they get stuck, as opposed to switching over to another competing product.

But how many of those support requests make your CS staff do the same thing over and over again, as opposed to addressing truly novel cases? How often are these agents answering the same questions, fixing the same problems, or applying the same changes to accounts that users are unable to perform themselves yet?

These are all symptoms of technical debt and feature debt.

Toil isn’t necessarily bad in the same way that technical debt isn’t necessarily a bad thing. By strategically taking on debt you will be able to release sooner, allowing you to get earlier user feedback and product validation, or to undercut the competition. But debt, and toil, will need to be paid off at some point. And the faster you grow, the more of a drag it becomes when you have too much of it.

How much of what your support staff does qualifies as toil? Are you tracking this metric in a way that’s visible to people outside of the support structure? I wouldn’t know what a healthy ratio will look like and chances are, this will differ across many diverse businesses and industries. But I do believe that your CS staff should spend a majority of its time assisting customers with new, never-seen-before problems rather than repetitively doing the same thing over and over again.

Do your CS teams possess the necessary skills and technologies to automate those tasks? Are there agreements in place with the product and engineering teams to prioritize support tooling and user-experience/workflow improvements over new feature development when their toil becomes too high?

  1. I deliberately say reduce rather than eliminate here. Similar to how extreme reliability can become too much reliability, there are diminishing returns to the automation of manual tasks. Past a certain point, there is even a net negative once the automation itself becomes more expensive to build or maintain compared to doing the work by hand. ↩︎