SRE Lessons: Continuously Optimize to Reduce Toil

"Can't we just automate away all of our toil?"

Reducing toil is often mistaken for a linear project. The misconception is that you start at one end and refactor or automate your way to the end. In reality, reducing toil is an ongoing process.

Rather than framing it as a project or special effort, reducing toil requires a continuous improvement approach.

A tremendous advantage in both efficiency and agility goes to the Operations and SRE organizations who best adopt a continuous improvement approach to reducing toil.


Toil is a naturally occurring force

Toil is a byproduct of the ongoing development and running of complex systems. It is a factor of limited time and the impossibility of foreseeing all issues and conditions.

In an active business, nothing stands still. Deployment of new versions, new features, and new configurations at all levels of the technology stack are ongoing.

Nothing stands still in the enterpriseNothing stands still in an enterprise.


The constant struggle to automate

As our systems keep evolving, so does the need for all sorts of manual intervention. Sometimes the need is for tasks we knew about but didn't have time to automate. Sometimes the need is unexpected.

In Google's first SRE book, they gave a great example of the evolution of automation:

  1. No automation - Database master is failed over manually between locations.
  2. Externally maintained system-specific automation - An SRE has a failover script in his or her home directory.
  3. Externally maintained generic automation - The SRE adds database support to a "generic failover" script that everyone uses.
  4. Internally maintained system-specific automation - The database ships with its own failover script.
  5. Systems that don't need any automation - The database notices problems and automatically fails over without human intervention.

You could argue that the engineers who built this system should have known that failover was going to be needed at some point. But who hasn't had to drop these sorts of requirements due to time pressures? Or perhaps stability was overestimated, so it was deprioritized?

In any case, there is a maturity that the system has to go through. With each increase in maturity, the toil occurrences likely goes down.

GeneralEvolutionOfAutomationAdapted from General Evolution of Automation (from chapter 7 of "Site Reliability Engineering")


However, as we said previously, nothing stands still in an enterprise. As new versions and new systems come online, the maturity regresses, and the need for new automation arises - from app configuration to database tasks to infrastructure configuration to user adds and more. It is an ongoing cycle.


Tech Debt is relentless

The same combination of limited time and limited foresight gives rise to the ubiquitous technical debt that we all face. As technical debt grows, there is an increased chance of outages or other need for toil-heavy, manual intervention. Also, the act of paying down technical debt generally requires refactoring or moving to new versions where the before mentioned automation hierarchy gets rolled back.



Continuous optimization to reduce toil

It's important to remember what is at stake. Toil is not only a productivity killer but excessive toil crowds out the ability to do engineering work (i.e., work that adds enduring value and uses our human potential)

If we crowd out the engineering work, we will not be able to improve the business or reduce the toil. Reusing the technical debt metaphor, this puts you into the all too familiar "engineering bankruptcy" position where everything feels reactive, and you are "running in place."

The upside of developing an organization's ability to reduce toil is significantly increased capacity and the ability to increase the amount of strategic work getting done. The downside of not developing an organization's ability to reduce toil is a downward spiral of diminishing capacity, even as headcount grows.

How to continuously optimize an organization to reduce toil:

  1. A formal continuous improvement program
    Adopt an explicit improvement program within your company that both tracks toil and sets aside capacity for teams to work on reducing toil. The Toyota Kata system is an excellent example of a formal improvement program that works at the team or organizational level.

  2. "Shift left" and make a shared responsibility
    Shared responsibility is a core concept in SRE. Dev and SRE both have shared responsibility for maintaining SLOs and operational outcomes. If performance standards aren't met, they both have to swarm to the problem, even if that means stopping new project work. This same shared responsibility model should carry over to reducing toil. What toil can be eliminated at design time or through development effort? How can standard platforms be built up to minimize variation and manual tasks?


  3. Use Rundeck to capture procedures and respond rapidly
    Whether it is due to time constraints or unforeseen circumstances, toil arises and needs to be dealt with efficiently and quickly. Use Rundeck to quickly and easily create automated runbooks that can be used to do both. The automated runbooks can provide self-service capabilities to a broader team (keeping toil from piling up). For the toil you can't offload, the runbooks can be used to get it done as efficiently and effectively as possible.
    Another popular use of Rundeck is to create automated procedures to minimize the effects of repetitive issues before an official fix is in place.



Let me know on twitter or in the comments about what other strategies you use to continuously reduce toil.