Our industry has always had localized expressions for work that was necessary but didn’t move the company forward. "Busy work." "Monkey work." "Muck work." "Chores." Now, thanks to the SRE movement, there is a word we can all use. That word is “toil.”
The concept of toil is a unifying force because it provides an impartial framework for identifying — then containing — the work that takes up our time, blocks people from fulfilling their engineering potential, and doesn't move the company forward.
Why Toil Matters
Not enough time and too much to do describes the default working conditions inside IT Operations.
Unfortunately, "not enough time and too much to do” describes the default working conditions inside IT Operations. There is an unlimited supply of planned and unplanned work — new things to roll out, incidents to respond to, support requests to answer, technical debt to pay down, and the list goes on.
With only so many hours in the day, how do we make sure what we are working on actually makes a difference?
How do you make sure your team, and your broader organization, are maximizing the kinds of work that add value and finding ways to eliminate work that doesn’t? After all, organization and team decisions dictate much of your work.
To maximize both the value of your operations organization and the human potential of your colleagues, you need an objective framework to identify and contain the "wrong" kind of work and maximize the "right" kind of work. Understanding what toil is, and keeping the amount of toil contained, provides economic benefits to your company and improves the work-lives of your fellow engineers. That is a textbook win-win situation.
What is the Definition of Toil?
Google1 first popularized the term "toil,” and the SRE movement has been pushing it into the general lexicon of IT operations.
For those who don't know, SRE stands for either Site Reliability Engineering (a set of principles and practices) or Site Reliability Engineer (a role). In a nutshell, SRE is about injecting software engineering practices — and a new mindset — into IT operations to create highly-reliable and highly-scalable systems. Interest in the topic of SRE has skyrocketed since Google published their seminal Site Reliability Engineering book.
In the book, Vivek Rau articulates an excellent definition, “Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.”
Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows. -Vivek Rau
The more of these attributes a task has, the more confidence you can have in classifying the work as "toil." However, just because work is classified as toil doesn’t mean that a task is frivolous or unnecessary. On the contrary, most organizations would grind to a halt if the toil didn’t get done.
The concept of value is often a point of confusion when toil is introduced into a traditional enterprise operations culture. After all, to some, manual intervention in the running of services is their job description. Are they not valuable? The short answer is "yes, they are valuable, but they could be a whole lot more valuable."
A goal of "no toil" sounds nice in theory. However, in reality, a “no toil” goal is not attainable in an ongoing business. Technology organizations are always in flux, and new developments (expected or unexpected) will almost always cause toil.Just because a task is necessary to deliver value to a customer, does not mean that it is always value-adding work. For people who are familiar with Lean manufacturing principles, this is not dissimilar to Type 1 Muda (necessary, non-value adding tasks). Toil may be necessary at times, but it doesn’t add enduring value (i.e., a change in the perception of value by customers). Long-term, we should want to eliminate the need for the toil.
The best we can hope for is to be effective at reducing toil and keeping toil at a manageable level across the organization. Toil will come from sources you already know about but just haven't had the time or budget to automate (e.g., semi-manual deployments, schema updates/rollbacks, changing storage quotas, network changes, user adds, adding capacity, DNS changes, service failover, etc.). Toil will also come from any number of unforeseen conditions that can cause incidents requiring manual intervention (e.g., restarts, diagnostics, performance checks, changing config settings, etc.).
What Should People Be Doing Instead of Toil?
Instead of your people spending their time on non-value-adding toil, you want them spending as much of their time as possible on value-adding engineering work.
Also pulling from Vivek Rau’s helpful definitions, engineering work can be defined as the creative and innovative work that requires human judgment, has enduring value, and can be leveraged by others.
Working in an organization with a high ratio of engineering work to toil feels like, metaphorically speaking, everyone is swimming towards a goal. Working in an organization with a low ratio of engineering work to toil feels more like you are treading water, at best, or sinking, at worst.
High Levels of Toil Are Toxic
Toil may seem innocuous in small amounts. Concern over individual incidents of toil is often dismissed with a response like “nothing wrong with a little busy work.” However, when left unchecked, toil can quickly accumulate to levels that are toxic to both the individual and the organization.
For the individual, high-levels of toil lead to:
- Discontent and a lack of feeling of accomplishment
- More errors, leading to time-consuming rework to fix
- No time to learn new skills
- Career stagnation (hurt by a lack of opportunity to deliver value-adding projects)
For the organization, high-levels of toil lead to:
- Constant shortages of team capacity
- Excessive operational support costs
- Inability to make progress on strategic initiatives (the “everybody is busy, but nothing is getting done” syndrome)
- Inability to retain top talent (and acquire top talent once word gets out about how the organization functions)
One of the most dangerous aspects of toil is that it requires engineering work to eliminate it. Think about the last deluge of manual, repetitive tasks you experienced. Doing those tasks doesn’t prevent the next batch from appearing.
Reducing toil requires engineering time to either build supporting automation to automate away the need for manual intervention or enhance the system to alleviate the need for the intervention in the first place.
Engineering work needed to reduce toil will typically be a choice of creating external automation (i.e., scripts and automation tools outside of the service), creating internal automation (i.e., automation delivered as part of the service), or enhancing the service to not require maintenance intervention.
Toil eats up the time needed to do the engineering work that will prevent future toil. If you aren't careful, the level of toil in an organization can increase to a point where the organization won’t have the capacity needed to stop it. If we use the Technical Debt metaphor, this would be “engineering bankruptcy.”
The SRE model of working — and all of the benefits that come with it — depends on teams having ample capacity for engineering work. This capacity requirement is why toil is such a central concept for SRE. If toil eats up the capacity to do engineering work, the SRE model doesn’t work. An SRE perpetually buried under toil isn’t an SRE, he is just a traditional long-suffering sysadmin with a new title.
Why Rundeck Cares About Toil
First, our mission here at Rundeck is to improve the work-lives of operations professionals. Reducing toil and maximizing engineering time does just that.
Second, our users who are SREs have shown us how they use Rundeck in their efforts to reduce toil.
- Rundeck helps them standardize procedures (i.e., reducing variation and reducing errors to reduce toil)
- Rundeck helps them automate away tasks that previously required a lot of toil (i.e., easier to do engineering work that reduces toil)
- Rundeck enable self-service so others can do operations tasks themselves (i.e., stop one team from creating toil for another team)
You’ll be hearing a lot more from us on the topic of toil. Stay tuned.
Tom Limoncelli provided some insight on Twitter about where he first heard the term "toil". If you know of any other early history, please leave in the comments below!
I credit Todd Underwood @ Google with coining this term, but maybe I just heard it first from him.— Thomas Limoncelli 5k (@yesthattom) March 13, 2018