If you've worked in Operations, you know how maddening it can be to deal with repeated "known issues." No matter how routine the response is, it still chews up your time with interruptions and toil.
Dealing with repetitive problems may be frustrating. However, that frustration kicks into high gear when you submit a bug report on the issue — but Dev quickly closes it. The reason? It is usually something about not having the budget or priority. And besides, "Ops has a workaround."
Of course, all businesses have to make tradeoffs. But is the full cost/benefit being done? Or is there an implicit bias that dev time to build business features is more valuable than ops time?
SRE's defines and keeps toil in check
The SRE movement has helped to move our industry forward by naming and popularizing the awareness of "toil." It gives a name to the repetitive, non-value-adding work that piles up on Operations and crowds out the engineering work that is needed to make tomorrow better than today.
In addition to defining toil, SRE makes it clear that it must be an organizational priority to keep toil in check. Like most of SRE, this is accomplished through a shared responsibility model. It isn't just the responsibility of Operations/SRE to reduce their toil. The whole organization plays a part in keeping toil down to a manageable level.
Now, even with SRE's shared responsibility model, it is still likely that bugs with known operational workarounds will exist (and their fixes will be occasionally deprioritized in favor of other issues). However, there is a greater likelihood that the impact of the issue on all roles will be factored into the prioritization.
Prioritize the elimination of toil
The first (and, yes, most obvious) way to deal with the known repetitive issues is to understand the impact it has on the org and make the fix. One of the side benefits of the "ops with developer skills" profile that SRE encourages is that SREs can participate in code fixes to help eliminate some of the known repetitive issues.
Use Self-Service to distribute load before the permanent fix
The next option is to turn the known workaround into a standard operating procedure that others can use to remediate the issue as it arises. This use of self-service doesn't always eliminate the toil, but through the ability to move it around the organization, you can mitigate its effects. Sometimes this is putting more control into the hands of delivery teams. Other times it is giving more control to the NOC or L1 support teams. In different situations, you can set up auto-remediation before alerting.
Other editions of the "SRE Anti-Patterns" series:
SRE Anti-Pattern: "The Dogpile"
SRE Anti-Pattern: "I Could Fix It, If I Could Get To It"
SRE Anti-Pattern: "I'm An Expert, I Don't Check The Wiki."
SRE Anti-Pattern: “Do it. Do it again. Then do it again.”