In this edition of SRE Anti-Patterns, I'm highlighting a problem that gets in the way of people collaboratively responding to incidents.
Multi-party conference bridges or chat rooms are a routine part of major (and often minor) incidents. The goal is to quickly assemble all of the right knowledge and skills to solve a problem quickly. What can get in the way of that goal are the collisions, interference, and miscommunication that are an unfortunate byproduct of multiple eager people all interacting with the same systems at the same time.
I've amusingly heard this referred to as "the dogpile." It is a cousin to the "too many cooks in the kitchen" concept. A familiar example: "Someone says they think it might be a problem with server A. So you log into server A, run the 'top' command, and the first ten entries are your other colleagues also running 'top.'"
Other dogpile examples can be destructive. Some caused by people miscommunicating, other times caused by multiple parties making conflicting changes.
An obvious way to counteract the dogpile effect is to put strict rules into place (e.g., only one person interact with the system at a time). However, the success of such rules depends on the understanding, agreement, and discipline of all involved — no small feat under the pressure of an outage.
Smartly applied self-service operations tools can help enforce the discipline around how groups respond to incidents. First, make it so anyone involved can see the output of any diagnostic actions. Then, make sure that all exploratory or corrective actions are also automatically coordinated and visible (anyone can see the details of what is run and what the output is).
The dogpile may be human nature, but with the right policies and tooling in place, you can keep it at bay.
If you want to discuss how Rundeck and the Operations as a Service design pattern can help you get rid of your "dogpile" roadblocks, don't hesitate to contact us.
Other editions of "SRE Anti-Patterns":