SRE Challenges: Taming the Tool Sprawl

Tool sprawl is a fact of life in any enterprise. Over the years, tools and scripts have accumulated as you’ve solved one challenge after another. It's not uncommon to find modern tools like Ansible, Puppet, or Terraform mixed with legacy tools and a wide variety of scripts (Bash, Python, Powershell, Perl, and more).

Of course, let's not forget that many of these tools or languages probably have what I would call... well uh... "passionate supporters" who aren't interested in learning a new tool/language and want you to use the one they already use.

And then what about the "orphaned but useful" tools or scripts? We all know cases where the person who created critical pieces of our company's tooling has moved on (elsewhere or just another role). Sure you could probably redo it, but that takes up extra time you don't have. And besides, why mess with what works.

All of this variety can make the idea of creating meaningful self-service daunting. It's like you have islands of tools and whenever you need something done, you have to swim from one to another.


Anti-pattern: One tool/language to rule them all

Many efforts to tame the tool sprawl and implement self-service start with the logical thought that reducing the number of tools and languages down to one is the way to go. "If we could just standardize everything on one tool" is a comment we've heard a lot. But in reality, this proves impractical, and often impossible.

First, time is always Operations most significant constraint. The business isn't slowing down, and you need to keep moving forward. The list of technical debt is perpetually growing and "fixing" tools, or scripts, that work will hover towards the bottom of the list.

Second, forcing teams to standardize on one language or automation framework just isn’t realistic given the heterogeneous, high-velocity nature of modern enterprises.

Teams need to be able to use the automation languages and tools that they want. The choices weren't made for frivolous reasons. These teams made their decisions based on the best choice, at that particular time and in their specific context, for the work they needed to accomplish.

Premature standardization has shown to be detrimental to team performance. And, late standardization will rarely happen given the time constraints of modern operations.

Embrace heterogeneity, for the win

Now that we've established that heterogeneity is inescapable (and likely desirable) let's discuss how to embrace it and make the most of it.

1. Select a Self-Service Operations Platform that doesn't force different teams to change their automation tool or learn a new language/DSL.
While "works with your existing tools" is a claim made by many tools, few actually make it easy or do so without requiring you to learn their DSL or configuration language first. There is a time to think about standardization, but don't let it get in the way of the benefits of Self-Service Operations.

2. Allow everyone to start with what they currently use.
If one group has PowerShell scripts and another has Ansible playbooks, move those into your Self-Service Operations Platform as quickly as possible. Where things are still manual steps (such as "here are the commands you run to make sure this procedure worked"), default to closing those gaps with the lowest common denominator. In most organizations that lowest common denominator will be some form of shell script. I have learned the hard way that collaboration by the widest group possible is generally more important than "the best" automation tool. You can always come back and "clean things up," but often you won't. Most people will quickly discover that those scripts are doing just fine, and they can instead focus on more valuable tasks.

3. Pick a tool that makes the difficult things easy.
Using scripts and tools to move the software bits from point A to point B or call some APIs? That is the easy part. Most everyone in an Operations organization should be able to script those foundational tasks. Where things get difficult and time consuming is building self-service workflows across those tasks. You'll want a tool that will make it easy to configure workflow execution, remote dispatching, error-handling, user input passing, notifications, UI (Web GUI, API, CLI), access control, and logging.


At Rundeck, we've built a Self-Service Operations Platform that is made to embrace the heterogeneous nature of today's fast-moving enterprises. We've made sure that there is no DSL to learn and you can drop in scripts in any language or create workflows out of any other tools. Rundeck is designed to help you leverage and get the most out of the skills and assets you already have. The quicker you can enable Self-Service Operations, the quicker you can relieve the interruptions and the waiting that has been slowing everyone down.



Want to learn about Self-Service Operations?
Read this book online or download the PDF.