DevOps and Digital Transformations are driving an unprecedented increase in the pace and volume of daily change. While this might be great news to Development and Product teams, their counterparts in Operations are alarmed at the problems and challenges that comes with this increase in pace and volume of work.
Operations organizations are finding themselves squeezed between two unrelenting forces. On one side there are the business-driven demands of DevOps and Digital Transformation (“Go faster! Open things up!). On the other side there are the business-driven demands to maximize security and stability ("Don't be the next hack! Don't be the next outage! Lock things down!"). And there, in the middle, is an already over-burdened Operations organization doing their best to avoid being squeezed beyond the breaking point.
Operations is reaching an inflection point. To deliver what the business demands, Operations has been challenged to find a way to provide unprecedented levels of organizational responsiveness and throughput — all while “locking things down” to sufficiently meet today’s risk profiles.
A lot is riding on how Operations responds to this challenge. A failure here is not just a localized IT failure. A failure will undermine a business’s ability to operate. Failing to solve this IT challenge will turn into a competitive disadvantage for the business.
On the flip side, this challenge also presents a great opportunity. Operations can take this business mandate and use it to reimagine how both planned and unplanned work is handled. This is a chance to improve how Operations both serves the broader business and improves the day-to-day lives of Operations professionals.
Operations as a Service is a key design pattern that allows an Operations organization to move faster, be more flexible, and lock things down. Operations as a Service is also critical technique for breaking down the organizational barriers that prevent enterprises from achieving DevOps and Digital transformations
On the surface, Operations as a Service is a straightforward concept — turn your operations tasks into automated services that can be consumed on-demand (via GUI, command line, or API) by anyone who needs those operations task performed.
However, when you look deeper, you'll see that a lot goes into making this straightforward vision a reality. This book introduces the why, the what, and the how of Operations as a Service.
Operations has always carried the mandate of meeting the business needs through designing, running, and fixing complex systems. However, what that operations work looks like and what those systems are comprised of has changed over time.
In the past few decades, we’ve seen the focus of typical enterprise operations organizations steadily expand from networking, to server platforms, to service management, to API integration. If we look at what happens today, operations encompases the full stack of a very complex, software-defined, API-enabled system running on infrastructure they may or may not own.
Part of this expansion can be attributed to the dramatic evolution of the underlying enterprise compute technologies. From open platforms to virtual machines to cloud to containers to “Cloud Native” — there has been one major shift after another.
Of course, the systems built under each new technology paradigm never fully replace the systems built under the old paradigms. It's not uncommon for an enterprise to have an accumulation of systems built over 10-15 years and have no budget, risk appetite, or even viable way to replace them all.
With each shift, who bares the brunt of the responsibility for making sure the old and the new hang together? Operations, of course. With each new advance, Operations juggles more complexity and more layers of legacy technologies than ever before.
One belief that has remained consistent throughout most of this evolution is the idea that operations work may only be executed by a distinct and separate Operations team. For decades it was almost considered heresy to suggest dismantling the strong, wall-like division of roles and responsibilities between Development teams and Operations teams.
Today, the DevOps and Cloud Native movements are strongly challenging the old “Ops in a silo” orthodoxy. These movements have been taking an end-to-end look at improving the entire IT lifecycle. In doing so, they are challenging many old assumptions.
DevOps and Cloud Native share a ruthless focus on improving time-to-market while simultaneously improving quality and lowering costs. These practitioner-led movements have repeatedly shown that the strongly-siloed traditional way of operating is why so much of IT was delivered late, cost too much, and didn’t deliver on its promises to the business.
To understand why Operations as a Service is so important we need understand why the old way of working in silos and passing work through resulting request queues can be so destructive.
The term “silo” is a bit of DevOps jargon used to describe a specific condition that occurs naturally anywhere a significant number of people gather to do work. Following human nature, organizations tend to divide up their work and their people by functional specialization. As these divisions occur, human nature further encourages the people within these divisions to focus inward and optimize their work for their specialization. This is where silos start to form and the problems begin.
A team is said to be “working in a silo” when that team is working in a different context from other teams, their work comes from a different source (i.e. a different backlog), and they are working under different priorities or incentives. While specialization and focus are inherently a good thing for skills development, we have to be vigilant to avoid the unintended formation of organizational silos. It should be noted that silos are not meant to be in indictment of having organizational structure or division. While structure can contribute to the formation of silos, it is really how people work that makes it a silo. Silos may tend to form along team boundaries, but teams do not necessarily have to turn into silos.
One of the first signs of silo formation is difficult or eror-prone. The most common culprits of these handoff problems are:
Another sign of silo formation is that request queues appear at the boundaries of the silos and grow increasingly longer. The stronger the silo effects, the the longer the requests queues become. As the silo effects take hold, those fulfilling the requests end up working increasingly disconnected from the requestors and, due to the issues listed above, that leads to additional errors and rework.
On the surface, request queues seem like an orderly and efficient way to manage work at organizational divides. However, if you look under the surface you will find that request queues are a major source of economic waste in any business.
Why are requests queues a major source of economic waste? Let us look at the list created by noted author and product development expert, Donald G. Reinertsen.
Reinertsen also points out that higher the capacity utilization of a team, the exponentially longer the request queues will be (which directly increases all of the economic wastes previously listed). Given that most IT operations teams operate at near full capacity, this effect needs to be understood.
As a team approach 100% utilization, the request queues increase in size exponentially. As we move from 60% to 80% utilization, the queue doubles. As we move from 80% to 90%, the queue doubles again. As we move from 90 to 95 percent, the queue doubles again. You can see that if a team operates in a high-utilization environments and uses request queues to manage its work, queuing theory dictates that it is nearly impossible to keep request queues small and your organization will continue to suffer from the negative economic effects of large queues.
While the waiting, bottlenecks, errors, and rework that comes with queues can cause operations to be far more expensive than it needs to be, it’s the cost of delay that can have a profound effect on a company’s fortunes. For every delay introduced into a company’s processes, there corresponding effect of reducing how quickly the business can react to the market and how quickly the business can deliver. While the individual effect of each delay can be almost imperceptible, in an organization full of request queues they add up (and compound) quite quickly.
Delay has a cost and the cost can be quite expensive. As Reinertsen says, “If you only quantify one thing, quantify the cost of delay”.
Request queues have a fundamental negative effect on the economics of any business that depends on IT to build or maintain business advantage. Despite this, what is the most common method of managing work inside IT operations organizations? Request queues in the form of ticket systems!
For decades, ticket-driven request queues fulfilled by manual or semi-manual processes have been the most dominant style of working in enterprise IT Operations organizations. The typical operating model was to have lots of specialist teams divided by functional expertise and use ticket systems to govern the flow, approval, and order of work between the different specialist teams. It was also typical for a heavy reliance on project management coordination to push work through this system of silos.
If you examined how any of the teams worked under this operating model, you would likely find the definition of a silo. You would then find most — if not all — of the negative effects of queues and add up the cost to the business (in both increased cost of operations and cost of delay).
Even as ticket systems have been rebranded as ITSM tools and new features and processes for queue management have been introduced, it still isn’t addressing the fundamental problems that come with functional silos reinforced by ticket-driven request queues fulfilled by manual or semi-manual processes.
How can enterprise operations organization solve these silo problems? Two key strategies have emerged from the DevOps community.
The first strategy is perhaps the most obvious: get rid of as many silos as you can.
Forward thinking organizations are transforming from a traditional “vertical” structure aligned by function to a “horizontal” structure aligned by value stream.
This generally means creating cross-functional teams that can handle as much of the lifecycle as possible without needing to hand work off to other teams. It is very difficult for silos to form if you don’t have handoffs or breaks in context and everyone is working from a single backlog with common priorities.
The second strategy is the primary focus of this paper: Operations as a Service. Wherever silos cannot be avoided, you must apply techniques and tooling to mitigate the negative impact of those silos. Operations as a Service does just that by enabling both the definition and execution of operations activity to be delegated throughout a broader organization and across traditional organizational boundaries. Wait time is eliminated, feedback loops are shortened, breaks in context are avoided, tooling is aligned, and labor capacity is improved.
Operations as a Service turns your operations tasks into services that can be consumed on-demand (via GUI, command line, or API) by anyone who needs those operations task performed. Let’s look at how this works and how to calculate the ROI.
Operations as a Service is built on a fundamental shift in thinking about automated operations procedures.
Traditionally, automated procedures have been viewed as monolithic things that are created and live within Operations (and often within the same Operations team). In reality, an automated procedure has three distinct parts: the definition of the automated procedure, the ability to execute that automated procedure, and the security or management policies governing that automated procedure.
Many of the beliefs around what we can and can’t do with regard to automating operations procedures stems from this monolithic view. If we can instead break out our thinking about automated procedures into these three distinct parts, the possibilities for changing an organization start to open up.
We can start to think about who should be responsible for each of the elements:
The goal is to be able to move the individual elements of each automated procedure to the part of the organization where the move improves the flow of work and best utilizes your labor.
For example, we can enable a scalable organization where developers collaborate on defining operations procedures (e.g. “This automation smartly restart this application”), the operations group vets and improves on those procedures (e.g. “Is this safe? Will this do what we want? Does it play nice with other systems?”) , and then the security organization can control where the procedure can be run and who can run it (e.g., traditional operations engineers, developers, a dedicated IT Ops support team, etc.).
Let’s take a look at what is possible when we move the responsibility for the different elements of an automated operations procedure.
In the traditional siloed way of working, all operations activity is executed by a centralized operations team. All three elements are the sole responsibility of that centralized operations team. Do you need an operations task completed? Submit tickets and wait in request queues. Still waiting? Try to get the ticket escalated.
History has shown us that this model causes the most organizational pain. Operations bottlenecks and labor shortages are common. Handoffs to teams outside of operations (and sometimes even within operations) are long and error-prone. When faced with today's high-velocity software lifecycles and dynamic infrastructures, operations organizations struggle under this model. The recent work of the DevOps community has repeatedly advised against this way of operating.
Rigid Self-Service is typically an organization’s first step toward Self-Service Operations. Operations will setup a "button" (e.g. GUI, command line, API) for people to "push" to execute a specific repetitive task. We call this type of self-service “rigid” because the process is completely defined and managed by Operations. The people who consume the self-service procedure aren’t given the ability to modify it (though they may be presented with a few options). Enabling Developers to deploy a .war file via automated CI/CD loop or and on-demand button is an example of a common DevOps-inspired first step that generally falls into this rigid self-service model.
Due to the rigid nature of this way of working, the amount of self-service that is actually possible is limited. Operations still ends up performing most of the operations activity themselves (including the creation of most of the automation). Rigid Self-Service is a good first step as it does help to improve the flow of work and lighten some of the load on Operations -- it's just not a game-changing level of improvement.
In most cases, the systems currently running in your environments were created or assembled by a team outside of operations. Therefore, it is a reasonable assumption that the deepest knowledge of these systems and how to fix them reside outside of Operations. As those teams sprint forward, Operations will always be playing catch-up.
To help Operations get up to speed, the teams outside of Operations often perform a knowledge handoff to Operations in the form of a documentation dump (readme files, "do-this-then-do-that" word documents on SharePoint servers, ticket comments, etc.) and perhaps some "it worked in my environment" scripts. At the handoff, Operations is expected to quickly come up to speed and build the procedures and the automation necessary for the management of systems in the higher pre-production environments (e.g. UAT, STAGE, etc.) and production enviroments.
In addition to the error-prone nature of relying on human-to-human knowledge transfer, these handoffs are time consuming and resource intensive, often requiring the full attention of your most skilled engineers. As lifecycles speed up and environments become more dynamic and complex, these handoffs quickly become a major source of bottlenecks and their inevitable errors become a weak link for quality and security.
A key step in achieving a self-service style of operations is delegating the responsibility for defining and creating automated operations procedures to the creators of the systems and software running in your environment. A simple example would be a mandate that developers write and maintain all automated operations procedures for all software that they create.
This means that the automated procedures have to be seamlessly reusable by both the development team and the Operations team. Automation for one component must also work seamlessly with automation for other components. One might think that this implies that a single automation tool would have to be used across all teams. But, that is not actually the case.
Forcing teams to standardize on one language or automation framework just isn't realistic given the heterogeneous nature of modern enterprises. Teams need to be able to use the automation languages and tools that they want, while allowing for other tools, to orchestrate procedures across those underlying frameworks and languages. If you can achieve this, you get high-velocity handoffs that improve the flow of work, improve quality, and relieve Operations of significant pressure.
Once you combine self-service capabilities for both defining and executing automated procedures, you have enabled Operations as a Service. Like any modern, on-demand "_aaS", you are putting as much control as possible into the hands of the requesting party. This lets the requester complete their tasks, when they need to, and keep feedback loops as tight as possible.
Security and management controls are significant requirements for Operations as a Service solution. This way of working can only exist in an enterprise setting if Operations can maintain full security controls, enforce compliance, and have management oversight.
If done correctly in a low-friction manner, everybody wins. Operations gives autonomy to teams who need operations tasks performed while simultaneously locking down critical information to a greater degree than is even possible today.
As an organization advances with the Operations as a Service, it is a logical next step to push some governance capabilities to other teams. This is popular with organizations that have multiple lines of business with different risk profiles. While Operations must retain ultimate control, there are some access control and compliance decisions that can be pushed closer to the owners of the risk within those lines of business.
There are multiple business drivers for moving to an Operations as a Service model. Calculating a return on investment will always depend on a company's unique environment, however here are some ways to build your company specific ROI calculations:
ROI benefits for teams doing operations work:
ROI benefits for teams outside of Operations:
ROI benefits measured at the business level:
Increased visibility for audit and compliance
The benefits accrued by each constituent group (Operations and the groups served by Operations) reinforce each other and compound over time. The net effect is that the ROI to the business should grow exponentially as more Operations as a Service capabilities are rolled out to additional parts of the company. If you are getting started and want the simplest ROI measure that everyone can rally behind, use lead time. There are lead times all over your organization and each one has a cost associated with it.
There are lead times for feature delivery or lead time in resolving an incident or lead time in waiting for an environment change. Calculate how much each of those lead times cost. When talking to engineers you can express lead time in how long they are waiting or how long it takes them to do repetitive tasks for others. When talking to the business you can add up those lead times and quantify the cost of delay. You can then make the financial argument that you should apply the Operations as a Service pattern to reduce as many of those lead times as possible.
Now that we’ve covered the “Why” and the “What” of Operations as a Service, let’s look at how companies go about building their Operations as a Service capabilities. We’ve noticed a general pattern that companies follow. The process can be broken down into the following four steps.
The first step is to focus on creating a central, secure hub that serves three primary functions:
Heterogeneity is a fact of life in the enterprise. Multiple generations of different platforms and tools will need to co-exist. The idea that an enterprise of significant size can move exclusively to one platform and one automation language/framework just isn’t reality. In addition to the logistical, financial, and technical debt barriers to completely retooling, its natural for different teams to want to use reuse their skills and select tools that best fit their specific needs.
Plan for this heterogeneity by focusing on the orchestration and scheduling of those underlying platforms and supporting tooling. Allow teams to create the component automation using the scripting languages or tools that they want to use. The focus of the hub created for Operations as a Services should be to provide a general orchestration and scheduling capability that can leverage and standardize automation written in any language or built in any legacy tool (as long as you can get API or CLI access to those legacy tools).
Role-based access control is a critical part of the Operations as a Service hub. All enterprises have the need to enforce strict security and governance requirements. Operations needs the capability to delegate access to other teams who don’t traditionally participate in operations activity. The ability to grant read, write, and execute permissions based on roles is essential.
The person executing operations procedures will generally need to understand the context within which they are executing those procedures. The most important parts of that context are generally the current status and current configuration of the system on which they are working. Current status can generally be found in your existing monitoring and metrics tools. Current configuration can be more difficult to determine if a live-updated CMDB doesn’t exist. However, following more modern practices there are often configuration templates and configuration management tools that have this information (and newer systems may already expose this data to inspection via API).
When people are experienced with the services and systems being managed, it is not difficult for them to work from multiple “screens” — looking to different places to get monitoring or configuration data and then going to a different tool to take action. However, when people are less experienced with a particular service or environment (which is bound to happen as you expand self-service), there is benefit to bringing those views together. Meaning, a person would see, all in one place, the monitoring and configuration context as well as the actions they can take.
Either as a prerequisite or a parallel step, work with your peers to define a basic set of standard procedures and a shared “vocabulary” around actions that can be taken on each environment and system. An example of this would be to say that everything needs a basic set of actions — start, stop, status, configure, update, reset connections, etc. The convention you establish will provide slots that the responsible parties can fill in, and expand upon, with their favorite scripts and tools. This creates a consistent baseline of expectations for both those who created the automated procedures and those who will be executing the automated procedures.
One of the great advances of the DevOps movement is bringing Software Development Lifecyle (SDLC) discipline to operations work. Since everything above the hardware is software, why not treat it as such and use all of the established software management best practices that we can. These include using versioned source control, having an automated “build” process, using a well-defined promotion process to move code from one environment to the next (hopefully as immutable artifacts), and regular code reviews.
After teams establish their Operations as a Service platform, the next step is to setup an SDLC that Development and Operations teams can collaborate through to define standard operating procedures. Developers will define the procedures to manage the systems they create and Operations and Security will vet those procedures and perform code reviews. Approved procedures can be tested in lower environments then promoted to production where they can be used by anyone to whom Operations and Security has delegated access.
The primary benefit is that you can leverage the knowledge of Development teams. They are closest to creating the system and are most qualified to determine if it is healthy and how to operate it. By getting them involved early, they can create and test the automated procedures when they are easiest and cheapest to create. This also speeds any handoffs from Development to Operations since Operations can focus on vetting and approving the procedures rather than creating them.
It should also be pointed out that in most enterprises, the total number of developers will be a lot larger than the total number of operations engineers. The only hope of relieving the capacity squeeze on Operations is to leverage all of that labor that can be found outside of Operations.
Operations teams who engage developers with this SDLC-driven approach see a rapid increase in both the usage of their Operations as a Service hub and the overall delivery throughput of the organization.
Now that the Operations as a Service hub is up and running and the SDLC around the automated procedures is in use, the next step is to integrate with other enterprise management tools to make your Operations as a Service capabilities even more intelligent.
For example, ITSM tools have a wealth of ticket and incident information and are key regulators of operations communication and work permission. You can have your procedures update tickets, check if tickets exist, open tickets on failure, and more.
Software artifact and container repositories will generally know which versions are available and marked approved. Through integration, you can have your automated procedures present users with options based on data from these repositories.
Chat systems are central to the new ChatOps style preferred by engineering and operations teams. Integrate with your ChatOps engines to call your automated procedures and see output within your ChatOps sessions.
The more integrations you can create, the more context your users will have without leaving the Operations as a Service hub.
While the idea opening up operations activity to a broader team may be initially concerning to traditional auditors, you can prove that your Operations as a Service platform actually improves your compliance posture. With the platform in place you will have audit evidence automatically generated from start to finish. You can prove: who created the procedure, who reviewed it, what the approval trial was before it was run, who ran it, when it ran, where it ran, and exactly what ran. When you combine that with a platform that logs all events and an audit trail for access control policies -- your Operations as a Service platform should be your auditor’s best friend.
Operations as a Service Platform Capabilities
A common trait that we see among high-performing companies is that they rollout their Operations as a Service capabilities using a continuous improvement approach rather than a “big-bang” approach.
These high-performers will start with a limited set of procedures in a particular slice of their business before expanding. During the subsequent expansion, they add more procedures, expand to other parts of the business, and improve their capabilities along the way. They do this on a rolling basis to ensure that they catch any design flaws or user concerns as quickly as possible. The result is that they deliver a long series of quick wins that make everyone happy.
The continuous improvement mindset is also important given the platform nature of Operations as a Service. A healthy platform is one that all parties can adapt quickly to their need. Recognizing that change is a constant in Operations, we are reminded to make tooling and design choices that favor ease of use, low learning curves, and a rapid delivery lifecycle.
Rundeck is ready to be the core of your Operations as a Service platform. From orchestration and scheduling capabilities to access control management — Rundeck checks off the boxes of the capabilities you will need to get started.
Rundeck has been built under the motto, “Go fast. Be flexible. Lock things down.” Let’s look more closely at how the Rundeck platform addresses each of these ideals.
Current business and technology trends (e.g. Digital, DevOps) are causing a sharp increase in pace and flexibility demanded of IT. However, the complexity found within most companies gets in the way of these desires. Rundeck is designed to work equally as well across the “old” and the “new”. Rundeck lets you connect users, tools, scripts, and APIs across any generation of technology. Whether it’s traditional siloed operations of legacy systems or fast and decentralized operations of new Cloud Native systems, Rundeck allows your teams work they need to work. Rundeck lets you move faster as an organization by helping you improve the effectiveness of your operations work and also by enabling self-service operations wherever possible.
In order to respond quickly as an organization, the ability to safely delegate control to those closest to the need is vital. Whether the need is to respond to a production incident or manage environments during a delivery lifecycle, Rundeck enables companies to be more flexible and empower as many people as needed to do operational tasks. The more operational control that can be delegated, the more flexible and nimble an organization can be.
Lock things down
“Be secure! Don’t be the next data breach! Don’t be the next major outage!” is the strongest business mandate — and one of the greatest challenges — for today’s enterprise IT organizations. If the mandate is to lock things down, how can you meet the other business mandate of going faster? Rundeck approaches this challenge from two different vectors. First, Rundeck enables Ops to implement granular access and compliance controls. Rundeck provide a control layer across all of your operational activity. Second, as the hub of IT operations, Rundeck greatly increases visibility into all operations activity. All actions are logged and create a permanent record.
Want to know more about how Rundeck helps operations as a service? Get the white paper.