Keeping a Digital Business running has never been an easy job, especially over the last year. 2020 forced many businesses to accelerate their digital transformation initiatives faster than anyone imagined! Customers are demanding more capacity and reliability, the business is releasing more new services - faster than ever before, and companies are learning to use new remote working models, straining systems and people.
Complexity is the New Normal
In operations, there has always been a mix of legacy and new applications. But the level of system complexity has increased with the rise of the public cloud, containers and microservices. Even for mid-sized SaaS companies.
Visual representation of services for a mid-size SaaS company
Operations teams are used to dealing with failures. However, with the rising scale and complexity of today’s services, problems and failures happen more often and can be much more difficult to solve. On top of all of that, there's also the pressure to open things up so the organization can move faster, but also to lock things down and remain compliant.
Needless to say, staying ahead is no easy feat. How does a business go faster, but at the same time avoid risk? Enter the concept of real-time operations.
Why Real-Time Operations?
Everyone agrees that speed is a competitive advantage. So how does a business move faster? It's almost impossible if Ops is in a reactive state. Unfortunately that’s where a lot of businesses are today. We call this reactive state ticket-time operations.
Life in Operations has always been a mix of planned and unplanned work. Ops teams are frequently interrupted by someone who needs them to do something or they are interrupting someone with a request.
It's an endless stream of requests in the form of tickets - often asking to do the same task over and over again. For example, the development teams may need the network team to make a change to a firewall rule every time there’s a new release. The network team has to drop what they are doing to make the change... but that change also needs to be approved by the security team before it goes live. Now the network team is interrupting the security team and waiting for them to help. Meanwhile everyone is juggling their own work.
The industry has gotten used to this way of working and the results aren’t great. Engineers feel frustrated, overworked and under-utilized and business owners feel like everything takes too long, costs too much, and breaks too often.
So, here we are today. The demands of IT Operations are pushing things to the breaking point. It’s no longer sustainable to operate under the slow, high-friction, and high-cost burden of the ticket-time operating model. Instead Operations needs to shift to what we call real-time Operations.
What do we mean by “real time”? Real-time is the ability to make decisions and take action at the speed of the business. It means instant communication and decision making. Instead of having the information and control be inside silos - it's distributing control out to the organization and letting people work at their own pace and have end-to-end control.
Three Ways to Enable Real-Time Operations1. Monitoring, Observability, and AIOps
Monitoring is an age old practice that has traditionally been the domain of the Operations side of the house. Monitoring is about looking for patterns or events that are similar to those seen previously and alerting the appropriate folks when those conditions are triggered.
The ‘new’ kid on the block is observability, which measures how well you can understand a system’s internal states from its external outputs. Observability tools and methods help us interrogate our services to figure out what is really going on.
It's built on:
- Is this discrete event something that has happened before?
- Metrics: Looking at those events and asking - are things getting better or worse?
- Distributed tracing: Looking within the new distributed infrastructures and understand how these events cross through each component.
Although monitoring is traditionally owned by the operations side - we are seeing Observability also driven by developers. Monitoring + Observability help achieve real time operations by creating deeper visibility between teams and help us learn how systems work day-to-day.
Last but not least, there’s AIOps. AIOps is about combining tool capabilities to understand what’s happening in real time. AIOps provides solutions similar to existing event management solutions, but includes added capabilities required for today’s complex, modern environments such as machine learning, automation, flexible data collection and ingestion, powerful visualizations, and more. It's about taking all the information and signals from all the infrastructure, aggregating metrics, reducing noise, improving correlation and understanding, and spotting patterns. Learn how to use AIOps for better Incident Management.
2. Service Ownership
In an increasing complex digital world the notion of service ownership becomes more and more important. Organizations needs to know:
- What happens when something goes wrong?
- What are the dependencies?
- And who is the person responsible?
The service ownership practice helps build a map that answers these questions and helps businesses understand the interaction across the teams and technical systems that they’re interacting with.
Services will fail; it’s a fact of operating. How a company responds when there’s a failure can make all the difference between keeping or losing customers.
Full-service ownership helps streamline the incident response lifecycle by empowering engineers to own their services in production, which reduces the number of handoffs and can significantly reduce MTTR when incidents occur. Placing subject matter experts, with direct knowledge of the systems they support, in the role of first responders helps decrease the inevitable chaos and panic that arise from uncertainty.
3. Self-Service Operations
For organizations trying to move from a reactive ticket-driven approach to a proactive approach, the self-service operations model is a key real-time operations enabler.
What does “real time” mean when it comes to self-service? Rather than having information and control be inside functional silos, self-service delegates control to the right people in the organization.
Part of self-service is communicating intelligence, like sharing system context, visibility, service ownership, the right runbooks, and decision support. The other part is freeing up subject matter experts to do work that adds value to the business - rather than continually getting interrupted by requests.
In an incident management scenario, this means first responders have the information and the control they need to be able to take action or to have AI take action on their behalf. This results in faster resolution and fewer disruptive escalations!
Self-Service with Runbook Automation
You can create self-service with runbook automation. Runbook automation allows the subject matter experts to define workflows that span different tools, scripts, APIs, permissions, credentials, and command line procedures and delegate that process to the people who need it.
Runbook automation enables the right people to safely complete tasks that previously only subject matter experts could do. It also allows your subject matter experts to take their best practices and turn them into common practices used by everyone.
Runbook automation can be used across the full life cycle. For incident response, responders have the ability to diagnose an issue and have the automated actions at their fingertips that normally they would have to escalate to the experts to do. This works for normal day to day service requests too. For provisioning, change, and maintenance tasks, instead of constantly waiting for someone to do something for you, runbook automation enables folks to complete the task for themselves. Learn more about self-service operations.
Our opportunity to transform how operations work gets done spans the entire operations lifecycle. Applying real-time operations focus to these other Ops work tasks can make a big difference to improving business velocity!