DevOps team at large utility company

Improving incident response with Rundeck.

Summary

An endless cycle of service stops and restarts was wearing down the small DevOps team at a large utility company. Zabbix did a good job of automating notifications about outages, but the rest of the process was repetitive and manual. One persistent DevOps team member, Kosy Anyanwu, knew there had to be a better way. A little research led her to Rundeck and soon the incident response loop was completely automated, letting the team reclaim time for strategic projects.

The Challenge

Kosy was a member of a six-person DevOps team at a large, utility. One of the main tasks was restarting services that went down. The team implemented Zabbix, an open source enterprise monitoring tool to trigger notifications of these outages. That worked well, speeding response times. However, as anyone in Ops knows, just as you extinguish one fire, another starts. The cycle of notification, restart, and documentation was repetitive and manual. In addition, the team did not get any preserved record of output from remote executions.

The lack of automation frustrated the team. Kosy searched for a solution to make the Zabbix notifications more effective. Could Zabbix not only notify the team, but also kick off the remediation steps, and close the loop?

After scouring the Web, Kosy discovered a presentation that outlined how to integrate Rundeck and Zabbix created by a German consulting firm. The key was to use Rundeck’s scheduling capabilities to automate and create an audit trail.

"Rundeck reduced interruptions from alerts and saved us time. It is automation for automation."

 

Kosy Anyanwu

Optimizing Incident Response

The goals of the project were outlined. When a service stops running, have Zabbix fire a trigger, call middleware which executes a Rundeck job to restart the service, and send an acknowledgement to Zabbix. Now there’s a closed loop process and Ops is no longer a roadblock. The Ops team reclaims time to focus on strategic projects.

How they did it:

  • Map Zabbix hosts to Rundeck nodes

  • Map Zabbix trigger names to Rundeck job names

  • On trigger event, pass related host and trigger from Zabbix to Rundeck

  • Return job execution status from Rundeck as an event acknowledgement in Zabbix

The Results: Standard operating procedures and increased visibility

The ROI of the Rundeck implementation was quickly realized and readily apparent to the team. They were able to replace ad hoc, repetitive tasks with standard operating procedures across the team, improve efficiency, and create an audit trail for tracking all remote actions. Benefits include:

  • Easier overall maintenance of actions, no duplicated sets of commands.

  • Reduced errors in definitions of remote commands. 

  • Created central repository for access credentials and keys.

  • Simplified testing of Zabbix remote commands. 

  • Preserved output from executed remote actions. 

  • Delivered event acknowledgement in Zabbix for Return job execution status from Rundeck

Kosy created an excellent deep-dive write up of her integration of Rundeck and Zabbix