As operators, our job is to assure that infrastructure, systems and applications are running correctly, waking up humans to fix only the things that machines can’t. With IT growing more complex and systems proliferating, it’s vital that we automate as much of our job as possible.
That’s why Rundeck, a PagerDuty company, and Sensu have partnered to provide plugins and integrations combining Rundeck’s runbook automation with Sensu’s monitoring-as-code technology. These integrations give you a monitoring system that delivers automated remediation via Rundeck automated playbooks, shortening incident response times and reducing alert fatigue for you and your colleagues.
This blog post covers common uses of the Rundeck-Sensu integration and describes how Sensu handlers and Rundeck plugins work together to trigger automated remediations. Find detailed product demos in a recent webinar here: Automated remediation with Rundeck and Sensu.
Monitoring as Code
Some of you reading this may not be familiar with Sensu. We are the creators of an observability pipeline that delivers monitoring as code for any cloud, and complete visibility of your systems from bare metal to Kubernetes. Sensu gives DevOps and SRE teams a flexible automation platform that integrates with best-of-breed data platforms, while letting you use your current monitoring and observability tools — for example, Nagios, Telegraf, StatsD or Prometheus.
Sensu’s flexibility and extensive integrations — such as the collaboration with Rundeck — are key to its value. Monitoring as code is far more than just automated installation and configuration of agents, plugins and exporters. It encompasses the entire observability lifecycle, including automated diagnosis, alerting and incident management, and of course automated remediation. To understand more about monitoring as code (and how it differs from infrastructure as code for monitoring), read our post Monitoring as code: what it is and why you need it.
Example Use Case 1: A Simple Remediation
As I mentioned earlier, the constant proliferation of systems and increasing complexity of IT infrastructure make it impossible for humans to keep vital business systems running with manual methods. Employing automation to do our tasks saves our energies for the things only humans can do — for example, coming up with new ways for IT to help achieve organizational goals.
Automated remediation also helps avert alert fatigue, which is a real thing that can mess with our judgement. Remediating via automation can also be far more accurate, though of course we do need fallbacks, such as alerts that go out when automated remediations fail.
In this first use case, we show how Sensu can detect when a service has failed on a node and how it fires a webhook through to Rundeck, which then automatically restarts the service without notifying a human.
We start with a Sensu environment and a Rundeck environment, with three nodes running NGINX as an example service. Sensu is actively monitoring the nodes (cloud compute instances or Kubernetes Pods), and the NGINX services running on them. The Sensu NGINX check contains an annotation with remediation instructions in case an NGINX service instance should fail, including a schedule (e.g. how many consecutive failures should Sensu wait for before taking some action) and the corresponding Rundeck Job information. Sensu is also configured with a Rundeck handler – instructions the Sensu Rundeck integration uses to communicate with Rundeck to invoke Rundeck Jobs using the Rundeck Webhook API.
Each time Sensu detects a failure on a given instance of the NGINX service it will evaluate the event (including the above-mentioned check annotations) to determine if any action should be taken.
When Sensu processes a remediation action using Rundeck it will POST a request to the corresponding Rundeck Webhook URL to invoke a Rundeck Job to fix the underlying failure detected by Sensu. In the case of our demo/example, this results in restarting NGINX.
Users can also execute checks manually by clicking Execute at the top of the check view in Sensu if necessary.
Now, when you go to the Activity tab, you can see the timestamp when the Sensu fix job was run, and that it was successfully executed against the node. Then you can look at the Sensu window again and see that the agent is now healthy — the NGINX service was restarted.
Automated remediations can fail sometimes, however you have multiple avenues for surfacing such failures. Within Rundeck, you can set a notifier for when a Rundeck job fails. In Sensu, you can configure a handler so that an alert gets sent to PagerDuty if an automated remediation sent to Rundeck fails. The PagerDuty alert makes sure someone on the team knows they need to intervene manually and get the service running again.
Example Use Case 2: Silencing a node for planned maintenance
In ideal circumstances, automation and observability should work together in concert. However, in certain circumstances automation can be set up to make changes to systems that can result in false-positive alerts. For example, scheduled and unscheduled maintenance could inadvertently stop a service that is being actively monitored.
The Rundeck Sensu integration provides new Workflow steps to temporarily silence events on nodes where Rundeck Jobs are running, eliminating false positive alerts and improving the overall reliability of the system.
For example, a planned NGINX restart. Before stopping the NGINX service, you can create a silence on the particular node using Rundeck so that when NGINX goes down, the Sensu agent won’t send out any alerts and NGINX won’t be restarted on that node. You also don’t want the agent on that node to notify for another reason, or trigger any handlers. Once the NGINX service is restarted on the node and the silence is removed automatically!
See full demos of the use cases outlined above in the on demand webinar!
The Sensu plugin is available for Rundeck Enterprise customers on Rundeck version 3.3.6 and newer. Find more information about the Sensu plugin and other Rundeck plugins here. To get started with Sensu, please visit https://sensu.io#get-started