In high-scale environments, manual intervention for routine service failures is a primary source of operational toil. My dissertation explored a closed-loop automation system designed to detect, diagnose, and resolve service outages without human interaction.
The architecture leverages event-driven automation to bridge the gap between monitoring signals and administrative actions.
When New Relic identifies that a service process has terminated or is unresponsive, it triggers a Critical Alert. This alert is configured to send a POST request via Webhook to the StackStorm API endpoint.
Inside StackStorm, a custom rule parses the incoming JSON. If the alert matches the "Service Down" pattern, an Orquesta workflow is initiated to execute a secure SSH or Kubernetes-native command to restart the specific service.
Reduced MTTR from minutes (human response) to sub-10 seconds (automated response).
Completely eliminated manual tickets for known L1 service failures.
Every automated action provides a full audit trail directly in MS Teams channels.
Ensured 99.9% availability for critical internal microservices.