Auto-Remediation Architecture via StackStorm

In high-scale environments, manual intervention for routine service failures is a primary source of operational toil. My dissertation explored a closed-loop automation system designed to detect, diagnose, and resolve service outages without human interaction.

The Workflow

The architecture leverages event-driven automation to bridge the gap between monitoring signals and administrative actions.

New Relic
Service Down Detection
Webhook Payload
JSON Event Data
StackStorm Engine
Rule Matching & Logic
Auto-Remediation
Service Restart Command
MS Teams
Outcome Notification

Technical Deep Dive

When New Relic identifies that a service process has terminated or is unresponsive, it triggers a Critical Alert. This alert is configured to send a POST request via Webhook to the StackStorm API endpoint.

Inside StackStorm, a custom rule parses the incoming JSON. If the alert matches the "Service Down" pattern, an Orquesta workflow is initiated to execute a secure SSH or Kubernetes-native command to restart the specific service.

Impact & Results

Rapid Recovery

Reduced MTTR from minutes (human response) to sub-10 seconds (automated response).

Zero Toil

Completely eliminated manual tickets for known L1 service failures.

Visibility

Every automated action provides a full audit trail directly in MS Teams channels.

Stability

Ensured 99.9% availability for critical internal microservices.