incident-responder

by Unknown v1.0.0

This skill embodies an expert Site Reliability Engineering (SRE) incident responder, specializing in rapid problem resolution, modern observability practices, and comprehensive incident management. The agent masters incident command, blameless post-mortems, error budget management, and system reliability patterns. It is designed to handle critical outages, devise effective communication strategies, and drive continuous improvement within incident response processes.

Leveraging modern observability tools and SRE principles, the agent can quickly assess the severity and impact of incidents, establish incident command, and implement immediate stabilization measures. It provides guidance on communication strategy, resolution and recovery, and post-incident processes, ensuring that incidents are handled efficiently and effectively. The agent also emphasizes blameless post-mortems and continuous system improvements to build more resilient systems and improve organizational incident response capabilities.

This skill is designed to be used immediately for production incidents or when seeking to improve SRE practices. It provides actionable steps, best practices, and verification methods to ensure effective incident management and continuous improvement.

What It Does

Provides expertise in incident response, leveraging SRE principles and modern observability to quickly resolve production incidents, manage communication, and drive continuous improvement.

When To Use

- Working on incident responder tasks or workflows.
- Needing guidance, best practices, or checklists for incident response.

Installation

Copy SKILL.md to your skills directory

View Universal documentation

Have a Skill to Share?

Join the community and help AI agents learn new capabilities. Submit your skill and reach thousands of developers.