SRE Incident Response | O’Reilly

SRE Incident Response | O’Reilly
English | Size: 612.33 MB
Genre: eLearning

Incidents are costly. When your system goes down, you must work quickly, efficiently, and effectively to get things back up. The gold standard process is the incident management system (IMS), developed by American firefighters in the 1970s. IMS is now used by militaries, emergency personnel, and—in the domain of site reliability engineering (SRE)—companies like Google. Responding efficiently and effectively can make the difference between meeting your service-level objectives (SLOs) and blowing right past them—which is why effective incident response is a core pillar of SRE.

Just as important are the preparation done beforehand and the analysis that occurs afterward. During nonincident times, organizations should be safely testing how services may fail (such as with game days), planning who responds when things break, and crafting playbooks for common actions and responses. Postincident, measuring and evaluating incident response is crucial to determine what works and what doesn’t.

Incident Labs’ Emil Stolarsky and Jaime Woo show you how to create a successful incident response strategy, from preparation and training to running IMS during the incident to evaluating the response and sharing lessons learned throughout your organization. Our services will never be perfect, and they’ll all break eventually. What makes us SREs is how we prepare for those days when things break, how we respond, and what we learn.




If any links die or problem unrar, send request to

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.