The primary goals of writing an Incident Review are to ensure that the incident is documented, that all contributing root cause(s) are well understood, and, especially, that effective preventive actions are put in place to reduce the likelihood and/or impact of recurrence.1
Not every incident requires a review. But, if an incident matches any of the following criteria, an incident review must be completed:
Incident reviews (of S1/S2 incidents) have two steps:
The first step in the Incident Review process is the synchronous review of the incident by representatives of the teams involved in the resolution of the incident. This step is conducted as close to the incident date as possible and does not require a complete Incident Review write up. The outcome of this first step should be a published Incident Review, per defined timelines.
Incident reviews second step is engaging with the customer, through the point of contact such as a TAM. This should always involve sharing the findings from the first step in an async form. In case of a customer requiring a sync to discuss the finding, the Infrastructure management will organise the discussion with important stakeholders of this process, per defined timelines
Incident resolution = date incident was resolved
Incident Reviews are conducted in production issues—except in the case of extenuating circumstances when Infrastructure or Engineering management determines a synchronous video call should be held. The issues should have the
~IncidentReview label attached.
~Corrective Action. Linking already existing issues for corrective action is appropriate if the incident was similar to a prior event and corrective actions overlap.
~Corrective Actionmust have an assigned priority label, it is the responsibility of the DRI to ensure that the priorities are set.
~Corrective Actionissues, a due date should be set on the issue to ensure that expectation are set for resolving them.
~Corrective Actionissues have been linked, the issue can be closed.
The infrastructure team keeps track of Corrective Actions on a dedicated board. The prioritization and assignment of issues is collectively handled by the Reliability Engineering managers.
Google SRE Chapter 15 - Postmortem Culture: Learning from Failure ↩