Incident Management and the (literal) history of putting out fires
We’ve all heard of Incident Management in one form or another, but not many know how it came to be.
Incident management is the process of responding and mitigating computer system incidents to have the least amount of disruption to the end users of said service.
The earliest version of an Incident Management System is Incident Command System (ICS). ICS was first started in 1970 after a devastating California wildfire, the Laguna fire. The Laguna fire is the third largest fire in California, during those 13 days in 1970, 16 lives were lost, 700 structures were destroyed and over one-half million acres burned. The overall cost and loss associated with these fires totaled $18 million per day.
This fire was so damaging because while there was a process on handling incidents locally, since this fire crossed multiple jurisdictions, there was confusion and uncertainty over who led efforts such as flying the planes with the dowsing materials. Firefighters from roughly 500 different agencies couldn’t always communicate because they didn’t use the same radio frequencies. There was no centralized information source for the crews to get up-to-date information on where and when they were needed, so they weren’t deployed efficiently. The terminology they used to describe their equipment and tactics wasn’t the same. This lack of cross-collaboration and a lack of communication ended up costing human lives. After the fire, the fire chiefs across California met to figure out how to best collaborate in the future to prevent similar issues. From this initiative ICS was created, to improve their techniques and coordinate between agencies how best to manage and respond to wildfires.
ICS identified a hierarchy of organization structure and the responsibilities that various agencies have during incident response allowing them to collaborate and communicate more effectively. It introduced a chain of command where each person had only one supervisor, leading up to the “Incident Commander” who is the only leader in charge of the entire incident. While the position can be rotated, it makes it easier to know who is the lead responder and in charge of ensuring the incident mitigation and delegation.
ICS is also a part of the National Incident Management System (NIMS), which guides various organizations both in the public and private sector how to work together to prevent, protect against, mitigate, respond and recover from incidents. NIMS was created after 9/11 and adopted in 2015 tying together various federal emergency preparedness into the Federal Emergency Management Agency. NIMS involves Resource Management, Command and Coordination (including ICS) and Communications and Information Management. NIMS provides necessary training across different levels of government in using it and ICS to manage different types of emergency response. Other countries around the world have their own version of NIMS and use some type of ICS.
As Computer Systems became larger and end users had expectations of higher availability across multiple timezones to their systems, Systems Engineers and Site Reliability Engineers had to come up with ways to manage their incident response effectively. While availability and reliability is the top feature required by a system, it’s impossible to have no downtime at all.
Especially with global demand, and in between possible system outages and required maintenance. Not having good incident response means that there is a lack of communication when issues happen, it’s hard to know who is working on what, causing unnecessary duplication of effort, and also opportunities for worse outcomes if efforts clash in any way.
The ICS has since been adapted by many companies (such as Pagerduty and Google) to their own Incident Management systems. The process and lifecycle of incident response generally looks the same from a higher level. The goals are the same just adapted from ICS and NIMS:
- Maintain a clear line of command
- Designate clearly defined roles
- Keep a working record of debugging and mitigation as you go
- Declare incidents early and often.
An incident usually gets reported either from an alert firing from a monitoring/observability dashboard or from a user report. Once an incident is declared, an Incident Commander is chosen (usually the person who declared the incident), and the existing Incident Management system kicks in. Then the incident response kicks in and the ICS command structure helps work on resolving the incident.
In the ICS command structure adapted by tech companies there is usually someone or a few people in charge of “Communications”, and some people working on resolving the incident via investigation and mitigation are in charge of “Operations/Execution”. The names also change from company to company and have different splits. For example, at PagerDuty the “Subject Matter Experts” are the Operations folks on incident response and they have two distinct Communication roles “Internal Liaison” and “Customer Liaison” alongside the Incident Commander. Whereas Google has a “Communications Lead” and an “Operations Lead” with additional people reporting to them as needed. In comparison, here at Datadog we have “Responders”, who are the folks working on the incident, “Customer Liaison” and “Incident Executive Lead” who are both in charge of communication but externally and internally respectively. All of this is so that the Incident Commander can be the main source of truth, while the people working on the incidents can work on resolving it without interruption.
These people work together to resolve the incident and stabilize the service afterwards to wrap up the incident. Afterwards there is usually a postmortem or incident retrospective to review the incident. It’s helpful to have in writing important details such as the incident timeline, what went well, what went wrong and where luck was involved. These artifacts of the incident are helpful for future incidents, for folks training on how to best respond to incidents, and also include tasks on how to prevent similar incidents in the future. Postmortems cover two parts of the NIMS fundamentals “Preparedness” and “Ongoing Management and Maintenance”. The document is helpful for future review and training, and the action items assigned after the incident, included in the postmortem are usually made in order to prevent the incident happening again contributing to maintenance of the system.