Incident Response: Best Practices

Kumar Abhishek
Fyipe
Published in
7 min readJul 31, 2018

--

Incident response is one of the top agenda for every business be it an enterprise or startup that is trying to ensure exceptional customer experience across every aspect of its business offering. With increasing complexity of websites and applications, businesses are finding it challenging to ensure optimum performance standard while maintaining the best in class availability and reliability for their customers.

In the blog “How to achieve 99.99% uptime”, we saw the paradoxical relationship between increasing complexity and availability. As complexity increases in the system, it becomes more and more difficult to achieve high availability.

It is important to have high availability goals but it is equally important to have a backup strategy in place if something goes down or if your customer report an incident.

When incidents come flooding, it probably is a bad idea to not have strategy in place. Here are the best strategies that would help you create a rock solid incident response plan.

Create a well defined process

A well defined process with clearly defined roles and processes removes the ambiguity and confusion when you receive an incident report. This saves a lot of precious time which otherwise would be unnecessarily wasted in sorting and finalizing out a course of action every time something goes wrong.

Incident response team must have well defined roles assigned to them. Check out our blog page to know more about the roles that we recommend you to have and the responsibilities that must be assigned to each of the role.

The incident response process for every company is as unique as the company itself but creating the plan with the following guidelines in mind would help us in defining a plan that is both efficient and effective.

  1. Define clear terms, roles, hierarchy and responsibility: The whole incident response team must be clear with the various terms and lingo used in the process. Every team member must know their roles and limitations on the actions that they can take decision for. The team must be aware of the chain of command and must have a fixed format for communication with various stakeholders.
  2. SLAs must be defined internally as well: Businesses must have a well defined matrix that helps prioritize the incident based on their “urgency” and “impact”. The priority of the incident would be decided using this matrix with Priority 1 or P1(Highest Urgency and Highest impact) being the top priority incident and P5 being the one with lowest priority. The matrix can look something like this.

Grouping incidents based on priority helps in defining the SLA for different incidents and the support team is expected to follow the SLA while resolving incidents. This not only increases the commitment level but also helps in resolving the incidents on time.

3. Ask for objections not conformity: While defining the process, take care of objections that any stake holder might have. Sometimes the incidents are so unique that the existing process might require some modifications. In such cases focusing on resolving objections save a lot of time while deciding the best course of action.

Create a communication strategy

In the blog “How to handle downtime ….” we came to know how to create an effective communication plan as well the importance of having the strategy in place. It is important to have a well defined communication plan for both internal communication, between the internal teams working on the issue, and external communication that involves customers, stakeholders and business users.

When a major incident happens, the response team converts into a mad house and hence communication must follow a fixed protocols and channels to prevent confusion at hour zero. It is important to have a person or team acting as liaison between teams, executives and customers, providing information on need to know basis. This ensures that the team isn’t affected by unnecessary pressure from stakeholders and executives who want the incident to be resolved “within the next 10 minutes”. This pressure demotivates the team already working at top of their efficiency and delays the response time.

In such scenarios having a statuspage powered by Fyipe goes a long way in improving the resolution time (TTR) of incidents by removing the need of unnecessary communication internally and externally. Fyipe’s statuspage helps in keeping everyone on the same page.

Automate processes as much as possible

Automating redundant process that consume a lot of time when done manually helps a lot in reducing the Time to Resolution (TTR) of incidents. But what should you automate ?

The best way to reduce the incident response time is to be aware of the problem before you receive an incident report from a user. You can do so by using tools for monitoring the availability of different components of your website or application.
You also require a system in place to alert the IT and various stakeholders when something goes wrong.

Tools like Fyipe monitor your website, applications, servers and a lot more. The on-call management feature helps you choose who receives the alert as per their availability. Thus based on the schedule it alerts the person available via call, SMS, emails and other integrations such as Slack so that you start working on the incident resolution as fast as possible.

Fyipe powered statuspage helps you keep the whole organization on the same page and also keeps your customers aware of the issue and assures them that it is being taken care of.

Postmortem report must be a part of Incident response plan

Even though a postmortem report might seem unnecessary, it is a vital part of the incident response plan. Incident response isn’t just about finding a fix and getting the system to a normal state. It is a process that describes how you react in response to an incident to bring your systems to their original state and this state also includes intangible factors such as brand image, customer trust, reliability and customer experience.

While your IT helps you bring the tech infrastructure back to normal, the postmortem report helps you regain customer trust and improve the overall customer experience. Hence, it is important that you get the postmortem report of incidents right.

A postmortem report should essentially contain these three things:

1. A personal apology for causing inconvenience to your customers/users.

2. Show that you know the reason why the issue occurred and were capable of fixing it with root cause analysis.

3. An assurance that such incidents won’t happen in future and that you have a solid plan for situations like this in future.

The whole idea of this report is to apologize and win back your customer trust. It is thus extremely important to maintain transparency as well as keep the document convincing enough to your customer.

The report varies a bit depending on whether you have a tech product and customer or if your customer belongs to the non tech end of the spectrum. Whichever be the case, the only thing that matters is that your customer understand what you are trying to say.

Mock drills and training

One of the ways to ensure that the incident response goes as planned is by conducting mock drills by creating artificial failures and incidents to train the team to handle every kind of situation.

Repeated training hard wires the process into the brain of your team members and they are able to respond better equipped to handle these situations in real life. To learn from each drill make sure you create performance reports including details such as:

  1. Was the team able to follow the documented process ?
  2. What are the bottle necks in the current process and how can they be removed ?
  3. Was the team able to respond to the incident on time ?
  4. What are the changes that need to be done in the response plan to improve the efficiency of the team ?

Make sure you implement the changes in the next drill and compare the results with the previous one to see if the changes resulted in any improvement in how the team handles incidents.

How does Fyipe help ?

Fyipe monitors your website, apps, servers much more and alerts you and your team via call, SMS, email and other integrations such as Slack when something goes wrong with your system. The on-call management feature allows you to decide who in your team receives the alert according to their schedule and availability so that your team can start working on the issue right away.

Fyipe comes with a free trial so that you and your team can test it out and see if it fits your purpose. You can check it out here.

--

--