Incident Handling

incident handling diagram.drawio

Sources of Incident Alerts

  • Monitoring Alerts ⇒ Acknowledge the OpsGenie or Icinga alert!

  • Customer Ticket via VSHN Portal control.vshn.net

  • Customer Ticket via Email

  • Customer Calls our Technical Hotline +41 44 545 53 53

Jira Ticket

  • Check if an Incident Ticket already exists for the issue at hand, otherwise create one.

    • Minor incidents which can be very quickly fixed don’t need a dedicated ticket.

  • If more than one customer could be affected, create a ticket in VSHNOPS and link the tickets for any affected customer, as to keep data security intact.

  • Assess Severity Level of incident. Set ticket priority accordingly.

  • Note how you got alerted (Who called, what monitoring alert)

  • Link the OpsGenie alert if you have one.

  • Write down how to reproduce the error / Check the status (link alert, etc.)

  • If it’s about a security vulnerability, then check out Vulnerability Process for VSHNeers

Communication

Internal Communication:

  1. Ensure a communication channel is established

    1. vshn.chat reachable:

      1. For high severity incidents create a new public channel for this incident, called #incident-YYYY-MM-DD, for example incident-2020-09-01 and announce it in the Threema group "VSHN Tech" and #general vshn.chat channel.

      2. Lower severity incidents can be handled in #operations channel

    2. vshn.chat unreachable: Announce the incident in the Threema group "VSHN Tech" and include the information that vshn.chat is unreachable.

  2. Join the VSHN Company Zoom room, the same as used for company meetings. With that we can keep the paths of communication short and have a first organization.

  3. Announce the Incident to the vshn.chat #operations channel.

External Communication:

  1. Check the "00-OnCall" wiki page in the customer wiki space for any customer-specific contact instructions, procedures, etc.

  2. Establish communication with customer

    1. Define contact channels with customer (Ticket + Chat, Zoom or phone)

    2. Define next contact time, prioritize problems with the customer

  3. If needed, regularly update status.vshn.net or status.appuio.cloud during solution finding (get help from someone who is not involved in problem fixing).

    1. if more than one customer is affected

    2. or the incident has a major impact and downtime is more than 10-15

    3. or for other reasons that might call for it

Get Help and Coordinate

Ensure to get help, if…​

  • …​ it is useful or needed.

  • …​ it is clear that one person alone can not handle it quickly.

  • …​ it’s not solved (system running again) within 30-60 min.

  • …​ more than one customer is affected and coordination and communication need to be handled with more focus, than one person can spare.

Distribute Responsibilities

One person can take cover all responsibilities or, they can get split up, as the engineers involved see fit.

  • Technical lead - Owns the technical problem-solving

  • Technical help - Contributes towards the technical problem-solving

  • Organizational lead (aka Incident Commander) - Handles documentation, information and feedback

    • Documents the incident timeline

    • Procures needed information

    • Requests feedback along the way where useful or needed

    • Mandatory: request confirmation once technical lead is confident the service(s) are back online. ⇒ Confirmation can come from customer if he was involved, or internal, if customer was not involved.

  • Communication Manager - Who ensures communication with the customer(s) must be kept updated at regular intervals about:

    • Results of the problem analysis

    • Progress in problem-solving

    • Foreseeable duration of the interruption

    • The solution or parts of it, as soon as it is available, even if the solution is only a temporary one.

Close Ticket

As soon as the incident is resolved, inform the customer of the resolution and close the incident ticket.

  • Document the cause, the timeline and the solution(s)

  • Document if there is anything left to do:

    • What is left to do?

    • Link follow-up ticket.

Next Steps and Follow up

These are very situational, depending on the severity, the nature of what caused the incident, whether the solution is just a temporary one, if there is more left to do, etc.

  • If a customer was affected in any way, create a post-mortem according to the post mortem template.

  • Inform Key Account Management and notify them about the post-mortem.

  • Check if one or more follow-up problem tickets are needed to work on the underlying root cause. Problem tickets may evolve into change tickets, once an actionable solution for the underlying cause has been found.