Incident Handling

1. Incident Handling Process

In this section we describe the incident handling process. This should be a helpful tool for an engineer in stressing situation. Incidents can come in lot of different shapes. In VSHN we handle lot of operations and lot of small incidents. Most of them follow a similar process but more implicitly. This section should document how we handle incidents. An engineer should relate on the description when start working on an incident.

1.1. Graphical Process

1.2. Process Description

Start of incident

VSHN gets notified about incidents via these sources:
1. Monitoring Alerts.
2. Customer Ticket via VSHN Portal control.vshn.net.
3. Customer Ticket via Email.
4. Customer Calls our Technical Hotline +41 44 545 53 53.
If it is not an incident then the ticket is triaged according to Service Desk Workflow and the process ended.
If it is outside office hours follow On-Call Guide in our wiki.
If it is about a security vulnerability, then check out Vulnerability Process for VSHNeers
Engineer acknowledges the alert in OpsGenie / Icinga and creates a ticket.
Engineer starts working.

During Incident

Create a ticket (see supporting info in Jira Ticket) to document working process if not already present.
Check out Communication below
After 30min update ticket with status:
1. Which service is affected.
2. What happened.
3. When did it start.
4. What is the current state of the incident handling.
  1. Investigating
  2. Mitigating
  3. Observing
5. Who to contact with questions.
Engineer should decide if they need help, see Get Help and Coordinate for guidance. If no further help is needed, continue working for 30 more minutes and update the ticket again
If the engineer sees that they cannot handle the incident alone or it could be a bigger thing, they must start to distribute responsibilities according to Roles in Incident Handling.
If it is an information security incident Report Security Incident if not done already. If unsure whether it classifies as a security incident ask CISO or people from ISM Governance Role.
If incident takes longer than 2h or has occurred for the third time in 48h inform Account Management
If it is a big incident, multiple services are down for long time (for example a complete outage of a cloud site) then the Business Continuity Plan must be triggered.

After Incident

If the incident took less than 4h close incident ticket according to Close Ticket
If the incident took more than 4h:
1. Account Management decides if a Post Mortem should be done. Account Management leads the post-mortem process
2. Create a ticket if there is more work to do or an underlying problem has to be addressed.
3. Close incident ticket with a well formulated summary.

2. Supporting Information

Some supporting information you could need for incident handling.

2.1. Jira Ticket

Check if an Incident Ticket already exists for the issue at hand, otherwise create one.
- Small incidents which can be very quickly fixed don’t need a dedicated ticket.
If more than one customer could be affected, create a ticket in VSHNOPS or another VSHN internal Jira project and link the tickets for any affected customer, as to keep data security intact.
Assess Severity Level of incident. Set ticket priority accordingly.
Note down how you got alerted (who called, what monitoring alert)
Link the OpsGenie alert if you have one.
Write down how to reproduce the error / Check the status (link alert, etc.)

A good ticket should consist of this information:

Which service is affected
What happened
When did it start
What is the current state of the incident Handling
- Investigating
- Mitigating
- Observing
Who to contact with questions

2.2. Severity Levels

The four severity levels are related to the severity levels in the ISMS Security Incident Management Process

Trivial/Event

An issue that is quick to resolve with minimal effort, often handled immediately upon detection or reporting.
Resolution Time: Resolved within 15 minutes.

Minor

The incident is noticeable
The incident can be pinpointed to individual systems
Resolution Time: longer than 15 minutes and resolved under 2 hours

Major Incident

The incident causes significant business impact for customers or VSHN
Multiple systems are affected
Resolution Time: longer than 2 hours

Incident endangering the business continuity

a serious incident that could endanger the existence of VSHN.
Business Continuity Plan must be triggered

2.3. Communication

Internal Communication

Ensure a communication channel is established
1. vshn.chat reachable:
  1. For high severity incidents create a new public channel for this incident, called #incident-YYYY-MM-DD, for example incident-2020-09-01 and announce it in the Threema group "VSHN Tech" and #general vshn.chat channel.
  2. Lower severity incidents can be handled in #operations channel
2. vshn.chat unreachable: Announce the incident in the Threema group "VSHN Tech" and include the information that vshn.chat is unreachable.
Join the VSHN Company Zoom room (Password in Passbolt). With that we can keep the paths of communication short and have a first organization.
Announce the Incident to the vshn.chat #operations channel.

External Communication

Check the "00-OnCall" wiki page in the customer wiki space for any customer-specific contact instructions, procedures, etc.
Establish communication with customer
1. Define contact channels with customer (Ticket + Chat, Zoom or phone)
2. Define next contact time, prioritize problems with the customer. According to the Incident Handling Process we should update the customer all 30 minutes.
If needed, regularly update status.vshn.net or status.appuio.cloud during solution finding (get help from someone who is not involved in problem fixing).
1. if more than one customer is affected
2. or the incident has a major impact and downtime is more than 10-15
3. or for other reasons that might call for it

2.4. Get Help and Coordinate

Ensure to get help, if…

… it is useful or needed.
… it is clear that one person alone can not handle it quickly.
… it is not solved (system running again) within 30-60 min.
… more than one customer is affected and coordination and communication need to be handled with more focus, than one person can spare.

2.5. Roles in Incident Handling

One person can take cover all responsibilities or, they can get split up, as the engineers involved see fit.

Technical Lead

Owns the technical problem-solving. This person is responsible to escalate so other roles can be established.

Technical Help

Contributes towards the technical problem-solving

Organizational Lead (aka Incident Commander)

Handles documentation, information, and feedback; has the back of Technical Lead so they can focus on technical work

Documents the incident timeline
Gathers needed information
Requests feedback along the way where useful or needed
Mandatory: request confirmation once technical lead is confident the services are back online. ⇒ Confirmation can come from customer if he was involved, or internal, if customer was not involved.
Ensuring documentation for Post mortem started

Communication Manager

Who ensures communication with the customers must be kept updated at regular intervals about:

Results of the problem analysis
Progress in problem-solving
Foreseeable duration of the interruption
The solution or parts of it, as soon as it is available, even if the solution is only a temporary one.

2.6. Close Ticket

As soon as the incident is resolved, inform the customer of the resolution and close the incident ticket.

Document the cause, the timeline and the solutions
Document if there is anything left to do:
- What is left to do?
- Link follow-up ticket.

2.7. Next Steps and Follow up

These are very situational, depending on the severity, the nature of what caused the incident, whether the solution is just a temporary one, if there is more left to do, etc.

If a customer was affected in any way, and the incident was not small, create a post-mortem according to the post mortem template.
Ensure Key Account Management are notified (should already have happened during the incident) them about
Key Account Management must ensure a Post Mortem is created.
Check if one or more follow-up problem tickets are needed to work on the underlying root cause. Problem tickets may evolve into change tickets, once an actionable solution for the underlying cause has been found.