Incident Handling
1. Incident Handling Process
In this section we describe the incident handling process. This should be a helpful tool for an engineer in stressing situation. Incidents can come in lot of different shapes. In VSHN we handle lot of operations and lot of small incidents. Most of them follow a similar process but more implicitly. This section should document how we handle incidents. An engineer should relate on the description when start working on an incident.
1.2. Process Description
- Start of incident
-
-
VSHN gets notified about incidents via these sources:
-
Monitoring Alerts.
-
Customer Ticket via VSHN Portal control.vshn.net.
-
Customer Ticket via Email.
-
Customer Calls our Technical Hotline +41 44 545 53 53.
-
-
If it is not an incident then the ticket is triaged according to Service Desk Workflow and the process ended.
-
If it is outside office hours follow On-Call Guide in our wiki.
-
If it is about a security vulnerability, then check out Vulnerability Process for VSHNeers
-
Engineer acknowledges the alert in OpsGenie / Icinga and creates a ticket.
-
Engineer starts working.
-
- During Incident
-
-
Create a ticket (see supporting info in Jira Ticket) to document working process if not already present.
-
Check out Communication below
-
After 30min update ticket with status:
-
Which service is affected.
-
What happened.
-
When did it start.
-
What is the current state of the incident handling.
-
Investigating
-
Mitigating
-
Observing
-
-
Who to contact with questions.
-
-
Engineer should decide if they need help, see Get Help and Coordinate for guidance. If no further help is needed, continue working for 30 more minutes and update the ticket again
-
If the engineer sees that they cannot handle the incident alone or it could be a bigger thing, they must start to distribute responsibilities according to Roles in Incident Handling.
-
If it is an information security incident Report Security Incident if not done already. If unsure whether it classifies as a security incident ask CISO or people from ISM Governance Role.
-
If incident takes longer than 2h inform Account Management
-
If it is a big incident, multiple services are down for long time (for example a complete outage of a cloud site) then the Business Continuity Plan must be triggered.
-
- After Incident
-
-
If the incident took less than 4h close incident ticket according to Close Ticket
-
If the incident took more than 4h:
-
Account Management decides if a Post Mortem should be done. Account Management leads the post-mortem process
-
Create a ticket if there is more work to do or an underlying problem has to be addressed.
-
Close incident ticket with a well formulated summary.
-
-
2. Supporting Information
Some supporting information you could need for incident handling.
2.1. Jira Ticket
-
Check if an Incident Ticket already exists for the issue at hand, otherwise create one.
-
Small incidents which can be very quickly fixed don’t need a dedicated ticket.
-
-
If more than one customer could be affected, create a ticket in VSHNOPS or another VSHN internal Jira project and link the tickets for any affected customer, as to keep data security intact.
-
Assess Severity Level of incident. Set ticket priority accordingly.
-
Note down how you got alerted (who called, what monitoring alert)
-
Link the OpsGenie alert if you have one.
-
Write down how to reproduce the error / Check the status (link alert, etc.)
A good ticket should consist of this information:
-
Which service is affected
-
What happened
-
When did it start
-
What is the current state of the incident Handling
-
Investigating
-
Mitigating
-
Observing
-
-
Who to contact with questions
2.2. Severity Levels
The four severity levels are related to the severity levels in the ISMS Security Incident Management Process |
- Trivial/Event
-
-
An issue that is quick to resolve with minimal effort, often handled immediately upon detection or reporting.
-
Resolution Time: Resolved within 15 minutes.
-
- Minor
-
-
The incident is noticeable
-
The incident can be pinpointed to individual systems
-
Resolution Time: longer than 15 minutes and resolved under 2 hours
-
- Major Incident
-
-
The incident causes significant business impact for customers or VSHN
-
Multiple systems are affected
-
Resolution Time: longer than 2 hours
-
- Incident endangering the business continuity
-
-
a serious incident that could endanger the existence of VSHN.
-
Business Continuity Plan must be triggered
-
2.3. Communication
- Internal Communication
-
-
Ensure a communication channel is established
-
vshn.chat reachable:
-
For high severity incidents create a new public channel for this incident, called
#incident-YYYY-MM-DD
, for exampleincident-2020-09-01
and announce it in the Threema group "VSHN Tech" and #general vshn.chat channel. -
Lower severity incidents can be handled in #operations channel
-
-
vshn.chat unreachable: Announce the incident in the Threema group "VSHN Tech" and include the information that vshn.chat is unreachable.
-
-
Join the VSHN Company Zoom room (Password in Passbolt). With that we can keep the paths of communication short and have a first organization.
-
Announce the Incident to the vshn.chat #operations channel.
-
- External Communication
-
-
Check the "00-OnCall" wiki page in the customer wiki space for any customer-specific contact instructions, procedures, etc.
-
Establish communication with customer
-
Define contact channels with customer (Ticket + Chat, Zoom or phone)
-
Define next contact time, prioritize problems with the customer. According to the Incident Handling Process we should update the customer all 30 minutes.
-
-
If needed, regularly update status.vshn.net or status.appuio.cloud during solution finding (get help from someone who is not involved in problem fixing).
-
if more than one customer is affected
-
or the incident has a major impact and downtime is more than 10-15
-
or for other reasons that might call for it
-
-
2.4. Get Help and Coordinate
Ensure to get help, if…
-
… it is useful or needed.
-
… it is clear that one person alone can not handle it quickly.
-
… it is not solved (system running again) within 30-60 min.
-
… more than one customer is affected and coordination and communication need to be handled with more focus, than one person can spare.
2.5. Roles in Incident Handling
One person can take cover all responsibilities or, they can get split up, as the engineers involved see fit.
- Technical Lead
-
Owns the technical problem-solving. This person is responsible to escalate so other roles can be established.
- Technical Help
-
Contributes towards the technical problem-solving
- Organizational Lead (aka Incident Commander)
-
Handles documentation, information, and feedback; has the back of Technical Lead so they can focus on technical work
-
Documents the incident timeline
-
Gathers needed information
-
Requests feedback along the way where useful or needed
-
Mandatory: request confirmation once technical lead is confident the service(s) are back online. ⇒ Confirmation can come from customer if he was involved, or internal, if customer was not involved.
-
Ensuring documentation for Post mortem started
-
- Communication Manager
-
Who ensures communication with the customer(s) must be kept updated at regular intervals about:
-
Results of the problem analysis
-
Progress in problem-solving
-
Foreseeable duration of the interruption
-
The solution or parts of it, as soon as it is available, even if the solution is only a temporary one.
-
2.6. Close Ticket
As soon as the incident is resolved, inform the customer of the resolution and close the incident ticket.
-
Document the cause, the timeline and the solution(s)
-
Document if there is anything left to do:
-
What is left to do?
-
Link follow-up ticket.
-
2.7. Next Steps and Follow up
These are very situational, depending on the severity, the nature of what caused the incident, whether the solution is just a temporary one, if there is more left to do, etc.
-
If a customer was affected in any way, and the incident was not small, create a post-mortem according to the post mortem template.
-
Ensure Key Account Management are notified (should already have happened during the incident) them about
-
Key Account Management must ensure a Post Mortem is created.
-
Check if one or more follow-up problem tickets are needed to work on the underlying root cause. Problem tickets may evolve into change tickets, once an actionable solution for the underlying cause has been found.