Responsible Ops

In tech teams one person is responsible that Triage and Monitoring Ops is working. The responsible person also ensures that this tasks are covered during business hours (09:00–18:00 Zurich time).

This doesn’t mean that this person has to work 09:00–18:00 (Zurich time) every day. Handing over some or all duties to another person is possible anytime, if done in a coordinated way.

Even when handed over, it’s your overall responsibility.

Assignment of this role

As this role is about being responsible only (not necessarily doing it yourself) this can be any of the VSHNeers from the team. Each tech team rotates this role according to a (usually predefined) schedule.

Duties

Make sure that at least one person of your team is doing the following in this priority order:

  1. Monitoring Operations (high priority Incidents)
    Sometimes it’s not easy to asses the priority: You have to be aware of the impact (get another opinion from the team)

  2. Ticket 1st stage Triage

  3. Ensure urgent Incidents are worked on in time!

  4. Ticket 2nd stage Triage

  5. Monitoring Operations (low priority Incidents)

  6. Monitoring Operations (Problems, for example WARNINGs)

  7. Only when there is nothing left in 1–6 → Work on tickets while constantly checking Triage filters and the monitoring alerts

Communicate

  • With others being Responsible Ops

  • With the other people of your team

Monitoring Operations

At least one VSHNeer from every tech team is responsible (ensured by Responsible Ops) to handle Monitoring Alerts.

Goals

  • Efficiently handle small things that pop up

  • As soon as anything is changed on customer systems or we’ve to contact the customer, track the situation in a (new) ticket

  • Keep the monitoring green

    • No unhandled CRITICAL / DOWN in Monitoring older than 15min Even if CRITICALs are actually not that critical they still pop up on the dashboard, can generate notifications to customer, confuse other team members when they check the Monitoring / Dashboard.

  • Handle mails in the Maintenance Mailbox.

How?

Rules
  • It’s okay to fix small things (<15min effort) directly without a ticket
    Log work to the Monitoring Ops Chore in Jira.

  • As soon as you change anything or need customer feedback (configuration, resizing, scaling, etc.) on a production system (especially customer systems) create a ticket in the customer space in Jira

  • Tickets created from Monitoring Operations already have 1st stage Triage quality

    • When you start working on it directly, make sure you bring it to 2nd stage Triage quality first or as soon as there is time for it

Don’t over-refine tickets. The priority is to resolve Incidents in time and fix small things directly to prevent incidents.
Per issue (Alert) you’re handling
  1. Check if there is already a ticket for the issue, if not create one. ACK the alert with the ticket

  2. We try to reproduce the issue and document in the ticket what happened and all information we can find within reasonable time (<15min):

    • log in to the server, checking the logs of the service, describe the failed pod, etc.

    • check if there are other similar problems in monitoring (on same host, same customer, same service for other customers, etc. - correlate things)

    • We assess the impact and describe it in the ticket. When unsure, we check with Service Managers or Product Owners (or at least within your team).

  3. When urgent, we fix the issue directly or get someone else from the team to start working on the Incident

  4. Work on bigger issues as soon as all other issues are handled (and if you’re also on Triage duty and if nothing todo there)

  5. If the Incident is likely to happen again or already known we create a Problem tickets, see Creating tickets for other teams.

Ticket Triage

When customers, partners or VSHNeers create new tickets, they pop up in Triage filters we use in each tech team. The responsibility is to triage these ticket and refine them into a usable state.

VSHNeers create tickets in Jira directly, and ensure it’s of good enough quality to make triage easy or obsolete. Customers can create tickets either via the VSHN Portal or E-Mail to support@vshn.ch.

1st stage Triage

In the first stage of the triage process we find out whether the ticket is further triaged and refined by your team or not.

You find the link to the Jira Dashboard with all necessary filters on the wiki homepage of your team.

How we do 1st stage Triage is documented in the wiki, as this procedure is still changing a lot.

Your team handles the following interrupt based type of work:

Classification Description Action

Incident

Unplanned interruption to or quality reduction of a service

Refine task to usable state and resolve incident.

Incident Prevention

Prevent an Incident proactively. Usually with the help of monitoring / graphs, for example a disk filling up, etc.

Resolve the situation to prevent potential Incident. Create Problem task for the Home-Squad to solve the underlying issue.

Request

Request from a customer (or internally) for information, advice, a easy and standard change, or access to a service

Clarify requirements (with customer), assess (authorization, feasibility, security, effort, impact, etc.), work on and complete task. Transform into a Change if needed.

The following task are usually not considered interrupt based work and can be planned by your team:

Classification Description Action

Change

The addition, modification or removal of anything that could have an impact on a running service.

Handover to the Service Manager or the Technical Service Manager for clarification and further refinement. Assign to Home-Squad if clear which Squad.

Problem

Root-cause analysis and potential resolution planning for one or more current, potential and past Incidents

Handover to the Service Manager, the Technical Service Manager or the Product Owner for clarification and further planning. Assign to Home-Squad if clear which Squad.

Project Task

A task belonging to a running project.

Handover to the Project Manager of the project for further refinement and planning.

Everything else

Tasks not fitting into this classification table, for example: Sales, backoffice, finance and organizational development tasks, Research task and similar.

Check with the creator of the ticket to handle the ticket to a point that it doesn’t longer show up in Triage filters.

2nd stage Triage

Done by the Ticket Triage VSHNeer or the VSHNeer starting to work on a urgent ticket for all tickets that come from 1st stage Triage
Depending on the urgency, resolving the Incident is usually more important than refining the ticket.

Goals
  • Ensure ticket quality

    • Make it possible to review tickets (task deliverables, reproduce issue, etc.)

    • Eliminate misunderstandings between customer / VSHN

    • Make tickets look the same, where useful. No one likes to start working on ugly tickets

  • Have estimates on bigger tasks

    • Make it possible to escalate when estimate is reached soon

  • Actually Select the ticket, so that it appears on Kanban Board

How we do 2nd stage Triage is documented in the wiki, as this procedure is still changing a lot.

How we work with tickets

Once tickets went through Ticket Triage, they either appear on our Kanban board or as New tickets in your team. It’s up to your team how you actually plan and work on your tasks.

Creating tickets for other teams

  • Create the new ticket correctly (squad, blocks, summary, template, etc.) - basically the ticket has 2nd stage Triage quality.

  • Add a follows link to the original task - which was the reason for creating this task

  • Assign the ticket to the other team

  • Leave the ticket in the New state

    • Tickets in "New" state must not have any time logged on them. You must log time on the previous ticket instead.

    • When unsure whether the ticket is seen and picked up by the other team, inform the team about the new ticket

    • It’s up to the teams how they see and handle such tickets, usually they have daily e-mail notification for New tickets and go over all New tickets in the weekly planning.

Leaving tickets in the New state is crucial. Only then tickets pop-up in filters of other teams and can get e-mail notified automatically. Also it’s the team’s job to decide what they do with such a ticket.