Responsible Ops

Each tech team needs to do Triage, Monitoring Operations and Phone Duty. The Responsible Ops role ensures that these tasks are covered during business hours (Mo-Fr 09:00–18:00 Zurich time).

Often the Responsible Ops person does these tasks himself, but handing over some or all duties to another person is possible anytime if done in a coordinated way.

Assignment of this role

This role can be taken over by any of the VSHNeers of the team. Each tech team rotates this role according to their own schedule.

Duties

Make sure that at least one person of your team is doing the following in this priority order:

  1. Monitoring Operations (high priority Incidents)
    Sometimes it’s not easy to asses the priority: You have to be aware of the impact (get another opinion from the team)

  2. Ticket 1st stage Triage

  3. Ensure urgent Incidents are worked on in time!

  4. Ticket 2nd stage Triage

  5. Monitoring Operations (low priority Incidents)

  6. Monitoring Operations (Problems, for example WARNINGs)

Communicate:

Monitoring Operations

At least one VSHNeer from every tech team is responsible (ensured by Responsible Ops) to handle Monitoring Alerts.

Goals

  • Efficiently handle small things that pop up

  • As soon as anything is changed on customer systems or we’ve to contact the customer, track the situation in a ticket

  • Keep the monitoring green

  • Respond quickly to issues affecting the customer to provide good customer service

    • CRITICALs shouldn’t go unnoticed for more than 15 minutes, thus the monitoring must be checked regularly

  • Maintain customer SLAs even if several issues come up at once

  • All problems must either be solved directly or tracked for future handling

  • Handle mails in the Maintenance Mailbox.

How?

Rules
  • It’s okay to fix small things (<15min effort) directly without a ticket

  • As soon as you change anything or need customer feedback (configuration, resizing, scaling, etc.) on a production system (especially customer systems) create a ticket in the customer space in Jira

  • Tickets created from Monitoring Operations already have 1st stage Triage quality

    • When you start working on it directly, make sure you bring it to 2nd stage Triage quality first or as soon as there is time for it

Don’t over-refine tickets. The priority is to resolve Incidents in time and fix small things directly to prevent incidents.
Per issue/alert
  1. Check if there is an existing ticket for this issue. If there is, update the ticket if necessary and ACK the alert with a link to the ticket.

  2. If there isn’t do a quick initial investigation (try to spend no more than 15 minutes) and document your findings in a new ticket:

    • Log in to the server, check the logs of the service, describe the failed pod, etc.

    • Check if there are or have been other similar problems (on same host, same customer, same service for other customers, etc. - correlate things)

    • Assess and describe the impact. If unsure check with Service Managers or Product Owners.

    • ACK the alert with a link to the ticket

  3. Immediately fix urgent issues yourself or get someone else from the team to work on them.

Ticket Triage

When customers, partners or VSHNeers create new tickets, they pop up in Triage filters we use in each tech team. The responsibility is to triage these ticket and refine them into a usable state.

VSHNeers create tickets in Jira directly, and ensure it’s of good enough quality to make triage easy or obsolete. Customers can create tickets either via the VSHN Portal or E-Mail to support@vshn.ch.

1st stage Triage

In the first stage of the triage process we find out whether the ticket is further triaged and refined by your team or not.

You find the link to the Jira Dashboard with all necessary filters on the wiki homepage of your team.

How we do 1st stage Triage is documented in the wiki, as this procedure is still changing a lot.

Your team handles the following interrupt based type of work:

Classification Description Action

Incident

Unplanned interruption to or quality reduction of a service

Refine task to usable state and resolve incident.

Incident Prevention

Prevent an Incident proactively. Usually with the help of monitoring / graphs, for example a disk filling up, etc.

Resolve the situation to prevent potential Incident. Create Problem task for the Home-Team to solve the underlying issue.

Request

Request from a customer (or internally) for information, advice, a easy and standard change, or access to a service

Clarify requirements (with customer), assess (authorization, feasibility, security, effort, impact, etc.), work on and complete task. Transform into a Change if needed.

The following task are usually not considered interrupt based work and can be planned by your team:

Classification Description Action

Change

The addition, modification or removal of anything that could have an impact on a running service.

Handover to the Service Manager or the Technical Service Manager for clarification and further refinement. Assign to Home-Team if clear which Team.

Problem

Root-cause analysis and potential resolution planning for one or more current, potential and past Incidents

Handover to the Service Manager, the Technical Service Manager or the Product Owner for clarification and further planning. Assign to Home-Team if clear which Team.

Project Task

A task belonging to a running project.

Handover to the Project Manager of the project for further refinement and planning.

Everything else

Tasks not fitting into this classification table, for example: Sales, backoffice, finance and organizational development tasks, Research task and similar.

Check with the creator of the ticket to handle the ticket to a point that it doesn’t longer show up in Triage filters.

2nd stage Triage

Done by the Ticket Triage VSHNeer or the VSHNeer starting to work on a urgent ticket for all tickets that come from 1st stage Triage
Depending on the urgency, resolving the Incident is usually more important than refining the ticket.

Goals
  • Ensure ticket quality

    • Make it possible to review tickets (task deliverables, reproduce issue, etc.)

    • Eliminate misunderstandings between customer / VSHN

    • Make tickets look the same, where useful. No one likes to start working on ugly tickets

  • Have estimates on bigger tasks

    • Make it possible to escalate when estimate is reached soon

  • Actually Select the ticket, so that it appears on Kanban Board

How we do 2nd stage Triage is documented in the wiki, as this procedure is still changing a lot.

How we work with tickets

Once tickets went through Ticket Triage, they either appear on our Kanban board or as New tickets in your team. It’s up to your team how you actually plan and work on your tasks.

Creating tickets for other teams

  • Create the new ticket correctly (team, blocks, summary, template, etc.) - basically the ticket has 2nd stage Triage quality.

  • Add a follows link to the original task - which was the reason for creating this task

  • Assign the ticket to the other team

  • Leave the ticket in the New state

    • Tickets in "New" state must not have any time logged on them. You must log time on the previous ticket instead.

    • It’s up to the teams how they see and handle such tickets, usually they have daily e-mail notification for New tickets and go over all New tickets in the weekly planning.

Leaving tickets in the New state is crucial. Only then tickets pop-up in filters of other teams and can get e-mail notified automatically. Also it’s the team’s job to decide what they do with such a ticket.

Phone Duty

VSHN provides a phone number to customers for technical inquiries. These inquiries are handled directly by the tech teams.

In order to do so the Zoom desktop client needs to be installed and logged in with the personal Zoom account. Whether to receive calls or not is configured via the personal menu in the top right corner → Receive Queue Calls, which must be enabled during Phone Duty.

Inquiries by customers via phone should usually be documented in a ticket in order to track issues, decisions and agreements, and possibly to bill the time spent.

Organizing help for other teams

Due to the history of our teams and with the move to fixed customer ownership of our Solution Teams, there are known and unknown knowledge gaps. One team may urgently need help from another team because they don’t know enough about the customer’s setup or the technology they’re using.

We agreed, that Responsible Ops is also entry point for this: via

  • Check #Operations to see if another team could use your help.

  • If you don’t have the expertise or time to help, ask your team if someone else could help.

  • Offer or coordinate help to the other team.

This isn’t about taking over tasks from the other team. The ownership should stay with the team responsible for the customer.