Responsible Ops

Each tech team needs to do Triage, Monitoring Operations and Phone Duty. The Responsible Ops role ensures that these tasks are covered during business hours (Mo-Fr 09:00–18:00 Zurich time).

Often the Responsible Ops person does these tasks themself, but handing over some or all duties to another person is possible anytime if done in a coordinated way.

Assignment of this role

This role can be taken over by any of the VSHNeers of the team. Each tech team rotates this role according to their own schedule.

Duties

Make sure that at least one person of your team is doing the following in this priority order:

  1. Monitoring Operations (high priority Incidents)
    Sometimes it’s not easy to asses the priority: You have to be aware of the impact (get another opinion from the team)

  2. Ticket 1st stage Triage

  3. Ensure urgent Incidents are worked on in time!

  4. Ticket 2nd stage Triage

  5. Monitoring Operations (low priority Incidents)

  6. Monitoring Operations (Problems, for example WARNINGs)

Communicate:

Monitoring Operations

At least one VSHNeer from every tech team is responsible (ensured by Responsible Ops) to handle monitoring alerts.

Goals

  • Efficiently handle small things that pop up

  • As soon as anything is changed on customer systems or we’ve to contact the customer, track the situation in a ticket

  • Respond quickly to issues affecting the customer to provide good customer service

    • Urgent alerts shouldn’t go unnoticed for more than 30 minutes

  • Maintain SLAs even if several issues come up at once

  • All problems must either be solved directly or tracked for future handling

  • Handle mails in the Maintenance Mailbox.

How?

Rules
  • It’s okay to fix small things (<15min effort) directly without a ticket

  • As soon as you change anything or need customer feedback (configuration, resizing, scaling, etc.) on a production system (especially customer systems) create a ticket in the customer space in Jira

  • Tickets created from Monitoring Operations already have 1st stage Triage quality

    • When you start working on it directly, make sure you bring it to 2nd stage Triage quality first or as soon as there is time for it

Don’t over-refine tickets. The priority is to resolve Incidents in time and fix small things directly to prevent incidents.
Per issue/alert
  1. Check if there is an existing ticket for this issue. If there is, update the ticket if necessary and ACK the alert with a link to the ticket.

  2. If there isn’t do a quick initial investigation (try to spend no more than 15 minutes) and document your findings in a new ticket:

    • Log in to the server, check the logs of the service, describe the failed pod, etc.

    • Check if there are or have been other similar problems (on same host, same customer, same service for other customers, etc. - correlate things)

    • Assess and describe the impact. If unsure check with Service Managers or Product Owners.

    • ACK the alert with a link to the ticket

  3. Immediately fix urgent issues yourself or get someone else from the team to work on them.