Altair Squad

Altair isn’t a usual Squad at VSHN. VSHNeers from the tech squads Capella, Polaris and Sirius rotate in and out to form this dynamic Squad.

Mission

We’re the VSHNeers at the front taking ownership of daily customer business and incident resolution.

The Altair Squad is responsible for but not limited to daily customer business and incident resolution. For this Altair triages new tickets, reacts to monitoring alerts, answers customers Requests and resolves Incidents.

Altair exists to remove interrupt based work patterns from other Squads and serve customers independently of the other squads that build and maintain services for a customer.

Rotation from Home-Squads

The term "Home-Squad" refers to a Squad sending VSHNeers to Altair.

The Home-Squads send VSHNeers to Altair for a fixed duration, following these rules:

  • The rotation planning is the sole responsibility of the Home-Squad

  • All VSHNeers, including the Squad Master, from the tech squads participate, shifts distributed as equally as possible.

  • All squads send at least 2 VSHNeers to Altair.

  • The VSHNeers from a Home-Squad cover the required skills for Ticket Triage, Monitoring Operations and resolving the usual Incidents in addition to answering common customer Requests.

  • Rotations should overlap at least 1 week. This prevents workflow interrupts in Altair and makes it easier for new VSHNeers to join for the first time.

  • Rotation length is 1 to 4 weeks. The Squad decides on this taking necessary skill-sets and personal liking into account as good as possible.

Read Rotating in and out of Altair to learn about how to join and leave Altair.

Squad Culture

The stakeholders of the squad are the Home-Squads represented by their Squad Masters. The Home-Squads do retros that also cover the mission, work and culture of Altair. Action points and feedback regarding Altair should always be discussed between the Squads and forwarded to the Squad Master of Altair.

The Squad Master of Altair is a fixed person and no different than in other squads - a servant leader and moderator only. The squad master isn’t responsible to plan shifts, ensure quality, take decisions on triage problems or fix badly refined tickets.

Daily Stand-ups

The squad does daily stand-up meetings like most of the other squads do. The stand-ups are mainly to share knowledge, offer or request help and solve duties coverage problems.

Altair sync meetings

At least 2 times per week there is a longer sync meeting with all current members from Altair. As interrupt based work can’t be planned, this is more about assigning and discussing less urgent tickets, making sure that nothing gets lost, etc.

Stay connected to your Home-Squad

One of the biggest goals of Altair is to split planned from interrupt based work patterns. On the other hand, a total disconnect from the Home-Squad can result in various problems. Therefore you’re allowed and should:

  • Participate in your Home-Squad’s Chat channel.

  • Attend the Home-Squad’s sync meeting(s), but try to leave early when they dive too deep into project work topics

  • Be up2date of maintenance work and planned changes the Home-Squad does, as this could have an impact (fallout) on your Altair Incident handling job

Responsible Ops

In Altair, from each Home-Squad one person is responsible that Triage and Monitoring Ops is working for the customers and technology related to their Home-Squad. The responsible person also ensures that this tasks are covered during business hours (09:00–18:00 Zurich time).

This doesn’t mean that this person has to work 09:00–18:00 (Zurich time) every day. Handing over some or all duties to another person is possible anytime, if done in a coordinated way.

Even when handed over, it’s your overall responsibility.

Assignment of this role

As this role is about being responsible only (not necessarily doing it yourself) this can be any of the VSHNeers from one Home-Squad. For that reason we follow a fixed rule who this person is:

The person that has been in Altair the longest during the current rotation cycle is Responsible Ops.
Exception: When all VSHNeers of a Home-Squad rotate at the same time, it’s the person defined by the Home-Squad during rotation planning.

The current role owner is always documented (manually) on the Altair Home in the Wiki

Handing over the role

If discussed in an Altair sync meeting, the Responsible Ops role can be handed over to another person of the same Home-Squad.

Duties

Make sure that at least one person of your home-squad is doing the following in this priority order:

  1. Monitoring Operations (high priority Incidents)
    Sometimes it’s not easy to asses the priority: You have to be aware of the impact (get another opinion from the team)

  2. Ticket 1st stage Triage

  3. Ensure urgent Incidents are worked on in time!

  4. Ticket 2nd stage Triage

  5. Monitoring Operations (low priority Incidents)

  6. Monitoring Operations (Problems, for example WARNINGs)

  7. Only when there is nothing left in 1–6 → Work on tickets while constantly checking Triage filters and the monitoring alerts

Communicate

  • With others being Responsible Ops

  • With the other people of your home-squad in Altair

Monitoring Operations

2–3 VSHNeers are responsible (ensured by Responsible Ops to handle Monitoring Alerts, they can split the workload according to home-squad skills to be more efficient (can be shared with Ticket Triage).

Goals

  • Efficiently handle small things that pop up

  • As soon as anything is changed on customer systems or we’ve to contact the customer, track the situation in a (new) ticket

  • Keep the monitoring green

    • No unhandled CRITICAL / DOWN in Monitoring older than 15min Even if CRITICALs are actually not that critical they still pop up on the dashboard, can generate notifications to customer, confuse other team members when they check the Monitoring / Dashboard.

  • Handle mails in the Maintenance Mailbox.

How?

Rules
  • It’s okay to fix small things (<15min effort) directly without a ticket
    Log work to the Monitoring Ops Chore in Jira.

  • As soon as you change anything or need customer feedback (configuration, resizing, scaling, etc.) on a production system (especially customer systems) create a ticket in the customer space in Jira

  • Tickets created from Monitoring Operations already have 1st stage Triage quality

    • When you start working on it directly, make sure you bring it to 2nd stage Triage quality first or as soon as there is time for it

Don’t over-refine tickets. The priority is to resolve Incidents in time and fix small things directly to prevent incidents.
Per issue (Alert) you’re handling
  1. Check if there is already a ticket for the issue, if not create one. ACK the alert with the ticket
    See filters on this this wiki page or search in Jira.

  2. We try to reproduce the issue and document in the ticket what happened and all information we can find within reasonable time (<15min):

    • log in to the server, checking the logs of the service, describe the failed pod, etc.

    • check if there are other similar problems in monitoring (on same host, same customer, same service for other customers, etc. - correlate things)

    • We assess the impact and describe it in the ticket. When unsure, we check with Service Managers or Product Owners (or at least within Altair).

  3. When urgent, we fix the issue directly or get someone else from Altair to start working on the Incident

  4. Work on bigger issues as soon as all other issues are handled (and if you’re also on Triage duty and if nothing todo there)

  5. If the Incident is likely to happen again or already known we create a Problem tickets for other tech squads, see Creating tickets for other squads.

Ticket Triage

When customers, partners or VSHNeers create new tickets, they pop up in Triage filters we use in Altair. The responsibility of Altair is to triage these ticket and refine them into a usable state.

VSHNeers create tickets in Jira directly, and ensure it’s of good enough quality to make triage easy or obsolete. Customers can create tickets either via the VSHN Portal or E-Mail to support@vshn.ch.

1st stage Triage

In the first stage of the triage process we find out whether the ticket is further triaged and refined by Altair or not.

You find the link to the Jira Dashboard with all necessary filters on the Altair Wiki Homepage.

How we do 1st stage Triage is documented in the wiki, as this procedure is still changing a lot.

The following tasks are further handled and / or resolved by Altair:

Classification Description Altair Action

Incident

Unplanned interruption to or quality reduction of a service

Refine task to usable state and resolve incident.

Incident Prevention

Prevent an Incident proactively. Usually with the help of monitoring / graphs, for example a disk filling up, etc.

Resolve the situation to prevent potential Incident. Create Problem task for the Home-Squad to solve the underlying issue.

Request

Request from a customer (or internally) for information, advice, a easy and standard change, or access to a service

Clarify requirements (with customer), assess (authorization, feasibility, security, effort, impact, etc.), work on and complete task. Transform into a Change if needed.

The following task are usually not further handled and / or resolved by Altair:

Classification Description Altair Action

Change

The addition, modification or removal of anything that could have an impact on a running service.

Handover to the Service Manager or the Technical Service Manager for clarification and further refinement. Assign to Home-Squad if clear which Squad.

Problem

Root-cause analysis and potential resolution planning for one or more current, potential and past Incidents

Handover to the Service Manager, the Technical Service Manager or the Product Owner for clarification and further planning. Assign to Home-Squad if clear which Squad.

Project Task

A task belonging to a running project.

Handover to the Project Manager of the project for further refinement and planning.

Everything else

Tasks not fitting into this classification table, for example: Sales, backoffice, finance and organizational development tasks, Research task and similar.

Check with the creator of the ticket to handle the ticket to a point that it doesn’t longer show up in Triage filters.

2nd stage Triage

Done by the Ticket Triage VSHNeer or the VSHNeer starting to work on a urgent ticket for all tickets that come from 1st stage Triage
Depending on the urgency, resolving the Incident is usually more important than refining the ticket.

Goals
  • Ensure ticket quality

    • Make it possible to review tickets (task deliverables, reproduce issue, etc.)

    • Eliminate misunderstandings between customer / VSHN

    • Make tickets look the same, where useful. No one likes to start working on ugly tickets

  • Have estimates on bigger tasks

    • Make it possible to escalate when estimate is reached soon

  • Actually Select the ticket, so that it appears on Kanban Board

How we do 2nd stage Triage is documented in the wiki, as this procedure is still changing a lot.

How we work with tickets

Once tickets went through Ticket Triage, they either appear on our Kanban board or as New tickets in other squads.

Altair tickets

You find the link to our Kanban board on the Altair Wiki Homepage.

Usually we try to follow Stop starting, start finishing in VSHN, but in Altair we’ve to react on interrupts like new Requests and Incidents (from tickets or Monitoring Alerts). A new ticket can always have a higher priority which means that you stop working on a ticket and concentrate on another, more urgent ticket.

If something is urgent it usually can’t wait to be pulled from the Kanban ToDo list, the VSHNeer on Triage duty will ping you and send you the ticket link.

Besides the very urgent Incidents, we pull the work from the Kanban ToDo column, starting with the highest priority (the priority field) Incident, then everything else, using the set priority.

Every time before you start working on another ticket, check the KanBan board to make sure that there is nothing more important laying around.

Tickets for other squads

There are two types of tickets Altair touches but doesn’t work on. Still, to some point, it’s our responsibility to make sure these tickets are picked up by the other squads.

  • New tickets assigned to other squads during 1st stage Triage

  • Altair creating new tickets for other squads

Once triaged to Altair and worked on, a ticket must never leave Altair again. This is important to have a clear ownership preventing "ping-pong" between squads and give the customer the single-point-of-contact experience.

Creating tickets for other squads

  • Create the new ticket correctly (squad, blocks, summary, template, etc.) - basically the ticket has 2nd stage Triage quality.

  • Add a follows link to the original Altair ticket - which was the reason for creating this ticket

  • Assign the ticket to the Home-Squad

  • Leave the ticket in the New state

    • Tickets in "New" state must not have any time logged on them. You must log time on the previous Altair ticket instead.

    • When unsure whether the ticket is seen and picked up by the other squad, inform the Squad Master about the new ticket

    • It’s up to the squads how they see and handle such tickets, usually they have daily e-mail notification for New tickets and go over all New tickets in the weekly planning.

Leaving tickets in the New state is crucial. Only then tickets pop-up in filters of the Home-Squads and can get e-mail notified automatically. Also it’s the Home-Squad’s job to decide what they do with such a ticket.