Scalability:Observability Team
Observability encompasses the technical elements responsible for metrics, logging, and tracing, along with the tools and processes that leverage these components.
Mission
Our mission is to deliver and maintain a world-class observability offering and frictionless operational experience for team members at GitLab.
Common Links
Workflow | Team workflow |
GitLab.com | @gitlab-org/scalability/observability |
Issue Trackers | Scalability Tamland |
Team Slack Channels | #g_scalability-observability - Team channel #scalability_social - Group social channel |
Project Slack Channels | #scalability-tamland Tamland development |
Information Slack Channels | #infrastructure-lounge (Infrastructure Group Channel), #incident-management (Incident Management), #alerts-general (SLO alerting), #mech_symp_alerts (Mechanical Sympathy Alerts) |
Team Members
The following people are members of the Scalability:Observability team:
The team is located all over the world in different timezones.
Responsibilities and topics
This is an overview of topics we cover to help us reflect on and learn about our areas of ownership, duties, products and services since the team got created when merging Scalability:Projections and Reliability:Observability at the end of 2023.
- Monitoring
- Metrics stack
- Logging stack
- Error budgets
- Ownership of concept and implementation
- Delivery of monthly error budget report
- Capacity planning
- Triage rotation for .com
- Operational aspects for GitLab Dedicated capacity planning
- Developing Tamland, the forecasting tool
- Capacity reporting for GitLab Dedicated
- Service Maturity model which covers GitLab.com’s production services.
- GitLab.com availability: Provide underlying data and aggregate numbers
- SRE oncall rotation
Indicators
The group is an owner of several performance indicators that roll up to the Infrastructure department indicators:
- Service Maturity model which covers GitLab.com’s production services.
- The forecasting project named Tamland which generates capacity warnings to prevent incidents.
These are combined to enable us to better prioritize team projects.
An overly simplified example of how these indicators might be used, in no particular order:
- Service Maturity - provides detail on how trustworthy the data we received from observability stack in relation to the service; the lower the level the more focus we need to improve the service observability
- Tamland reports - Provides a forecast for a specific service
Between these different signals, we have a relatively (im)precise view into the past, present and future to help us prioritise scaling needs for GitLab.com.
How we work
We default to working inline with the GitLab values and by following the processes of the wider SaaS Platforms section and Scalability group. In addition to this, listed below are some processes that are specific, or particularly important, to how we work in Scalability:Observability.
Issue management
While we mainly operate from the scalability issue tracker, there are other projects under the gl-infra
group team members are working on.
Hence we strive to use group-level labels and boards to get the entire picture.
Labels
All issues pertaining to our team have the ~"team::Scalability-Observability"
label.
All issues that are within scope of current work have a ~board::build
or ~board::planning
label.
This is a measure to cut through noise on the tracker and allows us to get a view on what’s currently important to us.
See Boards below on how this is being used.
All issues require either a Service label or the team-tasks, discussion, or capacity planning labels.
Assignees
We use issue assignments to signal who is the DRI for the issue. We expect issue assignees to regularly update their issues with the status, and to be as explicit as possible at what has been done and what still needs to be done. We expect the assignee of an issue to drive the issue to completion. The assignee status typically expresses, that the assigned team member is currently actively working on this or planning to come back to it relatively soon. We unassign ourselves from issues we are not actively working on or planning to revisit in a few days.
Boards
The Scalability::Observability team’s issue boards track the progress of ongoing work.
We use issue boards to track the progress of planned and ongoing work. Refer to the Scalability group issue boards section for more details.
Planning | Building |
---|---|
Planning Board | Build Board |
Issues where we are investigating the work to be done. | Issues that will be built next, or are actively in development. |
|
|
Group call
We hold a weekly, 30 minutes group call at alternating times to facilitate a synchronous conversation across members of the group. While attendance is optional, joining the call if you can and otherwise catching up on the recording is encouraged.
The purpose of the call is to have a space and time for the group to
- discuss team-level concerns,
- facilitate organisation of work across team members,
- chat about any impediments to resolve those quicker,
- and generally have a space and time to hang out as a team and socialize.
While we emphasize on collaborating async, we embrace the opportunity for synchronous conversation.
However, the call is not meant to be used
- to provide regular status updates (as those are expected to be given async),
- make decisions without async collaboration.
The non-social part of the group call will be recorded and uploaded to Google Drive automatically.
The agenda of the call can be found in this Google Doc (internal link). As usual, the agenda can be used to collaborate async and in advance to any calls happening.
The timing of the call follows the time of the Scalability demo call, which happens at three different times across three weeks. The group call is scheduled to start 30 minutes before the demo call.
Updates in Slack
In order to stay informed with everyone’s immediate topics, we post regular status updates in our Slack channel.
These updates include whatever the team member is currently working on and dealing with, for example consider including current focus area, general work items, blockers, in-flight changes, learnings, side tracks, upcoming time off and other relevant information.
There is no strict frequency for posting updates, although we strive to make updates at least once per week.
When posting updates, consider providing enough context (e.g. through links) so that interested team members are able to dive in on their own (low context).
Error Budgets
Tamland: Development
9a04e7ba
)