Using GitLab, you automatically get broad and deep insight into the health of your deployment.
We provide a robust monitoring solution to give GitLab users insight into the performance and availability of their deployments and alert them to problems as soon as they arise. We provide data that is easy to digest and to relate to other features in GitLab. With every piece of the devops lifecycle integrated into GitLab, we have a unique opportunity to closely tie our monitoring features to all of the other pieces of the devops flow.
We work collaboratively and transparently and we will contribute as much of our work as possible back to the open source community.
The monitoring team is responsible for:
This stage consists of the following groups:
These groups map to the Monitor Stage product category.
Team members who are successful in this stage typically demonstrate stakeholder mentality. There are many ways to demonstrate this but examples include:
This stage is only successful when each team member collaborates to make one another successful.
Since GitLab releases on a monthly basis, we have supporting activities that also take place on monthly rhythms. In addition, since our releases take place on the 22nd of each month, each monthly cadence does not map to actual months of the Gregorian calendar. These are listed in an ordered list for ease of reference.
workflow::verificationwill be moved to the next milestone
Meetings are not required but attendance/reviewing the recordings to the important ones will generally make team members successful. These are ordered in order of importance and are all stored in the Monitor Stage Calendar(Viewable to all GitLab team members)
Groups in this stage also participate in async daily standups. The purpose is to give every team member insight into what others are working on so that we can identify ways to collaborate and unblock one another as well as foster relationships within the team. We use the geekbot slack plugin to automate our async standup, following the guidelines outlined in the Geekbot commands guide.
Our questions change depending on the day of the week. Participation is optional but encouraged.
|Question||Why we ask it|
|Do you need help from anyone to unblock you this week?||One of our main goals with our standups is to help ensure that we are unblocking one another as a top priority. We ask this first because we think it's the question that other team members can take action on.|
|What do you plan on working on this week?||We want to understand how our daily actions drive us toward our weekly goals. This question provides broader context for our daily work, but also helps us hold ourselves accountable to maintaining proper scopes for our tasks, issues, merge requests, etc. This answer may stay the same for a week, this would mean things are progressing on schedule. Alternatively, seeing this answer change throughout the week is also okay. Maybe we got side tracked helping someone get unblocked. Maybe new blockers came up. The intention is not to have to justify our actions, but to keep a running record of how our work is progressing or evolving.|
|Any personal tidbits you'd like to share?||This question is intentionally open ended. You might want to share how you feel, a personal anecdote, funny joke, or simply let the team know that you will have limited availability that afternoon. All of these answers are welcome.|
|Question||Why we ask it|
|Are you facing any blockers requiring action from others?||Same reason as Monday's first question|
|Are you on track with your plan for the week?||We want to understand how each team member is doing on achieving our week goal(s). It is meant to highlight progress while also identifying if there are things getting in the way. This could also be used to update the plan for the week as things change.|
|What will be your primary focus for today?||This question is aimed at the most impactful task for the day. We aren't tyring to account for the entire day's worth of work. Highlighting only a primary task keeps our answers concise and provides insight into each team member's most important priority. This doesn't necessarily mean sharing the task that will take the most time. We focus on results over input. Typically this will mean highlighting the task that is most impactful in closing the gap between today and our end of the week goal(s).|
|Any personal tidbits you'd like to share?||Same reason as Monday's last question|
|Question||Why we ask it|
|What went well this week? What did you enjoy?||The end of the week is a good time to reflect on our goals, and this question is meant to be a short retrospective of the week. This focusing on things that went well during the week.|
|What didn’t go so well? What caused you to slow down?||Like the previous question, this question is a way to review our week. This one is a way to surface things that did not go so well or things that go in the way of meeting our weekly goal(s).|
|What have you learned?||This could be something about work or personal. We hope that by sharing things we have learned that others can also learn from us.|
|Any plans for the weekend you'd like to share?||Like the "personal tidbit" question we ask other days of the week, this one is very opened ended. You can share as much or as little as you want and all answers are welcome.|
Spikes are time-boxed investigations typically performed in agile software development. Groups in the monitor stage typically create Spike issues when there is uncertainty on how to proceed on a feature from a technical perspective before a feature is developed.
deliverableto ensure a clear ownership from engineers
workflow::ready for development
workflow::verificationand close the issue
Engineer(s) assigned to the Spike issue will be responsible for the following tasks:
With the support of GitLab's SRE team, we implemented the SRE shadow program as a means of improving the team's understanding of our ideal user personas so that we can build a better product.
In this program, engineers are expected to devote 1 entire week to shadow SREs. There is no expectation for the engineer to complete their assigned issues during this time. Engineers are added to PagerDuty and will follow the existing SRE shadow format of interning (except scaled down to a shorter duration of 1 week). Although typical SREs on-call for multiple days at a time, shadows are only expected to shadow during their regular business hours. This can be set as a preference in PagerDuty.
Engineers interested in the program should notify their respective frontend/backend engineering managers. Managers should collaborate and determine an optimal schedule in the slack channel
#monitor-sre-shadow and create an access request for PagerDuty. Assign the access request to the SRE manager (this is a departure from established processes). We are currently limited to 2 max shadows per release so that we do not overload the SRE team. If you are shadowing during the same release as another engineer, coordinate to create a combined access request for the duration of the release.
Before starting your rotation, coordinate with the SRE(s) who will be on-call to determine which areas it makes sense for you to shadow (incidents, other on-call tasks, SRE daily tasks, etc). Typically, shadowing an SRE involves activies such paying attention to SRE relevant slack channels(#production, #incident-management), reading through incident issues posted there and jumping into the 'The Situation Room' posted at the top of incident management for any active issues where the on-call SRE joins that room.
You can either check PagerDuty schedules or coordinate with the SRE manager to figure out who you'll be shadowing.
Alumni of the program are encouraged to add themselves to this list and document/link to the observations/outcomes they were able to share with the wider team.
|Laura Montemayor||Shadowing a Site Reliability Engineer|
|Tristan Read||My week shadowing a GitLab Site Reliability Engineer|
|Sarah Yasonik||Created 4 issues for the team to consider adding to the product|
In order to make it more efficient to verify changes and demonstrate our product features to customers and other stakeholders. The engineers in this stage maintain a few demo environments.
|Customer simulation environment||tanuki-inc|
|Verifying features in Staging||monitor-sandbox (Staging)|
|Verifying features in Production||monitor-sandbox (Production)|
To be able to test logging features in both the elastic stack enabled and kubernetes only cases, the following clusters and environments exist in production and staging:
|Elastic Stack ON
|Elastic Stack OFF