Infrastructure

The Infrastructure Department is responsible for the availability, reliability, performance, and scalability of GitLab.com and other supporting services

Mission

The Infrastructure Department enables GitLab (the company) to deliver a single DevOps application, and GitLab SaaS users to focus on generating value for their own businesses by ensuring that we operate an enterprise-grade SaaS platform.

The Infrastructure Department does this by focusing on availability, reliability, performance, and scalability efforts. These responsibilities have cost efficiency as an additional driving force, reinforced by the properly prioritized dogfooding efforts.

Many other teams also contribute to the success of the SaaS platform because GitLab.com is not a role. However, it is the responsibility of the Infrastructure Department to drive the ongoing evolution of the SaaS platform, enabled by platform observability data.

Getting Assistance

If you’re a GitLab team member and are looking to alert the Infrastructure teams about an availability issue with GitLab.com, please find quick instructions to report an incident here: Reporting an Incident.

Vision

The Infrastructure Department operates a fast, secure, and reliable SaaS platform to which (and with which) everyone can contribute.

Integral part of this vision is to:

  1. Build a highly performant team of engineers, combining operational and software development experience to influence the best in reliable infrastructure.
  2. Work publicly in accordance with our transparency value.
  3. Use our own product to prepare, build, deliver work, and support the company strategy.
  4. Align our strategy with the industry trends, company direction, and end customer needs.

Direction

The direction is accomplished by using Objectives and Key Results (OKRs).

Other strategic initiatives to achieve this vision are driven by the needs of enterprise customers looking to adopt GitLab.com. The GitLab.com strategy catalogs top customer requests for the SaaS offering and outlines strategic initiatves across both Infrastructure and Stage Groups needed to address these gaps.

We are also Product Development

Unlike typical companies, part of the mandates of our Security, Infrastructure, and Support Departments is to contribute to the development of the GitLab Product. This follows from these concepts, many of which are also behaviors attached to our core values:

As such, everyone in the department should be familiar with, and be acting upon, the following statements:

  • We should all feel comfortable contributing to the GitLab open source project
  • If we need something, our first instinct should be to get it into the open source project so it can be given back to the community
  • Try to get it in the open source project first, rather than later, even if it’s 2x harder
  • We should be using the whole product to do our jobs
  • We are all familiar with our Dogfooding process and follow it
  • We should not expect new team members to join the company with these instincts, so we should be willing to teach them
  • It is part of managers’ responsibility to teach these values and behaviors

Organization structure

(click the boxes for more details)

flowchart LR
    I[Infrastructure]
    click I "/handbook/engineering/infrastructure/"

    I --> TPM[Technical Program Management]
    click TPM "/handbook/engineering/infrastructure/technical-program-management/"

    I --> EP[Engineering Productivity]
    click EP "/handbook/engineering/infrastructure/engineering-productivity/"
    I --> C[Core Platform]
    click C "/handbook/engineering/infrastructure/core-platform/"
    I --> EA[Engineering Analytics]
    click EA "/handbook/engineering/quality/engineering-analytics/"
    I --> TP[Test Platform]
    click TP "/handbook/engineering/infrastructure/test-platform/"
    I --> SP[SaaS Platforms]
    click SP "/handbook/engineering/infrastructure/platforms/"

    C --> SS[Systems Stage]
    click SS "/handbook/engineering/infrastructure/core-platform/systems/"

    SS --> GC[Gitaly::Cluster]
    click GC "/handbook/engineering/infrastructure/core-platform/systems/gitaly/"
    SS --> GG[Gitaly::Git]
    click GG "/handbook/engineering/infrastructure/core-platform/systems/gitaly/"
    SS --> Geo
    click Geo "/handbook/engineering/infrastructure/core-platform/systems/geo/"
    SS --> DB[Distribution::Build]
    click DB "/handbook/engineering/infrastructure/core-platform/systems/distribution/"
    SS --> DD[Distribution::Deploy]
    click DD "/handbook/engineering/infrastructure/core-platform/systems/distribution/"
    SS --> CC[Cloud Connector]
    click CC "/handbook/engineering/infrastructure/core-platform/systems/cloud-connector/"

    C --> DS[Data Stores Stage]
    click DS "/handbook/engineering/infrastructure/core-platform/data_stores/"
    DS --> TS[Tenant Scale]
    click TS "/handbook/engineering/infrastructure/core-platform/data_stores/tenant-scale/"
    DS --> Database
    click Database "/handbook/engineering/infrastructure/core-platform/data_stores/database/"
    DS --> GS[Global Search]
    click GS "/handbook/engineering/infrastructure/core-platform/data_stores/search/"

    SP --> DE[Delivery]
    click DE "/handbook/engineering/infrastructure/team/delivery/"
    DE --> Deployments
    DE --> Releases
    SP --> Ops
    click Ops "/handbook/engineering/infrastructure/team/ops/"
    SP --> Foundations
    click Foundations "/handbook/engineering/infrastructure/team/foundations/"
    SP --> Scalability
    click Scalability "/handbook/engineering/infrastructure/team/scalability/"
    Scalability --> Observability
    Scalability --> Practices

    SP --> D[Dedicated]
    click D "/handbook/engineering/infrastructure/team/gitlab-dedicated/"
    D --> E[Environment Automation]
    click E "/handbook/engineering/infrastructure/team/gitlab-dedicated/"
    D --> PSS[Public Sector Services]
    click PSS "/handbook/engineering/infrastructure/team/gitlab-dedicated/us-public-sector-services/"
    D --> Switchboard
    click Switchboard "/handbook/engineering/infrastructure/team/gitlab-dedicated/switchboard/"

    TP --> SMP[Self-Managed Platform]
    click SMP "/handbook/engineering/infrastructure/test-platform/self-managed-platform-team/"
    TP --> TE[Test Engineering]
    click TE "/handbook/engineering/infrastructure/test-platform/test-engineering-team/"
    TP --> TTI[Test and Tools Infrastructure]
    click TTI "/handbook/engineering/infrastructure/test-platform/test-and-tools-infrastructure-team/"

Design

The Infrastructure Library contains documents that outline our thinking about the problems we are solving and represents the current state for any topic, playing a significant role in how we produce technical solutions to meet the challenges we face.

Dogfooding

The Infrastructure department uses GitLab and GitLab features extensively as the main tool for operating many environments, including GitLab.com.

We follow the same dogfooding process as part of the Engineering function, while keeping the department mission statement as the primary prioritization driver. The prioritization process is aligned to the Engineering function level prioritization process which defines where the priority of dogfooding lies with regards to other technical decisions the Infrastructure department makes.

When we consider building tools to help us operate GitLab.com, we follow the 5x rule to determine whether to build the tool as a feature in GitLab or outside of GitLab. To track Infrastructure’s contributions back into the GitLab product, we tag those issues with the appropriate Dogfooding label.

Handbook use at the Infrastructure department

At GitLab, we have a handbook first policy. It is how we communicate process changes, and how we build up a single source of truth for work that is being delivered every day.

The handbook usage page guide lists a number of general tips. Highlighting the ones that can be encountered most frequently in the Infrastructure department:

  1. The wider community can benefit from training materials, architectural diagrams, technical documentation, and how-to documentation. A good place for this detailed information is in the related project documentation. A handbook page can contain a high level overview, and link to more in-depth information placed in the project documentation.
  2. Think about the audience consuming the material in the handbook. A detailed run through of a GitLab.com operational runbook in the handbook might provide information that is not applicable to self-managed users, potentially causing confusion. Additionally, the handbook is not a go-to place for operational information, and grouping operational information together in a single place while explaining the general context with links as a reference will increase visibility.
  3. Ensure that the handbook pages are easy to consume. Checklists, onboarding, repeatable tasks should be either automated or created in a form of template that can be linked from the handbook.
  4. The handbook is the process. The handbook describes our principles, and our epics and issues are our principles put into practice.

Projects

Classification of the Infrastructure department projects is described on the infrastructure department projects page.

The infrastructure issue tracker is the backlog and a catch-all project for the infrastructure teams and tracks the work our teams are doing–unrelated to an ongoing change or incident.

In addition to tracking the backlog, Infrastructure Department projects are captured in our Infrastructure Department Epic as well as in our Quarterly Objectives & Key Results

Supporting Product Features

We have a model that we use to help us support product features. This model provides details on how we collaborate to ship new features to Production.

Ownership

The Infrastructure team maintains responsibility for the underlying infrastructure on which customer-facing services run. Specific ownership details are in the GitLab Service Ownership Policy

Stable Counterparts

Infrastructure SREs may be aligned with stage groups as stable counterparts.

Stable Counterparts are used as a framework for managing reliable services at GitLab. The framework provides guidelines for collaboration between Stage Groups and Infrastructure Teams.

Interviewing

The Infrastructure department hires for a number of different technical specialisms and positions across its teams. This Infrastructure Interviewing Guide offers more detail on some of our regular openings, interview process and other useful information related to applying to jobs with us. More information on our current openings can be found on the careers page.

General Issue Trackers General Slack Channels Team Slack Channels Resources
Infrastructure issue queue #production #g_delivery Production Architecture
Production incidents, and changes #infrastructure-lounge #g_scalability Operational Runbooks
Delivery #incident-management Environments
Scalability #announcements Monitoring
#feed_alerts-general Readiness Reviews
Infrastructure Standards

Other Pages


Capacity Planning for GitLab Infrastructure
Introduction In order to scale GitLab infrastructure at the right time and to prevent incidents, we employ a capacity planning process for example for GitLab.com and GitLab Dedicated. In parts, this process is predictive and gets input from a forecasting tool to predict future needs. This aims to provide an earlier and less obstrusive warning to infrastructure teams before components reach their individual saturation levels. The forecasting tool generates capacity warnings which are converted to issues and these issues are raised in various status meetings.
Career Development in the Infrastructure Department
Career Development Discovery & Planning There are a number of tools we use to plot and manage career development: Role descriptions, which outline role responsibilities, requirements and nice-to-haves for each level in the role. Big Picture Career Conversations Quarterly checkpoints 1:1s 360 Feedback Talent Assessment Maintaining current role descriptions which establish expectations for hiring and ongoing performance expectations is an important supporting function for effective Career Development planning. The rest of the tools are for active engagement by the Team Member along with their Manager.
Change Management
Purpose Change Management has traditionally referred to the processes, procedures, tools and techniques applied in IT environments to carefully manage changes in an operational environment: change tickets and plans, approvals, change review meetings, scheduling, and other red tape. In our context, Change Management refers to the guidelines we apply to manage changes in the operational environment with the aim of doing so (in order of highest to lowest priority) safely, effectively and efficiently.
Core Platform Sub-department
Vision Offer enterprise-grade operational experience of GitLab products from streamlined deployment and maintenance, disaster recovery, secure search and discoverability, to high availability, scalability, and performance. Mission Core Platform focuses on improving our capabilities and metrics in the following areas: Database Database Reliability Distribution:Build Distribution:Deploy Geo Gitaly Cloud Connector Global Search Tenant Scale All Team Members The following people are permanent members of teams that belong to the Core Platform Sub-department:
Cost Management
GitLab Cost Management
Database
Database Reliability at GitLab The group of Database Reliability Engineers (DBREs) are on the Reliability Engineering teams that runs GitLab.com. We care most about database reliability aspects of the infrastructure and GitLab as a product. We strive to approach database reliability from a data driven perspective as much as we can. As such, we start by defining Service Level Objectives below and document what service levels we currently aim to maintain for GitLab.
Emergency Change Processes for GitLab SaaS
The Infrastructure Department, responsible for managing GitLab SaaS environment, has a number of processes that have an implicit emergency process component as a part of a regular workflow. This page serves as a high level overview of the most important components of those processes, with links to pages describing said processes in more depth. Workflow An integral part of any irregular situation occurring on GitLab SaaS is the incident management process.
Engineering Productivity team
The Engineering Productivity team increases productivity of GitLab team members and contributors by shortening feedback loops and improving workflow efficiency for GitLab projects.
Incident Management
If you’re a GitLab team member and are looking to alert Reliability Engineering about an availability issue with GitLab.com, please find quick instructions to report an incident here: Reporting an Incident. If you’re a GitLab team member looking for who is currently the Engineer On Call (EOC), please see the Who is the Current EOC? section. If you’re a GitLab team member looking for the status of a recent incident, please see the incident board.
Incident Review
The primary goals of writing an Incident Review are to ensure that the incident is documented, that all contributing root cause(s) are well understood, and, especially, that effective preventive actions are put in place to reduce the likelihood and/or impact of recurrence.1 Incident Review Issue Creation and Ownership Incident reviews are conducted as close to the incident date as possible. Every Incident Review Issue must be assigned a DRI. The DRI will usually be the assignee of the associated incident but it may be someone else like the service owner.
Infrastructure Department Frequently Asked Questions
GitLab.com Backups Q: How often is GitLab.com backed up? A: See our summary of our backup strategy Q: Are GitLab.com backups encrypted? A: Yes. We use GCP Persistent Storage volumes underneath all of our filesystems, and that is implicitly encrypted. So the live filesystems, their snapshot-based backups, database replicas, and logical backups are all fully encrypted at the block device layer. Additionally, GCP encrypts and encapsulates traffic between our nodes within our VPCs, so data in motion is also protected from eavesdropping and tampering.
Infrastructure Department Performance Indicators
Executive Summary KPI Health Status GitLab.com Availability SLO Okay February 2024 Availability 99.82%January 2024 Availability 100.00%December 2023 Availability 99.99% Mean Time To Production (MTTP) Okay Work towards MTTP epic 280. Corrective Action SLO Okay Corrective Action SLO are back below 0 Master Pipeline Stability Okay Current month improved to 93%Key issues have been internal Gitaly performanceDependency upgrade issue that has also been resolved Merge request pipeline duration Okay Reduced to 42 minutes for this monthTwo previous months below target due to increased retries and lack of parellizationImplemented a timeout for jobs such that we can capture artifacts and resolve issues S1 Open Customer Bug Age (OCBA) Attention Promoted to KPI in FY24Q2Slight uptick in last 3 months due to the triaging of all untriaged customer bugsAll S1 bugs are scheduled for current milestone S2 Open Customer Bug Age (OCBA) Attention Promoted to KPI in FY24Q2Above target, significant reduction will require a focus on older customer impacting S2 Quality Team Member Retention Confidential Confidential metric, see notes in Key Review agendaWill be merged into a combined department retention metric Infrastructure Team Member Retention Confidential Confidential metric, see notes in Key Review agendaWill be merged into a combined department retention metric Key Performance Indicators GitLab.
Infrastructure Department Projects
“GitLab’s approach to the types, data classifications, canonical locations, ownership, workflow and organization of infrastructure department projects”
Infrastructure Environments
Environments Terraform control for the environments can be found on ops Future Iteration with Infrastructure Standards We have a WIP initiative to iterate on our company-wide infrastructure standards. You can learn more about this on the infrastructure standards handbook page. This page will be refactored incrementally as the standards are documented, implemented, and changes to environments take place. Development Name URL Purpose Deploy Database Terminal access Development various Development on save Fixture individual dev Development happens on a local machine.
Infrastructure Feature Support
How the Infrastructure Department supports shipping features to Production.
Infrastructure OKRs
Infrastructure OKRs
Infrastructure Product Management
Responsibilities The responsibilities of the Infrastructure Product Manager are documented in the job-families page. Engagement Model Inbound Requests The Infra PM can help triage and prirotize inbound requests to Infrastructure from internal teams and GitLab.com customers. Types of requests: Dogfooding requests e.g. Runbooks Security and Compliance Requests GitLab.com customer requests in remit of the Infrastructure department: GitLab.com customers, especially enterprises, may often have requests related to operational capabilities or non-functional requirements of GitLab.
Library
Overview The Infrastructure Library moved to https://gitlab.com/gitlab-com/gl-infra/readiness/-/tree/master/library.
Network Security Management Procedure
Purpose GitLab architects a defense-in-depth methodology that enforces the concept of “least functionality” through restricting network access to systems, applications and services and ensures sufficient security and privacy controls are executed to protect the confidentiality, integrity, availability and safety of the organization’s network infrastructure, as well as to provide situational awareness of activity on GitLab’s networks. Scope GitLab’s network architecture is available to both internal and external users and hosts our DNS with Cloudflare incluing gitlab.
Production
If you’re a GitLab team member and are looking to alert Reliability Engineering about an availability issue with GitLab.com, please find quick instructions to report an incident here: Reporting an Incident. If you’re a GitLab team member looking for help with a security problem, please see the Engaging the Security On-Call section. The Production Environment The GitLab.com production environment is comprised of services that operate–or support the operation of–gitlab.com. For a complete list of production services see the service catalog
Release Tools
Guide to GitLab's tools for new releases
Service Maturity Model
Introduction This page shows the output of our service maturity model for each service in our metrics catalog. The model itself is part of the metrics catalog, and uses information from the metrics catalog and the service catalog to score each service. To achieve a particular level in the maturity model, a service must meet all the criteria for that level and all previous levels. Some criteria do not apply to all services (for instance, services like PgBouncer do not need development documentation).
Service Ownership
GitLab Service Ownership Policy Purpose This policy establishes service ownership within the engineering organization for customer-facing services, outlining responsibilities and ownership structure. Scope This policy applies specifically to customer-facing services and the underlying infrastructure services that support them. Service Ownership Customer Facing Services Reliability::General Contains all customer-facing services tied to the monolith architecture. Responsibilities include design, development, deployment, and operational stability. Ensuring alignment with organizational standards and meeting service level objectives (SLOs) for customer-facing services.
Team
See the SaaS Platforms Organizational Structure for teams in Infrastructure.
Technical Program Management Team
Technical Program Management Team drives the planning, execution, and delivery of complex infrastructure projects across Engineering and Product.
Test Platform Sub-Department
Test Platform Sub-Department enables successful development and deployment of high quality GitLab software applications by providing innovative build automated solutions, reliable tooling, refined test efficiency, and fostering an environment where Quality is Everyone's responsibility.
The Infrastructure Platforms Section
Mission The Infrastructure Platforms section enables GitLab Engineering to build and deliver safe, scalable and efficient features for multi-tenant and single-tenant GitLab SaaS platforms (GitLab.com and GitLab Dedicated). Vision To deliver on the mission, we are in the process of formalising the building blocks we need to work on. Direction In FY25, teams in the Platforms Section of the Infrastructure Department have collaborated on the “North Star”, which is then used to set the SaaS Platforms Strategy.