This document only covers our shared and GitLab shared runners, which are available for GitLab.com users and managed by the Infrastructure teams.
Our CI infrastructure is hosted on Google Cloud Engine (GCE) with a backup
environment on Digital Ocean (DO). In GCE we use the
us-east1-d regions. In the event of GCE downtime, we can unpause the Digital
Ocean runners which run in NYC1. All of them are configured via Chef. These
machines are manually created and added to chef and do NOT use terraform at this
In each region we have few types of machines:
private-runners-manager-Xare re-used. However, machines created by
shared-runners-manager-Xare removed immediately after the job is finished.
Runner managers connect to the GitLab.com and dev.gitlab.org API in order to fetch jobs that need to be run. The autoscaled machines clone the relevant project via HTTPS.
The runners are connected as follows:
gitlab-org. They provide services to
gitlab-eeprojects and all their forks. They are the generic shared runners on dev.gitlab.org. Privileged mode is off.
prometheus.gprd.gitlab.netscrapes each runner host with the job
ci-node. Prometheus also scrapes specific prometheus nodes within the runners' regions using prometheus federation.
chef.gitlab.comserver is accessed by all hosts from inside of Cloud Provider Region, excluding autoscaled machines.
runners-manager-X hosts are connected to one or more GitLab instances and are constantly asking the API for new jobs that should be executed. After the job is started Runners are also updating the job’s trace and status by sending updates to the GitLab instance. This communication uses Runner’s API from GitLab APIv4.
autoscaled-machine-N hosts first access GitLab with the git+http(s)
protocol to receive project sources with git pull or git fetch operations,
depending on configuration. This operation uses the general git+http(s)
protocol and specific type of authentication (using gitlab-ci-token feature).
The job may also access project’s submodules using GitLab with the same
protocol as for the project. These hosts may also upload and/or download
artifacts to and from GitLab. The
gitlab-runner-helper binary is used for
this purpose which uses Runner’s API from GitLab APIv4.
runners-manager-X to autoscaled-machine-N - Runner starts jobs on autoscaled machines using the Docker Engine API. After the machine is created, Runner receives IP:PORT information about where the Docker Engine API endpoint is available. In GCE this uses the internal IP address, but DO uses the external address. Using the Docker Engine API, Runner first schedules the different containers used for the purpose of the job. It then starts job’s scripts and receive commands output. This output is then sent upstream to GitLab as it was described above.
prometheus-X to autoscaled-machine-N - the Prometheus server requests the autoscaled machines for exported metrics. It uses the HTTP(S) protocol for this.
runner-manager-X to Cloud Provider API Gateway - Runner, using Docker Machine, manages autoscaled machines used for executing jobs. It uses Cloud Providers API to schedule machines creation and removals. It also uses the same API to gather information about existing machines.
After creating the machine Runner uses received IP:PORT to schedule containers and execute jobs scripts there.
autoscaled-machine-N to external Docker Registry - Docker Engine, using Docker Registry API, pulls Docker Images from external machines. This could be Docker Hub, GitLab Registry, or any other Docker compatible registry.
autoscaled-machine-N to-and-from runners-cache-X - the job uses
gitlab-runner-helper binary which can access the Minio S3 service to
download and/or archive cache. This operation is done using the S3 protocol.
Docker Engine, when started by Docker Machine on behalf of Runner, may also be configured to use the Registry Mirror service on the cache server. In that case, Docker Engine will try to access Registry Mirror using the Docker Registry API in order to download required layers from the Mirror instead of the external Registry. (DO ONLY)
autoscaled-machine-N to-and-from Consul Cluster - autoscaled machines have Consul Agent installed on them. This Agent is using Consul’s GOSSIP protocol to register/unregister themselves from the Cluster. During registration machines provide information about services available on the machine. Currently we export only information about metrics exporters for Prometheus). (DO ONLY)
prometheus-X to Consul Cluster - the Prometheus server also contains the Consul Agent. This agent is using Consul’s GOSSIP protocol to register/unregister information about services (also restricted to metrics exporters). Additionally Prometheus is using it’s Consul Service Discovery mechanism to gather information about all targets that should be scraped. (DO ONLY)
prometheus-X to consul-X - Prometheus server requests each of consul servers for exported metrics. It uses the HTTP(S) protocol for this. (DO ONLY)
runners-cache-X to external Docker Registry - if Registry Mirror is used, it pulls requested images from external Registries using the Docker Registry API. It pulls only layers that are not available locally already.
The Runner and it’s configuration is handled with Chef and defined on chef.gitlab.com. The detailed upgrade process is described in the associated runbook.
sudo chef-clienton nodes where needed.
sudo /root/runner_upgrade.shon nodes where needed.
Why the difference?
When we’re updating Runner, the process needs to be stopped. If this is done
during job’s execution, it will break the job. That’s why we use Runner’s
feature named graceful shutdown. By sending SIGQUIT signal to the Runner, we’re
causing Runner to not request new jobs but still wait for existing ones to
finish. If this was done from inside of
chef-client run it could fail in
unexpected way. With the
/root/runner_upgrade.sh script we’re first stopping
Runner gracefully (with 7200 minutes timeout) and then starting
update the version.
For Runner’s configuration update there is no need to stop/restart Runner’s
process and since we’re not changing Runner’s version,
chef-client is not
upgrading package (which could trigger Runner’s process stop). In that case we
can simply run
sudo chef-client. This will update the config.toml file and
Runner will automatically update most of the configuration.
Some of the general configuration parameters can’t be refreshed without restarting the process. In that case we need to use the same script as for the Runner Upgrade.
We also have a few processes that are configured on some of the runner-manager-X machines and are not included in the graphs above:
digitalocean-exporter - currently started manually as Docker Containers
(with –restart always policy) on
private-runners-manager-2hosts. This process contacts the DO API and fetches
information about used droplets.
hanging-droplets-cleaner - currently started manually as Docker Containers
(with –restart always policy) on all DO Runners (
private-runners-manager-2). This process does a cross-check between machines
available locally for Docker Machine and Droplets existing in DO. If it find
any Droplet that doesn’t have an existing machine on the host, it removes the
Droplet. This process is also exporting Prometheus metrics about the number of
found and removed droplets and about errors occurred during such operations.
droplet-zero-machines-cleaner - is currently started manually as Docker
Containers (with –restart always policy) on all DO Runners
private-runners-manager-2). This process looks
for a locally defined Docker Machines that in their configuration have
DropletId=0 value. If such machine is found, and is older than a defined
limit, it’s removed. The process is also exporting Prometheus metrics about
the number of found and removed machines and about errors occurred during such
All above metrics can be tracked on our CI Dashboard.
hanging-droplets-cleaner and droplet-zero-machines-cleaner processes are specific for DigitalOcean integration. We’ve discovered specific problems for this cloud provider and the tools were developed to automatically handle the cleanup.
prometheus-app.gprd.gitlab.net- for metrics scraped from GitLab via unicorn exporter and GitLab Monitor project
prometheus.gprd.gitlab.net- for Runner internal metrics, node metrics of Runner Manager machines, gathering metrics about our cloud providers, gathering metrics of autoscaled machines with federation from CI Prometheus servers (Ben is currently working on enabling Thanos there, so Grafana will access CI Prometheus servers directly)
prometheus-01.us-east1-d.gce.gitlab-runners.gitlab.net- for scraping metrics from exporters installed on autoscaled machines - currently node exporter only.