Published on December 16, 2019
5 min read
On-call through the eyes of a software engineer. Read Tristan's week shadowing a GitLab Site Reliability Engineer
First-off I'll introduce myself - I'm @tristan.read, a Frontend Engineer in GitLab's Monitor::Health Group. Last week I had the privilege of shadowing one of GitLab's on-call SREs. The purpose was to observe day-to-day incident response activities and gain some real-life experience with the job. We'd like all of our engineers to better understand and develop empathy for users of Monitor::Health features.
The idea for the week is to follow everything the SRE does. This means I attended handover meetings, watched the same alert channels, and joined incident response calls if and when they occurred.
Two incidents occurred during the week while I was shadowing.
On Wednesday a jump in GitLab Runner usage was detected on GitLab.com - this was caused by a user attempting to use runner minutes to mine crypto coins. This was dealt with by using an in-house abuse mitigation tool, which stops the runner jobs and removes the associated project and account.
Had this event not been spotted it would have been caught by an automated tool, but in this case the SRE spotted it first. An incident issue was raised for this, but it remains private.
This incident was triggered by slowdowns and increased error rates appearing on GitLab.com's canary and main web applications. Several Application Performance Index (Apdex) Service Level Objectives (SLO) were violated.
Public incident issue: https://gitlab.com/gitlab-com/gl-infra/production/issues/1442
These are some things that I learned during the week on-call.
Alerts can be split into several types:
In general, types 2 and 3 are more useful for on-call SREs, as they reveal that something out of normal is occurring.
SREs deal with a constant stream of alerts. Many of these aren't super time-critical.
Why don't they limit the alerts to only things that are critical? This approach might cause early symptoms to be missed until they snowball into a higher impact issue.
It's the job of the on-call SRE to decide which alerts indicate something serious and when to escalate or investigate further. I suspect this may also be caused by inflexibility in alerts - it could be better to have more alert levels, or 'smarter' ways of setting up alerts as described above.
Feature proposal: https://gitlab.com/gitlab-org/gitlab/issues/42633
Internal:
External:
And many many more.
If GitLab.com has a major outage, we don't want that affecting our ability to resolve the problem. This could be mitigated by running a second GitLab instance for operating GitLab.com. In fact, we already have this with https://ops.gitlab.net/.
SREs have a tough job with many complexities. It would be great to see more GitLab products in use solving these problems. We're already working on some additions to the product that will help with the above workflows. See the Ops Section Product Vision for more details.
In 2020 we'll be expanding the team so that we can build all of these exciting features. Please see our vacancies if you're interested, and feel free to get in touch with one of our team members if you have questions about the role.
Cover image by Chris Liverani on Unsplash
Find out which plan works best for your team
Learn about pricingLearn about what GitLab can do for your team
Talk to an expert