This talk introduces Site Reliability Engineering (SRE) at Google, explaining its purpose and describing the challenges it addresses. SRE teams manage Google’s many services and web sites from our offices in Pittsburgh, New York, London, Sydney, Zurich, Los Angeles, Dublin, Mountain View, … They draw upon the Linux based computing resources that are distributed in data centers around the world.
Operating distributed systems at scale requires an unusual set of skills—problem solving, programming, system design, networking, and OS internals—which are difficult to find in one person. At Google, we’ve found some ways to hire Site Reliability Engineers, blending both software and systems skills to help keep a high standard for new SREs across our many teams and sites, including standardizing the format of our interviews and the unusual practice of making hiring decisions by committee. Adopting similar practices can help your SRE or DevOps team grow by consistently hiring excellent coworkers.
In order to run the company’s numerous services as efficiently and reliably as possible, Google’s Site Reliability Engineering (SRE) organization leverages the expertise of two main disciplines: Software Engineering and Systems Engineering. The roles of Software Engineer (SWE) and Systems Engineer (SE) lie at the two poles of the SRE continuum of skills and interests. While Site Reliability Engineers tend to be assigned to one of these two buckets, there is much overlap between the two job roles, and the knowledge exchange between the two job roles is rather fluid.
Updating production software is a process that may require dozens, if not hundreds, of steps. These include creating and testing new code, building new binaries and packages, associating the packages with a versioned release, updating the jobs in production datacenters, possibly modifying database schemata, and testing and verifying the results. There are boxes to check and approvals to seek, and the more automated the process, the easier it becomes. When releases can be made faster, it is possible to release more often, and, organizationally, one becomes less afraid to “release early, release often” [6, 7]. And that’s what we describe in this article—making rollouts as easy and as automated as possible. When a “green” condition is detected, we can more quickly perform a new rollout. Humans are still needed somewhere in the loop, but we strive to reduce the purely mechanical toil they need to perform.
Being on-call is a critical duty that many operations and engineering teams must undertake in order to keep their services reliable and available. However, there are several pitfalls in the organization of oncall rotations and responsibilities that can lead to serious consequences for the services and for the teams if not avoided. We provide the primary tenets of the approach to on-call that Google’s Site Reliability Engineers have developed over years, and explain how that approach has led to reliable services and sustainable workload over time.
Google is constantly changing our software to implement new, useful features for our users. Unfortunately, making changes is inherently risky. Google services are quite complex, and any new feature might accidentally cause problems for users. In fact, most outages of Google services are the result of deploying a change. As a consequence, there is an inherent tension between the desire to innovate quickly and to keep the site reliable. Google manages this tension by using a metrics-based approach called an unreliability budget, which provides an objective metric to guide decisions involving tradeoffs between innovation and reliability.
Janus is a system for partitioning the flash storage tier
between workloads in a cloud-scale distributed file system
with two tiers, flash storage and disk. The file system
stores newly created files in the flash tier and moves them
to the disk tier using either a First-In-First-Out (FIFO)
policy or a Least-Recently-Used (LRU) policy, subject to
per-workload allocations. Janus constructs compact metrics
of the cacheability of the different workloads, using
sampled distributed traces because of the large scale of
the system. From these metrics, we formulate and solve
an optimization problem to determine the flash allocation
to workloads that maximizes the total reads sent to the
flash tier, subject to operator-set priorities and bounds on
flash write rates. Using measurements from production
workloads in multiple data centers using these recommendations,
as well as traces of other production workloads,
we show that the resulting allocation improves the
flash hit rate by 47–76% compared to a unified tier shared
by all workloads. Based on these results and an analysis
of several thousand production workloads, we conclude
that flash storage is a cost-effective complement to disks
in data centers.