This talk introduces Site Reliability Engineering (SRE) at Google, explaining its purpose and describing the challenges it addresses. SRE teams manage Google’s many services and web sites from our offices in Pittsburgh, New York, London, Sydney, Zurich, Los Angeles, Dublin, Mountain View, … They draw upon the Linux based computing resources that are distributed in data centers around the world.
Operating distributed systems at scale requires an unusual set of skills—problem solving, programming, system design, networking, and OS internals—which are difficult to find in one person. At Google, we’ve found some ways to hire Site Reliability Engineers, blending both software and systems skills to help keep a high standard for new SREs across our many teams and sites, including standardizing the format of our interviews and the unusual practice of making hiring decisions by committee. Adopting similar practices can help your SRE or DevOps team grow by consistently hiring excellent coworkers.
In order to run the company’s numerous services as efficiently and reliably as possible, Google’s Site Reliability Engineering (SRE) organization leverages the expertise of two main disciplines: Software Engineering and Systems Engineering. The roles of Software Engineer (SWE) and Systems Engineer (SE) lie at the two poles of the SRE continuum of skills and interests. While Site Reliability Engineers tend to be assigned to one of these two buckets, there is much overlap between the two job roles, and the knowledge exchange between the two job roles is rather fluid.
Updating production software is a process that may require dozens, if not hundreds, of steps. These include creating and testing new code, building new binaries and packages, associating the packages with a versioned release, updating the jobs in production datacenters, possibly modifying database schemata, and testing and verifying the results. There are boxes to check and approvals to seek, and the more automated the process, the easier it becomes. When releases can be made faster, it is possible to release more often, and, organizationally, one becomes less afraid to “release early, release often” [6, 7]. And that’s what we describe in this article—making rollouts as easy and as automated as possible. When a “green” condition is detected, we can more quickly perform a new rollout. Humans are still needed somewhere in the loop, but we strive to reduce the purely mechanical toil they need to perform.
Being on-call is a critical duty that many operations and engineering teams must undertake in order to keep their services reliable and available. However, there are several pitfalls in the organization of oncall rotations and responsibilities that can lead to serious consequences for the services and for the teams if not avoided. We provide the primary tenets of the approach to on-call that Google’s Site Reliability Engineers have developed over years, and explain how that approach has led to reliable services and sustainable workload over time.
Google is constantly changing our software to implement new, useful features for our users. Unfortunately, making changes is inherently risky. Google services are quite complex, and any new feature might accidentally cause problems for users. In fact, most outages of Google services are the result of deploying a change. As a consequence, there is an inherent tension between the desire to innovate quickly and to keep the site reliable. Google manages this tension by using a metrics-based approach called an unreliability budget, which provides an objective metric to guide decisions involving tradeoffs between innovation and reliability.
Janus is a system for partitioning the flash storage tier
between workloads in a cloud-scale distributed file system
with two tiers, flash storage and disk. The file system
stores newly created files in the flash tier and moves them
to the disk tier using either a First-In-First-Out (FIFO)
policy or a Least-Recently-Used (LRU) policy, subject to
per-workload allocations. Janus constructs compact metrics
of the cacheability of the different workloads, using
sampled distributed traces because of the large scale of
the system. From these metrics, we formulate and solve
an optimization problem to determine the flash allocation
to workloads that maximizes the total reads sent to the
flash tier, subject to operator-set priorities and bounds on
flash write rates. Using measurements from production
workloads in multiple data centers using these recommendations,
as well as traces of other production workloads,
we show that the resulting allocation improves the
flash hit rate by 47–76% compared to a unified tier shared
by all workloads. Based on these results and an analysis
of several thousand production workloads, we conclude
that flash storage is a cost-effective complement to disks
in data centers.
Debugging high throughput, low-latency C/C++ systems in production is hard. At Google we developed XRay, a function call tracing system that allows Google engineers to get accurate function call traces with negligible overhead when off and moderate overhead when on, suitable for services deployed in production. XRay enables efficient function call entry/exit logging with high accuracy timestamps, and can be dynamically enabled and disabled. This white paper describes the XRay tracing system and its implementation. It also describes future plans with open sourcing XRay and engaging open source communities.
In the late 1980s and early 1990s, object-oriented programming
revolutionized software development, popularizing
the approach of building of applications as collections
of modular components. Today we are seeing
a similar revolution in distributed system development,
with the increasing popularity of microservice architectures
built from containerized software components.
Containers     are particularly well-suited
as the fundamental “object” in distributed systems by
virtue of the walls they erect at the container boundary.
As this architectural style matures, we are seeing the
emergence of design patterns, much as we did for objectoriented
programs, and for the same reason – thinking in
terms of objects (or containers) abstracts away the lowlevel
details of code, eventually revealing higher-level
patterns that are common to a variety of applications and
This paper describes three types of design patterns
that we have observed emerging in container-based distributed
systems: single-container patterns for container
management, single-node patterns of closely cooperating
containers, and multi-node patterns for distributed
algorithms. Like object-oriented patterns before them,
these patterns for distributed computation encode best
practices, simplify development, and make the systems
where they are used more reliable.
This article focuses on the real life challenges of managing data processing pipelines of depth and complexity. It considers the frequency continuum between periodic pipelines that run very infrequently through continuous pipelines that never stop running, and discusses the discontinuities that can produce significant operational problems. A fresh take on the master-slave model is presented as a more reliable and better scaling alternative to the periodic pipeline for processing Big Data.