XRay: A Function Call Tracing System

Debugging high throughput, low-latency C/C++ systems in production is hard. At Google we developed XRay, a function call tracing system that allows Google engineers to get accurate function call traces with negligible overhead when off and moderate overhead when on, suitable for services deployed in production. XRay enables efficient function call entry/exit logging with high accuracy timestamps, and can be dynamically enabled and disabled. This white paper describes the XRay tracing system and its implementation. It also describes future plans with open sourcing XRay and engaging open source communities.

Source: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45287.pdf


Design patterns for container-based distributed systems

In the late 1980s and early 1990s, object-oriented programming
revolutionized software development, popularizing
the approach of building of applications as collections
of modular components. Today we are seeing
a similar revolution in distributed system development,
with the increasing popularity of microservice architectures
built from containerized software components.
Containers [15] [22] [1] [2] are particularly well-suited
as the fundamental “object” in distributed systems by
virtue of the walls they erect at the container boundary.
As this architectural style matures, we are seeing the
emergence of design patterns, much as we did for objectoriented
programs, and for the same reason – thinking in
terms of objects (or containers) abstracts away the lowlevel
details of code, eventually revealing higher-level
patterns that are common to a variety of applications and
This paper describes three types of design patterns
that we have observed emerging in container-based distributed
systems: single-container patterns for container
management, single-node patterns of closely cooperating
containers, and multi-node patterns for distributed
algorithms. Like object-oriented patterns before them,
these patterns for distributed computation encode best
practices, simplify development, and make the systems
where they are used more reliable.

Source: https://www.usenix.org/system/files/conference/hotcloud16/hotcloud16_burns.pdf

Continuous Pipelines at Google

This article focuses on the real life challenges of managing data processing pipelines of depth and complexity. It considers the frequency continuum between periodic pipelines that run very infrequently through continuous pipelines that never stop running, and discusses the discontinuities that can produce significant operational problems. A fresh take on the master-­slave model is presented as a more reliable and better scaling alternative to the periodic pipeline for processing Big Data.

Source: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43790.pdf

My Philosophy on Alerting – Google Docs

Former SRE on optimal alerting processes.


When you are auditing or writing alerting rules, consider these things to keep your oncall rotation happier:


  • Pages should be urgent, important, actionable, and real.
  • They should represent either ongoing or imminent problems with your service.
  • Err on the side of removing noisy alerts – over-monitoring is a harder problem to solve than under-monitoring.
  • You should almost always be able to classify the problem into one of: availability & basic functionality; latency; correctness (completeness, freshness and durability of data); and feature-specific problems.
  • Symptoms are a better way to capture more problems more comprehensively and robustly with less effort.
  • Include cause-based information in symptom-based pages or on dashboards, but avoid alerting directly on causes.
  • The further up your serving stack you go, the more distinct problems you catch in a single rule.  But don’t go so far you can’t sufficiently distinguish what’s going on.
  • If you want a quiet oncall rotation, it’s imperative to have a system for dealing with things that need timely response, but are not imminently critical.

Source: My Philosophy on Alerting – Google Docs


10 Years of Crashing Google | USENIX

Excellent talk about apocalypse-level disaster testing. Video in the link.

Google has long had a culture of causing failures to its systems intentionally to find failures and fix them before they happen in an uncontrolled manner. Along the way, we built up several supporting components that need to get addressed on the way: failure automation, response to incidents, learning from postmortems and failure prevention. This talk pulls together learnings (and war stories) from the entire lifecycle.

Source: 10 Years of Crashing Google | USENIX