Canary Analysis Service – ACM Queue

In 1913, Scottish physiologist John Scott Haldane proposed the idea of bringing a caged canary into a mine to detect dangerous gases. More than 100 years later, Haldane’s canary-in-the-coal-mine approach is also applied in software testing.

In this article, the term canarying refers to a partial and time-limited deployment of a change in a service, followed by an evaluation of whether the service change is safe. The production change process may then roll forward, roll back, alert a human, or do something else. Effective canarying involves many decisions—for example, how to deploy the partial service change or choose meaningful metrics—and deserves a separate discussion.

Google has deployed a shared centralized service called CAS (Canary Analysis Service) that offers automatic (and often autoconfigured) analysis of key metrics during a production change. CAS is used to analyze new versions of binaries, configuration changes, data-set changes, and other production changes. CAS evaluates hundreds of thousands of production changes every day at Google.



Application Layer Transport Security (LOAS)

Production systems at Google consist of a constellation of microservices1 that collectively issue O(1010) Remote Procedure Calls (RPCs) per second. When a Google engineer schedules a production workload2, any RPCs issued or received by that workload are protected with ALTS by default. This automatic, zero-configuration protection is provided by Google’s Application Layer Transport Security (ALTS). In addition to the automatic protections conferred on RPC’s, ALTS also facilitates easy service replication, load balancing, and rescheduling across production machines. This paper describes ALTS and explores its deployment over Google’s production infrastructure.


Fleet management at scale

Google’s employees are spread across the globe, and with job functions
ranging from software engineers to financial analysts, they require a broad
spectrum of technology to get their jobs done. As a result, we manage a fleet
of nearly a quarter-million computers (workstations and laptops) across four
operating systems (macOS, Windows, Linux, and Chrome OS).
Our colleagues often ask how we’re able to manage such a diverse fleet. Do
we have access to unlimited resources? Impose draconian security policies
on users? Shift the maintenance burden to our support staff?
The truth is that the bigger we get, the more we look for ways to increase
efficiency without sacrificing security or user productivity. We scale our
engineering teams by relying on reviewable, repeatable, and automated
backend processes and minimizing GUI-based configuration tools. Using and
developing open-source software saves money and provides us with a level
of flexibility that’s often missing from proprietary software and closed
systems. And we strike a careful balance between user uptime and security
by giving users freedom to get their work done while preventing them from
doing harm, like installing malware or exposing Google data.
This paper describes some of the tools and systems that we use to image,
manage, and secure our varied inventory of workstations and laptops . Some
tools were built by third parties—sometimes with our own modifications to
make them work for us. We also created several tools to meet our own
enterprise needs, often open sourcing them later for wider use. By sharing
this information, we hope to help others navigate some of the challenges
we’ve faced—and ultimately overcame—throughout our enterprise fleet
management journey.


Taking the Edge off with Espresso: Scale, Reliability and Programmability for Global Internet Peering

We present the design of Espresso, Google’s SDN-based Internet
peering edge routing infrastructure. This architecture grew out of a
need to exponentially scale the Internet edge cost-effectively and to
enable application-aware routing at Internet-peering scale. Espresso
utilizes commodity switches and host-based routing/packet process-
ing to implement a novel fine-grained traffic engineering capability.
Overall, Espresso provides Google a scalable peering edge that is
programmable, reliable, and integrated with global traffic systems.
Espresso also greatly accelerated deployment of new networking
features at our peering edge. Espresso has been in production for
two years and serves over 22% of Google’s total traffic to the Inter-


Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions

We analyze how modern distributed storage systems behave
in the presence of file-system faults such as data
corruption and read and write errors. We characterize
eight popular distributed storage systems and uncover
numerous bugs related to file-system fault tolerance. We
find that modern distributed systems do not consistently
use redundancy to recover from file-system faults: a
single file-system fault can cause catastrophic outcomes
such as data loss, corruption, and unavailability. Our results
have implications for the design of next generation
fault-tolerant distributed and cloud storage systems.


An Inside Look at Google BigQuery

This white paper introduces Google BigQuery, a fully-managed and cloud based interactive query service for massive datasets. BigQuery is the external implementation of one of the company’s core technologies whose code name is Dremel. This paper discusses the uniqueness of the technology as a cloudenabled massively parallel query engine, the differences between BigQuery and Dremel, and how BigQuery compares with other technologies such as MapReduce/Hadoop and existing data warehouse solutions

Note: This is from 2012, so doesn’t include a lot of recent innovations like upsert and standard sql support.


Understanding Synthetic Gradients and Decoupled Neural Interfaces

When training neural networks, the use of Synthetic
Gradients (SG) allows layers or modules
to be trained without update locking – without
waiting for a true error gradient to be backpropagated
– resulting in Decoupled Neural Interfaces
(DNIs). This unlocked ability of being
able to update parts of a neural network asynchronously
and with only local information was
demonstrated to work empirically in Jaderberg
et al. (2016). However, there has been very little
demonstration of what changes DNIs and SGs
impose from a functional, representational, and
learning dynamics point of view. In this paper,
we study DNIs through the use of synthetic gradients
on feed-forward networks to better understand
their behaviour and elucidate their effect
on optimisation. We show that the incorporation
of SGs does not affect the representational
strength of the learning system for a neural network,
and prove the convergence of the learning
system for linear and deep linear models. On
practical problems we investigate the mechanism
by which synthetic gradient estimators approximate
the true loss, and, surprisingly, how that
leads to drastically different layer-wise representations.
Finally, we also expose the relationship
of using synthetic gradients to other error approximation
techniques and find a unifying language
for discussion and comparison.