Determining whether online users are authorized to access
digital objects is central to preserving privacy. This paper presents the design, implementation, and deployment
of Zanzibar, a global system for storing and evaluating access control lists. Zanzibar provides a uniform data model
and configuration language for expressing a wide range of
access control policies from hundreds of client services at
Google, including Calendar, Cloud, Drive, Maps, Photos,
and YouTube. Its authorization decisions respect causal ordering of user actions and thus provide external consistency
amid changes to access control lists and object contents.
Zanzibar scales to trillions of access control lists and millions
of authorization requests per second to support services used
by billions of people. It has maintained 95th-percentile latency of less than 10 milliseconds and availability of greater
than 99.999% over 3 years of production use
It is very difficult to find accurate information about the correctness and isolation levels offered by modern distributed databases, and the operational conditions required to achieve them. Developers use different terms for the same thing, the meaning of terms varies or is ambiguous, and sometimes vendors themselves do not actually know.
At Fauna, we care a lot about accurately describing which guarantees different systems actually provide. This is our effort to centralize a description of which database does what, based on publicly available information (documentation, source code, third-party analyses, and developers’ comments). For consistency’s sake, we will use the terminology from Kyle Kingsbury’s explanation on the Jepsen site. The chart is ranked by the maximum multi-partition isolation level offered.
The data is based on statements about isolation levels from vendor documentation, white papers, and developer commentary, exclusive of aspirational marketing statements. We have tried to be neutral in the characterization of the various systems’ architectural properties. Whether the system implementations uphold these guarantees is addressed elsewhere. If you haven’t already, please see FaunaDB’s own Jepsen results for confirmation that FaunaDB upholds its guarantees.
Today’s graphs used in domains such as machine learning or
social network analysis may contain hundreds of billions of
edges. Yet, they are not necessarily stored efficiently, and standard
graph representations such as adjacency lists waste a
significant number of bits while graph compression schemes
such as WebGraph often require time-consuming decompression.
To address this, we propose Log(Graph): a graph representation
that combines high compression ratios with very
low-overhead decompression to enable cheaper and faster
graph processing. The key idea is to encode a graph so that
the parts of the representation approach or match the respective
storage lower bounds. We call our approach “graph
logarithmization” because these bounds are usually logarithmic.
Our high-performance Log(Graph) implementation
based on modern bitwise operations and state-of-the-art succinct
data structures achieves high compression ratios as well
as performance. For example, compared to the tuned Graph
Algorithm Processing Benchmark Suite (GAPBS), it reduces
graph sizes by 20-35% while matching GAPBS’ performance
or even delivering speedups due to reducing amounts of
transferred data. It approaches the compression ratio of the
established WebGraph compression library while enabling
speedups of up to more than 2×. Log(Graph) can improve
the design of various graph processing engines or libraries on
single NUMA nodes as well as distributed-memory systems
Any security capability is inherently only as secure as the other systems
it trusts. The BeyondCorp project helped Google clearly define
and make access decisions around the platforms we trust, shifting
our security strategy from protecting services to protecting trusted platforms.
Previous BeyondCorp articles discussed the tooling Google uses to
confidently ascertain the provenance of a device, but we have not yet covered
the mechanics behind how we trust these devices.
Private WANs are increasingly important to the operation of
enterprises, telecoms, and cloud providers. For example, B4,
Google’s private software-defined WAN, is larger and growing
faster than our connectivity to the public Internet. In this
paper, we present the five-year evolution of B4. We describe
the techniques we employed to incrementally move from
offering best-effort content-copy services to carrier-grade
availability, while concurrently scaling B4 to accommodate
100x more traffic. Our key challenge is balancing the tension
introduced by hierarchy required for scalability, the partitioning
required for availability, and the capacity asymmetry
inherent to the construction and operation of any large-scale
network. We discuss our approach to managing this tension:
i) we design a custom hierarchical network topology for both
horizontal and vertical software scaling, ii) we manage inherent
capacity asymmetry in hierarchical topologies using
a novel traffic engineering algorithm without packet encapsulation,
and iii) we re-architect switch forwarding rules
via two-stage matching/hashing to deal with asymmetric
network failures at scale.
What is this, and who’s it for?
§ Lessons learned from the trenches building distributed systems for 8+ years at Cloudera and in
open source communities.
The large-scale monitoring of computer users’ software
activities has become commonplace, e.g., for application
telemetry, error reporting, or demographic profiling. This
paper describes a principled systems architecture—Encode,
Shuffle, Analyze (ESA)—for performing such monitoring
with high utility while also protecting user privacy. The ESA
design, and its PROCHLO implementation, are informed by
our practical experiences with an existing, large deployment
of privacy-preserving software monitoring.
With ESA, the privacy of monitored users’ data is guaranteed
by its processing in a three-step pipeline. First, the data
is encoded to control scope, granularity, and randomness.
Second, the encoded data is collected in batches subject to
a randomized threshold, and blindly shuffled, to break linkability
and to ensure that individual data items get “lost in the
crowd” of the batch. Third, the anonymous, shuffled data is
analyzed by a specific analysis engine that further prevents
statistical inference attacks on analysis results.
ESA extends existing best-practice methods for sensitivedata
analytics, by using cryptography and statistical techniques
to make explicit how data is elided and reduced in
precision, how only common-enough, anonymous data is analyzed,
and how this is done for only specific, permitted purposes.
As a result, ESA remains compatible with the established
workflows of traditional database analysis.
Strong privacy guarantees, including differential privacy,
can be established at each processing step to defend
against malice or compromise at one or more of those steps.
PROCHLO develops new techniques to harden those steps,
including the Stash Shuffle, a novel scalable and efficient
oblivious-shuffling algorithm based on Intel’s SGX, and new
applications of cryptographic secret sharing and blinding.
We describe ESA and PROCHLO, as well as experiments
that validate their ability to balance utility and privacy.
Abstract. Most of today’s Internet applications generate vast amounts
of data (typically, in the form of event logs) that needs to be processed
and analyzed for detailed reporting, enhancing user experience and increasing
monetization. In this paper, we describe the architecture of
Ubiq, a geographically distributed framework for processing continuously
growing log files in real time with high scalability, high availability and
low latency. The Ubiq framework fully tolerates infrastructure degradation
and data center-level outages without any manual intervention. It
also guarantees exactly-once semantics for application pipelines to process
logs as a collection of multiple events. Ubiq has been in production
for Google’s advertising system for many years and has served as a critical
log processing framework for several dozen pipelines. Our production
deployment demonstrates linear scalability with machine resources, extremely
high availability even with underlying infrastructure failures, and
an end-to-end latency of under a minute.
In 1913, Scottish physiologist John Scott Haldane proposed the idea of bringing a caged canary into a mine to detect dangerous gases. More than 100 years later, Haldane’s canary-in-the-coal-mine approach is also applied in software testing.
In this article, the term canarying refers to a partial and time-limited deployment of a change in a service, followed by an evaluation of whether the service change is safe. The production change process may then roll forward, roll back, alert a human, or do something else. Effective canarying involves many decisions—for example, how to deploy the partial service change or choose meaningful metrics—and deserves a separate discussion.
Google has deployed a shared centralized service called CAS (Canary Analysis Service) that offers automatic (and often autoconfigured) analysis of key metrics during a production change. CAS is used to analyze new versions of binaries, configuration changes, data-set changes, and other production changes. CAS evaluates hundreds of thousands of production changes every day at Google.
Production systems at Google consist of a constellation of microservices1 that collectively issue O(1010) Remote Procedure Calls (RPCs) per second. When a Google engineer schedules a production workload2, any RPCs issued or received by that workload are protected with ALTS by default. This automatic, zero-configuration protection is provided by Google’s Application Layer Transport Security (ALTS). In addition to the automatic protections conferred on RPC’s, ALTS also facilitates easy service replication, load balancing, and rescheduling across production machines. This paper describes ALTS and explores its deployment over Google’s production infrastructure.