It is very difficult to find accurate information about the correctness and isolation levels offered by modern distributed databases, and the operational conditions required to achieve them. Developers use different terms for the same thing, the meaning of terms varies or is ambiguous, and sometimes vendors themselves do not actually know.
At Fauna, we care a lot about accurately describing which guarantees different systems actually provide. This is our effort to centralize a description of which database does what, based on publicly available information (documentation, source code, third-party analyses, and developers’ comments). For consistency’s sake, we will use the terminology from Kyle Kingsbury’s explanation on the Jepsen site. The chart is ranked by the maximum multi-partition isolation level offered.
The data is based on statements about isolation levels from vendor documentation, white papers, and developer commentary, exclusive of aspirational marketing statements. We have tried to be neutral in the characterization of the various systems’ architectural properties. Whether the system implementations uphold these guarantees is addressed elsewhere. If you haven’t already, please see FaunaDB’s own Jepsen results for confirmation that FaunaDB upholds its guarantees.
Concurrency is hard. How do you test your system when it’s spread across three services and four languages? Unit testing and type systems only take us so far. At some point we need new tools.
Enter TLA+. TLA+ is a specification language that describes your system and the properties you want. This makes it a fantastic complement to testing: not only can you check your code, you can check your design, too! TLA+ is especially effective for testing concurrency problems, like stalling, race conditions, and dropped messages.
This talk will introduce the ideas behind TLA+ and how it works, with a focus on practical examples. We’ll also show how it caught complex bugs in our systems, as well as how you can start applying it to your own work.
(This is the exact same abstract as the last TLA+ talk, despite the two having just one slide in common. Writing abstracts is not my strong suite.)
What is this, and who’s it for?
§ Lessons learned from the trenches building distributed systems for 8+ years at Cloudera and in
open source communities.
The CAP Theorem is a fundamental theorem in distributed systems that states any distributed system can have at most two of the following three properties.
This guide will summarize Gilbert and Lynch’s specification and proof of the CAP Theorem with pictures!
Abstract. Most of today’s Internet applications generate vast amounts
of data (typically, in the form of event logs) that needs to be processed
and analyzed for detailed reporting, enhancing user experience and increasing
monetization. In this paper, we describe the architecture of
Ubiq, a geographically distributed framework for processing continuously
growing log files in real time with high scalability, high availability and
low latency. The Ubiq framework fully tolerates infrastructure degradation
and data center-level outages without any manual intervention. It
also guarantees exactly-once semantics for application pipelines to process
logs as a collection of multiple events. Ubiq has been in production
for Google’s advertising system for many years and has served as a critical
log processing framework for several dozen pipelines. Our production
deployment demonstrates linear scalability with machine resources, extremely
high availability even with underlying infrastructure failures, and
an end-to-end latency of under a minute.
In the late 1980s and early 1990s, object-oriented pro-
gramming revolutionized software development, popu-
larizing the approach of building of applications as col-
lections of modular components. Today we are seeing
a similar revolution in distributed system development,
with the increasing popularity of microservice archi-
tectures built from containerized software components.
Containers     are particularly well-suited
as the fundamental “object” in distributed systems by
virtue of the walls they erect at the container bound-
ary. As this architectural style matures, we are seeing the
emergence of design patterns, much as we did for object-
oriented programs, and for the same reason – thinking in
terms of objects (or containers) abstracts away the low-
level details of code, eventually revealing higher-level
patterns that are common to a variety of applications and
This paper describes three types of design patterns
that we have observed emerging in container-based dis-
tributed systems: single-container patterns for container
management, single-node patterns of closely cooperat-
ing containers, and multi-node patterns for distributed
algorithms. Like object-oriented patterns before them,
these patterns for distributed computation encode best
practices, simplify development, and make the systems
where they are used more reliable.
We analyze how modern distributed storage systems behave
in the presence of file-system faults such as data
corruption and read and write errors. We characterize
eight popular distributed storage systems and uncover
numerous bugs related to file-system fault tolerance. We
find that modern distributed systems do not consistently
use redundancy to recover from file-system faults: a
single file-system fault can cause catastrophic outcomes
such as data loss, corruption, and unavailability. Our results
have implications for the design of next generation
fault-tolerant distributed and cloud storage systems.