Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions

We analyze how modern distributed storage systems behave
in the presence of file-system faults such as data
corruption and read and write errors. We characterize
eight popular distributed storage systems and uncover
numerous bugs related to file-system fault tolerance. We
find that modern distributed systems do not consistently
use redundancy to recover from file-system faults: a
single file-system fault can cause catastrophic outcomes
such as data loss, corruption, and unavailability. Our results
have implications for the design of next generation
fault-tolerant distributed and cloud storage systems.

Source: https://www.usenix.org/system/files/conference/fast17/fast17-ganesan.pdf

An Inside Look at Google BigQuery

This white paper introduces Google BigQuery, a fully-managed and cloud based interactive query service for massive datasets. BigQuery is the external implementation of one of the company’s core technologies whose code name is Dremel. This paper discusses the uniqueness of the technology as a cloudenabled massively parallel query engine, the differences between BigQuery and Dremel, and how BigQuery compares with other technologies such as MapReduce/Hadoop and existing data warehouse solutions

Note: This is from 2012, so doesn’t include a lot of recent innovations like upsert and standard sql support.

Source: https://cloud.google.com/files/BigQueryTechnicalWP.pdf

Understanding Synthetic Gradients and Decoupled Neural Interfaces

When training neural networks, the use of Synthetic
Gradients (SG) allows layers or modules
to be trained without update locking – without
waiting for a true error gradient to be backpropagated
– resulting in Decoupled Neural Interfaces
(DNIs). This unlocked ability of being
able to update parts of a neural network asynchronously
and with only local information was
demonstrated to work empirically in Jaderberg
et al. (2016). However, there has been very little
demonstration of what changes DNIs and SGs
impose from a functional, representational, and
learning dynamics point of view. In this paper,
we study DNIs through the use of synthetic gradients
on feed-forward networks to better understand
their behaviour and elucidate their effect
on optimisation. We show that the incorporation
of SGs does not affect the representational
strength of the learning system for a neural network,
and prove the convergence of the learning
system for linear and deep linear models. On
practical problems we investigate the mechanism
by which synthetic gradient estimators approximate
the true loss, and, surprisingly, how that
leads to drastically different layer-wise representations.
Finally, we also expose the relationship
of using synthetic gradients to other error approximation
techniques and find a unifying language
for discussion and comparison.

Source: https://arxiv.org/pdf/1703.00522.pdf

Spotify’s Event Delivery – The Road to the Cloud (Part III) | Labs

In the first post in this series, we talked about how our old event system worked and some of the lessons we learned from operating it. In the second post, we covered the design of our new event delivery system, and why we choose Cloud Pub/Sub as the transport mechanism for all events. In this third and final post, we will explain how we intend to consume all the published events with Dataflow, and what we have discovered about the performance of this approach so far.

Source: https://labs.spotify.com/2016/03/10/spotifys-event-delivery-the-road-to-the-cloud-part-iii/

Spotify’s Event Delivery – The Road to the Cloud (Part II) | Labs

Whenever a user performs an action in the Spotify client—such as listening to a song or searching for an artist—a small piece of information, an event, is sent to our servers. Event delivery, the process of making sure that all events gets transported safely from clients all over the world to our central processing system, is an interesting problem. In this series of blog posts, we are going to look at some of the work we have done in this area. More specifically, we are going to look at the architecture of our new event delivery system, and tell you why we choose to base our new system on Google Cloud managed services.

In the first post in this series, we talked about how our old event system worked and some of the lessons we learned from operating it. In this second post, we’ll cover the design of our new event delivery system, and why we choose Cloud Pub/Sub as the transport mechanism for all events.

Source: https://labs.spotify.com/2016/03/03/spotifys-event-delivery-the-road-to-the-cloud-part-ii/

Lessons learned from B4, Google’s SDN WAN

Google’s B4 wide area network was first revealed several years ago. The outside observer might have thought, “Google’s B4 is finished. I wonder what they’re going to do next.” Turns out, once any network is in production @scale, there’s a continued need to make it better. Subhasree Mandal covered the reality of how Google iterated multiple times on different parts of B4 to improve its performance, availability, and scalability. Several of the challenges and solutions that Subhasree detailed were definitely at the intersection of networking and distributed systems. B4 was covered in a SIGCOMM 2013 paper from Google.