We analyze how modern distributed storage systems behave
in the presence of file-system faults such as data
corruption and read and write errors. We characterize
eight popular distributed storage systems and uncover
numerous bugs related to file-system fault tolerance. We
find that modern distributed systems do not consistently
use redundancy to recover from file-system faults: a
single file-system fault can cause catastrophic outcomes
such as data loss, corruption, and unavailability. Our results
have implications for the design of next generation
fault-tolerant distributed and cloud storage systems.
Abstract— We show that the performance of existing
fault localization algorithms differs markedly for different
networks; and no algorithm simultaneously provides
high localization accuracy and low computational overhead.
We develop a framework to explain these behaviors
by anatomizing the algorithms with respect to six
important characteristics of real networks, such as uncertain
dependencies, noise, and covering relationships. We
use this analysis to develop Gestalt, a new algorithm that
combines the best elements of existing ones and includes
a new technique to explore the space of fault hypotheses.
We run experiments on three real, diverse networks. For
each, Gestalt has either significantly higher localization
accuracy or an order of magnitude lower running time.
For example, when applied to the Lync messaging system
that is used widely within corporations, Gestalt localizes
faults with the same accuracy as Sherlock, while
reducing fault localization time from days to 23 seconds.
Systems like GMail and Picasa keep massive amounts of data in the cloud, all of which has to be constantly backed up to prepare for the inevitable. Typical backup and recovery techniques don’t scale, so Google has devised new methods for securing unprecedented volumes of data against every type of failure.
There are many unique challenges, both obvious and subtle, in delivering storage systems at this scale; we’ll discuss these and their solutions as well as some alternatives that didn’t make the grade.
About the speaker: Raymond Blum leads a team of Site Reliability Engineers charged with keeping Google’s and its users’ data safe and durable. Prior to coming to Google he was the IT director for a hedge fund after spending a few lifetimes developing systems at HBO and on Wall Street. In his meager spare time he indulges his interests in robotics and home automation and reads too much science fiction.
It is estimated that over 90% of all new information produced in the world is being stored on magnetic media, most of it on hard disk drives. Despite their importance, there is relatively little published work on the failure patterns of disk drives, and the key factors that affect their lifetime. Most available data are either based on extrapolation from accelerated aging experiments or from relatively modest sized field studies. Moreover, larger population studies rarely have the infrastructure in place to collect health signals from components in operation, which is critical information for detailed failure analysis.
We present data collected from detailed observations of a large disk drive population in a production Internet services deployment. The population observed is many times larger than that of previous studies. In addition to presenting failure statistics, we analyze the correlation between failures and several parameters generally believed to impact longevity.
Our analysis identifies several parameters from the drive’s self monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures. Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported.
Maglev is Google’s network load balancer. It is a
large distributed software system that runs on commodity
Linux servers. Unlike traditional hardware network load
balancers, it does not require a specialized physical rack
deployment, and its capacity can be easily adjusted by
adding or removing servers. Network routers distribute
packets evenly to the Maglev machines via Equal Cost
Multipath (ECMP); each Maglev machine then matches
the packets to their corresponding services and spreads
them evenly to the service endpoints. To accommodate
high and ever-increasing traffic, Maglev is specifically
optimized for packet processing performance. A single
Maglev machine is able to saturate a 10Gbps link with
small packets. Maglev is also equipped with consistent
hashing and connection tracking features, to minimize
the negative impact of unexpected faults and failures on
connection-oriented protocols. Maglev has been serving
Google’s traffic since 2008. It has sustained the rapid
global growth of Google services, and it also provides
network load balancing for Google Cloud Platform.
Summary. Google’s Ads Data Infrastructure systems run the multibillion dollar ads business at Google. High availability and strong consistency are critical for these systems. While most distributed systems handle machine-level failures well, handling datacenter-level failures is less common. In our experience, handling datacenter-level failures is critical for running true high availability systems. Most of our systems (e.g. Photon, F1, Mesa) now support multi-homing as a fundamental design property. Multi-homed systems run live in multiple datacenters all the time, adaptively moving load between datacenters, with the ability to handle outages of any scale completely transparently. This paper focuses primarily on stream processing systems, and describes our general approaches for building high availability multi-homed systems, discusses common challenges and solutions, and shares what we have learned in building and running these large-scale systems for over ten years.
Google has long had a culture of causing failures to its systems intentionally to find failures and fix them before they happen in an uncontrolled manner. Along the way, we built up several supporting components that need to get addressed on the way: failure automation, response to incidents, learning from postmortems and failure prevention. This talk pulls together learnings (and war stories) from the entire lifecycle.