We analyze how modern distributed storage systems behave
in the presence of file-system faults such as data
corruption and read and write errors. We characterize
eight popular distributed storage systems and uncover
numerous bugs related to file-system fault tolerance. We
find that modern distributed systems do not consistently
use redundancy to recover from file-system faults: a
single file-system fault can cause catastrophic outcomes
such as data loss, corruption, and unavailability. Our results
have implications for the design of next generation
fault-tolerant distributed and cloud storage systems.
Dr. Remzi Arpaci-Dusseau from the University of Wisconsin-Madison showed us how the simplest of things, like updating a file on a disk using a file system, is subtly complex and highly error-prone. File systems differ in semantics and the handling of edge cases. Through automated tools, he and his students have uncovered hidden assumptions in highly deployed systems, like Git, on specific file system behaviors that are not always correct. Assumptions can be about atomicity, failure handling, crash consistency guarantees, and more. This talk took us through the nuts and bolts of the lowest levels of storage systems.
It is estimated that over 90% of all new information produced in the world is being stored on magnetic media, most of it on hard disk drives. Despite their importance, there is relatively little published work on the failure patterns of disk drives, and the key factors that affect their lifetime. Most available data are either based on extrapolation from accelerated aging experiments or from relatively modest sized field studies. Moreover, larger population studies rarely have the infrastructure in place to collect health signals from components in operation, which is critical information for detailed failure analysis.
We present data collected from detailed observations of a large disk drive population in a production Internet services deployment. The population observed is many times larger than that of previous studies. In addition to presenting failure statistics, we analyze the correlation between failures and several parameters generally believed to impact longevity.
Our analysis identifies several parameters from the drive’s self monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures. Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported.
Disks form the central element of Cloud-based storage, whose demand far outpaces the considerable rate of innovation in disks. Exponential growth in demand, already in progress for 15+ years, implies that most future disks will be in data centers and thus part of a large collection of disks. We describe the “collection view” of disks and how it and the focus on tail latency, driven by live services, place new and different requirements on disks. Beyond defining key metrics for data-center disks, we explore a range of new physical design options and changes to firmware that could improve these metrics.
We hope this is the beginning of a new era of “data center” disks and a new broad and open discussion about how to evolve disks for data centers. The ideas presented here provide some guidance and some options, but we believe the best solutions will come from the combined efforts of industry, academia and other large customers.