Today’s graphs used in domains such as machine learning or
social network analysis may contain hundreds of billions of
edges. Yet, they are not necessarily stored efficiently, and standard
graph representations such as adjacency lists waste a
significant number of bits while graph compression schemes
such as WebGraph often require time-consuming decompression.
To address this, we propose Log(Graph): a graph representation
that combines high compression ratios with very
low-overhead decompression to enable cheaper and faster
graph processing. The key idea is to encode a graph so that
the parts of the representation approach or match the respective
storage lower bounds. We call our approach “graph
logarithmization” because these bounds are usually logarithmic.
Our high-performance Log(Graph) implementation
based on modern bitwise operations and state-of-the-art succinct
data structures achieves high compression ratios as well
as performance. For example, compared to the tuned Graph
Algorithm Processing Benchmark Suite (GAPBS), it reduces
graph sizes by 20-35% while matching GAPBS’ performance
or even delivering speedups due to reducing amounts of
transferred data. It approaches the compression ratio of the
established WebGraph compression library while enabling
speedups of up to more than 2×. Log(Graph) can improve
the design of various graph processing engines or libraries on
single NUMA nodes as well as distributed-memory systems
Dr. Remzi Arpaci-Dusseau from the University of Wisconsin-Madison showed us how the simplest of things, like updating a file on a disk using a file system, is subtly complex and highly error-prone. File systems differ in semantics and the handling of edge cases. Through automated tools, he and his students have uncovered hidden assumptions in highly deployed systems, like Git, on specific file system behaviors that are not always correct. Assumptions can be about atomicity, failure handling, crash consistency guarantees, and more. This talk took us through the nuts and bolts of the lowest levels of storage systems.
Spanner is Google’s highly available global SQL database [CDE+12]. It manages replicated data at great
scale, both in terms of size of data and volume of transactions. It assigns globally consistent real-time
timestamps to every datum written to it, and clients can do globally consistent reads across the entire
database without locking.