Photon: Fault-tolerant and Scalable Joining of Continuous Data Streams

Web-based enterprises process events generated by millions
of users interacting with their websites. Rich statistical
data distilled from combining such interactions in near realtime
generates enormous business value. In this paper, we
describe the architecture of Photon, a geographically distributed
system for joining multiple continuously flowing
streams of data in real-time with high scalability and low
latency, where the streams may be unordered or delayed.
The system fully tolerates infrastructure degradation and
datacenter-level outages without any manual intervention.
Photon guarantees that there will be no duplicates in the
joined output (at-most-once semantics) at any point in time,
that most joinable events will be present in the output in
real-time (near-exact semantics), and exactly-once semantics
eventually.
Photon is deployed within Google Advertising System to
join data streams such as web search queries and user clicks
on advertisements. It produces joined logs that are used to
derive key business metrics, including billing for advertisers.
Our production deployment processes millions of events per
minute at peak with an average end-to-end latency of less
than 10 seconds. We also present challenges and solutions
in maintaining large persistent state across geographically
distant locations, and highlight the design principles that
emerged from our experience.

Source: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41318.pdf

HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm

Cardinality estimation has a wide range of applications and
is of particular importance in database systems. Various
algorithms have been proposed in the past, and the HyperLogLog
algorithm is one of them. In this paper, we
present a series of improvements to this algorithm that reduce
its memory requirements and significantly increase its
accuracy for an important range of cardinalities. We have
implemented our proposed algorithm for a system at Google
and evaluated it empirically, comparing it to the original
HyperLogLog algorithm. Like HyperLogLog, our improved
algorithm parallelizes perfectly and computes the
cardinality estimate in a single pass.

Source: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/40671.pdf

Large-Scale Automated Refactoring Using ClangMR

Maintaining large codebases can be a challenging
endeavour. As new libraries, APIs and standards are introduced,
old code is migrated to use them. To provide as clean and
succinct an interface as possible for developers, old APIs are
ideally removed as new ones are introduced. In practice, this
becomes difficult as automatically finding and transforming code
in a semantically correct way can be challenging, particularly as
the size of a codebase increases.
In this paper, we present a real-world implementation of a
system to refactor large C++ codebases efficiently. A combination
of the Clang compiler framework and the MapReduce parallel
processor, ClangMR enables code maintainers to easily and
correctly transform large collections of code. We describe the
motivation behind such a tool, its implementation and then
present our experiences using it in a recent API update with
Google’s C++ codebase.

Source: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41342.pdf

Scalable, Example-Based Refactorings with Refaster

We discuss Refaster, a tool that uses normal, compilable
before-and-after examples of Java code to specify a Java
refactoring. Refaster has been used successfully by the Java
Core Libraries Team at Google to perform a wide variety
of refactorings across Google’s massive Java codebase. Our
main contribution is that a large class of useful refactorings
can be expressed in pure Java, without a specialized DSL,
while keeping the tool easily accessible to average Java
developers

Source: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41876.pdf

Verified Boot on Chrome OS and How to do it yourself

Chrome OS uses a first stage read-only firmware and second-stage updatable firmware. The updatable firmware is signed and contains kernel keys and a dm-verify hash, so that the firmware, Linux kernel and root filesystem are all protected against corruption and attack. This system is described and discussed. As part of Google’s upstream efforts in U-Boot, a generalized secure boot system has been developed and released with U-Boot 2013.07. This implementation uses the FIT format, which collects together images, such as kernels, device tree, RAM disks. Support is provided for TPMs (Trust Platform Module), RSA-based signing and verificaiton, and hashing with hardware acceleration. This system is also described and discussed, along with the specific steps needed to implement it in your designs.

Source: https://research.google.com/pubs/pub42038.html

Let’s Parse to Prevent Pwnage

Software that processes rich content suffers from endemic security vulnerabilities. Frequently, these bugs are due to data confusion: discrepancies in how content data is parsed, composed, and otherwise processed by different applications, frameworks, and language runtimes. Data confusion often enables code injection attacks, such as cross-site scripting or SQL injection, by leading to incorrect assumptions about the encodings and checks applied to rich content of uncertain provenance. However, even for well-structured, value-only content, data confusion can critically impact security, e.g., as shown by XML signature vulnerabilities [12]. This paper advocates the position that data confusion can be effectively prevented through the use of simple mechanisms—based on parsing—that eliminate ambiguities by fully resolving content data to normalized, clearly-understood forms. Using code injection on the Web as our motivation, we make the case that automatic defense mechanisms should be integrated with programming languages, application frameworks, and runtime libraries, and applied with little, or no, developer intervention. We outline a scalable, sustainable approach for developing and maintaining those mechanisms. The resulting tools can offer comprehensive protection against data confusion, even when multiple types of rich content data are processed and composed in complex ways.

Source: https://research.google.com/pubs/pub38080.html