The best best practices guide to writing python, from the guy who wrote the best python library, requests.
Web-based enterprises process events generated by millions
of users interacting with their websites. Rich statistical
data distilled from combining such interactions in near realtime
generates enormous business value. In this paper, we
describe the architecture of Photon, a geographically distributed
system for joining multiple continuously flowing
streams of data in real-time with high scalability and low
latency, where the streams may be unordered or delayed.
The system fully tolerates infrastructure degradation and
datacenter-level outages without any manual intervention.
Photon guarantees that there will be no duplicates in the
joined output (at-most-once semantics) at any point in time,
that most joinable events will be present in the output in
real-time (near-exact semantics), and exactly-once semantics
Photon is deployed within Google Advertising System to
join data streams such as web search queries and user clicks
on advertisements. It produces joined logs that are used to
derive key business metrics, including billing for advertisers.
Our production deployment processes millions of events per
minute at peak with an average end-to-end latency of less
than 10 seconds. We also present challenges and solutions
in maintaining large persistent state across geographically
distant locations, and highlight the design principles that
emerged from our experience.
Cardinality estimation has a wide range of applications and
is of particular importance in database systems. Various
algorithms have been proposed in the past, and the HyperLogLog
algorithm is one of them. In this paper, we
present a series of improvements to this algorithm that reduce
its memory requirements and significantly increase its
accuracy for an important range of cardinalities. We have
implemented our proposed algorithm for a system at Google
and evaluated it empirically, comparing it to the original
HyperLogLog algorithm. Like HyperLogLog, our improved
algorithm parallelizes perfectly and computes the
cardinality estimate in a single pass.
Maintaining large codebases can be a challenging
endeavour. As new libraries, APIs and standards are introduced,
old code is migrated to use them. To provide as clean and
succinct an interface as possible for developers, old APIs are
ideally removed as new ones are introduced. In practice, this
becomes difficult as automatically finding and transforming code
in a semantically correct way can be challenging, particularly as
the size of a codebase increases.
In this paper, we present a real-world implementation of a
system to refactor large C++ codebases efficiently. A combination
of the Clang compiler framework and the MapReduce parallel
processor, ClangMR enables code maintainers to easily and
correctly transform large collections of code. We describe the
motivation behind such a tool, its implementation and then
present our experiences using it in a recent API update with
Google’s C++ codebase.