Web-based enterprises process events generated by millions
of users interacting with their websites. Rich statistical
data distilled from combining such interactions in near realtime
generates enormous business value. In this paper, we
describe the architecture of Photon, a geographically distributed
system for joining multiple continuously flowing
streams of data in real-time with high scalability and low
latency, where the streams may be unordered or delayed.
The system fully tolerates infrastructure degradation and
datacenter-level outages without any manual intervention.
Photon guarantees that there will be no duplicates in the
joined output (at-most-once semantics) at any point in time,
that most joinable events will be present in the output in
real-time (near-exact semantics), and exactly-once semantics
Photon is deployed within Google Advertising System to
join data streams such as web search queries and user clicks
on advertisements. It produces joined logs that are used to
derive key business metrics, including billing for advertisers.
Our production deployment processes millions of events per
minute at peak with an average end-to-end latency of less
than 10 seconds. We also present challenges and solutions
in maintaining large persistent state across geographically
distant locations, and highlight the design principles that
emerged from our experience.