PROCHLO: Strong Privacy for Analytics in the Crowd

The large-scale monitoring of computer users’ software
activities has become commonplace, e.g., for application
telemetry, error reporting, or demographic profiling. This
paper describes a principled systems architecture—Encode,
Shuffle, Analyze (ESA)—for performing such monitoring
with high utility while also protecting user privacy. The ESA
design, and its PROCHLO implementation, are informed by
our practical experiences with an existing, large deployment
of privacy-preserving software monitoring.
With ESA, the privacy of monitored users’ data is guaranteed
by its processing in a three-step pipeline. First, the data
is encoded to control scope, granularity, and randomness.
Second, the encoded data is collected in batches subject to
a randomized threshold, and blindly shuffled, to break linkability
and to ensure that individual data items get “lost in the
crowd” of the batch. Third, the anonymous, shuffled data is
analyzed by a specific analysis engine that further prevents
statistical inference attacks on analysis results.
ESA extends existing best-practice methods for sensitivedata
analytics, by using cryptography and statistical techniques
to make explicit how data is elided and reduced in
precision, how only common-enough, anonymous data is analyzed,
and how this is done for only specific, permitted purposes.
As a result, ESA remains compatible with the established
workflows of traditional database analysis.
Strong privacy guarantees, including differential privacy,
can be established at each processing step to defend
against malice or compromise at one or more of those steps.
PROCHLO develops new techniques to harden those steps,
including the Stash Shuffle, a novel scalable and efficient
oblivious-shuffling algorithm based on Intel’s SGX, and new
applications of cryptographic secret sharing and blinding.
We describe ESA and PROCHLO, as well as experiments
that validate their ability to balance utility and privacy.


Research Blog: Federated Learning: Collaborative Machine Learning without Centralized Training Data

Standard machine learning approaches require centralizing the training data on one machine or in a datacenter. And Google has built one of the most secure and robust cloud infrastructures for processing this data to make our services better. Now for models trained from user interaction with mobile devices, we’re introducing an additional approach: Federated Learning.

Federated Learning enables mobile phones to collaboratively learn a shared prediction model while keeping all the training data on device, decoupling the ability to do machine learning from the need to store the data in the cloud. This goes beyond the use of local models that make predictions on mobile devices (like the Mobile Vision API and On-Device Smart Reply) by bringing model training to the device as well.


CONIKS: Bringing Key Transparency to End Users

We present CONIKS, an end-user key verification service capable of integration in end-to-end encrypted communication systems. CONIKS builds on transparency log proposals for web server certificates but solves several new challenges specific to key verification for end users. CONIKS obviates the need for global third-party monitors and enables users to efficiently monitor their own key bindings for consistency, downloading less than 20 kB per day to do so even for a provider with billions of users. CONIKS users and providers can collectively audit providers for non-equivocation, and this requires downloading a constant 2.5 kB per provider per day. Additionally, CONIKS preserves the level of privacy offered by today’s major communication services, hiding the list of usernames present and even allowing providers to conceal the total number of users in the system.


Building a RAPPOR with the Unknown: Privacy-Preserving Learning of Associations and Data Dictionaries

Techniques based on randomized response enable the collection of potentially sensitive data from clients in a privacy-preserving manner with strong local differential privacy guarantees. A recent such technology, RAPPOR, enables estimation of the marginal frequencies of a set of strings via privacy-preserving crowdsourcing. However, this original estimation process relies on a known dictionary of possible strings; in practice, this dictionary can be extremely large and/or unknown. In this paper, we propose a novel decoding algorithm for the RAPPOR mechanism that enables the estimation of “unknown unknowns,” i.e., strings we do not know we should be estimating. To enable learning without explicit dictionary knowledge, we develop methodology for estimating the joint distribution of multiple variables collected with RAPPOR. Our contributions are not RAPPOR-specific, and can be generalized to other local differential privacy mechanisms for learning distributions of string-valued random variables.


Privacy-Preserving Network Flow Recording

Network flow recording is an important tool with applications that range from legal compliance and security auditing
to network forensics, troubleshooting, and marketing. Unfortunately, current network flow recording technologies
do not allow network operators to enforce a privacy policy on the data that is recorded, in particular how this
data is stored and used within the organization. Challenges to building such a technology include the public
key infrastructure, scalability, and gathering statistics about the data while still preserving privacy.

We present a network flow recording technology that addresses these challenges by using Identity Based Encryption in combination with privacy-preserving semantics for on-the-fly statistics. We argue that our implementation
supports a wide range of policies that cover many current applications of network flow recording. We also
characterize the performance and scalability of our implementation and find that the encryption and statistics scale
well and can easily keep up with the rate at which commodity systems can capture traffic, with a couple of interesting
caveats about the size of the subnet that data is being recorded for and how statistics generation is affected
by implementation details. We conclude that privacy preserving network flow recording is possible at 10 gigabit
rates for subnets as large as a /20 (4096 hosts).

Because network flow recording is one of the most serious threats to web privacy today, we believe that developing
technology to enforce a privacy policy on the recorded data is an important first step before policy makers
can make decisions about how network operators can and should store and use network flow data. Our goal in
this paper is to explore the tradeoffs of performance and scalability vs. privacy, and the usefulness of the recorded
data in forensics vs. privacy.


Differential Privacy for Collaborative Security

Fighting global security threats with only a local view is inherently difficult. Internet network operators need to fight global phenomena
such as botnets, but they are hampered by the fact that operators can observe only the traffic in their local domains. We propose a
collaborative approach to this problem, in which operators share aggregate information about the traffic in their respective domains
through an automated query mechanism. We argue that existing work on differential privacy and type systems can be leveraged to
build a programmable query mechanism that can express a wide range of queries while limiting what can be learned about individual
customers. We report on our progress towards building such a mechanism, and we discuss opportunities and challenges of the
collaborative security approach.


Replacing Sawzall — a case study in domain-specific language migration

In a previous post, we described how data scientists at Google used Sawzall to perform powerful, scalable analysis. However, over the last three years we’ve eliminated almost all our Sawzall code, and now the niche that Sawzall occupied in our software ecosystem is mostly filled by Go. In this post, we’ll describe Sawzall’s role in Google’s analysis ecosystem, explain some of the problems we encountered as Sawzall use increased which motivated our migration, and detail the techniques we applied to achieve language-agnostic analysis while maintaining strong access controls and the ability to write fast, scalable analyses.