XRay: A Function Call Tracing System

Debugging high throughput, low-latency C/C++ systems in production is hard. At Google we developed XRay, a function call tracing system that allows Google engineers to get accurate function call traces with negligible overhead when off and moderate overhead when on, suitable for services deployed in production. XRay enables efficient function call entry/exit logging with high accuracy timestamps, and can be dynamically enabled and disabled. This white paper describes the XRay tracing system and its implementation. It also describes future plans with open sourcing XRay and engaging open source communities.

Source: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45287.pdf

TensorFlow: A system for large-scale machine learning

TensorFlow is a machine learning system that operates at
large scale and in heterogeneous environments. TensorFlow
uses dataflow graphs to represent computation,
shared state, and the operations that mutate that state. It
maps the nodes of a dataflow graph across many machines
in a cluster, and within a machine across multiple computational
devices, including multicore CPUs, generalpurpose
GPUs, and custom designed ASICs known as
Tensor Processing Units (TPUs). This architecture gives
flexibility to the application developer: whereas in previous
“parameter server” designs the management of shared
state is built into the system, TensorFlow enables developers
to experiment with novel optimizations and training
algorithms. TensorFlow supports a variety of applications,
with particularly strong support for training and
inference on deep neural networks. Several Google services
use TensorFlow in production, we have released it
as an open-source project, and it has become widely used
for machine learning research. In this paper, we describe
the TensorFlow dataflow model in contrast to existing systems,
and demonstrate the compelling performance that
TensorFlow achieves for several real-world applications.

Source: https://arxiv.org/pdf/1605.08695v2.pdf

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

TensorFlow [1] is an interface for expressing machine learning
algorithms, and an implementation for executing such algorithms.
A computation expressed using TensorFlow can be
executed with little or no change on a wide variety of heterogeneous
systems, ranging from mobile devices such as phones
and tablets up to large-scale distributed systems of hundreds
of machines and thousands of computational devices such as
GPU cards. The system is flexible and can be used to express
a wide variety of algorithms, including training and inference
algorithms for deep neural network models, and it has been
used for conducting research and for deploying machine learning
systems into production across more than a dozen areas of
computer science and other fields, including speech recognition,
computer vision, robotics, information retrieval, natural
language processing, geographic information extraction, and
computational drug discovery. This paper describes the TensorFlow
interface and an implementation of that interface that
we have built at Google. The TensorFlow API and a reference
implementation were released as an open-source package under
the Apache 2.0 license in November, 2015 and are available at
http://www.tensorflow.org

Source: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45166.pdf

Flayer: Exposing Application Internals

Flayer is a tool for dynamically exposing application
innards for security testing and analysis. It is implemented
on the dynamic binary instrumentation framework
Valgrind [17] and its memory error detection plugin,
Memcheck [21]. This paper focuses on the implementation
of Flayer, its supporting libraries, and their application
to software security.
Flayer provides tainted, or marked, data flow analysis
and instrumentation mechanisms for arbitrarily altering
that flow. Flayer improves upon prior taint tracing
tools with bit-precision. Taint propagation calculations
are performed for each value-creating memory or register
operation. These calculations are embedded in the
target application’s running code using dynamic instrumentation.
The same technique has been employed to allow
the user to control the outcome of conditional jumps
and step over function calls.
Flayer’s functionality provides a robust foundation for
the implementation of security tools and techniques. In
particular, this paper presents an effective fault injection
testing technique and an automation library, LibFlayer.
Alongside these contributions, it explores techniques for
vulnerability patch analysis and guided source code auditing.
Flayer finds errors in real software. In the past year, its
use has yielded the expedient discovery of flaws in security
critical software including OpenSSH and OpenSSL.

Source: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/33253.pdf

Why Google Stores Billions of Lines of Code in a Single Repository

EARLY GOOGLE EMPLOYEES decided to work with a
shared codebase managed through a centralized
source control system. This approach has served
Google well for more than 16 years, and today the vast
majority of Google’s software assets continues to be
stored in a single, shared repository. Meanwhile, the
number of Google software developers has steadily
increased, and the size of the Google codebase
has grown exponentially (see Figure 1). As a result,
the technology used to host the codebase has also
evolved significantly.

Source: http://cacm.acm.org/magazines/2016/7/204032-why-google-stores-billions-of-lines-of-code-in-a-single-repository/fulltext

How Developers Search for Code: A Case Study

With the advent of large code repositories and sophisticated
search capabilities, code search is increasingly becoming a
key software development activity. In this work we shed
some light into how developers search for code through a case
study performed at Google, using a combination of survey
and log-analysis methodologies. Our study provides insights
into what developers are doing and trying to learn when performing
a search, search scope, query properties, and what a
search session under different contexts usually entails. Our
results indicate that programmers search for code very frequently,
conducting an average of five search sessions with
12 total queries each workday. The search queries are often
targeted at a particular code location and programmers are
typically looking for code with which they are somewhat familiar.
Further, programmers are generally seeking answers
to questions about how to use an API, what code does, why
something is failing, or where code is located.

Source: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43835.pdf

Tricorder: Building a Program Analysis Ecosystem

Static analysis tools help developers find bugs, improve
code readability, and ensure consistent style across a
project. However, these tools can be difficult to smoothly integrate
with each other and into the developer workflow, particularly
when scaling to large codebases. We present TRICORDER,
a program analysis platform aimed at building a data-driven
ecosystem around program analysis. We present a set of guiding
principles for our program analysis tools and a scalable architecture
for an analysis platform implementing these principles.
We include an empirical, in-situ evaluation of the tool as it is
used by developers across Google that shows the usefulness and
impact of the platform.

Source: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43322.pdf

Programmers’ Build Errors: A Case Study (at Google)

Building is an integral part of the software development process.
However, little is known about the compiler errors that
occur in this process. In this paper, we present an empirical
study of 26.6 million builds produced during a period of
nine months by thousands of developers. We describe the
workflow through which those builds are generated, and we
analyze failure frequency, compiler error types, and resolution
efforts to fix those compiler errors. The results provide
insights on how a large organization build process works, and
pinpoints errors for which further developer support would
be most effective

Source: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/42184.pdf