We introduce TensorFlow Agents, an efficient infrastructure paradigm for
building parallel reinforcement learning algorithms in TensorFlow. We simu-
late multiple environments in parallel, and group them to perform the neural
network computation on a batch rather than individual observations. This
allows the TensorFlow execution engine to parallelize computation, without
the need for manual synchronization. Environments are stepped in separate
Python processes to progress them in parallel without interference of the global
interpreter lock. As part of this project, we introduce BatchPPO, an efficient
implementation of the proximal policy optimization algorithm. By open sourc-
ing TensorFlow Agents, we hope to provide a flexible starting point for future
projects that accelerates future research in the field.
Standard machine learning approaches require centralizing the training data on one machine or in a datacenter. And Google has built one of the most secure and robust cloud infrastructures for processing this data to make our services better. Now for models trained from user interaction with mobile devices, we’re introducing an additional approach: Federated Learning.
Federated Learning enables mobile phones to collaboratively learn a shared prediction model while keeping all the training data on device, decoupling the ability to do machine learning from the need to store the data in the cloud. This goes beyond the use of local models that make predictions on mobile devices (like the Mobile Vision API and On-Device Smart Reply) by bringing model training to the device as well.
Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC–called a Tensor Processing Unit (TPU)–deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS), and a large (28MiB) software-managed on-chip memory. The TPU’s deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, …) that help average throughput more than guaranteed latency. The lack of of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters’ NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X – 30X faster than its contemporary GPU or CPU with TOPS/Watt about 30X – 80X higher. Moreover, using the GPUs GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.
Take a look at the image below. It’s a collection of bugs and creepy-crawlies of different shapes and sizes. Take a moment to categorize them by similarity into a number of groups.
This isn’t a trick question. Start with grouping the spiders together.
Done? While there’s not necessarily a “correct” answer here, it’s most likely you split the bugs into four clusters. The spiders in one cluster, the pair of snails in another, the butterflies and moth into one, and the trio of wasps and bees into one more.
That wasn’t too bad, was it? You could probably do the same with twice as many bugs, right? If you had a bit of time to spare — or a passion for entomology — you could probably even do the same with a hundred bugs.
For a machine though, grouping ten objects into however many meaningful clusters is no small task, thanks to a mind-bending branch of maths called combinatorics, which tells us that are 115,975 different possible ways you could have grouped those ten insects together. Had there been twenty bugs, there would have been over fifty trillion possible ways of clustering them.
Recent results indicate that the generic descriptors extracted
from the convolutional neural networks are very
powerful. This paper adds to the mounting evidence that
this is indeed the case. We report on a series of experiments
conducted for different recognition tasks using the
publicly available code and model of the OverFeat network
which was trained to perform object classification on
ILSVRC13. We use features extracted from the OverFeat
network as a generic image representation to tackle the diverse
range of recognition tasks of object image classification,
scene recognition, fine grained recognition, attribute
detection and image retrieval applied to a diverse set of
datasets. We selected these tasks and datasets as they gradually
move further away from the original task and data the
OverFeat network was trained to solve. Astonishingly,
we report consistent superior results compared to the highly
tuned state-of-the-art systems in all the visual classification
tasks on various datasets. For instance retrieval it consistently
outperforms low memory footprint methods except for
sculptures dataset. The results are achieved using a linear
SVM classifier (or L2 distance in case of retrieval) applied
to a feature representation of size 4096 extracted from a
layer in the net. The representations are further modified
using simple augmentation techniques e.g. jittering. The
results strongly suggest that features obtained from deep
learning with convolutional nets should be the primary candidate
in most visual recognition tasks
We evaluate whether features extracted from
the activation of a deep convolutional network
trained in a fully supervised fashion on a large,
fixed set of object recognition tasks can be repurposed
to novel generic tasks. Our generic
tasks may differ significantly from the originally
trained tasks and there may be insufficient labeled
or unlabeled data to conventionally train or
adapt a deep architecture to the new tasks. We investigate
and visualize the semantic clustering of
deep convolutional features with respect to a variety
of such tasks, including scene recognition,
domain adaptation, and fine-grained recognition
challenges. We compare the efficacy of relying
on various network levels to define a fixed feature,
and report novel results that significantly
outperform the state-of-the-art on several important
vision challenges. We are releasing DeCAF,
an open-source implementation of these deep
convolutional activation features, along with all
associated network parameters to enable vision
researchers to be able to conduct experimentation
with deep representations across a range of
visual concept learning paradigms
Large Convolutional Network models have
recently demonstrated impressive classification
performance on the ImageNet benchmark
(Krizhevsky et al., 2012). However
there is no clear understanding of why they
perform so well, or how they might be improved.
In this paper we address both issues.
We introduce a novel visualization technique
that gives insight into the function of intermediate
feature layers and the operation of
the classifier. Used in a diagnostic role, these
visualizations allow us to find model architectures
that outperform Krizhevsky et al. on
the ImageNet classification benchmark. We
also perform an ablation study to discover
the performance contribution from different
model layers. We show our ImageNet model
generalizes well to other datasets: when the
softmax classifier is retrained, it convincingly
beats the current state-of-the-art results on
Caltech-101 and Caltech-256 datasets.