Research Blog: Federated Learning: Collaborative Machine Learning without Centralized Training Data

Standard machine learning approaches require centralizing the training data on one machine or in a datacenter. And Google has built one of the most secure and robust cloud infrastructures for processing this data to make our services better. Now for models trained from user interaction with mobile devices, we’re introducing an additional approach: Federated Learning.

Federated Learning enables mobile phones to collaboratively learn a shared prediction model while keeping all the training data on device, decoupling the ability to do machine learning from the need to store the data in the cloud. This goes beyond the use of local models that make predictions on mobile devices (like the Mobile Vision API and On-Device Smart Reply) by bringing model training to the device as well.

Source: https://research.googleblog.com/2017/04/federated-learning-collaborative.html?m=1

In-Datacenter Performance Analysis of a Tensor Processing Unit​ TM

Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC–called a Tensor Processing Unit (TPU)–deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS), and a large (28MiB) software-managed on-chip memory. The TPU’s deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, …) that help average throughput more than guaranteed latency. The lack of of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters’ NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X – 30X faster than its contemporary GPU or CPU with TOPS/Watt about 30X – 80X higher. Moreover, using the GPUs GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.

Source: https://drive.google.com/file/d/0Bx4hafXDDq2EMzRNcy1vSUxtcEk/view

How Machines Make Sense of Big Data: an Introduction to Clustering Algorithms

Take a look at the image below. It’s a collection of bugs and creepy-crawlies of different shapes and sizes. Take a moment to categorize them by similarity into a number of groups.
This isn’t a trick question. Start with grouping the spiders together.
Done? While there’s not necessarily a “correct” answer here, it’s most likely you split the bugs into four clusters. The spiders in one cluster, the pair of snails in another, the butterflies and moth into one, and the trio of wasps and bees into one more.
That wasn’t too bad, was it? You could probably do the same with twice as many bugs, right? If you had a bit of time to spare — or a passion for entomology — you could probably even do the same with a hundred bugs.
For a machine though, grouping ten objects into however many meaningful clusters is no small task, thanks to a mind-bending branch of maths called combinatorics, which tells us that are 115,975 different possible ways you could have grouped those ten insects together. Had there been twenty bugs, there would have been over fifty trillion possible ways of clustering them.

Source: https://medium.freecodecamp.com/how-machines-make-sense-of-big-data-an-introduction-to-clustering-algorithms-4bd97d4fbaba

CNN Features off-the-shelf: an Astounding Baseline for Recognition

Recent results indicate that the generic descriptors extracted
from the convolutional neural networks are very
powerful. This paper adds to the mounting evidence that
this is indeed the case. We report on a series of experiments
conducted for different recognition tasks using the
publicly available code and model of the OverFeat network
which was trained to perform object classification on
ILSVRC13. We use features extracted from the OverFeat
network as a generic image representation to tackle the diverse
range of recognition tasks of object image classification,
scene recognition, fine grained recognition, attribute
detection and image retrieval applied to a diverse set of
datasets. We selected these tasks and datasets as they gradually
move further away from the original task and data the
OverFeat network was trained to solve. Astonishingly,
we report consistent superior results compared to the highly
tuned state-of-the-art systems in all the visual classification
tasks on various datasets. For instance retrieval it consistently
outperforms low memory footprint methods except for
sculptures dataset. The results are achieved using a linear
SVM classifier (or L2 distance in case of retrieval) applied
to a feature representation of size 4096 extracted from a
layer in the net. The representations are further modified
using simple augmentation techniques e.g. jittering. The
results strongly suggest that features obtained from deep
learning with convolutional nets should be the primary candidate
in most visual recognition tasks

Source: http://www.cv-foundation.org//openaccess/content_cvpr_workshops_2014/W15/papers/Razavian_CNN_Features_Off-the-Shelf_2014_CVPR_paper.pdf

DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition

We evaluate whether features extracted from
the activation of a deep convolutional network
trained in a fully supervised fashion on a large,
fixed set of object recognition tasks can be repurposed
to novel generic tasks. Our generic
tasks may differ significantly from the originally
trained tasks and there may be insufficient labeled
or unlabeled data to conventionally train or
adapt a deep architecture to the new tasks. We investigate
and visualize the semantic clustering of
deep convolutional features with respect to a variety
of such tasks, including scene recognition,
domain adaptation, and fine-grained recognition
challenges. We compare the efficacy of relying
on various network levels to define a fixed feature,
and report novel results that significantly
outperform the state-of-the-art on several important
vision challenges. We are releasing DeCAF,
an open-source implementation of these deep
convolutional activation features, along with all
associated network parameters to enable vision
researchers to be able to conduct experimentation
with deep representations across a range of
visual concept learning paradigms

Source: https://arxiv.org/pdf/1310.1531.pdf

Visualizing and Understanding Convolutional Networks

Large Convolutional Network models have
recently demonstrated impressive classification
performance on the ImageNet benchmark
(Krizhevsky et al., 2012). However
there is no clear understanding of why they
perform so well, or how they might be improved.
In this paper we address both issues.
We introduce a novel visualization technique
that gives insight into the function of intermediate
feature layers and the operation of
the classifier. Used in a diagnostic role, these
visualizations allow us to find model architectures
that outperform Krizhevsky et al. on
the ImageNet classification benchmark. We
also perform an ablation study to discover
the performance contribution from different
model layers. We show our ImageNet model
generalizes well to other datasets: when the
softmax classifier is retrained, it convincingly
beats the current state-of-the-art results on
Caltech-101 and Caltech-256 datasets.

Source: https://arxiv.org/pdf/1311.2901.pdf

Understanding Synthetic Gradients and Decoupled Neural Interfaces

When training neural networks, the use of Synthetic
Gradients (SG) allows layers or modules
to be trained without update locking – without
waiting for a true error gradient to be backpropagated
– resulting in Decoupled Neural Interfaces
(DNIs). This unlocked ability of being
able to update parts of a neural network asynchronously
and with only local information was
demonstrated to work empirically in Jaderberg
et al. (2016). However, there has been very little
demonstration of what changes DNIs and SGs
impose from a functional, representational, and
learning dynamics point of view. In this paper,
we study DNIs through the use of synthetic gradients
on feed-forward networks to better understand
their behaviour and elucidate their effect
on optimisation. We show that the incorporation
of SGs does not affect the representational
strength of the learning system for a neural network,
and prove the convergence of the learning
system for linear and deep linear models. On
practical problems we investigate the mechanism
by which synthetic gradient estimators approximate
the true loss, and, surprisingly, how that
leads to drastically different layer-wise representations.
Finally, we also expose the relationship
of using synthetic gradients to other error approximation
techniques and find a unifying language
for discussion and comparison.

Source: https://arxiv.org/pdf/1703.00522.pdf