Distilling the Knowledge in a Neural Network

A very simple way to improve the performance of almost any machine learning
algorithm is to train many different models on the same data and then to average
their predictions [3]. Unfortunately, making predictions using a whole ensemble
of models is cumbersome and may be too computationally expensive to allow deployment
to a large number of users, especially if the individual models are large
neural nets. Caruana and his collaborators [1] have shown that it is possible to
compress the knowledge in an ensemble into a single model which is much easier
to deploy and we develop this approach further using a different compression
technique. We achieve some surprising results on MNIST and we show that we
can significantly improve the acoustic model of a heavily used commercial system
by distilling the knowledge in an ensemble of models into a single model. We also
introduce a new type of ensemble composed of one or more full models and many
specialist models which learn to distinguish fine-grained classes that the full models
confuse. Unlike a mixture of experts, these specialist models can be trained
rapidly and in parallel.

Source: https://arxiv.org/pdf/1503.02531.pdf


Learning and Transferring Mid-Level Image Representations using Convolutional Neural Networks

Convolutional neural networks (CNN) have recently shown
outstanding image classification performance in the largescale
visual recognition challenge (ILSVRC2012). The success
of CNNs is attributed to their ability to learn rich midlevel
image representations as opposed to hand-designed
low-level features used in other image classification methods.
Learning CNNs, however, amounts to estimating millions
of parameters and requires a very large number of
annotated image samples. This property currently prevents
application of CNNs to problems with limited training data.
In this work we show how image representations learned
with CNNs on large-scale annotated datasets can be effi-
ciently transferred to other visual recognition tasks with
limited amount of training data. We design a method to
reuse layers trained on the ImageNet dataset to compute
mid-level image representation for images in the PASCAL
VOC dataset. We show that despite differences in image
statistics and tasks in the two datasets, the transferred representation
leads to significantly improved results for object
and action classification, outperforming the current state of
the art on Pascal VOC 2007 and 2012 datasets. We also
show promising results for object and action localization.

Source: http://www.cv-foundation.org/openaccess/content_cvpr_2014/papers/Oquab_Learning_and_Transferring_2014_CVPR_paper.pdf

How transferable are features in deep neural networks?

Many deep neural networks trained on natural images exhibit a curious phenomenon
in common: on the first layer they learn features similar to Gabor filters
and color blobs. Such first-layer features appear not to be specific to a particular
dataset or task, but general in that they are applicable to many datasets and tasks.
Features must eventually transition from general to specific by the last layer of
the network, but this transition has not been studied extensively. In this paper we
experimentally quantify the generality versus specificity of neurons in each layer
of a deep convolutional neural network and report a few surprising results. Transferability
is negatively affected by two distinct issues: (1) the specialization of
higher layer neurons to their original task at the expense of performance on the
target task, which was expected, and (2) optimization difficulties related to splitting
networks between co-adapted neurons, which was not expected. In an example
network trained on ImageNet, we demonstrate that either of these two issues
may dominate, depending on whether features are transferred from the bottom,
middle, or top of the network. We also document that the transferability of features
decreases as the distance between the base task and target task increases, but
that transferring features even from distant tasks can be better than using random
features. A final surprising result is that initializing a network with transferred
features from almost any number of layers can produce a boost to generalization
that lingers even after fine-tuning to the target dataset.

Source: http://papers.nips.cc/paper/5347-how-transferable-are-features-in-deep-neural-networks.pdf

CNN Features off-the-shelf: an Astounding Baseline for Recognition

Recent results indicate that the generic descriptors extracted
from the convolutional neural networks are very
powerful. This paper adds to the mounting evidence that
this is indeed the case. We report on a series of experiments
conducted for different recognition tasks using the
publicly available code and model of the OverFeat network
which was trained to perform object classification on
ILSVRC13. We use features extracted from the OverFeat
network as a generic image representation to tackle the diverse
range of recognition tasks of object image classification,
scene recognition, fine grained recognition, attribute
detection and image retrieval applied to a diverse set of
datasets. We selected these tasks and datasets as they gradually
move further away from the original task and data the
OverFeat network was trained to solve. Astonishingly,
we report consistent superior results compared to the highly
tuned state-of-the-art systems in all the visual classification
tasks on various datasets. For instance retrieval it consistently
outperforms low memory footprint methods except for
sculptures dataset. The results are achieved using a linear
SVM classifier (or L2 distance in case of retrieval) applied
to a feature representation of size 4096 extracted from a
layer in the net. The representations are further modified
using simple augmentation techniques e.g. jittering. The
results strongly suggest that features obtained from deep
learning with convolutional nets should be the primary candidate
in most visual recognition tasks

Source: http://www.cv-foundation.org//openaccess/content_cvpr_workshops_2014/W15/papers/Razavian_CNN_Features_Off-the-Shelf_2014_CVPR_paper.pdf

DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition

We evaluate whether features extracted from
the activation of a deep convolutional network
trained in a fully supervised fashion on a large,
fixed set of object recognition tasks can be repurposed
to novel generic tasks. Our generic
tasks may differ significantly from the originally
trained tasks and there may be insufficient labeled
or unlabeled data to conventionally train or
adapt a deep architecture to the new tasks. We investigate
and visualize the semantic clustering of
deep convolutional features with respect to a variety
of such tasks, including scene recognition,
domain adaptation, and fine-grained recognition
challenges. We compare the efficacy of relying
on various network levels to define a fixed feature,
and report novel results that significantly
outperform the state-of-the-art on several important
vision challenges. We are releasing DeCAF,
an open-source implementation of these deep
convolutional activation features, along with all
associated network parameters to enable vision
researchers to be able to conduct experimentation
with deep representations across a range of
visual concept learning paradigms

Source: https://arxiv.org/pdf/1310.1531.pdf

Visualizing and Understanding Convolutional Networks

Large Convolutional Network models have
recently demonstrated impressive classification
performance on the ImageNet benchmark
(Krizhevsky et al., 2012). However
there is no clear understanding of why they
perform so well, or how they might be improved.
In this paper we address both issues.
We introduce a novel visualization technique
that gives insight into the function of intermediate
feature layers and the operation of
the classifier. Used in a diagnostic role, these
visualizations allow us to find model architectures
that outperform Krizhevsky et al. on
the ImageNet classification benchmark. We
also perform an ablation study to discover
the performance contribution from different
model layers. We show our ImageNet model
generalizes well to other datasets: when the
softmax classifier is retrained, it convincingly
beats the current state-of-the-art results on
Caltech-101 and Caltech-256 datasets.

Source: https://arxiv.org/pdf/1311.2901.pdf

Understanding Synthetic Gradients and Decoupled Neural Interfaces

When training neural networks, the use of Synthetic
Gradients (SG) allows layers or modules
to be trained without update locking – without
waiting for a true error gradient to be backpropagated
– resulting in Decoupled Neural Interfaces
(DNIs). This unlocked ability of being
able to update parts of a neural network asynchronously
and with only local information was
demonstrated to work empirically in Jaderberg
et al. (2016). However, there has been very little
demonstration of what changes DNIs and SGs
impose from a functional, representational, and
learning dynamics point of view. In this paper,
we study DNIs through the use of synthetic gradients
on feed-forward networks to better understand
their behaviour and elucidate their effect
on optimisation. We show that the incorporation
of SGs does not affect the representational
strength of the learning system for a neural network,
and prove the convergence of the learning
system for linear and deep linear models. On
practical problems we investigate the mechanism
by which synthetic gradient estimators approximate
the true loss, and, surprisingly, how that
leads to drastically different layer-wise representations.
Finally, we also expose the relationship
of using synthetic gradients to other error approximation
techniques and find a unifying language
for discussion and comparison.

Source: https://arxiv.org/pdf/1703.00522.pdf