Distilling the Knowledge in a Neural Network

A very simple way to improve the performance of almost any machine learning
algorithm is to train many different models on the same data and then to average
their predictions [3]. Unfortunately, making predictions using a whole ensemble
of models is cumbersome and may be too computationally expensive to allow deployment
to a large number of users, especially if the individual models are large
neural nets. Caruana and his collaborators [1] have shown that it is possible to
compress the knowledge in an ensemble into a single model which is much easier
to deploy and we develop this approach further using a different compression
technique. We achieve some surprising results on MNIST and we show that we
can significantly improve the acoustic model of a heavily used commercial system
by distilling the knowledge in an ensemble of models into a single model. We also
introduce a new type of ensemble composed of one or more full models and many
specialist models which learn to distinguish fine-grained classes that the full models
confuse. Unlike a mixture of experts, these specialist models can be trained
rapidly and in parallel.

Source: https://arxiv.org/pdf/1503.02531.pdf


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s