Papperoni

2010

Why Does Unsupervised Pre-Training Help Deep Learning?

Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre Antoine Manzagol, Pascal Vincent, Samy Bengio

Open PDF Google Scholar

citations

Cite Score

69

AI summary

This paper explores the role of unsupervised pre-training in deep learning, suggesting it acts as a regularizer, guiding learning toward better generalization and minimizing variance, validated through experiments on MNIST, InfiniteMNIST, and Shapeset.

Main Contributions

Demonstrates that unsupervised pre-training acts as a regularizer in deep learning.
Shows that pre-training guides learning towards basins of attraction with better generalization.
Empirically validates the influence of pre-training on architecture depth, model capacity, and number of training examples.
Introduces experiments on MNIST, InfiniteMNIST and Shapeset datasets.
Finds that pre-training improves generalization and robustness to initialization, but can hurt performance with smaller layers.

Abstract

Much recent research has been devoted to learning algorithms for deep architectures such as Deep Belief Networks and stacks of auto-encoder variants, with impressive results obtained in several areas, mostly on vision and language data sets. The best results obtained on supervised learning tasks involve an unsupervised learning component, usually in an unsupervised pre-training phase. Even though these new algorithms have enabled training deep models, many questions remain as to the nature of this difficult learning problem. The main question investigated here is the following: how does unsupervised pre-training work? Answering this questions is important if learning in deep architectures is to be further improved. We propose several explanatory hypotheses and test them through extensive simulations. We empirically show the influence of pre-training with respect to architecture depth, model capacity, and number of training examples. The experiments confirm and clarify the advantage of unsupervised pre-training. The results suggest that unsupervised pre-training guides the learning towards basins of attraction of minima that support better generalization from the training data set; the evidence from these results supports a regularization explanation for the effect of pre-training.

Citation Graph

Loading graph...

References [51]

Sort:

Filter:

[1]Gradient-Based Learning Applied to Document Recognition

Yann Lecun, Leon Bottou, Yoshua Bengio, Patrick Haffner - 1998

62 papers in library cite

I absolutely hated this paper. Has ~50 pages but seems like 200 pages. Takes too long to explain some things that really is just repeating itself. Also doesn't seem to add too much on top of LeNet-5. Also, focuses a lot on GTNs, which really didn't stick.

[2]Visualizing Data Using t-SNE

Geoffrey Hinton - 2008

7 papers in library cite

Amazing. Simple. Impactful. Easy to understand. Masterpiece.

[3]Reducing the Dimensionality of Data With Neural Networks

Geoffrey Hinton, Ruslan Salakhutdinov - 2006

37 papers in library cite

I didn't like the way this is written, very hard to understand without a ton of background knowledge. But hey, it's the first deep learning model!

[4]A Fast Learning Algorithm for Deep Belief Nets

Geoffrey E. Hinton, S. Osindero, Y. Teh - 2006

43 papers in library cite

The paper does not explain anything. It just throws the idea and a bunch of math, but doesn't really care to explain the concepts.

[5]Learning Deep Architectures for AI

Yoshua Bengio - 2009

25 papers in library cite

It's a nice overview. Some sections get very theoretical, but the first half is very good and I feel that it does a waaaay better job of explaining RBMs and DBNs than other papers. This feels like Bengio is taking your hand and saying "if you don't know what's going on, here you go, everything you need to know to jump into the deep nets train"

[6]Extracting and Composing Robust Features With Denoising Autoencoders

Pascal Vincent, Hugo Larochelle, Yoshua Bengio, Pierre Antoine Manzagol - 2008

25 papers in library cite

I am *so* glad we found an alternative to DBNs. Also, introduced the idea of denoising which is nice.

[7]A Unified Architecture for Natural Language Processing: Deep Neural Networks With Multitask Learning

Ronan Collobert, Jason Weston - 2008

32 papers in library cite

Really did not add much to the game. I think this was more of a small perf. improvement over other existing things and set a few methodological standards. Maybe main contribution is Multitask Learning + Deep learning

[8]Greedy Layer-Wise Training of Deep Networks

Yoshua Bengio, P. Lamblin, D. Popovici, Hugo Larochelle - 2006

33 papers in library cite

Bengio is perfect. This is everything that Hinton's paper hoped to be. Very well explained, and also tying back to real use cases (not just "hey, the math works and it reduced the score")

[9]Training Products of Experts by Minimizing Contrastive Divergence

Geoffrey Hinton - 2002

23 papers in library cite

Good read, but I think I need to revisit it after I understand RBMs better.

[10]Scaling Learning Algorithms Towards AI

Yoshua Bengio, Yann Lecun - 2007

15 papers in library cite

I should have read this sooner! Such a good explanation of why deep learning > other stuff! Also, better than Bengio's 2006 Learning Deep Archs for AI

[11]Visualizing Higher-Layer Features of a Deep Network

Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pascal Vincent - 2009

4 papers in library cite

Very nice the way that they tackled it as an optimization problem using gradient descent. I think this is a similar approach to adversarial examples (not sure if this is what inspired them, I don't remember)

[12]Efficient Learning of Sparse Representations With an Energy-Based Model

Marc'aurelio Ranzato, C. Poultney, S. Chopra, Yann Lecun - 2006

20 papers in library cite

It's ok. Not really good, but alright.

[13]An Empirical Evaluation of Deep Architectures on Problems With Many Factors of Variation

Hugo Larochelle, Dumitru Erhan, Aaron Courville, James Bergstra, Yoshua Bengio - 2007

13 papers in library cite

Good paper showing promising results for Deep Learning. Nothing amazing but good nonetheless

[14]Deep Learning via Semi-Supervised Embedding

Jason Weston, F. Ratle, Ronan Collobert - 2008

10 papers in library cite

It's a good paper and nice idea, but seems overly complicated and I don't think it's very used... (PS: this was republished in 2012)

[15]Measuring Invariances in Deep Networks

I. Goodfellow, Quoc Le, A. Saxe, A. Ng - 2009

7 papers in library cite

Very nice concept and methodology, but the results in the end are underwhelming

[16]To Recognize Shapes, First Learn to Generate Images

Geoffrey Hinton - 2006

5 papers in library cite

Maybe the best explanation of deep belief nets and RBMs by Hinton.

[17]Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations

Honglak Lee, R. Grosse, R. Ranganath, Andrew Y. Ng - 2009

12 papers in library cite

[18]Sparse Feature Learning for Deep Belief Networks

Marc'aurelio Ranzato, Y. Boureau, Yann Lecun - 2008

12 papers in library cite

[19]Sparse Deep Belief Net Model for Visual Area V2

Honglak Lee, C. Ekanadham, A. Ng - 2008

10 papers in library cite

[20]Modèles Connexionnistes De l'Apprentissage

Yann Lecun - 1987

9 papers in library cite

[21]Exponential Family Harmoniums With an Application to Information Retrieval

M. Welling, M. R. Zvi, Geoffrey Hinton - 2005

8 papers in library cite

[22]A Global Geometric Framework for Nonlinear Dimensionality Reduction

J. Tenenbaum, V. D. Silva, John Langford - 2000

7 papers in library cite

[23]Exploring Strategies for Training Deep Neural Networks

Hugo Larochelle, Yoshua Bengio, J. Louradour, P. Lamblin - 2009

7 papers in library cite

40 pages

[24]On the Power of Small-Depth Threshold Circuits

J. Hastad, M. Goldmann - 1991

7 papers in library cite

[25]The Curse of Highly Variable Functions for Local Kernel Machines

Yoshua Bengio, O. Delalleau, N. L. Roux - 2006

7 papers in library cite

[26]Justifying and Generalizing Contrastive Divergence

Yoshua Bengio, O. Delalleau - 2007

5 papers in library cite

[27]Learning Continuous Attractors in Recurrent Networks

S. H. Seung - 1998

5 papers in library cite

[28]Restricted Boltzmann Machines for Collaborative Filtering

Ruslan Salakhutdinov, A. Mnih, Geoffrey E. Hinton - 2007

5 papers in library cite

[29]Semantic Hashing

Ruslan Salakhutdinov, Geoffrey Hinton - 2007

5 papers in library cite

[30]Semi-Supervised Learning

O. Chapelle, B. Scholkopf, A. Zien - 2006

5 papers in library cite

[31]Almost Optimal Lower Bounds for Small Depth Circuits

J. Hastad - 1986

4 papers in library cite

[32]Maximum Mutual Information Estimation of Hidden Markov Model Parameters for Speech Recognition

L. Bahl, P. Brown, P. D. Souza, R. Mercer - 1986

4 papers in library cite

[33]Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes

Ruslan Salakhutdinov, Geoffrey E. Hinton - 2008

4 papers in library cite

[34]Deep Learning From Temporal Coherence in Video

H. Mobahi, Ronan Collobert, Jason Weston - 2009

3 papers in library cite

[35]Learning Long-Range Vision for Autonomous Off-Road Driving

Raia Hadsell, P. Sermanet, M. Scoffier, A. Erkan, K. Kavackuoglu, U. Muller, Yann Lecun - 2009

3 papers in library cite

[36]Memoires Associatives Distribuees

P. Gallinari, Yann Lecun, S. Thiria, F. F. Soulie - 1987

3 papers in library cite

[37]Modeling Image Patches With a Directed Hierarchy of markov Random Field

S. Osindero, Geoffrey E. Hinton - 2008

3 papers in library cite

[38]Classification Using Discriminative Restricted Boltzmann Machines

Hugo Larochelle, Yoshua Bengio - 2008

2 papers in library cite

[39]Cluster Kernels for Semi-Supervised Learning

O. Chapelle, Jason Weston, B. Scholkopf - 2003

2 papers in library cite

[40]Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering

M. Belkin, P. Niyogi - 2002

2 papers in library cite

[41]Minimum Phone Error and I-Smoothing for Improved Discriminative Training

D. Povey, P. Woodland - 2002

2 papers in library cite

[42]Separating the Polynomial-Time Hierarchy by Oracles

A. Yao - 1985

2 papers in library cite

[43]Training Invariant Support Vector Machines Using Selective Sampling

G. Loosli, S. Canu, Leon Bottou - 2007

2 papers in library cite

[44]Unsupervised Learning of Probabilistic Grammar-Markov Models for Object Categories

L. Zhu, Yanru Chen, A. Yuille - 2009

2 papers in library cite

[45]Asymptotic Statistical Theory of Overtraining and Cross-Validation

S. I. Amari, N. Murata, Klaus Robert Muller, M. Finke, H. H. Yang - 1997

1 paper in library cites

[46]Complexity Regularization With Application to Artificial Neural Networks

A. E. Barron - 1991

1 paper in library cites

[47]Generating Facial Expressions With Deep Belief Nets

J. M. Susskind, E. Geoffrey, J. R. Movellan, A. K. Anderson - 2008

1 paper in library cites

[48]On Discriminative vs. Generative Classifiers: A Comparison of Logistic Regression and Naive Bayes

Andrew Y. Ng, Michael I. Jordan - 2002

1 paper in library cites

[49]Overtraining, Regularization and Searching for a Minimum, With Application to Neural Networks

J. Sjoberg, L. Ljung - 1995

1 paper in library cites

[50]Principled Hybrids of Generative and Discriminative Models

J. A. Lasserre, C. M. Bishop, T. P. Minka - 2006

1 paper in library cites

[51]Sensitive Periods in Development : Interdisciplinary Perspectives

M. H. Bornstein - 1987

1 paper in library cites

Cited by

12

papers in your library

Cites

18

papers in your library

Read

on October 14, 2025

Good paper, easy to follow, and brings some light to the pre-training stuff (layer-by-layer). I just wish it wasn't so long. It's a chore.

Tags

Paper Aliases

No aliases