Papperoni

2010

Understanding the Difficulty of Training Deep Feedforward Neural Networks

Yoshua Bengio

Open PDF Google Scholar

citations

Cite Score

94

AI summary

This paper analyzes the difficulties of training deep feedforward neural networks with standard gradient descent. It introduces Shapeset-3 x 2, MNIST, CIFAR-10 and Small-ImageNet datasets. It proposes a new initialization scheme that brings substantially faster convergence, and shows that sigmoid activations should be avoided.

Main Contributions

Identified the saturation problem in deep networks with sigmoid activations.
Showed that saturated units can move out of saturation by themselves, albeit slowly.
Proposed a new initialization scheme that brings substantially faster convergence.
Demonstrated the importance of appropriate activation functions and initialization schemes for training deep networks.
Analyzed how activations and gradients vary across layers and during training.

Abstract

Whereas before 2006 it appears that deep multilayer neural networks were not successfully trained, since then several algorithms have been shown to successfully train them, with experimental results showing the superiority of deeper vs less deep architectures. All these experimental results were obtained with new initialization or training mechanisms. Our objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future. We first observe the influence of the non-linear activations functions. We find that the logistic sigmoid activation is unsuited for deep networks with random initialization because of its mean value, which can drive especially the top hidden layer into saturation. Surprisingly, we find that saturated units can move out of saturation by themselves, albeit slowly, and explaining the plateaus sometimes seen when training neural networks. We find that a new non-linearity that saturates less can often be beneficial. Finally, we study how activations and gradients vary across layers and during training, with the idea that training may be more difficult when the singular values of the Jacobian associated with each layer are far from 1. Based on these considerations, we propose a new initialization scheme that brings substantially faster convergence.

Citation Graph

Loading graph...

References [20]

Sort:

Filter:

[1]Gradient-Based Learning Applied to Document Recognition

Yann Lecun, Leon Bottou, Yoshua Bengio, Patrick Haffner - 1998

62 papers in library cite

I absolutely hated this paper. Has ~50 pages but seems like 200 pages. Takes too long to explain some things that really is just repeating itself. Also doesn't seem to add too much on top of LeNet-5. Also, focuses a lot on GTNs, which really didn't stick.

[2]Learning Representations by Back-Propagating Errors

D. E. Rumelhart, Geoffrey E. Hinton, Ronald J. Williams - 1986

34 papers in library cite

Introduced backprop. Short and simple.

[3]Learning Multiple Layers of Features From Tiny Images

Alex Krizhevsky - 2009

27 papers in library cite

It's alright. It mainly focuses on RBMs and their features and the actual part that describes the dataset is like 1 page. However, it's maybe the best intuitive description of an RBM I have seen. Other than that, it reads very much like an undergraduate thesis.

[4]A Fast Learning Algorithm for Deep Belief Nets

Geoffrey E. Hinton, S. Osindero, Y. Teh - 2006

43 papers in library cite

The paper does not explain anything. It just throws the idea and a bunch of math, but doesn't really care to explain the concepts.

[5]Learning Long-Term Dependencies With Gradient Descent Is Difficult

Yoshua Bengio, Patrice Simard, Paolo Frasconi - 1994

31 papers in library cite

The first ones to notice that there is a problem with gradient descent, but way too mathy for me.

[6]Learning Deep Architectures for AI

Yoshua Bengio - 2009

25 papers in library cite

It's a nice overview. Some sections get very theoretical, but the first half is very good and I feel that it does a waaaay better job of explaining RBMs and DBNs than other papers. This feels like Bengio is taking your hand and saying "if you don't know what's going on, here you go, everything you need to know to jump into the deep nets train"

[7]Extracting and Composing Robust Features With Denoising Autoencoders

Pascal Vincent, Hugo Larochelle, Yoshua Bengio, Pierre Antoine Manzagol - 2008

25 papers in library cite

I am *so* glad we found an alternative to DBNs. Also, introduced the idea of denoising which is nice.

[8]A Unified Architecture for Natural Language Processing: Deep Neural Networks With Multitask Learning

Ronan Collobert, Jason Weston - 2008

32 papers in library cite

Really did not add much to the game. I think this was more of a small perf. improvement over other existing things and set a few methodological standards. Maybe main contribution is Multitask Learning + Deep learning

[9]Efficient Backprop

Yann Lecun, Leon Bottou, G. B. Orr, Klaus Robert Muller - 1998

20 papers in library cite

The first half is very very good. The remainder is very boring.

[10]Greedy Layer-Wise Training of Deep Networks

Yoshua Bengio, P. Lamblin, D. Popovici, Hugo Larochelle - 2006

33 papers in library cite

Bengio is perfect. This is everything that Hinton's paper hoped to be. Very well explained, and also tying back to real use cases (not just "hey, the math works and it reduced the score")

[11]Efficient Learning of Sparse Representations With an Energy-Based Model

Marc'aurelio Ranzato, C. Poultney, S. Chopra, Yann Lecun - 2006

20 papers in library cite

It's ok. Not really good, but alright.

[12]An Empirical Evaluation of Deep Architectures on Problems With Many Factors of Variation

Hugo Larochelle, Dumitru Erhan, Aaron Courville, James Bergstra, Yoshua Bengio - 2007

13 papers in library cite

Good paper showing promising results for Deep Learning. Nothing amazing but good nonetheless

[13]Deep Learning via Semi-Supervised Embedding

Jason Weston, F. Ratle, Ronan Collobert - 2008

10 papers in library cite

It's a good paper and nice idea, but seems overly complicated and I don't think it's very used... (PS: this was republished in 2012)

[14]A Scalable Hierarchical Distributed Language Model

A. Mnih, Geoffrey E. Hinton - 2009

16 papers in library cite

Good paper that introduces hierarchical trees as an alternative to the expensive softmax output. I think this is not really relevant anymore, but good read.

[15]The Difficulty of Training Deep Architectures and the Effect of Unsupervised Pre-Training

Pascal Vincent - 2009

5 papers in library cite

Very nice analysis of why supervised pretraining works!

[16]Exploring Strategies for Training Deep Neural Networks

Hugo Larochelle, Yoshua Bengio, J. Louradour, P. Lamblin - 2009

7 papers in library cite

40 pages

[17]Accelerated Learning in Layered Neural Networks

Sara A. Solla, E. Levin, M. Fleisher - 1988

2 papers in library cite

[18]Unsupervised Learning of Probabilistic Grammar-Markov Models for Object Categories

L. Zhu, Yanru Chen, A. Yuille - 2009

2 papers in library cite

[19]Learning in Modular Systems

D. Bradley - 2009

1 paper in library cites

[20]Quadratic Polynomials Learn Better Image Features

James Bergstra, G. Desjardins, P. Lamblin, Yoshua Bengio - 2009

1 paper in library cites

Cited by

20

papers in your library

Cites

15

papers in your library

Read

on July 21, 2025

Nice but underwhelming results (they still underperform vs. pretraining). I also didn't really like the way it's written. It's not bad, it's just a bit clunky. Worth the read though.

Tags

Paper Aliases

No aliases