2010

Understanding the Difficulty of Training Deep Feedforward Neural Networks

Yoshua Bengio

citations

Cite Score

94

AI summary

This paper analyzes the difficulties of training deep feedforward neural networks with standard gradient descent. It introduces Shapeset-3 x 2, MNIST, CIFAR-10 and Small-ImageNet datasets. It proposes a new initialization scheme that brings substantially faster convergence, and shows that sigmoid activations should be avoided.

Main Contributions

  • Identified the saturation problem in deep networks with sigmoid activations.
  • Showed that saturated units can move out of saturation by themselves, albeit slowly.
  • Proposed a new initialization scheme that brings substantially faster convergence.
  • Demonstrated the importance of appropriate activation functions and initialization schemes for training deep networks.
  • Analyzed how activations and gradients vary across layers and during training.

Abstract

Whereas before 2006 it appears that deep multilayer neural networks were not successfully trained, since then several algorithms have been shown to successfully train them, with experimental results showing the superiority of deeper vs less deep architectures. All these experimental results were obtained with new initialization or training mechanisms. Our objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future. We first observe the influence of the non-linear activations functions. We find that the logistic sigmoid activation is unsuited for deep networks with random initialization because of its mean value, which can drive especially the top hidden layer into saturation. Surprisingly, we find that saturated units can move out of saturation by themselves, albeit slowly, and explaining the plateaus sometimes seen when training neural networks. We find that a new non-linearity that saturates less can often be beneficial. Finally, we study how activations and gradients vary across layers and during training, with the idea that training may be more difficult when the singular values of the Jacobian associated with each layer are far from 1. Based on these considerations, we propose a new initialization scheme that brings substantially faster convergence.

Citation Graph

Loading graph...

References [20]

Sort:
Filter:

Yann Lecun, Leon Bottou, Yoshua Bengio, Patrick Haffner - 1998

62 papers in library cite

D. E. Rumelhart, Geoffrey E. Hinton, Ronald J. Williams - 1986

34 papers in library cite

Alex Krizhevsky - 2009

27 papers in library cite

Geoffrey E. Hinton, S. Osindero, Y. Teh - 2006

43 papers in library cite

Yoshua Bengio, Patrice Simard, Paolo Frasconi - 1994

31 papers in library cite

Yoshua Bengio - 2009

25 papers in library cite

Pascal Vincent, Hugo Larochelle, Yoshua Bengio, Pierre Antoine Manzagol - 2008

25 papers in library cite

Ronan Collobert, Jason Weston - 2008

32 papers in library cite

Yann Lecun, Leon Bottou, G. B. Orr, Klaus Robert Muller - 1998

20 papers in library cite

Yoshua Bengio, P. Lamblin, D. Popovici, Hugo Larochelle - 2006

33 papers in library cite

Marc'aurelio Ranzato, C. Poultney, S. Chopra, Yann Lecun - 2006

20 papers in library cite

Hugo Larochelle, Dumitru Erhan, Aaron Courville, James Bergstra, Yoshua Bengio - 2007

13 papers in library cite

Jason Weston, F. Ratle, Ronan Collobert - 2008

10 papers in library cite

A. Mnih, Geoffrey E. Hinton - 2009

16 papers in library cite

Pascal Vincent - 2009

5 papers in library cite

Hugo Larochelle, Yoshua Bengio, J. Louradour, P. Lamblin - 2009

7 papers in library cite

Sara A. Solla, E. Levin, M. Fleisher - 1988

2 papers in library cite

L. Zhu, Yanru Chen, A. Yuille - 2009

2 papers in library cite

D. Bradley - 2009

1 paper in library cites

James Bergstra, G. Desjardins, P. Lamblin, Yoshua Bengio - 2009

1 paper in library cites

Cited by

20

papers in your library

Cites

15

papers in your library

Read

on July 21, 2025

Your review

Tags

Paper Aliases

No aliases