2010
Cite Score
94
AI summary
This paper analyzes the difficulties of training deep feedforward neural networks with standard gradient descent. It introduces Shapeset-3 x 2, MNIST, CIFAR-10 and Small-ImageNet datasets. It proposes a new initialization scheme that brings substantially faster convergence, and shows that sigmoid activations should be avoided.
Main Contributions
Abstract
Whereas before 2006 it appears that deep multilayer neural networks were not successfully trained, since then several algorithms have been shown to successfully train them, with experimental results showing the superiority of deeper vs less deep architectures. All these experimental results were obtained with new initialization or training mechanisms. Our objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future. We first observe the influence of the non-linear activations functions. We find that the logistic sigmoid activation is unsuited for deep networks with random initialization because of its mean value, which can drive especially the top hidden layer into saturation. Surprisingly, we find that saturated units can move out of saturation by themselves, albeit slowly, and explaining the plateaus sometimes seen when training neural networks. We find that a new non-linearity that saturates less can often be beneficial. Finally, we study how activations and gradients vary across layers and during training, with the idea that training may be more difficult when the singular values of the Jacobian associated with each layer are far from 1. Based on these considerations, we propose a new initialization scheme that brings substantially faster convergence.
Citation Graph
References [20]
Yann Lecun, Leon Bottou, Yoshua Bengio, Patrick Haffner - 1998
62 papers in library cite
D. E. Rumelhart, Geoffrey E. Hinton, Ronald J. Williams - 1986
34 papers in library cite
Alex Krizhevsky - 2009
27 papers in library cite
Geoffrey E. Hinton, S. Osindero, Y. Teh - 2006
43 papers in library cite
Yoshua Bengio, Patrice Simard, Paolo Frasconi - 1994
31 papers in library cite
Yoshua Bengio - 2009
25 papers in library cite
Pascal Vincent, Hugo Larochelle, Yoshua Bengio, Pierre Antoine Manzagol - 2008
25 papers in library cite
Ronan Collobert, Jason Weston - 2008
32 papers in library cite
Yann Lecun, Leon Bottou, G. B. Orr, Klaus Robert Muller - 1998
20 papers in library cite
Yoshua Bengio, P. Lamblin, D. Popovici, Hugo Larochelle - 2006
33 papers in library cite
Marc'aurelio Ranzato, C. Poultney, S. Chopra, Yann Lecun - 2006
20 papers in library cite
Hugo Larochelle, Dumitru Erhan, Aaron Courville, James Bergstra, Yoshua Bengio - 2007
13 papers in library cite
Jason Weston, F. Ratle, Ronan Collobert - 2008
10 papers in library cite
A. Mnih, Geoffrey E. Hinton - 2009
16 papers in library cite
Pascal Vincent - 2009
5 papers in library cite
Hugo Larochelle, Yoshua Bengio, J. Louradour, P. Lamblin - 2009
7 papers in library cite
Sara A. Solla, E. Levin, M. Fleisher - 1988
2 papers in library cite
L. Zhu, Yanru Chen, A. Yuille - 2009
2 papers in library cite
D. Bradley - 2009
1 paper in library cites
James Bergstra, G. Desjardins, P. Lamblin, Yoshua Bengio - 2009
1 paper in library cites
Cited by
20
papers in your library
Cites
15
papers in your library
Read
on July 21, 2025
Your review
Tags
Paper Aliases
No aliases