Papperoni

2013

On the Importance of Initialization and Momentum in Deep Learning

Ilya Sutskever, James Martens, G. Dahl, Geoffrey Hinton

Open PDF Google Scholar

citations

Cite Score

78

AI summary

This paper demonstrates that stochastic gradient descent with momentum, combined with a well-designed random initialization and a specific schedule for the momentum parameter, can train DNNs and RNNs to performance levels previously only achievable with Hessian-Free optimization, achieving state-of-the-art results.

Main Contributions

Demonstrates the effectiveness of stochastic gradient descent with momentum for training deep and recurrent neural networks.
Shows that careful initialization and a specific schedule for the momentum parameter are crucial for achieving high performance.
Achieves results comparable to Hessian-Free optimization on deep autoencoder training tasks.
Demonstrates that momentum-accelerated SGD can successfully train RNNs on artificial datasets with long-range temporal dependencies.
Proposes a hybrid HF-momentum algorithm that combines the benefits of both Hessian-Free optimization and momentum-based methods.

Abstract

Deep and recurrent neural networks (DNNs and RNNs respectively) are powerful models that were considered to be almost impossible to train using stochastic gradient descent with momentum. In this paper, we show that when stochastic gradient descent with momentum uses a well-designed random initialization and a particular type of slowly increasing schedule for the momentum parameter, it can train both DNNs and RNNS (on datasets with long-term dependencies) to levels of performance that were previously achievable only with Hessian-Free optimization. We find that both the initialization and the momentum are crucial since poorly initialized networks cannot be trained with momentum and well-initialized networks perform markedly worse when the momentum is absent or poorly tuned. Our success training these models suggests that previous attempts to train deep and recurrent neural networks from random initializations have likely failed due to poor initialization schemes. Furthermore, carefully tuned momentum methods suffice for dealing with the curvature issues in deep and recurrent network training objectives without the need for sophisticated second-order methods.

Citation Graph

Loading graph...

References [30]

Sort:

Filter:

[1]ImageNet Classification With Deep Convolutional Neural Networks

Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton - 2012

71 papers in library cite

I'm giving this a 5 just because of the impact, but this is VEEERY derivative of earlier work. Kudos for them for putting it all together, but really there's nothing revolutionary here.

[2]Long Short-Term Memory

Sepp Hochreiter, Jürgen Schmidhuber - 1997

94 papers in library cite

LSTMs FTW!

[3]Understanding the Difficulty of Training Deep Feedforward Neural Networks

Yoshua Bengio - 2010

20 papers in library cite

Nice but underwhelming results (they still underperform vs. pretraining). I also didn't really like the way it's written. It's not bad, it's just a bit clunky. Worth the read though.

[4]Reducing the Dimensionality of Data With Neural Networks

Geoffrey Hinton, Ruslan Salakhutdinov - 2006

37 papers in library cite

I didn't like the way this is written, very hard to understand without a ton of background knowledge. But hey, it's the first deep learning model!

[5]A Fast Learning Algorithm for Deep Belief Nets

Geoffrey E. Hinton, S. Osindero, Y. Teh - 2006

43 papers in library cite

The paper does not explain anything. It just throws the idea and a bunch of math, but doesn't really care to explain the concepts.

[6]Deep Neural Networks for Acoustic Modeling in Speech Recognition

Geoffrey Hinton - 2012

21 papers in library cite

The core of the paper itself is a bit boring and doesn't introduce anything new (just RBMs and DBNs again) but I am giving this a 4 because it's probably the best explanation of RBMs and DBNs I've read so far.

[7]Learning Long-Term Dependencies With Gradient Descent Is Difficult

Yoshua Bengio, Patrice Simard, Paolo Frasconi - 1994

31 papers in library cite

The first ones to notice that there is a problem with gradient descent, but way too mathy for me.

[8]Efficient Backprop

Yann Lecun, Leon Bottou, G. B. Orr, Klaus Robert Muller - 1998

20 papers in library cite

The first half is very very good. The remainder is very boring.

[9]Greedy Layer-Wise Training of Deep Networks

Yoshua Bengio, P. Lamblin, D. Popovici, Hugo Larochelle - 2006

33 papers in library cite

Bengio is perfect. This is everything that Hinton's paper hoped to be. Very well explained, and also tying back to real use cases (not just "hey, the math works and it reduced the score")

[10]Harnessing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication

Herbert Jaeger, Harald Haas - 2004

4 papers in library cite

Bad read. Doesn't explain anything of how things work. It's boring and I didn't like the figures. Maybe the main paper about ESNs is better, but frankly I don't want to read it.

[11]Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition

G. Dahl, D. Yu, L. Deng, Alex Acero - 2012

19 papers in library cite

Good paper, very well written and probably the best explanation of RBMs and DBNs I've seen. However, I don't see a lot of impact and seems very derivative from other works.

[12]Sequence Transduction With Recurrent Neural Networks

Alex Graves - 2012

7 papers in library cite

Good contribution. Discusses transducing (converting one sequence to the other) without pre-defined alignment. I didn't really like it as it is too mathy and a bit hard to understand, and I think it was not too impactful.

[13]Generating Text With Recurrent Neural Networks

Ilya Sutskever, James Martens, Geoffrey E. Hinton - 2011

13 papers in library cite

Pleasant paper but results are underwhelming. They use RNNs for character-level modeling, which is different. They also use the hessian-free method proposed by Martens, but don't go too deep into how it works, which is nice because otherwise it would be very mathy. Other papers cite this more as an example of usage rather than an actual milestone.

[14]Deep Learning Made Easier by Linear Transformations in Perceptrons

Tapani Raiko, Harri Valpola, Yann Lecun - 2012

7 papers in library cite

Kudos for introducing shortcut connections (which would become important in the future), but to me it seems a bit mid.

[15]Deep Learning via Hessian-Free Optimization

James Martens - 2010

12 papers in library cite

This paper is surprisingly good! When I first read the Hessian-Free optimization part, I thought "ugh, this is going to be full of math", but in the end it was very very enjoyable. I think I just wouldn't give it a 5 because it doesn't seem to have had that much impact.

[16]Learning Recurrent Neural Networks With Hessian-Free Optimization

James Martens, Ilya Sutskever - 2011

13 papers in library cite

Meh, very very mathy and seems like a minor improvement on LSTMs. They give very impressive results but I don't think it's too much of a fair comparison (vs. a 1999 architecture of LSTMs). I think they did well in showing that exploding/vanishing gradients can be overcome, but their method is not "it". As said by Bengio, they are partially responsible for the revival of RNNs (alongside Mikolov)

[17]Learning Recurrent Neural Networks With Hessian-Free Optimization

James Martens, Ilya Sutskever - 2011

13 papers in library cite

Meh, very very mathy and seems like a minor improvement on LSTMs. They give very impressive results but I don't think it's too much of a fair comparison (vs. a 1999 architecture of LSTMs). I think they did well in showing that exploding/vanishing gradients can be overcome, but their method is not "it". As said by Bengio, they are partially responsible for the revival of RNNs (alongside Mikolov)

[18]Subword Language Modeling With Neural Networks

Tomas Mikolov, Ilya Sutskever, A. Deoras, H. S. Le, S. Kombrink, Jan Cernocky - 2012

7 papers in library cite

Doesn't add a ton of contribution other than saying that subword models can perform better than char-level models. Probably the most important thing here is early use of subwords.

[19]Acoustic Modeling Using Deep Belief Networks

A. Mohamed, G. Dahl, Geoffrey Hinton - 2012

12 papers in library cite

[20]A Method of Solving a Convex Programming Problem With Convergence Rate O (1/K2)

Y. Nesterov - 1983

3 papers in library cite

[21]An Optimal Method for Stochastic Composite Optimization

G. Lan - 2010

2 papers in library cite

[22]Dynamics and Algorithms for Stochastic Search

G. B. Orr - 1996

2 papers in library cite

[23]Improved Preconditioner for Hessian Free Optimization

O. Chapelle, Dumitru Erhan - 2011

2 papers in library cite

[24]Introductory Lectures on Convex Optimization: A Basic Course

Y. Nesterov - 2013

2 papers in library cite

[25]Stochastic Dynamics of Learning With Momentum in Neural Networks

W. Wiegerinck, A. Komoda, T. Heskes - 1994

2 papers in library cite

[26]Better Mini-Batch Algorithms via Accelerated Gradient Methods

A. Cotter, O. Shamir, N. Srebro, K. Sridharan - 2011

1 paper in library cites

[27]Large Scale Online Learning

Leon Bottou, Yann Lecun - 2004

1 paper in library cites

[28]Personal Communication

Herbert Jaeger - 2012

1 paper in library cites

[29]Some Methods of Speeding Up the Convergence of Iteration Methods

B. T. Polyak - 1964

1 paper in library cites

[30]Towards Faster Stochastic Gradient Search

C. Darken, J. Moody - 1993

1 paper in library cites

Cited by

13

papers in your library

Cites

19

papers in your library

Read

on August 18, 2025

They give very good context and it's easy to understand that they are doing this as a counterpoint to HF. Surprising results as well. I just think it was made obsolete by relu

Tags

Paper Aliases

No aliases