2013

On the Importance of Initialization and Momentum in Deep Learning

Ilya Sutskever, James Martens, G. Dahl, Geoffrey Hinton

citations

Cite Score

78

AI summary

This paper demonstrates that stochastic gradient descent with momentum, combined with a well-designed random initialization and a specific schedule for the momentum parameter, can train DNNs and RNNs to performance levels previously only achievable with Hessian-Free optimization, achieving state-of-the-art results.

Main Contributions

  • Demonstrates the effectiveness of stochastic gradient descent with momentum for training deep and recurrent neural networks.
  • Shows that careful initialization and a specific schedule for the momentum parameter are crucial for achieving high performance.
  • Achieves results comparable to Hessian-Free optimization on deep autoencoder training tasks.
  • Demonstrates that momentum-accelerated SGD can successfully train RNNs on artificial datasets with long-range temporal dependencies.
  • Proposes a hybrid HF-momentum algorithm that combines the benefits of both Hessian-Free optimization and momentum-based methods.

Abstract

Deep and recurrent neural networks (DNNs and RNNs respectively) are powerful models that were considered to be almost impossible to train using stochastic gradient descent with momentum. In this paper, we show that when stochastic gradient descent with momentum uses a well-designed random initialization and a particular type of slowly increasing schedule for the momentum parameter, it can train both DNNs and RNNS (on datasets with long-term dependencies) to levels of performance that were previously achievable only with Hessian-Free optimization. We find that both the initialization and the momentum are crucial since poorly initialized networks cannot be trained with momentum and well-initialized networks perform markedly worse when the momentum is absent or poorly tuned. Our success training these models suggests that previous attempts to train deep and recurrent neural networks from random initializations have likely failed due to poor initialization schemes. Furthermore, carefully tuned momentum methods suffice for dealing with the curvature issues in deep and recurrent network training objectives without the need for sophisticated second-order methods.

Citation Graph

Loading graph...

References [30]

Sort:
Filter:

Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton - 2012

71 papers in library cite

Sepp Hochreiter, Jürgen Schmidhuber - 1997

94 papers in library cite

Yoshua Bengio - 2010

20 papers in library cite

Geoffrey Hinton, Ruslan Salakhutdinov - 2006

37 papers in library cite

Geoffrey E. Hinton, S. Osindero, Y. Teh - 2006

43 papers in library cite

Geoffrey Hinton - 2012

21 papers in library cite

Yoshua Bengio, Patrice Simard, Paolo Frasconi - 1994

31 papers in library cite

Yann Lecun, Leon Bottou, G. B. Orr, Klaus Robert Muller - 1998

20 papers in library cite

Yoshua Bengio, P. Lamblin, D. Popovici, Hugo Larochelle - 2006

33 papers in library cite

Herbert Jaeger, Harald Haas - 2004

4 papers in library cite

G. Dahl, D. Yu, L. Deng, Alex Acero - 2012

19 papers in library cite

Alex Graves - 2012

7 papers in library cite

Ilya Sutskever, James Martens, Geoffrey E. Hinton - 2011

13 papers in library cite

Tapani Raiko, Harri Valpola, Yann Lecun - 2012

7 papers in library cite

James Martens - 2010

12 papers in library cite

James Martens, Ilya Sutskever - 2011

13 papers in library cite

James Martens, Ilya Sutskever - 2011

13 papers in library cite

Tomas Mikolov, Ilya Sutskever, A. Deoras, H. S. Le, S. Kombrink, Jan Cernocky - 2012

7 papers in library cite

A. Mohamed, G. Dahl, Geoffrey Hinton - 2012

12 papers in library cite

Y. Nesterov - 1983

3 papers in library cite

G. Lan - 2010

2 papers in library cite

G. B. Orr - 1996

2 papers in library cite

O. Chapelle, Dumitru Erhan - 2011

2 papers in library cite

Y. Nesterov - 2013

2 papers in library cite

W. Wiegerinck, A. Komoda, T. Heskes - 1994

2 papers in library cite

A. Cotter, O. Shamir, N. Srebro, K. Sridharan - 2011

1 paper in library cites

Leon Bottou, Yann Lecun - 2004

1 paper in library cites

Herbert Jaeger - 2012

1 paper in library cites

B. T. Polyak - 1964

1 paper in library cites

C. Darken, J. Moody - 1993

1 paper in library cites

Cited by

13

papers in your library

Cites

19

papers in your library

Read

on August 18, 2025

Your review

Tags

Paper Aliases

No aliases