Cite Score
83
AI summary
This paper introduces ADADELTA, a novel per-dimension learning rate method for gradient descent that dynamically adapts over time using only first-order information, with minimal overhead. The method shows promising results on MNIST and a large-scale voice dataset in a distributed cluster environment.
Main Contributions
Abstract
We present a novel per-dimension learning rate method for gradient descent called ADADELTA. The method dynamically adapts over time using only first order information and has minimal computational overhead beyond vanilla stochastic gradient descent. The method requires no manual tuning of a learning rate and appears robust to noisy gradient information, different model architecture choices, various data modalities and selection of hyperparameters. We show promising results compared to other methods on the MNIST digit classification task using a single machine and on a large scale voice dataset in a distributed cluster environment.
Citation Graph
References [7]
D. E. Rumelhart, Geoffrey E. Hinton, Ronald J. Williams - 1986
34 papers in library cite
John Duchi, Elad Hazan, Yoram Singer - 2011
19 papers in library cite
Sutton Monro - 1951
3 papers in library cite
Jeffrey Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Quoc V. Le, Mark Z. Mao, Marc'aurelio Ranzato, A. Senior, P. Tucker, K. Yang, Andrew Y. Ng - 2012
16 papers in library cite
S. Becker, Yann Lecun - 1988
9 papers in library cite
Navdeep Jaitly, P. Nguyen, A. Senior, Vincent Vanhoucke - 2012
6 papers in library cite
T. Schaul, S. Zhang, Yann Lecun - 2012
2 papers in library cite
Cited by
13
papers in your library
Cites
6
papers in your library
Read
on June 20, 2025
Your review
Tags
Paper Aliases
No aliases