Papperoni

2012

Adadelta: An Adaptive Learning Rate Method

Matthew D. Zeiler

citations

Cite Score

AI summary

This paper introduces ADADELTA, a novel per-dimension learning rate method for gradient descent that dynamically adapts over time using only first-order information, with minimal overhead. The method shows promising results on MNIST and a large-scale voice dataset in a distributed cluster environment.

Main Contributions

Introduces a new per-dimension learning rate method for gradient descent called ADADELTA.
ADADELTA dynamically adapts over time using only first order information and has minimal computational overhead.
The method requires no manual tuning of a learning rate and appears robust to noisy gradient information.
Shows promising results compared to other methods on the MNIST digit classification task.
Shows promising results compared to other methods on a large scale voice dataset in a distributed cluster environment.

Abstract

We present a novel per-dimension learning rate method for gradient descent called ADADELTA. The method dynamically adapts over time using only first order information and has minimal computational overhead beyond vanilla stochastic gradient descent. The method requires no manual tuning of a learning rate and appears robust to noisy gradient information, different model architecture choices, various data modalities and selection of hyperparameters. We show promising results compared to other methods on the MNIST digit classification task using a single machine and on a large scale voice dataset in a distributed cluster environment.

Citation Graph

Loading graph...

References [7]

Sort:

Filter:

[1]Learning Representations by Back-Propagating Errors

D. E. Rumelhart, Geoffrey E. Hinton, Ronald J. Williams - 1986

34 papers in library cite

Google Scholar

Introduced backprop. Short and simple.

[2]Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

John Duchi, Elad Hazan, Yoram Singer - 2011

19 papers in library cite

Google Scholar

I actually skimmed through most of this. It's not a bad paper, but it's a math paper, not AI.

[3]A Stochastic Approximation Method

Sutton Monro - 1951

3 papers in library cite

Google Scholar

It's math. But it actually does a somewhat good job at explaining (but I don't think they tried too hard). It gets way better near the end.

[4]Large Scale Distributed Deep Networks

Jeffrey Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Quoc V. Le, Mark Z. Mao, Marc'aurelio Ranzato, A. Senior, P. Tucker, K. Yang, Andrew Y. Ng - 2012

16 papers in library cite

Google Scholar

Good paper, nice algorithm. Nothing too crazy, but I understand the impact. I think the work to create the system was larger than the algorithm itself.

[5]Improving the Convergence of Back-Propagation Learning With Second-Order Methods

S. Becker, Yann Lecun - 1988

9 papers in library cite

Google Scholar

Surprisingly good. I thought this would be very math-heavy but in the end it's very easy to understand. Sadly results are underwhelming

[6]Application of Pretrained Deep Neural Networks to Large Vocabulary Speech Recognition

Navdeep Jaitly, P. Nguyen, A. Senior, Vincent Vanhoucke - 2012

6 papers in library cite

Google Scholar

It's not bad, it's just nothing new really. They just get existing methods and apply to very large datasets. I see the contribution, but boring read - just experiment methodology and results.

[7]No More Pesky Learning Rates

T. Schaul, S. Zhang, Yann Lecun - 2012

2 papers in library cite

Google Scholar

Cited by

papers in your library

Cites

papers in your library

Read

on June 20, 2025

Cool contribution. Simple and straightforward, and not too mathy. Based on ADAGRAD.