2001

Gradient Flow in Recurrent Nets: The Difficulty of Learning Long-Term Dependencies

Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, Jürgen Schmidhuber

citations

Cite Score

62

AI summary

This paper analyzes the problem of vanishing gradients in recurrent neural networks (RNNs), showing that gradients either blow up or vanish exponentially, hindering the learning of long-term dependencies. It theoretically proves that RNNs struggle to robustly store past input information, and briefly reviews alternative optimization methods and architectures.

Main Contributions

  • Provides a theoretical analysis of vanishing gradients in RNNs, focusing on the asymptotic behavior of error gradients as a function of time lags.
  • Shows that RNN gradients either blow up or vanish exponentially, depending on the size of the weights.
  • Demonstrates that a sufficient condition to obtain gradient decay is also a necessary condition for the system to robustly store discrete state information for the long-term.
  • Discusses the decomposition of the state-space of hidden units into regions where gradients decay and regions where robust latching is not possible.
  • Reviews several proposals to cope with the problem of long-term dependencies, including alternative search algorithms and alternative architectures.

Abstract

Recurrent networks (crossreference Chapter 12) can, in principle, use their feedback connections to store representations of recent input events in the form of activations. The most widely used algorithms for learning what to put in short-term memory, however, take too much time to be feasible or do not work well at all, especially when minimal time lags between inputs and corresponding teacher signals are long. Although theoretically fascinating, they do not provide clear practical advantages over, say, backprop in feedforward networks with limited time windows (see crossreference Chapters 11 and 12). With conventional “algorithms based on the computation of the complete gradient", such as “Back-Propagation Through Time" (BPTТТ, e.g., [23, 28, 27]) or “Real-Time Recurrent Learning” (RTRL, e.g., [22]) error signals “flowing backwards in time” tend to either (1) blow up or (2) vanish: the temporal evolution of the backpropagated error exponentially depends on the size of the weights [12, 6]. Case (1) may lead to oscillating weights, while in case (2) learning to bridge long time lags takes a prohibitive amount of time, or does not work at all. In what follows, we give a theoretical analysis of this problem by studying the asymptotic behavior of error gradients as a function of time lags. In Section 2, we consider the case of standard RNNs and derive the main result using the approach first proposed in [12]. In Section 3, we consider the more general case of adaptive dynamical systems, which include, besides standard RNNs, other recurrent architectures based on different connectivities and choices of the activation function (e.g., RBF or second order connections). Using the analysis reported in [6] we show that one of the following two undesirable situations necessarily arise: either the system is unable to robustly store past information about its inputs, or gradients vanish exponentially. Finally, in Section 4 we shortly review alternative optimization methods and architectures that have been suggested to improve learning in the presence of long-term dependencies.

Citation Graph

Loading graph...

References [28]

Sort:
Filter:

Sepp Hochreiter, Jürgen Schmidhuber - 1997

94 papers in library cite

D. E. Rumelhart, Geoffrey E. Hinton, Ronald J. Williams - 1986

46 papers in library cite

Yoshua Bengio, Patrice Simard, Paolo Frasconi - 1994

31 papers in library cite

Felix A. Gers, Jürgen Schmidhuber, Fred Cummins - 2000

13 papers in library cite

Paul J. Werbos - 1988

11 papers in library cite

Sepp Hochreiter, Jürgen Schmidhuber - 1997

5 papers in library cite

Ronald J. Williams, David Zipser - 1992

8 papers in library cite

Sepp Hochreiter - 1991

18 papers in library cite

A. J. Robinson, F. Fallside - 1987

10 papers in library cite

Jürgen Schmidhuber - 1992

8 papers in library cite

S. Elhihi, Yoshua Bengio - 1996

6 papers in library cite

M. C. Mozer - 1992

5 papers in library cite

Yoshua Bengio, Paolo Frasconi - 1994

4 papers in library cite

Fernando J. Pineda - 1988

4 papers in library cite

B. D. Vries, J. C. Principe - 1991

2 papers in library cite

K. Lang, A. Waibel, Geoffrey E. Hinton - 1990

2 papers in library cite

K. Doya - 1992

2 papers in library cite

P. Baldi, F. Pineda - 1991

2 papers in library cite

J. Ortega, W. Rheinboldt - 1970

2 papers in library cite

T. Lin, B. Horne, P. Tino, C. Giles - 1996

2 papers in library cite

M. B. Ring - 1993

2 papers in library cite

Yoshua Bengio - 1999

2 papers in library cite

Jürgen Schmidhuber - 1993

2 papers in library cite

G. Sun, H. Chen, Y. Lee - 1993

2 papers in library cite

P. Angeline, G. Saunders, J. Pollack - 1994

1 paper in library cites

Yoshua Bengio, Paolo Frasconi - 1995

1 paper in library cites

T. Lin, B. Horne, C. Giles - 1998

1 paper in library cites

Fred Cummins, F. Gers, Jürgen Schmidhuber - 1999

1 paper in library cites

Cited by

16

papers in your library

Cites

7

papers in your library

Read

on August 18, 2025

Your review

Tags

Paper Aliases

No aliases