2001
Cite Score
62
AI summary
This paper analyzes the problem of vanishing gradients in recurrent neural networks (RNNs), showing that gradients either blow up or vanish exponentially, hindering the learning of long-term dependencies. It theoretically proves that RNNs struggle to robustly store past input information, and briefly reviews alternative optimization methods and architectures.
Main Contributions
Abstract
Recurrent networks (crossreference Chapter 12) can, in principle, use their feedback connections to store representations of recent input events in the form of activations. The most widely used algorithms for learning what to put in short-term memory, however, take too much time to be feasible or do not work well at all, especially when minimal time lags between inputs and corresponding teacher signals are long. Although theoretically fascinating, they do not provide clear practical advantages over, say, backprop in feedforward networks with limited time windows (see crossreference Chapters 11 and 12). With conventional “algorithms based on the computation of the complete gradient", such as “Back-Propagation Through Time" (BPTТТ, e.g., [23, 28, 27]) or “Real-Time Recurrent Learning” (RTRL, e.g., [22]) error signals “flowing backwards in time” tend to either (1) blow up or (2) vanish: the temporal evolution of the backpropagated error exponentially depends on the size of the weights [12, 6]. Case (1) may lead to oscillating weights, while in case (2) learning to bridge long time lags takes a prohibitive amount of time, or does not work at all. In what follows, we give a theoretical analysis of this problem by studying the asymptotic behavior of error gradients as a function of time lags. In Section 2, we consider the case of standard RNNs and derive the main result using the approach first proposed in [12]. In Section 3, we consider the more general case of adaptive dynamical systems, which include, besides standard RNNs, other recurrent architectures based on different connectivities and choices of the activation function (e.g., RBF or second order connections). Using the analysis reported in [6] we show that one of the following two undesirable situations necessarily arise: either the system is unable to robustly store past information about its inputs, or gradients vanish exponentially. Finally, in Section 4 we shortly review alternative optimization methods and architectures that have been suggested to improve learning in the presence of long-term dependencies.
Citation Graph
References [28]
Sepp Hochreiter, Jürgen Schmidhuber - 1997
94 papers in library cite
D. E. Rumelhart, Geoffrey E. Hinton, Ronald J. Williams - 1986
46 papers in library cite
Yoshua Bengio, Patrice Simard, Paolo Frasconi - 1994
31 papers in library cite
Felix A. Gers, Jürgen Schmidhuber, Fred Cummins - 2000
13 papers in library cite
Paul J. Werbos - 1988
11 papers in library cite
Sepp Hochreiter, Jürgen Schmidhuber - 1997
5 papers in library cite
Ronald J. Williams, David Zipser - 1992
8 papers in library cite
Sepp Hochreiter - 1991
18 papers in library cite
A. J. Robinson, F. Fallside - 1987
10 papers in library cite
Jürgen Schmidhuber - 1992
8 papers in library cite
S. Elhihi, Yoshua Bengio - 1996
6 papers in library cite
M. C. Mozer - 1992
5 papers in library cite
Yoshua Bengio, Paolo Frasconi - 1994
4 papers in library cite
Fernando J. Pineda - 1988
4 papers in library cite
B. D. Vries, J. C. Principe - 1991
2 papers in library cite
K. Lang, A. Waibel, Geoffrey E. Hinton - 1990
2 papers in library cite
K. Doya - 1992
2 papers in library cite
P. Baldi, F. Pineda - 1991
2 papers in library cite
J. Ortega, W. Rheinboldt - 1970
2 papers in library cite
T. Lin, B. Horne, P. Tino, C. Giles - 1996
2 papers in library cite
M. B. Ring - 1993
2 papers in library cite
Yoshua Bengio - 1999
2 papers in library cite
Jürgen Schmidhuber - 1993
2 papers in library cite
G. Sun, H. Chen, Y. Lee - 1993
2 papers in library cite
P. Angeline, G. Saunders, J. Pollack - 1994
1 paper in library cites
Yoshua Bengio, Paolo Frasconi - 1995
1 paper in library cites
T. Lin, B. Horne, C. Giles - 1998
1 paper in library cites
Fred Cummins, F. Gers, Jürgen Schmidhuber - 1999
1 paper in library cites
Cited by
16
papers in your library
Cites
7
papers in your library
Read
on August 18, 2025
Your review
Tags
Paper Aliases
No aliases