Papperoni

2001

Gradient Flow in Recurrent Nets: The Difficulty of Learning Long-Term Dependencies

Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, Jürgen Schmidhuber

citations

Cite Score

AI summary

This paper analyzes the problem of vanishing gradients in recurrent neural networks (RNNs), showing that gradients either blow up or vanish exponentially, hindering the learning of long-term dependencies. It theoretically proves that RNNs struggle to robustly store past input information, and briefly reviews alternative optimization methods and architectures.

Main Contributions

Provides a theoretical analysis of vanishing gradients in RNNs, focusing on the asymptotic behavior of error gradients as a function of time lags.
Shows that RNN gradients either blow up or vanish exponentially, depending on the size of the weights.
Demonstrates that a sufficient condition to obtain gradient decay is also a necessary condition for the system to robustly store discrete state information for the long-term.
Discusses the decomposition of the state-space of hidden units into regions where gradients decay and regions where robust latching is not possible.
Reviews several proposals to cope with the problem of long-term dependencies, including alternative search algorithms and alternative architectures.

Abstract

Recurrent networks (crossreference Chapter 12) can, in principle, use their feedback connections to store representations of recent input events in the form of activations. The most widely used algorithms for learning what to put in short-term memory, however, take too much time to be feasible or do not work well at all, especially when minimal time lags between inputs and corresponding teacher signals are long. Although theoretically fascinating, they do not provide clear practical advantages over, say, backprop in feedforward networks with limited time windows (see crossreference Chapters 11 and 12). With conventional “algorithms based on the computation of the complete gradient", such as “Back-Propagation Through Time" (BPTТТ, e.g., [23, 28, 27]) or “Real-Time Recurrent Learning” (RTRL, e.g., [22]) error signals “flowing backwards in time” tend to either (1) blow up or (2) vanish: the temporal evolution of the backpropagated error exponentially depends on the size of the weights [12, 6]. Case (1) may lead to oscillating weights, while in case (2) learning to bridge long time lags takes a prohibitive amount of time, or does not work at all. In what follows, we give a theoretical analysis of this problem by studying the asymptotic behavior of error gradients as a function of time lags. In Section 2, we consider the case of standard RNNs and derive the main result using the approach first proposed in [12]. In Section 3, we consider the more general case of adaptive dynamical systems, which include, besides standard RNNs, other recurrent architectures based on different connectivities and choices of the activation function (e.g., RBF or second order connections). Using the analysis reported in [6] we show that one of the following two undesirable situations necessarily arise: either the system is unable to robustly store past information about its inputs, or gradients vanish exponentially. Finally, in Section 4 we shortly review alternative optimization methods and architectures that have been suggested to improve learning in the presence of long-term dependencies.

Citation Graph

Loading graph...

References [28]

Sort:

Filter:

[1]Long Short-Term Memory

Sepp Hochreiter, Jürgen Schmidhuber - 1997

94 papers in library cite