Papperoni

2011

Generating Text With Recurrent Neural Networks

Ilya Sutskever, James Martens, Geoffrey E. Hinton

Open PDF Google Scholar

citations

Cite Score

56

AI summary

This paper introduces a novel Multiplicative Recurrent Neural Network (MRNN) architecture for character-level language modeling, trained using Hessian-Free optimization, achieving state-of-the-art results surpassing previous methods on benchmark datasets. The MRNN demonstrates strong language generation capabilities.

Main Contributions

Introduces a new RNN variant: Multiplicative RNN (MRNN) that uses multiplicative (or 'gated') connections.
Demonstrates the power of RNNs trained with Hessian-Free optimization for character-level language modeling tasks.
Achieves state-of-the-art results surpassing the performance of the best previous single method for character-level language modeling: a hierarchical non-parametric sequence model.
Largest recurrent neural network application to date.
The text generated by the MRNNs exhibited a significant amount of interesting and high-level linguistic structure, featuring a large vocabulary, a considerable amount of grammatical structure, and a wide variety of highly plausible proper names that were not in the training set.

Abstract

Recurrent Neural Networks (RNNs) are very powerful sequence models that do not enjoy widespread use because it is extremely difficult to train them properly. Fortunately, recent advances in Hessian-free optimization have been able to overcome the difficulties associated with training RNNs, making it possible to apply them successfully to challenging sequence problems. In this paper we demonstrate the power of RNNs trained with the new Hessian-Free optimizer (HF) by applying them to character-level language modeling tasks. The standard RNN architecture, while effective, is not ideally suited for such tasks, so we introduce a new RNN variant that uses multiplicative (or "gated") connections which allow the current input character to determine the transition matrix from one hidden state vector to the next. After training the multiplicative RNN with the HF optimizer for five days on 8 high-end Graphics Processing Units, we were able to surpass the performance of the best previous single method for character-level language modeling: a hierarchical non-parametric sequence model. To our knowledge this represents the largest recurrent neural network application to date.

Citation Graph

Loading graph...

References [26]

Sort:

Filter:

[1]Long Short-Term Memory

Sepp Hochreiter, Jürgen Schmidhuber - 1997

94 papers in library cite

LSTMs FTW!

[2]Learning Representations by Back-Propagating Errors

D. E. Rumelhart, Geoffrey E. Hinton, Ronald J. Williams - 1986

34 papers in library cite

Introduced backprop. Short and simple.

[3]Learning Long-Term Dependencies With Gradient Descent Is Difficult

Yoshua Bengio, Patrice Simard, Paolo Frasconi - 1994

31 papers in library cite

The first ones to notice that there is a problem with gradient descent, but way too mathy for me.

[4]Recurrent Neural Network Based Language Model

Tomas Mikolov, M. Karafiat, Lukas Burget, Jan Cernocky, Sanjeev Khudanpur - 2010

36 papers in library cite

The comeback of RNNs for language modeling. Not too exciting but impactful and a short read.

[5]Backpropagation Through Time: What It Does and How to Do It

P. Werbos - 1990

9 papers in library cite

Amazing tutorial! Very pragmatic. Explains very basic concepts and focus on implementation

[6]Harnessing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication

Herbert Jaeger, Harald Haas - 2004

4 papers in library cite

Bad read. Doesn't explain anything of how things work. It's boring and I didn't like the figures. Maybe the main paper about ESNs is better, but frankly I don't want to read it.

[7]Offline Handwriting Recognition With Multidimensional Recurrent Neural Networks

Alex Graves, Jürgen Schmidhuber - 2009

5 papers in library cite

It's okay. The method is nice but a bit too convoluted.

[8]Deep Learning via Hessian-Free Optimization

James Martens - 2010

12 papers in library cite

This paper is surprisingly good! When I first read the Hessian-Free optimization part, I thought "ugh, this is going to be full of math", but in the end it was very very enjoyable. I think I just wouldn't give it a 5 because it doesn't seem to have had that much impact.

[9]A Scalable Hierarchical Distributed Language Model

A. Mnih, Geoffrey E. Hinton - 2009

16 papers in library cite

Good paper that introduces hierarchical trees as an alternative to the expensive softmax output. I think this is not really relevant anymore, but good read.

[10]Learning Recurrent Neural Networks With Hessian-Free Optimization

James Martens, Ilya Sutskever - 2011

13 papers in library cite

Meh, very very mathy and seems like a minor improvement on LSTMs. They give very impressive results but I don't think it's too much of a fair comparison (vs. a 1999 architecture of LSTMs). I think they did well in showing that exploding/vanishing gradients can be overcome, but their method is not "it". As said by Bengio, they are partially responsible for the revival of RNNs (alongside Mikolov)

[11]An Application to Recurrent Nets to Phone Probability Estimation

A. Robinson - 1994

9 papers in library cite

Very early work. Seems like a good overview of how to apply NNs to speech recog, but TBH there's nothing impressive here.

[12]Untersuchungen zu dynamischen neuronalen netzen

Sepp Hochreiter - 1991

18 papers in library cite

[13]CUDAMat: A CUDA-based Matrix Class for Python

V. Mnih - 2009

5 papers in library cite

[14]The Human Knowledge Compression Contest

M. Hutter - 2012

4 papers in library cite

[15]Factored Conditional Restricted boltzmann Machines for Modeling Motion Style

Graham W. Taylor, Geoffrey E. Hinton - 2009

3 papers in library cite

[16]The new york times annotated corpus

E. Sandhaus - 2008

3 papers in library cite

[17]Adaptive Weighing of Context Models for Lossless Data Compression

M. Mahoney - 2005

2 papers in library cite

[18]Dynamic Bayesian Networks: Representation, Inference and Learning

Kevin P. Murphy - 2002

2 papers in library cite

[19]The BellKor Solution to the Netflix Prize

R. M. Bell, Y. Koren, C. Volinsky - 2007

2 papers in library cite

[20]A Stochastic Memoizer for Sequence Data

F. Wood, C. Archambeau, J. Gasthaus, L. James, Yee Whye Teh - 2009

1 paper in library cites

[21]Arithmetic Coding

J. Rissanen, G. G. Langdon - 1979

1 paper in library cites

[22]Dasher-a Data Entry Interface Using Continuous Gestures and Language Models

D. J. Ward, A. F. Blackwell, D. J. C. Mackay - 2000

1 paper in library cites

[23]Gnumpy: An Easy Way to Use GPU Boards in Python

T. Tieleman - 2010

1 paper in library cites

[24]Improving the Prediction of Protein Secondary Structure in Three and Eight Classes Using Recurrent Neural Networks and Profiles

G. Pollastri, D. Przybylski, B. Rost, P. Baldi - 2002

1 paper in library cites

[25]Lossless Compression Based on the Sequence Memoizer

J. Gasthaus, F. Wood, Yee Whye Teh - 2010

1 paper in library cites

[26]Observable Operator Models for Discrete Stochastic Time Series

Herbert Jaeger - 2000

1 paper in library cites

Cited by

13

papers in your library

Cites

11

papers in your library

Read

on June 21, 2025

Pleasant paper but results are underwhelming. They use RNNs for character-level modeling, which is different. They also use the hessian-free method proposed by Martens, but don't go too deep into how it works, which is nice because otherwise it would be very mathy. Other papers cite this more as an example of usage rather than an actual milestone.

Tags

Paper Aliases

No aliases