Papperoni

2014

Sequence to Sequence Learning With Neural Networks

Ilya Sutskever, Oriol Vinyals, Quoc V. Le

Open PDF Google Scholar

citations

Cite Score

94

AI summary

This paper introduces a sequence-to-sequence learning approach using multilayered LSTMs for machine translation, achieving a BLEU score of 34.8 on the WMT'14 English to French translation task. Reversing the order of words in source sentences improves performance, and the LSTM model also learns meaningful sentence representations.

Main Contributions

Introduces a general end-to-end approach to sequence learning using LSTMs.
Achieves a BLEU score of 34.8 on the WMT'14 English to French translation task.
Demonstrates that reversing the order of words in source sentences improves LSTM performance.
Shows that LSTMs can learn sensible phrase and sentence representations.
Finds that deep LSTMs outperform shallow LSTMs.

Abstract

Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Our main result is that on an English to French translation task from the WMT'14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.8 on the entire test set, where the LSTM's BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which is close to the previous best result on this task. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.

Citation Graph

Loading graph...

References [31]

Sort:

Filter:

[1]ImageNet Classification With Deep Convolutional Neural Networks

Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton - 2012

71 papers in library cite

I'm giving this a 5 just because of the impact, but this is VEEERY derivative of earlier work. Kudos for them for putting it all together, but really there's nothing revolutionary here.

[2]Long Short-Term Memory

Sepp Hochreiter, Jürgen Schmidhuber - 1997

94 papers in library cite

LSTMs FTW!

[3]Gradient-Based Learning Applied to Document Recognition

Yann Lecun, Leon Bottou, Yoshua Bengio, Patrick Haffner - 1998

62 papers in library cite

I absolutely hated this paper. Has ~50 pages but seems like 200 pages. Takes too long to explain some things that really is just repeating itself. Also doesn't seem to add too much on top of LeNet-5. Also, focuses a lot on GTNs, which really didn't stick.

[4]Learning Representations by Back-Propagating Errors

D. E. Rumelhart, Geoffrey E. Hinton, Ronald J. Williams - 1986

34 papers in library cite

Introduced backprop. Short and simple.

[5]Neural Machine Translation by Jointly Learning to Align and Translate

D. Bahdanau, Kyunghyun Cho, Yoshua Bengio - 2014

59 papers in library cite

Introduces the attention mechanism - amazing overall

[6]Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation

Kyunghyun Cho, B. V. Merrienboer, C. G. Gulcehre, D. Bahdanau, F. Bougares, Holger Schwenk, Yoshua Bengio - 2014

38 papers in library cite

Introduces RNN encoder-decoder. I love it :)

[7]BLUE: A Method for Automatic Evaluation of Machine Translation

K. Papineni, S. Roukos, T. Ward, Wei Jing Zhu - 2002

19 papers in library cite

Very cool idea. Simple yet very impactful!

[8]Deep Neural Networks for Acoustic Modeling in Speech Recognition

Geoffrey Hinton - 2012

21 papers in library cite

The core of the paper itself is a bit boring and doesn't introduce anything new (just RBMs and DBNs again) but I am giving this a 4 because it's probably the best explanation of RBMs and DBNs I've read so far.

[9]Learning Long-Term Dependencies With Gradient Descent Is Difficult

Yoshua Bengio, Patrice Simard, Paolo Frasconi - 1994

31 papers in library cite

The first ones to notice that there is a problem with gradient descent, but way too mathy for me.

[10]A Neural Probabilistic Language Model

Yoshua Bengio, R. Ducharme, Pascal Vincent - 2001

62 papers in library cite

What started it all. Very simple and elegant.

[11]On the Difficulty of Training Recurrent Neural Networks

Razvan Pascanu, Tomas Mikolov, Yoshua Bengio - 2013

21 papers in library cite

It starts very mathy but in the end there are some very nice contributions! You don't actually need to understand the math to know what's going on in the end.

[12]Recurrent Neural Network Based Language Model

Tomas Mikolov, M. Karafiat, Lukas Burget, Jan Cernocky, Sanjeev Khudanpur - 2010

36 papers in library cite

The comeback of RNNs for language modeling. Not too exciting but impactful and a short read.

[13]Connectionist Temporal Classification: Labelling Unsegmented Sequence Data With Recurrent Neural Networks

Alex Graves, Santiago Fernandez, Faustino Gomez, Jürgen Schmidhuber - 2006

7 papers in library cite

It's a bit lukewarm. Nice idea but execution was a bit meh. I also think the prefix search was unnecessarily complex and loses to beam search (as they admit later on)

[14]Backpropagation Through Time: What It Does and How to Do It

P. Werbos - 1990

9 papers in library cite

Amazing tutorial! Very pragmatic. Explains very basic concepts and focus on implementation

[15]Multi-Column Deep Neural Networks for Image Classification

Dan C. Ciresan, Ueli Meier, Jürgen Schmidhuber - 2012

11 papers in library cite

Very nice paper! And I am impressed they used CNNs before Hinton's paper. It's a shame there are so few citations. They also propose max-pooling and actually give a good explanation about it.

[16]Generating Sequences With Recurrent Neural Networks

Alex Graves - 2013

27 papers in library cite

Very cool and is the first to actually proposed the Attention mechanism! It gets a bit mathy but nothing too crazy. Also has the first examples of good machine generated writing I've seen in these papers, so very nice results.

[17]Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition

G. Dahl, D. Yu, L. Deng, Alex Acero - 2012

19 papers in library cite

Good paper, very well written and probably the best explanation of RBMs and DBNs I've seen. However, I don't see a lot of impact and seems very derivative from other works.

[18]Building High-Level Features Using Large Scale Unsupervised Learning

Quoc V. Le, M. A. Ranzato, R. Monga, M. Devin, K. Chen, G. S. Corrado, Jeffrey Dean, Andrew Y. Ng - 2012

10 papers in library cite

Very nice and very early work - seems very simple but very insightful to use an autoencoder to detect objects. Also, very similar to the neocognitron :)

[19]LSTM neural Networks for Language Modeling

M. Sundermeyer, R. Schluter, Hermann Ney - 2010

7 papers in library cite

It reads like an undergrad project - doesn't add anything new.

[20]Gradient Flow in Recurrent Nets: The Difficulty of Learning Long-Term Dependencies

Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, Jürgen Schmidhuber - 2001

16 papers in library cite

Wow, this is so much better than the other paper - I should have read it sooner. It's concise and not too abstract, and also gives very good context on RNN problems and how to solve them.

[21]Recurrent Continuous Translation Models

N. Kalchbrenner, Phil Blunsom - 2013

27 papers in library cite

Good paper, probably the first that used an encoder-decoder. But they used a conv. NN instead of a tradicional decoder, which I don't really like.

[22]LSTM Can Solve Hard Long Time Lag Problems

Sepp Hochreiter, Jürgen Schmidhuber - 1997

5 papers in library cite

Doesn't add much on top of what we already know. Good paper nonetheless.

[23]Fast and Robust Neural Network Joint Models for Statistical Machine Translation

Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas Lamar, Richard Schwartz, John Makhoul - 2014

9 papers in library cite

Simple and very nice that it showed that NNs can stand on their own for NMT.

[24]Multilingual Distributed Representations Without Word Alignment

K. M. Hermann, Phil Blunsom - 2014

3 papers in library cite

They embed the sentence, but can't get a translation directly

[25]Untersuchungen zu dynamischen neuronalen netzen

Sepp Hochreiter - 1991

18 papers in library cite

[26]Statistical Language Models Based on Neural Networks

Tomas Mikolov - 2012

17 papers in library cite

Mikolov's Thesis

[27]Edinburgh's Phrase-Based Machine Translation Systems for WMT-14

N. Durrani, B. Haddow, P. Koehn, K. Heafield - 2014

6 papers in library cite

[28]Joint Language and Translation Modeling With Recurrent Neural Networks

Michael Auli, M. Galley, C. Quirk, Geoffrey Zweig - 2013

3 papers in library cite

[29]Overcoming the Curse of Sentence Length for Neural Machine Translation Using Automatic Segmentation

J. P. Abadie, D. Bahdanau, B. V. Merrienboer, Kyunghyun Cho, Yoshua Bengio - 2014

2 papers in library cite

[30]University le mans

Holger Schwenk - 2014

2 papers in library cite

[31]On Small Depth Threshold Circuits

A. Razborov - 1992

1 paper in library cites

Cited by

58

papers in your library

Cites

24

papers in your library

Read

on June 20, 2025

Good paper, but I think it only got famous because they set a new good baseline for NNs in MT. Their main contribution was reversing the source sentence TBH.

Tags

Paper Aliases

No aliases