Papperoni

2017

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin

Open PDF Google Scholar

citations

Cite Score

100

AI summary

This paper introduces the Transformer, a novel sequence transduction model relying solely on attention mechanisms and achieving state-of-the-art results on WMT 2014 English-to-German and English-to-French translation tasks while being more parallelizable and requiring less training time.

Main Contributions

Introduces the Transformer architecture, which relies entirely on attention mechanisms, dispensing with recurrence and convolutions.
Achieves state-of-the-art BLEU scores of 28.4 on the WMT 2014 English-to-German translation task and 41.8 on the WMT 2014 English-to-French translation task.
Demonstrates that the Transformer is more parallelizable and requires significantly less time to train compared to existing models.
Shows that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing.
Introduces multi-head attention, allowing the model to jointly attend to information from different representation subspaces at different positions.

Abstract

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Citation Graph

Loading graph...

References [40]

Sort:

Filter:

[1]Deep Residual Learning for Image Recognition

K. He, X. Zhang, S. Ren, Jian Sun - 2016

20 papers in library cite

This is simply amazing. Very very simple idea, totally revolutionary. No maths, just "it works!". Amazing.

[2]Adam: A Method for Stochastic Optimization

D. P. Kingma, Jimmy Lei Ba - 2014

49 papers in library cite

Amazing paper! Very well explained and huge impact. I am amazed that they made something so simple even when it requires a lot of background mathematical knowledge

[3]Long Short-Term Memory

Sepp Hochreiter, Jürgen Schmidhuber - 1997

94 papers in library cite

LSTMs FTW!

[4]Dropout: A Simple Way to Prevent Neural Networks From Overfitting

N. Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan R. Salakhutdinov - 2014

20 papers in library cite

Good paper, but it's mostly a review of the method described in the other paper with more results. It's longer as well, so I would suggest just reading the other one.

[5]Rethinking the Inception Architecture for Computer Vision

Zbigniew Wojna - 2015

5 papers in library cite

It's nice to see all of the performance optimizations they do, but it's very derivative

[6]Neural Machine Translation by Jointly Learning to Align and Translate

D. Bahdanau, Kyunghyun Cho, Yoshua Bengio - 2014

59 papers in library cite

Introduces the attention mechanism - amazing overall

[7]Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation

Kyunghyun Cho, B. V. Merrienboer, C. G. Gulcehre, D. Bahdanau, F. Bougares, Holger Schwenk, Yoshua Bengio - 2014

38 papers in library cite

Introduces RNN encoder-decoder. I love it :)

[8]Sequence to Sequence Learning With Neural Networks

Ilya Sutskever, Oriol Vinyals, Quoc V. Le - 2014

58 papers in library cite

Good paper, but I think it only got famous because they set a new good baseline for NNs in MT. Their main contribution was reversing the source sentence TBH.

[9]Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

J. Chung, C. G. Gulcehre, Kyunghyun Cho, Yoshua Bengio - 2014

11 papers in library cite

It's a good paper but results are veeeery underwhelming.

[10]Layer Normalization

Jimmy Lei Ba, R. Kiros, Geoffrey E. Hinton - 2016

14 papers in library cite

Very nice! At first I had a little bit of prejudice because it seemed way too mathy, but actually the math is easy to follow and the results are very nice.

[11]Effective Approaches to Attention-Based Neural Machine Translation

T. Luong, H. Pham, Christopher D. Manning - 2015

15 papers in library cite

Good paper, but very derivative. Attention methods start getting very complicated... I understand why Transformers took over TBH

[12]Building a Large Annotated Corpus of English: The Penn Treebank

M. P. Marcus, B. Santorini, Mary Ann Marcinkiewicz - 1993

22 papers in library cite

Well, not really interesting but very cool to see how the peen tree bank was made.

[13]Neural Machine Translation of Rare Words with Subword Units

R. Sennrich, B. Haddow, Alexandra Birch - 2016

22 papers in library cite

Very good! Simple, explains quite a lot and good results. Forms the basis for a lot of stuff now!

[14]Google's Neural Machine Translation System: Bridging the Gap Between Human and Machine Translation

Yonghui Wu, M. Schuster, Ziru Chen, Quoc V. Le, M. Norouzi, W. Macherey, M. Krikun, Yue Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. J. Johnson, Xiaodong Liu, Lukasz Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, Wenyi Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, Oriol Vinyals, G. S. Corrado, M. Hughes, Jeffrey Dean - 2016

15 papers in library cite

It's a very good paper but TBH doesn't bring anything new other than joining a bunch of existing stuff. I think it ended up being foundational because it's Google and several people used it as a base for future research. Good contribution then :)

[15]Generating Sequences With Recurrent Neural Networks

Alex Graves - 2013

27 papers in library cite

Very cool and is the first to actually proposed the Attention mechanism! It gets a bit mathy but nothing too crazy. Also has the first examples of good machine generated writing I've seen in these papers, so very nice results.

[16]Convolutional Sequence to Sequence Learning

J. Gehring, Michael Auli, D. Grangier, D. Yarats, Yann Dauphin - 2017

3 papers in library cite

It's a good paper. I liked that it is very similar to the transformer methodology (but they don't use convolutions there). I need to go back and read the transformers paper to make parallels.

[17]Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, K. Maziarz, A. Davis, Quoc Le, Geoffrey Hinton, Jeffrey Dean - 2017

9 papers in library cite

It's nice, but there's an important section in the middle about batch sizes that I don't quite understand. Not sure if I am missing some background knowledge or if they explain it poorly, and seems foundational to their main method... Either way, I did understand the methodology of the paper, and they have nice results :)

[18]End-to-End Memory Networks

S. Sukhbaatar, A. Szlam, Jason Weston, Rob Fergus - 2015

18 papers in library cite

This was so surprising! This is very similar to transformers and RAG. Who knew?!

[19]A Structured Self-Attentive Sentence Embedding

Zongyu Lin, M. Feng, C. D. Santos, M. Yu, Bing Xiang, B. Zhou, Yoshua Bengio - 2017

2 papers in library cite

It's a bit poorly written and I don't like their 2D embeddings, but the visualizations are nice :)

[20]Gradient Flow in Recurrent Nets: The Difficulty of Learning Long-Term Dependencies

Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, Jürgen Schmidhuber - 2001

16 papers in library cite

Wow, this is so much better than the other paper - I should have read it sooner. It's concise and not too abstract, and also gives very good context on RNN problems and how to solve them.

[21]A Deep Reinforced Model for Abstractive Summarization

R. Paulus, Caiming Xiong, Richard Socher - 2017

7 papers in library cite

It's nice that they introduce intra-attention and RL, but at this point I think a lot of the work in attention is very derivative.

[22]Long Short-Term Memory-Networks for Machine Reading

Mirella Lapata - 2016

8 papers in library cite

I read this more as an example of intra-attention, but this is not the main focus of the paper. I think visualization/explanation is a bit bad, and it doesn't seem too impactful. I kept thinking that this is starting to get too complicated, and indeed it was surpassed by transformers right after that.

[23]Exploring the Limits of Language Modeling

R. Jozefowicz, Oriol Vinyals, M. Schuster, Noam Shazeer, Yonghui Wu - 2016

20 papers in library cite

It's funny because at first I did not like it, but then it clicked and I really liked it - they are trying to come around the large dictionary and the rare word problem. In the end it's SotA, but I think it's too convoluted and was replaced by Transformers.

[24]Grammar as a Foreign Language

Geoffrey Hinton - 2015

9 papers in library cite

It's a nice paper showing that attention can be used for parsing. However, parsing is boring and is very derivative. Good paper nonetheless.

[25]Multi-Task Sequence to Sequence Learning

M. T. Luong, Quoc V. Le, Ilya Sutskever, Oriol Vinyals, Lukasz Kaiser - 2015

4 papers in library cite

Very nice paper, but all of the things that I read about S2S sound very derivative at this point - nonetheless it's nice to see MTL again - seems a bit uncharted territory at this point, despite being very promising

[26]Using the Output Embedding to Improve Language Models

O. Press, Lior Wolf - 2017

7 papers in library cite

I did not like this paper at all - The paper is not bad, it's just that I expected *way* more. Good results but uninteresting

[27]Massive Exploration of Neural Machine Translation Architectures

D. Britz, Anna Goldie, M. Luong, Quoc Le - 2017

1 paper in library cites

Very good review - these sorts of papers are needed! Good to see how HPs impact results

[28]Neural GPUs Learn Algorithms

Lukasz Kaiser, Ilya Sutskever - 2016

5 papers in library cite

Results and architecture are nice but TBH it is very poorly written... Doesn't seem like a lot of impact either.

[29]Can Active Memory Replace Attention?

Lukasz Kaiser, Samy Bengio - 2016

2 papers in library cite

So nice to see an alternative that works as well as attention!

[30]Xception: Deep Learning With Depthwise Separable Convolutions

Francois Chollet - 2016

2 papers in library cite

Seems important

[31]Neural Machine Translation in Linear Time

N. Kalchbrenner, L. Espeholt, K. Simonyan, A. V. D. Oord, Alex Graves, Koray Kavukcuoglu - 2016

5 papers in library cite

Bytenet - Also "linear time" caught my attention

[32]Deep Recurrent Models With Fast-Forward Connections for Neural Machine Translation

Jingren Zhou, Yue Cao, Xinpeng Wang, P. L. Li, Weixin Xu - 2016

5 papers in library cite

[33]Effective Self-Training for Parsing

D. Mcclosky, E. Charniak, M. J. Johnson - 2006

4 papers in library cite

[34]Learning Accurate, Compact, and Interpretable Tree Annotation

Slav Petrov, L. Barrett, R. Thibaux, Dan Klein - 2006

4 papers in library cite

[35]Factorization Tricks for LSTM Networks

O. Kuchaiev, B. Ginsburg - 2017

2 papers in library cite

[36]Fast and Accurate Shift-Reduce Constituent Parsing

M. Zhu, Y. Z. Zhang, Weizhu Chen, Mingchuan Zhang, Jiacheng Zhu - 2013

2 papers in library cite

[37]Self-Training PCFG Grammars With Latent Annotations Across Languages

Zhongqiang Huang, M. Harper - 2009

2 papers in library cite

[38]A Decomposable Attention Model

A. P. Parikh, O. Tackstrom, Dipanjan Das, Jakob Uszkoreit - 2016

1 paper in library cites

[39]Recurrent Neural Network Grammars

C. Dyer, A. Kuncoro, M. Ballesteros, N. Smith - 2016

1 paper in library cites

[40]Structured Attention Networks

Yoon Kim, C. Denton, L. Hoang, A. Rush - 2017

1 paper in library cites

Cited by

47

papers in your library

Cites

31

papers in your library

Read

on April 22, 2025

I mean... it introduced Transformers!

Tags

AttentionTransformersMachine Translation

Paper Aliases

No aliases