2017

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin

citations

Cite Score

100

AI summary

This paper introduces the Transformer, a novel sequence transduction model relying solely on attention mechanisms and achieving state-of-the-art results on WMT 2014 English-to-German and English-to-French translation tasks while being more parallelizable and requiring less training time.

Main Contributions

  • Introduces the Transformer architecture, which relies entirely on attention mechanisms, dispensing with recurrence and convolutions.
  • Achieves state-of-the-art BLEU scores of 28.4 on the WMT 2014 English-to-German translation task and 41.8 on the WMT 2014 English-to-French translation task.
  • Demonstrates that the Transformer is more parallelizable and requires significantly less time to train compared to existing models.
  • Shows that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing.
  • Introduces multi-head attention, allowing the model to jointly attend to information from different representation subspaces at different positions.

Abstract

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Citation Graph

Loading graph...

References [40]

Sort:
Filter:

K. He, X. Zhang, S. Ren, Jian Sun - 2016

20 papers in library cite

D. P. Kingma, Jimmy Lei Ba - 2014

49 papers in library cite

Sepp Hochreiter, Jürgen Schmidhuber - 1997

94 papers in library cite

N. Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan R. Salakhutdinov - 2014

20 papers in library cite

Zbigniew Wojna - 2015

5 papers in library cite

D. Bahdanau, Kyunghyun Cho, Yoshua Bengio - 2014

59 papers in library cite

Kyunghyun Cho, B. V. Merrienboer, C. G. Gulcehre, D. Bahdanau, F. Bougares, Holger Schwenk, Yoshua Bengio - 2014

38 papers in library cite

Ilya Sutskever, Oriol Vinyals, Quoc V. Le - 2014

58 papers in library cite

J. Chung, C. G. Gulcehre, Kyunghyun Cho, Yoshua Bengio - 2014

11 papers in library cite

Jimmy Lei Ba, R. Kiros, Geoffrey E. Hinton - 2016

14 papers in library cite

T. Luong, H. Pham, Christopher D. Manning - 2015

15 papers in library cite

M. P. Marcus, B. Santorini, Mary Ann Marcinkiewicz - 1993

22 papers in library cite

R. Sennrich, B. Haddow, Alexandra Birch - 2016

22 papers in library cite

Yonghui Wu, M. Schuster, Ziru Chen, Quoc V. Le, M. Norouzi, W. Macherey, M. Krikun, Yue Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. J. Johnson, Xiaodong Liu, Lukasz Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, Wenyi Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, Oriol Vinyals, G. S. Corrado, M. Hughes, Jeffrey Dean - 2016

15 papers in library cite

Alex Graves - 2013

27 papers in library cite

J. Gehring, Michael Auli, D. Grangier, D. Yarats, Yann Dauphin - 2017

3 papers in library cite

Noam Shazeer, Azalia Mirhoseini, K. Maziarz, A. Davis, Quoc Le, Geoffrey Hinton, Jeffrey Dean - 2017

9 papers in library cite

S. Sukhbaatar, A. Szlam, Jason Weston, Rob Fergus - 2015

18 papers in library cite

Zongyu Lin, M. Feng, C. D. Santos, M. Yu, Bing Xiang, B. Zhou, Yoshua Bengio - 2017

2 papers in library cite

Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, Jürgen Schmidhuber - 2001

16 papers in library cite

R. Paulus, Caiming Xiong, Richard Socher - 2017

7 papers in library cite

Mirella Lapata - 2016

8 papers in library cite

R. Jozefowicz, Oriol Vinyals, M. Schuster, Noam Shazeer, Yonghui Wu - 2016

20 papers in library cite

Geoffrey Hinton - 2015

9 papers in library cite

M. T. Luong, Quoc V. Le, Ilya Sutskever, Oriol Vinyals, Lukasz Kaiser - 2015

4 papers in library cite

O. Press, Lior Wolf - 2017

7 papers in library cite

D. Britz, Anna Goldie, M. Luong, Quoc Le - 2017

1 paper in library cites

Lukasz Kaiser, Ilya Sutskever - 2016

5 papers in library cite

Lukasz Kaiser, Samy Bengio - 2016

2 papers in library cite

Francois Chollet - 2016

2 papers in library cite

N. Kalchbrenner, L. Espeholt, K. Simonyan, A. V. D. Oord, Alex Graves, Koray Kavukcuoglu - 2016

5 papers in library cite

Jingren Zhou, Yue Cao, Xinpeng Wang, P. L. Li, Weixin Xu - 2016

5 papers in library cite

D. Mcclosky, E. Charniak, M. J. Johnson - 2006

4 papers in library cite

Slav Petrov, L. Barrett, R. Thibaux, Dan Klein - 2006

4 papers in library cite

O. Kuchaiev, B. Ginsburg - 2017

2 papers in library cite

M. Zhu, Y. Z. Zhang, Weizhu Chen, Mingchuan Zhang, Jiacheng Zhu - 2013

2 papers in library cite

Zhongqiang Huang, M. Harper - 2009

2 papers in library cite

A. P. Parikh, O. Tackstrom, Dipanjan Das, Jakob Uszkoreit - 2016

1 paper in library cites

C. Dyer, A. Kuncoro, M. Ballesteros, N. Smith - 2016

1 paper in library cites

Yoon Kim, C. Denton, L. Hoang, A. Rush - 2017

1 paper in library cites

Cited by

47

papers in your library

Cites

31

papers in your library

Read

on April 22, 2025

Your review

Tags

AttentionTransformersMachine Translation

Paper Aliases

No aliases