2019

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Z. Dai, Zhilin Yang, Yining Yang, W. Cohen, J. Carbonell, Quoc Le, Ruslan Salakhutdinov

citations

Cite Score

75

AI summary

This paper introduces Transformer-XL, a novel neural architecture for language modeling that uses a segment-level recurrence mechanism and a novel positional encoding scheme to capture longer-term dependencies, achieving state-of-the-art results on WikiText-103, enwik8, text8, One Billion Word, and Penn Treebank datasets.

Main Contributions

  • Introduces a new architecture called Transformer-XL that leverages recurrence in a purely self-attentive model.
  • Derives a novel positional encoding scheme, enabling the model to reuse hidden states without temporal confusion.
  • Achieves significantly better results than RNNs on both character-level and word-level language modeling tasks.
  • Demonstrates a substantial speedup during evaluation compared to vanilla Transformers.
  • Shows that Transformer-XL can generate coherent text articles with thousands of tokens, even when trained on a medium-sized dataset.

Abstract

Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling. We propose a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence. It consists of a segment-level recurrence mechanism and a novel positional encoding scheme. Our method not only enables capturing longer-term dependency, but also resolves the context fragmentation problem. As a result, Transformer-XL learns dependency that is 80% longer than RNNs and 450% longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up to 1,800+ times faster than vanilla Transformers during evaluation. Notably, we improve the state-of-the-art results of bpc/perplexity to 0.99 on en-wiki8, 1.08 on text8, 18.3 on WikiText-103, 21.8 on One Billion Word, and 54.5 on Penn Treebank (without finetuning). When trained only on WikiText-103, Transformer-XL manages to generate reasonably coherent, novel text articles with thousands of tokens. Our code, pretrained models, and hyperparameters are available in both Tensorflow and PyTorch.

Citation Graph

Loading graph...

References [63]

Sort:
Filter:

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin - 2017

47 papers in library cite

Jacob Devlin, M. W. Chang, K. Lee, Kristina Toutanova - 2018

39 papers in library cite

Sepp Hochreiter, Jürgen Schmidhuber - 1997

94 papers in library cite

D. Bahdanau, Kyunghyun Cho, Yoshua Bengio - 2014

59 papers in library cite

M. E. Peters, M. Neumann, M. Iyyer, Matt Gardner, C. Clark, K. Lee, L. S. Zettlemoyer - 2018

27 papers in library cite

Alec Radford, K. Narasimhan, T. Salimans, Ilya Sutskever - 2018

23 papers in library cite

Yoshua Bengio, R. Ducharme, Pascal Vincent - 2001

62 papers in library cite

Tomas Mikolov, M. Karafiat, Lukas Burget, Jan Cernocky, Sanjeev Khudanpur - 2010

36 papers in library cite

Alex Graves - 2013

27 papers in library cite

Wojciech Zaremba, Ilya Sutskever, Oriol Vinyals - 2014

22 papers in library cite

Noam Shazeer, Azalia Mirhoseini, K. Maziarz, A. Davis, Quoc Le, Geoffrey Hinton, Jeffrey Dean - 2017

9 papers in library cite

S. Merity, Caiming Xiong, J. Bradbury, Richard Socher - 2017

12 papers in library cite

Alex Graves, G. Wayne, Ivo Danihelka - 2014

18 papers in library cite

Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, Jürgen Schmidhuber - 2001

16 papers in library cite

Jason Weston, S. Chopra, Antoine Bordes - 2015

18 papers in library cite

Yarin Gal - 2015

9 papers in library cite

A. M. Dai, Quoc V. Le - 2015

27 papers in library cite

R. Jozefowicz, Oriol Vinyals, M. Schuster, Noam Shazeer, Yonghui Wu - 2016

20 papers in library cite

F. Morin, Yoshua Bengio - 2005

19 papers in library cite

C. Chelba, Tomas Mikolov, M. Schuster, Q. Ge, T. Brants, P. Koehn, Tony Robinson - 2013

13 papers in library cite

O. Press, Lior Wolf - 2017

7 papers in library cite

Tomas Mikolov, Geoffrey Zweig - 2012

12 papers in library cite

R. A. Rfou, D. Choe, Noah Constant, M. Guo, Llion Jones - 2018

6 papers in library cite

A. Baevski, Michael Auli - 2018

3 papers in library cite

U. Khandelwal, He He, P. Qi, Dan Jurafsky - 2018

2 papers in library cite

E. Grave, Armand Joulin, Nicolas Usunier - 2016

7 papers in library cite

E. Grave, Armand Joulin, M. Cisse, D. Grangier, Hervé Jégou - 2017

4 papers in library cite

Yann N. Dauphin, A. Fan, Michael Auli, D. Grangier - 2016

8 papers in library cite

P. Shaw, Jakob Uszkoreit, Ashish Vaswani - 2018

1 paper in library cites

D. Ha, Andrew Dai, Quoc V. Le - 2016

3 papers in library cite

S. Merity, Nitish Shirish Keskar, Richard Socher - 2017

6 papers in library cite

C. Z. A. Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, C. Hawthorne, A. M. Dai, M. D. Hoffman, D. Eck - 2018

1 paper in library cites

Quoc V. Le, Navdeep Jaitly, Geoffrey E. Hinton - 2015

2 papers in library cite

J. Koutnik, K. Greff, Faustino Gomez, Jürgen Schmidhuber - 2014

4 papers in library cite

N. Kalchbrenner, L. Espeholt, K. Simonyan, A. V. D. Oord, Alex Graves, Koray Kavukcuoglu - 2016

5 papers in library cite

J. G. Zilly, R. K. Srivastava, J. Koutnik, Jürgen Schmidhuber - 2016

6 papers in library cite

T. Cooijmans, Nicolas Ballas, C. Laurent, Aaron Courville - 2016

3 papers in library cite

Noam Shazeer, Y. Cheng, Niki Parmar, D. Tran, Ashish Vaswani, P. Koanantakool, P. Hawkins, Honglak Lee, M. Hong, C. Young - 2018

4 papers in library cite

Tomas Mikolov, Armand Joulin, S. Chopra, M. Mathieu, Marc'aurelio Ranzato - 2015

8 papers in library cite

H. Inan, K. Khosravi, Richard Socher - 2017

6 papers in library cite

Zhilin Yang, Z. Dai, Ruslan Salakhutdinov, W. W. Cohen - 2017

4 papers in library cite

Tianle Wang, Kyunghyun Cho - 2015

4 papers in library cite

Yangfeng Ji, T. Cohn, L. Kong, C. Dyer, J. Eisenstein - 2015

3 papers in library cite

B. Krause, L. Lu, I. Murray, S. Renals - 2016

3 papers in library cite

A. Dieng, A. B., Caitlin Wang, Jianfeng Gao, J. A. Paisley, J. John - 2016

3 papers in library cite

Razvan Pascanu, Tomas Mikolov, Yoshua Bengio - 2012

3 papers in library cite

S. Merity, Nitish Shirish Keskar, Richard Socher - 2018

2 papers in library cite

O. Kuchaiev, B. Ginsburg - 2017

2 papers in library cite

J. W. Rae, C. Dyer, Peter Dayan, T. P. Lillicrap - 2018

2 papers in library cite

A. Mujika, F. Meier, A. Steger - 2017

2 papers in library cite

J. Chung, S. Ahn, Yoshua Bengio - 2016

2 papers in library cite

Shanda Li, Wentao Li, C. Cook, C. Zhu, Y. Gao - 2018

2 papers in library cite

Barret Zoph, Quoc V. Le - 2017

2 papers in library cite

Shuai Bai, J. Zico Kolter, V. Koltun - 2018

1 paper in library cites

Haozhe Liu, K. Simonyan, Yining Yang - 2018

1 paper in library cites

H. Pham, M. Y. Guan, Barret Zoph, Quoc V. Le, Jeffrey Dean - 2018

1 paper in library cites

C. Blundell, T. Kocisky, K. M. Hermann, C. Dyer, Phil Blunsom - 2018

1 paper in library cites

T. H. Trinh, A. M. Dai, T. Luong, Quoc V. Le - 2018

1 paper in library cites

Yonghui Wu, S. Zhang, Y. Z. Zhang, Yoshua Bengio, Ruslan R. Salakhutdinov - 2016

1 paper in library cites

S. Kanai, Y. Fujiwara, Y. Yamanaka, S. Adachi - 2018

1 paper in library cites

Noam Shazeer, J. Pelemans, C. Chelba - 2014

1 paper in library cites

N. R. Ke, A. G. A. P. Goyal, O. Bilaniuk, J. Binas, M. C. Mozer, C. Pal, Yoshua Bengio - 2018

1 paper in library cites

Wenyi Wang, Z. Gan, Wenyi Wang, D. Shen, J. Huang, W. Ping, S. Satheesh, L. Carin - 2017

1 paper in library cites

Cited by

9

papers in your library

Cites

39

papers in your library

Read

on November 14, 2025

Your review

Tags

Paper Aliases

No aliases