Papperoni

2019

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Z. Dai, Zhilin Yang, Yining Yang, W. Cohen, J. Carbonell, Quoc Le, Ruslan Salakhutdinov

Open PDF Google Scholar

citations

Cite Score

75

AI summary

This paper introduces Transformer-XL, a novel neural architecture for language modeling that uses a segment-level recurrence mechanism and a novel positional encoding scheme to capture longer-term dependencies, achieving state-of-the-art results on WikiText-103, enwik8, text8, One Billion Word, and Penn Treebank datasets.

Main Contributions

Introduces a new architecture called Transformer-XL that leverages recurrence in a purely self-attentive model.
Derives a novel positional encoding scheme, enabling the model to reuse hidden states without temporal confusion.
Achieves significantly better results than RNNs on both character-level and word-level language modeling tasks.
Demonstrates a substantial speedup during evaluation compared to vanilla Transformers.
Shows that Transformer-XL can generate coherent text articles with thousands of tokens, even when trained on a medium-sized dataset.

Abstract

Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling. We propose a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence. It consists of a segment-level recurrence mechanism and a novel positional encoding scheme. Our method not only enables capturing longer-term dependency, but also resolves the context fragmentation problem. As a result, Transformer-XL learns dependency that is 80% longer than RNNs and 450% longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up to 1,800+ times faster than vanilla Transformers during evaluation. Notably, we improve the state-of-the-art results of bpc/perplexity to 0.99 on en-wiki8, 1.08 on text8, 18.3 on WikiText-103, 21.8 on One Billion Word, and 54.5 on Penn Treebank (without finetuning). When trained only on WikiText-103, Transformer-XL manages to generate reasonably coherent, novel text articles with thousands of tokens. Our code, pretrained models, and hyperparameters are available in both Tensorflow and PyTorch.

Citation Graph

Loading graph...

References [63]

Sort:

Filter:

[1]Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin - 2017

47 papers in library cite

I mean... it introduced Transformers!

[2]BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, M. W. Chang, K. Lee, Kristina Toutanova - 2018

39 papers in library cite

Simply amazing. It's very impressive how they make a leap vs. existing stuff (you can see from the references, pretty much no one is doing what they are doing, other than GPT)

[3]Long Short-Term Memory

Sepp Hochreiter, Jürgen Schmidhuber - 1997

94 papers in library cite

LSTMs FTW!

[4]Neural Machine Translation by Jointly Learning to Align and Translate

D. Bahdanau, Kyunghyun Cho, Yoshua Bengio - 2014

59 papers in library cite

Introduces the attention mechanism - amazing overall

[5]Deep Contextualized Word Representations

M. E. Peters, M. Neumann, M. Iyyer, Matt Gardner, C. Clark, K. Lee, L. S. Zettlemoyer - 2018

27 papers in library cite

I didn't really like the approach. Seems a bit derivative TBH. BERT seems more elegant.

[6]Improving Language Understanding by Generative Pre-Training

Alec Radford, K. Narasimhan, T. Salimans, Ilya Sutskever - 2018

23 papers in library cite

Very simple and very nice! Easy to understand and revolutionary maybe?

[7]A Neural Probabilistic Language Model

Yoshua Bengio, R. Ducharme, Pascal Vincent - 2001

62 papers in library cite

What started it all. Very simple and elegant.

[8]Recurrent Neural Network Based Language Model

Tomas Mikolov, M. Karafiat, Lukas Burget, Jan Cernocky, Sanjeev Khudanpur - 2010

36 papers in library cite

The comeback of RNNs for language modeling. Not too exciting but impactful and a short read.

[9]Generating Sequences With Recurrent Neural Networks

Alex Graves - 2013

27 papers in library cite

Very cool and is the first to actually proposed the Attention mechanism! It gets a bit mathy but nothing too crazy. Also has the first examples of good machine generated writing I've seen in these papers, so very nice results.

[10]Recurrent Neural Network Regularization

Wojciech Zaremba, Ilya Sutskever, Oriol Vinyals - 2014

22 papers in library cite

It's a very simple idea and TBH it's nothing different from dropout. It's good that it's a very short paper and very straightforward, but could be a paragraph long.

[11]Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, K. Maziarz, A. Davis, Quoc Le, Geoffrey Hinton, Jeffrey Dean - 2017

9 papers in library cite

It's nice, but there's an important section in the middle about batch sizes that I don't quite understand. Not sure if I am missing some background knowledge or if they explain it poorly, and seems foundational to their main method... Either way, I did understand the methodology of the paper, and they have nice results :)

[12]Pointer Sentinel Mixture Models

S. Merity, Caiming Xiong, J. Bradbury, Richard Socher - 2017

12 papers in library cite

I really liked the methodology, but I had to read it a few times to understand it intuitively - I think they should have done a better job at explaining it.

[13]Neural Turing Machines

Alex Graves, G. Wayne, Ivo Danihelka - 2014

18 papers in library cite

This paper is amazing. If someone told me that NNs could use and address memory by position I wouldn't believe it worked. Very nice, but it's a shame that it's just a toy example.

[14]Gradient Flow in Recurrent Nets: The Difficulty of Learning Long-Term Dependencies

Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, Jürgen Schmidhuber - 2001

16 papers in library cite

Wow, this is so much better than the other paper - I should have read it sooner. It's concise and not too abstract, and also gives very good context on RNN problems and how to solve them.

[15]Memory Networks

Jason Weston, S. Chopra, Antoine Bordes - 2015

18 papers in library cite

The first half of the paper (when they discuss the concept in a very abstract way) is amazing. However, the actual methodology was very convoluted - I did not like it. I thought that Neural Turing Machines were inspired in this, but actually they are contemporary... So anyway, the concept is nice, execution is not.

[16]A Theoretically Grounded Application of Dropout in Recurrent Neural Networks

Yarin Gal - 2015

9 papers in library cite

The methodology is very simple and effective, but maybe it was too simple for a paper, so they put A TON of math on top of it... Unnecessary really.

[17]Semi-Supervised Sequence Learning

A. M. Dai, Quoc V. Le - 2015

27 papers in library cite

Very good paper that was probably the first to introduce pre-training in NLP!

[18]Exploring the Limits of Language Modeling

R. Jozefowicz, Oriol Vinyals, M. Schuster, Noam Shazeer, Yonghui Wu - 2016

20 papers in library cite

It's funny because at first I did not like it, but then it clicked and I really liked it - they are trying to come around the large dictionary and the rare word problem. In the end it's SotA, but I think it's too convoluted and was replaced by Transformers.

[19]Hierarchical Probabilistic Neural Network Language Model

F. Morin, Yoshua Bengio - 2005

19 papers in library cite

Nice paper overall. Seems very impactful despite not being as relevant right now.

[20]One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling

C. Chelba, Tomas Mikolov, M. Schuster, Q. Ge, T. Brants, P. Koehn, Tony Robinson - 2013

13 papers in library cite

It's somewhat shallow, but I can see the importance of this paper.

[21]Using the Output Embedding to Improve Language Models

O. Press, Lior Wolf - 2017

7 papers in library cite

I did not like this paper at all - The paper is not bad, it's just that I expected *way* more. Good results but uninteresting

[22]Context Dependent Recurrent Neural Network Language Model

Tomas Mikolov, Geoffrey Zweig - 2012

12 papers in library cite

Nothing too interesting, just using the context of the RNN.

[23]Character-Level Language Modeling With Deeper Self-Attention

R. A. Rfou, D. Choe, Noah Constant, M. Guo, Llion Jones - 2018

6 papers in library cite

It's good, and almost a 4, but I think it is a bit boring. Plus character LMs are not meta.

[24]Adaptive Input Representations for Neural Language Modeling

A. Baevski, Michael Auli - 2018

3 papers in library cite

I like the idea and it's a nice paper. Maybe it is a bit outdated now that everything is wordpiece, but I like that they reduce the number of parameters.

[25]Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context

U. Khandelwal, He He, P. Qi, Dan Jurafsky - 2018

2 papers in library cite

Very nice paper and an analysis that makes sense! People thought that LSTMs had "infinite memory", but this paper shows that it is not so!

[26]Improving Neural Language Models With a Continuous Cache

E. Grave, Armand Joulin, Nicolas Usunier - 2016

7 papers in library cite

This was a surprise to me - I expected this to suck. However, they can provide a simple and intuitive way of improving LMs - nice

[27]Efficient Softmax Approximation for GPUs

E. Grave, Armand Joulin, M. Cisse, D. Grangier, Hervé Jégou - 2017

4 papers in library cite

I loved the methodology and the idea behind it. It's good to see some practical improvements (rather than methodological). I just think it's a bit tough to read. I had to ask AI to help explain some stuff.

[28]Language Modeling With Gated Convolutional Networks

Yann N. Dauphin, A. Fan, Michael Auli, D. Grangier - 2016

8 papers in library cite

[29]Self-Attention With Relative Position Representations

P. Shaw, Jakob Uszkoreit, Ashish Vaswani - 2018

1 paper in library cites

[30]HyperNetworks

D. Ha, Andrew Dai, Quoc V. Le - 2016

3 papers in library cite

Using a network to generate weights for other network

[31]Regularizing and Optimizing 1stm Language Models

S. Merity, Nitish Shirish Keskar, Richard Socher - 2017

6 papers in library cite

SotA LM in 2017

[32]An Improved Relative Self-Attention Mechanism for Transformer With Application to Music Generation

C. Z. A. Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, C. Hawthorne, A. M. Dai, M. D. Hoffman, D. Eck - 2018

1 paper in library cites

[33]A Simple Way to Initialize Recurrent Networks of Rectified Linear Units

Quoc V. Le, Navdeep Jaitly, Geoffrey E. Hinton - 2015

2 papers in library cite

[34]A Clockwork RNN

J. Koutnik, K. Greff, Faustino Gomez, Jürgen Schmidhuber - 2014

4 papers in library cite

Interesting title, not well cited... I added because it's schmidhuber

[35]Neural Machine Translation in Linear Time

N. Kalchbrenner, L. Espeholt, K. Simonyan, A. V. D. Oord, Alex Graves, Koray Kavukcuoglu - 2016

5 papers in library cite

Bytenet - Also "linear time" caught my attention

[36]Recurrent Highway Networks

J. G. Zilly, R. K. Srivastava, J. Koutnik, Jürgen Schmidhuber - 2016

6 papers in library cite

Recurrent highway networks - maybe read after the original highway network

[37]Recurrent Batch Normalization

T. Cooijmans, Nicolas Ballas, C. Laurent, Aaron Courville - 2016

3 papers in library cite

Method for implementing BN for RNNs

[38]Mesh-Tensorflow: Deep Learning for Supercomputers

Noam Shazeer, Y. Cheng, Niki Parmar, D. Tran, Ashish Vaswani, P. Koanantakool, P. Hawkins, Honglak Lee, M. Hong, C. Young - 2018

4 papers in library cite

[39]Learning Longer Memory in Recurrent Neural Networks

Tomas Mikolov, Armand Joulin, S. Chopra, M. Mathieu, Marc'aurelio Ranzato - 2015

8 papers in library cite

RNNs + longer memory

[40]Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling

H. Inan, K. Khosravi, Richard Socher - 2017

6 papers in library cite

[41]Breaking the Softmax Bottleneck: A High-Rank RNN Language Model

Zhilin Yang, Z. Dai, Ruslan Salakhutdinov, W. W. Cohen - 2017

4 papers in library cite

[42]Larger-Context Language Modelling

Tianle Wang, Kyunghyun Cho - 2015

4 papers in library cite

[43]Document Context Language Models

Yangfeng Ji, T. Cohn, L. Kong, C. Dyer, J. Eisenstein - 2015

3 papers in library cite

[44]Multiplicative LSTM for Sequence Modelling

B. Krause, L. Lu, I. Murray, S. Renals - 2016

3 papers in library cite

[45]Topicrnn: A Recurrent Neural Network With Long-Range Semantic Dependency

A. Dieng, A. B., Caitlin Wang, Jianfeng Gao, J. A. Paisley, J. John - 2016

3 papers in library cite

[46]Understanding the Exploding Gradient Problem

Razvan Pascanu, Tomas Mikolov, Yoshua Bengio - 2012

3 papers in library cite

[47]An Analysis of Neural Language Modeling at Multiple Scales

S. Merity, Nitish Shirish Keskar, Richard Socher - 2018

2 papers in library cite

[48]Factorization Tricks for LSTM Networks

O. Kuchaiev, B. Ginsburg - 2017

2 papers in library cite

[49]Fast Parametric Learning With Activation Memorization

J. W. Rae, C. Dyer, Peter Dayan, T. P. Lillicrap - 2018

2 papers in library cite

[50]Fast-Slow Recurrent Neural Networks

A. Mujika, F. Meier, A. Steger - 2017

2 papers in library cite

[51]Hierarchical multiscale recurrent Neural Networks

J. Chung, S. Ahn, Yoshua Bengio - 2016

2 papers in library cite

[52]Independently Recurrent Neural Network (Indrnn): Building a Longer and Deeper RNN

Shanda Li, Wentao Li, C. Cook, C. Zhu, Y. Gao - 2018

2 papers in library cite

[53]Neural Architecture Search With Reinforcement Learning

Barret Zoph, Quoc V. Le - 2017

2 papers in library cite

[54]An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

Shuai Bai, J. Zico Kolter, V. Koltun - 2018

1 paper in library cites

[55]Darts: Differentiable Architecture Search

Haozhe Liu, K. Simonyan, Yining Yang - 2018

1 paper in library cites

[56]Efficient Neural Architecture Search via Parameter Sharing

H. Pham, M. Y. Guan, Barret Zoph, Quoc V. Le, Jeffrey Dean - 2018

1 paper in library cites

[57]Gábor Melis

C. Blundell, T. Kocisky, K. M. Hermann, C. Dyer, Phil Blunsom - 2018

1 paper in library cites

[58]Learning Longer-Term Dependencies in RNNs With Auxiliary Losses

T. H. Trinh, A. M. Dai, T. Luong, Quoc V. Le - 2018

1 paper in library cites

[59]On Multiplicative Integration With Recurrent Neural Networks

Yonghui Wu, S. Zhang, Y. Z. Zhang, Yoshua Bengio, Ruslan R. Salakhutdinov - 2016

1 paper in library cites

[60]Sigsoftmax: Reanalysis of the Softmax Bottleneck

S. Kanai, Y. Fujiwara, Y. Yamanaka, S. Adachi - 2018

1 paper in library cites

[61]Skip-Gram Language Modeling Using Sparse Non-Negative Matrix Probability Estimation

Noam Shazeer, J. Pelemans, C. Chelba - 2014

1 paper in library cites

[62]Sparse Attentive Backtracking: Temporal Credit Assignment Through Reminding

N. R. Ke, A. G. A. P. Goyal, O. Bilaniuk, J. Binas, M. C. Mozer, C. Pal, Yoshua Bengio - 2018

1 paper in library cites

[63]Topic Compositional Neural Language Model

Wenyi Wang, Z. Gan, Wenyi Wang, D. Shen, J. Huang, W. Ping, S. Satheesh, L. Carin - 2017

1 paper in library cites

Cited by

9

papers in your library

Cites

39

papers in your library

Read

on November 14, 2025

It's so cool to see context expansion without the need to actually expand context! Such a simple context and so effective!

Tags

Paper Aliases

No aliases