2018

Character-Level Language Modeling With Deeper Self-Attention

R. A. Rfou, D. Choe, Noah Constant, M. Guo, Llion Jones

citations

Cite Score

23

AI summary

This paper introduces a character-level language model using a deep (64-layer) transformer with self-attention. It achieves state-of-the-art results on text8 and enwik8 datasets by incorporating auxiliary losses at intermediate layers and positions, demonstrating the effectiveness of deep transformers for character-level language modeling.

Main Contributions

  • Demonstrates that a deep transformer model can outperform RNN variants in character-level language modeling.
  • Achieves state-of-the-art results on text8 and enwik8 benchmarks with 1.13 and 1.06 bits per character, respectively.
  • Shows the importance of adding auxiliary losses at intermediate network layers and sequence positions for training deep transformer models.
  • Introduces a 64-layer transformer architecture, significantly deeper than previous transformer networks.
  • Replaces the sinusoidal timing signal with a learned per-layer positional embedding to improve performance in deep networks.

Abstract

LSTMs and other RNN variants have shown strong performance on character-level language modeling. These models are typically trained using truncated backpropagation through time, and it is common to assume that their success stems from their ability to remember long-term contexts. In this paper, we show that a deep (64-layer) transformer model (Vaswani et al. 2017) with fixed context outperforms RNN variants by a large margin, achieving state of the art on two popular benchmarks: 1.13 bits per character on text8 and 1.06 on enwik8. To get good results at this depth, we show that it is important to add auxiliary losses, both at intermediate network layers and intermediate sequence positions.

Citation Graph

Loading graph...

References [46]

Sort:
Filter:

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin - 2017

47 papers in library cite

Sepp Hochreiter, Jürgen Schmidhuber - 1997

94 papers in library cite

S. Ioffe, Christian Szegedy - 2015

18 papers in library cite

Kyunghyun Cho, B. V. Merrienboer, C. G. Gulcehre, D. Bahdanau, F. Bougares, Holger Schwenk, Yoshua Bengio - 2014

38 papers in library cite

Jimmy Lei Ba, R. Kiros, Geoffrey E. Hinton - 2016

14 papers in library cite

Yoshua Bengio, R. Ducharme, Pascal Vincent - 2001

62 papers in library cite

Tomas Mikolov, M. Karafiat, Lukas Burget, Jan Cernocky, Sanjeev Khudanpur - 2010

36 papers in library cite

P. Werbos - 1990

9 papers in library cite

Wojciech Zaremba, Ilya Sutskever, Oriol Vinyals - 2014

22 papers in library cite

Noam Shazeer, Azalia Mirhoseini, K. Maziarz, A. Davis, Quoc Le, Geoffrey Hinton, Jeffrey Dean - 2017

9 papers in library cite

S. Sukhbaatar, A. Szlam, Jason Weston, Rob Fergus - 2015

18 papers in library cite

M. Sundermeyer, R. Schluter, Hermann Ney - 2010

7 papers in library cite

Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, Jürgen Schmidhuber - 2001

16 papers in library cite

Jason Weston, S. Chopra, Antoine Bordes - 2015

18 papers in library cite

Yarin Gal - 2015

9 papers in library cite

Tomas Mikolov, S. Kombrink, Lukas Burget, Jan Cernocky, Sanjeev Khudanpur - 2011

16 papers in library cite

R. Jozefowicz, Oriol Vinyals, M. Schuster, Noam Shazeer, Yonghui Wu - 2016

20 papers in library cite

C. Chelba, Tomas Mikolov, M. Schuster, Q. Ge, T. Brants, P. Koehn, Tony Robinson - 2013

13 papers in library cite

Alec Radford, R. Jozefowicz, Ilya Sutskever - 2017

8 papers in library cite

U. Khandelwal, He He, P. Qi, Dan Jurafsky - 2018

2 papers in library cite

E. Grave, Armand Joulin, Nicolas Usunier - 2016

7 papers in library cite

Tomas Mikolov, Ilya Sutskever, A. Deoras, H. S. Le, S. Kombrink, Jan Cernocky - 2012

7 papers in library cite

X. Zhang, J. Zhao, Yann Lecun - 2015

7 papers in library cite

Yann N. Dauphin, A. Fan, Michael Auli, D. Grangier - 2016

8 papers in library cite

S. Merity, Nitish Shirish Keskar, Richard Socher - 2017

6 papers in library cite

J. Chung, C. G. Gulcehre, Kyunghyun Cho, Yoshua Bengio - 2015

3 papers in library cite

N. Kalchbrenner, L. Espeholt, K. Simonyan, A. V. D. Oord, Alex Graves, Koray Kavukcuoglu - 2016

5 papers in library cite

J. G. Zilly, R. K. Srivastava, J. Koutnik, Jürgen Schmidhuber - 2016

6 papers in library cite

T. Cooijmans, Nicolas Ballas, C. Laurent, Aaron Courville - 2016

3 papers in library cite

David Krueger, T. Maharaj, J. Kramar, M. Pezeshki, Nicolas Ballas, N. R. Ke, A. G. A. P. Goyal, Yoshua Bengio, Hugo Larochelle, Aaron Courville - 2016

3 papers in library cite

B. Krause, E. Kahembwe, I. Murray, S. Renals - 2017

3 papers in library cite

B. Krause, L. Lu, I. Murray, S. Renals - 2016

3 papers in library cite

A. Mujika, F. Meier, A. Steger - 2017

2 papers in library cite

J. Chung, S. Ahn, Yoshua Bengio - 2016

2 papers in library cite

Shanda Li, Wentao Li, C. Cook, C. Zhu, Y. Gao - 2018

2 papers in library cite

S. Zhang, Yonghui Wu, T. Che, Zongyu Lin, R. Memisevic, Ruslan R. Salakhutdinov, Yoshua Bengio - 2016

1 paper in library cites

T. Kenter, Llion Jones, D. Hewlett - 2018

1 paper in library cites

[38]Cmix

B. Knol - 2017

1 paper in library cites

M. Daniluk, Tim Rocktaschel, J. Welbl, Sebastian Riedel - 2017

1 paper in library cites

T. Salimans, Haowei Zhang, Alec Radford, D. N. Metaxas - 2018

1 paper in library cites

M. Mahoney - 2009

1 paper in library cites

N. R. Ke, A. G. A. P. Goyal, O. Bilaniuk, J. Binas, L. Charlin, C. Pal, Yoshua Bengio - 2017

1 paper in library cites

K. M. Rocki - 2016

1 paper in library cites

C. Tallec, Y. Ollivier - 2017

1 paper in library cites

M. Arjovsky, A. Shah, Yoshua Bengio - 2015

1 paper in library cites

Alexis Conneau, Holger Schwenk, L. Barrault, Yann Lecun - 2016

1 paper in library cites

Cited by

6

papers in your library

Cites

30

papers in your library

Read

on November 16, 2025

Your review

Tags

Paper Aliases

No aliases