Papperoni

2018

Adaptive Input Representations for Neural Language Modeling

A. Baevski, Michael Auli

Open PDF Google Scholar

citations

Cite Score

22

AI summary

This paper introduces adaptive input embeddings, extending adaptive softmax for neural language modeling, and evaluates them on the WIKITEXT-103 and BILLION WORD benchmarks, achieving state-of-the-art perplexity scores and faster training times compared to character input CNNs.

Main Contributions

Introduces adaptive input embeddings that extend adaptive softmax to input word representations.
Demonstrates that adaptive input embeddings reduce overfitting to rare words by assigning more capacity to frequent words and less to infrequent ones.
Shows that models with adaptive word representations outperform strong character-based models while training more than twice as fast.
Achieves a perplexity of 18.7 on the WIKITEXT-103 benchmark.
Achieves a perplexity of 23.02 on the BILLION WORD benchmark.

Abstract

We introduce adaptive input representations for neural language modeling which extend the adaptive softmax of Grave et al. (2017) to input representations of variable capacity. There are several choices on how to factorize the input and output layers, and whether to model words, characters or sub-word units. We perform a systematic comparison of popular choices for a self-attentional architecture. Our experiments show that models equipped with adaptive embeddings are more than twice as fast to train than the popular character input CNN while having a lower number of parameters. On the WIKITEXT-103 benchmark we achieve 18.7 perplexity, an improvement of 10.5 perplexity compared to the previously best published result and on the BILLION WORD benchmark, we achieve 23.02 perplexity.

Citation Graph

Loading graph...

References [34]

Sort:

Filter:

[1]Deep Residual Learning for Image Recognition

K. He, X. Zhang, S. Ren, Jian Sun - 2016

20 papers in library cite

This is simply amazing. Very very simple idea, totally revolutionary. No maths, just "it works!". Amazing.

[2]Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin - 2017

47 papers in library cite

I mean... it introduced Transformers!

[3]A Neural Probabilistic Language Model

Yoshua Bengio, R. Ducharme, Pascal Vincent - 2001

62 papers in library cite

What started it all. Very simple and elegant.

[4]SGDR: Stochastic Gradient Descent With Warm Restarts

Frank Hutter - 2017

4 papers in library cite

Very simple, intuitive, and effective. I just hoped they provided a better motivation for the restarts rather than "it exists in other literature"

[5]Neural Machine Translation of Rare Words with Subword Units

R. Sennrich, B. Haddow, Alexandra Birch - 2016

22 papers in library cite

Very good! Simple, explains quite a lot and good results. Forms the basis for a lot of stuff now!

[6]On the Difficulty of Training Recurrent Neural Networks

Razvan Pascanu, Tomas Mikolov, Yoshua Bengio - 2013

21 papers in library cite

It starts very mathy but in the end there are some very nice contributions! You don't actually need to understand the math to know what's going on in the end.

[7]Recurrent Neural Network Based Language Model

Tomas Mikolov, M. Karafiat, Lukas Burget, Jan Cernocky, Sanjeev Khudanpur - 2010

36 papers in library cite

The comeback of RNNs for language modeling. Not too exciting but impactful and a short read.

[8]On the Importance of Initialization and Momentum in Deep Learning

Ilya Sutskever, James Martens, G. Dahl, Geoffrey Hinton - 2013

13 papers in library cite

They give very good context and it's easy to understand that they are doing this as a counterpoint to HF. Surprising results as well. I just think it was made obsolete by relu

[9]Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, K. Maziarz, A. Davis, Quoc Le, Geoffrey Hinton, Jeffrey Dean - 2017

9 papers in library cite

It's nice, but there's an important section in the middle about batch sizes that I don't quite understand. Not sure if I am missing some background knowledge or if they explain it poorly, and seems foundational to their main method... Either way, I did understand the methodology of the paper, and they have nice results :)

[10]Pointer Sentinel Mixture Models

S. Merity, Caiming Xiong, J. Bradbury, Richard Socher - 2017

12 papers in library cite

I really liked the methodology, but I had to read it a few times to understand it intuitively - I think they should have done a better job at explaining it.

[11]Extensions of Recurrent Neural Network Language Model

Tomas Mikolov, S. Kombrink, Lukas Burget, Jan Cernocky, Sanjeev Khudanpur - 2011

16 papers in library cite

Doesn't add much.

[12]Exploring the Limits of Language Modeling

R. Jozefowicz, Oriol Vinyals, M. Schuster, Noam Shazeer, Yonghui Wu - 2016

20 papers in library cite

It's funny because at first I did not like it, but then it clicked and I really liked it - they are trying to come around the large dictionary and the rare word problem. In the end it's SotA, but I think it's too convoluted and was replaced by Transformers.

[13]Hierarchical Probabilistic Neural Network Language Model

F. Morin, Yoshua Bengio - 2005

19 papers in library cite

Nice paper overall. Seems very impactful despite not being as relevant right now.

[14]One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling

C. Chelba, Tomas Mikolov, M. Schuster, Q. Ge, T. Brants, P. Koehn, Tony Robinson - 2013

13 papers in library cite

It's somewhat shallow, but I can see the importance of this paper.

[15]Using the Output Embedding to Improve Language Models

O. Press, Lior Wolf - 2017

7 papers in library cite

I did not like this paper at all - The paper is not bad, it's just that I expected *way* more. Good results but uninteresting

[16]Character-Level Language Modeling With Deeper Self-Attention

R. A. Rfou, D. Choe, Noah Constant, M. Guo, Llion Jones - 2018

6 papers in library cite

It's good, and almost a 4, but I think it is a bit boring. Plus character LMs are not meta.

[17]Improving Neural Language Models With a Continuous Cache

E. Grave, Armand Joulin, Nicolas Usunier - 2016

7 papers in library cite

This was a surprise to me - I expected this to suck. However, they can provide a simple and intuitive way of improving LMs - nice

[18]Efficient Softmax Approximation for GPUs

E. Grave, Armand Joulin, M. Cisse, D. Grangier, Hervé Jégou - 2017

4 papers in library cite

I loved the methodology and the idea behind it. It's good to see some practical improvements (rather than methodological). I just think it's a bit tough to read. I had to ask AI to help explain some stuff.

[19]Language Modeling With Gated Convolutional Networks

Yann N. Dauphin, A. Fan, Michael Auli, D. Grangier - 2016

8 papers in library cite

[20]Large, Pruned or Continuous Space Language Models on a GPU for Statistical Machine Translation

Holger Schwenk, A. Rousseau, M. Attik - 2012

5 papers in library cite

[21]Character-Aware Neural Language Models

Yoon Kim, Yacine Jernite, D. Sontag, Alexander M. Rush - 2016

7 papers in library cite

[22]Classes for Fast Maximum Entropy Training

J. T. Goodman - 2001

7 papers in library cite

[23]Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling

H. Inan, K. Khosravi, Richard Socher - 2017

6 papers in library cite

[24]Decoding With Large-Scale Neural Language Models Improves Translation

Ashish Vaswani, Y. Zhao, V. Fossum, D. Chiang - 2013

5 papers in library cite

[25]Scaling Neural Machine Translation

M. Ott, S. Edunov, D. Grangier, Michael Auli - 2018

3 papers in library cite

[26]Sparse Non-Negative Matrix Language Modeling for Skip-Grams

Noam Shazeer, J. Pelemans, C. Chelba - 2015

3 papers in library cite

[27]An Analysis of Neural Language Modeling at Multiple Scales

S. Merity, Nitish Shirish Keskar, Richard Socher - 2018

2 papers in library cite

[28]Deep Neural Network Language Models

E. Arisoy, T. N. Sainath, Brian Kingsbury, Bhuvana Ramabhadran - 2012

2 papers in library cite

[29]Fast Parametric Learning With Activation Memorization

J. W. Rae, C. Dyer, Peter Dayan, T. P. Lillicrap - 2018

2 papers in library cite

[30]Strategies for Training Large Vocabulary Neural Language Models

Weizhu Chen, D. Grangier, Michael Auli - 2015

2 papers in library cite

[31]Analyzing Uncertainty in Neural Machine Translation

M. Ott, Michael Auli, D. Grangier, Marc'aurelio Ranzato - 2018

1 paper in library cites

[32]Neural Lattice Language Models

J. Buckman, Graham Neubig - 2018

1 paper in library cites

[33]Pragmatic Neural Language Modelling in Machine Translation

P. Baltescu, Phil Blunsom - 2015

1 paper in library cites

[34]Spell Once, Summon Anywhere: A Two-Level Open-Vocabulary Language Model

S. J. Mielke, J. Eisner - 2018

1 paper in library cites

Cited by

3

papers in your library

Cites

19

papers in your library

Read

on November 15, 2025

I like the idea and it's a nice paper. Maybe it is a bit outdated now that everything is wordpiece, but I like that they reduce the number of parameters.

Tags

Paper Aliases

No aliases