Papperoni

2013

One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling

C. Chelba, Tomas Mikolov, M. Schuster, Q. Ge, T. Brants, P. Koehn, Tony Robinson

Open PDF Google Scholar

citations

Cite Score

42

AI summary

The paper introduces a new benchmark dataset with one billion words, along with baseline results, for statistical language modeling, using n-gram models and recurrent neural network-based language models, achieving a 35% reduction in perplexity over the baseline.

Main Contributions

Introduces a new benchmark dataset with one billion words for statistical language modeling research.
Provides scripts to rebuild training/held-out data and log-probability values for baseline n-gram models.
Presents baseline results for various language modeling techniques, including n-gram and recurrent neural network-based language models.
Achieves a 35% reduction in perplexity (10% in cross-entropy) over the baseline Kneser-Ney 5-gram model through a combination of techniques.
Trains the largest recurrent neural network language model ever reported.

Abstract

We propose a new benchmark corpus to be used for measuring progress in statistical language modeling. With almost one billion words of training data, we hope this benchmark will be useful to quickly evaluate novel language modeling techniques, and to compare their contribution when combined with other advanced techniques. We show performance of several well-known types of language models, with the best results achieved with a recurrent neural network based language model. The baseline unpruned Kneser-Ney 5-gram model achieves perplexity 67.6. A combination of techniques leads to 35% reduction in perplexity, or 10% reduction in cross-entropy (bits), over that baseline. The benchmark is available as a code.google.com project; besides the scripts needed to rebuild the training/held-out data, it also makes available log-probability values for each word in each of ten held-out data sets, for each of the baseline n-gram models.

Citation Graph

Loading graph...

References [32]

Sort:

Filter:

[1]Learning Representations by Back-Propagating Errors

D. E. Rumelhart, Geoffrey E. Hinton, Ronald J. Williams - 1986

34 papers in library cite

Introduced backprop. Short and simple.

[2]Finding Structure in Time

Jeffrey L. Elman - 1990

23 papers in library cite

Good paper overall that introduces the concept of an RNN. However, applications and results are still very primitive.

[3]A Neural Probabilistic Language Model

Yoshua Bengio, R. Ducharme, Pascal Vincent - 2001

62 papers in library cite

What started it all. Very simple and elegant.

[4]Recurrent Neural Network Based Language Model

Tomas Mikolov, M. Karafiat, Lukas Burget, Jan Cernocky, Sanjeev Khudanpur - 2010

36 papers in library cite

The comeback of RNNs for language modeling. Not too exciting but impactful and a short read.

[5]LSTM neural Networks for Language Modeling

M. Sundermeyer, R. Schluter, Hermann Ney - 2010

7 papers in library cite

It reads like an undergrad project - doesn't add anything new.

[6]Improved Backing-Off for M-Gram language Modeling

R. Kneser, Hermann Ney - 1995

11 papers in library cite

It's nice, it's simple... But not NNs and seems very incremental on top of existing backoff

[7]Extensions of Recurrent Neural Network Language Model

Tomas Mikolov, S. Kombrink, Lukas Burget, Jan Cernocky, Sanjeev Khudanpur - 2011

16 papers in library cite

Doesn't add much.

[8]Hierarchical Probabilistic Neural Network Language Model

F. Morin, Yoshua Bengio - 2005

19 papers in library cite

Nice paper overall. Seems very impactful despite not being as relevant right now.

[9]Three New Graphical Models for Statistical Language Modelling

A. Mnih, Geoffrey Hinton - 2007

12 papers in library cite

I don't know why this is so impactful. I didn't like it and I think this was overly complex.

[10]Large Language Models in Machine Translation

Jeffrey Dean - 2007

3 papers in library cite

Very nice paper on how google solved their language translation using bigrams. Interesting, despite not NN related.

[11]Strategies for Training Large Scale Neural Network Language Models

Tomas Mikolov, A. Deoras, D. Povey, Lukas Burget, Jan Cernocky - 2011

9 papers in library cite

Just builds on other things. Very minor suff in my opinion.

[12]Continuous Space Language Models

Holger Schwenk - 2007

12 papers in library cite

One more paper about speech recog. Nothing special really.

[13]Empirical Evaluation and Combination of Advanced Language Modeling Techniques

Tomas Mikolov, A. Deoras, S. Kombrink, Lukas Burget, Jan Cernocky - 2011

13 papers in library cite

Early work proving that NNs can be good. But very uninteresting overall.

[14]Statistical Language Models Based on Neural Networks

Tomas Mikolov - 2012

17 papers in library cite

Mikolov's Thesis

[15]A Bit of Progress in Language Modeling

J. Goodman - 2001

15 papers in library cite

Focuses on n-grams.

[16]An Empirical Study of Smoothing Techniques for Language Modeling

S. F. Chen, J. Goodman - 1998

13 papers in library cite

[17]Class-Based N-Gram Models of Natural Language

P. F. Brown, P. V. Desouza, R. L. Mercer, Vincent J. Della Pietra, J. C. Lai - 1992

12 papers in library cite

[18]Estimation of Probabilities From Sparse Data for the Language Model Component of a Speech Recognizer

S. Katz - 1987

11 papers in library cite

[19]Classes for Fast Maximum Entropy Training

J. T. Goodman - 2001

7 papers in library cite

[20]Structured Language Modeling

C. Chelba, Frederick Jelinek - 2000

6 papers in library cite

[21]Random Forests and the Data Sparseness Problem in Language Modeling

P. Xu - 2005

4 papers in library cite

[22]A Dynamic Language Model for Speech Recognition

Frederick Jelinek, B. Merialdo, S. Roukos, M. Strauss - 1991

3 papers in library cite

[23]Shrinking Exponential Language Models

S. F. Chen - 2009

3 papers in library cite

[24]A Neural Syntactic Language Model

A. Emami - 2006

2 papers in library cite

[25]Efficient Subsampling for Training Complex Language Models

P. Xu, A. Gunawardana, Sanjeev Khudanpur - 2011

2 papers in library cite

[26]Entropy-Based Pruning of Back-Off Language Models

Andreas Stolcke - 1998

2 papers in library cite

[27]Speed Regularization and Optimality in Word Classing

Geoffrey Zweig, K. Makarychev - 2013

2 papers in library cite

[28]A Hierarchical Bayesian Language Model Based on PitmanYor Processes

Yee Whye Teh - 2006

1 paper in library cites

[29]Adaptive Statistical Language Modeling: A Maximum Entropy Approach

R. Rosenfeld - 1994

1 paper in library cites

[30]Discriminative Language Modeling With Conditional Random Fields and the perceptron algorithm

B. Roark, M. Saralar, Michael Collins, M. J. Johnson - 2004

1 paper in library cites

[31]Factored Recurrent Neural Network Language Model in TED Lecture Transcription

Yonghui Wu, H. Yamamoto, X. Lu, S. Matsuda, C. Hori, H. Kashioka - 2012

1 paper in library cites

[32]Study on Interaction Between Entropy Pruning and Kneser-Ney Smoothing

C. Chelba, T. Brants, W. Neveitt, P. Xu - 2010

1 paper in library cites

Cited by

13

papers in your library

Cites

13

papers in your library

Read

on October 23, 2025

It's somewhat shallow, but I can see the importance of this paper.

Tags

Paper Aliases

No aliases