2013

One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling

C. Chelba, Tomas Mikolov, M. Schuster, Q. Ge, T. Brants, P. Koehn, Tony Robinson

citations

Cite Score

42

AI summary

The paper introduces a new benchmark dataset with one billion words, along with baseline results, for statistical language modeling, using n-gram models and recurrent neural network-based language models, achieving a 35% reduction in perplexity over the baseline.

Main Contributions

  • Introduces a new benchmark dataset with one billion words for statistical language modeling research.
  • Provides scripts to rebuild training/held-out data and log-probability values for baseline n-gram models.
  • Presents baseline results for various language modeling techniques, including n-gram and recurrent neural network-based language models.
  • Achieves a 35% reduction in perplexity (10% in cross-entropy) over the baseline Kneser-Ney 5-gram model through a combination of techniques.
  • Trains the largest recurrent neural network language model ever reported.

Abstract

We propose a new benchmark corpus to be used for measuring progress in statistical language modeling. With almost one billion words of training data, we hope this benchmark will be useful to quickly evaluate novel language modeling techniques, and to compare their contribution when combined with other advanced techniques. We show performance of several well-known types of language models, with the best results achieved with a recurrent neural network based language model. The baseline unpruned Kneser-Ney 5-gram model achieves perplexity 67.6. A combination of techniques leads to 35% reduction in perplexity, or 10% reduction in cross-entropy (bits), over that baseline. The benchmark is available as a code.google.com project; besides the scripts needed to rebuild the training/held-out data, it also makes available log-probability values for each word in each of ten held-out data sets, for each of the baseline n-gram models.

Citation Graph

Loading graph...

References [32]

Sort:
Filter:

D. E. Rumelhart, Geoffrey E. Hinton, Ronald J. Williams - 1986

34 papers in library cite

Jeffrey L. Elman - 1990

23 papers in library cite

Yoshua Bengio, R. Ducharme, Pascal Vincent - 2001

62 papers in library cite

Tomas Mikolov, M. Karafiat, Lukas Burget, Jan Cernocky, Sanjeev Khudanpur - 2010

36 papers in library cite

M. Sundermeyer, R. Schluter, Hermann Ney - 2010

7 papers in library cite

R. Kneser, Hermann Ney - 1995

11 papers in library cite

Tomas Mikolov, S. Kombrink, Lukas Burget, Jan Cernocky, Sanjeev Khudanpur - 2011

16 papers in library cite

F. Morin, Yoshua Bengio - 2005

19 papers in library cite

A. Mnih, Geoffrey Hinton - 2007

12 papers in library cite

Jeffrey Dean - 2007

3 papers in library cite

Tomas Mikolov, A. Deoras, D. Povey, Lukas Burget, Jan Cernocky - 2011

9 papers in library cite

Holger Schwenk - 2007

12 papers in library cite

Tomas Mikolov, A. Deoras, S. Kombrink, Lukas Burget, Jan Cernocky - 2011

13 papers in library cite

Tomas Mikolov - 2012

17 papers in library cite

J. Goodman - 2001

15 papers in library cite

S. F. Chen, J. Goodman - 1998

13 papers in library cite

P. F. Brown, P. V. Desouza, R. L. Mercer, Vincent J. Della Pietra, J. C. Lai - 1992

12 papers in library cite

J. T. Goodman - 2001

7 papers in library cite

C. Chelba, Frederick Jelinek - 2000

6 papers in library cite

P. Xu - 2005

4 papers in library cite

Frederick Jelinek, B. Merialdo, S. Roukos, M. Strauss - 1991

3 papers in library cite

S. F. Chen - 2009

3 papers in library cite

A. Emami - 2006

2 papers in library cite

P. Xu, A. Gunawardana, Sanjeev Khudanpur - 2011

2 papers in library cite

Andreas Stolcke - 1998

2 papers in library cite

Geoffrey Zweig, K. Makarychev - 2013

2 papers in library cite

Yee Whye Teh - 2006

1 paper in library cites

R. Rosenfeld - 1994

1 paper in library cites

B. Roark, M. Saralar, Michael Collins, M. J. Johnson - 2004

1 paper in library cites

Yonghui Wu, H. Yamamoto, X. Lu, S. Matsuda, C. Hori, H. Kashioka - 2012

1 paper in library cites

C. Chelba, T. Brants, W. Neveitt, P. Xu - 2010

1 paper in library cites

Cited by

13

papers in your library

Cites

13

papers in your library

Read

on October 23, 2025

Your review

Tags

Paper Aliases

No aliases