2013
Cite Score
42
AI summary
The paper introduces a new benchmark dataset with one billion words, along with baseline results, for statistical language modeling, using n-gram models and recurrent neural network-based language models, achieving a 35% reduction in perplexity over the baseline.
Main Contributions
Abstract
We propose a new benchmark corpus to be used for measuring progress in statistical language modeling. With almost one billion words of training data, we hope this benchmark will be useful to quickly evaluate novel language modeling techniques, and to compare their contribution when combined with other advanced techniques. We show performance of several well-known types of language models, with the best results achieved with a recurrent neural network based language model. The baseline unpruned Kneser-Ney 5-gram model achieves perplexity 67.6. A combination of techniques leads to 35% reduction in perplexity, or 10% reduction in cross-entropy (bits), over that baseline. The benchmark is available as a code.google.com project; besides the scripts needed to rebuild the training/held-out data, it also makes available log-probability values for each word in each of ten held-out data sets, for each of the baseline n-gram models.
Citation Graph
References [32]
D. E. Rumelhart, Geoffrey E. Hinton, Ronald J. Williams - 1986
34 papers in library cite
Jeffrey L. Elman - 1990
23 papers in library cite
Yoshua Bengio, R. Ducharme, Pascal Vincent - 2001
62 papers in library cite
Tomas Mikolov, M. Karafiat, Lukas Burget, Jan Cernocky, Sanjeev Khudanpur - 2010
36 papers in library cite
M. Sundermeyer, R. Schluter, Hermann Ney - 2010
7 papers in library cite
R. Kneser, Hermann Ney - 1995
11 papers in library cite
Tomas Mikolov, S. Kombrink, Lukas Burget, Jan Cernocky, Sanjeev Khudanpur - 2011
16 papers in library cite
F. Morin, Yoshua Bengio - 2005
19 papers in library cite
A. Mnih, Geoffrey Hinton - 2007
12 papers in library cite
Jeffrey Dean - 2007
3 papers in library cite
Tomas Mikolov, A. Deoras, D. Povey, Lukas Burget, Jan Cernocky - 2011
9 papers in library cite
Holger Schwenk - 2007
12 papers in library cite
Tomas Mikolov, A. Deoras, S. Kombrink, Lukas Burget, Jan Cernocky - 2011
13 papers in library cite
Tomas Mikolov - 2012
17 papers in library cite
J. Goodman - 2001
15 papers in library cite
S. F. Chen, J. Goodman - 1998
13 papers in library cite
P. F. Brown, P. V. Desouza, R. L. Mercer, Vincent J. Della Pietra, J. C. Lai - 1992
12 papers in library cite
S. Katz - 1987
11 papers in library cite
J. T. Goodman - 2001
7 papers in library cite
C. Chelba, Frederick Jelinek - 2000
6 papers in library cite
P. Xu - 2005
4 papers in library cite
Frederick Jelinek, B. Merialdo, S. Roukos, M. Strauss - 1991
3 papers in library cite
S. F. Chen - 2009
3 papers in library cite
A. Emami - 2006
2 papers in library cite
P. Xu, A. Gunawardana, Sanjeev Khudanpur - 2011
2 papers in library cite
Andreas Stolcke - 1998
2 papers in library cite
Geoffrey Zweig, K. Makarychev - 2013
2 papers in library cite
Yee Whye Teh - 2006
1 paper in library cites
R. Rosenfeld - 1994
1 paper in library cites
B. Roark, M. Saralar, Michael Collins, M. J. Johnson - 2004
1 paper in library cites
Yonghui Wu, H. Yamamoto, X. Lu, S. Matsuda, C. Hori, H. Kashioka - 2012
1 paper in library cites
C. Chelba, T. Brants, W. Neveitt, P. Xu - 2010
1 paper in library cites
Cited by
13
papers in your library
Cites
13
papers in your library
Read
on October 23, 2025
Your review
Tags
Paper Aliases
No aliases