Papperoni

2011

Strategies for Training Large Scale Neural Network Language Models

Tomas Mikolov, A. Deoras, D. Povey, Lukas Burget, Jan Cernocky

Open PDF Google Scholar

citations

Cite Score

30

AI summary

This paper introduces a method to effectively train neural network language models on large datasets by sorting training data by relevance and using a hash-based maximum entropy model, achieving a 10% relative reduction in word error rate on the English Broadcast News task using 400M tokens.

Main Contributions

Introduces a method to sort training data by relevance for faster convergence and better performance.
Presents a hash-based implementation of a maximum entropy model that can be trained as part of the neural network model, reducing computational complexity.
Achieves around 10% relative reduction of word error rate on English Broadcast News speech recognition task.
Experiments are performed using Recurrent neural network language model (RNN LM).

Abstract

We describe how to effectively train neural network based language models on large data sets. Fast convergence during training and better overall performance is observed when the training data are sorted by their relevance. We introduce hash-based implementation of a maximum entropy model, that can be trained as a part of the neural network model. This leads to significant reduction of computational complexity. We achieved around 10% relative reduction of word error rate on English Broadcast News speech recognition task, against large 4-gram model trained on 400M tokens.

Citation Graph

Loading graph...

References [21]

Sort:

Filter:

[1]A Neural Probabilistic Language Model

Yoshua Bengio, R. Ducharme, Pascal Vincent - 2001

62 papers in library cite

What started it all. Very simple and elegant.

[2]Recurrent Neural Network Based Language Model

Tomas Mikolov, M. Karafiat, Lukas Burget, Jan Cernocky, Sanjeev Khudanpur - 2010

36 papers in library cite

The comeback of RNNs for language modeling. Not too exciting but impactful and a short read.

[3]Curriculum Learning

Yoshua Bengio, J. Louradour, Ronan Collobert, Jason Weston - 2009

6 papers in library cite

Very nice paper that introduces curriculum learning. Possibly not too relevant, but good nonetheless.

[4]Learning and Development in Neural Networks: The Importance of Starting Small

Jeffrey L. Elman - 1993

5 papers in library cite

This is such a nice paper! Maybe because it's written for a specific public, but it's such an easy read and ties back a lot with neural/biological concepts!

[5]Extensions of Recurrent Neural Network Language Model

Tomas Mikolov, S. Kombrink, Lukas Burget, Jan Cernocky, Sanjeev Khudanpur - 2011

16 papers in library cite

Doesn't add much.

[6]Hierarchical Probabilistic Neural Network Language Model

F. Morin, Yoshua Bengio - 2005

19 papers in library cite

Nice paper overall. Seems very impactful despite not being as relevant right now.

[7]Continuous Space Language Models

Holger Schwenk - 2007

12 papers in library cite

One more paper about speech recog. Nothing special really.

[8]Empirical Evaluation and Combination of Advanced Language Modeling Techniques

Tomas Mikolov, A. Deoras, S. Kombrink, Lukas Burget, Jan Cernocky - 2011

13 papers in library cite

Early work proving that NNs can be good. But very uninteresting overall.

[9]Training Neural Network Language Models on Very Large Corpora

Holger Schwenk, Jean Luc Gauvain - 2005

7 papers in library cite

Seems very derivative of Schwenk's early work. It's also very focused on speech recognition, and "very large corpora" seems very relative.

[10]Can Artificial Neural Networks Learn Language Models

Weixin Xu, Alex Rudnicky - 2000

5 papers in library cite

Shitty paper and I hate that it was the first.

[11]Structured Output Layer neural Network Language Model

H. S. Le, I. Oparin, A. Allauzen, Jean Luc Gauvain, F. Yvon - 2011

7 papers in library cite

[12]Classes for Fast Maximum Entropy Training

J. T. Goodman - 2001

7 papers in library cite

[13]A Maximum Entropy Approach to Adaptive Statistical Language Modeling

R. Rosenfeld - 1996

6 papers in library cite

[14]Shrinking Exponential Language Models

S. F. Chen - 2009

3 papers in library cite

[15]Speech Recognition With Segmental Conditional Random Fields: A Summary of the JHU CLSP 2010 Summer Workshop

Geoffrey Zweig, P. Nguyen, D. V. Compernolle, K. Demuynck, L. Atlas, Peter Clark, G. Sell, Mingliang Wang, F. Sha, H. Hermansky, D. Karakos, A. Jansen, S. Thomas, G. S. V. S. Sivaram, S. Bowman, J. Kao - 2011

3 papers in library cite

[16]The IBM Attila speech Recognition Toolkit

H. Soltau, G. Saon, Brian Kingsbury - 2010

3 papers in library cite

[17]A Fast Rescoring Strategy to Capture Long-Distance Dependencies

A. Deoras, Tomas Mikolov, K. Church - 2011

2 papers in library cite

[18]Efficient Estimation of Maximum Entropy Language Models With N-Gram Features: An SRILM Extension

T. Alumae, M. Kurimo - 2010

2 papers in library cite

[19]Efficient Subsampling for Training Complex Language Models

P. Xu, A. Gunawardana, Sanjeev Khudanpur - 2011

2 papers in library cite

[20]Model Combination for Speech Recognition Using Empirical Bayes Risk Minimization

A. Deoras, D. Filimonov, M. Harper, Frederick Jelinek - 2010

1 paper in library cites

[21]Scaling Shrinkage-Based Language Models

S. Chen, L. Mangu, Bhuvana Ramabhadran, R. Sarikaya, A. Sethy - 2009

1 paper in library cites

Cited by

9

papers in your library

Cites

10

papers in your library

Read

on March 20, 2025

Just builds on other things. Very minor suff in my opinion.

Tags

Paper Aliases

No aliases