Papperoni

2012

A Fast and Simple Algorithm for Training Neural Probabilistic Language Models

A. Mnih, Yee Whye Teh

Open PDF Google Scholar

citations

Cite Score

33

AI summary

This paper introduces a fast and simple algorithm for training Neural Probabilistic Language Models (NPLMs) based on noise-contrastive estimation. It achieves state-of-the-art results on the Microsoft Research Sentence Completion Challenge dataset, reducing training times by more than an order of magnitude.

Main Contributions

Proposes a fast and simple algorithm for training NPLMs based on noise-contrastive estimation.
Demonstrates the algorithm's efficiency on the Penn Treebank corpus, reducing training times significantly.
Shows that the algorithm is more stable and efficient than importance sampling.
Trains neural language models on a 47M-word corpus with an 80K-word vocabulary.
Achieves state-of-the-art results on the Microsoft Research Sentence Completion Challenge dataset.

Abstract

In spite of their superior performance, neural probabilistic language models (NPLMs) remain far less widely used than n-gram models due to their notoriously long training times, which are measured in weeks even for moderately-sized datasets. Training NPLMS is computationally expensive because they are explicitly normalized, which leads to having to consider all words in the vocabulary when computing the log-likelihood gradients. We propose a fast and simple algorithm for training NPLMs based on noise-contrastive estimation, a newly introduced procedure for estimating unnormalized continuous distributions. We investigate the behaviour of the algorithm on the Penn Treebank corpus and show that it reduces the training times by more than an order of magnitude without affecting the quality of the resulting models. The algorithm is also more efficient and much more stable than importance sampling because it requires far fewer noise samples to perform well. We demonstrate the scalability of the proposed approach by training several neural language models on a 47M-word corpus with a 80K-word vocabulary, obtaining state-of-the-art results on the Microsoft Research Sentence Completion Challenge dataset.

Citation Graph

Loading graph...

References [21]

Sort:

Filter:

[1]A Neural Probabilistic Language Model

Yoshua Bengio, R. Ducharme, Pascal Vincent - 2001

62 papers in library cite

What started it all. Very simple and elegant.

[2]Recurrent Neural Network Based Language Model

Tomas Mikolov, M. Karafiat, Lukas Burget, Jan Cernocky, Sanjeev Khudanpur - 2010

36 papers in library cite

The comeback of RNNs for language modeling. Not too exciting but impactful and a short read.

[3]A Unified Architecture for Natural Language Processing: Deep Neural Networks With Multitask Learning

Ronan Collobert, Jason Weston - 2008

32 papers in library cite

Really did not add much to the game. I think this was more of a small perf. improvement over other existing things and set a few methodological standards. Maybe main contribution is Multitask Learning + Deep learning

[4]Srilm - An Extensible Language Modeling Toolkit

Andreas Stolcke - 2002

13 papers in library cite

Toolkit for N-grams. Not too relevant and sounds veeeery simple (sorry for those who implemented it). It's nice to see early implementation of OOP though. The paper is boring and doesn't really say much about the framework, more of a description of how to use the commands and n-gram models.

[5]A Maximum Entropy Approach to Natural Language Processing

A. L. Berger, S. A. D. Pietra, Vincent J. Della Pietra - 1996

10 papers in library cite

This paper is so good! Easy to follow and very nice results. The experiments are a bit meh, but otherwise wonderful.

[6]Word Representations: A Simple and General Method for Semi-Supervised Learning

J. Turian, L. Ratinov, Yoshua Bengio - 2010

17 papers in library cite

This basically introduced the concept of embeddings. Very nice.

[7]Parsing Natural Scenes and Natural Language With Recursive Neural Networks

Richard Socher, C. C. Lin, C. Manning, Andrew Y. Ng - 2011

10 papers in library cite

Good idea and nice results, but not my thing right now.

[8]Hierarchical Probabilistic Neural Network Language Model

F. Morin, Yoshua Bengio - 2005

19 papers in library cite

Nice paper overall. Seems very impactful despite not being as relevant right now.

[9]A Scalable Hierarchical Distributed Language Model

A. Mnih, Geoffrey E. Hinton - 2009

16 papers in library cite

Good paper that introduces hierarchical trees as an alternative to the expensive softmax output. I think this is not really relevant anymore, but good read.

[10]Three New Graphical Models for Statistical Language Modelling

A. Mnih, Geoffrey Hinton - 2007

12 papers in library cite

I don't know why this is so impactful. I didn't like it and I think this was overly complex.

[11]Empirical Evaluation and Combination of Advanced Language Modeling Techniques

Tomas Mikolov, A. Deoras, S. Kombrink, Lukas Burget, Jan Cernocky - 2011

13 papers in library cite

Early work proving that NNs can be good. But very uninteresting overall.

[12]Adaptive Importance Sampling to Accelerate Training of a Neural Probabilistic Language Model

Yoshua Bengio, Jean Sebastien Senecal - 2008

6 papers in library cite

Just a rerun of a previous paper. This is ok, but really not too different.

[13]Quick Training of Probabilistic Neural Nets by Importance Sampling

Yoshua Bengio, Jean Sebastien Senecal - 2003

11 papers in library cite

Good idea to overcome softmax computation cost. Not sure if too relevant today, but definitely better than the 2008 paper that is the same stuff.

[14]Training Neural Network Language Models on Very Large Corpora

Holger Schwenk, Jean Luc Gauvain - 2005

7 papers in library cite

Seems very derivative of Schwenk's early work. It's also very focused on speech recognition, and "very large corpora" seems very relative.

[15]Noise-Contrastive Estimation: A New Estimation Principle for Unnormalized Statistical Models

M. Gutmann, A. Hyvarinen - 2010

7 papers in library cite

[16]The microsoft research Sentence Completion Challenge

Geoffrey Zweig, C. J. Burges - 2011

6 papers in library cite

[17]Noise-Contrastive Estimation of Unnormalized Statistical Models, With Applications to Natural Image Statistics

M. U. Gutmann, A. Hyvarinen - 2012

2 papers in library cite

[18]A Family of Computationally Efficient and Simple Estimators for Unnormalized Statistical Models

M. Pihlaja, M. Gutmann, A. Hyvarinen - 2010

1 paper in library cites

[19]A Probabilistic Model for Semantic Word Vectors

A. L. Maas, Andrew Y. Ng - 2010

1 paper in library cites

[20]Improving a Statistical Language Model Through Non-Linear Prediction

A. Mnih, Z. Yuecheng, Geoffrey Hinton - 2009

1 paper in library cites

[21]Natural Language Processing With Python

S. Bird, E. Klein, E. Loper - 2009

1 paper in library cites

Cited by

5

papers in your library

Cites

14

papers in your library

Read

on April 28, 2025

Too mathy and really did not have a lot of impact TBH

Tags

Paper Aliases

No aliases