Papperoni

2001

A Neural Probabilistic Language Model

Yoshua Bengio, R. Ducharme, Pascal Vincent

Open PDF Google Scholar

citations

Cite Score

87

AI summary

This paper introduces a neural probabilistic language model using distributed word representations and neural networks to overcome the curse of dimensionality in language modeling, achieving improved perplexity on the Brown corpus and AP News data compared to n-gram models.

Main Contributions

Introduces a neural probabilistic language model that learns distributed representations for words.
The model learns word feature vectors and the probability function simultaneously.
Demonstrates improved generalization by leveraging semantic similarity between words.
Achieves significantly better perplexity on the Brown corpus compared to state-of-the-art n-gram models.
Shows that the model can effectively utilize longer contexts for language modeling.

Abstract

A goal of statistical language modeling is to learn the joint probability function of sequences of words in a language. This is intrinsically difficult because of the curse of dimensionality: a word sequence on which the model will be tested is likely to be different from all the word sequences seen during training. Traditional but very successful approaches based on n-grams obtain generalization by concatenating very short overlapping sequences seen in the training set. We propose to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences. The model learns simultaneously (1) a distributed representation for each word along with (2) the probability function for word sequences, expressed in terms of these representations. Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made of words that are similar (in the sense of having a nearby representation) to words forming an already seen sentence. Training such large models (with millions of parameters) within a reasonable time is itself a significant challenge. We report on experiments using neural networks for the probability function, showing on two text corpora that the proposed approach significantly improves on state-of-the-art n-gram models, and that the proposed approach allows to take advantage of longer contexts.

Citation Graph

Loading graph...

References [33]

Sort:

Filter:

[1]Finding Structure in Time

Jeffrey L. Elman - 1990

23 papers in library cite

Good paper overall that introduces the concept of an RNN. However, applications and results are still very primitive.

[2]Efficient Backprop

Yann Lecun, Leon Bottou, G. B. Orr, Klaus Robert Muller - 1998

20 papers in library cite

The first half is very very good. The remainder is very boring.

[3]Training Products of Experts by Minimizing Contrastive Divergence

Geoffrey Hinton - 2002

23 papers in library cite

Good read, but I think I need to revisit it after I understand RBMs better.

[4]Srilm - An Extensible Language Modeling Toolkit

Andreas Stolcke - 2002

13 papers in library cite

Toolkit for N-grams. Not too relevant and sounds veeeery simple (sorry for those who implemented it). It's nice to see early implementation of OOP though. The paper is boring and doesn't really say much about the framework, more of a description of how to use the commands and n-gram models.

[5]A Maximum Entropy Approach to Natural Language Processing

A. L. Berger, S. A. D. Pietra, Vincent J. Della Pietra - 1996

10 papers in library cite

This paper is so good! Easy to follow and very nice results. The experiments are a bit meh, but otherwise wonderful.

[6]Improved Backing-Off for M-Gram language Modeling

R. Kneser, Hermann Ney - 1995

11 papers in library cite

It's nice, it's simple... But not NNs and seems very incremental on top of existing backoff

[7]Learning Distributed Representations of Concepts

Geoffrey E. Hinton - 1986

13 papers in library cite

 Probably seminal, but a bit boring overall

[8]Quick Training of Probabilistic Neural Nets by Importance Sampling

Yoshua Bengio, Jean Sebastien Senecal - 2003

11 papers in library cite

Good idea to overcome softmax computation cost. Not sure if too relevant today, but definitely better than the 2008 paper that is the same stuff.

[9]Connectionist Language Modeling for Large Vocabulary Continuous Speech Recognition

Holger Schwenk, Jean Luc Gauvain - 2002

14 papers in library cite

Only real relevance is being early. Otherwise not much to see.

[10]Can Artificial Neural Networks Learn Language Models

Weixin Xu, Alex Rudnicky - 2000

5 papers in library cite

Shitty paper and I hate that it was the first.

[11]Sequential Neural Text Compression

Jürgen Schmidhuber - 1996

3 papers in library cite

I really liked this paper. Maybe not useful, but the idea is very nice!

[12]WordNet: An Electronic Lexical Database

C. Fellbaum - 1998

12 papers in library cite

It's huge and I don't think it will add much (it is a book)

[13]Indexing by Latent Semantic Analysis

S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, R. Harshman - 1990

12 papers in library cite

LSA paper

[14]A Bit of Progress in Language Modeling

J. Goodman - 2001

15 papers in library cite

Focuses on n-grams.

[15]An Empirical Study of Smoothing Techniques for Language Modeling

S. F. Chen, J. Goodman - 1998

13 papers in library cite

[16]Class-Based N-Gram Models of Natural Language

P. F. Brown, P. V. Desouza, R. L. Mercer, Vincent J. Della Pietra, J. C. Lai - 1992

12 papers in library cite

[17]Estimation of Probabilities From Sparse Data for the Language Model Component of a Speech Recognizer

S. Katz - 1987

11 papers in library cite

[18]Interpolated Estimation of Markov Source Parameters From Sparse Data

Frederick Jelinek, R. L. Mercer - 1980

8 papers in library cite

[19]Distributional Clustering of English Words

Fernando Pereira, N. Tishby, L. Lee - 1993

4 papers in library cite

[20]Natural Language Processing With Modular PDP Networks and Distributed Lexicon

R. Miikkulainen, M. G. Dyer - 1991

4 papers in library cite

[21]Modeling High-Dimensional Discrete Data With Multi-Layer Neural Networks

Yoshua Bengio, Samy Bengio - 2000

3 papers in library cite

Hinrich Schutze - 1993

3 papers in library cite

[23]A Latent Semantic Analysis Framework for Large-Span Language Modeling

J. R. Bellegarda - 1997

2 papers in library cite

[24]Comparison of Part-of-Speech and Automatically Derived Category-Based Language Models for Speech Recognition

T. R. Niesler, E. W. D. Whittaker, P. C. Woodland - 1998

2 papers in library cite

[25]Distributional Clustering of Words for Text Classification

D. Baker, Andrew Mccallum - 1998

2 papers in library cite

[26]Extracting Distributed Representations of Concepts and Relations From Positive and Negative Propositions

A. Paccanaro, Geoffrey Hinton - 2000

2 papers in library cite

[27]Improved Clustering Techniques for Class-Based Statistical Language Modelling

Hermann Ney, R. Kneser - 1993

2 papers in library cite

[28]Improving Protein Secondary Structure Prediction Using Structured Neural Networks and Multiple Sequence Profiles

S. Riis, A. Krogh - 1996

2 papers in library cite

[29]New Distributed Probabilistic Language Models

Yoshua Bengio - 2002

2 papers in library cite

[30]Self-Organizing Letter Code-Book for Text-to-Phoneme Neural Network Model

K. J. Jensen, S. Riis - 2000

2 papers in library cite

[31]Taking on the Curse of Dimensionality in Joint Distributions Using Neural Networks

Samy Bengio, Yoshua Bengio - 2000

2 papers in library cite

[32]MPI: A Message Passing Interface Standard

J. Dongarra, D. Walker, T. M. P. I. Forum - 1995

1 paper in library cites

[33]Products of Hidden Markov Models

A. Brown, Geoffrey E. Hinton - 2000

1 paper in library cites

Cited by

62

papers in your library

Cites

13

papers in your library

Read

on March 17, 2025

What started it all. Very simple and elegant.

Tags

Paper Aliases

No aliases