Papperoni

2002

Training Products of Experts by Minimizing Contrastive Divergence

Geoffrey Hinton

citations

Cite Score

AI summary

This paper introduces a method for training Products of Experts (PoE) by minimizing contrastive divergence, which addresses the difficulty of making experts as different as possible. The approach optimizes a different objective function than the log likelihood of the data, leading to more efficient training.

Main Contributions

Introduces a novel approach for training Products of Experts (PoE) by minimizing contrastive divergence.
Presents an alternative objective function to avoid computing derivatives of the partition function.
Demonstrates the effectiveness of the approach on synthetic data and handwritten digit recognition.
Shows that the method learns localized features and performs well in discrimination tasks.
Discusses the relationship between PoE's and Boltzmann machines, highlighting the advantages of the contrastive divergence learning algorithm.

Abstract

It is possible to combine multiple probabilistic models of the same data by multiplying their probability distributions together and then renormalizing. This is a very efficient way to model high-dimensional data which simultaneously satisfies many different low-dimensional constraints because each individual expert model can focus on giving high probability to data vectors that satisfy just one of the constraints. Data vectors that satisfy this one constraint but violate other constraints will be ruled out by their low probability under the other experts. Training a product of experts appears difficult because, in addition to maximizing the probability that each individual expert assigns to the observed data, it is necessary to make the experts be as different as possible. This ensures that the product of their distributions is small which allows the renormalization to magnify the probability of the data under the product of experts model. Fortunately, if the individual experts are tractable there is an efficient way to train a product of experts.

Citation Graph

Loading graph...

References [14]

Sort:

Filter:

[1]A Maximum Entropy Approach to Natural Language Processing

A. L. Berger, S. A. D. Pietra, Vincent J. Della Pietra - 1996

10 papers in library cite

Google Scholar

This paper is so good! Easy to follow and very nice results. The experiments are a bit meh, but otherwise wonderful.

[2]The Wake-Sleep Algorithm for Unsupervised Neural Networks

Geoffrey Hinton, Peter Dayan, B. Frey, R. Neal - 1995

9 papers in library cite

Google Scholar

It's okay... I get the feeling that this is early autoencoders work, but the term still didn't exist. I don't think it adds nothing new though.

[3]Information Processing in Dynamical Systems: Foundations of Harmony Theory

P. Smolensky - 1986

11 papers in library cite

Google Scholar

88 pages; Introduced RBMs

[4]Learning and Relearning in Boltzmann Machines

Geoffrey E. Hinton, T. J. Sejnowski - 1986

9 papers in library cite

Google Scholar

37 pages; Introduced Boltzmann machines

[5]Unsupervised Learning of Distributions on Binary Vectors Using Two Layer Networks

Y. Freund, D. Haussler - 1992

8 papers in library cite

Google Scholar

[6]Learning Continuous Attractors in Recurrent Networks

S. H. Seung - 1998

5 papers in library cite

Google Scholar

[7]Combining Probability Distributions: A Critique and an Annotated Bibliography

C. Genest, J. V. Zidek - 1986

3 papers in library cite

Google Scholar

[8]Learning Representations by Recirculation

Geoffrey E. Hinton, J. L. Mcclelland - 1988

3 papers in library cite

Google Scholar

[9]Mean Field Theory for Sigmoid Belief Networks

L. Saul, T. Jaakkola, M. Jordan - 1996

3 papers in library cite

Google Scholar

[10]A Hierarchical Community of Experts

Geoffrey E. Hinton, B. Sallans, Zoubin Ghahramani - 1999

2 papers in library cite

Google Scholar

[11]Bias/Variance Decompositions for Likelihood-Based Estimators

T. Heskes - 1998

2 papers in library cite

Google Scholar

[12]Biologically Plausible Error-Driven Learning Using Local Activation Differences: The Generalized Recirculation Algorithm

R. O'reilly - 1996

1 paper in library cites

Google Scholar

[13]Learning Structural Descriptions From Examples

P. Winston - 1975

1 paper in library cites

Google Scholar

[14]Using Generative Models for and-Written Digit Recognition

M. Revow, C. Williams, Geoffrey Hinton - 1996

1 paper in library cites

Google Scholar

Cited by

papers in your library

Cites

papers in your library

Read

on June 26, 2025

Good read, but I think I need to revisit it after I understand RBMs better.