Papperoni

2007

Large Language Models in Machine Translation

Jeffrey Dean

citations

Cite Score

AI summary

The paper introduces a distributed infrastructure to train language models up to 2 trillion tokens using a new smoothing method called Stupid Backoff, achieving improvements in machine translation quality as measured by the BLEU score.

Main Contributions

Proposes a distributed language model training and deployment infrastructure.
Introduces a new smoothing method called Stupid Backoff.
Demonstrates that translation quality improves with increasing language model size, even at the largest sizes considered (up to 2 trillion tokens).
Achieves a 5-gram language model of up to 300 billion n-grams.
Shows Stupid Backoff performs as well as sophisticated methods as the size of the language model increases.

Abstract

This paper reports on the benefits of large-scale statistical language modeling in machine translation. A distributed infrastructure is proposed which we use to train on up to 2 trillion tokens, resulting in language models having up to 300 billion n-grams. It is capable of providing smoothed probabilities for fast, single-pass decoding. We introduce a new smoothing method, dubbed Stupid Backoff, that is inexpensive to train on large data sets and approaches the quality of Kneser-Ney Smoothing as the amount of training data increases.

Citation Graph

Loading graph...

References [14]

Sort:

Filter:

[1]BLUE: A Method for Automatic Evaluation of Machine Translation

K. Papineni, S. Roukos, T. Ward, Wei Jing Zhu - 2002

19 papers in library cite

Google Scholar

Very cool idea. Simple yet very impactful!

[2]MapReduce: Simplified Data Processing on Large Clusters

Jeffrey Dean, Sanjay Ghemawat - 2004

4 papers in library cite

Google Scholar

Amazing paper that discusses how MapReduce works! Very simple, and really nice to read something not related to AI. A shame that it's off-topic.

[3]Improved Backing-Off for M-Gram language Modeling

R. Kneser, Hermann Ney - 1995

11 papers in library cite

Google Scholar

It's nice, it's simple... But not NNs and seems very incremental on top of existing backoff

[4]A Bit of Progress in Language Modeling

J. Goodman - 2001

15 papers in library cite

Google Scholar

Focuses on n-grams.

[5]An Empirical Study of Smoothing Techniques for Language Modeling

S. F. Chen, J. Goodman - 1998

13 papers in library cite

Google Scholar

[6]Estimation of Probabilities From Sparse Data for the Language Model Component of a Speech Recognizer

S. Katz - 1987

11 papers in library cite

Google Scholar

[7]Interpolated Estimation of Markov Source Parameters From Sparse Data

Frederick Jelinek, R. L. Mercer - 1980

8 papers in library cite

Google Scholar

[8]The Mathematics of Statistical Machine Translation: Parameter Estimation

P. F. Brown, S. D. Pietra, Vincent J. Della Pietra, R. L. Mercer - 1993

7 papers in library cite

Google Scholar

[9]Statistical Signifi Cance Tests for Machine Translation Evaluation

P. Koehn - 2004

2 papers in library cite

Google Scholar

[10]Computer-Intensive Methods for Testing Hypotheses

E. W. Noreen - 1989

1 paper in library cites

Google Scholar

[11]Distributed Language Modeling for N-Best List Re-Ranking

Y. Z. Zhang, A. S. Hildebrand, S. Vogel - 2006

1 paper in library cites

Google Scholar

[12]Dynamic Programming Search for Continuous Speech Recognition

Hermann Ney, S. Ortmanns - 1999

1 paper in library cites

Google Scholar

[13]Large-Scale Distributed Language Modeling

A. Emami, K. Papineni, J. S. Sorensen - 2007

1 paper in library cites

Google Scholar

[14]The Alignment Template Approach to Statistical Machine Translation

F. J. Och, Hermann Ney - 2004

1 paper in library cites

Google Scholar

Cited by

papers in your library

Cites

papers in your library

Read

on March 24, 2025

Very nice paper on how google solved their language translation using bigrams. Interesting, despite not NN related.