Papperoni

2014

On Using Very Large Target Vocabulary for Neural Machine Translation

Yoshua Bengio

Open PDF Google Scholar

citations

Cite Score

41

AI summary

This paper introduces an approximate training algorithm based on importance sampling that allows the training of NMT models with larger target vocabulary. The results demonstrate improved translation performance and do not sacrifice speed for both training and decoding, achieving state-of-the-art results on the WMT'14 English→French translation task.

Main Contributions

Introduces an approximate training algorithm based on importance sampling for NMT models.
The approach allows training NMT models with a much larger target vocabulary.
The proposed algorithm effectively keeps the computational complexity during training at the level of using only a small subset of the full vocabulary.
Demonstrates that they can potentially achieve better translation performance using larger vocabularies, without sacrificing speed for both training and decoding.
Achieves state-of-the-art translation performance with single NMT models on the WMT'14 English→French translation task.

Abstract

Neural machine translation, a recently proposed approach to machine translation based purely on neural networks, has shown promising results compared to the existing approaches such as phrase-based statistical machine translation. Despite its recent success, neural machine translation has its limitation in handling a larger vocabulary, as training complexity as well as decoding complexity increase proportionally to the number of target words. In this paper, we propose a method based on importance sampling that allows us to use a very large target vocabulary without increasing training complexity. We show that decoding can be efficiently done even with the model having a very large target vocabulary by selecting only a small subset of the whole target vocabulary. The models trained by the proposed approach are empirically found to match, and in some cases outperform, the baseline models with a small vocabulary as well as the LSTM-based neural machine translation models. Furthermore, when we use an ensemble of a few models with very large target vocabularies, we achieve performance comparable to the state of the art (measured by BLEU) on both the English→German and English→French translation tasks of WMT'14.

Citation Graph

Loading graph...

References [22]

Sort:

Filter:

[1]Efficient Estimation of Word Representations in Vector Space

Tomas Mikolov, K. Chen, G. S. Corrado, Jeffrey Dean - 2013

26 papers in library cite

Expanded wor2vec. Very nice overall.

[2]Neural Machine Translation by Jointly Learning to Align and Translate

D. Bahdanau, Kyunghyun Cho, Yoshua Bengio - 2014

59 papers in library cite

Introduces the attention mechanism - amazing overall

[3]Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation

Kyunghyun Cho, B. V. Merrienboer, C. G. Gulcehre, D. Bahdanau, F. Bougares, Holger Schwenk, Yoshua Bengio - 2014

38 papers in library cite

Introduces RNN encoder-decoder. I love it :)

[4]BLUE: A Method for Automatic Evaluation of Machine Translation

K. Papineni, S. Roukos, T. Ward, Wei Jing Zhu - 2002

19 papers in library cite

Very cool idea. Simple yet very impactful!

[5]Sequence to Sequence Learning With Neural Networks

Ilya Sutskever, Oriol Vinyals, Quoc V. Le - 2014

58 papers in library cite

Good paper, but I think it only got famous because they set a new good baseline for NNs in MT. Their main contribution was reversing the source sentence TBH.

[6]On the Properties of Neural Machine Translation: Encoder-Decoder Approaches

Kyunghyun Cho, B. V. Merrienboer, D. Bahdanau, Yoshua Bengio - 2014

9 papers in library cite

This just builds upon other work, comparing and analyzing previous architectures. This faces a specific problem of the time: long sentences. Not too relevant today.

[7]Recurrent Continuous Translation Models

N. Kalchbrenner, Phil Blunsom - 2013

27 papers in library cite

Good paper, probably the first that used an encoder-decoder. But they used a conv. NN instead of a tradicional decoder, which I don't really like.

[8]Theano: New Features and Speed Improvements

F. Bastien, P. Lamblin, Razvan Pascanu, James Bergstra, I. Goodfellow, A. Bergeron, A. Bouchard, N. Nicolas, Yoshua Bengio - 2012

13 papers in library cite

The paper itself is ok, but it seems like they just wanted to make a paper to say "hey, ours is so much better now after these improvements, we crushed Torch7". A bit tryhard tbh.

[9]Addressing the Rare Word Problem in Neural Machine Translation

T. Luong, Ilya Sutskever, Quoc V. Le, Oriol Vinyals, Wojciech Zaremba - 2014

14 papers in library cite

The method was very poorly explained. It was also worse than a paper released sooner, and more complicated. Overall not that good.

[10]Adaptive Importance Sampling to Accelerate Training of a Neural Probabilistic Language Model

Yoshua Bengio, Jean Sebastien Senecal - 2008

6 papers in library cite

Just a rerun of a previous paper. This is ok, but really not too different.

[11]Theano: A CPU and GPU Math Expression Compiler

James Bergstra, O. Breuleux, F. Bastien, P. Lamblin, Razvan Pascanu, G. Desjardins, J. Turian, D. W. Farley, Yoshua Bengio - 2010

22 papers in library cite

Very nice framework. Symbolic programming is very nice. However, I think that this had very little impact and was mostly used by Bengio's lab.

[12]Statistical Phrase-Based Translation

P. Koehn, F. J. Och, D. Marcu - 2003

8 papers in library cite

[13]Noise-Contrastive Estimation: A New Estimation Principle for Unnormalized Statistical Models

M. Gutmann, A. Hyvarinen - 2010

7 papers in library cite

[14]Edinburgh's Phrase-Based Machine Translation Systems for WMT-14

N. Durrani, B. Haddow, P. Koehn, K. Heafield - 2014

6 papers in library cite

[15]Statistical Machine Translation

P. Koehn - 2010

5 papers in library cite

[16]A Simple, Fast, and Effective Reparameterization of IBM Model 2

C. Dyer, V. Chahuneau, Noah A. Smith - 2013

4 papers in library cite

[17]Learning Word Embeddings Efficiently With Noise-Contrastive Estimation

A. Mnih, Koray Kavukcuoglu - 2013

4 papers in library cite

[18]N-Gram Counts and Language Models From the Common Crawl

C. Buck, K. Heafield, B. V. Ooyen - 2014

3 papers in library cite

[19]Recursive hetero-associative Memories for Translation

M. L. Forcada, R. P. Neco - 1997

2 papers in library cite

[20]Eu-bridge MT: Combined Machine Translation

M. Freitag, S. Peitz, J. Wuebker, Hermann Ney, M. Huck, R. Sennrich, N. Durrani, M. Nadejde, P. Williams, P. Koehn - 2014

1 paper in library cites

[21]The DCU-ICTCAS MT system at WMT 2014 on German-English Translation Task

Lei Li, Xiaobao Wu, S. C. Vaillo, J. Xie, A. Way, Qian Liu - 2014

1 paper in library cites

[22]The RWTH Aachen German-English Machine Translation System for WMT 2014

S. Peitz, J. Wuebker, M. Freitag, Hermann Ney - 2014

1 paper in library cites

Cited by

12

papers in your library

Cites

11

papers in your library

Read

on October 14, 2025

It's nice, but it starts getting a bit into the realm of "yeah, that seems like a minor improvement". It's nice that they use the importance sampling stuff from the previous paper though - I thought it had completely vanished :)

Tags

Paper Aliases

No aliases