Papperoni

2012

Subword Language Modeling With Neural Networks

Tomas Mikolov, Ilya Sutskever, A. Deoras, H. S. Le, S. Kombrink, Jan Cernocky

Open PDF Google Scholar

citations

Cite Score

12

AI summary

This paper introduces a subword language model using neural networks, combining character and word-level advantages. It demonstrates that neural network models can be significantly smaller than compressed n-gram models while maintaining performance on the Broadcast news RT04 task, with further size reductions possible through sub-word units and quantization.

Main Contributions

Proposed a simple technique for learning sub-word level units from data, combining the advantages of character and word-level models.
Showed that neural network based language models can be an order of magnitude smaller than compressed n-gram models.
Demonstrated that using quantization, memory requirements can be reduced by around 90% while maintaining word error rate.
Explored the possibility of further reduction of size of the neural network language model by decomposing infrequent words into subwords.
Achieved comparable or better performance than n-gram models in speech recognition tasks with significantly smaller neural network models.

Abstract

We explore the performance of several types of language models on the word-level and the character-level language modeling tasks. This includes two recently proposed recurrent neural network architectures, a feedforward neural network model, a maximum entropy model and the usual smoothed n-gram models. We then propose a simple technique for learning sub-word level units from the data, and show that it combines advantages of both character and word-level models. Finally, we show that neural network based language models can be order of magnitude smaller than compressed n-gram models, at the same level of performance when applied to a Broadcast news RT04 speech recognition task. By using sub-word units, the size can be reduced even more.

Citation Graph

Loading graph...

References [23]

Sort:

Filter:

[1]Learning Internal Representations by Error Propagation

D. E. Rumelhart, Geoffrey E. Hinton, Ronald J. Williams - 1986

46 papers in library cite

I expected very little of this, but was so good in explaining concepts! Very good read. It gets a bit boring when it starts explaining things by the end of the chapter, but good nonetheless.

[2]Finding Structure in Time

Jeffrey L. Elman - 1990

23 papers in library cite

Good paper overall that introduces the concept of an RNN. However, applications and results are still very primitive.

[3]A Neural Probabilistic Language Model

Yoshua Bengio, R. Ducharme, Pascal Vincent - 2001

62 papers in library cite

What started it all. Very simple and elegant.

[4]Recurrent Neural Network Based Language Model

Tomas Mikolov, M. Karafiat, Lukas Burget, Jan Cernocky, Sanjeev Khudanpur - 2010

36 papers in library cite

The comeback of RNNs for language modeling. Not too exciting but impactful and a short read.

[5]Srilm - An Extensible Language Modeling Toolkit

Andreas Stolcke - 2002

13 papers in library cite

Toolkit for N-grams. Not too relevant and sounds veeeery simple (sorry for those who implemented it). It's nice to see early implementation of OOP though. The paper is boring and doesn't really say much about the framework, more of a description of how to use the commands and n-gram models.

[6]Generating Text With Recurrent Neural Networks

Ilya Sutskever, James Martens, Geoffrey E. Hinton - 2011

13 papers in library cite

Pleasant paper but results are underwhelming. They use RNNs for character-level modeling, which is different. They also use the hessian-free method proposed by Martens, but don't go too deep into how it works, which is nice because otherwise it would be very mathy. Other papers cite this more as an example of usage rather than an actual milestone.

[7]Extensions of Recurrent Neural Network Language Model

Tomas Mikolov, S. Kombrink, Lukas Burget, Jan Cernocky, Sanjeev Khudanpur - 2011

16 papers in library cite

Doesn't add much.

[8]A Scalable Hierarchical Distributed Language Model

A. Mnih, Geoffrey E. Hinton - 2009

16 papers in library cite

Good paper that introduces hierarchical trees as an alternative to the expensive softmax output. I think this is not really relevant anymore, but good read.

[9]Learning Recurrent Neural Networks With Hessian-Free Optimization

James Martens, Ilya Sutskever - 2011

13 papers in library cite

Meh, very very mathy and seems like a minor improvement on LSTMs. They give very impressive results but I don't think it's too much of a fair comparison (vs. a 1999 architecture of LSTMs). I think they did well in showing that exploding/vanishing gradients can be overcome, but their method is not "it". As said by Bengio, they are partially responsible for the revival of RNNs (alongside Mikolov)

[10]Strategies for Training Large Scale Neural Network Language Models

Tomas Mikolov, A. Deoras, D. Povey, Lukas Burget, Jan Cernocky - 2011

9 papers in library cite

Just builds on other things. Very minor suff in my opinion.

[11]Empirical Evaluation and Combination of Advanced Language Modeling Techniques

Tomas Mikolov, A. Deoras, S. Kombrink, Lukas Burget, Jan Cernocky - 2011

13 papers in library cite

Early work proving that NNs can be good. But very uninteresting overall.

[12]Hybrid Word-Subword Decoding for Spoken Term Detection

Lukas Burget - 2008

1 paper in library cites

The writing is bad and it's uninteresting. Also has nothing to do with NNs. I think Mikolov only cited because it's from his alma mater. It doesn't explain anything about the approach (or if it does, it's very hard to follow)

[13]Hybrid Language Models Using Mixed Types of Sub-Lexical Units for Open Vocabulary German LVCSR

M. Shaik, A. Mousa, R. Schluter, Hermann Ney - 2011

1 paper in library cites

Very uninteresting, and nothing related to NNs. TBH I am not sure why Mikolov citred this specifically, as there are several other good examples. The main contribution here is using a mix of different techniques, but doesn't bring anything original.

[14]Modelling Out-of-Vocabulary Words for Robust Speech Recognition

I. Bazzi - 2002

3 papers in library cite

[15]The IBM Attila speech Recognition Toolkit

H. Soltau, G. Saon, Brian Kingsbury - 2010

3 papers in library cite

[16]A Fast Rescoring Strategy to Capture Long-Distance Dependencies

A. Deoras, Tomas Mikolov, K. Church - 2011

2 papers in library cite

[17]Adaptive Weighing of Context Models for Lossless Data Compression

M. Mahoney - 2005

2 papers in library cite

[18]A Succinct N-Gram Language Model

T. Watanabe, H. Tsukada, H. Isozaki - 2009

1 paper in library cites

[19]Compressing Trigram Language Models With Golomb Coding

K. Church, R. Wa, T. Hart, Jianfeng Gao - 2007

1 paper in library cites

[20]Learning Sub-Word Units for Open Vocabulary Speech Recognition

C. Parada, Mark Dredze, A. Sethy, A. Rastrow - 2011

1 paper in library cites

[21]Mandarin Word-Character Hybrid-Input Neural Network Language Model

M. Kang, T. Ng, L. Nguyen - 2011

1 paper in library cites

[22]Phonotactic and Acoustic Language Recognition

P. Matejka - 2009

1 paper in library cites

[23]Recovery of Rare Words in Lecture Speech

S. Kombrink, M. Hannemann, Lukas Burget, H. Hermansky - 2010

1 paper in library cites

Cited by

7

papers in your library

Cites

13

papers in your library

Read

on June 20, 2025

Doesn't add a ton of contribution other than saying that subword models can perform better than char-level models. Probably the most important thing here is early use of subwords.

Tags

Paper Aliases

No aliases