Papperoni

2018

Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context

U. Khandelwal, He He, P. Qi, Dan Jurafsky

Open PDF Google Scholar

citations

Cite Score

19

AI summary

This paper analyzes how LSTM language models use context through ablation studies on Penn Treebank and WikiText-2. The model uses about 200 tokens of context, distinguishing nearby context from distant history, and the neural caching model helps the LSTM copy words from distant context.

Main Contributions

Analyzes the role of context in an LSTM language model through ablation studies.
Finds that the model is capable of using about 200 tokens of context on average, but sharply distinguishes nearby context (recent 50 tokens) from the distant history.
Shows that the model is highly sensitive to the order of words within the most recent sentence, but ignores word order in the long-range context (beyond 50 tokens).
Demonstrates that the neural caching model (Grave et al., 2017b) especially helps the LSTM to copy words from within this distant context.
Provides a better understanding of how neural LMs use their context and sheds light on recent success from cache-based models.

Abstract

We know very little about how neural language models (LM) use prior linguistic context. In this paper, we investigate the role of context in an LSTM LM, through ablation studies. Specifically, we analyze the increase in perplexity when prior context words are shuffled, replaced, or dropped. On two standard datasets, Penn Treebank and WikiText-2, we find that the model is capable of using about 200 tokens of context on average, but sharply distinguishes nearby context (recent 50 tokens) from the distant history. The model is highly sensitive to the order of words within the most recent sentence, but ignores word order in the long-range context (beyond 50 tokens), suggesting the distant past is modeled only as a rough semantic field or topic. We further find that the neural caching model (Grave et al., 2017b) especially helps the LSTM to copy words from within this distant context. Overall, our analysis not only provides a better understanding of how neural LMs use their context, but also sheds light on recent success from cache-based models.

Citation Graph

Loading graph...

References [27]

Sort:

Filter:

[1]Long Short-Term Memory

Sepp Hochreiter, Jürgen Schmidhuber - 1997

94 papers in library cite

LSTMs FTW!

[2]Neural Machine Translation by Jointly Learning to Align and Translate

D. Bahdanau, Kyunghyun Cho, Yoshua Bengio - 2014

59 papers in library cite

Introduces the attention mechanism - amazing overall

[3]Building a Large Annotated Corpus of English: The Penn Treebank

M. P. Marcus, B. Santorini, Mary Ann Marcinkiewicz - 1993

22 papers in library cite

Well, not really interesting but very cool to see how the peen tree bank was made.

[4]Recurrent Neural Network Based Language Model

Tomas Mikolov, M. Karafiat, Lukas Burget, Jan Cernocky, Sanjeev Khudanpur - 2010

36 papers in library cite

The comeback of RNNs for language modeling. Not too exciting but impactful and a short read.

[5]Generating Sequences With Recurrent Neural Networks

Alex Graves - 2013

27 papers in library cite

Very cool and is the first to actually proposed the Attention mechanism! It gets a bit mathy but nothing too crazy. Also has the first examples of good machine generated writing I've seen in these papers, so very nice results.

[6]Regularization of Neural Networks Using Dropconnect

L. Wan, M. Zeiler, S. Zhang, Rob Fergus - 2013

8 papers in library cite

I feel that the method is very complex and does not improve much on top of regular dropout.

[7]Pointer Sentinel Mixture Models

S. Merity, Caiming Xiong, J. Bradbury, Richard Socher - 2017

12 papers in library cite

I really liked the methodology, but I had to read it a few times to understand it intuitively - I think they should have done a better job at explaining it.

[8]A Theoretically Grounded Application of Dropout in Recurrent Neural Networks

Yarin Gal - 2015

9 papers in library cite

The methodology is very simple and effective, but maybe it was too simple for a paper, so they put A TON of math on top of it... Unnecessary really.

[9]Exploring the Limits of Language Modeling

R. Jozefowicz, Oriol Vinyals, M. Schuster, Noam Shazeer, Yonghui Wu - 2016

20 papers in library cite

It's funny because at first I did not like it, but then it clicked and I really liked it - they are trying to come around the large dictionary and the rare word problem. In the end it's SotA, but I think it's too convoluted and was replaced by Transformers.

[10]Using the Output Embedding to Improve Language Models

O. Press, Lior Wolf - 2017

7 papers in library cite

I did not like this paper at all - The paper is not bad, it's just that I expected *way* more. Good results but uninteresting

[11]The Goldilocks Principle: Reading Children's Books With Explicit Memory Representations

F. Hill, Antoine Bordes, S. Chopra, Jason Weston - 2015

14 papers in library cite

Cool use of memory networks.

[12]Improving Neural Language Models With a Continuous Cache

E. Grave, Armand Joulin, Nicolas Usunier - 2016

7 papers in library cite

This was a surprise to me - I expected this to suck. However, they can provide a simple and intuitive way of improving LMs - nice

[13]Language Modeling With Gated Convolutional Networks

Yann N. Dauphin, A. Fan, Michael Auli, D. Grangier - 2016

8 papers in library cite

[14]Regularizing and Optimizing 1stm Language Models

S. Merity, Nitish Shirish Keskar, Richard Socher - 2017

6 papers in library cite

SotA LM in 2017

[15]On the State of the Art of Evaluation in Neural Language Models

G. Melis, C. Dyer, Phil Blunsom - 2018

6 papers in library cite

SotA LM in 2017

[16]The stanford coreNLP Natural Language Processing Toolkit

Christopher D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, D. Mcclosky - 2014

6 papers in library cite

[17]Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling

H. Inan, K. Khosravi, Richard Socher - 2017

6 papers in library cite

[18]Assessing the Ability of lstms to Learn Syntax-Sensitive Dependencies

Tal Linzen, E. Dupoux, Y. Goldberg - 2016

5 papers in library cite

[19]Breaking the Softmax Bottleneck: A High-Rank RNN Language Model

Zhilin Yang, Z. Dai, Ruslan Salakhutdinov, W. W. Cohen - 2017

4 papers in library cite

[20]Fine-Grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks

Y. Adi, E. Kermany, Yonatan Belinkov, O. Lavi, Y. Goldberg - 2016

4 papers in library cite

[21]Contextual LSTM (clstm) Models for Large Scale NLP Tasks

Sayan Ghosh, Oriol Vinyals, B. Strope, S. Roy, T. Dean, L. Heck - 2016

1 paper in library cites

[22]Larger-Context Language Modelling With Recurrent Neural Network

Tianle Wang, Kyunghyun Cho - 2016

1 paper in library cites

[23]N-Gram Language Modeling Using Recurrent Neural Network Estimation

C. Chelba, M. Norouzi, Samy Bengio - 2017

1 paper in library cites

[24]Syntactic Topic Models

J. B. Graber, D. Blei - 2009

1 paper in library cites

[25]Topically Driven Neural Language Model

J. H. Lau, T. Baldwin, T. Cohn - 2017

1 paper in library cites

[26]Unbounded Cache Model for Online Language Modeling With Open Vocabulary

E. Grave, M. M. Cisse, Armand Joulin - 2017

1 paper in library cites

[27]Visualizing and Understanding Neural Models in NLP

Jeffrey Li, X. Chen, Eduard Hovy, Dan Jurafsky - 2016

1 paper in library cites

Cited by

2

papers in your library

Cites

15

papers in your library

Read

on November 16, 2025

Very nice paper and an analysis that makes sense! People thought that LSTMs had "infinite memory", but this paper shows that it is not so!

Tags

Paper Aliases

No aliases