Papperoni

2017

Pointer Sentinel Mixture Models

S. Merity, Caiming Xiong, J. Bradbury, Richard Socher

Open PDF Google Scholar

citations

Cite Score

66

AI summary

This paper introduces a pointer sentinel mixture model with pointer-LSTM that achieves state-of-the-art language modeling performance on the Penn Treebank with fewer parameters. The paper also introduces a new benchmark dataset for language modeling called WikiText.

Main Contributions

Introduces a pointer sentinel mixture model with pointer-LSTM for language modeling.
The pointer sentinel-LSTM model achieves state-of-the-art language modeling performance on the Penn Treebank with 70.9 perplexity.
The pointer sentinel-LSTM model uses fewer parameters than a standard softmax LSTM.
Introduces a new benchmark dataset for language modeling called WikiText.
The pointer component is heavily used for rare names.

Abstract

Recent neural network sequence models with softmax classifiers have achieved their best language modeling performance only with very large hidden states and large vocabularies. Even then they struggle to predict rare or unseen words even if the context makes the prediction unambiguous. We introduce the pointer sentinel mixture architecture for neural sequence models which has the ability to either reproduce a word from the recent context or produce a word from a standard softmax classifier. Our pointer sentinel-LSTM model achieves state of the art language modeling performance on the Penn Treebank (70.9 perplexity) while using far fewer parameters than a standard softmax LSTM. In order to evaluate how well language models can exploit longer contexts and deal with more realistic vocabularies and larger corpora we also introduce the freely available WikiText corpus.

Citation Graph

Loading graph...

References [27]

Sort:

Filter:

[1]Long Short-Term Memory

Sepp Hochreiter, Jürgen Schmidhuber - 1997

94 papers in library cite

LSTMs FTW!

[2]Neural Machine Translation by Jointly Learning to Align and Translate

D. Bahdanau, Kyunghyun Cho, Yoshua Bengio - 2014

59 papers in library cite

Introduces the attention mechanism - amazing overall

[3]Building a Large Annotated Corpus of English: The Penn Treebank

M. P. Marcus, B. Santorini, Mary Ann Marcinkiewicz - 1993

22 papers in library cite

Well, not really interesting but very cool to see how the peen tree bank was made.

[4]On the Difficulty of Training Recurrent Neural Networks

Razvan Pascanu, Tomas Mikolov, Yoshua Bengio - 2013

21 papers in library cite

It starts very mathy but in the end there are some very nice contributions! You don't actually need to understand the math to know what's going on in the end.

[5]Recurrent Neural Network Based Language Model

Tomas Mikolov, M. Karafiat, Lukas Burget, Jan Cernocky, Sanjeev Khudanpur - 2010

36 papers in library cite

The comeback of RNNs for language modeling. Not too exciting but impactful and a short read.

[6]Pointer Networks

Oriol Vinyals, M. Fortunato, Navdeep Jaitly - 2015

10 papers in library cite

Cool concept. Nice that it works and can find good solutions for TSP.

[7]Recurrent Neural Network Regularization

Wojciech Zaremba, Ilya Sutskever, Oriol Vinyals - 2014

22 papers in library cite

It's a very simple idea and TBH it's nothing different from dropout. It's good that it's a very short paper and very straightforward, but could be a paragraph long.

[8]End-to-End Memory Networks

S. Sukhbaatar, A. Szlam, Jason Weston, Rob Fergus - 2015

18 papers in library cite

This was so surprising! This is very similar to transformers and RAG. Who knew?!

[9]A Theoretically Grounded Application of Dropout in Recurrent Neural Networks

Yarin Gal - 2015

9 papers in library cite

The methodology is very simple and effective, but maybe it was too simple for a paper, so they put A TON of math on top of it... Unnecessary really.

[10]Long Short-Term Memory-Networks for Machine Reading

Mirella Lapata - 2016

8 papers in library cite

I read this more as an example of intra-attention, but this is not the main focus of the paper. I think visualization/explanation is a bit bad, and it doesn't seem too impactful. I kept thinking that this is starting to get too complicated, and indeed it was surpassed by transformers right after that.

[11]How to Construct Deep Recurrent Neural Networks

Razvan Pascanu, C. G. Gulcehre, Kyunghyun Cho, Yoshua Bengio - 2013

7 papers in library cite

Very interesting despite not being too relevant. Good read and a new way of thinking about RNNs.

[12]One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling

C. Chelba, Tomas Mikolov, M. Schuster, Q. Ge, T. Brants, P. Koehn, Tony Robinson - 2013

13 papers in library cite

It's somewhat shallow, but I can see the importance of this paper.

[13]Context Dependent Recurrent Neural Network Language Model

Tomas Mikolov, Geoffrey Zweig - 2012

12 papers in library cite

Nothing too interesting, just using the context of the RNN.

[14]Ask Me Anything: Dynamic Memory Networks for Natural Language Processing

A. Kumar, O. Irsoy, P. Ondruska, M. Iyyer, J. Bradbury, I. Gulrajani, Victor Zhong, R. Paulus, Richard Socher - 2015

9 papers in library cite

[15]Pointing the Unknown Words

C. G. Gulcehre, S. Ahn, R. Nallapati, B. Zhou, Yoshua Bengio - 2016

7 papers in library cite

Bengio, pointer networks

[16]Recurrent Highway Networks

J. G. Zilly, R. K. Srivastava, J. Koutnik, Jürgen Schmidhuber - 2016

6 papers in library cite

Recurrent highway networks - maybe read after the original highway network

[17]Latent Predictor Networks for Code Generation

W. Ling, Edward Grefenstette, K. M. Hermann, T. Kocisky, A. Senior, Feng Wang, Phil Blunsom - 2016

3 papers in library cite

Code generation

[18]Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations

David Krueger, T. Maharaj, J. Kramar, M. Pezeshki, Nicolas Ballas, N. R. Ke, A. G. A. P. Goyal, Yoshua Bengio, Hugo Larochelle, Aaron Courville - 2016

3 papers in library cite

Zoneout: variation of dropout, but with very little citations

[19]Text Understanding With the Attention Sum Reader Network

R. Kadlec, M. Schmid, O. Bajgar, Jan Kleindienst - 2016

7 papers in library cite

Not too cited but cited a lot internally

[20]Moses: Open Source Toolkit for Statistical Machine Translation

P. Koehn, H. Hoang, Alexandra Birch, Chris Callison Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, E. Herbst - 2007

8 papers in library cite

[21]Character-Aware Neural Language Models

Yoon Kim, Yacine Jernite, D. Sontag, Alexander M. Rush - 2016

7 papers in library cite

[22]A Maximum Entropy Approach to Adaptive Statistical Language Modeling

R. Rosenfeld - 1996

6 papers in library cite

[23]Dynamic Memory Networks for Visual and Textual Question Answering

Caiming Xiong, S. Merity, Richard Socher - 2016

5 papers in library cite

[24]Fine-Grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks

Y. Adi, E. Kermany, Yonatan Belinkov, O. Lavi, Y. Goldberg - 2016

4 papers in library cite

[25]Incorporating Copying Mechanism in Sequence-to-Sequence Learning

J. Gu, Z. L. Lu, H. Li, V. O. K. Li - 2016

4 papers in library cite

[26]Language Modeling With Sum-Product Networks

W. C. Cheng, S. Kok, H. V. Pham, H. L. Chieu, K. M. A. Chai - 2014

2 papers in library cite

[27]A Neural Knowledge Language Model

S. Ahn, H. Choi, T. Parnamaa, Yoshua Bengio - 2016

1 paper in library cites

Cited by

12

papers in your library

Cites

19

papers in your library

Read

on November 2, 2025

I really liked the methodology, but I had to read it a few times to understand it intuitively - I think they should have done a better job at explaining it.

Tags

Paper Aliases

No aliases