Papperoni

2015

The Goldilocks Principle: Reading Children's Books With Explicit Memory Representations

F. Hill, Antoine Bordes, S. Chopra, Jason Weston

Open PDF Google Scholar

citations

Cite Score

29

AI summary

This paper introduces the Children's Book Test (CBT) and explores memory networks for language modeling. It finds that models with explicit memory perform well on semantic content words, and self-supervision enhances performance, achieving state-of-the-art results on the CNN QA benchmark.

Main Contributions

Introduces the Children's Book Test (CBT) dataset for evaluating language models on children's books.
Compares various state-of-the-art language models, including RNNs and Memory Networks, on the CBT dataset.
Shows that Memory Networks with explicit memory representations outperform other models in predicting semantic content words.
Finds that there is a 'Goldilocks principle' for the amount of text encoded in a single memory representation.
Achieves state-of-the-art performance on the CNN QA benchmark by applying self-supervision to Memory Networks.

Abstract

We introduce a new test of how well language models capture meaning in children's books. Unlike standard language modelling benchmarks, it distinguishes the task of predicting syntactic function words from that of predicting lower-frequency words, which carry greater semantic content. We compare a range of state-of-the-art models, each with a different way of encoding what has been previously read. We show that models which store explicit representations of long-term contexts outperform state-of-the-art neural language models at predicting semantic content words, although this advantage is not observed for syntactic function words. Interestingly, we find that the amount of text encoded in a single memory representation is highly influential to the performance: there is a sweet-spot, not too big and not too small, between single words and full sentences that allows the most meaningful information in a text to be effectively retained and recalled. Further, the attention over such window-based memories can be trained effectively through self-supervision. We then assess the generality of this principle by applying it to the CNN QA benchmark, which involves identifying named entities in paraphrased summaries of news articles, and achieve state-of-the-art performance.

Citation Graph

Loading graph...

References [29]

Sort:

Filter:

[1]Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning

R. Williams - 1992

11 papers in library cite

It's alright for formalizing the concept, but it's a bit boring and doesn't add a lot from the middle on. Focuses too much in reviewing existing techniques and in stochastic units.

[2]Show, Attend and Tell: Neural Image Caption Generation With Visual Attention

K. Xu, Jimmy Lei Ba, R. Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, R. Zemel, Yoshua Bengio - 2015

12 papers in library cite

It's a nice paper. I liked the soft attention way more than the hard one, and I am a bit mad that it wasn't the best lol And also it's the first paper I read about multimodality, but it seems that this was bustling at the time. Also results are kinda bad.

[3]Learning Long-Term Dependencies With Gradient Descent Is Difficult

Yoshua Bengio, Patrice Simard, Paolo Frasconi - 1994

31 papers in library cite

The first ones to notice that there is a problem with gradient descent, but way too mathy for me.

[4]Improving Neural Networks by Preventing Co-Adaptation of Feature Detectors

Geoffrey E. Hinton, N. Srivastava, Alex Krizhevsky, Ilya Sutskever, Ruslan R. Salakhutdinov - 2012

25 papers in library cite

Dropout, super impactful. The idea that you are training many estimators at once is also very nice.

[5]Effective Approaches to Attention-Based Neural Machine Translation

T. Luong, H. Pham, Christopher D. Manning - 2015

15 papers in library cite

Good paper, but very derivative. Attention methods start getting very complicated... I understand why Transformers took over TBH

[6]Teaching Machines to Read and Comprehend

K. M. Hermann, T. Kocisky, Edward Grefenstette, L. Espeholt, W. Kay, M. Suleyman, Phil Blunsom - 2015

31 papers in library cite

Nice way of converting unsupervised data to train for Q&A - and nice visualizations as well :) But I think their main contribution is the dataset. Maybe with the dataset they "unlocked" summarization?

[7]A Neural Attention Model for Abstractive Sentence Summarization

Alexander M. Rush, S. Chopra, Jason Weston - 2015

13 papers in library cite

TBH the paper is a bit boring and nothing new after reading a bunch of more modern techniques. I feel that they could have done a better job considering that seq2seq existed at the time. Either way, points for being the first to propose summarization with NNs.

[8]Regularization of Neural Networks Using Dropconnect

L. Wan, M. Zeiler, S. Zhang, Rob Fergus - 2013

8 papers in library cite

I feel that the method is very complex and does not improve much on top of regular dropout.

[9]End-to-End Memory Networks

S. Sukhbaatar, A. Szlam, Jason Weston, Rob Fergus - 2015

18 papers in library cite

This was so surprising! This is very similar to transformers and RAG. Who knew?!

[10]Memory Networks

Jason Weston, S. Chopra, Antoine Bordes - 2015

18 papers in library cite

The first half of the paper (when they discuss the concept in a very abstract way) is amazing. However, the actual methodology was very convoluted - I did not like it. I thought that Neural Turing Machines were inspired in this, but actually they are contemporary... So anyway, the concept is nice, execution is not.

[11]Improving Word Representations via Global Context and Multiple Word Prototypes

Eric H. Huang, Richard Socher, C. Manning, Andrew Y. Ng - 2012

7 papers in library cite

Good paper. I would say not too relevant, but has some nice concepts like a "global context vector" (document embedding) that comes from averaging word embeddings, and a joint training objective.

[12]Towards AI-complete Question Answering: A Set of Prerequisite Toy Tasks

Jason Weston, Antoine Bordes, S. Chopra, Tomas Mikolov - 2015

11 papers in library cite

It's a good idea and a nice read but the bad part is that most of the tasks are already easy.

[13]MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text

M. Richardson, C. J. C. Burges, Erin Renshaw - 2013

16 papers in library cite

Maybe the best dataset paper I have ever read. So well explained, thoroughly thought! It's a shame it's a very small dataset...

[14]Context Dependent Recurrent Neural Network Language Model

Tomas Mikolov, Geoffrey Zweig - 2012

12 papers in library cite

Nothing too interesting, just using the context of the RNN.

[15]Large Scale Image Annotation: Learning to Rank With Joint Word-Image Embeddings

Jason Weston, Samy Bengio, Nicolas Usunier - 2010

3 papers in library cite

Very simple approach! Seems like an early word2vec. I like the joint embeddings as well. However, results are underwhelming.

[16]Inferring Algorithmic Patterns With Stack-Augmented Recurrent Nets

Armand Joulin, Tomas Mikolov - 2015

9 papers in library cite

Very underwhelming TBH. I expected more after reading the Neural Turing Machine paper. This reads like "yeah, we lost the race, here's what we were doing before they did something better"

[17]Ask Me Anything: Dynamic Memory Networks for Natural Language Processing

A. Kumar, O. Irsoy, P. Ondruska, M. Iyyer, J. Bradbury, I. Gulrajani, Victor Zhong, R. Paulus, Richard Socher - 2015

9 papers in library cite

[18]Large-Scale Simple Question Answering With Memory Networks

Antoine Bordes, Nicolas Usunier, S. Chopra, Jason Weston - 2015

5 papers in library cite

Mem networks for QA - sounds interesting

[19]A Cache-Based Natural Language Model for Speech Recognition

R. Kuhn, R. D. Mori - 1990

6 papers in library cite

[20]The microsoft research Sentence Completion Challenge

Geoffrey Zweig, C. J. Burges - 2011

6 papers in library cite

[21]The stanford coreNLP Natural Language Processing Toolkit

Christopher D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, D. Mcclosky - 2014

6 papers in library cite

[22]Learning to Transduce With Unbounded Memory

Edward Grefenstette, K. Hermann, M. Suleyman, Phil Blunsom - 2015

5 papers in library cite

[23]Unconstrained Online Handwriting Recognition With Recurrent Neural Networks

Alex Graves, Santiago Fernandez, M. Liwicki, H. Bunke, Jürgen Schmidhuber - 2008

5 papers in library cite

[24]Scalable Modified Kneser-Ney Language Model Estimation

K. Heafield, I. Pouzyrevsky, J. H. Clark, P. Koehn - 2013

2 papers in library cite

[25]Transition-Based Dependency Parsing With Stack Long Short-Term Memory

C. Dyer, M. Ballesteros, W. Ling, A. Matthews, N. Smith - 2015

2 papers in library cite

[26]Goldilocks and the Three Bears

J. Hassall - 1904

1 paper in library cites

[27]Interaction With Context During Human Sentence Processing

G. Altmann, M. Steedman - 1988

1 paper in library cites

[28]The Neurobiology of Semantic Memory

J. R. Binder, R. H. Desai - 2011

1 paper in library cites

[29]Word Frequency Distributions and Lexical Semantics

R. H. Baayen, R. Lieber - 1996

1 paper in library cites

Cited by

14

papers in your library

Cites

18

papers in your library

Read

on October 30, 2025

Cool use of memory networks.

Tags

Paper Aliases

No aliases