Papperoni

2018

Annotation Artifacts in Natural Language Inference Data

Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Richard Schwartz, S. Bowman, Noah A. Smith

Open PDF Google Scholar

citations

Cite Score

43

AI summary

This paper studies annotation artifacts in NLI datasets like SNLI and MultiNLI, revealing that models can classify hypotheses without premises with high accuracy using fastText. It identifies linguistic phenomena correlated with inference classes and shows NLI models rely heavily on these artifacts, suggesting overestimated performance and a need for balanced datasets.

Main Contributions

Identified annotation artifacts in NLI datasets (SNLI and MultiNLI) that allow for hypothesis-only classification.
Showed that a simple text categorization model (fastText) can achieve high accuracy classifying hypotheses without observing the premise.
Analyzed linguistic phenomena (negation, vagueness) correlated with specific inference classes.
Demonstrated that high-performing NLI models rely heavily on annotation artifacts for predictions.
Suggested that the success of NLI models may be overestimated due to the presence of annotation artifacts.

Abstract

Large-scale datasets for natural language inference are created by presenting crowd workers with a sentence (premise), and asking them to generate three new sentences (hypotheses) that it entails, contradicts, or is logically neutral with respect to. We show that, in a significant portion of such data, this protocol leaves clues that make it possible to identify the label by looking only at the hypothesis, without observing the premise. Specifically, we show that a simple text categorization model can correctly classify the hypothesis alone in about 67% of SNLI (Bowman et al., 2015) and 53% of MultiNLI (Williams et al., 2018). Our analysis reveals that specific linguistic phenomena such as negation and vagueness are highly correlated with certain inference classes. Our findings suggest that the success of natural language inference models to date has been overestimated, and that the task remains a hard open problem.

Citation Graph

Loading graph...

References [29]

Sort:

Filter:

[1]SQuAD: 100,000+ Questions for Machine Comprehension of Text

P. Rajpurkar, J. Zhang, K. Lopyrev, Percy Liang - 2016

37 papers in library cite

Nice paper that introduced an important dataset. Not much else though.

[2]A Large Annotated Corpus for Learning Natural Language Inference

Samuel R. Bowman, G. Angeli, Christopher Potts, Christopher D. Manning - 2015

25 papers in library cite

Dataset collection is ok. The model that they create seems very low effort.

[3]A Broad-Coverage Challenge Corpus for Sentence Understanding Through Inference

A. Williams, Nikita Nangia, S. Bowman - 2018

19 papers in library cite

Very nice paper and cool dataset - good thing they expanded SNLI. Also, they at least tried to have a good baseline, and comparisons of domains are nice.

[4]Teaching Machines to Read and Comprehend

K. M. Hermann, T. Kocisky, Edward Grefenstette, L. Espeholt, W. Kay, M. Suleyman, Phil Blunsom - 2015

31 papers in library cite

Nice way of converting unsupervised data to train for Q&A - and nice visualizations as well :) But I think their main contribution is the dataset. Maybe with the dataset they "unlocked" summarization?

[5]The PASCAL Recognising Textual Entailment Challenge

Ido Dagan, O. Glickman, Bernardo Magnini - 2005

19 papers in library cite

It's very nice how they had the foresight to create a challenge that became relevant like 10 years later.

[6]Supervised Learning of Universal Sentence Representations From Natural Language Inference Data

Alexis Conneau, Douwe Kiela, Holger Schwenk, L. Barrault, Antoine Bordes - 2017

11 papers in library cite

It's nice. Maybe the first to do NLI right (after the SNLI paper tried but failed miserably). It is simple and effective. After that people started "performance maxxxing"

[7]Adversarial Examples for Evaluating Reading Comprehension Systems

R. Jia, Percy Liang - 2017

11 papers in library cite

I liked it a lot! It's good to see people testing things rather than just trying to beat SotA!

[8]A Decomposable Attention Model for Natural Language Inference

A. P. Parikh, O. Tackstrom, Dipanjan Das, Jakob Uszkoreit - 2016

11 papers in library cite

Very nice alternative to the common LSTM encoder-decoder architecture! Seems similar o the Transformers arch in the sense that they don't use RNNs. Nice that they analyze computational complexity as well.

[9]A Sick Cure for the Evaluation of Compositional Distributional Semantic Models

Marco Marelli, S. Menini, M. Baroni, L. Bentivogli, R. Bernardi, R. Z. Elli - 2014

7 papers in library cite

Just a basic dataset paper. Nothing much.

[10]A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task

Deli Chen, J. Bolton, Christopher D. Manning - 2016

9 papers in library cite

Very solid work that shows that things are not always what they seem - very nice!

[11]Vqa: Visual Question Answering

S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, D. Parikh - 2015

6 papers in library cite

New task?

[12]Hypothesis Only Baselines in Natural Language Inference

A. Poliak, J. Naradowsky, A. Haldar, R. Rudinger, B. V. Durme - 2018

5 papers in library cite

Using only the hypothesis can solve the NLI task (it shouldn'!)

[13]The Effect of Different Writing Tasks on Linguistic Style: A Case Study of the ROC Story Cloze Task

Richard Schwartz, Maarten Sap, I. Konstas, L. Zilles, Yejin Choi, Noah A. Smith - 2017

3 papers in library cite

People write differently based on the prompt and that makes it easy to distinguish things

[14]Social Bias in Elicited Natural Language Inferences

R. Rudinger, C. May, B. V. Durme - 2017

3 papers in library cite

[15]Analyzing the Behavior of Visual Question Answering Models

A. Agrawal, D. Batra, D. Parikh - 2016

2 papers in library cite

[16]Revisiting Visual Question Answering Baselines

A. Jabri, Armand Joulin, Laurens Van Der Maaten - 2016

2 papers in library cite

[17]Making the v in VQA matter: Elevating the Role of Image Understanding in Visual Question Answering

Y. Goyal, Tushar Khot, D. S. Stay, D. Batra, D. Parikh - 2017

1 paper in library cites

[18]A Corpus and Evaluation Framework for Deeper Understanding of Commonsense Stories

N. Mostafazadeh, N. Chambers, X. He, D. Parikh, D. Batra, L. Vanderwende, P. Kohli, J. Allen - 2016

5 papers in library cite

[19]Illinois-lh: A Denotational and Distributional Approach to Semantics

A. Lai, J. Hockenmaier - 2014

5 papers in library cite

[20]Bag of Tricks for Efficient Text Classification

Armand Joulin, E. Grave, Piotr Bojanowski, Tomas Mikolov - 2017

4 papers in library cite

[21]Evaluating Compositionality in Sentence Embeddings

I. Dasgupta, Daniel Guo, Andreas Stuhlmuller, S. J. Gershman, N. D. Goodman - 2018

2 papers in library cite

[22]Natural Language Inference Over Interaction Space

Y. Gong, H. Luo, J. Zhang - 2018

2 papers in library cite

[23]Natural Language Inference With External Knowledge

Qinlang Chen, X. D. Zhu, Z. H. Ling, D. Inkpen, S. Wei - 2017

2 papers in library cite

[24]Pay Attention to the Ending: Strong Neural Baselines for the roc Story Cloze Task

Zhipeng Cai, L. Tu, Kevin Gimpel - 2017

2 papers in library cite

[25]Annotating Relation Inference in Context via Question Answering

Omer Levy, Ido Dagan - 2016

1 paper in library cites

[26]Crowdsourcing Inference-Rule Evaluation

N. Zeichner, Jonathan Berant, Ido Dagan - 2012

1 paper in library cites

[27]Discovery of Inference Rules for Question-Answering

D. Lin, P. Pantel - 2001

1 paper in library cites

[28]Do Supervised Distributional Methods Really Learn Lexical Inference Relations?

Omer Levy, S. Remus, C. Biemann, Ido Dagan - 2015

1 paper in library cites

[29]Visual Referring Expression Recognition: What Do Our Systems Actually Learn?

V. Cirik, L. Morency, T. B. Kirkpatrick - 2018

1 paper in library cites

Cited by

6

papers in your library

Cites

17

papers in your library

Read

on December 30, 2025

I love papers that show flaws in current methodology/datasets. Very nice for pointing it out!

Tags

Paper Aliases

No aliases