Papperoni

2019

Hellaswag: Can a Machine Really Finish Your Sentence?

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, Yejin Choi

citations

Cite Score

AI summary

This paper introduces HellaSwag, a new challenging dataset for commonsense natural language inference using Adversarial Filtering (AF) to generate adversarial wrong answers, demonstrating that state-of-the-art models like BERT still struggle despite near-human performance on previous benchmarks, highlighting limitations in their commonsense reasoning.

Main Contributions

Introduced HellaSwag, a new dataset for commonsense natural language inference designed to be challenging for state-of-the-art models yet trivial for humans.
Utilized Adversarial Filtering (AF), a data collection paradigm that iteratively selects adversarial machine-generated wrong answers, proving its robustness.
Identified a 'Goldilocks' zone for dataset example length and complexity where machine-generated text is nonsensical to humans but misclassified by models.
Demonstrated that deep pretrained models like BERT struggle with HellaSwag (<48% accuracy), suggesting they operate more like rapid surface learners rather than possessing robust commonsense reasoning.
Proposed a new path for NLP research where benchmarks co-evolve adversarially with the state-of-the-art to present ever-harder challenges.

Abstract

Recent work by Zellers et al. (2018) introduced a new task of commonsense natural language inference: given an event description such as "A woman sits at a piano," a machine must select the most likely followup: "She sets her fingers on the keys." With the introduction of BERT (Devlin et al., 2018), near human-level performance was reached. Does this mean that machines can perform human level commonsense inference? In this paper, we show that commonsense inference still proves difficult for even state-of-the-art models, by presenting HellaSwag, a new challenge dataset. Though its questions are trivial for humans (>95% accuracy), state-of-the-art models struggle (<48%). We achieve this via Adversarial Filtering (AF), a data collection paradigm wherein a series of discriminators iteratively select an adversarial set of machine-generated wrong answers. AF proves to be surprisingly robust. The key insight is to scale up the length and complexity of the dataset examples towards a critical 'Goldilocks' zone wherein generated text is ridiculous to humans, yet often misclassified by state-of-the-art models. Our construction of HellaSwag, and its resulting difficulty, sheds light on the inner workings of deep pretrained models. More broadly, it suggests a new path forward for NLP research, in which benchmarks co-evolve with the evolving state-of-the-art in an adversarial way, so as to present ever-harder challenges.

Citation Graph

Loading graph...

References [19]

Sort:

Filter:

[1]BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, M. W. Chang, K. Lee, Kristina Toutanova - 2018

39 papers in library cite