2019

Hellaswag: Can a Machine Really Finish Your Sentence?

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, Yejin Choi

citations

Cite Score

72

AI summary

This paper introduces HellaSwag, a new challenging dataset for commonsense natural language inference using Adversarial Filtering (AF) to generate adversarial wrong answers, demonstrating that state-of-the-art models like BERT still struggle despite near-human performance on previous benchmarks, highlighting limitations in their commonsense reasoning.

Main Contributions

  • Introduced HellaSwag, a new dataset for commonsense natural language inference designed to be challenging for state-of-the-art models yet trivial for humans.
  • Utilized Adversarial Filtering (AF), a data collection paradigm that iteratively selects adversarial machine-generated wrong answers, proving its robustness.
  • Identified a 'Goldilocks' zone for dataset example length and complexity where machine-generated text is nonsensical to humans but misclassified by models.
  • Demonstrated that deep pretrained models like BERT struggle with HellaSwag (<48% accuracy), suggesting they operate more like rapid surface learners rather than possessing robust commonsense reasoning.
  • Proposed a new path for NLP research where benchmarks co-evolve adversarially with the state-of-the-art to present ever-harder challenges.

Abstract

Recent work by Zellers et al. (2018) introduced a new task of commonsense natural language inference: given an event description such as "A woman sits at a piano," a machine must select the most likely followup: "She sets her fingers on the keys." With the introduction of BERT (Devlin et al., 2018), near human-level performance was reached. Does this mean that machines can perform human level commonsense inference? In this paper, we show that commonsense inference still proves difficult for even state-of-the-art models, by presenting HellaSwag, a new challenge dataset. Though its questions are trivial for humans (>95% accuracy), state-of-the-art models struggle (<48%). We achieve this via Adversarial Filtering (AF), a data collection paradigm wherein a series of discriminators iteratively select an adversarial set of machine-generated wrong answers. AF proves to be surprisingly robust. The key insight is to scale up the length and complexity of the dataset examples towards a critical 'Goldilocks' zone wherein generated text is ridiculous to humans, yet often misclassified by state-of-the-art models. Our construction of HellaSwag, and its resulting difficulty, sheds light on the inner workings of deep pretrained models. More broadly, it suggests a new path forward for NLP research, in which benchmarks co-evolve with the evolving state-of-the-art in an adversarial way, so as to present ever-harder challenges.

Citation Graph

Loading graph...

References [19]

Sort:
Filter:

Jacob Devlin, M. W. Chang, K. Lee, Kristina Toutanova - 2018

39 papers in library cite

Alec Radford, Jeffrey Wu, Rewon Child, D. Luan, Dario Amodei, Ilya Sutskever - 2019

27 papers in library cite

M. E. Peters, M. Neumann, M. Iyyer, Matt Gardner, C. Clark, K. Lee, L. S. Zettlemoyer - 2018

27 papers in library cite

Alec Radford, K. Narasimhan, T. Salimans, Ilya Sutskever - 2018

23 papers in library cite

Yuxuan Zhu, R. Kiros, R. Zemel, Ruslan Salakhutdinov, R. Urtasun, Antonio Torralba, Sanja Fidler - 2015

18 papers in library cite

R. Jia, Percy Liang - 2017

11 papers in library cite

Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Richard Schwartz, S. Bowman, Noah A. Smith - 2018

6 papers in library cite

Yejin Choi - 2018

5 papers in library cite

Ari Holtzman, J. Buys, L. Du, M. Forbes, Yejin Choi - 2019

5 papers in library cite

R. Krishna, K. Hata, F. Ren, Li Fei Fei, J. C. Niebles - 2017

2 papers in library cite

Qinlang Chen, X. Zhu, Z. H. Ling, S. Wei, H. Jiang, D. Inkpen - 2017

5 papers in library cite

A. Poliak, J. Naradowsky, A. Haldar, R. Rudinger, B. V. Durme - 2018

5 papers in library cite

A. Rohrbach, A. Torabi, M. Rohrbach, N. Tandon, C. Pal, Hugo Larochelle, Aaron Courville, B. Schiele - 2017

2 papers in library cite

M. Glockner, V. Shwartz, Y. Goldberg - 2018

3 papers in library cite

Armand Joulin, E. Grave, Piotr Bojanowski, Tomas Mikolov - 2017

4 papers in library cite

R. Rudinger, Vera Demberg, A. Modi, B. V. Durme, M. Pinkal - 2015

1 paper in library cites

J. Gordon, B. V. Durme - 2013

1 paper in library cites

S. Williams, A. Waterman, D. Patterson - 2009

1 paper in library cites

Yonatan Belinkov, Yonatan Bisk - 2018

1 paper in library cites

Cited by

6

papers in your library

Cites

14

papers in your library

Read

on May 25, 2026

Your review

Tags

Paper Aliases

No aliases