2019
Cite Score
72
AI summary
This paper introduces HellaSwag, a new challenging dataset for commonsense natural language inference using Adversarial Filtering (AF) to generate adversarial wrong answers, demonstrating that state-of-the-art models like BERT still struggle despite near-human performance on previous benchmarks, highlighting limitations in their commonsense reasoning.
Main Contributions
Abstract
Recent work by Zellers et al. (2018) introduced a new task of commonsense natural language inference: given an event description such as "A woman sits at a piano," a machine must select the most likely followup: "She sets her fingers on the keys." With the introduction of BERT (Devlin et al., 2018), near human-level performance was reached. Does this mean that machines can perform human level commonsense inference? In this paper, we show that commonsense inference still proves difficult for even state-of-the-art models, by presenting HellaSwag, a new challenge dataset. Though its questions are trivial for humans (>95% accuracy), state-of-the-art models struggle (<48%). We achieve this via Adversarial Filtering (AF), a data collection paradigm wherein a series of discriminators iteratively select an adversarial set of machine-generated wrong answers. AF proves to be surprisingly robust. The key insight is to scale up the length and complexity of the dataset examples towards a critical 'Goldilocks' zone wherein generated text is ridiculous to humans, yet often misclassified by state-of-the-art models. Our construction of HellaSwag, and its resulting difficulty, sheds light on the inner workings of deep pretrained models. More broadly, it suggests a new path forward for NLP research, in which benchmarks co-evolve with the evolving state-of-the-art in an adversarial way, so as to present ever-harder challenges.
Citation Graph
References [19]
Jacob Devlin, M. W. Chang, K. Lee, Kristina Toutanova - 2018
39 papers in library cite
Alec Radford, Jeffrey Wu, Rewon Child, D. Luan, Dario Amodei, Ilya Sutskever - 2019
27 papers in library cite
M. E. Peters, M. Neumann, M. Iyyer, Matt Gardner, C. Clark, K. Lee, L. S. Zettlemoyer - 2018
27 papers in library cite
Alec Radford, K. Narasimhan, T. Salimans, Ilya Sutskever - 2018
23 papers in library cite
Yuxuan Zhu, R. Kiros, R. Zemel, Ruslan Salakhutdinov, R. Urtasun, Antonio Torralba, Sanja Fidler - 2015
18 papers in library cite
R. Jia, Percy Liang - 2017
11 papers in library cite
Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Richard Schwartz, S. Bowman, Noah A. Smith - 2018
6 papers in library cite
Yejin Choi - 2018
5 papers in library cite
Ari Holtzman, J. Buys, L. Du, M. Forbes, Yejin Choi - 2019
5 papers in library cite
R. Krishna, K. Hata, F. Ren, Li Fei Fei, J. C. Niebles - 2017
2 papers in library cite
Qinlang Chen, X. Zhu, Z. H. Ling, S. Wei, H. Jiang, D. Inkpen - 2017
5 papers in library cite
A. Poliak, J. Naradowsky, A. Haldar, R. Rudinger, B. V. Durme - 2018
5 papers in library cite
A. Rohrbach, A. Torabi, M. Rohrbach, N. Tandon, C. Pal, Hugo Larochelle, Aaron Courville, B. Schiele - 2017
2 papers in library cite
M. Glockner, V. Shwartz, Y. Goldberg - 2018
3 papers in library cite
Armand Joulin, E. Grave, Piotr Bojanowski, Tomas Mikolov - 2017
4 papers in library cite
R. Rudinger, Vera Demberg, A. Modi, B. V. Durme, M. Pinkal - 2015
1 paper in library cites
J. Gordon, B. V. Durme - 2013
1 paper in library cites
S. Williams, A. Waterman, D. Patterson - 2009
1 paper in library cites
Yonatan Belinkov, Yonatan Bisk - 2018
1 paper in library cites
Cited by
6
papers in your library
Cites
14
papers in your library
Read
on May 25, 2026
Your review
Tags
Paper Aliases
No aliases