2021

What Will it Take to Fix Benchmarking in Natural Language Understanding?

Samuel R. Bowman, George E. Dahl

citations

Cite Score

14

AI summary

This paper argues that current NLU benchmarks are broken and proposes four criteria (validity, reliable annotation, statistical power, and disincentives for biased models) that future benchmarks should satisfy to facilitate progress in language understanding.

Main Contributions

  • Argues that current NLU evaluation benchmarks are broken due to saturation and biases.
  • Proposes four criteria for effective NLU benchmarks: validity, reliable annotation, adequate statistical power, and disincentives for biased models.
  • Discusses limitations of existing benchmark creation paradigms (naturally-occurring, expert-authored, crowdsourcing, adversarial filtering).
  • Outlines potential solutions for improving each criterion, including hybrid data collection and auxiliary bias evaluation metrics.
  • Emphasizes the need for community infrastructure and incentive design to address bias and improve evaluation.

Abstract

Evaluation for many natural language understanding (NLU) tasks is broken: Unreliable and biased systems score so highly on standard benchmarks that there is little room for researchers who develop better systems to demonstrate their improvements. The recent trend to abandon IID benchmarks in favor of adversarially-constructed, out-of-distribution test sets ensures that current models will perform poorly, but ultimately only obscures the abilities that we want our benchmarks to measure. In this position paper, we lay out four criteria that we argue NLU benchmarks should meet. We argue most current benchmarks fail at these criteria, and that adversarial data collection does not meaningfully address the causes of these failures. Instead, restoring a healthy evaluation ecosystem will require significant progress in the design of benchmark datasets, the reliability with which they are annotated, their size, and the ways they handle social bias.

Citation Graph

Loading graph...

References [68]

Sort:
Filter:

Jacob Devlin, M. W. Chang, K. Lee, Kristina Toutanova - 2018

39 papers in library cite

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei - 2020

21 papers in library cite

Yibo Liu, M. Ott, N. Goyal, J. Du, M. Joshi, Deli Chen, Omer Levy, Martha Lewis, Luke Zettlemoyer, Veselin Stoyanov - 2019

17 papers in library cite

P. Rajpurkar, J. Zhang, K. Lopyrev, Percy Liang - 2016

37 papers in library cite

A. Wang, A. Singh, J. Michael, F. Hill, Omer Levy, Samuel R. Bowman - 2018

26 papers in library cite

A. L. Maas, R. E. Daly, P. T. Pham, Dong Huang, Andrew Y. Ng, Christopher Potts - 2011

12 papers in library cite

Samuel R. Bowman, G. Angeli, Christopher Potts, Christopher D. Manning - 2015

25 papers in library cite

T. Kwiatkowski, J. Palomaki, O. Rhinehart, Michael Collins, A. P. Parikh, C. Alberti, D. Epstein, Illia Polosukhin, M. Kelcey, Jacob Devlin, K. Lee, K. N. Toutanova, Llion Jones, M. W. Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, Slav Petrov - 2019

9 papers in library cite

P. Rajpurkar, R. Jia, Percy Liang - 2018

14 papers in library cite

Luis Von Ahn, Laura Dabbish - 2004

5 papers in library cite

A. Wang, Y. Pruksachatkun, Nikita Nangia, A. Singh, J. Michael, F. Hill, Omer Levy, Samuel R. Bowman - 2019

15 papers in library cite

R. Jia, Percy Liang - 2017

11 papers in library cite

Hector J. Levesque, E. Davis, Leora Morgenstern - 2011

13 papers in library cite

R. T. Mccoy, Ellie Pavlick, Tal Linzen - 2019

5 papers in library cite

Yejin Choi - 2018

5 papers in library cite

D. Paperno, German Kruszewski, A. Lazaridou, N. Q. Pham, R. Bernardi, S. Pezzelle, M. Baroni, G. Boleda, Raquel Fernandez - 2016

12 papers in library cite

Kawin Ethayarajh, Dan Jurafsky - 2020

3 papers in library cite

Colin Raffel, Noam Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, Wentao Li, P. J. Liu - 2019

17 papers in library cite

Missing author list

2016

2 papers in library cite

Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen - 2020

4 papers in library cite

K. Sakaguchi, R. L. Bras, C. Bhagavatula, Yejin Choi - 2019

4 papers in library cite

M. T. Ribeiro, Tianhao Wu, C. Guestrin, Shivalika Singh - 2020

2 papers in library cite

A. Naik, A. Ravichander, N. M. Sadeh, C. P. Rose, Graham Neubig - 2018

4 papers in library cite

D. Card, P. Henderson, U. Khandelwal, R. Jia, K. Mahowald, Dan Jurafsky - 2020

2 papers in library cite

Nikita Nangia, Samuel R. Bowman - 2019

3 papers in library cite

R. Rudinger, J. Naradowsky, B. Leonard, B. V. Durme - 2018

6 papers in library cite

R. Rudinger, C. May, B. V. Durme - 2017

3 papers in library cite

J. Dunietz, G. Burnham, A. Bharadwaj, O. Rambow, J. C. Carroll, D. Ferrucci - 2020

1 paper in library cites

Rowan Zellers, Ari Holtzman, E. Clark, Lianhui Qin, Ali Farhadi, Yejin Choi - 2020

2 papers in library cite

K. W. Church, J. Hestness - 2019

1 paper in library cites

S. Sugawara, P. Stenetorp, A. Aizawa - 2020

1 paper in library cites

S. L. Blodgett, S. Barocas, H. D. Iii, H. Wallach - 2020

7 papers in library cite

A. Poliak, A. Haldar, R. Rudinger, J. E. Hu, Ellie Pavlick, A. S. White, B. V. Durme - 2018

4 papers in library cite

E. M. Bender, B. Friedman - 2018

4 papers in library cite

Y. Nie, A. Williams, E. Dinan, Mohit Bansal, Jason Weston, Douwe Kiela - 2019

3 papers in library cite

E. M. Bender, A. Koller - 2020

3 papers in library cite

T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. D. Iii, K. Crawford - 2018

3 papers in library cite

Yonatan Bisk, Ari Holtzman, J. Thomason, Jacob Andreas, Yoshua Bengio, J. Chai, Mirella Lapata, A. Lazaridou, J. May, A. Nisnevich - 2020

3 papers in library cite

R. Cooper, D. Crouch, J. Eijck, C. Fox, J. Genabith, J. Jaspars, H. Kamp, D. Milward, M. Pinkal, M. Poesio, S. Pulman, T. Briscoe, H. Maier, K. Konrad - 1996

3 papers in library cite

L. Huang, R. L. Bras, C. Bhagavatula, Yejin Choi - 2019

2 papers in library cite

K. Webster, M. Recasens, V. Axelrod, J. Baldridge - 2018

2 papers in library cite

D. Dua, A. Gottumukkala, A. Talmor, Shivalika Singh, Matt Gardner - 2019

2 papers in library cite

T. Niven, H. Y. Kao - 2019

2 papers in library cite

A. Ettinger, S. Rao, H. D. Iii, E. Bender - 2017

2 papers in library cite

M. Poesio, J. Chamberlain, S. Paun, J. Yu, A. Uma, U. Kruschwitz - 2019

1 paper in library cites

R. L. Bras, Swabha Swayamdipta, C. Bhagavatula, Rowan Zellers, M. E. Peters, Ashish Sabharwal, Yejin Choi - 2020

1 paper in library cites

C. Vania, R. Chen, Samuel R. Bowman - 2020

1 paper in library cites

S. Sugawara, P. Stenetorp, K. Inui, A. Aizawa - 2020

1 paper in library cites

K. Fort, B. Guillaume, H. Chastant - 2014

1 paper in library cites

Matt Gardner, Y. Artzi, V. Basmov, Jonathan Berant, B. Bogin, S. Chen, P. Dasigi, D. Dua, Y. Elazar, A. Gottumukkala, N. Gupta, Hananneh Hajishirzi, G. Ilharco, Daniel Khashabi, K. Lin, Joseph Liu, N. F. Liu, P. Mulcaire, Q. Ning, Shivalika Singh, Noah A. Smith, S. Subramanian, R. Tsarfaty, E. Wallace, A. Zhang, B. Zhou - 2020

1 paper in library cites

Svetlana Kiritchenko, Saif M. Mohammad - 2018

1 paper in library cites

N. Tiku - 2020

1 paper in library cites

Ellie Pavlick, T. Kwiatkowski - 2019

1 paper in library cites

Y. Pruksachatkun, Jason Phang, Haozhe Liu, Phu Mon Htut, X. Zhang, R. Y. Pang, C. Vania, K. Kann, Samuel R. Bowman - 2020

1 paper in library cites

M. Florestall - 2008

1 paper in library cites

Timo Schick, Hinrich Schutze - 2020

1 paper in library cites

Y. Meng, Xiang Ren, Z. Sun, Xiang Lisa Li, A. Yuan, F. Wu, Jeffrey Li - 2019

1 paper in library cites

C. Welty, P. Paritosh, L. Aroyo - 2019

1 paper in library cites

Samuel R. Bowman, J. Palomaki, L. B. Soares, E. Pitler - 2020

1 paper in library cites

Han Hu, Kyle Richardson, L. Xu, Lei Li, S. Kubler, L. Moss - 2020

1 paper in library cites

K. Wiggers - 2020

1 paper in library cites

B. Morschheuser, J. Hamari - 2019

1 paper in library cites

B. Hutchinson, A. Smart, A. Hanna, E. Denton, C. Greer, O. Kjartansson, P. Barnes, M. Mitchell - 2021

1 paper in library cites

Tao Li, Daniel Khashabi, Tushar Khot, Ashish Sabharwal, Vivek Srikumar - 2020

1 paper in library cites

C. Si, Shijie Wang, M. Y. Kan, J. J. Jiang - 2019

1 paper in library cites

J. B. Graber, B. Borschinger - 2020

1 paper in library cites

Cited by

1

papers in your library

Cites

28

papers in your library

Read

on June 3, 2026

Your review

Tags

Benchmark

Paper Aliases

No aliases