2020

Utility Is in the Eye of the User: A Critique of NLP Leaderboards

Kawin Ethayarajh, Dan Jurafsky

citations

Cite Score

13

AI summary

This paper critiques the current NLP leaderboard paradigm, arguing that its focus on performance-based evaluation neglects other important model qualities like fairness and efficiency, and proposes increased transparency by reporting practical statistics (e.g., model size, energy efficiency) to better align with user utility.

Main Contributions

  • Critiques current NLP leaderboards for prioritizing performance over other valuable qualities like fairness, compactness, and energy efficiency.
  • Frames leaderboards and NLP practitioners as consumers of models, using microeconomic theory to analyze the divergence in their utility functions.
  • Identifies limitations in leaderboard design, including non-smooth utility for leaderboards and the neglect of prediction costs (e.g., model size, energy efficiency, latency).
  • Advocates for increased transparency on leaderboards, recommending the reporting of practical statistics for models.
  • Proposes a dynamic, customizable leaderboard interface allowing users to re-rank models based on their individual utility functions and preferences.

Abstract

Benchmarks such as GLUE have helped drive advances in NLP by incentivizing the creation of more accurate models. While this leaderboard paradigm has been remarkably successful, a historical focus on performance-based evaluation has been at the expense of other qualities that the NLP community values in models, such as compactness, fairness, and energy efficiency. In this opinion paper, we study the divergence between what is incentivized by leaderboards and what is useful in practice through the lens of microeconomic theory. We frame both the leaderboard and NLP practitioners as consumers and the benefit they get from a model as its utility to them. With this framing, we formalize how leaderboards in their current form can be poor proxies for the NLP community at large. For example, a highly inefficient model would provide less utility to practitioners but not to a leaderboard, since it is a cost that only the former must bear. To allow practitioners to better estimate a model's utility to them, we advocate for more transparency on leaderboards, such as the reporting of statistics that are of practical concern (e.g., model size, energy efficiency, and inference latency).

Citation Graph

Loading graph...

References [56]

Sort:
Filter:

Jacob Devlin, M. W. Chang, K. Lee, Kristina Toutanova - 2018

39 papers in library cite

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei - 2020

21 papers in library cite

Thomas Wolf - 2019

6 papers in library cite

P. Rajpurkar, J. Zhang, K. Lopyrev, Percy Liang - 2016

37 papers in library cite

A. Wang, A. Singh, J. Michael, F. Hill, Omer Levy, Samuel R. Bowman - 2018

26 papers in library cite

E. Strubell, A. Ganesh, Andrew Mccallum - 2019

3 papers in library cite

Samuel R. Bowman, G. Angeli, Christopher Potts, Christopher D. Manning - 2015

25 papers in library cite

P. Rajpurkar, R. Jia, Percy Liang - 2018

14 papers in library cite

A. Wang, Y. Pruksachatkun, Nikita Nangia, A. Singh, J. Michael, F. Hill, Omer Levy, Samuel R. Bowman - 2019

15 papers in library cite

Oren Etzioni - 2019

4 papers in library cite

R. Jia, Percy Liang - 2017

11 papers in library cite

Richard Socher - 2018

9 papers in library cite

M. Mitchell, S. Wu, A. Zaldivar, P. Barnes, L. Vasserman, B. Hutchinson, E. Spitzer, I. D. Raji, T. Gebru - 2018

5 papers in library cite

S. Arora, Yiqing Liang, T. Ma - 2017

4 papers in library cite

A. Blum, Moritz Hardt - 2015

2 papers in library cite

R. Rudinger, J. Naradowsky, B. Leonard, B. V. Durme - 2018

6 papers in library cite

J. Zhao, Tianle Wang, M. Yatskar, V. Ordonez, K. W. Chang - 2018

3 papers in library cite

S. L. Blodgett, S. Barocas, H. D. Iii, H. Wallach - 2020

7 papers in library cite

E. M. Bender, B. Friedman - 2018

4 papers in library cite

E. Agirre, C. Banea, C. Cardie, D. M. Cer, M. T. Diab, A. G. Agirre, W. Guo, R. Mihalcea, G. Rigau, J. Wiebe - 2014

4 papers in library cite

Moin Nadeem, A. Bethke, Siva Reddy - 2020

4 papers in library cite

Y. Nie, A. Williams, E. Dinan, Mohit Bansal, Jason Weston, Douwe Kiela - 2019

3 papers in library cite

T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. D. Iii, K. Crawford - 2018

3 papers in library cite

Tal Linzen - 2020

3 papers in library cite

J. Dodge, Suchin Gururangan, D. Card, Richard Schwartz, Noah A. Smith - 2019

3 papers in library cite

K. Clark, M. T. Luong, Quoc V. Le, Christopher D. Manning - 2020

2 papers in library cite

S. Bordia, S. Bowman - 2019

2 papers in library cite

John Miller, Karl Krauth, Benjamin Recht, Ludwig Schmidt - 2020

2 papers in library cite

W. E. Zhang, Q. Z. Sheng, A. Alhazmi, Chun-Liang Li - 2020

1 paper in library cites

T. Manzini, L. Y. Chong, A. W. Black, Y. Tsvetkov - 2019

1 paper in library cites

A. Raghunathan, Jacob Steinhardt, Percy Liang - 2018

1 paper in library cites

R. Jia, A. Raghunathan, K. Goksel, Percy Liang - 2019

1 paper in library cites

Moritz Hardt - 2017

1 paper in library cites

P. A. Samuelson - 1948

1 paper in library cites

Y. Oren, S. Sagawa, Tatsunori Hashimoto, Percy Liang - 2019

1 paper in library cites

L. Hou, L. Shang, Xu Jiang, Qian Liu - 2020

1 paper in library cites

Moritz Hardt, E. Price, N. Srebro - 2016

1 paper in library cites

S. Barocas, Moritz Hardt, A. Narayanan - 2017

1 paper in library cites

Tatsunori Hashimoto, M. Srivastava, H. Namkoong, Percy Liang - 2018

1 paper in library cites

A. H. Zadeh, A. Moshovos - 2020

1 paper in library cites

A. Rogers - 2019

1 paper in library cites

Kawin Ethayarajh - 2020

1 paper in library cites

J. Rawls - 2001

1 paper in library cites

Y. Mao, Yuzhi Wang, Chiyu Wu, Chiyuan Zhang, Yuzhi Wang, Yining Yang, Q. Zhang, Y. Tong, Jinze Bai - 2020

1 paper in library cites

B. Sundheim - 1995

1 paper in library cites

B. Dorr - 2011

1 paper in library cites

A. Rogers - 2020

1 paper in library cites

N. G. Mankiw - 2020

1 paper in library cites

Kawin Ethayarajh - 2019

1 paper in library cites

E. Agirre, D. Cer, M. Diab, A. G. Agirre, W. Guo - 2013

1 paper in library cites

E. Agirre, C. Banea, C. Cardie, D. M. Cer, M. T. Diab, A. G. Agirre, W. Guo, I. L. Gazpio, M. Maritxalar, R. Mihalcea - 2015

1 paper in library cites

A. Raghunathan, Jacob Steinhardt, P. S. Liang - 2018

1 paper in library cites

Kawin Ethayarajh, David Duvenaud, G. Hirst - 2019

1 paper in library cites

Kawin Ethayarajh, David Duvenaud, G. Hirst - 2019

1 paper in library cites

Kawin Ethayarajh - 2018

1 paper in library cites

Cited by

3

papers in your library

Cites

17

papers in your library

Read

on June 2, 2026

Your review

Tags

Paper Aliases

No aliases