2021
Cite Score
83
AI summary
This paper introduces a new multitask benchmark covering 57 diverse subjects to evaluate the knowledge and problem-solving abilities of text models, revealing that the GPT-3 model significantly outperforms random chance but still lacks expert-level accuracy and shows lopsided performance.
Main Contributions
Abstract
We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models also have lopsided performance and frequently do not know when they are wrong. Worse, they still have near-random accuracy on some socially important subjects such as morality and law. By comprehensively evaluating the breadth and depth of a model's academic and professional understanding, our test can be used to analyze models across many tasks and to identify important shortcomings.
Citation Graph
References [32]
Jacob Devlin, M. W. Chang, K. Lee, Kristina Toutanova - 2018
39 papers in library cite
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei - 2020
21 papers in library cite
A. M. Turing - 1950
8 papers in library cite
Yibo Liu, M. Ott, N. Goyal, J. Du, M. Joshi, Deli Chen, Omer Levy, Martha Lewis, Luke Zettlemoyer, Veselin Stoyanov - 2019
17 papers in library cite
Alec Radford, Jeffrey Wu, Rewon Child, D. Luan, Dario Amodei, Ilya Sutskever - 2019
27 papers in library cite
Z. Lan, Mark Chen, S. Goodman, Kevin Gimpel, P. Sharma, Radu Soricut - 2019
8 papers in library cite
A. Wang, A. Singh, J. Michael, F. Hill, Omer Levy, Samuel R. Bowman - 2018
26 papers in library cite
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei - 2020
12 papers in library cite
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord - 2018
5 papers in library cite
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, Yejin Choi - 2019
6 papers in library cite
F. Petroni, Tim Rocktaschel, P. Lewis, A. Bakhtin, Yonghui Wu, A. H. Miller, Sebastian Riedel - 2019
4 papers in library cite
A. Wang, Y. Pruksachatkun, Nikita Nangia, A. Singh, J. Michael, F. Hill, Omer Levy, Samuel R. Bowman - 2019
15 papers in library cite
Guokun Lai, Q. Xie, Haozhe Liu, Yining Yang, Eduard Hovy - 2017
11 papers in library cite
M. Richardson, C. J. C. Burges, Erin Renshaw - 2013
16 papers in library cite
Colin Raffel, Noam Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, Wentao Li, P. J. Liu - 2019
17 papers in library cite
Yonatan Bisk, Rowan Zellers, R. L. Bras, Jianfeng Gao, Yejin Choi - 2019
5 papers in library cite
Dan Hendrycks, Collin Burns, Steven Basart, A. Critch, Jeffrey Li, Dawn Song, Jacob Steinhardt - 2020
3 papers in library cite
Daniel Khashabi, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, Hananneh Hajishirzi - 2020
5 papers in library cite
Rowan Zellers, Ari Holtzman, E. Clark, Lianhui Qin, Ali Farhadi, Yejin Choi - 2020
2 papers in library cite
T. Mihaylov, Peter Clark, Tushar Khot, Ashish Sabharwal - 2018
6 papers in library cite
M. G. Bellemare, Y. Naddaf, J. Veness, M. Bowling - 2013
5 papers in library cite
C. Guo, G. Pleiss, Y. S. Sun, K. Q. Weinberger - 2017
4 papers in library cite
Yonatan Bisk, Ari Holtzman, J. Thomason, Jacob Andreas, Yoshua Bengio, J. Chai, Mirella Lapata, A. Lazaridou, J. May, A. Nisnevich - 2020
3 papers in library cite
Y. Ovadia, E. Fertig, J. Ren, Z. Nado, D. Sculley, S. Nowozin, J. V. Dillon, B. Lakshminarayanan, J. Snoek - 2019
2 papers in library cite
L. Huang, R. L. Bras, C. Bhagavatula, Yejin Choi - 2019
2 papers in library cite
Dan Hendrycks, Mantas Mazeika, T. Dietterich - 2019
2 papers in library cite
A. B. Sai, A. K. Mohankumar, M. M. Khapra - 2020
1 paper in library cites
Peter Clark, Oren Etzioni, Daniel Khashabi, Tushar Khot, B. D. Mishra, Kyle Richardson, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord, N. Tandon, S. Bhakthavatsalam, D. Groeneveld, M. Guerquin, M. Schmitz - 2019
1 paper in library cites
Dan Hendrycks, K. Zhao, Steven Basart, Jacob Steinhardt, Dawn Song - 2019
1 paper in library cites
Tushar Khot, Peter Clark, M. Guerquin, P. Jansen, Ashish Sabharwal - 2019
1 paper in library cites
R. Geirhos, J. H. Jacobsen, C. Michaelis, R. Zemel, W. Brendel, M. Bethge, F. A. Wichmann - 2020
1 paper in library cites
A. Kumar, Percy Liang, T. Ma - 2019
1 paper in library cites
Cited by
6
papers in your library
Cites
17
papers in your library
Read
on May 24, 2026
Your review
Tags
Paper Aliases
No aliases