2021

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt

citations

Cite Score

83

AI summary

This paper introduces a new multitask benchmark covering 57 diverse subjects to evaluate the knowledge and problem-solving abilities of text models, revealing that the GPT-3 model significantly outperforms random chance but still lacks expert-level accuracy and shows lopsided performance.

Main Contributions

  • Introduces a new multitask test (MMLU) covering 57 subjects across STEM, humanities, and social sciences to measure text model accuracy.
  • MMLU evaluates knowledge acquired during pretraining using zero-shot and few-shot settings, without large training sets.
  • Finds that GPT-3 achieves 43.9% accuracy on MMLU, significantly better than random chance but still far below human expert-level.
  • Highlights that models struggle with calculation-heavy STEM subjects and socially important subjects like law and morality.
  • Shows that GPT-3 is uncalibrated, with its confidence poorly related to actual accuracy.

Abstract

We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models also have lopsided performance and frequently do not know when they are wrong. Worse, they still have near-random accuracy on some socially important subjects such as morality and law. By comprehensively evaluating the breadth and depth of a model's academic and professional understanding, our test can be used to analyze models across many tasks and to identify important shortcomings.

Citation Graph

Loading graph...

References [32]

Sort:
Filter:

Jacob Devlin, M. W. Chang, K. Lee, Kristina Toutanova - 2018

39 papers in library cite

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei - 2020

21 papers in library cite

A. M. Turing - 1950

8 papers in library cite

Yibo Liu, M. Ott, N. Goyal, J. Du, M. Joshi, Deli Chen, Omer Levy, Martha Lewis, Luke Zettlemoyer, Veselin Stoyanov - 2019

17 papers in library cite

Alec Radford, Jeffrey Wu, Rewon Child, D. Luan, Dario Amodei, Ilya Sutskever - 2019

27 papers in library cite

Z. Lan, Mark Chen, S. Goodman, Kevin Gimpel, P. Sharma, Radu Soricut - 2019

8 papers in library cite

A. Wang, A. Singh, J. Michael, F. Hill, Omer Levy, Samuel R. Bowman - 2018

26 papers in library cite

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei - 2020

12 papers in library cite

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord - 2018

5 papers in library cite

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, Yejin Choi - 2019

6 papers in library cite

F. Petroni, Tim Rocktaschel, P. Lewis, A. Bakhtin, Yonghui Wu, A. H. Miller, Sebastian Riedel - 2019

4 papers in library cite

A. Wang, Y. Pruksachatkun, Nikita Nangia, A. Singh, J. Michael, F. Hill, Omer Levy, Samuel R. Bowman - 2019

15 papers in library cite

Guokun Lai, Q. Xie, Haozhe Liu, Yining Yang, Eduard Hovy - 2017

11 papers in library cite

M. Richardson, C. J. C. Burges, Erin Renshaw - 2013

16 papers in library cite

Colin Raffel, Noam Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, Wentao Li, P. J. Liu - 2019

17 papers in library cite

Yonatan Bisk, Rowan Zellers, R. L. Bras, Jianfeng Gao, Yejin Choi - 2019

5 papers in library cite

Dan Hendrycks, Collin Burns, Steven Basart, A. Critch, Jeffrey Li, Dawn Song, Jacob Steinhardt - 2020

3 papers in library cite

Daniel Khashabi, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, Hananneh Hajishirzi - 2020

5 papers in library cite

Rowan Zellers, Ari Holtzman, E. Clark, Lianhui Qin, Ali Farhadi, Yejin Choi - 2020

2 papers in library cite

T. Mihaylov, Peter Clark, Tushar Khot, Ashish Sabharwal - 2018

6 papers in library cite

M. G. Bellemare, Y. Naddaf, J. Veness, M. Bowling - 2013

5 papers in library cite

C. Guo, G. Pleiss, Y. S. Sun, K. Q. Weinberger - 2017

4 papers in library cite

Yonatan Bisk, Ari Holtzman, J. Thomason, Jacob Andreas, Yoshua Bengio, J. Chai, Mirella Lapata, A. Lazaridou, J. May, A. Nisnevich - 2020

3 papers in library cite

Y. Ovadia, E. Fertig, J. Ren, Z. Nado, D. Sculley, S. Nowozin, J. V. Dillon, B. Lakshminarayanan, J. Snoek - 2019

2 papers in library cite

L. Huang, R. L. Bras, C. Bhagavatula, Yejin Choi - 2019

2 papers in library cite

Dan Hendrycks, Mantas Mazeika, T. Dietterich - 2019

2 papers in library cite

A. B. Sai, A. K. Mohankumar, M. M. Khapra - 2020

1 paper in library cites

Peter Clark, Oren Etzioni, Daniel Khashabi, Tushar Khot, B. D. Mishra, Kyle Richardson, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord, N. Tandon, S. Bhakthavatsalam, D. Groeneveld, M. Guerquin, M. Schmitz - 2019

1 paper in library cites

Dan Hendrycks, K. Zhao, Steven Basart, Jacob Steinhardt, Dawn Song - 2019

1 paper in library cites

Tushar Khot, Peter Clark, M. Guerquin, P. Jansen, Ashish Sabharwal - 2019

1 paper in library cites

R. Geirhos, J. H. Jacobsen, C. Michaelis, R. Zemel, W. Brendel, M. Bethge, F. A. Wichmann - 2020

1 paper in library cites

A. Kumar, Percy Liang, T. Ma - 2019

1 paper in library cites

Cited by

6

papers in your library

Cites

17

papers in your library

Read

on May 24, 2026

Your review

Tags

Vetto Study

Paper Aliases

No aliases