Papperoni

2021

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt

Open PDF Google Scholar

citations

Cite Score

83

AI summary

This paper introduces a new multitask benchmark covering 57 diverse subjects to evaluate the knowledge and problem-solving abilities of text models, revealing that the GPT-3 model significantly outperforms random chance but still lacks expert-level accuracy and shows lopsided performance.

Main Contributions

Introduces a new multitask test (MMLU) covering 57 subjects across STEM, humanities, and social sciences to measure text model accuracy.
MMLU evaluates knowledge acquired during pretraining using zero-shot and few-shot settings, without large training sets.
Finds that GPT-3 achieves 43.9% accuracy on MMLU, significantly better than random chance but still far below human expert-level.
Highlights that models struggle with calculation-heavy STEM subjects and socially important subjects like law and morality.
Shows that GPT-3 is uncalibrated, with its confidence poorly related to actual accuracy.

Abstract

We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models also have lopsided performance and frequently do not know when they are wrong. Worse, they still have near-random accuracy on some socially important subjects such as morality and law. By comprehensively evaluating the breadth and depth of a model's academic and professional understanding, our test can be used to analyze models across many tasks and to identify important shortcomings.

Citation Graph

Loading graph...

References [32]

Sort:

Filter:

[1]BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, M. W. Chang, K. Lee, Kristina Toutanova - 2018

39 papers in library cite

Simply amazing. It's very impressive how they make a leap vs. existing stuff (you can see from the references, pretty much no one is doing what they are doing, other than GPT)

[2]Language Models Are Few-Shot Learners

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei - 2020

21 papers in library cite

It's just training the GPT arch with more data and more params. Nothing too surprising, but kudos for identifying and formalizing few-shot learning.

[3]Computing Machinery and Intelligence

A. M. Turing - 1950

8 papers in library cite

A must-read, but it gets a bit boring halfway through (as he is describing every counter argument).

[4]RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yibo Liu, M. Ott, N. Goyal, J. Du, M. Joshi, Deli Chen, Omer Levy, Martha Lewis, Luke Zettlemoyer, Veselin Stoyanov - 2019

17 papers in library cite

I liked it a lot! It shows that you don't need to do something completely new to have good results and contribute to science. It could be a 5, but it's a 4 due to not bringing anything new

[5]Language Models Are Unsupervised Multitask Learners

Alec Radford, Jeffrey Wu, Rewon Child, D. Luan, Dario Amodei, Ilya Sutskever - 2019

27 papers in library cite

Amazing! Tons of important contributions. I think they could have explained the models a bit better, and I think this is where OpenAI starts to become evil (and not open)

[6]ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations

Z. Lan, Mark Chen, S. Goodman, Kevin Gimpel, P. Sharma, Radu Soricut - 2019

8 papers in library cite

I like how they can achieve very close results with very few params! Very nice tricks to do that as well.

[7]GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

A. Wang, A. Singh, J. Michael, F. Hill, Omer Levy, Samuel R. Bowman - 2018

26 papers in library cite

I like it, but it's just a mesh of different existing datasets and F1 score. Nothing new really but I get why it's important

[8]Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei - 2020

12 papers in library cite

Very nice! An amazing contribution. Problem is, the paper is just like 3 pages of actual interesting content, and 10 pages of detailed results. Boring to read but very good otherwise.

[9]Think You Have Solved Question Answering? Try arc, the Ai2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord - 2018

5 papers in library cite

Meh, this is just the dataset definition. I don't see anything special, just a new data source. No new methodology or anything.

[10]Hellaswag: Can a Machine Really Finish Your Sentence?

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, Yejin Choi - 2019

6 papers in library cite

It's very simple, just a revamp on the SWAG dataset. However, the authors had a lot of foresight into understanding the dataset saturation loop - made me wonder if we stopped using adversarial examples because they just don't make sense anymore.

[11]Language Models as Knowledge Bases?

F. Petroni, Tim Rocktaschel, P. Lewis, A. Bakhtin, Yonghui Wu, A. H. Miller, Sebastian Riedel - 2019

4 papers in library cite

Very nice, but I expected a bit more. I thought it would be more of a philosophical discussion rather than a benchmark analysis. Still, probably the first ones to notice that LMs contain knowledge.

[12]SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

A. Wang, Y. Pruksachatkun, Nikita Nangia, A. Singh, J. Michael, F. Hill, Omer Levy, Samuel R. Bowman - 2019

15 papers in library cite

Nothing too surprising, just getting a bunch of stuff from different places and putting it all together. At least they do a good analysis of the benchmark vs. existing methodologies.

[13]RACE: Large-Scale Reading Comprehension Dataset From Examinations

Guokun Lai, Q. Xie, Haozhe Liu, Yining Yang, Eduard Hovy - 2017

11 papers in library cite

I really like the idea of using human tests for testing AI. Also, very nice insige for using chinese tests!

[14]MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text

M. Richardson, C. J. C. Burges, Erin Renshaw - 2013

16 papers in library cite

Maybe the best dataset paper I have ever read. So well explained, thoroughly thought! It's a shame it's a very small dataset...

[15]Exploring the Limits of Transfer Learning With a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, Wentao Li, P. J. Liu - 2019

17 papers in library cite

44 pages; T5 paper - "unifying framework where all text-based NLP problems are cast as text-to-text tasks"

[16]Piqa: Reasoning About Physical Commonsense in Natural Language

Yonatan Bisk, Rowan Zellers, R. L. Bras, Jianfeng Gao, Yejin Choi - 2019

5 papers in library cite

[17]Aligning AI With Shared Human Values

Dan Hendrycks, Collin Burns, Steven Basart, A. Critch, Jeffrey Li, Dawn Song, Jacob Steinhardt - 2020

3 papers in library cite

[18]Unifiedqa: Crossing Format Boundaries With a Single Qa System

Daniel Khashabi, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, Hananneh Hajishirzi - 2020

5 papers in library cite

Not sure if I want to read it. only ~400 citations

[19]Evaluating Machines by Their Real-World Language Use

Rowan Zellers, Ari Holtzman, E. Clark, Lianhui Qin, Ali Farhadi, Yejin Choi - 2020

2 papers in library cite

[20]Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

T. Mihaylov, Peter Clark, Tushar Khot, Ashish Sabharwal - 2018

6 papers in library cite

[21]The Arcade Learning Environment: An Evaluation Platform for General Agents

M. G. Bellemare, Y. Naddaf, J. Veness, M. Bowling - 2013

5 papers in library cite

[22]On Calibration of Modern Neural Networks

C. Guo, G. Pleiss, Y. S. Sun, K. Q. Weinberger - 2017

4 papers in library cite

[23]Experience Grounds Language

Yonatan Bisk, Ari Holtzman, J. Thomason, Jacob Andreas, Yoshua Bengio, J. Chai, Mirella Lapata, A. Lazaridou, J. May, A. Nisnevich - 2020

3 papers in library cite

[24]Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift

Y. Ovadia, E. Fertig, J. Ren, Z. Nado, D. Sculley, S. Nowozin, J. V. Dillon, B. Lakshminarayanan, J. Snoek - 2019

2 papers in library cite

[25]Cosmos qa: Machine Reading Comprehension With Contextual Commonsense Reasoning

L. Huang, R. L. Bras, C. Bhagavatula, Yejin Choi - 2019

2 papers in library cite

[26]Deep anomaly Detection With Outlier Exposure

Dan Hendrycks, Mantas Mazeika, T. Dietterich - 2019

2 papers in library cite

[27]A Survey of Evaluation Metrics Used for nlg Systems

A. B. Sai, A. K. Mohankumar, M. M. Khapra - 2020

1 paper in library cites

[28]From 'F' to 'A' on the n.y. Regents Science Exams: An Overview of the Aristo Project

Peter Clark, Oren Etzioni, Daniel Khashabi, Tushar Khot, B. D. Mishra, Kyle Richardson, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord, N. Tandon, S. Bhakthavatsalam, D. Groeneveld, M. Guerquin, M. Schmitz - 2019

1 paper in library cites

[29]Natural Adversarial Examples

Dan Hendrycks, K. Zhao, Steven Basart, Jacob Steinhardt, Dawn Song - 2019

1 paper in library cites

[30]Qasc: A Dataset for Question Answering via Sentence Composition

Tushar Khot, Peter Clark, M. Guerquin, P. Jansen, Ashish Sabharwal - 2019

1 paper in library cites

[31]Shortcut Learning in Deep Neural Networks

R. Geirhos, J. H. Jacobsen, C. Michaelis, R. Zemel, W. Brendel, M. Bethge, F. A. Wichmann - 2020

1 paper in library cites

[32]Verified Uncertainty Calibration

A. Kumar, Percy Liang, T. Ma - 2019

1 paper in library cites

Cited by

6

papers in your library

Cites

17

papers in your library

Read

on May 24, 2026

Ok, good benchmark but nothing surprising. I bet it saturated soon after (it did)

Tags

Vetto Study

Paper Aliases

No aliases