Papperoni

2022

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Mirac Suzgun, Nathan Scales, Nathanael Scharli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V. Le, Ed H. Chi, Denny Zhou, Jason Wei

Open PDF Google Scholar

citations

Cite Score

48

AI summary

This paper introduces BIG-Bench Hard (BBH), a suite of 23 challenging language model tasks, and shows that Chain-of-Thought (CoT) prompting significantly improves performance for models like PaLM and Codex, surpassing human-rater baselines on many tasks.

Main Contributions

Introduces BIG-Bench Hard (BBH), a curated set of 23 challenging BIG-Bench tasks where prior language models underperformed human-rater baselines.
Demonstrates that Chain-of-Thought (CoT) prompting significantly improves language model performance on BBH tasks.
Shows that CoT prompting enables PaLM to surpass average human-rater performance on 10 of 23 BBH tasks, and Codex (code-davinci-002) on 17 of 23 tasks.
Analyzes the interaction between CoT prompting and model scale, finding that performance gains emerge with sufficiently large models.
Reveals that CoT prompting unlocks emergent task performance for several BBH tasks that otherwise exhibit flat scaling curves.

Abstract

BIG-Bench (Srivastava et al., 2022) is a diverse evaluation suite that focuses on tasks believed to be beyond the capabilities of current language models. Language models have already made good progress on this benchmark, with the best model in the BIG-Bench paper outperforming average reported human-rater results on 65% of the BIG-Bench tasks via few-shot prompting. But on what tasks do language models fall short of average human-rater performance, and are those tasks actually unsolvable by current language models? In this work, we focus on a suite of 23 challenging BIG-Bench tasks which we call BIG-Bench Hard (BBH). These are the task for which prior language model evaluations did not outperform the average human-rater. We find that applying chain-of-thought (CoT) prompting to BBH tasks enables PaLM to surpass the average human-rater performance on 10 of the 23 tasks, and Codex (code-davinci-002) to surpass the average human-rater performance on 17 of the 23 tasks. Since many tasks in BBH require multi-step reasoning, few-shot prompting without CoT, as done in the BIG-Bench evaluations (Srivastava et al., 2022), substantially underestimates the best performance and capabilities of language models, which is better captured via CoT prompting. As further analysis, we explore the interaction between CoT and model scale on BBH, finding that CoT enables emergent task performance on several BBH tasks with otherwise flat scaling curves.

Citation Graph

Loading graph...

References [53]

Sort:

Filter:

[1]BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, M. W. Chang, K. Lee, Kristina Toutanova - 2018

39 papers in library cite

Simply amazing. It's very impressive how they make a leap vs. existing stuff (you can see from the references, pretty much no one is doing what they are doing, other than GPT)

[2]Language Models Are Few-Shot Learners

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei - 2020

21 papers in library cite

It's just training the GPT arch with more data and more params. Nothing too surprising, but kudos for identifying and formalizing few-shot learning.

[3]Training Language Models to Follow Instructions With Human Feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, C. Wainwright, Pamela Mishkin, Chiyuan Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, Ryan Lowe - 2022

11 papers in library cite

No new research here. Only true contribution is scaling RLHF to GPT 3.

[4]Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, Wojciech Zaremba - 2021

9 papers in library cite

Very nice read! Nothin new in the methodology but very thoughtful and thorough analysis.

[5]Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei - 2020

12 papers in library cite

Very nice! An amazing contribution. Problem is, the paper is just like 3 pages of actual interesting content, and 10 pages of detailed results. Boring to read but very good otherwise.

[6]Program Synthesis With Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le - 2021

4 papers in library cite

Boring read. No news. They lost to OpenAI and HumanEval.

[7]Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aman Gupta, Adria Garriga Alonso - 2022

4 papers in library cite

Nice initiative but too much focus on models and performance rather than the bench itself. Also, saturated right after

[8]Exploring the Limits of Transfer Learning With a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, Wentao Li, P. J. Liu - 2019

17 papers in library cite

44 pages; T5 paper - "unifying framework where all text-based NLP problems are cast as text-to-text tasks"

[9]Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xinpeng Wang, Dale Schuurmans, Maarten Bosma, Fanyue Xia, E. Chi, Quoc V. Le, Denny Zhou - 2022

10 papers in library cite

CoT

[10]Palm: Scaling Language Modeling With Pathways

Aakanksha Chowdhery, S. Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, A. Roberts, P. Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann - 2023

6 papers in library cite

[11]Finetuned Language Models Are Zero-Shot Learners

Jason Wei, Maarten Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, Quoc V. Le - 2021

3 papers in library cite

[12]Large Language Models Are Zero-Shot Reasoners

T. Kojima, Shixiang Shane Gu, M. Reid, Y. Matsuo, Y. Iwasawa - 2022

6 papers in library cite

REASONING!

[13]Emergent Abilities of Large Language Models

Jason Wei, Yi Tay, R. Bommasani, Colin Raffel, Barret Zoph, S. Borgeaud, D. Yogatama, Maarten Bosma, Denny Zhou, D. Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeffrey Dean, William Fedus - 2022

2 papers in library cite

[14]Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xinpeng Wang, Jason Wei, Dale Schuurmans, Quoc Le, E. Chi, Denny Zhou - 2022

5 papers in library cite

Missing author list

[15]Training Compute-Optimal Large Language Models

2022

4 papers in library cite

"DeepMind’s research showed that we weren't just compute-constrained, but data-constrained."

[16]Multitask Prompted Training Enables Zero-Shot Task Generalization

V. Sanh, A. Webson, Colin Raffel, S. H. Bach, L. Sutawika, Z. Alyafeai, A. Chaffin, A. Stiegler, T. L. Scao, A. Raja - 2021

4 papers in library cite

Missing author list

[17]Scaling Language Models: Methods, Analysis & Insights from Training Gopher

2021

4 papers in library cite

Gopher: Massive scaling + Google Deepmind

[18]Show Your Work: Scratchpads for Intermediate Computation With Language Models

Maxwell Nye, A. J. Andreassen, Guy Gur Ari, Henryk Michalewski, Jacob Austin, D. Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, D. Luan, Charles Sutton, Augustus Odena - 2021

5 papers in library cite

First reasoning?

[19]Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield Dodds, Nova Dassarma, Eli Tran Johnson, Scott Johnston, Sheer El Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, S. Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, J. Jacobson, Jackson Kernion, Shauna Kravec, Liane Lovitt, Kamal Ndousse, Catherine Olsson, Sam Ringer, Dario Amodei, Tom B. Brown, Jack Clark, Nicholas Joseph, Benjamin Mann, Sam McCandlish, Christopher Olah, Jared Kaplan - 2022

3 papers in library cite

[20]Scaling Laws for Transfer

Danny Hernandez, Jared Kaplan, Tom Henighan, Sam McCandlish - 2021

5 papers in library cite

[21]Cross-Task Generalization via Natural Language Crowdsourcing Instructions

Swaroop Mishra, Daniel Khashabi, Chitta Baral, Hananneh Hajishirzi - 2021

4 papers in library cite

[22]Calibrate Before Use: Improving Few-Shot Performance of Language Models

Zhuoye Zhao, E. Wallace, S. Feng, Dan Klein, Shivalika Singh - 2021

3 papers in library cite

[23]On the Advance of Making Language Models Better Reasoners

Yiwei Li, Zongyu Lin, S. Zhang, Q. Fu, Berlin Chen, J. G. Lou, Weizhu Chen - 2022

3 papers in library cite

[24]Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning

Antonia Creswell, M. Shanahan, Irina Higgins - 2022

3 papers in library cite

[25]True Few-Shot Learning With Language Models

Ethan Perez, Douwe Kiela, Kyunghyun Cho - 2021

3 papers in library cite

[26]Exploiting Cloze Questions for Few-Shot Text Classification and Natural Language Inference

Timo Schick, Hinrich Schutze - 2020

2 papers in library cite

[27]Language Models Are Few-Shot Multilingual Learners

Genta Indra Winata, Andrea Madotto, Zongyu Lin, Rosanne Liu, Jason Yosinski, Pascale Fung - 2021

2 papers in library cite

[28]Language Models Are Multilingual Chain-of-Thought Reasoners

F. Shi, Mirac Suzgun, M. Freitag, Xinpeng Wang, S. Srivats, S. Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, Jason Wei - 2023

2 papers in library cite

[29]Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Denny Zhou, Nathanael Scharli, L. Hou, Jason Wei, Nathan Scales, Xinpeng Wang, Dale Schuurmans, C. Cui, O. Bousquet, Quoc V. Le, Ed H. Chi - 2023

2 papers in library cite

[30]Predictability and Surprise in Large Generative Models

Deep Ganguli, Danny Hernandez, Liane Lovitt, Nova Dassarma, Tom Henighan, Andy Jones, Nicholas Joseph, Jackson Kernion, Benjamin Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Nelson Elhage, Sheer El Showk, Stanislav Fort, Zac Hatfield Dodds, Scott Johnston, Shauna Kravec, Neel Nanda, Kamal Ndousse, Catherine Olsson, Dario Amodei, Dario Amodei, Tom B. Brown, Jared Kaplan, Sam McCandlish, Christopher Olah, Jack Clark - 2022

2 papers in library cite

[31]The Power of Scale for Parameter-Efficient Prompt Tuning

B. Lester, R. A. Rfou, Noah Constant - 2021

2 papers in library cite

[32]A Recipe for Arbitrary Text Style Transfer With Large Language Models

E. Reif, Daphne Ippolito, A. Yuan, A. Coenen, Chris Callison Burch, Jason Wei - 2022

1 paper in library cites

[33]Ambipun: Generating Humorous Puns With Ambiguous Context

A. Mittal, Yuandong Tian, Nanyun Peng - 2022

1 paper in library cites

[34]An Explanation of in-Context Learning as Implicit Bayesian inference

S. M. Xie, A. Raghunathan, Percy Liang, T. Ma - 2021

1 paper in library cites

[35]Binding Language Models in Symbolic Languages

Zhoujun Cheng, Tianbao Xie, P. Shi, Chun-Liang Li, R. Nadkarni, Y. Hu, Caiming Xiong, D. R. Radev, M. Ostendorf, Luke Zettlemoyer - 2022

1 paper in library cites

[36]Can Language Models Encode Perceptual Structure Without Grounding? A Case Study in Color

M. Abdou, A. Kulmizev, D. Hershcovich, S. Frank, Ellie Pavlick, A. Sogaard - 2021

1 paper in library cites

[37]Can Language Models Learn From Explanations in Context?

A. K. Lampinen, I. Dasgupta, S. C. Chan, K. Matthewson, M. H. Tessler, Antonia Creswell, J. L. Mcclelland, J. X. Wang, F. Hill - 2022

1 paper in library cites

[38]Competition-Level Code Generation With Alphacode

Yiwei Li, D. Choi, J. Chung, Nate Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. D. Lago - 2022

1 paper in library cites

[39]Compositional Semantic Parsing With Large Language Models

A. Drozdov, Nathanael Scharli, E. Akyurek, Nathan Scales, X. Song, X. Chen, O. Bousquet, Denny Zhou - 2022

1 paper in library cites

[40]Do Prompt-Based Models Really Understand the Meaning of Their Prompts?

A. Webson, Ellie Pavlick - 2021

1 paper in library cites

[41]Few-Shot Self-Rationalization With Natural Language Prompts

A. Marasovic, I. Beltagy, D. Downey, M. E. Peters - 2022

1 paper in library cites

[42]Large Language Models Are Few(1)-Shot Table Reasoners

Weizhu Chen - 2022

1 paper in library cites

[43]Mapping Language Models to Grounded Conceptual Spaces

R. Patel, Ellie Pavlick - 2022

1 paper in library cites

[44]MetaICL: Learning to Learn in Context

S. Min, Martha Lewis, Luke Zettlemoyer, Hananneh Hajishirzi - 2022

1 paper in library cites

[45]Natural Language Inference With a Human Touch: Using Human Explanations to Guide Model Attention

J. Stacey, Yonatan Belinkov, M. Rei - 2021

1 paper in library cites

[46]On the Machine Learning of Ethical Judgments From Natural Language

Z. Talat, H. Blix, J. Valvoda, M. I. Ganesh, R. Cotterell, A. Williams - 2022

1 paper in library cites

[47]Prompt-and-Rerank: A Method for Zero-Shot and Few-Shot Arbitrary Textual Style Transfer With Small Language Models

Mirac Suzgun, L. M. Kyriazi, Dan Jurafsky - 2022

1 paper in library cites

[48]Reframing Human-Ai collaboration for Generating Free-Text Explanations

S. Wiegreffe, J. Hessel, Swabha Swayamdipta, M. Riedl, Yejin Choi - 2022

1 paper in library cites

[49]Rethinking the Role of Demonstrations: What Makes in-Context Learning Work?

S. Min, X. Lyu, Ari Holtzman, M. Artetxe, Martha Lewis, Hananneh Hajishirzi, Luke Zettlemoyer - 2022

1 paper in library cites

[50]Scaling Laws vs Model Architectures: How Does Inductive Bias Influence Scaling?

Yi Tay, Mostafa Dehghani, S. Abnar, Hyung Won Chung, William Fedus, J. Rao, S. Narang, V. Q. Tran, D. Yogatama, D. Metzler - 2022

1 paper in library cites

[51]Text and Patterns: For Effective Chain of Thought, It Takes Two to Tango

Aman Madaan, A. Yazdanbakhsh - 2022

1 paper in library cites

[52]Towards Interpretable Natural Language Understanding With Explanations as Latent Variables

W. Zhou, Jiaxi Hu, Haowei Zhang, X. Liang, Maosong Sun, Caiming Xiong, Jie Tang - 2020

1 paper in library cites

[53]When Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data

P. Hase, Mohit Bansal - 2022

1 paper in library cites

Cited by

4

papers in your library

Cites

19

papers in your library

Read

on June 3, 2026

Good read and good analysis but no interesting research. They just distilled the benchmark and applied CoT. Good for showing that it works but not exciting.

Tags

Benchmark

Paper Aliases

No aliases