2022

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Mirac Suzgun, Nathan Scales, Nathanael Scharli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V. Le, Ed H. Chi, Denny Zhou, Jason Wei

citations

Cite Score

48

AI summary

This paper introduces BIG-Bench Hard (BBH), a suite of 23 challenging language model tasks, and shows that Chain-of-Thought (CoT) prompting significantly improves performance for models like PaLM and Codex, surpassing human-rater baselines on many tasks.

Main Contributions

  • Introduces BIG-Bench Hard (BBH), a curated set of 23 challenging BIG-Bench tasks where prior language models underperformed human-rater baselines.
  • Demonstrates that Chain-of-Thought (CoT) prompting significantly improves language model performance on BBH tasks.
  • Shows that CoT prompting enables PaLM to surpass average human-rater performance on 10 of 23 BBH tasks, and Codex (code-davinci-002) on 17 of 23 tasks.
  • Analyzes the interaction between CoT prompting and model scale, finding that performance gains emerge with sufficiently large models.
  • Reveals that CoT prompting unlocks emergent task performance for several BBH tasks that otherwise exhibit flat scaling curves.

Abstract

BIG-Bench (Srivastava et al., 2022) is a diverse evaluation suite that focuses on tasks believed to be beyond the capabilities of current language models. Language models have already made good progress on this benchmark, with the best model in the BIG-Bench paper outperforming average reported human-rater results on 65% of the BIG-Bench tasks via few-shot prompting. But on what tasks do language models fall short of average human-rater performance, and are those tasks actually unsolvable by current language models? In this work, we focus on a suite of 23 challenging BIG-Bench tasks which we call BIG-Bench Hard (BBH). These are the task for which prior language model evaluations did not outperform the average human-rater. We find that applying chain-of-thought (CoT) prompting to BBH tasks enables PaLM to surpass the average human-rater performance on 10 of the 23 tasks, and Codex (code-davinci-002) to surpass the average human-rater performance on 17 of the 23 tasks. Since many tasks in BBH require multi-step reasoning, few-shot prompting without CoT, as done in the BIG-Bench evaluations (Srivastava et al., 2022), substantially underestimates the best performance and capabilities of language models, which is better captured via CoT prompting. As further analysis, we explore the interaction between CoT and model scale on BBH, finding that CoT enables emergent task performance on several BBH tasks with otherwise flat scaling curves.

Citation Graph

Loading graph...

References [53]

Sort:
Filter:

Jacob Devlin, M. W. Chang, K. Lee, Kristina Toutanova - 2018

39 papers in library cite

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei - 2020

21 papers in library cite

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, C. Wainwright, Pamela Mishkin, Chiyuan Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, Ryan Lowe - 2022

11 papers in library cite

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, Wojciech Zaremba - 2021

9 papers in library cite

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei - 2020

12 papers in library cite

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le - 2021

4 papers in library cite

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aman Gupta, Adria Garriga Alonso - 2022

4 papers in library cite

Colin Raffel, Noam Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, Wentao Li, P. J. Liu - 2019

17 papers in library cite

Jason Wei, Xinpeng Wang, Dale Schuurmans, Maarten Bosma, Fanyue Xia, E. Chi, Quoc V. Le, Denny Zhou - 2022

10 papers in library cite

Aakanksha Chowdhery, S. Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, A. Roberts, P. Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann - 2023

6 papers in library cite

Jason Wei, Maarten Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, Quoc V. Le - 2021

3 papers in library cite

T. Kojima, Shixiang Shane Gu, M. Reid, Y. Matsuo, Y. Iwasawa - 2022

6 papers in library cite

Jason Wei, Yi Tay, R. Bommasani, Colin Raffel, Barret Zoph, S. Borgeaud, D. Yogatama, Maarten Bosma, Denny Zhou, D. Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeffrey Dean, William Fedus - 2022

2 papers in library cite

Xinpeng Wang, Jason Wei, Dale Schuurmans, Quoc Le, E. Chi, Denny Zhou - 2022

5 papers in library cite

Missing author list

2022

4 papers in library cite

V. Sanh, A. Webson, Colin Raffel, S. H. Bach, L. Sutawika, Z. Alyafeai, A. Chaffin, A. Stiegler, T. L. Scao, A. Raja - 2021

4 papers in library cite

Missing author list

2021

4 papers in library cite

Maxwell Nye, A. J. Andreassen, Guy Gur Ari, Henryk Michalewski, Jacob Austin, D. Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, D. Luan, Charles Sutton, Augustus Odena - 2021

5 papers in library cite

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield Dodds, Nova Dassarma, Eli Tran Johnson, Scott Johnston, Sheer El Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, S. Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, J. Jacobson, Jackson Kernion, Shauna Kravec, Liane Lovitt, Kamal Ndousse, Catherine Olsson, Sam Ringer, Dario Amodei, Tom B. Brown, Jack Clark, Nicholas Joseph, Benjamin Mann, Sam McCandlish, Christopher Olah, Jared Kaplan - 2022

3 papers in library cite

Danny Hernandez, Jared Kaplan, Tom Henighan, Sam McCandlish - 2021

5 papers in library cite

Swaroop Mishra, Daniel Khashabi, Chitta Baral, Hananneh Hajishirzi - 2021

4 papers in library cite

Zhuoye Zhao, E. Wallace, S. Feng, Dan Klein, Shivalika Singh - 2021

3 papers in library cite

Yiwei Li, Zongyu Lin, S. Zhang, Q. Fu, Berlin Chen, J. G. Lou, Weizhu Chen - 2022

3 papers in library cite

Antonia Creswell, M. Shanahan, Irina Higgins - 2022

3 papers in library cite

Ethan Perez, Douwe Kiela, Kyunghyun Cho - 2021

3 papers in library cite

Timo Schick, Hinrich Schutze - 2020

2 papers in library cite

Genta Indra Winata, Andrea Madotto, Zongyu Lin, Rosanne Liu, Jason Yosinski, Pascale Fung - 2021

2 papers in library cite

F. Shi, Mirac Suzgun, M. Freitag, Xinpeng Wang, S. Srivats, S. Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, Jason Wei - 2023

2 papers in library cite

Denny Zhou, Nathanael Scharli, L. Hou, Jason Wei, Nathan Scales, Xinpeng Wang, Dale Schuurmans, C. Cui, O. Bousquet, Quoc V. Le, Ed H. Chi - 2023

2 papers in library cite

Deep Ganguli, Danny Hernandez, Liane Lovitt, Nova Dassarma, Tom Henighan, Andy Jones, Nicholas Joseph, Jackson Kernion, Benjamin Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Nelson Elhage, Sheer El Showk, Stanislav Fort, Zac Hatfield Dodds, Scott Johnston, Shauna Kravec, Neel Nanda, Kamal Ndousse, Catherine Olsson, Dario Amodei, Dario Amodei, Tom B. Brown, Jared Kaplan, Sam McCandlish, Christopher Olah, Jack Clark - 2022

2 papers in library cite

B. Lester, R. A. Rfou, Noah Constant - 2021

2 papers in library cite

E. Reif, Daphne Ippolito, A. Yuan, A. Coenen, Chris Callison Burch, Jason Wei - 2022

1 paper in library cites

A. Mittal, Yuandong Tian, Nanyun Peng - 2022

1 paper in library cites

S. M. Xie, A. Raghunathan, Percy Liang, T. Ma - 2021

1 paper in library cites

Zhoujun Cheng, Tianbao Xie, P. Shi, Chun-Liang Li, R. Nadkarni, Y. Hu, Caiming Xiong, D. R. Radev, M. Ostendorf, Luke Zettlemoyer - 2022

1 paper in library cites

M. Abdou, A. Kulmizev, D. Hershcovich, S. Frank, Ellie Pavlick, A. Sogaard - 2021

1 paper in library cites

A. K. Lampinen, I. Dasgupta, S. C. Chan, K. Matthewson, M. H. Tessler, Antonia Creswell, J. L. Mcclelland, J. X. Wang, F. Hill - 2022

1 paper in library cites

Yiwei Li, D. Choi, J. Chung, Nate Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. D. Lago - 2022

1 paper in library cites

A. Drozdov, Nathanael Scharli, E. Akyurek, Nathan Scales, X. Song, X. Chen, O. Bousquet, Denny Zhou - 2022

1 paper in library cites

A. Webson, Ellie Pavlick - 2021

1 paper in library cites

A. Marasovic, I. Beltagy, D. Downey, M. E. Peters - 2022

1 paper in library cites

Weizhu Chen - 2022

1 paper in library cites

R. Patel, Ellie Pavlick - 2022

1 paper in library cites

S. Min, Martha Lewis, Luke Zettlemoyer, Hananneh Hajishirzi - 2022

1 paper in library cites

J. Stacey, Yonatan Belinkov, M. Rei - 2021

1 paper in library cites

Z. Talat, H. Blix, J. Valvoda, M. I. Ganesh, R. Cotterell, A. Williams - 2022

1 paper in library cites

Mirac Suzgun, L. M. Kyriazi, Dan Jurafsky - 2022

1 paper in library cites

S. Wiegreffe, J. Hessel, Swabha Swayamdipta, M. Riedl, Yejin Choi - 2022

1 paper in library cites

S. Min, X. Lyu, Ari Holtzman, M. Artetxe, Martha Lewis, Hananneh Hajishirzi, Luke Zettlemoyer - 2022

1 paper in library cites

Yi Tay, Mostafa Dehghani, S. Abnar, Hyung Won Chung, William Fedus, J. Rao, S. Narang, V. Q. Tran, D. Yogatama, D. Metzler - 2022

1 paper in library cites

Aman Madaan, A. Yazdanbakhsh - 2022

1 paper in library cites

W. Zhou, Jiaxi Hu, Haowei Zhang, X. Liang, Maosong Sun, Caiming Xiong, Jie Tang - 2020

1 paper in library cites

Cited by

4

papers in your library

Cites

19

papers in your library

Read

on June 3, 2026

Your review

Tags

Benchmark

Paper Aliases

No aliases