2021

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, John Schulman

citations

Cite Score

82

AI summary

This paper introduces GSM8K, a dataset of 8.5K math word problems, and proposes training verifiers to judge the correctness of model completions, demonstrating significant performance improvements on GSM8K and better scaling with data compared to finetuning baselines.

Main Contributions

  • Introduction of GSM8K, a curated dataset of 8.5K grade school math questions with natural language solutions.
  • Proposal of training verifiers to evaluate the correctness of model-generated solutions for multi-step mathematical reasoning.
  • Demonstration that verification significantly improves performance on GSM8K, achieving a boost comparable to a 30x model size increase.
  • Empirical evidence that verification scales more effectively with increased data than a finetuning baseline.
  • Identification of dropout as a strong regularizer that significantly improves performance for both finetuning and verification.

Abstract

State-of-the-art language models can match human performance on many tasks, but they still struggle to robustly perform multi-step mathematical reasoning. To diagnose the failures of current models and support research, we introduce GSM8K, a dataset of 8.5K high quality linguistically diverse grade school math word problems. We find that even the largest transformer models fail to achieve high test performance, despite the conceptual simplicity of this problem distribution. To increase performance, we propose training verifiers to judge the correctness of model completions. At test time, we generate many candidate solutions and select the one ranked highest by the verifier. We demonstrate that verification significantly improves performance on GSM8K, and we provide strong empirical evidence that verification scales more effectively with increased data than a finetuning baseline.

Citation Graph

Loading graph...

References [29]

Sort:
Filter:

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin - 2017

47 papers in library cite

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei - 2020

21 papers in library cite

Ilya Sutskever, Oriol Vinyals, Quoc V. Le - 2014

58 papers in library cite

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei - 2020

12 papers in library cite

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, Jacob Steinhardt - 2021

8 papers in library cite

A. Wang, Y. Pruksachatkun, Nikita Nangia, A. Singh, J. Michael, F. Hill, Omer Levy, Samuel R. Bowman - 2019

15 papers in library cite

A. Amini, S. Gabriel, P. Lin, R. K. Kedziorski, Yejin Choi, Hananneh Hajishirzi - 2019

6 papers in library cite

S. Y. Miao, C. C. Liang, K. Y. Su - 2020

3 papers in library cite

A. Talmor, J. Herzig, N. Lourie, Jonathan Berant - 2019

3 papers in library cite

Nate Kushman, Y. Artzi, Luke Zettlemoyer, R. Barzilay - 2014

3 papers in library cite

W. Ling, D. Yogatama, C. Dyer, Phil Blunsom - 2017

3 papers in library cite

E. Nichols, Leo Gao, R. Gomez - 2020

2 papers in library cite

G. Lample, F. Charton - 2020

2 papers in library cite

Junhong Shen, Y. Yin, Lei Li, L. Shang, Xu Jiang, Mingchuan Zhang, Qian Liu - 2021

2 papers in library cite

Dong Huang, Sherry Shi, Chin Yew Lin, J. Yin, W. Ma - 2016

2 papers in library cite

Joseph Liu, L. Cui, Haozhe Liu, Dong Huang, Yuzhi Wang, Y. Z. Zhang - 2020

2 papers in library cite

Z. Xie, S. Sun - 2019

1 paper in library cites

W. Zhao, M. Shang, Yibo Liu, Lisa Wang, Joseph Liu - 2020

1 paper in library cites

Yuzhi Wang, Xiaodong Liu, Sherry Shi - 2017

1 paper in library cites

Shanda Li, L. Wu, S. Feng, Frank Xu, Frank Xu, S. Zhong - 2020

1 paper in library cites

K. Chen, Q. Huang, H. Palangi, P. Smolensky, K. D. Forbus, Jianfeng Gao - 2020

1 paper in library cites

J. T. Shen, M. Yamashita, E. Prihar, N. Heffernan, Xiaobao Wu, B. Graff, D. L. Lee - 2021

1 paper in library cites

S. Peng, K. Yuan, Leo Gao, Z. Tang - 2021

1 paper in library cites

Z. Liang, J. Zhang, J. Shao, X. Zhang - 2021

1 paper in library cites

Dong Huang, Joseph Liu, Chin Yew Lin, J. Yin - 2018

1 paper in library cites

X. Chen, C. Liang, A. W. Yu, Denny Zhou, Dawn Song, Quoc V. Le - 2019

1 paper in library cites

B. Kim, K. S. Ki, D. L. Lee, G. Gweon - 2020

1 paper in library cites

T. R. Chiang, Y. N. Chen - 2018

1 paper in library cites

S. Roy, Dan Roth - 2015

1 paper in library cites

Cited by

7

papers in your library

Cites

7

papers in your library

Read

on May 30, 2026

Your review

Tags

Paper Aliases

No aliases