Papperoni

2021

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, John Schulman

Open PDF Google Scholar

citations

Cite Score

82

AI summary

This paper introduces GSM8K, a dataset of 8.5K math word problems, and proposes training verifiers to judge the correctness of model completions, demonstrating significant performance improvements on GSM8K and better scaling with data compared to finetuning baselines.

Main Contributions

Introduction of GSM8K, a curated dataset of 8.5K grade school math questions with natural language solutions.
Proposal of training verifiers to evaluate the correctness of model-generated solutions for multi-step mathematical reasoning.
Demonstration that verification significantly improves performance on GSM8K, achieving a boost comparable to a 30x model size increase.
Empirical evidence that verification scales more effectively with increased data than a finetuning baseline.
Identification of dropout as a strong regularizer that significantly improves performance for both finetuning and verification.

Abstract

State-of-the-art language models can match human performance on many tasks, but they still struggle to robustly perform multi-step mathematical reasoning. To diagnose the failures of current models and support research, we introduce GSM8K, a dataset of 8.5K high quality linguistically diverse grade school math word problems. We find that even the largest transformer models fail to achieve high test performance, despite the conceptual simplicity of this problem distribution. To increase performance, we propose training verifiers to judge the correctness of model completions. At test time, we generate many candidate solutions and select the one ranked highest by the verifier. We demonstrate that verification significantly improves performance on GSM8K, and we provide strong empirical evidence that verification scales more effectively with increased data than a finetuning baseline.

Citation Graph

Loading graph...

References [29]

Sort:

Filter:

[1]Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin - 2017

47 papers in library cite

I mean... it introduced Transformers!

[2]Language Models Are Few-Shot Learners

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei - 2020

21 papers in library cite

It's just training the GPT arch with more data and more params. Nothing too surprising, but kudos for identifying and formalizing few-shot learning.

[3]Sequence to Sequence Learning With Neural Networks

Ilya Sutskever, Oriol Vinyals, Quoc V. Le - 2014

58 papers in library cite

Good paper, but I think it only got famous because they set a new good baseline for NNs in MT. Their main contribution was reversing the source sentence TBH.

[4]Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei - 2020

12 papers in library cite

Very nice! An amazing contribution. Problem is, the paper is just like 3 pages of actual interesting content, and 10 pages of detailed results. Boring to read but very good otherwise.

[5]Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, Jacob Steinhardt - 2021

8 papers in library cite

Good idea in creating the dataset but it's just a regular and uninteresting dataset description paper

[6]SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

A. Wang, Y. Pruksachatkun, Nikita Nangia, A. Singh, J. Michael, F. Hill, Omer Levy, Samuel R. Bowman - 2019

15 papers in library cite

Nothing too surprising, just getting a bunch of stuff from different places and putting it all together. At least they do a good analysis of the benchmark vs. existing methodologies.

[7]Mathqa: Towards Interpretable Math Word Problem Solving With Operation-Based Formalisms

A. Amini, S. Gabriel, P. Lin, R. K. Kedziorski, Yejin Choi, Hananneh Hajishirzi - 2019

6 papers in library cite

[8]A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers

S. Y. Miao, C. C. Liang, K. Y. Su - 2020

3 papers in library cite

[9]Commonsenseqa: A Question Answering Challenge Targeting Commonsense Knowledge

A. Talmor, J. Herzig, N. Lourie, Jonathan Berant - 2019

3 papers in library cite

[10]Learning to Automatically Solve Algebra Word Problems

Nate Kushman, Y. Artzi, Luke Zettlemoyer, R. Barzilay - 2014

3 papers in library cite

[11]Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems

W. Ling, D. Yogatama, C. Dyer, Phil Blunsom - 2017

3 papers in library cite

[12]Collaborative Storytelling With Large-Scale Neural Language Models

E. Nichols, Leo Gao, R. Gomez - 2020

2 papers in library cite

[13]Deep Learning for Symbolic Mathematics

G. Lample, F. Charton - 2020

2 papers in library cite

[14]Generate & Rank: A Multi-Task Framework for Math Word Problems

Junhong Shen, Y. Yin, Lei Li, L. Shang, Xu Jiang, Mingchuan Zhang, Qian Liu - 2021

2 papers in library cite

[15]How Well Do Computers Solve Math Word Problems? Large-Scale Dataset Construction and Evaluation

Dong Huang, Sherry Shi, Chin Yew Lin, J. Yin, W. Ma - 2016

2 papers in library cite

[16]LogiQA: A Challenge Dataset for Machine Reading Comprehension With Logical Reasoning

Joseph Liu, L. Cui, Haozhe Liu, Dong Huang, Yuzhi Wang, Y. Z. Zhang - 2020

2 papers in library cite

[17]A Goal-Driven Tree-Structured Neural Model for Math Word Problems

Z. Xie, S. Sun - 2019

1 paper in library cites

[18]Ape210k: A Large-Scale and Template-Rich Dataset of Math Word Problems

W. Zhao, M. Shang, Yibo Liu, Lisa Wang, Joseph Liu - 2020

1 paper in library cites

[19]Deep Neural Solver for Math Word Problems

Yuzhi Wang, Xiaodong Liu, Sherry Shi - 2017

1 paper in library cites

[20]Graph-to-Tree Neural Networks for Learning Structured Input-Output Translation With Applications to Semantic Parsing and Math Word Problem

Shanda Li, L. Wu, S. Feng, Frank Xu, Frank Xu, S. Zhong - 2020

1 paper in library cites

[21]Mapping Natural-Language Problems to Formal-Language Solutions Using Structured Neural Representations

K. Chen, Q. Huang, H. Palangi, P. Smolensky, K. D. Forbus, Jianfeng Gao - 2020

1 paper in library cites

[22]Mathbert: A Pre-Trained Language Model for General NLP Tasks in Mathematics Education

J. T. Shen, M. Yamashita, E. Prihar, N. Heffernan, Xiaobao Wu, B. Graff, D. L. Lee - 2021

1 paper in library cites

[23]Mathbert: A Pre-Trained Model for Mathematical Formula Understanding

S. Peng, K. Yuan, Leo Gao, Z. Tang - 2021

1 paper in library cites

[24]MWP-Bert: A Strong Baseline for Math Word Problems

Z. Liang, J. Zhang, J. Shao, X. Zhang - 2021

1 paper in library cites

[25]Neural Math Word Problem Solver With Reinforcement Learning

Dong Huang, Joseph Liu, Chin Yew Lin, J. Yin - 2018

1 paper in library cites

[26]Neural Symbolic Reader: Scalable Integration of Distributed and Symbolic Representations for Reading Comprehension

X. Chen, C. Liang, A. W. Yu, Denny Zhou, Dawn Song, Quoc V. Le - 2019

1 paper in library cites

[27]Point to the Expression: Solving Algebraic Word Problems Using the Expression-Pointer Transformer Model

B. Kim, K. S. Ki, D. L. Lee, G. Gweon - 2020

1 paper in library cites

[28]Semantically-Aligned Equation Generation for Solving and Reasoning Math Word Problems

T. R. Chiang, Y. N. Chen - 2018

1 paper in library cites

[29]Solving General Arithmetic Word Problems

S. Roy, Dan Roth - 2015

1 paper in library cites

Cited by

7

papers in your library

Cites

7

papers in your library

Read

on May 30, 2026

Fun read. Simple and nice use of verifiers

Tags

Paper Aliases

No aliases