Papperoni

2023

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, Karl Cobbe

Open PDF Google Scholar

citations

Cite Score

67

AI summary

This paper compares process supervision with outcome supervision for training reliable reward models to solve complex multi-step reasoning problems from the MATH dataset, finding that process supervision significantly outperforms outcome supervision, achieves 78% accuracy, and that active learning improves its data efficiency 2.6x.

Main Contributions

Process supervision significantly outperforms outcome supervision for training reliable reward models on the MATH dataset, solving 78.2% of problems from a representative subset.
A large reward model can reliably approximate human supervision for smaller reward models, enabling efficient large-scale data collection ablations.
Active learning leads to a 2.6x improvement in the data efficiency of process supervision.
The PRM800K dataset, containing 800,000 step-level human feedback labels, is released to promote related research.
The PRM (process-supervised reward model) demonstrates strong out-of-distribution generalization on recent STEM tests, outperforming ORM and majority voting.

Abstract

In recent years, large language models have greatly improved in their ability to perform complex multi-step reasoning. However, even state-of-the-art models still regularly produce logical mistakes. To train more reliable models, we can turn either to outcome supervision, which provides feedback for a final result, or process supervision, which provides feedback for each intermediate reasoning step. Given the importance of training reliable models, and given the high cost of human feedback, it is important to carefully compare the both methods. Recent work has already begun this comparison, but many questions still remain. We conduct our own investigation, finding that process supervision significantly outperforms outcome supervision for training models to solve problems from the challenging MATH dataset. Our process-supervised model solves 78% of problems from a representative subset of the MATH test set. Additionally, we show that active learning significantly improves the efficacy of process supervision. To support related research, we also release PRM800K, the complete dataset of 800,000 step-level human feedback labels used to train our best reward model.

Citation Graph

Loading graph...

References [26]

Sort:

Filter:

[1]Training Language Models to Follow Instructions With Human Feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, C. Wainwright, Pamela Mishkin, Chiyuan Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, Ryan Lowe - 2022

11 papers in library cite

No new research here. Only true contribution is scaling RLHF to GPT 3.

[2]Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, John Schulman - 2021

7 papers in library cite

Fun read. Simple and nice use of verifiers

[3]Deep Reinforcement Learning From Human Preferences

Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, Dario Amodei - 2017

11 papers in library cite

Very nice idea overall!

[4]Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, Jacob Steinhardt - 2021

8 papers in library cite

Good idea in creating the dataset but it's just a regular and uninteresting dataset description paper

[5]Learning to Summarize from Human Feedback

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul Christiano - 2020

10 papers in library cite

Very thoughtful on explaining data collection and worrying about future consequences. Seems much closer to the RLHF we do now vs. what is in the human preferences paper.

[6]WebGPT: Browser-Assisted Question-Answering With Human Feedback

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeffrey Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, John Schulman - 2021

7 papers in library cite

TBH the nicest thing about this is the idea of using a text-based web browser. Other than that, just another application of RLHF + PPO.

[7]Fine-Tuning Language Models From Human Preferences

Geoffrey Irving - 2020

7 papers in library cite

It's so simple how they do it, plus I absolutely LOVED the "challenges" section and how honest they were about it. This is true research!

[8]Scaling Laws for Reward Model Overoptimization

Leo Gao, John Schulman, Jacob Hilton - 2022

3 papers in library cite

Very technical and a ton of practical insights. I understand the relevance, but I didn't find it interesting.

[9]Solving Math Word Problems With Process- And Outcome-Based Feedback

Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, Irina Higgins - 2022

4 papers in library cite

Very low 3. Almost 2. Boring, too big, and nothing surprising. Results were proved (partially) wrong later. I think they just formalized ORM vs. PRM

[10]Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xinpeng Wang, Dale Schuurmans, Maarten Bosma, Fanyue Xia, E. Chi, Quoc V. Le, Denny Zhou - 2022

10 papers in library cite

CoT

[11]GPT-4 Technical Report

Openai - 2023

6 papers in library cite

GPT 4

[12]Sparks of Artificial General Intelligence: Early Experiments With GPT-4

S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Yiwei Li, S. Lundberg - 2023

3 papers in library cite

[13]Large Language Models Are Zero-Shot Reasoners

T. Kojima, Shixiang Shane Gu, M. Reid, Y. Matsuo, Y. Iwasawa - 2022

6 papers in library cite

REASONING!

[14]Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xinpeng Wang, Jason Wei, Dale Schuurmans, Quoc Le, E. Chi, Denny Zhou - 2022

5 papers in library cite

[15]On Faithfulness and Factuality in Abstractive Summarization

J. Maynez, Shashi Narayan, B. Bohnet, R. Mcdonald - 2020

6 papers in library cite

[16]Solving Quantitative Reasoning Problems With Language Models

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, C. Anil, I. Schlag, T. G. Solo - 2022

3 papers in library cite

Introduced Minerva

[17]A General Language Assistant as a Laboratory for Alignment

Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Benjamin Mann, Nova Dassarma, Nelson Elhage, Zac Hatfield Dodds, Danny Hernandez, Jackson Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom B. Brown, Jack Clark, Sam McCandlish, Christopher Olah, Jared Kaplan - 2021

5 papers in library cite

Start of the Anthropic RLHF journey. Not sure if worth the read but it's an option.

[18]Show Your Work: Scratchpads for Intermediate Computation With Language Models

Maxwell Nye, A. J. Andreassen, Guy Gur Ari, Henryk Michalewski, Jacob Austin, D. Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, D. Luan, Charles Sutton, Augustus Odena - 2021

5 papers in library cite

First reasoning?

[19]Reinforcement Learning With a Corrupted Reward Channel

Tom Everitt, V. Krakovna, L. Orseau, Shane Legg - 2017

4 papers in library cite

[20]On the Advance of Making Language Models Better Reasoners

Yiwei Li, Zongyu Lin, S. Zhang, Q. Fu, Berlin Chen, J. G. Lou, Weizhu Chen - 2022

3 papers in library cite

[21]Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning

Antonia Creswell, M. Shanahan, Irina Higgins - 2022

3 papers in library cite

[22]Star: Bootstrapping Reasoning With Reasoning

E. Zelikman, Yonghui Wu, J. Mu, N. Goodman - 2022

3 papers in library cite

[23]Without Specific Countermeasures, the Easiest Path to Transformative AI Likely Leads to AI Takeover

A. Cotra - 2022

3 papers in library cite

[24]Collaborative Storytelling With Large-Scale Neural Language Models

E. Nichols, Leo Gao, R. Gomez - 2020

2 papers in library cite

[25]Generate & Rank: A Multi-Task Framework for Math Word Problems

Junhong Shen, Y. Yin, Lei Li, L. Shang, Xu Jiang, Mingchuan Zhang, Qian Liu - 2021

2 papers in library cite

[26]Supervise Process, Not Outcomes

Andreas Stuhlmuller, J. Byun - 2022

2 papers in library cite

Cited by

4

papers in your library

Cites

18

papers in your library

Read

on May 31, 2026

Fun read but nothing really surprising in terms of methodology. Just proving PRMs are better than ORM (again, not surprising)

Tags

RLHFVetto Study

Paper Aliases

No aliases