Papperoni

2026

Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort

Xinpeng Wang, Nitish Joshi, Barbara Plank, Rico Angell, He He

Open PDF Google Scholar

citations

Cite Score

0

AI summary

This paper introduces TRACE (Truncated Reasoning AUC Evaluation), a scalable unsupervised method to detect implicit reward hacking in reasoning models by measuring reasoning effort through CoT truncation, achieving over 65% F1 gain in math and 30% in coding over strong baselines, and enabling loophole discovery.

Main Contributions

Introduces TRACE (Truncated Reasoning AUC Evaluation) to detect implicit reward hacking by quantifying reasoning effort.
Proposes measuring effort by how early a model's truncated Chain-of-Thought (CoT) becomes sufficient to achieve high reward.
Demonstrates TRACE's effectiveness with over 65% F1 gain in math reasoning and 30% in coding over strong CoT monitors.
Shows TRACE's ability to discover unknown loopholes during training.
Offers a scalable unsupervised approach for oversight where current monitoring methods are ineffective.

Abstract

Reward hacking, where a reasoning model exploits loopholes in a reward function to achieve high rewards without solving the intended task, poses a significant threat. This behavior may be explicit, i.e. verbalized in the model's chain-of-thought (CoT), or implicit, where the CoT appears benign thus bypasses CoT monitors. To detect implicit reward hacking, we propose TRACE (Truncated Reasoning AUC Evaluation). Our key observation is that hacking occurs when exploiting the loophole is easier than solving the actual task. This means that the model is using less "effort" than required to achieve high reward. TRACE quantifies effort by measuring how early a model's reasoning becomes sufficient to obtain the reward. We progressively truncate a model's CoT at various lengths, force the model to answer, and estimate the expected reward at each cutoff. A hacking model, which takes a shortcut, will achieve a high expected reward with only a small fraction of its CoT, yielding a large area under the reward-vs-length curve. TRACE achieves over 65% gains over our strongest 72B CoT monitor in math reasoning, and over 30% gains over a 32B monitor in coding. We further show that TRACE can discover unknown loopholes during training. Overall, TRACE offers a scalable unsupervised approach for oversight where current monitoring methods prove ineffective.

Citation Graph

Loading graph...

References [34]

Sort:

Filter:

[1]Training Language Models to Follow Instructions With Human Feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, C. Wainwright, Pamela Mishkin, Chiyuan Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, Ryan Lowe - 2022

11 papers in library cite

No new research here. Only true contribution is scaling RLHF to GPT 3.

[2]Deepseekmath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peng Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Yiwei Li, Yonghui Wu - 2024

3 papers in library cite

Fun read and GRPO is a very nice idea!

[3]Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

M. Turpin, J. Michael, Ethan Perez, Samuel R. Bowman - 2023

2 papers in library cite

[4]Measuring Coding Challenge Competence With Apps

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, E. Guo, Collin Burns, S. Puranik, He He, D. X. Song, Jacob Steinhardt - 2021

4 papers in library cite

[5]Back to Basics: Revisiting Reinforce Style Optimization for Learning From Human Feedback in LLMS

A. Ahmadian, C. Cremer, M. Galle, M. Fadaee, J. Kreutzer, O. Pietquin, A. Ustun, S. Hooker - 2024

1 paper in library cites

[6]Defining and Characterizing Reward Hacking

J. Skalse, N. H. R. Howe, D. Krasheninnikov, David Krueger - 2022

2 papers in library cite

[7]Training Large Language Models to Reason in a Continuous Latent Space

S. Hao, S. Sukhbaatar, D. Su, Xiang Lisa Li, Z. Hu, Jason Weston, Yuandong Tian - 2024

2 papers in library cite

Reasoning in latent space

[8]Measuring Faithfulness in Chain-of-Thought Reasoning

Tamera Lanham, Anna Chen, A. Radhakrishnan, B. Steiner, C. Denison, Danny Hernandez, Dustin Li, Esin Durmus, E. Hubinger, Jackson Kernion, K. Lukosiute, K. Nguyen, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Robin Larson, Sam McCandlish, Sandipan Kundu, Saurav Kadavath, Shusheng Yang, Tom Henighan, Timothy Maxwell, Timothy Telleen Lawton, Tristan Hume, Zac Hatfield Dodds, Jared Kaplan, J. Brauner, Samuel R. Bowman, Ethan Perez - 2023

2 papers in library cite

[9]Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

Bowen Baker, J. Huizinga, Leo Gao, Z. Dou, M. Y. Guan, A. Madry, Wojciech Zaremba, J. Pachocki, D. Farhi - 2025

2 papers in library cite

[10]Reasoning Models Don't Always Say What They Think

Yanru Chen, J. Benton, A. Radhakrishnan, Jonathan Uesato, C. Denison, John Schulman, A. Somani, P. Hase, M. Wagner, F. Roger, V. Mikulik, Samuel R. Bowman, Jan Leike, Jared Kaplan, Ethan Perez - 2025

2 papers in library cite

[11]Measuring Progress on Scalable Oversight for Large Language Models

S. Bowman, J. Hyun, Ethan Perez, E. Chen, C. Pettit, S. Heiner, K. Lukosiute, Amanda Askell, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron Mckinnon, Christopher Olah, Dario Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran Johnson, Jackson Kernion, Jamie Kerr, Jared Mueller, Jeffrey Ladish, J. D. Landau, Kamal Ndousse, Liane Lovitt, Nelson Elhage, Nicholas Schiefer, Nicholas Joseph, Noemi Mercado, Nova Dassarma, Robin Larson, Sam McCandlish, Sandipan Kundu, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Timothy Telleen Lawton, Tom B. Brown, T. J. Henighan, Tristan Hume, Yuntao Bai, Zac Hatfield Dodds, Benjamin Mann, Jared Kaplan - 2022

2 papers in library cite

[12]Let's Think Dot by Dot: Hidden Computation in Transformer Language Models

J. Pfau, W. Merrill, Samuel R. Bowman - 2024

1 paper in library cites

[13]Chain-of-Thought Reasoning in the Wild Is Not Always Faithful

I. Arcuschin, J. Janiak, R. Krzyzanowski, S. Rajamanoharan, Neel Nanda, A. Conmy - 2025

2 papers in library cite

[14]Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

C. E. Denison, M. S. Macdiarmid, F. Barez, D. K. Duvenaud, Shauna Kravec, S. Marks, Nicholas Schiefer, R. Soklaski, A. Tamkin, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, Ethan Perez, E. Hubinger - 2024

1 paper in library cites

[15]Reasoning Models Know When They're Right: Probing Hidden States for Self-Verification

A. Zhang, Yanru Chen, J. Pan, Changsheng Zhao, A. Panda, Jeffrey Li, He He - 2025

1 paper in library cites

[16]Describing Differences Between Text Distributions With Natural Language

R. Zhong, C. Snell, Dan Klein, Jacob Steinhardt - 2022

1 paper in library cites

[17]Big-Math: A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models

A. Albalak, D. Phung, N. Lile, Rafael Rafailov, K. Gandhi, L. Castricato, A. Singh, C. Blagden, V. Xiang, D. Mahan, N. Haber - 2025

1 paper in library cites

[18]When Chain of Thought Is Necessary, Language Models Struggle to Evade Monitors

S. Emmons, E. Jenner, D. K. Elson, Rif A. Saurous, S. Rajamanoharan, H. Chen, I. Shafkat, R. Shah - 2025

2 papers in library cite

[19]Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens

W. L. Chen, L. Peng, T. Tan, Changsheng Zhao, B. J. Chen, Zongyu Lin, A. Go, Y. Meng - 2026

1 paper in library cites

[20]Qwen2.5 Technical Report

Q. A. Yang, B. Yang, B. Zhang, B. Hui, Bo Zheng, B. Yu, Chun-Liang Li, D. Liu, F. Huang, G. Dong, H. Wei, Haowei Lin, Jihan Yang, J. Tu, J. Zhang, Jihan Yang, Jihan Yang, Jingren Zhou, Junyang Lin, K. Dang, K. Lu, K. Bao, K. Yang, Longhui Yu, M. Li, M. Xue, Peizhao Zhang, Qihao Zhu, R. Men, R. Lin, Tao Li, T. Xia, Xiang Ren, Xiang Ren, Yu Fan, Yu Su, Y. C. Zhang, Y. Wan, Yibo Liu, Z. Cui, Zhengyou Zhang, Z. Qiu, S. Quan, Zhengtao Wang - 2024

3 papers in library cite

[21]On the Biology of a Large Language Model

J. Lindsey, W. Gurnee, E. Ameisen, Berlin Chen, A. Pearce, N. L. Turner, C. Citro, D. Abrahams, S. Carter, B. Hosmer, J. Marcus, M. Sklar, A. Templeton, T. Bricken, C. Mcdougall, H. Cunningham, Tom Henighan, A. Jermyn, Andy Jones, A. Persic, Z. Qi, T. B. Thompson, S. Zimmerman, K. Rivoire, Tom Conerly, Christopher Olah, J. Batson - 2025

2 papers in library cite

Anthropic. Not a paper, but seems extremely nice

Missing author list

[22]Cot May Be Highly Informative Despite "Unfaithfulness"

2025

1 paper in library cites

[23]Hodoscope: Unsupervised Behavior Discovery in AI Agents

Z. Zhong, S. Saxena, A. Raghunathan - 2026

1 paper in library cites

[24]Large Reasoning Models Are (Not Yet) Multilingual Latent Reasoners

Yibo Liu, R. Zhao, Hinrich Schutze, M. A. Hedderich - 2026

1 paper in library cites

[25]Preventing Language Models From Hiding Their Reasoning

F. Roger, R. Greenblatt - 2023

1 paper in library cites

Missing author list

[26]Recent Frontier Models Are Reward Hacking

2025

1 paper in library cites

[27]Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking

C. Laidlaw, S. Singhal, A. D. Dragan - 2024

1 paper in library cites

Missing author list

[28]Llama 3 Model Card

2024

1 paper in library cites

[29]On Memorization of Large Language Models in Logical Reasoning

C. Xie, Y. Huang, Chiyuan Zhang, D. Yu, X. Chen, Bill Yuchen Lin, Boxuan Li, B. Ghazi, Ramana Kumar - 2024

1 paper in library cites

[30]Repo State Loopholes During Agentic Evaluation

J. Kahn - 2025

1 paper in library cites

[31]Retool: Reinforcement Learning for Strategic Tool Use in LLMS

J. Feng, S. Huang, X. Qu, G. Zhang, Y. Qin, B. Zhong, C. Jiang, J. Chi, W. Zhong - 2025

1 paper in library cites

Missing author list

[32]Sycophancy in GPT-40: What Happened and What We're Doing About It

2025

1 paper in library cites

[33]Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning

M. Turpin, A. Arditi, M. Li, J. Benton, J. Michael - 2025

1 paper in library cites

Missing author list

[34]The AI Cuda engineer: Agentic Cuda kernel Discovery, Optimization and Composition Limitations and Bloopers

2025

1 paper in library cites

Cited by

0

papers in your library

Cites

26

papers in your library

Read

on April 16, 2026

Very nice idea, but seems like something that will be surpassed/not important quite soon (given that reasoning is something that might vanish, or change completely).

Tags

ICLR2026

Paper Aliases

No aliases