2026

Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort

Xinpeng Wang, Nitish Joshi, Barbara Plank, Rico Angell, He He

citations

Cite Score

0

AI summary

This paper introduces TRACE (Truncated Reasoning AUC Evaluation), a scalable unsupervised method to detect implicit reward hacking in reasoning models by measuring reasoning effort through CoT truncation, achieving over 65% F1 gain in math and 30% in coding over strong baselines, and enabling loophole discovery.

Main Contributions

  • Introduces TRACE (Truncated Reasoning AUC Evaluation) to detect implicit reward hacking by quantifying reasoning effort.
  • Proposes measuring effort by how early a model's truncated Chain-of-Thought (CoT) becomes sufficient to achieve high reward.
  • Demonstrates TRACE's effectiveness with over 65% F1 gain in math reasoning and 30% in coding over strong CoT monitors.
  • Shows TRACE's ability to discover unknown loopholes during training.
  • Offers a scalable unsupervised approach for oversight where current monitoring methods are ineffective.

Abstract

Reward hacking, where a reasoning model exploits loopholes in a reward function to achieve high rewards without solving the intended task, poses a significant threat. This behavior may be explicit, i.e. verbalized in the model's chain-of-thought (CoT), or implicit, where the CoT appears benign thus bypasses CoT monitors. To detect implicit reward hacking, we propose TRACE (Truncated Reasoning AUC Evaluation). Our key observation is that hacking occurs when exploiting the loophole is easier than solving the actual task. This means that the model is using less "effort" than required to achieve high reward. TRACE quantifies effort by measuring how early a model's reasoning becomes sufficient to obtain the reward. We progressively truncate a model's CoT at various lengths, force the model to answer, and estimate the expected reward at each cutoff. A hacking model, which takes a shortcut, will achieve a high expected reward with only a small fraction of its CoT, yielding a large area under the reward-vs-length curve. TRACE achieves over 65% gains over our strongest 72B CoT monitor in math reasoning, and over 30% gains over a 32B monitor in coding. We further show that TRACE can discover unknown loopholes during training. Overall, TRACE offers a scalable unsupervised approach for oversight where current monitoring methods prove ineffective.

Citation Graph

Loading graph...

References [34]

Sort:
Filter:

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, C. Wainwright, Pamela Mishkin, Chiyuan Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, Ryan Lowe - 2022

11 papers in library cite

Zhihong Shao, Peng Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Yiwei Li, Yonghui Wu - 2024

3 papers in library cite

M. Turpin, J. Michael, Ethan Perez, Samuel R. Bowman - 2023

2 papers in library cite

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, E. Guo, Collin Burns, S. Puranik, He He, D. X. Song, Jacob Steinhardt - 2021

4 papers in library cite

A. Ahmadian, C. Cremer, M. Galle, M. Fadaee, J. Kreutzer, O. Pietquin, A. Ustun, S. Hooker - 2024

1 paper in library cites

J. Skalse, N. H. R. Howe, D. Krasheninnikov, David Krueger - 2022

2 papers in library cite

S. Hao, S. Sukhbaatar, D. Su, Xiang Lisa Li, Z. Hu, Jason Weston, Yuandong Tian - 2024

2 papers in library cite

Tamera Lanham, Anna Chen, A. Radhakrishnan, B. Steiner, C. Denison, Danny Hernandez, Dustin Li, Esin Durmus, E. Hubinger, Jackson Kernion, K. Lukosiute, K. Nguyen, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Robin Larson, Sam McCandlish, Sandipan Kundu, Saurav Kadavath, Shusheng Yang, Tom Henighan, Timothy Maxwell, Timothy Telleen Lawton, Tristan Hume, Zac Hatfield Dodds, Jared Kaplan, J. Brauner, Samuel R. Bowman, Ethan Perez - 2023

2 papers in library cite

Bowen Baker, J. Huizinga, Leo Gao, Z. Dou, M. Y. Guan, A. Madry, Wojciech Zaremba, J. Pachocki, D. Farhi - 2025

2 papers in library cite

Yanru Chen, J. Benton, A. Radhakrishnan, Jonathan Uesato, C. Denison, John Schulman, A. Somani, P. Hase, M. Wagner, F. Roger, V. Mikulik, Samuel R. Bowman, Jan Leike, Jared Kaplan, Ethan Perez - 2025

2 papers in library cite

S. Bowman, J. Hyun, Ethan Perez, E. Chen, C. Pettit, S. Heiner, K. Lukosiute, Amanda Askell, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron Mckinnon, Christopher Olah, Dario Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran Johnson, Jackson Kernion, Jamie Kerr, Jared Mueller, Jeffrey Ladish, J. D. Landau, Kamal Ndousse, Liane Lovitt, Nelson Elhage, Nicholas Schiefer, Nicholas Joseph, Noemi Mercado, Nova Dassarma, Robin Larson, Sam McCandlish, Sandipan Kundu, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Timothy Telleen Lawton, Tom B. Brown, T. J. Henighan, Tristan Hume, Yuntao Bai, Zac Hatfield Dodds, Benjamin Mann, Jared Kaplan - 2022

2 papers in library cite

J. Pfau, W. Merrill, Samuel R. Bowman - 2024

1 paper in library cites

I. Arcuschin, J. Janiak, R. Krzyzanowski, S. Rajamanoharan, Neel Nanda, A. Conmy - 2025

2 papers in library cite

C. E. Denison, M. S. Macdiarmid, F. Barez, D. K. Duvenaud, Shauna Kravec, S. Marks, Nicholas Schiefer, R. Soklaski, A. Tamkin, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, Ethan Perez, E. Hubinger - 2024

1 paper in library cites

A. Zhang, Yanru Chen, J. Pan, Changsheng Zhao, A. Panda, Jeffrey Li, He He - 2025

1 paper in library cites

R. Zhong, C. Snell, Dan Klein, Jacob Steinhardt - 2022

1 paper in library cites

A. Albalak, D. Phung, N. Lile, Rafael Rafailov, K. Gandhi, L. Castricato, A. Singh, C. Blagden, V. Xiang, D. Mahan, N. Haber - 2025

1 paper in library cites

S. Emmons, E. Jenner, D. K. Elson, Rif A. Saurous, S. Rajamanoharan, H. Chen, I. Shafkat, R. Shah - 2025

2 papers in library cite

W. L. Chen, L. Peng, T. Tan, Changsheng Zhao, B. J. Chen, Zongyu Lin, A. Go, Y. Meng - 2026

1 paper in library cites

Q. A. Yang, B. Yang, B. Zhang, B. Hui, Bo Zheng, B. Yu, Chun-Liang Li, D. Liu, F. Huang, G. Dong, H. Wei, Haowei Lin, Jihan Yang, J. Tu, J. Zhang, Jihan Yang, Jihan Yang, Jingren Zhou, Junyang Lin, K. Dang, K. Lu, K. Bao, K. Yang, Longhui Yu, M. Li, M. Xue, Peizhao Zhang, Qihao Zhu, R. Men, R. Lin, Tao Li, T. Xia, Xiang Ren, Xiang Ren, Yu Fan, Yu Su, Y. C. Zhang, Y. Wan, Yibo Liu, Z. Cui, Zhengyou Zhang, Z. Qiu, S. Quan, Zhengtao Wang - 2024

3 papers in library cite

J. Lindsey, W. Gurnee, E. Ameisen, Berlin Chen, A. Pearce, N. L. Turner, C. Citro, D. Abrahams, S. Carter, B. Hosmer, J. Marcus, M. Sklar, A. Templeton, T. Bricken, C. Mcdougall, H. Cunningham, Tom Henighan, A. Jermyn, Andy Jones, A. Persic, Z. Qi, T. B. Thompson, S. Zimmerman, K. Rivoire, Tom Conerly, Christopher Olah, J. Batson - 2025

2 papers in library cite

Missing author list

2025

1 paper in library cites

Z. Zhong, S. Saxena, A. Raghunathan - 2026

1 paper in library cites

Yibo Liu, R. Zhao, Hinrich Schutze, M. A. Hedderich - 2026

1 paper in library cites

F. Roger, R. Greenblatt - 2023

1 paper in library cites

Missing author list

2025

1 paper in library cites

C. Laidlaw, S. Singhal, A. D. Dragan - 2024

1 paper in library cites

Missing author list

2024

1 paper in library cites

C. Xie, Y. Huang, Chiyuan Zhang, D. Yu, X. Chen, Bill Yuchen Lin, Boxuan Li, B. Ghazi, Ramana Kumar - 2024

1 paper in library cites

J. Kahn - 2025

1 paper in library cites

J. Feng, S. Huang, X. Qu, G. Zhang, Y. Qin, B. Zhong, C. Jiang, J. Chi, W. Zhong - 2025

1 paper in library cites

Missing author list

2025

1 paper in library cites

M. Turpin, A. Arditi, M. Li, J. Benton, J. Michael - 2025

1 paper in library cites

Cited by

0

papers in your library

Cites

26

papers in your library

Read

on April 16, 2026

Your review

Tags

ICLR2026

Paper Aliases

No aliases