2026
Cite Score
0
AI summary
This paper introduces TRACE (Truncated Reasoning AUC Evaluation), a scalable unsupervised method to detect implicit reward hacking in reasoning models by measuring reasoning effort through CoT truncation, achieving over 65% F1 gain in math and 30% in coding over strong baselines, and enabling loophole discovery.
Main Contributions
Abstract
Reward hacking, where a reasoning model exploits loopholes in a reward function to achieve high rewards without solving the intended task, poses a significant threat. This behavior may be explicit, i.e. verbalized in the model's chain-of-thought (CoT), or implicit, where the CoT appears benign thus bypasses CoT monitors. To detect implicit reward hacking, we propose TRACE (Truncated Reasoning AUC Evaluation). Our key observation is that hacking occurs when exploiting the loophole is easier than solving the actual task. This means that the model is using less "effort" than required to achieve high reward. TRACE quantifies effort by measuring how early a model's reasoning becomes sufficient to obtain the reward. We progressively truncate a model's CoT at various lengths, force the model to answer, and estimate the expected reward at each cutoff. A hacking model, which takes a shortcut, will achieve a high expected reward with only a small fraction of its CoT, yielding a large area under the reward-vs-length curve. TRACE achieves over 65% gains over our strongest 72B CoT monitor in math reasoning, and over 30% gains over a 32B monitor in coding. We further show that TRACE can discover unknown loopholes during training. Overall, TRACE offers a scalable unsupervised approach for oversight where current monitoring methods prove ineffective.
Citation Graph
References [34]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, C. Wainwright, Pamela Mishkin, Chiyuan Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, Ryan Lowe - 2022
11 papers in library cite
Zhihong Shao, Peng Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Yiwei Li, Yonghui Wu - 2024
3 papers in library cite
M. Turpin, J. Michael, Ethan Perez, Samuel R. Bowman - 2023
2 papers in library cite
Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, E. Guo, Collin Burns, S. Puranik, He He, D. X. Song, Jacob Steinhardt - 2021
4 papers in library cite
A. Ahmadian, C. Cremer, M. Galle, M. Fadaee, J. Kreutzer, O. Pietquin, A. Ustun, S. Hooker - 2024
1 paper in library cites
J. Skalse, N. H. R. Howe, D. Krasheninnikov, David Krueger - 2022
2 papers in library cite
S. Hao, S. Sukhbaatar, D. Su, Xiang Lisa Li, Z. Hu, Jason Weston, Yuandong Tian - 2024
2 papers in library cite
Tamera Lanham, Anna Chen, A. Radhakrishnan, B. Steiner, C. Denison, Danny Hernandez, Dustin Li, Esin Durmus, E. Hubinger, Jackson Kernion, K. Lukosiute, K. Nguyen, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Robin Larson, Sam McCandlish, Sandipan Kundu, Saurav Kadavath, Shusheng Yang, Tom Henighan, Timothy Maxwell, Timothy Telleen Lawton, Tristan Hume, Zac Hatfield Dodds, Jared Kaplan, J. Brauner, Samuel R. Bowman, Ethan Perez - 2023
2 papers in library cite
Bowen Baker, J. Huizinga, Leo Gao, Z. Dou, M. Y. Guan, A. Madry, Wojciech Zaremba, J. Pachocki, D. Farhi - 2025
2 papers in library cite
Yanru Chen, J. Benton, A. Radhakrishnan, Jonathan Uesato, C. Denison, John Schulman, A. Somani, P. Hase, M. Wagner, F. Roger, V. Mikulik, Samuel R. Bowman, Jan Leike, Jared Kaplan, Ethan Perez - 2025
2 papers in library cite
S. Bowman, J. Hyun, Ethan Perez, E. Chen, C. Pettit, S. Heiner, K. Lukosiute, Amanda Askell, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron Mckinnon, Christopher Olah, Dario Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran Johnson, Jackson Kernion, Jamie Kerr, Jared Mueller, Jeffrey Ladish, J. D. Landau, Kamal Ndousse, Liane Lovitt, Nelson Elhage, Nicholas Schiefer, Nicholas Joseph, Noemi Mercado, Nova Dassarma, Robin Larson, Sam McCandlish, Sandipan Kundu, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Timothy Telleen Lawton, Tom B. Brown, T. J. Henighan, Tristan Hume, Yuntao Bai, Zac Hatfield Dodds, Benjamin Mann, Jared Kaplan - 2022
2 papers in library cite
J. Pfau, W. Merrill, Samuel R. Bowman - 2024
1 paper in library cites
I. Arcuschin, J. Janiak, R. Krzyzanowski, S. Rajamanoharan, Neel Nanda, A. Conmy - 2025
2 papers in library cite
C. E. Denison, M. S. Macdiarmid, F. Barez, D. K. Duvenaud, Shauna Kravec, S. Marks, Nicholas Schiefer, R. Soklaski, A. Tamkin, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, Ethan Perez, E. Hubinger - 2024
1 paper in library cites
A. Zhang, Yanru Chen, J. Pan, Changsheng Zhao, A. Panda, Jeffrey Li, He He - 2025
1 paper in library cites
R. Zhong, C. Snell, Dan Klein, Jacob Steinhardt - 2022
1 paper in library cites
A. Albalak, D. Phung, N. Lile, Rafael Rafailov, K. Gandhi, L. Castricato, A. Singh, C. Blagden, V. Xiang, D. Mahan, N. Haber - 2025
1 paper in library cites
S. Emmons, E. Jenner, D. K. Elson, Rif A. Saurous, S. Rajamanoharan, H. Chen, I. Shafkat, R. Shah - 2025
2 papers in library cite
W. L. Chen, L. Peng, T. Tan, Changsheng Zhao, B. J. Chen, Zongyu Lin, A. Go, Y. Meng - 2026
1 paper in library cites
Q. A. Yang, B. Yang, B. Zhang, B. Hui, Bo Zheng, B. Yu, Chun-Liang Li, D. Liu, F. Huang, G. Dong, H. Wei, Haowei Lin, Jihan Yang, J. Tu, J. Zhang, Jihan Yang, Jihan Yang, Jingren Zhou, Junyang Lin, K. Dang, K. Lu, K. Bao, K. Yang, Longhui Yu, M. Li, M. Xue, Peizhao Zhang, Qihao Zhu, R. Men, R. Lin, Tao Li, T. Xia, Xiang Ren, Xiang Ren, Yu Fan, Yu Su, Y. C. Zhang, Y. Wan, Yibo Liu, Z. Cui, Zhengyou Zhang, Z. Qiu, S. Quan, Zhengtao Wang - 2024
3 papers in library cite
J. Lindsey, W. Gurnee, E. Ameisen, Berlin Chen, A. Pearce, N. L. Turner, C. Citro, D. Abrahams, S. Carter, B. Hosmer, J. Marcus, M. Sklar, A. Templeton, T. Bricken, C. Mcdougall, H. Cunningham, Tom Henighan, A. Jermyn, Andy Jones, A. Persic, Z. Qi, T. B. Thompson, S. Zimmerman, K. Rivoire, Tom Conerly, Christopher Olah, J. Batson - 2025
2 papers in library cite
2025
1 paper in library cites
Z. Zhong, S. Saxena, A. Raghunathan - 2026
1 paper in library cites
Yibo Liu, R. Zhao, Hinrich Schutze, M. A. Hedderich - 2026
1 paper in library cites
F. Roger, R. Greenblatt - 2023
1 paper in library cites
C. Laidlaw, S. Singhal, A. D. Dragan - 2024
1 paper in library cites
C. Xie, Y. Huang, Chiyuan Zhang, D. Yu, X. Chen, Bill Yuchen Lin, Boxuan Li, B. Ghazi, Ramana Kumar - 2024
1 paper in library cites
J. Kahn - 2025
1 paper in library cites
J. Feng, S. Huang, X. Qu, G. Zhang, Y. Qin, B. Zhong, C. Jiang, J. Chi, W. Zhong - 2025
1 paper in library cites
2025
1 paper in library cites
M. Turpin, A. Arditi, M. Li, J. Benton, J. Michael - 2025
1 paper in library cites
2025
1 paper in library cites
Cited by
0
papers in your library
Cites
26
papers in your library
Read
on April 16, 2026
Your review
Tags
Paper Aliases
No aliases