Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
C. E. Denison, M. S. Macdiarmid, F. Barez, D. K. Duvenaud, Shauna Kravec, S. Marks, Nicholas Schiefer, R. Soklaski, A. Tamkin, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, Ethan Perez, E. Hubinger