2024
Alignment Faking in Large Language Models
R. Greenblatt, C. Denison, B. Wright, F. Roger, M. Macdiarmid, S. Marks, J. Treutlein, T. Belonax, Jixuan Chen, David Duvenaud, A. Khan, J. Michael, S. Mindermann, Ethan Perez, L. Petrini, Jonathan Uesato, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, E. Hubinger
Citation Graph
References [0]