2024

Alignment Faking in Large Language Models

R. Greenblatt, C. Denison, B. Wright, F. Roger, M. Macdiarmid, S. Marks, J. Treutlein, T. Belonax, Jixuan Chen, David Duvenaud, A. Khan, J. Michael, S. Mindermann, Ethan Perez, L. Petrini, Jonathan Uesato, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, E. Hubinger

citations

Citation Graph

Loading graph...

References [0]

Sort:
Filter:

No references match the current filters.

Cited by

1

papers in your library

Cites

0

papers in your library

Notes

Tags

Paper Aliases

No aliases