2024

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

C. E. Denison, M. S. Macdiarmid, F. Barez, D. K. Duvenaud, Shauna Kravec, S. Marks, Nicholas Schiefer, R. Soklaski, A. Tamkin, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, Ethan Perez, E. Hubinger

citations

Cite Score

7

Citation Graph

Loading graph...

References [0]

Sort:
Filter:

No references match the current filters.

Cited by

1

papers in your library

Cites

0

papers in your library

Notes

Tags

Paper Aliases

No aliases