2023

Direct Preference Optimization: Your Language Model Is Secretly a Reward Model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn

citations

Cite Score

83

AI summary

This paper introduces Direct Preference Optimization (DPO), a novel RL-free algorithm for training language models that aligns with human preferences by solving the standard RLHF problem with a simple classification loss, achieving performance comparable to or better than existing methods like PPO-based RLHF on sentiment control, summarization, and dialogue tasks.

Main Contributions

  • Introduces Direct Preference Optimization (DPO), a novel RL-free algorithm for training language models from preferences.
  • Proposes a new parameterization of the reward model in RLHF that enables extraction of the optimal policy in closed form.
  • Simplifies the RLHF problem to a single classification loss, eliminating the need for sampling during fine-tuning or extensive hyperparameter tuning.
  • Demonstrates DPO's stability, performance, and computational lightness compared to existing RLHF methods.
  • Achieves comparable or better performance than PPO-based RLHF in controlling sentiment, summarization, and single-turn dialogue.

Abstract

While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF). However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model. In this paper we introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form, allowing us to solve the standard RLHF problem with only a simple classification loss. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight, eliminating the need for sampling from the LM during fine-tuning or performing significant hyperparameter tuning. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods. Notably, fine-tuning with DPO exceeds PPO-based RLHF in ability to control sentiment of generations, and matches or improves response quality in summarization and single-turn dialogue while being substantially simpler to implement and train.

Citation Graph

Loading graph...

References [49]

Sort:
Filter:

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei - 2020

21 papers in library cite

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov - 2017

10 papers in library cite

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, C. Wainwright, Pamela Mishkin, Chiyuan Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, Ryan Lowe - 2022

11 papers in library cite

Alec Radford, Jeffrey Wu, Rewon Child, D. Luan, Dario Amodei, Ilya Sutskever - 2019

27 papers in library cite

R. Williams - 1992

11 papers in library cite

Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, Dario Amodei - 2017

11 papers in library cite

A. L. Maas, R. E. Daly, P. T. Pham, Dong Huang, Andrew Y. Ng, Christopher Potts - 2011

12 papers in library cite

Missing author list

2022

4 papers in library cite

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron Mckinnon, C. C. Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova Dassarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield Dodds, Benjamin Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom B. Brown, Jared Kaplan - 2022

2 papers in library cite

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul Christiano - 2020

10 papers in library cite

R. Nallapati, B. Zhou, C. N. D. Santos, C. G. Gulcehre, Bing Xiang - 2016

10 papers in library cite

R. Paulus, Caiming Xiong, Richard Socher - 2017

7 papers in library cite

Geoffrey Irving - 2020

7 papers in library cite

Hugo Touvron, T. Lavril, G. Izacard, X. Martinet, M. Lachaux, T. Lacroix, B. Roziere, N. Goyal, Eric Hambro, F. Azhar - 2023

2 papers in library cite

Aakanksha Chowdhery, S. Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, A. Roberts, P. Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann - 2023

6 papers in library cite

S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Yiwei Li, S. Lundberg - 2023

3 papers in library cite

Hyung Won Chung, L. Hou, S. Longpre, Barret Zoph, Yi Tay, William Fedus, Yiwei Li, Xinpeng Wang, Mostafa Dehghani, S. Brahma, A. Webson, Shixiang Shane Gu, Z. Dai, Mirac Suzgun, X. Chen, Aakanksha Chowdhery, A. C. Ros, M. Pellat, K. Robinson, D. Valter, S. Narang, Gaurav Mishra, A. Yu, V. Zhao, Y. Huang, Andrew Dai, H. Yu, Slav Petrov, Ed H. Chi, Jeffrey Dean, Jacob Devlin, A. Roberts, Denny Zhou, Quoc V. Le, Jason Wei - 2022

2 papers in library cite

V. Sanh, A. Webson, Colin Raffel, S. H. Bach, L. Sutawika, Z. Alyafeai, A. Chaffin, A. Stiegler, T. L. Scao, A. Raja - 2021

4 papers in library cite

Marc'aurelio Ranzato, S. Chopra, Michael Auli, Wojciech Zaremba - 2015

6 papers in library cite

M. Volske, Martin Potthast, S. Syed, Benno Stein - 2017

4 papers in library cite

R. Thoppilan, D. D. Freitas, J. Hall, Noam Shazeer, A. Kulshreshtha, H. Cheng, A. Jin, T. Bos, L. Baker, Yulun Du, Yiwei Li, Honglak Lee, H. S. Zheng, A. Ghafouri, M. Menegali, Y. Huang, M. Krikun, D. Lepikhin, J. Qin, Deli Chen, Yiheng Xu, Ziru Chen, A. Roberts, Maarten Bosma, Y. Zhou, C. C. Chang, I. Krivokon, W. Rusch, M. Pickett, K. S. M. Hellstern, M. R. Morris, T. Doshi, R. D. Santos, T. Duke, J. Soraker, B. Zevenbergen, V. Prabhakaran, M. Diaz, B. Hutchinson, K. Olson, A. Molina, E. H. John, Jaehoon Lee, L. Aroyo, R. Rajakumar, A. Butryna, M. Lamm, V. Kuzmina, J. Fenton, A. Cohen, R. Bernstein, R. Kurzweil, B. A. Arcas, C. Cui, M. Croak, E. Chi, Quoc Le - 2022

5 papers in library cite

Swaroop Mishra, Daniel Khashabi, Chitta Baral, Hananneh Hajishirzi - 2021

4 papers in library cite

B. Wang, A. Komatsuzaki - 2021

3 papers in library cite

Yonghui Wu, B. Hu - 2018

3 papers in library cite

S. Welleck, I. Kulikov, S. Roller, E. Dinan, Kyunghyun Cho, Jason Weston - 2019

3 papers in library cite

J. Kreutzer, J. Uyheng, S. Riezler - 2018

3 papers in library cite

N. Jaques, S. Gu, D. Bahdanau, J. M. H. Lobato, R. E. Turner, D. Eck - 2017

3 papers in library cite

R. D. Luce - 2005

2 papers in library cite

Tomasz Korbak, H. Elsahar, German Kruszewski, M. Dymetmant - 2022

2 papers in library cite

R. A. Bradley, M. E. Terry - 1952

2 papers in library cite

D. Sadigh, A. D. Dragan, S. Sastry, S. A. Seshia - 2017

1 paper in library cites

X. B. Peng, A. Kumar, G. Zhang, Sergey Levine - 2019

1 paper in library cites

D. Go, Tomasz Korbak, German Kruszewski, Jos Rozen, N. Ryu, M. Dymetman - 2023

1 paper in library cites

M. Dudik, K. Hofmann, R. E. Schapire, A. Slivkins, M. Zoghi - 2015

1 paper in library cites

A. Saha, A. Pacchiano, Jaehoon Lee - 2023

1 paper in library cites

D. Narayanan, M. Shoeybi, J. Casper, P. Legresley, M. Patwary, V. Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, Bryan Catanzaro, A. Phanishayee, Matei Zaharia - 2021

1 paper in library cites

Yanru Chen, R. Wang, H. Jiang, Sherry Shi, R. L. Xu - 2023

1 paper in library cites

H. Bong, A. Rinaldo - 2022

1 paper in library cites

X. Yan, C. Luo, C. L. A. Clarke, N. Craswell, E. M. Voorhees, P. Castells - 2022

1 paper in library cites

N. Jaques, J. H. Shen, A. Ghandeharioun, C. Ferguson, A. Lapedriza, N. Jones, Shixiang Shane Gu, R. Picard - 2020

1 paper in library cites

R. Ramamurthy, P. Ammanabrolu, K. Brantley, J. Hessel, R. Sifa, C. Bauckhage, Hananneh Hajishirzi, Yejin Choi - 2023

1 paper in library cites

A. Kupcsik, D. Hsu, W. S. Lee - 2018

1 paper in library cites

A. Jain, B. Wojcik, T. Joachims, A. Saxena - 2013

1 paper in library cites

R. B. Fekete, B. Szorenyi, Paul Weng, W. C. Cheng, Eyke Hüllermeier - 2014

1 paper in library cites

Stella Biderman, H. Schoelkopf, Q. Anthony, H. Bradley, K. O'brien, E. Hallahan, M. A. Khan, S. Purohit, U. S. Prashanth, E. Raff, A. Skowron, L. Sutawika, O. V. D. Wal - 2023

1 paper in library cites

Sergey Levine - 2018

1 paper in library cites

J. Peters, S. Schaal - 2007

1 paper in library cites

R. L. Plackett - 1975

1 paper in library cites

Y. Yue, J. Broder, R. Kleinberg, T. Joachims - 2012

1 paper in library cites

Cited by

3

papers in your library

Cites

21

papers in your library

Read

on May 29, 2026

Your review

Tags

RLHFVetto Study

Paper Aliases

No aliases