2020

Fine-Tuning Language Models From Human Preferences

Geoffrey Irving

citations

Cite Score

55

AI summary

This paper presents a reward learning approach to fine-tune large language models using human preferences on text continuations. The models were evaluated on sentiment, descriptiveness, and summarization tasks using the BookCorpus, CNN/Daily Mail, and TL;DR datasets. The results show improved performance with models trained on human feedback.

Main Contributions

  • Introduces a framework for fine-tuning language models using reinforcement learning with a reward model trained on human preferences.
  • Demonstrates the effectiveness of the approach on stylistic continuation tasks, achieving good results with only 5,000 human comparisons.
  • Applies the method to summarization tasks on the TL;DR and CNN/Daily Mail datasets, training models with 60,000 human comparisons.
  • Analyzes the behavior of the summarization models, finding that they tend to copy whole sentences from the input, skipping irrelevant preamble.
  • Compares online and offline data collection methods, finding that online data collection is important for summarization but not for simpler style tasks.

Abstract

Reward learning enables the application of reinforcement learning (RL) to tasks where reward is defined by human judgment, building a model of reward by asking humans questions. Most work on reward learning has used simulated environments, but complex information about values is often expressed in natural language, and we believe reward learning for language is a key to making RL practical and safe for real-world tasks. In this paper, we build on advances in generative pretraining of language models to apply reward learning to four natural language tasks: continuing text with positive sentiment or physically descriptive language, and summarization tasks on the TL;DR and CNN/Daily Mail datasets. For stylistic continuation we achieve good results with only 5,000 comparisons evaluated by humans. For summarization, models trained with 60,000 comparisons copy whole sentences from the input but skip irrelevant preamble; this leads to reasonable ROUGE scores and very good performance according to our human labelers, but may be exploiting the fact that labelers rely on simple heuristics.

Citation Graph

Loading graph...

References [47]

Sort:
Filter:

D. P. Kingma, Jimmy Lei Ba - 2014

49 papers in library cite

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin - 2017

47 papers in library cite

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov - 2017

10 papers in library cite

Alec Radford, Jeffrey Wu, Rewon Child, D. Luan, Dario Amodei, Ilya Sutskever - 2019

27 papers in library cite

M. E. Peters, M. Neumann, M. Iyyer, Matt Gardner, C. Clark, K. Lee, L. S. Zettlemoyer - 2018

27 papers in library cite

Alec Radford, K. Narasimhan, T. Salimans, Ilya Sutskever - 2018

23 papers in library cite

R. Sennrich, B. Haddow, Alexandra Birch - 2016

22 papers in library cite

Yonghui Wu, M. Schuster, Ziru Chen, Quoc V. Le, M. Norouzi, W. Macherey, M. Krikun, Yue Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. J. Johnson, Xiaodong Liu, Lukasz Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, Wenyi Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, Oriol Vinyals, G. S. Corrado, M. Hughes, Jeffrey Dean - 2016

15 papers in library cite

Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, Dario Amodei - 2017

11 papers in library cite

J. Howard, Sebastian Ruder - 2018

14 papers in library cite

K. M. Hermann, T. Kocisky, Edward Grefenstette, L. Espeholt, W. Kay, M. Suleyman, Phil Blunsom - 2015

31 papers in library cite

A. See, P. J. Liu, Christopher D. Manning - 2017

8 papers in library cite

Yuxuan Zhu, R. Kiros, R. Zemel, Ruslan Salakhutdinov, R. Urtasun, Antonio Torralba, Sanja Fidler - 2015

18 papers in library cite

R. Paulus, Caiming Xiong, Richard Socher - 2017

7 papers in library cite

A. M. Dai, Quoc V. Le - 2015

27 papers in library cite

Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, Shane Legg - 2018

5 papers in library cite

Alec Radford, R. Jozefowicz, Ilya Sutskever - 2017

8 papers in library cite

Jelena Luketina, Nantas Nardelli, Gregory Farquhar, Jakob Foerster, Jacob Andreas, Edward Grefenstette, Shimon Whiteson, Tim Rocktaschel - 2019

3 papers in library cite

Paul Christiano, Buck Shlegeris, Dario Amodei - 2018

7 papers in library cite

M. Volske, Martin Potthast, S. Syed, Benno Stein - 2017

4 papers in library cite

Geoffrey Irving, Paul Christiano, Dario Amodei - 2018

8 papers in library cite

D. Bahdanau, F. Hill, Jan Leike, E. Hughes, P. Kohli, Edward Grefenstette - 2019

4 papers in library cite

B. Ibarz, Jan Leike, T. Pohlen, Geoffrey Irving, Shane Legg, Dario Amodei - 2018

5 papers in library cite

Ethan Perez, S. Karamcheti, Rob Fergus, Jason Weston, Douwe Kiela, Kyunghyun Cho - 2019

4 papers in library cite

F. Bohm, Y. Gao, C. M. Meyer, O. Shapira, Ido Dagan, I. Gurevych - 2019

3 papers in library cite

B. Hancock, Antoine Bordes, P. E. Mazare, Jason Weston - 2019

3 papers in library cite

Yonghui Wu, B. Hu - 2018

3 papers in library cite

J. Kreutzer, J. Uyheng, S. Riezler - 2018

3 papers in library cite

N. Jaques, S. Gu, D. Bahdanau, J. M. H. Lobato, R. E. Turner, D. Eck - 2017

3 papers in library cite

W. S. Cho, Peizhao Zhang, Y. Z. Zhang, Xiang Lisa Li, M. Galley, Chris Brockett, Mingliang Wang, Jianfeng Gao - 2019

3 papers in library cite

S. Yi, R. Goel, C. Khatri, T. Chung, Behnam Hedayatnia, Anu Venkatesh, Raefer Gabriel, D. H. Tur - 2019

3 papers in library cite

N. Jaques, A. Ghandeharioun, J. H. Shen, C. Ferguson, A. Lapedriza, N. Jones, S. Gu, R. Picard - 2019

3 papers in library cite

Sebastian Gehrmann, Y. Deng, Alexander M. Rush - 2018

2 papers in library cite

P. Tambwekar, M. Dhuliawala, A. Mehta, L. J. Martin, B. Harrison, M. O. Riedl - 2018

2 papers in library cite

W. Kryscinski, Nitish Shirish Keskar, B. Mccann, Caiming Xiong, Richard Socher - 2019

2 papers in library cite

K. Nguyen, H. D. Iii, J. B. Graber - 2017

2 papers in library cite

Y. Gao, C. M. Meyer, M. Mesgar, I. Gurevych - 2019

2 papers in library cite

J. Huang, Rewon Child, V. Rao, Haozhe Liu, S. Satheesh, A. Coates - 2016

1 paper in library cites

J. T. Ash, Chiyuan Zhang, A. Krishnamurthy, John Langford, Akshat Agarwal - 2019

1 paper in library cites

Jeffrey Li, A. H. Miller, S. Chopra, Marc'aurelio Ranzato, Jason Weston - 2016

1 paper in library cites

D. Gissin, S. S. Shwartz - 2019

1 paper in library cites

Y. Guo, Dale Schuurmans - 2008

1 paper in library cites

Sebastian Gehrmann, Z. Ziegler, A. Rush - 2019

1 paper in library cites

J. Mcauley, C. Targett, Q. Shi, A. V. D. Hengel - 2015

1 paper in library cites

S. Sidor, Yonghui Wu, P. Zhokhov - 2017

1 paper in library cites

Y. Gao, C. M. Meyer, I. Gurevych - 2019

1 paper in library cites

U. Khandelwal, K. Clark, Dan Jurafsky, Lukasz Kaiser - 2019

1 paper in library cites

Cited by

7

papers in your library

Cites

22

papers in your library

Read

on November 22, 2025

Your review

Tags

RLHFVetto Study

Paper Aliases

No aliases