Papperoni

2020

Fine-Tuning Language Models From Human Preferences

Geoffrey Irving

Open PDF Google Scholar

citations

Cite Score

55

AI summary

This paper presents a reward learning approach to fine-tune large language models using human preferences on text continuations. The models were evaluated on sentiment, descriptiveness, and summarization tasks using the BookCorpus, CNN/Daily Mail, and TL;DR datasets. The results show improved performance with models trained on human feedback.

Main Contributions

Introduces a framework for fine-tuning language models using reinforcement learning with a reward model trained on human preferences.
Demonstrates the effectiveness of the approach on stylistic continuation tasks, achieving good results with only 5,000 human comparisons.
Applies the method to summarization tasks on the TL;DR and CNN/Daily Mail datasets, training models with 60,000 human comparisons.
Analyzes the behavior of the summarization models, finding that they tend to copy whole sentences from the input, skipping irrelevant preamble.
Compares online and offline data collection methods, finding that online data collection is important for summarization but not for simpler style tasks.

Abstract

Reward learning enables the application of reinforcement learning (RL) to tasks where reward is defined by human judgment, building a model of reward by asking humans questions. Most work on reward learning has used simulated environments, but complex information about values is often expressed in natural language, and we believe reward learning for language is a key to making RL practical and safe for real-world tasks. In this paper, we build on advances in generative pretraining of language models to apply reward learning to four natural language tasks: continuing text with positive sentiment or physically descriptive language, and summarization tasks on the TL;DR and CNN/Daily Mail datasets. For stylistic continuation we achieve good results with only 5,000 comparisons evaluated by humans. For summarization, models trained with 60,000 comparisons copy whole sentences from the input but skip irrelevant preamble; this leads to reasonable ROUGE scores and very good performance according to our human labelers, but may be exploiting the fact that labelers rely on simple heuristics.

Citation Graph

Loading graph...

References [47]

Sort:

Filter:

[1]Adam: A Method for Stochastic Optimization

D. P. Kingma, Jimmy Lei Ba - 2014

49 papers in library cite

Amazing paper! Very well explained and huge impact. I am amazed that they made something so simple even when it requires a lot of background mathematical knowledge

[2]Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin - 2017

47 papers in library cite

I mean... it introduced Transformers!

[3]Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov - 2017

10 papers in library cite

Very simple methodology and very well explained. I also liked that they did a good job on motivating the method.

[4]Language Models Are Unsupervised Multitask Learners

Alec Radford, Jeffrey Wu, Rewon Child, D. Luan, Dario Amodei, Ilya Sutskever - 2019

27 papers in library cite

Amazing! Tons of important contributions. I think they could have explained the models a bit better, and I think this is where OpenAI starts to become evil (and not open)

[5]Deep Contextualized Word Representations

M. E. Peters, M. Neumann, M. Iyyer, Matt Gardner, C. Clark, K. Lee, L. S. Zettlemoyer - 2018

27 papers in library cite

I didn't really like the approach. Seems a bit derivative TBH. BERT seems more elegant.

[6]Improving Language Understanding by Generative Pre-Training

Alec Radford, K. Narasimhan, T. Salimans, Ilya Sutskever - 2018

23 papers in library cite

Very simple and very nice! Easy to understand and revolutionary maybe?

[7]Neural Machine Translation of Rare Words with Subword Units

R. Sennrich, B. Haddow, Alexandra Birch - 2016

22 papers in library cite

Very good! Simple, explains quite a lot and good results. Forms the basis for a lot of stuff now!

[8]Google's Neural Machine Translation System: Bridging the Gap Between Human and Machine Translation

Yonghui Wu, M. Schuster, Ziru Chen, Quoc V. Le, M. Norouzi, W. Macherey, M. Krikun, Yue Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. J. Johnson, Xiaodong Liu, Lukasz Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, Wenyi Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, Oriol Vinyals, G. S. Corrado, M. Hughes, Jeffrey Dean - 2016

15 papers in library cite

It's a very good paper but TBH doesn't bring anything new other than joining a bunch of existing stuff. I think it ended up being foundational because it's Google and several people used it as a base for future research. Good contribution then :)

[9]Deep Reinforcement Learning From Human Preferences

Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, Dario Amodei - 2017

11 papers in library cite

Very nice idea overall!

[10]Universal Language Model Fine-Tuning for Text Classification

J. Howard, Sebastian Ruder - 2018

14 papers in library cite

Amazing! Bridging the gap between pre-training/finetuning in CV vs. NLP, plus giving amazing resuts!

[11]Teaching Machines to Read and Comprehend

K. M. Hermann, T. Kocisky, Edward Grefenstette, L. Espeholt, W. Kay, M. Suleyman, Phil Blunsom - 2015

31 papers in library cite

Nice way of converting unsupervised data to train for Q&A - and nice visualizations as well :) But I think their main contribution is the dataset. Maybe with the dataset they "unlocked" summarization?

[12]Get to the Point: Summarization With Pointer-Generator Networks

A. See, P. J. Liu, Christopher D. Manning - 2017

8 papers in library cite

It's a bit of the same thing of the other ones. I am not sure if this was the first or not, but I am getting a bit bored of this

[13]Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books

Yuxuan Zhu, R. Kiros, R. Zemel, Ruslan Salakhutdinov, R. Urtasun, Antonio Torralba, Sanja Fidler - 2015

18 papers in library cite

I think their approach was a bit convoluted and didn't really add a lot. Main contribution here is probably BookCorpus

[14]A Deep Reinforced Model for Abstractive Summarization

R. Paulus, Caiming Xiong, Richard Socher - 2017

7 papers in library cite

It's nice that they introduce intra-attention and RL, but at this point I think a lot of the work in attention is very derivative.

[15]Semi-Supervised Sequence Learning

A. M. Dai, Quoc V. Le - 2015

27 papers in library cite

Very good paper that was probably the first to introduce pre-training in NLP!

[16]Scalable Agent Alignment via Reward Modeling: A Research Direction

Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, Shane Legg - 2018

5 papers in library cite

Low 4. Good push and good direction, but nothing groundbreaking - other research was already around about reward modeling. Good for them that they pushed for it.

[17]Learning to Generate Reviews and Discovering Sentiment

Alec Radford, R. Jozefowicz, Ilya Sutskever - 2017

8 papers in library cite

Very impressive results. We can see that they were onto something (later became GPT2)

[18]A Survey of Reinforcement Learning Informed by Natural Language

Jelena Luketina, Nantas Nardelli, Gregory Farquhar, Jakob Foerster, Jacob Andreas, Edward Grefenstette, Shimon Whiteson, Tim Rocktaschel - 2019

3 papers in library cite

It's a good overview and I like that it gave me context on what's happening, but a bit boring to read.

[19]Supervising Strong Learners by Amplifying Weak Experts

Paul Christiano, Buck Shlegeris, Dario Amodei - 2018

7 papers in library cite

Nice idea, but doesn't have any concrete implementations or proof that it works. Sounds too aspirational.

[20]TL;DR: Mining Reddit to Learn Automatic Summarization

M. Volske, Martin Potthast, S. Syed, Benno Stein - 2017

4 papers in library cite

[21]AI Safety via Debate

Geoffrey Irving, Paul Christiano, Dario Amodei - 2018

8 papers in library cite

[22]Learning to Understand Goal Specifications by Modelling Reward

D. Bahdanau, F. Hill, Jan Leike, E. Hughes, P. Kohli, Edward Grefenstette - 2019

4 papers in library cite

[23]Reward Learning From Human Preferences and Demonstrations in Atari

B. Ibarz, Jan Leike, T. Pohlen, Geoffrey Irving, Shane Legg, Dario Amodei - 2018

5 papers in library cite

[24]Finding Generalizable Evidence by Learning to Convince q&a Models

Ethan Perez, S. Karamcheti, Rob Fergus, Jason Weston, Douwe Kiela, Kyunghyun Cho - 2019

4 papers in library cite

[25]Better Rewards Yield Better Summaries: Learning to Summarise Without References

F. Bohm, Y. Gao, C. M. Meyer, O. Shapira, Ido Dagan, I. Gurevych - 2019

3 papers in library cite

[26]Learning From Dialogue After Deployment: Feed Yourself, Chatbot!

B. Hancock, Antoine Bordes, P. E. Mazare, Jason Weston - 2019

3 papers in library cite

[27]Learning to Extract Coherent Summary via Deep Reinforcement Learning

Yonghui Wu, B. Hu - 2018

3 papers in library cite

[28]Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning

J. Kreutzer, J. Uyheng, S. Riezler - 2018

3 papers in library cite

[29]Sequence Tutor: Conservative Fine-Tuning of Sequence Generation Models With KL-Control

N. Jaques, S. Gu, D. Bahdanau, J. M. H. Lobato, R. E. Turner, D. Eck - 2017

3 papers in library cite

[30]Towards Coherent and Cohesive Long-Form Text Generation

W. S. Cho, Peizhao Zhang, Y. Z. Zhang, Xiang Lisa Li, M. Galley, Chris Brockett, Mingliang Wang, Jianfeng Gao - 2019

3 papers in library cite

[31]Towards Coherent and Engaging Spoken Dialog Response Generation Using Automatic Conversation Evaluators

S. Yi, R. Goel, C. Khatri, T. Chung, Behnam Hedayatnia, Anu Venkatesh, Raefer Gabriel, D. H. Tur - 2019

3 papers in library cite

[32]Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog

N. Jaques, A. Ghandeharioun, J. H. Shen, C. Ferguson, A. Lapedriza, N. Jones, S. Gu, R. Picard - 2019

3 papers in library cite

[33]Bottom-Up Abstractive Summarization

Sebastian Gehrmann, Y. Deng, Alexander M. Rush - 2018

2 papers in library cite

[34]Controllable Neural Story Generation via Reinforcement Learning

P. Tambwekar, M. Dhuliawala, A. Mehta, L. J. Martin, B. Harrison, M. O. Riedl - 2018

2 papers in library cite

[35]Neural Text Summarization: A Critical Evaluation

W. Kryscinski, Nitish Shirish Keskar, B. Mccann, Caiming Xiong, Richard Socher - 2019

2 papers in library cite

[36]Reinforcement Learning for Bandit Neural Machine Translation With Simulated Human Feedback

K. Nguyen, H. D. Iii, J. B. Graber - 2017

2 papers in library cite

[37]Reward Learning for Efficient Reinforcement Learning in Extractive Document Summarisation

Y. Gao, C. M. Meyer, M. Mesgar, I. Gurevych - 2019

2 papers in library cite

[38]Active Learning for Speech Recognition: The Power of Gradients

J. Huang, Rewon Child, V. Rao, Haozhe Liu, S. Satheesh, A. Coates - 2016

1 paper in library cites

[39]Deep Batch Active Learning by Diverse, Uncertain Gradient Lower Bounds

J. T. Ash, Chiyuan Zhang, A. Krishnamurthy, John Langford, Akshat Agarwal - 2019

1 paper in library cites

[40]Dialogue Learning With Human-in-the-Loop

Jeffrey Li, A. H. Miller, S. Chopra, Marc'aurelio Ranzato, Jason Weston - 2016

1 paper in library cites

[41]Discriminative Active Learning

D. Gissin, S. S. Shwartz - 2019

1 paper in library cites

[42]Discriminative Batch Mode Active Learning

Y. Guo, Dale Schuurmans - 2008

1 paper in library cites

[43]Generating Abstractive Summaries With Finetuned Language Models

Sebastian Gehrmann, Z. Ziegler, A. Rush - 2019

1 paper in library cites

[44]Image-Based Recommendations on Styles and Substitutes

J. Mcauley, C. Targett, Q. Shi, A. V. D. Hengel - 2015

1 paper in library cites

[45]OpenAI Baselines

S. Sidor, Yonghui Wu, P. Zhokhov - 2017

1 paper in library cites

[46]Preference-Based Interactive Multi-Document Summarisation

Y. Gao, C. M. Meyer, I. Gurevych - 2019

1 paper in library cites

[47]Sample Efficient Text Summarization Using a Single Pre-Trained Transformer

U. Khandelwal, K. Clark, Dan Jurafsky, Lukasz Kaiser - 2019

1 paper in library cites

Cited by

7

papers in your library

Cites

22

papers in your library

Read

on November 22, 2025

It's so simple how they do it, plus I absolutely LOVED the "challenges" section and how honest they were about it. This is true research!

Tags

RLHFVetto Study

Paper Aliases

No aliases