2022

Constitutional AI: Harmlessness From AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron Mckinnon, C. C. Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova Dassarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield Dodds, Benjamin Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom B. Brown, Jared Kaplan

citations

Cite Score

69

AI summary

This paper introduces Constitutional AI (CAI), a method for training harmless AI assistants through self-improvement using AI feedback based on a list of principles, enabling more precise control over AI behavior with fewer human labels and achieving non-evasive responses to harmful queries.

Main Contributions

  • Introduced Constitutional AI (CAI), a method for training harmless AI assistants using self-improvement without human labels for harmful outputs, relying instead on a 'constitution' of rules or principles.
  • Proposed a two-phase process: a supervised learning (SL) stage with self-critiques and revisions, and a reinforcement learning (RL) stage using 'RL from AI Feedback' (RLAIF) where an AI evaluates responses based on constitutional principles.
  • Demonstrated the ability to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections.
  • Showed that both SL and RL methods can leverage chain-of-thought style reasoning to improve human-judged performance and transparency of AI decision making.
  • Achieved precise control over AI behavior with significantly fewer human labels compared to traditional RLHF.

Abstract

As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI prefer-ences. We then train with RL using the preference model as the reward signal, i.e. we use 'RL from AI Feedback' (RLAIF). As a result we are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels.

Citation Graph

Loading graph...

References [26]

Sort:
Filter:

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, C. Wainwright, Pamela Mishkin, Chiyuan Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, Ryan Lowe - 2022

11 papers in library cite

Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, Dario Amodei - 2017

11 papers in library cite

Missing author list

2022

4 papers in library cite

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul Christiano - 2020

10 papers in library cite

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aman Gupta, Adria Garriga Alonso - 2022

4 papers in library cite

Leo Gao, John Schulman, Jacob Hilton - 2022

3 papers in library cite

Paul Christiano, Buck Shlegeris, Dario Amodei - 2018

7 papers in library cite

Jason Wei, Xinpeng Wang, Dale Schuurmans, Maarten Bosma, Fanyue Xia, E. Chi, Quoc V. Le, Denny Zhou - 2022

10 papers in library cite

T. Kojima, Shixiang Shane Gu, M. Reid, Y. Matsuo, Y. Iwasawa - 2022

6 papers in library cite

S. Bowman, J. Hyun, Ethan Perez, E. Chen, C. Pettit, S. Heiner, K. Lukosiute, Amanda Askell, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron Mckinnon, Christopher Olah, Dario Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran Johnson, Jackson Kernion, Jamie Kerr, Jared Mueller, Jeffrey Ladish, J. D. Landau, Kamal Ndousse, Liane Lovitt, Nelson Elhage, Nicholas Schiefer, Nicholas Joseph, Noemi Mercado, Nova Dassarma, Robin Larson, Sam McCandlish, Sandipan Kundu, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Timothy Telleen Lawton, Tom B. Brown, T. J. Henighan, Tristan Hume, Yuntao Bai, Zac Hatfield Dodds, Benjamin Mann, Jared Kaplan - 2022

2 papers in library cite

Geoffrey Irving, Paul Christiano, Dario Amodei - 2018

8 papers in library cite

Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Benjamin Mann, Nova Dassarma, Nelson Elhage, Zac Hatfield Dodds, Danny Hernandez, Jackson Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom B. Brown, Jack Clark, Sam McCandlish, Christopher Olah, Jared Kaplan - 2021

5 papers in library cite

R. Thoppilan, D. D. Freitas, J. Hall, Noam Shazeer, A. Kulshreshtha, H. Cheng, A. Jin, T. Bos, L. Baker, Yulun Du, Yiwei Li, Honglak Lee, H. S. Zheng, A. Ghafouri, M. Menegali, Y. Huang, M. Krikun, D. Lepikhin, J. Qin, Deli Chen, Yiheng Xu, Ziru Chen, A. Roberts, Maarten Bosma, Y. Zhou, C. C. Chang, I. Krivokon, W. Rusch, M. Pickett, K. S. M. Hellstern, M. R. Morris, T. Doshi, R. D. Santos, T. Duke, J. Soraker, B. Zevenbergen, V. Prabhakaran, M. Diaz, B. Hutchinson, K. Olson, A. Molina, E. H. John, Jaehoon Lee, L. Aroyo, R. Rajakumar, A. Butryna, M. Lamm, V. Kuzmina, J. Fenton, A. Cohen, R. Bernstein, R. Kurzweil, B. A. Arcas, C. Cui, M. Croak, E. Chi, Quoc Le - 2022

5 papers in library cite

Maxwell Nye, A. J. Andreassen, Guy Gur Ari, Henryk Michalewski, Jacob Austin, D. Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, D. Luan, Charles Sutton, Augustus Odena - 2021

5 papers in library cite

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield Dodds, Nova Dassarma, Eli Tran Johnson, Scott Johnston, Sheer El Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, S. Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, J. Jacobson, Jackson Kernion, Shauna Kravec, Liane Lovitt, Kamal Ndousse, Catherine Olsson, Sam Ringer, Dario Amodei, Tom B. Brown, Jack Clark, Nicholas Joseph, Benjamin Mann, Sam McCandlish, Christopher Olah, Jared Kaplan - 2022

3 papers in library cite

Ethan Perez, S. Huang, Francis Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, N. Mcaleese, Geoffrey Irving - 2022

2 papers in library cite

William Saunders, C. Yeh, Jeffrey Wu, S. Bills, Long Ouyang, J. Ward, Jan Leike - 2022

1 paper in library cites

D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, Demis Hassabis - 2017

5 papers in library cite

Jiacheng Xu, D. Ju, M. Li, Y. Lan Boureau, Jason Weston, E. Dinan - 2020

4 papers in library cite

I. Solaiman, C. Dennison - 2021

3 papers in library cite

A. Glaese, N. Mcaleese, M. Trebacz, J. Aslanides, V. Firoiu, T. Ewalds, M. Rauh, L. Weidinger, M. Chadwick, P. Thacker, L. C. Gillingham, Jonathan Uesato, P. S. Huang, R. Comanescu, Fan Yang, A. See, S. Dathathri, R. Greig, C. C. Chen, D. Fritz, J. S. Elias, R. Green, S. Mokra, N. Fernando, Bo Wu, R. Foley, S. Young, I. Gabriel, W. Isaac, J. Mellor, Demis Hassabis, Koray Kavukcuoglu, L. A. Hendricks, Geoffrey Irving - 2022

2 papers in library cite

Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Benjamin Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, Andy Jones, S. Bowman, Anna Chen, Tom Conerly, Nova Dassarma, Dawn Drain, Nelson Elhage, Sheer El Showk, Stanislav Fort, Zac Hatfield Dodds, Tom Henighan, Danny Hernandez, Tristan Hume, J. Jacobson, Scott Johnston, Shauna Kravec, Catherine Olsson, Sam Ringer, Eli Tran Johnson, Dario Amodei, Tom B. Brown, Nicholas Joseph, Sam McCandlish, Christopher Olah, Jared Kaplan, Jack Clark - 2022

2 papers in library cite

J. Zhao, Daniel Khashabi, Tushar Khot, Ashish Sabharwal, K. W. Chang - 2021

1 paper in library cites

J. Huang, Shixiang Shane Gu, L. Hou, Yonghui Wu, Xinpeng Wang, H. Yu, J. Han - 2022

1 paper in library cites

J. Scheurer, J. A. Campos, J. S. Chan, Anna Chen, Kyunghyun Cho, Ethan Perez - 2022

1 paper in library cites

Weijia Shi, E. Dinan, K. Shuster, Jason Weston, Jiacheng Xu - 2022

1 paper in library cites

Cited by

2

papers in your library

Cites

17

papers in your library

Read

on May 28, 2026

Your review

Tags

Vetto StudyRLHF

Paper Aliases

No aliases