2022
Constitutional AI: Harmlessness From AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron Mckinnon, C. C. Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova Dassarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield Dodds, Benjamin Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom B. Brown, Jared Kaplan
Cite Score
69
AI summary
This paper introduces Constitutional AI (CAI), a method for training harmless AI assistants through self-improvement using AI feedback based on a list of principles, enabling more precise control over AI behavior with fewer human labels and achieving non-evasive responses to harmful queries.
Main Contributions
Abstract
As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI prefer-ences. We then train with RL using the preference model as the reward signal, i.e. we use 'RL from AI Feedback' (RLAIF). As a result we are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels.
Citation Graph
References [26]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, C. Wainwright, Pamela Mishkin, Chiyuan Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, Ryan Lowe - 2022
11 papers in library cite
Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, Dario Amodei - 2017
11 papers in library cite
2022
4 papers in library cite
Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul Christiano - 2020
10 papers in library cite
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aman Gupta, Adria Garriga Alonso - 2022
4 papers in library cite
Leo Gao, John Schulman, Jacob Hilton - 2022
3 papers in library cite
Paul Christiano, Buck Shlegeris, Dario Amodei - 2018
7 papers in library cite
Jason Wei, Xinpeng Wang, Dale Schuurmans, Maarten Bosma, Fanyue Xia, E. Chi, Quoc V. Le, Denny Zhou - 2022
10 papers in library cite
T. Kojima, Shixiang Shane Gu, M. Reid, Y. Matsuo, Y. Iwasawa - 2022
6 papers in library cite
S. Bowman, J. Hyun, Ethan Perez, E. Chen, C. Pettit, S. Heiner, K. Lukosiute, Amanda Askell, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron Mckinnon, Christopher Olah, Dario Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran Johnson, Jackson Kernion, Jamie Kerr, Jared Mueller, Jeffrey Ladish, J. D. Landau, Kamal Ndousse, Liane Lovitt, Nelson Elhage, Nicholas Schiefer, Nicholas Joseph, Noemi Mercado, Nova Dassarma, Robin Larson, Sam McCandlish, Sandipan Kundu, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Timothy Telleen Lawton, Tom B. Brown, T. J. Henighan, Tristan Hume, Yuntao Bai, Zac Hatfield Dodds, Benjamin Mann, Jared Kaplan - 2022
2 papers in library cite
Geoffrey Irving, Paul Christiano, Dario Amodei - 2018
8 papers in library cite
Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Benjamin Mann, Nova Dassarma, Nelson Elhage, Zac Hatfield Dodds, Danny Hernandez, Jackson Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom B. Brown, Jack Clark, Sam McCandlish, Christopher Olah, Jared Kaplan - 2021
5 papers in library cite
R. Thoppilan, D. D. Freitas, J. Hall, Noam Shazeer, A. Kulshreshtha, H. Cheng, A. Jin, T. Bos, L. Baker, Yulun Du, Yiwei Li, Honglak Lee, H. S. Zheng, A. Ghafouri, M. Menegali, Y. Huang, M. Krikun, D. Lepikhin, J. Qin, Deli Chen, Yiheng Xu, Ziru Chen, A. Roberts, Maarten Bosma, Y. Zhou, C. C. Chang, I. Krivokon, W. Rusch, M. Pickett, K. S. M. Hellstern, M. R. Morris, T. Doshi, R. D. Santos, T. Duke, J. Soraker, B. Zevenbergen, V. Prabhakaran, M. Diaz, B. Hutchinson, K. Olson, A. Molina, E. H. John, Jaehoon Lee, L. Aroyo, R. Rajakumar, A. Butryna, M. Lamm, V. Kuzmina, J. Fenton, A. Cohen, R. Bernstein, R. Kurzweil, B. A. Arcas, C. Cui, M. Croak, E. Chi, Quoc Le - 2022
5 papers in library cite
Maxwell Nye, A. J. Andreassen, Guy Gur Ari, Henryk Michalewski, Jacob Austin, D. Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, D. Luan, Charles Sutton, Augustus Odena - 2021
5 papers in library cite
Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield Dodds, Nova Dassarma, Eli Tran Johnson, Scott Johnston, Sheer El Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, S. Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, J. Jacobson, Jackson Kernion, Shauna Kravec, Liane Lovitt, Kamal Ndousse, Catherine Olsson, Sam Ringer, Dario Amodei, Tom B. Brown, Jack Clark, Nicholas Joseph, Benjamin Mann, Sam McCandlish, Christopher Olah, Jared Kaplan - 2022
3 papers in library cite
Ethan Perez, S. Huang, Francis Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, N. Mcaleese, Geoffrey Irving - 2022
2 papers in library cite
William Saunders, C. Yeh, Jeffrey Wu, S. Bills, Long Ouyang, J. Ward, Jan Leike - 2022
1 paper in library cites
D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, Demis Hassabis - 2017
5 papers in library cite
Jiacheng Xu, D. Ju, M. Li, Y. Lan Boureau, Jason Weston, E. Dinan - 2020
4 papers in library cite
I. Solaiman, C. Dennison - 2021
3 papers in library cite
A. Glaese, N. Mcaleese, M. Trebacz, J. Aslanides, V. Firoiu, T. Ewalds, M. Rauh, L. Weidinger, M. Chadwick, P. Thacker, L. C. Gillingham, Jonathan Uesato, P. S. Huang, R. Comanescu, Fan Yang, A. See, S. Dathathri, R. Greig, C. C. Chen, D. Fritz, J. S. Elias, R. Green, S. Mokra, N. Fernando, Bo Wu, R. Foley, S. Young, I. Gabriel, W. Isaac, J. Mellor, Demis Hassabis, Koray Kavukcuoglu, L. A. Hendricks, Geoffrey Irving - 2022
2 papers in library cite
Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Benjamin Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, Andy Jones, S. Bowman, Anna Chen, Tom Conerly, Nova Dassarma, Dawn Drain, Nelson Elhage, Sheer El Showk, Stanislav Fort, Zac Hatfield Dodds, Tom Henighan, Danny Hernandez, Tristan Hume, J. Jacobson, Scott Johnston, Shauna Kravec, Catherine Olsson, Sam Ringer, Eli Tran Johnson, Dario Amodei, Tom B. Brown, Nicholas Joseph, Sam McCandlish, Christopher Olah, Jared Kaplan, Jack Clark - 2022
2 papers in library cite
J. Zhao, Daniel Khashabi, Tushar Khot, Ashish Sabharwal, K. W. Chang - 2021
1 paper in library cites
J. Huang, Shixiang Shane Gu, L. Hou, Yonghui Wu, Xinpeng Wang, H. Yu, J. Han - 2022
1 paper in library cites
J. Scheurer, J. A. Campos, J. S. Chan, Anna Chen, Kyunghyun Cho, Ethan Perez - 2022
1 paper in library cites
Weijia Shi, E. Dinan, K. Shuster, Jason Weston, Jiacheng Xu - 2022
1 paper in library cites
Cited by
2
papers in your library
Cites
17
papers in your library
Read
on May 28, 2026
Your review
Tags
Paper Aliases
No aliases