2020

Realtoxicityprompts: Evaluating Neural Toxic Degeneration in Language Models

Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, Noah A. Smith

citations

Cite Score

55

AI summary

This paper introduces REALTOXICITYPROMPTS, a 100K-prompt dataset, to evaluate toxic degeneration in pretrained neural language models, revealing that LMs can generate toxic content even from non-toxic prompts, and that while some mitigation methods help, none are failsafe, highlighting issues with pretraining data.

Main Contributions

  • Introduction of REALTOXICITYPROMPTS, a dataset of 100K naturally occurring, sentence-level prompts with toxicity scores.
  • Empirical finding that pretrained LMs (GPT-1, GPT-2, GPT-3, CTRL) can degenerate into toxic text even from seemingly innocuous prompts.
  • Evaluation of controllable generation methods, showing that data- or compute-intensive methods are more effective but no current method is failsafe.
  • Analysis of pretraining corpora (OpenAI WebText and OpenWebText Corpus), identifying significant amounts of offensive and toxic content.
  • Recommendations for better data selection processes and public release of relevant information during data collection for pretraining corpora.

Abstract

Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment. We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration. We create and release REALTOXICITYPROMPTS, a dataset of 100K naturally occurring, sentence-level prompts derived from a large corpus of English web text, paired with toxicity scores from a widely-used toxicity classifier. Using REALTOXICITYPROMPTS, we find that pretrained LMs can degenerate into toxic text even from seemingly innocuous prompts. We empirically assess several controllable generation methods, and find that while data- or compute-intensive methods (e.g., adaptive pretraining on non-toxic data) are more effective at steering away from toxicity than simpler solutions (e.g., banning "bad" words), no current method is failsafe against neural toxic degeneration. To pinpoint the potential cause of such persistent toxic degeneration, we analyze two web text corpora used to pretrain several LMs (including GPT-2; Radford et al., 2019), and find a significant amount of offensive, factually unreliable, and otherwise toxic content. Our work provides a test bed for evaluating toxic generations by LMs and stresses the need for better data selection processes for pretraining.

Citation Graph

Loading graph...

References [83]

Sort:
Filter:

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin - 2017

47 papers in library cite

Jacob Devlin, M. W. Chang, K. Lee, Kristina Toutanova - 2018

39 papers in library cite

Yann Lecun, Leon Bottou, Yoshua Bengio, Patrick Haffner - 1998

62 papers in library cite

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei - 2020

21 papers in library cite

Yibo Liu, M. Ott, N. Goyal, J. Du, M. Joshi, Deli Chen, Omer Levy, Martha Lewis, Luke Zettlemoyer, Veselin Stoyanov - 2019

17 papers in library cite

Alec Radford, Jeffrey Wu, Rewon Child, D. Luan, Dario Amodei, Ilya Sutskever - 2019

27 papers in library cite

Alec Radford, K. Narasimhan, T. Salimans, Ilya Sutskever - 2018

23 papers in library cite

Tomas Mikolov - 2017

7 papers in library cite

J. Kirkpatrick, Razvan Pascanu, N. C. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. G. Barwinska, Demis Hassabis, C. Clopath, D. Kumaran, Raia Hadsell - 2017

5 papers in library cite

R. Sennrich, B. Haddow, Alexandra Birch - 2016

22 papers in library cite

Thomas Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, Sam Shleifer, P. V. Platen, C. Ma, Yacine Jernite, J. Plu, Chenfeng Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, Alexander M. Rush - 2019

7 papers in library cite

Yuxuan Zhu, R. Kiros, R. Zemel, Ruslan Salakhutdinov, R. Urtasun, Antonio Torralba, Sanja Fidler - 2015

18 papers in library cite

Geoffrey Irving - 2020

7 papers in library cite

A. Paszke, S. Gross, Francisco Massa, Adam Lerer, J. Bradbury, G. Chanan, T. Killeen, Zongyu Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. Devito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, Jinze Bai, S. Chintala - 2019

2 papers in library cite

Colin Raffel, Noam Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, Wentao Li, P. J. Liu - 2019

17 papers in library cite

Ari Holtzman, J. Buys, L. Du, M. Forbes, Yejin Choi - 2019

5 papers in library cite

Suchin Gururangan, A. Marasovic, Swabha Swayamdipta, K. Lo, I. Beltagy, D. Downey, Noah A. Smith - 2020

2 papers in library cite

M. Mitchell, S. Wu, A. Zaldivar, P. Barnes, L. Vasserman, B. Hutchinson, E. Spitzer, I. D. Raji, T. Gebru - 2018

5 papers in library cite

Nitish Shirish Keskar, B. Mccann, L. R. Varshney, Caiming Xiong, Richard Socher - 2019

4 papers in library cite

S. L. Blodgett, S. Barocas, H. D. Iii, H. Wallach - 2020

7 papers in library cite

E. M. Bender, B. Friedman - 2018

4 papers in library cite

E. Sheng, K. W. Chang, P. Natarajan, Nanyun Peng - 2019

4 papers in library cite

S. Welleck, I. Kulikov, S. Roller, E. Dinan, Kyunghyun Cho, Jason Weston - 2019

3 papers in library cite

J. Dodge, Suchin Gururangan, D. Card, Richard Schwartz, Noah A. Smith - 2019

3 papers in library cite

E. Denton, A. Hanna, R. Amironesei, A. Smart, H. Nicole, M. K. Scheuerman - 2020

2 papers in library cite

E. Dinan, S. Humeau, B. Chintagunta, Jason Weston - 2019

2 papers in library cite

L. Breitfeller, E. Ahn, David Jurgens, Y. Tsvetkov - 2019

2 papers in library cite

C. May, A. Wang, S. Bordia, Samuel R. Bowman, R. Rudinger - 2019

2 papers in library cite

A. Gokaslan, V. Cohen - 2019

2 papers in library cite

S. Dathathri, Andrea Madotto, J. Lan, J. Hung, E. Frank, P. Molino, Jason Yosinski, Rosanne Liu - 2019

2 papers in library cite

Maarten Sap, S. Gabriel, Lianhui Qin, Dan Jurafsky, Noah A. Smith, Yejin Choi - 2020

2 papers in library cite

A. Sudhakar, B. Upadhyay, A. Maheswaran - 2019

1 paper in library cites

C. D. N. Mizil, M. Sudhof, Dan Jurafsky, J. Leskovec, Christopher Potts - 2013

1 paper in library cites

J. Golbeck, Z. Ashktorab, R. O. Banjo, A. Berlinger, S. Bhagwan, C. Buntain, P. Cheakalos, A. A. Geller, Q. Gergory, R. K. Gnanasekaran, R. R. Gunasekaran, K. M. Hoffman, J. Hottle, V. Jienjitlert, S. Khare, R. Lau, M. J. Martindale, S. Naik, H. L. Nixon, P. Ramachandran, K. M. Rogers, L. Rogers, M. S. Sarin, G. Shahane, J. Thanki, P. Vengataraman, Z. Wan, D. M. Wu - 2017

1 paper in library cites

Sayan Ghosh, M. Chollet, E. Laksana, Louis Philippe Morency, S. Scherer - 2017

1 paper in library cites

L. Green - 2002

1 paper in library cites

A. Wang, Kyunghyun Cho - 2019

1 paper in library cites

C. Disalvo, A. Clement, V. Pipek - 2012

1 paper in library cites

J. Ficler, Y. Goldberg - 2017

1 paper in library cites

J. Zhang, J. Chang, C. D. N. Mizil, L. Dixon, Y. Hua, D. Taraborelli, N. Thain - 2018

1 paper in library cites

S. L. Blodgett, L. Green, B. O'connor - 2016

1 paper in library cites

W. Stoop, F. Kunneman, A. V. D. Bosch, B. Miller - 2019

1 paper in library cites

X. F. Aran, T. V. Nuenen, J. M. Such, N. Criado - 2020

1 paper in library cites

C. Basta, M. R. C. Jussa, N. Casas - 2019

1 paper in library cites

E. Sanders - 2002

1 paper in library cites

J. Zhao, Tianle Wang, M. Yatskar, R. Cotterell, V. Ordonez, K. W. Chang - 2019

1 paper in library cites

M. X. Chen, B. N. Lee, G. Bansal, Yue Cao, S. Zhang, J. Y. Lu, J. Tsay, Yuzhi Wang, A. M. Dai, Ziru Chen, T. Sohn, Yonghui Wu - 2019

1 paper in library cites

A. Chung - 2019

1 paper in library cites

S. Sharoff - 2020

1 paper in library cites

A. M. Founta, C. Djouvas, D. Chatzakou, I. Leontiadis, J. Blackburn, G. Stringhini, A. Vakali, M. Sirivianos, N. Kourtellis - 2018

1 paper in library cites

Ari Holtzman, J. Buys, M. Forbes, Antoine Bosselut, D. Golub, Yejin Choi - 2018

1 paper in library cites

E. S. Jo, T. Gebru - 2020

1 paper in library cites

A. Z. Jacobs, H. M. Wallach - 2019

1 paper in library cites

L. Dixon, Jeffrey Li, J. S. Sorensen, N. Thain, L. Vasserman - 2018

1 paper in library cites

K. Kurita, N. Vyas, A. Pareek, A. W. Black, Y. Tsvetkov - 2019

1 paper in library cites

B. Ross, M. Rist, G. Carbonell, B. Cabrera, N. Kurowsky, M. Wojatzki - 2017

1 paper in library cites

A. Rajaraman, J. D. Ullman - 2011

1 paper in library cites

E. Dinan, A. Fan, L. Y. Wu, Jason Weston, Douwe Kiela, A. Williams - 2020

1 paper in library cites

J. H. Park, Pascale Fung - 2017

1 paper in library cites

X. Ma, Maarten Sap, Hannah Rashkin, Yejin Choi - 2020

1 paper in library cites

R. Baly, G. Karadzhov, D. Alexandrov, J. Glass, P. Nakov - 2018

1 paper in library cites

M. Karan, J. Snajder - 2019

1 paper in library cites

A. Rajadesingan, P. Resnick, C. Budak - 2020

1 paper in library cites

T. Davidson, D. Bhattacharya, I. Weber - 2019

1 paper in library cites

A. Romano - 2017

1 paper in library cites

M. Barthel, G. Stocking, J. Holcomb, A. Mitchell - 2016

1 paper in library cites

E. Fast, T. Vachovsky, M. S. Bernstein - 2016

1 paper in library cites

B. Hutchinson, V. Prabhakaran, E. Denton, K. Webster, Y. Zhong, S. Denuy - 2020

1 paper in library cites

J. Eisenstein, A. Ahmed, E. P. Xing - 2011

1 paper in library cites

A. King - 2019

1 paper in library cites

S. Mohan, A. Guha, M. Harris, F. Popowich, A. Schuster, C. Priebe - 2017

1 paper in library cites

S. Barocas, K. Crawford, A. Shapiro, H. Wallach - 2017

1 paper in library cites

K. Mcguffie, A. Newhouse - 2020

1 paper in library cites

Maarten Sap, D. Card, S. Gabriel, Yejin Choi, Noah A. Smith - 2019

1 paper in library cites

Nicholas Carlini, C. L. Liu, U. Erlingsson, J. Kos, D. X. Song - 2019

1 paper in library cites

B. Zadrozny, C. Elkan - 2002

1 paper in library cites

P. W. Koh, Percy Liang - 2017

1 paper in library cites

E. Wallace, S. Feng, N. Kandpal, Matt Gardner, Shivalika Singh - 2019

1 paper in library cites

B. Friedman, P. H. Kahn, A. Borning - 2008

1 paper in library cites

Cited by

1

papers in your library

Cites

19

papers in your library

Read

on May 30, 2026

Your review

Tags

Paper Aliases

No aliases