Papperoni

2018

Improving Language Understanding by Generative Pre-Training

Alec Radford, K. Narasimhan, T. Salimans, Ilya Sutskever

Open PDF Google Scholar

citations

Cite Score

89

AI summary

This paper introduces a semi-supervised approach for language understanding tasks, using a combination of unsupervised pre-training of a language model on the BooksCorpus dataset, followed by discriminative fine-tuning using Transformer networks, achieving state-of-the-art results on 9 out of 12 tasks.

Main Contributions

Introduces a semi-supervised approach for language understanding tasks, leveraging unsupervised pre-training and supervised fine-tuning.
Utilizes a Transformer-based language model for pre-training on the BooksCorpus dataset.
Employs task-specific input adaptations during fine-tuning to achieve effective transfer with minimal architectural changes.
Demonstrates state-of-the-art results on 9 out of 12 language understanding tasks, including significant improvements on commonsense reasoning, question answering, and textual entailment.
Analyzes zero-shot behaviors of the pre-trained model, showcasing its acquisition of useful linguistic knowledge for downstream tasks.

Abstract

Natural language understanding comprises a wide range of diverse tasks such as textual entailment, question answering, semantic similarity assessment, and document classification. Although large unlabeled text corpora are abundant, labeled data for learning these specific tasks is scarce, making it challenging for discriminatively trained models to perform adequately. We demonstrate that large gains on these tasks can be realized by generative pre-training of a language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task. In contrast to previous approaches, we make use of task-aware input transformations during fine-tuning to achieve effective transfer while requiring minimal changes to the model architecture. We demonstrate the effectiveness of our approach on a wide range of benchmarks for natural language understanding. Our general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, significantly improving upon the state of the art in 9 out of the 12 tasks studied. For instance, we achieve absolute improvements of 8.9% on commonsense reasoning (Stories Cloze Test), 5.7% on question answering (RACE), and 1.5% on textual entailment (MultiNLI).

Citation Graph

Loading graph...

References [71]

Sort:

Filter:

[1]Adam: A Method for Stochastic Optimization

D. P. Kingma, Jimmy Lei Ba - 2014

49 papers in library cite

Amazing paper! Very well explained and huge impact. I am amazed that they made something so simple even when it requires a lot of background mathematical knowledge

[2]Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin - 2017

47 papers in library cite

I mean... it introduced Transformers!

[3]Distributed Representations of Words and Phrases and Their Compositionality

Tomas Mikolov, Ilya Sutskever, K. Chen, G. S. Corrado, Jeffrey Dean - 2013

32 papers in library cite

Introduced word2vec. Game changer.

[4]GloVe: Global Vectors for Word Representation

Jeffrey Pennington, Richard Socher, Christopher D. Manning - 2014

31 papers in library cite

Not a bad paper, I just don't like the motivation and I think the methodology is poorly explained and hard to follow. I can't deny the good results though...

[5]A Fast Learning Algorithm for Deep Belief Nets

Geoffrey E. Hinton, S. Osindero, Y. Teh - 2006

43 papers in library cite

The paper does not explain anything. It just throws the idea and a bunch of math, but doesn't really care to explain the concepts.

[6]Convolutional Neural Networks for Sentence Classification

Yoon Kim - 2014

8 papers in library cite

It's nice, goes straight to the point. I can see why it has tons of citations. However, I am not sure it was as impactful as 20k citations.

[7]Deep Contextualized Word Representations

M. E. Peters, M. Neumann, M. Iyyer, Matt Gardner, C. Clark, K. Lee, L. S. Zettlemoyer - 2018

27 papers in library cite

I didn't really like the approach. Seems a bit derivative TBH. BERT seems more elegant.

[8]Layer Normalization

Jimmy Lei Ba, R. Kiros, Geoffrey E. Hinton - 2016

14 papers in library cite

Very nice! At first I had a little bit of prejudice because it seemed way too mathy, but actually the math is easy to follow and the results are very nice.

[9]A Stochastic Approximation Method

Sutton Monro - 1951

3 papers in library cite

It's math. But it actually does a somewhat good job at explaining (but I don't think they tried too hard). It gets way better near the end.

[10]Distributed Representations of Sentences and Documents

Quoc Le, Tomas Mikolov - 2014

13 papers in library cite

Introduced document embeddings. Very nice overall.

[11]Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank

Richard Socher, A. Perelygin, Jeffrey Wu, J. Chuang, C. Manning, A. Ng, Christopher Potts - 2013

24 papers in library cite

I didn't really like the first paper and I don't really like this one. I think the dataset is more influential than the methodology. I think Stanford folks are too focused on old school NLP.

[12]Natural Language Processing (Almost) From Scratch

Ronan Collobert, Jason Weston, Leon Bottou, M. Karlen, Koray Kavukcuoglu, P. P. Kuksa - 2011

23 papers in library cite

It's one of those huge papers that gets very tiring by the end. However, it's a very nice contribution. I am biased towards not liking it because it's basically old style NLP using NNs, which to me is a bit meh. However, I think this sets very important foundations for pretraining, embeddings, and proving that NNs rock.

[13]SQuAD: 100,000+ Questions for Machine Comprehension of Text

P. Rajpurkar, J. Zhang, K. Lopyrev, Percy Liang - 2016

37 papers in library cite

Nice paper that introduced an important dataset. Not much else though.

[14]Extracting and Composing Robust Features With Denoising Autoencoders

Pascal Vincent, Hugo Larochelle, Yoshua Bengio, Pierre Antoine Manzagol - 2008

25 papers in library cite

I am *so* glad we found an alternative to DBNs. Also, introduced the idea of denoising which is nice.

[15]Neural Machine Translation of Rare Words with Subword Units

R. Sennrich, B. Haddow, Alexandra Birch - 2016

22 papers in library cite

Very good! Simple, explains quite a lot and good results. Forms the basis for a lot of stuff now!

[16]GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

A. Wang, A. Singh, J. Michael, F. Hill, Omer Levy, Samuel R. Bowman - 2018

26 papers in library cite

I like it, but it's just a mesh of different existing datasets and F1 score. Nothing new really but I get why it's important

[17]Bridging Nonlinearities and Stochastic Regularizers With Gaussian Error Linear Units

Dan Hendrycks, Kevin Gimpel - 2016

9 papers in library cite

Very understandable, and very nice! I don't think the justification is good, but hey, it works!

[18]A Unified Architecture for Natural Language Processing: Deep Neural Networks With Multitask Learning

Ronan Collobert, Jason Weston - 2008

32 papers in library cite

Really did not add much to the game. I think this was more of a small perf. improvement over other existing things and set a few methodological standards. Maybe main contribution is Multitask Learning + Deep learning

[19]Greedy Layer-Wise Training of Deep Networks

Yoshua Bengio, P. Lamblin, D. Popovici, Hugo Larochelle - 2006

33 papers in library cite

Bengio is perfect. This is everything that Hinton's paper hoped to be. Very well explained, and also tying back to real use cases (not just "hey, the math works and it reduced the score")

[20]Universal Language Model Fine-Tuning for Text Classification

J. Howard, Sebastian Ruder - 2018

14 papers in library cite

Amazing! Bridging the gap between pre-training/finetuning in CV vs. NLP, plus giving amazing resuts!

[21]A Large Annotated Corpus for Learning Natural Language Inference

Samuel R. Bowman, G. Angeli, Christopher Potts, Christopher D. Manning - 2015

25 papers in library cite

Dataset collection is ok. The model that they create seems very low effort.

[22]A Broad-Coverage Challenge Corpus for Sentence Understanding Through Inference

A. Williams, Nikita Nangia, S. Bowman - 2018

19 papers in library cite

Very nice paper and cool dataset - good thing they expanded SNLI. Also, they at least tried to have a good baseline, and comparisons of domains are nice.

[23]Teaching Machines to Read and Comprehend

K. M. Hermann, T. Kocisky, Edward Grefenstette, L. Espeholt, W. Kay, M. Suleyman, Phil Blunsom - 2015

31 papers in library cite

Nice way of converting unsupervised data to train for Q&A - and nice visualizations as well :) But I think their main contribution is the dataset. Maybe with the dataset they "unlocked" summarization?

[24]Why Does Unsupervised Pre-Training Help Deep Learning?

Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre Antoine Manzagol, Pascal Vincent, Samy Bengio - 2010

12 papers in library cite

Good paper, easy to follow, and brings some light to the pre-training stuff (layer-by-layer). I just wish it wasn't so long. It's a chore.

[25]Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books

Yuxuan Zhu, R. Kiros, R. Zemel, Ruslan Salakhutdinov, R. Urtasun, Antonio Torralba, Sanja Fidler - 2015

18 papers in library cite

I think their approach was a bit convoluted and didn't really add a lot. Main contribution here is probably BookCorpus

[26]Skip-Thought Vectors

R. Kiros, Yuxuan Zhu, Ruslan Salakhutdinov, Richard S. Zemel, R. Urtasun, Antonio Torralba, Sanja Fidler - 2015

23 papers in library cite

Nice to see an alternative to Word2Vec to sentences, but I don't really like the approach. Good nonetheless.

[27]Supervised Learning of Universal Sentence Representations From Natural Language Inference Data

Alexis Conneau, Douwe Kiela, Holger Schwenk, L. Barrault, Antoine Bordes - 2017

11 papers in library cite

It's nice. Maybe the first to do NLI right (after the SNLI paper tried but failed miserably). It is simple and effective. After that people started "performance maxxxing"

[28]Automatically Constructing a Corpus of Sentential Paraphrases

W. Dolan, Chris Brockett - 2005

9 papers in library cite

Small dataset, questionable methodology, not useful for training models

[29]Efficient Learning of Sparse Representations With an Energy-Based Model

Marc'aurelio Ranzato, C. Poultney, S. Chopra, Yann Lecun - 2006

20 papers in library cite

It's ok. Not really good, but alright.

[30]Semi-Supervised Sequence Learning

A. M. Dai, Quoc V. Le - 2015

27 papers in library cite

Very good paper that was probably the first to introduce pre-training in NLP!

[31]RACE: Large-Scale Reading Comprehension Dataset From Examinations

Guokun Lai, Q. Xie, Haozhe Liu, Yining Yang, Eduard Hovy - 2017

11 papers in library cite

I really like the idea of using human tests for testing AI. Also, very nice insige for using chinese tests!

[32]Learned in Translation: Contextualized Word Vectors

B. Mccann, J. Bradbury, Caiming Xiong, Richard Socher - 2017

14 papers in library cite

Doesn't seem too revolutionary tbh. It's a nice methodology though, and it's nice that they do it in a supervised way

[33]Generating Wikipedia by Summarizing Long Sequences

P. J. Liu, M. Saleh, E. Pot, B. Goodrich, R. Sepassi, Lukasz Kaiser, Noam Shazeer - 2018

7 papers in library cite

Very nice, the first ones that realized that they could use decoder only from Transformers. Also nice that they expanded the context window, and got nice results. Very inventive to use citations from wikipedia and the first paragraph as target.

[34]Semi-Supervised Sequence Tagging With Bidirectional Language Models

M. E. Peters, W. Ammar, C. Bhagavatula, Russell Power - 2017

5 papers in library cite

Nothing too different from other LSTM + Attention papers

[35]Reasoning About Entailment With Neural Attention

Tim Rocktaschel, Edward Grefenstette, K. Hermann, T. Kocisky, Phil Blunsom - 2016

5 papers in library cite

It's nice that they are SotA on top of SNLI, but they just apply existing methodologies.

[36]When and Why Are Pre-Trained Word Embeddings Useful for Neural Machine Translation?

Graham Neubig - 2018

1 paper in library cites

It's not bad, it's just that it severely lacks deeper analysis on their results. A few of their results are contradictory, and they don't analyze why.

[37]Unsupervised Pretraining for Sequence to Sequence Learning

P. Ramachandran, P. J. Liu, Quoc V. Le - 2017

9 papers in library cite

It's alright, but it's the same Seq2Seq thing with pretraining

[38]A Simple but Tough-to-Beat Baseline for Sentence Embeddings

S. Arora, Yiqing Liang, T. Ma - 2017

4 papers in library cite

[39]Unsupervised Machine Translation Using monolingual corpora Only

G. Lample, L. Denoyer, Marc'aurelio Ranzato - 2017

4 papers in library cite

How?

[40]Learning General Purpose Distributed Sentence Representations via Large Scale Multi-Task Learning

S. Subramanian, A. Trischler, Yoshua Bengio, C. Pal - 2018

4 papers in library cite

Bengio

[41]The Fifth PASCAL Recognizing Textual Entailment Challenge

L. Bentivogli, Peter Clark, Ido Dagan, D. Giampiccolo - 2009

7 papers in library cite

[42]SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation

D. Cer, M. Diab, E. Agirre, I. L. Gazpio, L. Specia - 2017

6 papers in library cite

[43]Discourse-Based Objectives for Fast Unsupervised Sentence Representation Learning

Yacine Jernite, S. Bowman, D. Sontag - 2017

4 papers in library cite

[44]Semi-Supervised Sequential Labeling and Segmentation Using Gigaword Scale Unlabeled Data

J. Suzuki, H. Isozaki - 2008

4 papers in library cite

[45]A Fast and Accurate Dependency Parser Using Neural Networks

Deli Chen, C. Manning - 2014

3 papers in library cite

[46]An Efficient Framework for Learning Sentence Representations

L. Logeswaran, Honglak Lee - 2018

3 papers in library cite

[47]Discriminative Improvements to Distributional Sentence Similarity

Yangfeng Ji, J. Eisenstein - 2013

3 papers in library cite

[48]Roles of Pre-Training and Fine-Tuning in Context-Dependent DBN-HMMS for Real-World Speech Recognition

D. Yu, L. Deng, G. Dahl - 2010

3 papers in library cite

[49]Semi-Supervised Multitask Learning for Sequence Labeling

M. Rei - 2017

3 papers in library cite

[50]A Compare-Propagate Architecture With Alignment Factorization for Natural Language Inference

Yi Tay, L. A. Tuan, S. C. Hui - 2017

2 papers in library cite

[51]GPU kernels for Block-Sparse Weights

Scott Gray, Alec Radford, D. P. Kingma - 2017

2 papers in library cite

[52]Quora Question Pairs

Ziru Chen, Haowei Zhang, X. Zhang, L. Zhao - 2018

2 papers in library cite

[53]Resolving Complex Cases of Definite Pronouns: The Winograd Schema Challenge

A. Rahman, V. Ng - 2012

2 papers in library cite

[54]Scitail: A Textual Entailment Dataset From Science Question Answering

Tushar Khot, Ashish Sabharwal, Peter Clark - 2018

2 papers in library cite

[55]Semi-Supervised Learning for Natural Language

Percy Liang - 2005

2 papers in library cite

[56]Semi-Supervised Learning Literature Survey

X. Zhu - 2005

2 papers in library cite

[57]A Simple and Effective Approach to the Story Cloze Test

S. Srinivasan, R. Arora, M. Riedl - 2018

1 paper in library cites

[58]Constituency Parsing With a Self-Attentive Encoder

N. Kitaev, Dan Klein - 2018

1 paper in library cites

[59]Corpus of Linguistic Acceptability

Alex Warstadt, A. Singh, Samuel R. Bowman - 2018

1 paper in library cites

[60]Ecnu at Semeval-2017 Task 1: Leverage Kernel-Based Traditional NLP Features and Neural Networks to Build a Universal Model for Multilingual and Cross-Lingual Semantic Textual Similarity

J. Tian, Zijian Zhou, M. Lan, Yonghui Wu - 2017

1 paper in library cites

[61]Fixing Weight Decay Regularization in adam

I. Loshchilov, Frank Hutter - 2017

1 paper in library cites

[62]Learning Entity Representation for Entity Disambiguation

Z. He, Shuming Liu, M. Li, M. Zhou, Li Zhang, Haiming Wang - 2013

1 paper in library cites

[63]Lsdsem 2017 Shared Task: The Story Cloze Test

N. Mostafazadeh, M. Roth, A. Louis, N. Chambers, J. Allen - 2017

1 paper in library cites

[64]Multi-Range Reasoning for Machine Comprehension

Yi Tay, L. A. Tuan, S. C. Hui - 2018

1 paper in library cites

[65]Opportunities and Challenges in Working With Low-Resource Languages

Y. Tsvetkov - 2017

1 paper in library cites

[66]Semi-Supervised Conditional Random Fields for Improved Sequence Segmentation and Labeling

F. Jiao, Shijie Wang, C. H. Lee, R. Greiner, Dale Schuurmans - 2006

1 paper in library cites

[67]Semi-Supervised Text Classification Using Em

K. Nigam, Andrew Mccallum, T. Mitchell - 2006

1 paper in library cites

[68]Split-Brain Autoencoders: Unsupervised Learning by Cross-Channel Prediction

Robert Zhang, P. Isola, A. A. Efros - 2017

1 paper in library cites

[69]Stochastic Answer Networks for Natural Language Inference

Xiaodong Liu, K. Duh, Jianfeng Gao - 2018

1 paper in library cites

[70]Story Comprehension for Predicting What Happens Next

S. Chaturvedi, H. Peng, Dan Roth - 2017

1 paper in library cites

[71]Towards Human-Level Machine Reading Comprehension: Reasoning and Inference With Multiple Strategies

Yiheng Xu, Joseph Liu, Jianfeng Gao, Y. Shen, Xiaodong Liu - 2017

1 paper in library cites

Cited by

23

papers in your library

Cites

40

papers in your library

Read

on August 4, 2025

Very simple and very nice! Easy to understand and revolutionary maybe?

Tags

Paper Aliases

No aliases