Papperoni

2019

Are Sixteen Heads Really Better Than One?

P. Michel, Omer Levy, Graham Neubig

Open PDF Google Scholar

citations

Cite Score

45

AI summary

This paper observes that many attention heads in Transformer-based models (MT, BERT) can be removed after training without significant performance impact. A greedy algorithm is proposed for pruning heads, leading to inference-time efficiency gains up to 17.5% for BERT. Analysis reveals encoder-decoder attention layers are more sensitive to pruning.

Main Contributions

Demonstrates that a large percentage of attention heads can be removed at test time without significantly impacting performance in Transformer-based models.
Proposes a simple greedy algorithm for pruning attention heads.
Shows significant benefits for inference-time efficiency by pruning attention heads.
Reveals that encoder-decoder attention layers are more sensitive to pruning than self-attention layers in machine translation.
Provides evidence that the distinction between important and unimportant heads increases as training progresses.

Abstract

Attention is a powerful and ubiquitous mechanism for allowing neural models to focus on particular salient pieces of information by taking their weighted average when making predictions. In particular, multi-headed attention is a driving force behind many recent state-of-the-art natural language processing (NLP) models such as Transformer-based MT models and BERT. These models apply multiple attention mechanisms in parallel, with each attention “head" potentially focusing on different parts of the input, which makes it possible to express sophisticated functions beyond the simple weighted average. In this paper we make the surprising observation that even if models have been trained using multiple heads, in practice, a large percentage of attention heads can be removed at test time without significantly impacting performance. In fact, some layers can even be reduced to a single head. We further examine greedy algorithms for pruning down models, and the potential speed, memory efficiency, and accuracy improvements obtainable therefrom. Finally, we analyze the results with respect to which parts of the model are more reliant on having multiple heads, and provide precursory evidence that training dynamics play a role in the gains provided by multi-head attention.

Citation Graph

Loading graph...

References [38]

Sort:

Filter:

[1]Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin - 2017

47 papers in library cite

I mean... it introduced Transformers!

[2]BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, M. W. Chang, K. Lee, Kristina Toutanova - 2018

39 papers in library cite

Simply amazing. It's very impressive how they make a leap vs. existing stuff (you can see from the references, pretty much no one is doing what they are doing, other than GPT)

[3]Neural Machine Translation by Jointly Learning to Align and Translate

D. Bahdanau, Kyunghyun Cho, Yoshua Bengio - 2014

59 papers in library cite

Introduces the attention mechanism - amazing overall

[4]Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation

Kyunghyun Cho, B. V. Merrienboer, C. G. Gulcehre, D. Bahdanau, F. Bougares, Holger Schwenk, Yoshua Bengio - 2014

38 papers in library cite

Introduces RNN encoder-decoder. I love it :)

[5]Language Models Are Unsupervised Multitask Learners

Alec Radford, Jeffrey Wu, Rewon Child, D. Luan, Dario Amodei, Ilya Sutskever - 2019

27 papers in library cite

Amazing! Tons of important contributions. I think they could have explained the models a bit better, and I think this is where OpenAI starts to become evil (and not open)

[6]Effective Approaches to Attention-Based Neural Machine Translation

T. Luong, H. Pham, Christopher D. Manning - 2015

15 papers in library cite

Good paper, but very derivative. Attention methods start getting very complicated... I understand why Transformers took over TBH

[7]Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank

Richard Socher, A. Perelygin, Jeffrey Wu, J. Chuang, C. Manning, A. Ng, Christopher Potts - 2013

24 papers in library cite

I didn't really like the first paper and I don't really like this one. I think the dataset is more influential than the methodology. I think Stanford folks are too focused on old school NLP.

[8]Optimal Brain Damage

Yann Lecun, John Denker, Sara Solla, Richard Howard, Lawrence Jackel - 1990

4 papers in library cite

It's a nice idea but I think it's a bit uninteresting - maybe it influenced other important work later?

[9]A Broad-Coverage Challenge Corpus for Sentence Understanding Through Inference

A. Williams, Nikita Nangia, S. Bowman - 2018

19 papers in library cite

Very nice paper and cool dataset - good thing they expanded SNLI. Also, they at least tried to have a good baseline, and comparisons of domains are nice.

[10]Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Z. Dai, Zhilin Yang, Yining Yang, W. Cohen, J. Carbonell, Quoc Le, Ruslan Salakhutdinov - 2019

9 papers in library cite

It's so cool to see context expansion without the need to actually expand context! Such a simple context and so effective!

[11]Automatically Constructing a Corpus of Sentential Paraphrases

W. Dolan, Chris Brockett - 2005

9 papers in library cite

Small dataset, questionable methodology, not useful for training models

[12]A Deep Reinforced Model for Abstractive Summarization

R. Paulus, Caiming Xiong, Richard Socher - 2017

7 papers in library cite

It's nice that they introduce intra-attention and RL, but at this point I think a lot of the work in attention is very derivative.

[13]A Decomposable Attention Model for Natural Language Inference

A. P. Parikh, O. Tackstrom, Dipanjan Das, Jakob Uszkoreit - 2016

11 papers in library cite

Very nice alternative to the common LSTM encoder-decoder architecture! Seems similar o the Transformers arch in the sense that they don't use RNNs. Nice that they analyze computational complexity as well.

[14]Long Short-Term Memory-Networks for Machine Reading

Mirella Lapata - 2016

8 papers in library cite

I read this more as an example of intra-attention, but this is not the main focus of the paper. I think visualization/explanation is a bit bad, and it doesn't seem too impactful. I kept thinking that this is starting to get too complicated, and indeed it was surpassed by transformers right after that.

[15]Neural Network Acceptability Judgments

Alex Warstadt, A. Singh, S. Bowman - 2018

8 papers in library cite

CoLA dataset

[16]An Analysis of Encoder Representations in Transformer-Based Machine Translation

A. Raganato, J. Tiedemann - 2018

2 papers in library cite

[17]Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the REST Can Be Pruned

E. Voita, D. Talbot, F. Moiseev, R. Sennrich, T. Ivan - 2019

2 papers in library cite

[18]Moses: Open Source Toolkit for Statistical Machine Translation

P. Koehn, H. Hoang, Alexandra Birch, Chris Callison Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, E. Herbst - 2007

8 papers in library cite

[19]Improving Language Understanding With Unsupervised Learning

Alec Radford, K. Narasimhan, T. Salimans, Ilya Sutskever - 2018

4 papers in library cite

[20]Scaling Neural Machine Translation

M. Ott, S. Edunov, D. Grangier, Michael Auli - 2018

3 papers in library cite

[21]Sequence-Level Knowledge Distillation

Yoon Kim, A. Rush - 2016

3 papers in library cite

[22]Learning Both Weights and Connections for Efficient Neural Network

S. Han, J. Pool, J. Tran, W. Dally - 2015

2 papers in library cite

[23]Linguistically-Informed Self-Attention for Semantic Role Labeling

E. Strubell, P. Verga, D. Andor, D. Weiss, Andrew Mccallum - 2018

2 papers in library cite

[24]Pruning Convolutional Neural Networks for Resource Efficient Inference

P. Molchanov, S. Tyree, T. Karras, T. Aila, J. Kautz - 2017

2 papers in library cite

[25]Statistical Signifi Cance Tests for Machine Translation Evaluation

P. Koehn - 2004

2 papers in library cite

[26]Auto-Sizing Neural Networks: With Applications to N-Gram Language Models

K. Murray, D. Chiang - 2015

1 paper in library cites

[27]compare-mt: A Tool for Holistic Comparison of Language Generation Systems

Graham Neubig, Z. Dou, Jiaxi Hu, P. Michel, D. Pruthi, Xinpeng Wang - 2019

1 paper in library cites

[28]Compression of Neural Machine Translation Models via Pruning

A. See, M. Luong, C. Manning - 2016

1 paper in library cites

[29]Disan: Directional Self-Attention Network for RNN/CNN-Free language understanding

T. Shen, T. Zhou, G. Long, J. J. Jiang, Siyuan Pan, Chiyuan Zhang - 2018

1 paper in library cites

[30]Layer-Wise Relevance Propagation for Neural Networks With Local Renormalization Layers

A. Binder, G. Montavon, S. Lapuschkin, K. Muller, W. Samek - 2016

1 paper in library cites

[31]MTNT: A Testbed for Machine Translation of Noisy Text

P. Michel, Graham Neubig - 2018

1 paper in library cites

[32]Opening the Black Box of Deep Neural Networks via Information

R. S. Ziv, N. Tishby - 2017

1 paper in library cites

[33]Pruning Filters for Efficient Convnets

H. Li, A. Kadav, I. Durdanovic, H. Samet, H. Graf - 2016

1 paper in library cites

[34]Report on the 11 Th Iwslt Evaluation Campaign, iwslt 2014

M. Cettolo, J. Niehues, S. Stuker, L. Bentivogli, M. Federico - 2015

1 paper in library cites

[35]Second Order Derivatives for Network Pruning: Optimal Brain Surgeon

B. Hassibi, D. Stork - 1993

1 paper in library cites

[36]Structured Pruning of Deep Convolutional Neural Networks

S. Anwar, K. Hwang, W. Sung - 2017

1 paper in library cites

[37]Weighted Transformer Network for Machine Translation

K. Ahmed, Nitish Shirish Keskar, Richard Socher - 2017

1 paper in library cites

[38]Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures

G. Tang, M. Muller, A. Rios, R. Sennrich - 2018

1 paper in library cites

Cited by

1

papers in your library

Cites

17

papers in your library

Read

on December 29, 2025

I expected more. They don't answer the main question of the paper. I thought they would explain what the heads are for, but they only compare results with/without. It's more of an empirical analysis.

Tags

Paper Aliases

No aliases