2019

Are Sixteen Heads Really Better Than One?

P. Michel, Omer Levy, Graham Neubig

citations

Cite Score

45

AI summary

This paper observes that many attention heads in Transformer-based models (MT, BERT) can be removed after training without significant performance impact. A greedy algorithm is proposed for pruning heads, leading to inference-time efficiency gains up to 17.5% for BERT. Analysis reveals encoder-decoder attention layers are more sensitive to pruning.

Main Contributions

  • Demonstrates that a large percentage of attention heads can be removed at test time without significantly impacting performance in Transformer-based models.
  • Proposes a simple greedy algorithm for pruning attention heads.
  • Shows significant benefits for inference-time efficiency by pruning attention heads.
  • Reveals that encoder-decoder attention layers are more sensitive to pruning than self-attention layers in machine translation.
  • Provides evidence that the distinction between important and unimportant heads increases as training progresses.

Abstract

Attention is a powerful and ubiquitous mechanism for allowing neural models to focus on particular salient pieces of information by taking their weighted average when making predictions. In particular, multi-headed attention is a driving force behind many recent state-of-the-art natural language processing (NLP) models such as Transformer-based MT models and BERT. These models apply multiple attention mechanisms in parallel, with each attention “head" potentially focusing on different parts of the input, which makes it possible to express sophisticated functions beyond the simple weighted average. In this paper we make the surprising observation that even if models have been trained using multiple heads, in practice, a large percentage of attention heads can be removed at test time without significantly impacting performance. In fact, some layers can even be reduced to a single head. We further examine greedy algorithms for pruning down models, and the potential speed, memory efficiency, and accuracy improvements obtainable therefrom. Finally, we analyze the results with respect to which parts of the model are more reliant on having multiple heads, and provide precursory evidence that training dynamics play a role in the gains provided by multi-head attention.

Citation Graph

Loading graph...

References [38]

Sort:
Filter:

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin - 2017

47 papers in library cite

Jacob Devlin, M. W. Chang, K. Lee, Kristina Toutanova - 2018

39 papers in library cite

D. Bahdanau, Kyunghyun Cho, Yoshua Bengio - 2014

59 papers in library cite

Kyunghyun Cho, B. V. Merrienboer, C. G. Gulcehre, D. Bahdanau, F. Bougares, Holger Schwenk, Yoshua Bengio - 2014

38 papers in library cite

Alec Radford, Jeffrey Wu, Rewon Child, D. Luan, Dario Amodei, Ilya Sutskever - 2019

27 papers in library cite

T. Luong, H. Pham, Christopher D. Manning - 2015

15 papers in library cite

Richard Socher, A. Perelygin, Jeffrey Wu, J. Chuang, C. Manning, A. Ng, Christopher Potts - 2013

24 papers in library cite

Yann Lecun, John Denker, Sara Solla, Richard Howard, Lawrence Jackel - 1990

4 papers in library cite

A. Williams, Nikita Nangia, S. Bowman - 2018

19 papers in library cite

Z. Dai, Zhilin Yang, Yining Yang, W. Cohen, J. Carbonell, Quoc Le, Ruslan Salakhutdinov - 2019

9 papers in library cite

W. Dolan, Chris Brockett - 2005

9 papers in library cite

R. Paulus, Caiming Xiong, Richard Socher - 2017

7 papers in library cite

A. P. Parikh, O. Tackstrom, Dipanjan Das, Jakob Uszkoreit - 2016

11 papers in library cite

Mirella Lapata - 2016

8 papers in library cite

Alex Warstadt, A. Singh, S. Bowman - 2018

8 papers in library cite

A. Raganato, J. Tiedemann - 2018

2 papers in library cite

E. Voita, D. Talbot, F. Moiseev, R. Sennrich, T. Ivan - 2019

2 papers in library cite

P. Koehn, H. Hoang, Alexandra Birch, Chris Callison Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, E. Herbst - 2007

8 papers in library cite

Alec Radford, K. Narasimhan, T. Salimans, Ilya Sutskever - 2018

4 papers in library cite

M. Ott, S. Edunov, D. Grangier, Michael Auli - 2018

3 papers in library cite

Yoon Kim, A. Rush - 2016

3 papers in library cite

S. Han, J. Pool, J. Tran, W. Dally - 2015

2 papers in library cite

E. Strubell, P. Verga, D. Andor, D. Weiss, Andrew Mccallum - 2018

2 papers in library cite

P. Molchanov, S. Tyree, T. Karras, T. Aila, J. Kautz - 2017

2 papers in library cite

P. Koehn - 2004

2 papers in library cite

K. Murray, D. Chiang - 2015

1 paper in library cites

Graham Neubig, Z. Dou, Jiaxi Hu, P. Michel, D. Pruthi, Xinpeng Wang - 2019

1 paper in library cites

A. See, M. Luong, C. Manning - 2016

1 paper in library cites

T. Shen, T. Zhou, G. Long, J. J. Jiang, Siyuan Pan, Chiyuan Zhang - 2018

1 paper in library cites

A. Binder, G. Montavon, S. Lapuschkin, K. Muller, W. Samek - 2016

1 paper in library cites

P. Michel, Graham Neubig - 2018

1 paper in library cites

R. S. Ziv, N. Tishby - 2017

1 paper in library cites

H. Li, A. Kadav, I. Durdanovic, H. Samet, H. Graf - 2016

1 paper in library cites

M. Cettolo, J. Niehues, S. Stuker, L. Bentivogli, M. Federico - 2015

1 paper in library cites

B. Hassibi, D. Stork - 1993

1 paper in library cites

S. Anwar, K. Hwang, W. Sung - 2017

1 paper in library cites

K. Ahmed, Nitish Shirish Keskar, Richard Socher - 2017

1 paper in library cites

G. Tang, M. Muller, A. Rios, R. Sennrich - 2018

1 paper in library cites

Cited by

1

papers in your library

Cites

17

papers in your library

Read

on December 29, 2025

Your review

Tags

Paper Aliases

No aliases