2019
Cite Score
45
AI summary
This paper observes that many attention heads in Transformer-based models (MT, BERT) can be removed after training without significant performance impact. A greedy algorithm is proposed for pruning heads, leading to inference-time efficiency gains up to 17.5% for BERT. Analysis reveals encoder-decoder attention layers are more sensitive to pruning.
Main Contributions
Abstract
Attention is a powerful and ubiquitous mechanism for allowing neural models to focus on particular salient pieces of information by taking their weighted average when making predictions. In particular, multi-headed attention is a driving force behind many recent state-of-the-art natural language processing (NLP) models such as Transformer-based MT models and BERT. These models apply multiple attention mechanisms in parallel, with each attention “head" potentially focusing on different parts of the input, which makes it possible to express sophisticated functions beyond the simple weighted average. In this paper we make the surprising observation that even if models have been trained using multiple heads, in practice, a large percentage of attention heads can be removed at test time without significantly impacting performance. In fact, some layers can even be reduced to a single head. We further examine greedy algorithms for pruning down models, and the potential speed, memory efficiency, and accuracy improvements obtainable therefrom. Finally, we analyze the results with respect to which parts of the model are more reliant on having multiple heads, and provide precursory evidence that training dynamics play a role in the gains provided by multi-head attention.
Citation Graph
References [38]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin - 2017
47 papers in library cite
Jacob Devlin, M. W. Chang, K. Lee, Kristina Toutanova - 2018
39 papers in library cite
D. Bahdanau, Kyunghyun Cho, Yoshua Bengio - 2014
59 papers in library cite
Kyunghyun Cho, B. V. Merrienboer, C. G. Gulcehre, D. Bahdanau, F. Bougares, Holger Schwenk, Yoshua Bengio - 2014
38 papers in library cite
Alec Radford, Jeffrey Wu, Rewon Child, D. Luan, Dario Amodei, Ilya Sutskever - 2019
27 papers in library cite
T. Luong, H. Pham, Christopher D. Manning - 2015
15 papers in library cite
Richard Socher, A. Perelygin, Jeffrey Wu, J. Chuang, C. Manning, A. Ng, Christopher Potts - 2013
24 papers in library cite
Yann Lecun, John Denker, Sara Solla, Richard Howard, Lawrence Jackel - 1990
4 papers in library cite
A. Williams, Nikita Nangia, S. Bowman - 2018
19 papers in library cite
Z. Dai, Zhilin Yang, Yining Yang, W. Cohen, J. Carbonell, Quoc Le, Ruslan Salakhutdinov - 2019
9 papers in library cite
W. Dolan, Chris Brockett - 2005
9 papers in library cite
R. Paulus, Caiming Xiong, Richard Socher - 2017
7 papers in library cite
A. P. Parikh, O. Tackstrom, Dipanjan Das, Jakob Uszkoreit - 2016
11 papers in library cite
Mirella Lapata - 2016
8 papers in library cite
Alex Warstadt, A. Singh, S. Bowman - 2018
8 papers in library cite
A. Raganato, J. Tiedemann - 2018
2 papers in library cite
E. Voita, D. Talbot, F. Moiseev, R. Sennrich, T. Ivan - 2019
2 papers in library cite
P. Koehn, H. Hoang, Alexandra Birch, Chris Callison Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, E. Herbst - 2007
8 papers in library cite
Alec Radford, K. Narasimhan, T. Salimans, Ilya Sutskever - 2018
4 papers in library cite
M. Ott, S. Edunov, D. Grangier, Michael Auli - 2018
3 papers in library cite
Yoon Kim, A. Rush - 2016
3 papers in library cite
S. Han, J. Pool, J. Tran, W. Dally - 2015
2 papers in library cite
E. Strubell, P. Verga, D. Andor, D. Weiss, Andrew Mccallum - 2018
2 papers in library cite
P. Molchanov, S. Tyree, T. Karras, T. Aila, J. Kautz - 2017
2 papers in library cite
P. Koehn - 2004
2 papers in library cite
K. Murray, D. Chiang - 2015
1 paper in library cites
Graham Neubig, Z. Dou, Jiaxi Hu, P. Michel, D. Pruthi, Xinpeng Wang - 2019
1 paper in library cites
A. See, M. Luong, C. Manning - 2016
1 paper in library cites
T. Shen, T. Zhou, G. Long, J. J. Jiang, Siyuan Pan, Chiyuan Zhang - 2018
1 paper in library cites
A. Binder, G. Montavon, S. Lapuschkin, K. Muller, W. Samek - 2016
1 paper in library cites
P. Michel, Graham Neubig - 2018
1 paper in library cites
R. S. Ziv, N. Tishby - 2017
1 paper in library cites
H. Li, A. Kadav, I. Durdanovic, H. Samet, H. Graf - 2016
1 paper in library cites
M. Cettolo, J. Niehues, S. Stuker, L. Bentivogli, M. Federico - 2015
1 paper in library cites
B. Hassibi, D. Stork - 1993
1 paper in library cites
S. Anwar, K. Hwang, W. Sung - 2017
1 paper in library cites
K. Ahmed, Nitish Shirish Keskar, Richard Socher - 2017
1 paper in library cites
G. Tang, M. Muller, A. Rios, R. Sennrich - 2018
1 paper in library cites
Cited by
1
papers in your library
Cites
17
papers in your library
Read
on December 29, 2025
Your review
Tags
Paper Aliases
No aliases