2019
Cite Score
61
AI summary
This paper introduces an intra-layer model parallel approach using PyTorch that enables training transformer models with billions of parameters. It introduces models up to 8.3B parameters using 512 GPUs, achieving SOTA results on WikiText103, LAMBADA, and RACE datasets.
Main Contributions
Abstract
Recent work in language modeling demonstrates that training large transformer models advances the state of the art in Natural Language Processing applications. However, very large models can be quite difficult to train due to memory constraints. In this work, we present our techniques for training very large transformer models and implement a simple, efficient intra-layer model parallel approach that enables training transformer models with billions of parameters. Our approach does not require a new compiler or library changes, is orthogonal and complimentary to pipeline model parallelism, and can be fully implemented with the insertion of a few communication operations in native PyTorch. We illustrate this approach by converging transformer based models up to 8.3 billion parameters using 512 GPUs. We sustain 15.1 PetaFLOPs across the entire application with 76% scaling efficiency when compared to a strong single GPU baseline that sustains 39 TeraFLOPs, which is 30% of peak FLOPs. To demonstrate that large language models can further advance the state of the art (SOTA), we train an 8.3 billion parameter transformer language model similar to GPT-2 and a 3.9 billion parameter model similar to BERT. We show that careful attention to the placement of layer normalization in BERT-like models is critical to achieving increased performance as the model size grows. Using the GPT-2 model we achieve SOTA results on the WikiText103 (10.8 compared to SOTA perplexity of 15.8) and LAMBADA (66.5% compared to SOTA accuracy of 63.2%) datasets. Our BERT model achieves SOTA results on the RACE dataset (90.9% compared to SOTA accuracy of 89.4%).
Citation Graph
References [49]
D. P. Kingma, Jimmy Lei Ba - 2014
49 papers in library cite
Jacob Devlin, M. W. Chang, K. Lee, Kristina Toutanova - 2018
39 papers in library cite
Tomas Mikolov, Ilya Sutskever, K. Chen, G. S. Corrado, Jeffrey Dean - 2013
32 papers in library cite
Jeffrey Pennington, Richard Socher, Christopher D. Manning - 2014
31 papers in library cite
I. Loshchilov, Frank Hutter - 2017
7 papers in library cite
Yibo Liu, M. Ott, N. Goyal, J. Du, M. Joshi, Deli Chen, Omer Levy, Martha Lewis, Luke Zettlemoyer, Veselin Stoyanov - 2019
17 papers in library cite
M. E. Peters, M. Neumann, M. Iyyer, Matt Gardner, C. Clark, K. Lee, L. S. Zettlemoyer - 2018
27 papers in library cite
Alec Radford, K. Narasimhan, T. Salimans, Ilya Sutskever - 2018
23 papers in library cite
M. Abadi, Akshat Agarwal, P. Barham, E. Brevdo, Ziru Chen, C. Citro, G. Corrado, A. Davis, Jeffrey Dean, M. Devin, Sanjay Ghemawat, I. Goodfellow, A. Harp, Geoffrey Irving, M. Isard, Y. Jia, R. Jozefowicz, Lukasz Kaiser, M. Kudlur, J. Levenberg, D. Mane, R. Monga, S. Moore, D. Murray, Christopher Olah, M. Schuster, J. Shlens, B. Steiner, Ilya Sutskever, K. Talwar, P. Tucker, Vincent Vanhoucke, V. Vasudevan, F. Viegas, Oriol Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, Xiaoqiang Zheng - 2015
11 papers in library cite
Zhilin Yang, Z. Dai, Yining Yang, J. Carbonell, Ruslan Salakhutdinov, Quoc V. Le - 2019
11 papers in library cite
P. Rajpurkar, J. Zhang, K. Lopyrev, Percy Liang - 2016
37 papers in library cite
Z. Lan, Mark Chen, S. Goodman, Kevin Gimpel, P. Sharma, Radu Soricut - 2019
8 papers in library cite
A. Wang, A. Singh, J. Michael, F. Hill, Omer Levy, Samuel R. Bowman - 2018
26 papers in library cite
Dan Hendrycks, Kevin Gimpel - 2016
9 papers in library cite
Z. Dai, Zhilin Yang, Yining Yang, W. Cohen, J. Carbonell, Quoc Le, Ruslan Salakhutdinov - 2019
9 papers in library cite
P. Rajpurkar, R. Jia, Percy Liang - 2018
14 papers in library cite
Yuxuan Zhu, R. Kiros, R. Zemel, Ruslan Salakhutdinov, R. Urtasun, Antonio Torralba, Sanja Fidler - 2015
18 papers in library cite
S. Merity, Caiming Xiong, J. Bradbury, Richard Socher - 2017
12 papers in library cite
J. Turian, L. Ratinov, Yoshua Bengio - 2010
17 papers in library cite
M. Joshi, Deli Chen, Yibo Liu, D. Weld, Luke Zettlemoyer, Omer Levy - 2019
5 papers in library cite
Guokun Lai, Q. Xie, Haozhe Liu, Yining Yang, Eduard Hovy - 2017
11 papers in library cite
Cho Jui Hsieh - 2020
3 papers in library cite
B. Mccann, J. Bradbury, Caiming Xiong, Richard Socher - 2017
14 papers in library cite
D. Paperno, German Kruszewski, A. Lazaridou, N. Q. Pham, R. Bernardi, S. Pezzelle, M. Baroni, G. Boleda, Raquel Fernandez - 2016
12 papers in library cite
O. Melamud, J. Goldberger, Ido Dagan - 2016
5 papers in library cite
Alec Radford, R. Jozefowicz, Ilya Sutskever - 2017
8 papers in library cite
Tomas Mikolov, A. Deoras, S. Kombrink, Lukas Burget, Jan Cernocky - 2011
13 papers in library cite
P. Ramachandran, P. J. Liu, Quoc V. Le - 2017
9 papers in library cite
Colin Raffel, Noam Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, Wentao Li, P. J. Liu - 2019
17 papers in library cite
P. Goyal, Piotr Dollar, Ross Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, K. He - 2017
2 papers in library cite
Nitish Shirish Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, P. T. P. Tang - 2016
4 papers in library cite
Xiaodong Liu, Pengcheng He, Weizhu Chen, Jianfeng Gao - 2019
6 papers in library cite
Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, F. Roesner, Yejin Choi - 2019
5 papers in library cite
T. Chen, B. Xu, Chiyuan Zhang, C. Guestrin - 2016
2 papers in library cite
T. H. Trinh, Quoc V. Le - 2018
4 papers in library cite
Noam Shazeer, Y. Cheng, Niki Parmar, D. Tran, Ashish Vaswani, P. Koanantakool, P. Hawkins, Honglak Lee, M. Hong, C. Young - 2018
4 papers in library cite
L. G. Valiant - 1990
2 papers in library cite
J. Howard, Sebastian Ruder - 2018
2 papers in library cite
U. Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, Martha Lewis - 2020
2 papers in library cite
Y. Huang, Y. Cheng, Deli Chen, Honglak Lee, J. Ngiam, Quoc V. Le, Ziru Chen - 2018
2 papers in library cite
P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, H. Wu - 2018
2 papers in library cite
M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, B. Su - 2014
2 papers in library cite
Alec Radford, Jeffrey Wu, Rewon Child, D. Luan, Dario Amodei, Ilya Sutskever - 2019
1 paper in library cites
Z. Jia, Matei Zaharia, A. Aiken - 2018
1 paper in library cites
C. C. Chen, C. L. Yang, H. Y. Cheng - 2018
1 paper in library cites
Y. You, I. Gitman, B. Ginsburg - 2017
1 paper in library cites
A. Harlap, D. Narayanan, A. Phanishayee, V. Seshadri, N. Devanur, G. Ganger, P. Gibbons - 2018
1 paper in library cites
Microsoft - 2020
1 paper in library cites
Cited by
3
papers in your library
Cites
37
papers in your library
Read
on November 22, 2025
Your review
Tags
Paper Aliases
No aliases