Papperoni

2020

Large Batch Optimization for Deep Learning: Training BERT in 76 Minutes

Cho Jui Hsieh

Open PDF Google Scholar

citations

Cite Score

42

AI summary

This paper introduces LAMB, a new layerwise adaptive large batch optimization technique that leverages layerwise adaptation to accelerate the training of deep neural networks, demonstrating superior performance on BERT and RESNET-50, and reducing BERT training time to 76 minutes using a TPUv3 Pod.

Main Contributions

Investigates a general adaptation strategy catered to large batch learning.
Develops LAMB, a new optimization algorithm for achieving adaptivity of learning rate in SGD.
Provides convergence analysis for both LARS and LAMB.
Demonstrates the strong empirical performance of LAMB on BERT and RESNET-50.
Reduces BERT training time from 3 days to 76 minutes.

Abstract

Training large deep neural networks on massive datasets is computationally very challenging. There has been recent surge in interest in using large batch stochastic optimization methods to tackle this issue. The most prominent algorithm in this line of research is LARS, which by employing layerwise adaptive learning rates trains RESNET on ImageNet in a few minutes. However, LARS performs poorly for attention models like BERT, indicating that its performance gains are not consistent across tasks. In this paper, we first study a principled layerwise adaptation strategy to accelerate training of deep neural networks using large mini-batches. Using this strategy, we develop a new layerwise adaptive large batch optimization technique called LAMB; we then provide convergence analysis of LAMB as well as LARS, showing convergence to a stationary point in general nonconvex settings. Our empirical results demonstrate the superior performance of LAMB across various tasks such as BERT and RESNET-50 training with very little hyperparameter tuning. In particular, for BERT training, our optimizer enables use of very large batch sizes of 32868 without any degradation of performance. By increasing the batch size to the memory limit of a TPUv3 Pod, BERT training time can be reduced from 3 days to just 76 minutes (Table 1). The LAMB implementation is available online¹.

Citation Graph

Loading graph...

References [32]

Sort:

Filter:

[1]Deep Residual Learning for Image Recognition

K. He, X. Zhang, S. Ren, Jian Sun - 2016

20 papers in library cite

This is simply amazing. Very very simple idea, totally revolutionary. No maths, just "it works!". Amazing.

[2]BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, M. W. Chang, K. Lee, Kristina Toutanova - 2018

39 papers in library cite

Simply amazing. It's very impressive how they make a leap vs. existing stuff (you can see from the references, pretty much no one is doing what they are doing, other than GPT)

[3]On the Importance of Initialization and Momentum in Deep Learning

Ilya Sutskever, James Martens, G. Dahl, Geoffrey Hinton - 2013

13 papers in library cite

They give very good context and it's easy to understand that they are doing this as a counterpoint to HF. Surprising results as well. I just think it was made obsolete by relu

[4]Large Scale Distributed Deep Networks

Jeffrey Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Quoc V. Le, Mark Z. Mao, Marc'aurelio Ranzato, A. Senior, P. Tucker, K. Yang, Andrew Y. Ng - 2012

16 papers in library cite

Good paper, nice algorithm. Nothing too crazy, but I understand the impact. I think the work to create the system was larger than the algorithm itself.

[5]Accurate, Large Minibatch Sgd: Training Imagenet in 1 Hour

P. Goyal, Piotr Dollar, Ross Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, K. He - 2017

2 papers in library cite

A ton of citations! I want to see how they did it!

[6]On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Nitish Shirish Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, P. T. P. Tang - 2016

4 papers in library cite

Adam generalization problem investigation

[7]Practical Recommendations for Gradient-Based Training of Deep Architectures

Yoshua Bengio - 2012

3 papers in library cite

[8]Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent

Benjamin Recht, C. Re, S. Wright, F. Niu - 2011

6 papers in library cite

[9]A Method of Solving a Convex Programming Problem With Convergence Rate O (1/K2)

Y. Nesterov - 1983

3 papers in library cite

[10]Measuring the Effects of Data Parallelism on Neural Network Training

C. J. Shallue, Jaehoon Lee, J. Antognini, Jascha Sohl Dickstein, R. Frostig, George E. Dahl - 2018

2 papers in library cite

[11]Optimizing Neural Networks With Kronecker-Factored Approximate Curvature

James Martens, R. Grosse - 2015

2 papers in library cite

[12]Adabatch: Adaptive Batch Sizes for Training Deep Neural Networks

A. Devarakonda, M. Naumov, M. Garland - 2017

1 paper in library cites

[13]Dawnbench: An End-to-End Deep Learning Benchmark and Competition

C. Coleman, D. Narayanan, D. Kang, T. Z. Zhao, J. Zhang, L. Nardi, P. Bailis, K. Olukotun, C. Re, Matei Zaharia - 2017

1 paper in library cites

[14]Don't Decay the Learning Rate, Increase the Batch Size

S. L. Smith, P. J. Kindermans, Quoc V. Le - 2017

1 paper in library cites

[15]Extremely Large Minibatch Sgd: Training resnet-50 on Imagenet in 15 Minutes

T. Akiba, S. Suzuki, K. Fukuda - 2017

1 paper in library cites

[16]Firecaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters

F. N. Iandola, M. W. Moskewicz, K. Ashraf, Kurt Keutzer - 2016

1 paper in library cites

[17]Highly Scalable Deep Learning Training System With Mixed-Precision: Training Imagenet in Four Minutes

X. Jia, S. Song, Weiran He, Yuzhi Wang, H. Rong, F. Zhou, L. Xie, Z. Guo, Yining Yang, Longhui Yu - 2018

1 paper in library cites

[18]Image Classification at Supercomputer Scale

C. Ying, S. Kumar, Deli Chen, Tianle Wang, Y. Cheng - 2018

1 paper in library cites

[19]Imagenet Training in Minutes

Y. You, Zhengyou Zhang, Cho Jui Hsieh, J. Demmel, Kurt Keutzer - 2018

1 paper in library cites

[20]Imagenet/Resnet-50 training in 224 Seconds

H. Mikami, H. Suganuma, Y. Tanaka, Y. Kageyama - 2018

1 paper in library cites

[21]Incorporating Nesterov Momentum Into adam

T. Dozat - 2016

1 paper in library cites

[22]Large-Batch Training for LSTM and Beyond

Y. You, J. Hseu, C. Ying, J. Demmel, Kurt Keutzer, Cho Jui Hsieh - 2019

1 paper in library cites

[23]Mini-Batch Stochastic Approximation Methods for Nonconvex stochastic Composite Optimization

S. Ghadimi, G. Lan, Haowei Zhang - 2014

1 paper in library cites

[24]Scale Out for Large Minibatch Sgd: Residual Network Training on Imagenet-1k With Improved Accuracy and Reduced Time to Train

V. Codreanu, D. Podareanu, V. Saletore - 2017

1 paper in library cites

[25]Scaling Distributed Machine Learning With System and Algorithm Co-Design

M. Li - 2017

1 paper in library cites

[26]Scaling SGD Batch Size to 32k for Imagenet Training

Y. You, I. Gitman, B. Ginsburg - 2017

1 paper in library cites

[27]Second-Order Optimization Method for Large Mini-Batch: Training Resnet-50 on imagenet in 35 Epochs

K. Osawa, Y. Tsuji, Y. Ueno, A. Naruse, R. Yokota, S. Matsuoka - 2018

1 paper in library cites

[28]Signsgd: Compressed Optimisation for Non-Convex Problems

J. Bernstein, Y. X. Wang, K. Azizzadenesheli, A. Anandkumar - 2018

1 paper in library cites

[29]Stochastic First- And Zeroth-Order Methods for Nonconvex stochastic programming

S. Ghadimi, G. Lan - 2013

1 paper in library cites

[30]Stochastic First- And Zeroth-Order Methods for Nonconvex stochastic programming

S. Ghadimi, G. Lan - 2013

1 paper in library cites

[31]Train Longer, Generalize Better: Closing the Generalization Gap in Large Batch Training of Neural Networks

E. Hoffer, I. Hubara, D. Soudry - 2017

1 paper in library cites

[32]Yet Another Accelerated Sgd: Resnet-50 Training on Imagenet in 74.7 Seconds

M. Yamazaki, A. Kasagi, A. Tabuchi, T. Honda, M. Miwa, N. Fukumoto, T. Tabaru, A. Ike, K. Nakashima - 2019

1 paper in library cites

Cited by

3

papers in your library

Cites

7

papers in your library

Read

on December 28, 2025

The changes are very simple and very derivative, but they make it so complicated! Also, they don't discuss drawbacks, which I am sure there are many!

Tags

Paper Aliases

Reducing BERT Pre-Training Time From 3 Days to 76 Minutes