Papperoni

2015

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

S. Ioffe, Christian Szegedy

Open PDF Google Scholar

citations

Cite Score

98

AI summary

This paper introduces batch normalization, a novel technique to accelerate the training of deep neural networks by reducing internal covariate shift, achieving state-of-the-art results on the ImageNet classification dataset, and reaching 4.82% top-5 test error.

Main Contributions

Introduces batch normalization, a new technique for normalizing layer inputs during training.
Demonstrates that batch normalization allows for the use of higher learning rates and reduces the need for careful initialization.
Shows that batch normalization can eliminate the need for Dropout in some cases.
Achieves state-of-the-art results on the ImageNet classification dataset, with a top-5 test error of 4.82%.
Provides an algorithm for constructing, training, and performing inference with batch-normalized networks.

Abstract

Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization, and in some cases eliminates the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.82% top-5 test error, exceeding the accuracy of human raters.

Citation Graph

Loading graph...

References [24]

Sort:

Filter:

[1]Gradient-Based Learning Applied to Document Recognition

Yann Lecun, Leon Bottou, Yoshua Bengio, Patrick Haffner - 1998

62 papers in library cite

I absolutely hated this paper. Has ~50 pages but seems like 200 pages. Takes too long to explain some things that really is just repeating itself. Also doesn't seem to add too much on top of LeNet-5. Also, focuses a lot on GTNs, which really didn't stick.

[2]Going Deeper With Convolutions

Christian Szegedy, Weizhou Liu, Y. Jia, P. Sermanet, S. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich - 2015

20 papers in library cite

Introduced the inception algorithm, which is nice. The paper is quite good, but I had to google some stuff to understand it fully. Nice contribution and SotA, but TBH I felt that it wasn't toooo good of a read.

[3]Dropout: A Simple Way to Prevent Neural Networks From Overfitting

N. Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan R. Salakhutdinov - 2014

20 papers in library cite

Good paper, but it's mostly a review of the method described in the other paper with more results. It's longer as well, so I would suggest just reading the other one.

[4]Understanding the Difficulty of Training Deep Feedforward Neural Networks

Yoshua Bengio - 2010

20 papers in library cite

Nice but underwhelming results (they still underperform vs. pretraining). I also didn't really like the way it's written. It's not bad, it's just a bit clunky. Worth the read though.

[5]Rectified Linear Units Improve Restricted Boltzmann Machines

V. Nair, Geoffrey E. Hinton - 2010

18 papers in library cite

I hate when people introduce a new idea but don't care to explain it! This is terrible compared to bengio's paper.

[6]Delving Deep Into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

K. He, X. Zhang, S. Ren, Jian Sun - 2015

10 papers in library cite

I think the PRELU idea didn't catch on, but the initialization is very nice! Good read.

[7]Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

John Duchi, Elad Hazan, Yoram Singer - 2011

19 papers in library cite

I actually skimmed through most of this. It's not a bad paper, but it's a math paper, not AI.

[8]On the Difficulty of Training Recurrent Neural Networks

Razvan Pascanu, Tomas Mikolov, Yoshua Bengio - 2013

21 papers in library cite

It starts very mathy but in the end there are some very nice contributions! You don't actually need to understand the math to know what's going on in the end.

[9]Efficient Backprop

Yann Lecun, Leon Bottou, G. B. Orr, Klaus Robert Muller - 1998

20 papers in library cite

The first half is very very good. The remainder is very boring.

[10]On the Importance of Initialization and Momentum in Deep Learning

Ilya Sutskever, James Martens, G. Dahl, Geoffrey Hinton - 2013

13 papers in library cite

They give very good context and it's easy to understand that they are doing this as a counterpoint to HF. Surprising results as well. I just think it was made obsolete by relu

[11]Large Scale Distributed Deep Networks

Jeffrey Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Quoc V. Le, Mark Z. Mao, Marc'aurelio Ranzato, A. Senior, P. Tucker, K. Yang, Andrew Y. Ng - 2012

16 papers in library cite

Good paper, nice algorithm. Nothing too crazy, but I understand the impact. I think the work to create the system was larger than the algorithm itself.

[12]Exact Solutions to the Nonlinear Dynamics of Learning in Deep Linear Neural Networks

Surya Ganguli - 2014

9 papers in library cite

TBH it's been almost 2 months since I read this paper (shame on me for forgetting to add it)... Anyway, as I recall it I liked it, but TBH it's a bit underwhelming because it solved only for linear networks

[13]Deep Learning Made Easier by Linear Transformations in Perceptrons

Tapani Raiko, Harri Valpola, Yann Lecun - 2012

7 papers in library cite

Kudos for introducing shortcut connections (which would become important in the future), but to me it seems a bit mid.

[14]Imagenet Large Scale Visual Recognition Challenge

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Zhongqiang Huang, A. Karpathy, A. Khosla, M. Bernstein - 2014

18 papers in library cite

Imagenet dataset challenge paper

[15]Knowledge Matters: Importance of Prior Information for Optimization

C. G. Gulcehre, Yoshua Bengio - 2013

3 papers in library cite

Batch norm: "Our method bears similarity to the standardization layer"

[16]Nonlinear Image Representation Using Divisive Normalization

S. Lyu, E. Simoncelli - 2008

3 papers in library cite

[17]Deep Image: Scaling Up Image Recognition

R. Wu, Y. Shan, G. Sun - 2015

2 papers in library cite

[18]Improving Predictive Inference Under Covariate Shift by Weighting the Log-Likelihood Function

H. Shimodaira - 2000

2 papers in library cite

[19]Natural Neural Networks

G. Desjardins, Koray Kavukcuoglu - 2015

2 papers in library cite

[20]A Convergence Analysis of Log-Linear Training

S. Wiesler, Hermann Ney - 2011

1 paper in library cites

[21]A Literature Survey on Domain Adaptation of Statistical Classifiers

J. J. Jiang - 2008

1 paper in library cites

[22]Independent Component Analysis: Algorithms and Applications

A. Hyvarinen, E. Oja - 2000

1 paper in library cites

[23]Mean-Normalized Stochastic Gradient for Large-Scale Deep Learning

S. Wiesler, A. Richard, R. Schluter, Hermann Ney - 2014

1 paper in library cites

[24]Parallel Training of Deep Neural Networks With Natural Gradient and Parameter Averaging

D. Povey, X. Zhang, Sanjeev Khudanpur - 2014

1 paper in library cites

Cited by

18

papers in your library

Cites

15

papers in your library

Read

on July 19, 2025

Very good paper! Similar feel as ResNets: simple idea, elegant. Not too mathy

Tags

Paper Aliases

No aliases