Papperoni

2016

Layer Normalization

Jimmy Lei Ba, R. Kiros, Geoffrey E. Hinton

Open PDF Google Scholar

citations

Cite Score

89

AI summary

This paper introduces layer normalization, a new normalization method for neural networks, which computes normalization statistics from summed inputs within a layer on a single training case, improving training speed and generalization performance for RNN models.

Main Contributions

Introduces Layer Normalization, a novel normalization technique.
Layer Normalization computes normalization statistics from summed inputs within a layer on a single training case.
Layer Normalization is effective for stabilizing hidden state dynamics in RNNs.
Layer Normalization reduces training time compared to existing techniques.
Demonstrates improved generalization performance of Layer Normalization on RNN models.

Abstract

Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons. A recently introduced technique called batch normalization uses the distribution of the summed input to a neuron over a mini-batch of training cases to compute a mean and variance which are then used to normalize the summed input to that neuron on each training case. This significantly reduces the training time in feed-forward neural networks. However, the effect of batch normalization is dependent on the mini-batch size and it is not obvious how to apply it to recurrent neural networks. In this paper, we transpose batch normalization into layer normalization by computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case. Like batch normalization, we also give each neuron its own adaptive bias and gain which are applied after the normalization but before the non-linearity. Unlike batch normalization, layer normalization performs exactly the same computation at training and test times. It is also straightforward to apply to recurrent neural networks by computing the normalization statistics separately at each time step. Layer normalization is very effective at stabilizing the hidden state dynamics in recurrent networks. Empirically, we show that layer normalization can substantially reduce the training time compared with previously published techniques.

Citation Graph

Loading graph...

References [32]

Sort:

Filter:

[1]Adam: A Method for Stochastic Optimization

D. P. Kingma, Jimmy Lei Ba - 2014

49 papers in library cite

Amazing paper! Very well explained and huge impact. I am amazed that they made something so simple even when it requires a lot of background mathematical knowledge

[2]Very Deep Convolutional Networks for Large-Scale Image Recognition

K. Simonyan, Andrew Zisserman - 2014

20 papers in library cite

This is very good! The great thing here is small filters and depth analysis, but truly they do some other stuff as well: SotA, generalization for other tasks, and open source their models. Very nice.

[3]ImageNet Classification With Deep Convolutional Neural Networks

Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton - 2012

71 papers in library cite

I'm giving this a 5 just because of the impact, but this is VEEERY derivative of earlier work. Kudos for them for putting it all together, but really there's nothing revolutionary here.

[4]Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

S. Ioffe, Christian Szegedy - 2015

18 papers in library cite

Very good paper! Similar feel as ResNets: simple idea, elegant. Not too mathy

[5]Microsoft COCO: Common Objects in Context

T. Y. Lin, M. Maire, S. Belongie, James Hays, Pietro Perona, D. Ramanan, Piotr Dollar, C. L. Zitnick - 2014

14 papers in library cite

I liked this paper a lot. It's a bit long and I was already a bit tired, but it was nice overall.

[6]Efficient Estimation of Word Representations in Vector Space

Tomas Mikolov, K. Chen, G. S. Corrado, Jeffrey Dean - 2013

26 papers in library cite

Expanded wor2vec. Very nice overall.

[7]Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation

Kyunghyun Cho, B. V. Merrienboer, C. G. Gulcehre, D. Bahdanau, F. Bougares, Holger Schwenk, Yoshua Bengio - 2014

38 papers in library cite

Introduces RNN encoder-decoder. I love it :)

[8]Sequence to Sequence Learning With Neural Networks

Ilya Sutskever, Oriol Vinyals, Quoc V. Le - 2014

58 papers in library cite

Good paper, but I think it only got famous because they set a new good baseline for NNs in MT. Their main contribution was reversing the source sentence TBH.

[9]Generating Sequences With Recurrent Neural Networks

Alex Graves - 2013

27 papers in library cite

Very cool and is the first to actually proposed the Attention mechanism! It gets a bit mathy but nothing too crazy. Also has the first examples of good machine generated writing I've seen in these papers, so very nice results.

[10]Large Scale Distributed Deep Networks

Jeffrey Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Quoc V. Le, Mark Z. Mao, Marc'aurelio Ranzato, A. Senior, P. Tucker, K. Yang, Andrew Y. Ng - 2012

16 papers in library cite

Good paper, nice algorithm. Nothing too crazy, but I understand the impact. I think the work to create the system was larger than the algorithm itself.

[11]Teaching Machines to Read and Comprehend

K. M. Hermann, T. Kocisky, Edward Grefenstette, L. Espeholt, W. Kay, M. Suleyman, Phil Blunsom - 2015

31 papers in library cite

Nice way of converting unsupervised data to train for Q&A - and nice visualizations as well :) But I think their main contribution is the dataset. Maybe with the dataset they "unlocked" summarization?

[12]Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books

Yuxuan Zhu, R. Kiros, R. Zemel, Ruslan Salakhutdinov, R. Urtasun, Antonio Torralba, Sanja Fidler - 2015

18 papers in library cite

I think their approach was a bit convoluted and didn't really add a lot. Main contribution here is probably BookCorpus

[13]Skip-Thought Vectors

R. Kiros, Yuxuan Zhu, Ruslan Salakhutdinov, Richard S. Zemel, R. Urtasun, Antonio Torralba, Sanja Fidler - 2015

23 papers in library cite

Nice to see an alternative to Word2Vec to sentences, but I don't really like the approach. Good nonetheless.

[14]DRAW: A Recurrent Neural Network for Image Generation

K. Gregor, Ivo Danihelka, Alex Graves, D. J. Rezende, Daan Wierstra - 2015

5 papers in library cite

This is SO cool! So interesting that it was biologically inspired. First actual image generation that seems to work, also a jump from the RBM/Autoencoder stuff, and the incremental drawing is amazing!

[15]Unifying Visual-Semantic Embeddings With Multimodal Neural Language Models

Richard S. Zemel - 2014

5 papers in library cite

I think what I like the most about this paper is that they can use regularities, but I like that they put the images and text in the same embedding space by increasing similarity of positive samples and reducing similarity of negative ones - neat!

[16]Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups

Geoffrey E. Hinton, L. Deng, D. Yu, George E. Dahl, A. Mohamed, Navdeep Jaitly, A. Senior, Vincent Vanhoucke, P. Nguyen, T. N. Sainath, Brian Kingsbury - 2012

8 papers in library cite

The core of the paper itself is a bit boring and doesn't introduce anything new (just RBMs and DBNs again) but I am giving this a 4 because it's probably the best explanation of RBMs and DBNs I've read so far.

[17]Deep Speech 2: End-to-End Speech Recognition in English and mandarin

Dario Amodei, S. Ananthanarayanan, R. Anubhai, Jinze Bai, E. Battenberg, C. Case, J. Casper, Bryan Catanzaro, Q. Cheng, Guanduo Chen - 2016

3 papers in library cite

Speech recog. improved

[18]Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks

T. Salimans, D. A. Kingma, D. P. Diederik - 2016

4 papers in library cite

Cited by a lot of people

[19]Learning Deep Structure-Preserving Image-Text Embeddings

Lisa Wang, Yiwei Li, Svetlana Lazebnik - 2016

1 paper in library cites

SotA image-text embeddings

[20]Order-Embeddings of Images and Language

I. Vendrov, R. Kiros, Sanja Fidler, R. Urtasun - 2016

4 papers in library cite

learning a joint embedding space of images and sentences

[21]Recurrent Batch Normalization

T. Cooijmans, Nicolas Ballas, C. Laurent, Aaron Courville - 2016

3 papers in library cite

Method for implementing BN for RNNs

[22]Batch Normalized Recurrent Neural Networks

C. Laurent, G. Pereyra, P. Brakel, Y. Z. Zhang, Yoshua Bengio - 2015

1 paper in library cites

RNN + BN

[23]Seeing Stars: Exploiting Class Relationships for Sentiment Categorization With Respect to Rating Scales

Bo Pang, L. Lee - 2005

13 papers in library cite

[24]A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts

Bo Pang, L. A. Lee, L. Lillian - 2004

8 papers in library cite

[25]Annotating Expressions of Opinions and Emotions in Language

J. Wiebe, T. Wilson, T. Theresa, C. A. Cardie, C. Claire - 2005

7 papers in library cite

[26]Semeval-2014 Task 1: Evaluation of Compositional Distributional Semantic Models on Full Sentences Through Semantic Relatedness and Textual Entailment

Marco Marelli, L. Bentivogli, M. Baroni, R. Bernardi, S. Menini, R. Zamparelli - 2014

7 papers in library cite

[27]Mining and Summarizing Customer Reviews

M. Hu, B. A. Liu, B. Bing - 2004

6 papers in library cite

[28]Natural Gradient Works Efficiently in Learning

S. I. Amari - 1998

6 papers in library cite

[29]The Neural Autoregressive Distribution Estimator

Hugo Larochelle, I. Murray - 2011

5 papers in library cite

[30]IAM-OnDB an on-Line English Sentence Database Acquired From Handwritten Text on a Whiteboard

M. Liwicki, H. Bunke - 2005

3 papers in library cite

[31]Theano: A Python Framework for Fast Computation of Mathematical Expressions

T. T. D. Team, R. A. Rfou, G. Alain, Amjad Almahairi, C. Angermueller, D. Bahdanau, Nicolas Ballas, F. Bastien, J. Bayer, A. Belikov - 2016

2 papers in library cite

[32]Path-Sgd: Path-Normalized Optimization in Deep Neural Networks

Behnam Neyshabur, Ruslan Salakhutdinov, N. Srebro - 2015

1 paper in library cites

Cited by

14

papers in your library

Cites

22

papers in your library

Read

on July 20, 2025

Very nice! At first I had a little bit of prejudice because it seemed way too mathy, but actually the math is easy to follow and the results are very nice.

Tags

Paper Aliases