Papperoni

2016

Understanding Deep Learning Requires Rethinking Generalization

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals

Open PDF Google Scholar

citations

Cite Score

76

AI summary

This paper challenges traditional views of generalization by demonstrating that large deep neural networks, like Inception V3 and Alexnet on CIFAR10 and ImageNet, can perfectly fit random labels and noise, suggesting current complexity measures and explicit regularization inadequately explain their generalization performance.

Main Contributions

Deep neural networks can achieve 0 training error on completely random labels and even random pixels, regardless of explicit regularization.
Traditional generalization theories (VC-dimension, Rademacher complexity, uniform stability) fail to explain why neural networks generalize well in practice.
Explicit regularization (weight decay, dropout, data augmentation) improves generalization but is neither necessary nor sufficient for controlling it.
A theoretical construction shows simple depth-two ReLU networks can achieve perfect finite sample expressivity with parameters exceeding data points.
Implicit regularization, such as that provided by SGD, is suggested as a potential factor in generalization, with linear models showing SGD converges to minimum l2-norm solutions.

Abstract

Despite their massive size, successful deep artificial neural networks can exhibit a remarkably small difference between training and test performance. Conventional wisdom attributes small generalization error either to properties of the model family, or to the regularization techniques used during training. Through extensive systematic experiments, we show how these traditional approaches fail to explain why large neural networks generalize well in practice. Specifically, our experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data. This phenomenon is qualitatively unaffected by explicit regularization, and occurs even if we replace the true images by completely unstructured random noise. We corroborate these experimental findings with a theoretical construction showing that simple depth two neural networks already have perfect finite sample expressivity as soon as the number of parameters exceeds the number of data points as it usually does in practice. We interpret our experimental findings by comparison with traditional models.

Citation Graph

Loading graph...

References [32]

Sort:

Filter:

[1]Deep Residual Learning for Image Recognition

K. He, X. Zhang, S. Ren, Jian Sun - 2016

20 papers in library cite

This is simply amazing. Very very simple idea, totally revolutionary. No maths, just "it works!". Amazing.

[2]ImageNet Classification With Deep Convolutional Neural Networks

Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton - 2012

71 papers in library cite

I'm giving this a 5 just because of the impact, but this is VEEERY derivative of earlier work. Kudos for them for putting it all together, but really there's nothing revolutionary here.

[3]Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

S. Ioffe, Christian Szegedy - 2015

18 papers in library cite

Very good paper! Similar feel as ResNets: simple idea, elegant. Not too mathy

[4]Dropout: A Simple Way to Prevent Neural Networks From Overfitting

N. Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan R. Salakhutdinov - 2014

20 papers in library cite

Good paper, but it's mostly a review of the method described in the other paper with more results. It's longer as well, so I would suggest just reading the other one.

[5]Rethinking the Inception Architecture for Computer Vision

Zbigniew Wojna - 2015

5 papers in library cite

It's nice to see all of the performance optimizations they do, but it's very derivative

[6]Learning Multiple Layers of Features From Tiny Images

Alex Krizhevsky - 2009

27 papers in library cite

It's alright. It mainly focuses on RBMs and their features and the actual part that describes the dataset is like 1 page. However, it's maybe the best intuitive description of an RBM I have seen. Other than that, it reads very much like an undergraduate thesis.

[7]TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

M. Abadi, Akshat Agarwal, P. Barham, E. Brevdo, Ziru Chen, C. Citro, G. Corrado, A. Davis, Jeffrey Dean, M. Devin, Sanjay Ghemawat, I. Goodfellow, A. Harp, Geoffrey Irving, M. Isard, Y. Jia, R. Jozefowicz, Lukasz Kaiser, M. Kudlur, J. Levenberg, D. Mane, R. Monga, S. Moore, D. Murray, Christopher Olah, M. Schuster, J. Shlens, B. Steiner, Ilya Sutskever, K. Talwar, P. Tucker, Vincent Vanhoucke, V. Vasudevan, F. Viegas, Oriol Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, Xiaoqiang Zheng - 2015

11 papers in library cite

This should be the golden standard to what framework papers should be. It's large, but it's not boring at all. Explains the core concepts while not going too deep as to describe unimportant things; explains design decisions and shortcomings... overall amazing

[8]Imagenet Large Scale Visual Recognition Challenge

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Zhongqiang Huang, A. Karpathy, A. Khosla, M. Bernstein - 2014

18 papers in library cite

Imagenet dataset challenge paper

[9]Statistical Learning Theory

V. N. Vapnik - 1998

10 papers in library cite

Book, but there's a 12 page overview

[10]The Loss Surfaces of Multilayer Networks

A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, Yann Lecun - 2015

4 papers in library cite

[11]Approximation by Superpositions of a Sigmoidal Function

G. Cybenko - 1988

2 papers in library cite

[12]Shallow vs. Deep Sum-Product Networks

O. Delalleau, Yoshua Bengio - 2011

2 papers in library cite

[13]A Generalized Representer Theorem

B. Scholkopf, R. Herbrich, A. J. Smola - 2001

1 paper in library cites

[14]Approximation Properties of a Multilayered Feedforward Artificial Neural Network

H. N. Mhaskar - 1993

1 paper in library cites

[15]Benefits of Depth in Neural Networks

M. Telgarsky - 2016

1 paper in library cites

[16]Convolutional Rectifier Networks as Generalized Tensor Decompositions

N. Cohen, A. Shashua - 2016

1 paper in library cites

[17]Deep vs. Shallow Networks : An Approximation Theory Perspective

H. Mhaskar, T. A. Poggio - 2016

1 paper in library cites

[18]General Conditions for Predictivity in Learning Theory

T. Poggio, R. Rifkin, S. Mukherjee, P. Niyogi - 2004

1 paper in library cites

[19]Generalization Properties and Implicit Regularization for Multiple Passes SGM

Junyang Lin, R. Camoriano, L. Rosasco - 2016

1 paper in library cites

[20]In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning

Behnam Neyshabur, R. Tomioka, N. Srebro - 2014

1 paper in library cites

[21]Learnability, Stability and Uniform Convergence

S. S. Shwartz, O. Shamir, N. Srebro, K. Sridharan - 2010

1 paper in library cites

[22]Learning Feature Representations With K-Means

A. Coates, Andrew Y. Ng - 2012

1 paper in library cites

[23]Norm-Based Capacity Control in Neural Networks

Behnam Neyshabur, R. Tomioka, N. Srebro - 2015

1 paper in library cites

[24]On Early Stopping in Gradient Descent Learning

Y. Yao, L. Rosasco, A. Caponnetto - 2007

1 paper in library cites

[25]On the Computational Efficiency of Training Neural Networks

R. Livni, S. S. Shwartz, O. Shamir - 2014

1 paper in library cites

[26]Rademacher and Gaussian Complexities: Risk Bounds and Structural Results

P. L. Bartlett, S. Mendelson - 2003

1 paper in library cites

[27]Randomization Tests

E. Edgington, P. Onghena - 2007

1 paper in library cites

[28]Stability and Generalization

O. Bousquet, A. Elisseeff - 2002

1 paper in library cites

[29]Statistical Learning: Stability Is Sufficient for Generalization and Necessary and Sufficient for Consistency of Empirical Risk Minimization

S. Mukherjee, P. Niyogi, T. Poggio, R. Rifkin - 2002

1 paper in library cites

[30]The Power of Depth for Feedforward Neural Networks

R. Eldan, O. Shamir - 2016

1 paper in library cites

[31]The Sample Complexity of Pattern Classification With Neural Networks - The Size of the Weights Is More Important Than the Size of the Network

P. L. Bartlett - 1998

1 paper in library cites

[32]Train Faster, Generalize Better: Stability of Stochastic Gradient Descent

Moritz Hardt, Benjamin Recht, Yoram Singer - 2016

1 paper in library cites

Cited by

2

papers in your library

Cites

9

papers in your library

Read

on February 17, 2026

I like the push of rethinking generalization, and very intuitive to read. It is a bit incredible that it took so long for people to realize this - seems so obvious!

Tags

Paper Aliases

No aliases