Papperoni

2014

One Weird Trick for Parallelizing Convolutional Neural Networks

Alex Krizhevsky

citations

Cite Score

AI summary

This paper introduces a novel parallelization technique for training convolutional neural networks across multiple GPUs, leveraging data parallelism for convolutional layers and model parallelism for fully-connected layers, achieving better scaling than existing alternatives.

Main Contributions

Introduces a hybrid parallelization strategy combining data parallelism for convolutional layers and model parallelism for fully-connected layers.
Presents three schemes for implementing model parallelism in fully-connected layers, analyzing their communication costs and suitability for different hardware configurations.
Shows that variable batch sizes, with smaller batches for fully-connected layers, can lead to faster convergence and better minima.
Reports experimental results on ImageNet 2012 demonstrating good scaling with the proposed parallelization scheme.
Discusses the accuracy cost with large batch sizes and how it can be reduced using the variable batch size technique.

Abstract

I present a new way to parallelize the training of convolutional neural networks across multiple GPUs. The method scales significantly better than all alternatives when applied to modern convolutional neural networks.

Citation Graph

Loading graph...

References [7]

Sort:

Filter:

[1]ImageNet Classification With Deep Convolutional Neural Networks

Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton - 2012

71 papers in library cite

Google Scholar

I'm giving this a 5 just because of the impact, but this is VEEERY derivative of earlier work. Kudos for them for putting it all together, but really there's nothing revolutionary here.

[2]ImageNet: A Large-Scale Hierarchical Image Database

J. Deng, W. Dong, Richard Socher, L. J. Li, K. Li, Li Fei Fei - 2009

28 papers in library cite

Google Scholar

Very nice idea and huge impact!

[3]Large Scale Distributed Deep Networks

Jeffrey Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Quoc V. Le, Mark Z. Mao, Marc'aurelio Ranzato, A. Senior, P. Tucker, K. Yang, Andrew Y. Ng - 2012

16 papers in library cite

Google Scholar

Good paper, nice algorithm. Nothing too crazy, but I understand the impact. I think the work to create the system was larger than the algorithm itself.

[4]Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent

Benjamin Recht, C. Re, S. Wright, F. Niu - 2011

6 papers in library cite

Google Scholar

[5]Deep Learning With Cots HPC Systems

A. Coates, B. Huval, Tianle Wang, D. Wu, Bryan Catanzaro, N. Andrew - 2013

2 papers in library cite

Google Scholar

[6]GPU Asynchronous Stochastic Gradient Descent to Speed Up Neural Network Training

T. Paine, H. Jin, Jihan Yang, Zongyu Lin, T. Huang - 2013

1 paper in library cites

Google Scholar

[7]Multi-gpu Training of Convnets

O. Yadan, K. Adams, Y. Taigman, Marc'aurelio Ranzato - 2013

1 paper in library cites

Google Scholar

Cited by

papers in your library

Cites

papers in your library

Read

on July 25, 2025

Very nice paper. Good new approach to training and explains the idea and rationale well. Not too many citations but used by SotA CNNs at the time.