Papperoni

2015

Rethinking the Inception Architecture for Computer Vision

Zbigniew Wojna

Open PDF Google Scholar

citations

Cite Score

96

AI summary

This paper introduces a new Inception architecture for computer vision that utilizes factorized convolutions and aggressive regularization. The proposed Inception-v3 achieves state-of-the-art results on the ILSVRC 2012 classification challenge with 21.2% top-1 and 5.6% top-5 error for single frame evaluation. An ensemble of 4 models and multi-crop evaluation achieves 3.5% top-5 error and 17.3% top-1 error.

Main Contributions

Introduces design principles for scaling up convolutional networks, emphasizing computational efficiency and parameter count.
Presents a novel Inception architecture (Inception-v3) that utilizes factorized convolutions and aggressive regularization.
Achieves a new state-of-the-art top-1 error rate of 21.2% and top-5 error rate of 5.6% on the ILSVRC 2012 classification benchmark for single frame evaluation.
Demonstrates that high-quality results can be achieved with relatively low receptive field resolution (79x79), which is beneficial for detecting small objects.
Shows that combining lower parameter counts with batch-normalized auxiliary classifiers and label smoothing enables the training of high-quality networks on modest-sized training sets.

Abstract

Convolutional networks are at the core of most state-of-the-art computer vision solutions for a wide variety of tasks. Since 2014 very deep convolutional networks started to become mainstream, yielding substantial gains in various benchmarks. Although increased model size and computational cost tend to translate to immediate quality gains for most tasks (as long as enough labeled data is provided for training), computational efficiency and low parameter count are still enabling factors for various use cases such as mobile vision and big-data scenarios. Here we are exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization. We benchmark our methods on the ILSVRC 2012 classification challenge validation set demonstrate substantial gains over the state of the art: 21.2% top-1 and 5.6% top-5 error for single frame evaluation using a network with a computational cost of 5 billion multiply-adds per inference and with using less than 25 million parameters. With an ensemble of 4 models and multi-crop evaluation, we report 3.5% top-5 error and 17.3% top-1 error.

Citation Graph

Loading graph...

References [23]

Sort:

Filter:

[1]Very Deep Convolutional Networks for Large-Scale Image Recognition

K. Simonyan, Andrew Zisserman - 2014

20 papers in library cite

This is very good! The great thing here is small filters and depth analysis, but truly they do some other stuff as well: SotA, generalization for other tasks, and open source their models. Very nice.

[2]ImageNet Classification With Deep Convolutional Neural Networks

Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton - 2012

71 papers in library cite

I'm giving this a 5 just because of the impact, but this is VEEERY derivative of earlier work. Kudos for them for putting it all together, but really there's nothing revolutionary here.

[3]Going Deeper With Convolutions

Christian Szegedy, Weizhou Liu, Y. Jia, P. Sermanet, S. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich - 2015

20 papers in library cite

Introduced the inception algorithm, which is nice. The paper is quite good, but I had to google some stuff to understand it fully. Nice contribution and SotA, but TBH I felt that it wasn't toooo good of a read.

[4]Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

S. Ioffe, Christian Szegedy - 2015

18 papers in library cite

Very good paper! Similar feel as ResNets: simple idea, elegant. Not too mathy

[5]Fully Convolutional Networks for Semantic Segmentation

J. Long, E. Shelhamer, Trevor Darrell - 2015

7 papers in library cite

I didn't really like the way the paper is written and the results seem a bit underwhelming. However, it's nice that they do it in a fully convolutional way and that they increase the speed.

[6]Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

Ross Girshick, J. Donahue, Trevor Darrell, Jitendra Malik - 2014

18 papers in library cite

Good results, beat overfeat, used pretraining for improving performance. Only issue is that the paper is overly long...

[7]Delving Deep Into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

K. He, X. Zhang, S. Ren, Jian Sun - 2015

10 papers in library cite

I think the PRELU idea didn't catch on, but the initialization is very nice! Good read.

[8]TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

M. Abadi, Akshat Agarwal, P. Barham, E. Brevdo, Ziru Chen, C. Citro, G. Corrado, A. Davis, Jeffrey Dean, M. Devin, Sanjay Ghemawat, I. Goodfellow, A. Harp, Geoffrey Irving, M. Isard, Y. Jia, R. Jozefowicz, Lukasz Kaiser, M. Kudlur, J. Levenberg, D. Mane, R. Monga, S. Moore, D. Murray, Christopher Olah, M. Schuster, J. Shlens, B. Steiner, Ilya Sutskever, K. Talwar, P. Tucker, Vincent Vanhoucke, V. Vasudevan, F. Viegas, Oriol Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, Xiaoqiang Zheng - 2015

11 papers in library cite

This should be the golden standard to what framework papers should be. It's large, but it's not boring at all. Explains the core concepts while not going too deep as to describe unimportant things; explains design decisions and shortcomings... overall amazing

[9]Large-Scale Video Classification With Convolutional Neural Networks

A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, Li Fei Fei - 2014

2 papers in library cite

I liked it a lot. It's nothing "wow", but a very nice approach, and apparently the first o apply CNN to video, which is nice :)

[10]On the Difficulty of Training Recurrent Neural Networks

Razvan Pascanu, Tomas Mikolov, Yoshua Bengio - 2013

21 papers in library cite

It starts very mathy but in the end there are some very nice contributions! You don't actually need to understand the math to know what's going on in the end.

[11]On the Importance of Initialization and Momentum in Deep Learning

Ilya Sutskever, James Martens, G. Dahl, Geoffrey Hinton - 2013

13 papers in library cite

They give very good context and it's easy to understand that they are doing this as a counterpoint to HF. Surprising results as well. I just think it was made obsolete by relu

[12]Deeply-Supervised Nets

Chen Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, Zhuowen Tu - 2014

8 papers in library cite

Amazing potential, very nice idea, but bad writing.

[13]Imagenet Large Scale Visual Recognition Challenge

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Zhongqiang Huang, A. Karpathy, A. Khosla, M. Bernstein - 2014

18 papers in library cite

Imagenet dataset challenge paper

[14]Compressing Neural Networks With the Hashing Trick

Weizhu Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, Yanru Chen - 2015

2 papers in library cite

Network compression?

[15]Fast Algorithms for Convolutional Neural Networks

A. Lavin - 2015

3 papers in library cite

Optimization trick

[16]Lecture 6.5 - Rmsprop: Divide the Gradient by a Running Average of Its Recent Magnitude

T. Tieleman, Geoffrey Hinton - 2012

4 papers in library cite

[17]Scalable Object Detection Using Deep Neural Networks

Dumitru Erhan, Christian Szegedy, A. Toshev, Dragomir Anguelov - 2014

4 papers in library cite

[18]Deeppose: Human Pose Estimation via Deep Neural Networks

A. Toshev, Christian Szegedy - 2014

2 papers in library cite

[19]Facenet: A Unified Embedding for Face Recognition and Clustering

F. Schroff, D. Kalenichenko, J. Philbin - 2015

1 paper in library cites

[20]Learning a Deep Compact Image Representation for Visual Tracking

N. Wang, D. Y. Yeung - 2013

1 paper in library cites

[21]Learning a Deep Convolutional Network for Image Super-Resolution

C. Dong, C. C. Loy, K. He, X. Tang - 2014

1 paper in library cites

[22]Ontological Supervision for Fine Grained Classification of Street View Storefronts

Y. M. Attias, Q. Yu, M. C. Stumpe, V. Shet, S. Arnoud, L. Yatziv - 2015

1 paper in library cites

[23]Svd-net: An Algorithm that Automatically Selects Network Structure

D. C. Psichogios, L. H. Ungar - 1993

1 paper in library cites

Cited by

5

papers in your library

Cites

15

papers in your library

Read

on August 2, 2025

It's nice to see all of the performance optimizations they do, but it's very derivative

Tags

Paper Aliases

No aliases