Papperoni

2014

Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

K. He, X. Zhang, S. Ren, Jian Sun

Open PDF Google Scholar

citations

Cite Score

90

AI summary

This paper introduces Spatial Pyramid Pooling (SPP-net), a novel network structure that eliminates the fixed-size input constraint of CNNs, achieving state-of-the-art results on ImageNet 2012, Pascal VOC 2007, and Caltech101 datasets, and demonstrating significant speedup in object detection compared to R-CNN.

Main Contributions

Introduces a spatial pyramid pooling (SPP) layer to remove the fixed-size constraint of CNNs.
Demonstrates that SPP-net can generate a fixed-length representation regardless of image size/scale.
Shows that SPP-net boosts the accuracy of various CNN architectures on ImageNet 2012.
Achieves state-of-the-art classification results on Pascal VOC 2007 and Caltech101 using a single full-image representation and no fine-tuning.
Presents a method using SPP-net that is 24-102x faster than the R-CNN method in object detection, while achieving better or comparable accuracy on Pascal VOC 2007.

Abstract

Existing deep convolutional neural networks (CNNs) require a fixed-size (e.g., 224x224) input image. This requirement is "artificial" and may reduce the recognition accuracy for the images or sub-images of an arbitrary size/scale. In this work, we equip the networks with another pooling strategy, "spatial pyramid pooling", to eliminate the above requirement. The new network structure, called SPP-net, can generate a fixed-length representation regardless of image size/scale. Pyramid pooling is also robust to object deformations. With these advantages, SPP-net should in general improve all CNN-based image classification methods. On the ImageNet 2012 dataset, we demonstrate that SPP-net boosts the accuracy of a variety of CNN architectures despite their different designs. On the Pascal VOC 2007 and Caltech101 datasets, SPP-net achieves state-of-the-art classification results using a single full-image representation and no fine-tuning. The power of SPP-net is also significant in object detection. Using SPP-net, we compute the feature maps from the entire image only once, and then pool features in arbitrary regions (sub-images) to generate fixed-length representations for training the detectors. This method avoids repeatedly computing the convolutional features. In processing test images, our method is 24-102x faster than the R-CNN method, while achieving better or comparable accuracy on Pascal VOC 2007. In ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2014, our methods rank #2 in object detection and #3 in image classification among all 38 teams. This manuscript also introduces the improvement made for this competition.

Citation Graph

Loading graph...

References [40]

Sort:

Filter:

[1]Very Deep Convolutional Networks for Large-Scale Image Recognition

K. Simonyan, Andrew Zisserman - 2014

20 papers in library cite

This is very good! The great thing here is small filters and depth analysis, but truly they do some other stuff as well: SotA, generalization for other tasks, and open source their models. Very nice.

[2]ImageNet Classification With Deep Convolutional Neural Networks

Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton - 2012

71 papers in library cite

I'm giving this a 5 just because of the impact, but this is VEEERY derivative of earlier work. Kudos for them for putting it all together, but really there's nothing revolutionary here.

[3]ImageNet: A Large-Scale Hierarchical Image Database

J. Deng, W. Dong, Richard Socher, L. J. Li, K. Li, Li Fei Fei - 2009

28 papers in library cite

Very nice idea and huge impact!

[4]Going Deeper With Convolutions

Christian Szegedy, Weizhou Liu, Y. Jia, P. Sermanet, S. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich - 2015

20 papers in library cite

Introduced the inception algorithm, which is nice. The paper is quite good, but I had to google some stuff to understand it fully. Nice contribution and SotA, but TBH I felt that it wasn't toooo good of a read.

[5]Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

Ross Girshick, J. Donahue, Trevor Darrell, Jitendra Malik - 2014

18 papers in library cite

Good results, beat overfeat, used pretraining for improving performance. Only issue is that the paper is overly long...

[6]Visualizing and Understanding Convolutional Networks

Matthew D. Zeiler, Rob Fergus - 2014

15 papers in library cite

Very good explanation and visualization of CNNs, and also nice that they use their findings to improve the performance. The ablation study is also nice.

[7]Backpropagation Applied to Handwritten Zip-Code Recognition

Yann Lecun, B. Boser, John S. Denker, D. Henderson, R. E. Howard, W. Hubbard, L. D. Jackal - 1989

24 papers in library cite

The first convolution NN! Very simple concept and very simply explained. Very good results and overall a good read.

[8]Caffe: Convolutional Architecture for Fast Feature Embedding

Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, Ross Girshick, S. Guadarrama, Trevor Darrell - 2014

12 papers in library cite

Nothing new really, but worth the read. It's nice because it's the precursor to current AI frameworks + has a Python interface. Also good that model representation is separate from implementation

[9]Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories

Svetlana Lazebnik, Cordelia Schmid, Jean Ponce - 2006

14 papers in library cite

It's a fun read, but in the end is just an application of the spatial pyramid matching kernel from the other paper.

[10]Network in Network

M. Lin, Qinlang Chen, Shuicheng Yan - 2013

11 papers in library cite

I think this was badly written and explained. The idea is nice but I didn't like the paper at all.

[11]Deepface: Closing the Gap to Human-Level Performance in Face Verification

Y. Taigman, Michael Yang, Marc'aurelio Ranzato, Lior Wolf - 2014

5 papers in library cite

Very impressive results but boring overall. I think the main thing is that they were the first to use CNN, but it seems like the most important part of their method was the 3D alignment (which is nice, but out of scope for me)

[12]Video google: A Text Retrieval Approach to Object Matching in Videos

Josef Sivic, Andrew Zisserman - 2003

5 papers in library cite

Fun read! It's not really related to AI, but TBH the way they do search on video is more interesting than the object recog.

[13]Learning Generative Visual Models From Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories

Li Fei Fei, Rob Fergus, Pietro Perona - 2004

15 papers in library cite

I think most people cite this thinking this is where the Caltech 101 dataset comes from (it's not). Anyway, it's just an extension of the other dataset and it's very mathy, not NNs, and uninteresting.

[14]DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition

J. Donahue, Y. Jia, Oriol Vinyals, J. Hoffman, N. Zhang, E. Tzeng, Trevor Darrell - 2014

15 papers in library cite

Very nice paper. First I've seen (and based on the text, first ever) about feature extraction for images. It's very nice to see embeddings doing SotA

[15]The Pyramid Match kernel: Discriminative Classification With Sets of Image Features

Kristen Grauman, Trevor Darrell - 2005

4 papers in library cite

Very simple and elegant solution to set matching. At first I didn't understand, but then it clicked. I think it could be used for other stuff as well!

[16]OverFeat: Integrated Recognition, Localization and Detection Using Convolutional Networks

P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, Rob Fergus, Yann Lecun - 2014

16 papers in library cite

Very convoluted method, was SotA for only a bit of time, and the paper is very boring.

[17]Imagenet Large Scale Visual Recognition Challenge

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Zhongqiang Huang, A. Karpathy, A. Khosla, M. Bernstein - 2014

18 papers in library cite

Imagenet dataset challenge paper

[18]CNN Features Off-the-Shelf: An Astounding Baseline for Recognition

A. Razavian, H. Azizpour, J. Sullivan, S. Carlsson - 2014

6 papers in library cite

CNN feature extraction

[19]Return of the Devil in the Details: Delving Deep Into Convolutional Nets

K. Chatfield, K. Simonyan, A. Vedaldi, Andrew Zisserman - 2014

5 papers in library cite

Using CNNs as feature extractors

[20]Edge Boxes: Locating Object Proposals From Edges

C. L. Zitnick, Piotr Dollar - 2014

2 papers in library cite

A way of detecting objects that brings speedups

[21]Some Improvements on Deep Convolutional Neural Network Based Image Classification

A. G. Howard - 2013

4 papers in library cite

CNNs with images of different scales

[22]Histograms of Oriented Gradients for Human Detection

N. Dalal, B. Triggs - 2005

12 papers in library cite

[23]Distinctive Image Features From Scale-Invariant Keypoints

D. Lowe - 2004

9 papers in library cite

[24]Linear Spatial Pyramid Matching Using Sparse Coding for Image Classification

Jihan Yang, K. Yu, Y. Gong, T. Huang - 2009

8 papers in library cite

[25]Object Detection With Discriminatively Trained Part-Based Models

P. F. Felzenszwalb, Ross Girshick, D. Mcallester, D. Ramanan - 2010

8 papers in library cite

[26]The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results

Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, Andrew Zisserman - 2007

7 papers in library cite

[27]The Importance of Encoding Versus Training With Sparse Coding and Vector Quantization

A. Coates, Andrew Y. Ng - 2011

5 papers in library cite

[28]Deep Neural Networks for Object Detection

Christian Szegedy, A. Toshev, Dumitru Erhan - 2013

4 papers in library cite

[29]Learning and Transferring Mid-Level Image Representations Using Convolutional Neural Networks

Maxime Oquab, Leon Bottou, I. Laptev, Josef Sivic - 2014

4 papers in library cite

[30]LIBSVM: A Library for Support Vector Machines

C. C. Chang, C. J. Lin - 2001

4 papers in library cite

[31]Segmentation as Selective Search for Object Recognition

K. E. A. V. D. Sande, J. R. R. Uijlings, T. Gevers, A. W. M. Smeulders - 2011

3 papers in library cite

[32]Aggregating Local Image Descriptors Into Compact Codes

Hervé Jégou, F. Perronnin, M. Douze, J. Sanchez, P. Perez, Cordelia Schmid - 2012

2 papers in library cite

[33]Improving the Fisher Kernel for Large-Scale Image Classification

F. Perronnin, J. Sanchez, T. Mensink - 2010

2 papers in library cite

[34]Locality-Constrained Linear Coding for Image Classification

J. Wang, Jihan Yang, K. Yu, F. Lv, T. Huang, Y. Gong - 2010

2 papers in library cite

[35]Panda: Pose Aligned Networks for Deep Attribute Modeling

N. Zhang, M. Paluri, Marc'aurelio Ranzato, Trevor Darrell, L. Bourdev - 2014

2 papers in library cite

[36]Regionlets for Generic Object Detection

Xinpeng Wang, Michael Yang, S. Zhu, Yutong Lin - 2013

2 papers in library cite

[37]The Devil Is in the Details: An Evaluation of Recent Feature Encoding Methods

K. Chatfield, Victor Lempitsky, A. Vedaldi, Andrew Zisserman - 2011

2 papers in library cite

[38]Generic Object Detection With Dense Neural Patterns and Regionlets

W. Y. Zou, Xinpeng Wang, Maosong Sun, Yutong Lin - 2014

1 paper in library cites

[39]Kernel Codebooks for Scene Categorization

J. C. V. Gemert, J. M. Geusebroek, C. J. Veenman, A. W. Smeulders - 2008

1 paper in library cites

[40]Multi-Scale Orderless Pooling of Deep Convolutional Activation Features

Y. Gong, Lisa Wang, R. Guo, Svetlana Lazebnik - 2014

1 paper in library cites

Cited by

6

papers in your library

Cites

21

papers in your library

Read

on August 16, 2025

Very simple, general and effective method. The paper ends at page ~4 TBH, the rest is just results and gets boring. Good contribution though.

Tags

Paper Aliases

No aliases