Papperoni

2009

What Is the Best Multi-Stage Architecture for Object Recognition?

K. Jarrett, Koray Kavukcuoglu, Marc'aurelio Ranzato, Yann Lecun

Open PDF Google Scholar

citations

Cite Score

65

AI summary

This paper explores multi-stage architectures for object recognition using filter banks, non-linear transformations, and feature pooling, evaluating different non-linearities, filter learning methods (random, unsupervised, supervised), and the number of stages, achieving state-of-the-art results on NORB and MNIST datasets.

Main Contributions

Showed that using non-linearities that include rectification and local contrast normalization is the single most important ingredient for good accuracy on object recognition benchmarks.
Showed that two stages of feature extraction yield better accuracy than one.
Showed that a two-stage system with random filters can yield almost 63% recognition rate on Caltech-101, provided that the proper non-linearities and pooling layers are used.
Achieved state-of-the-art performance on the NORB dataset (5.6%) with supervised refinement.
Achieved the lowest known error rate on the undistorted, unprocessed MNIST dataset (0.53%) with unsupervised pre-training followed by supervised refinement.

Abstract

In many recent object recognition systems, feature extraction stages are generally composed of a filter bank, a non-linear transformation, and some sort of feature pooling layer. Most systems use only one stage of feature extraction in which the filters are hard-wired, or two stages where the filters in one or both stages are learned in supervised or unsupervised mode. This paper addresses three questions: 1. How does the non-linearities that follow the filter banks influence the recognition accuracy? 2. does learning the filter banks in an unsupervised or supervised manner improve the performance over random filters or hard-wired filters? 3. Is there any advantage to using an architecture with two stages of feature extraction, rather than one? We show that using non-linearities that include rectification and local contrast normalization is the single most important ingredient for good accuracy on object recognition benchmarks. We show that two stages of feature extraction yield better accuracy than one. Most surprisingly, we show that a two-stage system with random filters can yield almost 63% recognition rate on Caltech-101, provided that the proper non-linearities and pooling layers are used. Finally, we show that with supervised refinement, the system achieves state-of-the-art performance on NORB dataset (5.6%) and unsupervised pre-training followed by supervised refinement produces good accuracy on Caltech-101 (> 65%), and the lowest known error rate on the undistorted, unprocessed MNIST dataset (0.53%).

Citation Graph

Loading graph...

References [31]

Sort:

Filter:

[1]Gradient-Based Learning Applied to Document Recognition

Yann Lecun, Leon Bottou, Yoshua Bengio, Patrick Haffner - 1998

62 papers in library cite

I absolutely hated this paper. Has ~50 pages but seems like 200 pages. Takes too long to explain some things that really is just repeating itself. Also doesn't seem to add too much on top of LeNet-5. Also, focuses a lot on GTNs, which really didn't stick.

[2]Reducing the Dimensionality of Data With Neural Networks

Geoffrey Hinton, Ruslan Salakhutdinov - 2006

37 papers in library cite

I didn't like the way this is written, very hard to understand without a ton of background knowledge. But hey, it's the first deep learning model!

[3]Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories

Svetlana Lazebnik, Cordelia Schmid, Jean Ponce - 2006

14 papers in library cite

It's a fun read, but in the end is just an application of the spatial pyramid matching kernel from the other paper.

[4]Greedy Layer-Wise Training of Deep Networks

Yoshua Bengio, P. Lamblin, D. Popovici, Hugo Larochelle - 2006

33 papers in library cite

Bengio is perfect. This is everything that Hinton's paper hoped to be. Very well explained, and also tying back to real use cases (not just "hey, the math works and it reduced the score")

[5]Learning Generative Visual Models From Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories

Li Fei Fei, Rob Fergus, Pietro Perona - 2004

15 papers in library cite

I think most people cite this thinking this is where the Caltech 101 dataset comes from (it's not). Anyway, it's just an extension of the other dataset and it's very mathy, not NNs, and uninteresting.

[6]Learning Methods for Generic Object Recognition With Invariance to Pose and Lighting

Yann Lecun, Fu Jie Huang, Leon Bottou - 2004

18 papers in library cite

Good paper, nice methodology for creating different images. However, I think that this was not too impactful... I don't see this being used a lot.

[7]Efficient Learning of Sparse Representations With an Energy-Based Model

Marc'aurelio Ranzato, C. Poultney, S. Chopra, Yann Lecun - 2006

20 papers in library cite

It's ok. Not really good, but alright.

[8]Unsupervised Learning of Invariant Feature Hierarchies With Applications to Object Recognition

Marc'aurelio Ranzato, F. Huang, Y. Boureau, Yann Lecun - 2007

8 papers in library cite

I thought that I had learned pretty much everything before 2015, but this paper proved me wrong. Amazing read and very nice methodology. Very simple as well!

[9]Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations

Honglak Lee, R. Grosse, R. Ranganath, Andrew Y. Ng - 2009

12 papers in library cite

[10]Histograms of Oriented Gradients for Human Detection

N. Dalal, B. Triggs - 2005

12 papers in library cite

[11]Sparse Coding With an Overcomplete Basis Set: A Strategy Employed by V1?

Bruno A. Olshausen, David J. Field - 1997

10 papers in library cite

[12]Sparse Deep Belief Net Model for Visual Area V2

Honglak Lee, C. Ekanadham, A. Ng - 2008

10 papers in library cite

[13]Distinctive Image Features From Scale-Invariant Keypoints

D. Lowe - 2004

9 papers in library cite

[14]Linear Spatial Pyramid Matching Using Sparse Coding for Image Classification

Jihan Yang, K. Yu, Y. Gong, T. Huang - 2009

8 papers in library cite

[15]Shape Matching and Object Recognition Using Low Distortion Correspondences

A. C. Berg, T. L. Berg, Jitendra Malik - 2005

8 papers in library cite

[16]Object Recognition With Features Inspired by Visual Cortex

T. Serre, Lior Wolf, T. Poggio - 2005

7 papers in library cite

[17]Efficient Sparse Coding Algorithms

Honglak Lee, Alexis Battle, Rajat Raina, A. Ng - 2007

6 papers in library cite

[18]Svm-knn: Discriminative Nearest Neighbor Classification for Visual Category Recognition

Haowei Zhang, A. C. Berg, M. Maire, Jitendra Malik - 2006

6 papers in library cite

[19]Large-Scale Learning With SVM and Convolutional Nets for Generic Object Categorization

F. Huang, Yann Lecun - 2006

5 papers in library cite

[20]Why Is Real-World Visual Object Recognition Hard?

N. Pinto, D. D. Cox, J. J. Dicarlo - 2008

5 papers in library cite

[21]Learning Invariant Features Through Topographic Filter Maps

Koray Kavukcuoglu, Marc'aurelio Ranzato, Rob Fergus, Yann Lecun - 2009

4 papers in library cite

[22]Fast Inference in Sparse Coding Algorithms With Applications to Object Recognition

Koray Kavukcuoglu, Marc'aurelio Ranzato, Yann Lecun - 2008

3 papers in library cite

[23]Multiclass Object Recognition With Sparse, Localized Features

J. Mutch, D. Lowe - 2006

3 papers in library cite

[24]Nonlinear Image Representation Using Divisive Normalization

S. Lyu, E. Simoncelli - 2008

3 papers in library cite

[25]Semi-Supervised Learning of Compact Document Representations With Deep Networks

Marc'aurelio Ranzato, M. Szummer - 2008

2 papers in library cite

[26]Discriminative Learned Dictionaries for Local Image Analysis

Julien Mairal, F. Bach, Jean Ponce, G. Sapiro, Andrew Zisserman - 2008

1 paper in library cites

Missing author listMissing year

[27]http://yann.lecun.com/exdb/mnist/

1 paper in library cites

[28]K-svd and Its Non-Negative Variant for Dictionary Design

M. Aharon, M. Elad, A. Bruckstein - 2005

1 paper in library cites

[29]Learning the Discriminative Power-Invariance Trade-Off

M. Varma, D. Ray - 2007

1 paper in library cites

[30]Training Hierarchical Feed-Forward Visual Recognition Models Using Transfer Learning From Pseudo Tasks

A. Ahmed, K. Yu, Weixin Xu, Y. Gong, E. Xing - 2008

1 paper in library cites

[31]Unsupervised Learning: Foundations of Neural Computation

Geoffrey Hinton, T. Sejnowski - 1999

1 paper in library cites

Cited by

20

papers in your library

Cites

9

papers in your library

Read

on August 2, 2025

So boring and convoluted... They try to make things more abstract by saying "filter banks" and "2 stages", but in the end it's CNNs. It seems like they are trying to tie NNs to the rest of ML, but it doesn't work.

Tags

Paper Aliases

No aliases