Papperoni

2014

Large-Scale Video Classification With Convolutional Neural Networks

A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, Li Fei Fei

Open PDF Google Scholar

citations

Cite Score

82

AI summary

This paper introduces a large-scale video classification approach using CNNs on the Sports-1M dataset with 1 million YouTube videos across 487 classes, and proposes a multiresolution foveated architecture to speed up training, achieving significant performance gains and demonstrating strong generalization capabilities on the UCF-101 dataset.

Main Contributions

Extensive evaluation of CNNs for video classification on a large-scale dataset (Sports-1M) with 1 million videos across 487 categories.
Introduction of the Sports-1M dataset, a new large-scale video dataset for sports classification.
A multiresolution architecture that processes input at two spatial resolutions, improving runtime performance without sacrificing accuracy.
Demonstration of significant performance improvements over feature-based baselines on the Sports-1M dataset.
Significant improvement on the UCF-101 dataset through transfer learning, achieving state-of-the-art results compared to existing baselines.

Abstract

Convolutional Neural Networks (CNNs) have been established as a powerful class of models for image recognition problems. Encouraged by these results, we provide an extensive empirical evaluation of CNNs on large-scale video classification using a new dataset of 1 million YouTube videos belonging to 487 classes. We study multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggest a multiresolution, foveated architecture as a promising way of speeding up the training. Our best spatio-temporal networks display significant performance improvements compared to strong feature-based baselines (55.3% to 63.9%), but only a surprisingly modest improvement compared to single-frame models (59.3% to 60.9%). We further study the generalization performance of our best model by retraining the top layers on the UCF-101 Action Recognition dataset and observe significant performance improvements compared to the UCF-101 baseline model (63.3% up from 43.9%).

Citation Graph

Loading graph...

References [28]

Sort:

Filter:

[1]ImageNet Classification With Deep Convolutional Neural Networks

Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton - 2012

71 papers in library cite

I'm giving this a 5 just because of the impact, but this is VEEERY derivative of earlier work. Kudos for them for putting it all together, but really there's nothing revolutionary here.

[2]ImageNet: A Large-Scale Hierarchical Image Database

J. Deng, W. Dong, Richard Socher, L. J. Li, K. Li, Li Fei Fei - 2009

28 papers in library cite

Very nice idea and huge impact!

[3]Gradient-Based Learning Applied to Document Recognition

Yann Lecun, Leon Bottou, Yoshua Bengio, Patrick Haffner - 1998

62 papers in library cite

I absolutely hated this paper. Has ~50 pages but seems like 200 pages. Takes too long to explain some things that really is just repeating itself. Also doesn't seem to add too much on top of LeNet-5. Also, focuses a lot on GTNs, which really didn't stick.

[4]Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

Ross Girshick, J. Donahue, Trevor Darrell, Jitendra Malik - 2014

18 papers in library cite

Good results, beat overfeat, used pretraining for improving performance. Only issue is that the paper is overly long...

[5]Visualizing and Understanding Convolutional Networks

Matthew D. Zeiler, Rob Fergus - 2014

15 papers in library cite

Very good explanation and visualization of CNNs, and also nice that they use their findings to improve the performance. The ablation study is also nice.

[6]Video google: A Text Retrieval Approach to Object Matching in Videos

Josef Sivic, Andrew Zisserman - 2003

5 papers in library cite

Fun read! It's not really related to AI, but TBH the way they do search on video is more interesting than the object recog.

[7]Large Scale Distributed Deep Networks

Jeffrey Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Quoc V. Le, Mark Z. Mao, Marc'aurelio Ranzato, A. Senior, P. Tucker, K. Yang, Andrew Y. Ng - 2012

16 papers in library cite

Good paper, nice algorithm. Nothing too crazy, but I understand the impact. I think the work to create the system was larger than the algorithm itself.

[8]Convolutional neural Networks Applied to House Numbers Digit Classification

P. Sermanet, S. Chintala, Yann Lecun - 2012

6 papers in library cite

This reads like an undergrad's research paper or a recent master's paper. Only applying stuff to a new dataset.

[9]UCF101: A Dataset of 101 Human Actions Classes From Videos in the Wild

Khurram Soomro, Amir Roshan Zamir, Mubarak Shah - 2012

1 paper in library cites

Very low effort paper.

[10]OverFeat: Integrated Recognition, Localization and Detection Using Convolutional Networks

P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, Rob Fergus, Yann Lecun - 2014

16 papers in library cite

Very convoluted method, was SotA for only a bit of time, and the paper is very boring.

[11]CNN Features Off-the-Shelf: An Astounding Baseline for Recognition

A. Razavian, H. Azizpour, J. Sullivan, S. Carlsson - 2014

6 papers in library cite

CNN feature extraction

[12]Learning Hierarchical Features for Scene Labeling

Clement Farabet, C. Couprie, L. Najman, Yann Lecun - 2013

6 papers in library cite

Cited by deep learning paper (and 4 more)

[13]Histograms of Oriented Gradients for Human Detection

N. Dalal, B. Triggs - 2005

12 papers in library cite

[14]Learning Hierarchical Invariant Spatio-Temporal Features for Action Recognition With Independent Subspace Analysis

Quoc Le, W. Zou, S. Y. Yeung, A. Ng - 2011

4 papers in library cite

[15]Convolutional Learning of Spatio-Temporal Features

Graham W. Taylor, Rob Fergus, Yann Lecun, C. Bregler - 2010

3 papers in library cite

[16]Deep Neural Networks Segment Neuronal Membranes in Electron Microscopy Images

Dan C. Ciresan, A. Giusti, Jürgen Schmidhuber - 2012

3 papers in library cite

[17]Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification

J. C. Niebles, C. W. Chen, Li Fei Fei - 2010

2 papers in library cite

[18]Recognizing Realistic Actions From Videos "In the Wild"

Joseph Liu, J. Luo, Mubarak Shah - 2009

2 papers in library cite

[19]3D Convolutional Neural Networks for Human Action Recognition

S. Ji, Weixin Xu, Michael Yang, K. Yu - 2013

1 paper in library cites

[20]A Statistical Approach to Texture Classification From Single Images

M. Varma, Andrew Zisserman - 2005

1 paper in library cites

[21]Action Recognition by Dense Trajectories

Haiming Wang, A. Klaser, Cordelia Schmid, C. L. Liu - 2011

1 paper in library cites

[22]Behavior Recognition via Sparse Spatio-Temporal Features

Piotr Dollar, V. Rabaud, G. Cottrell, S. Belongie - 2005

1 paper in library cites

[23]Discriminative Tag Learning on Youtube Videos With Latent Sub-Tags

W. Yang, G. Toderici - 2011

1 paper in library cites

[24]Evaluation of Local Spatio-Temporal Features for Action Recognition

Haiming Wang, M. M. Ullah, A. Klaser, I. Laptev, Cordelia Schmid - 2009

1 paper in library cites

[25]Indoor Semantic Segmentation Using Depth Information

C. Couprie, Clement Farabet, L. Najman, Yann Lecun - 2013

1 paper in library cites

[26]Learning Realistic Human Actions From Movies

I. Laptev, M. Marszalek, Cordelia Schmid, B. Rozenfeld - 2008

1 paper in library cites

[27]On Space-Time Interest Points

I. Laptev - 2005

1 paper in library cites

[28]Sequential Deep Learning for Human Action Recognition

M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, A. Baskurt - 2011

1 paper in library cites

Cited by

2

papers in your library

Cites

12

papers in your library

Read

on August 3, 2025

I liked it a lot. It's nothing "wow", but a very nice approach, and apparently the first o apply CNN to video, which is nice :)

Tags

Paper Aliases

No aliases