2020

Do We Train on Test Data? Purging CIFAR of Near-Duplicates

Joachim Denzler

citations

Cite Score

7

AI summary

This paper introduces the ciFAIR dataset, a purged version of CIFAR-10 and CIFAR-100, where near-duplicates between training and test sets are removed, and it re-evaluates CNN performance, finding a significant drop in classification accuracy, suggesting overfitting to memorization.

Main Contributions

  • Identified a significant number of near-duplicate images in CIFAR-10 and CIFAR-100 test sets.
  • Introduced the ciFAIR dataset by replacing duplicates in the test sets with new images.
  • Re-evaluated state-of-the-art CNN architectures on the ciFAIR dataset, demonstrating a notable performance drop.
  • Showed that models can achieve near-perfect classification on duplicate images, indicating memorization.
  • The relative ranking of models remains consistent, suggesting research efforts haven't heavily overfitted to duplicates.

Abstract

The CIFAR-10 and CIFAR-100 datasets are two of the most heavily benchmarked datasets in computer vision and are often used to evaluate novel methods and model architectures in the field of deep learning. However, we find that 3.3% and 10% of the images from the test sets of these datasets have duplicates in the training set. These duplicates are easily recognizable by memorization and may, hence, bias the comparison of image recognition techniques regarding their generalization capability. To eliminate this bias, we provide the “fair CIFAR” (ciFAIR) dataset, where we replaced all duplicates in the test sets with new images sampled from the same domain. We then re-evaluate the classification performance of various popular state-of-the-art CNN architectures on these new test sets to investigate whether recent research has overfitted to memorizing data instead of learning abstract concepts. We find a significant drop in classification accuracy of between 9% and 14% relative to the original performance on the duplicate-free test set. The ciFAIR dataset and pre-trained models are available at https://cvjena.github.io/cifair/, where we also maintain a leaderboard.

Citation Graph

Loading graph...

References [24]

Sort:
Filter:

K. He, X. Zhang, S. Ren, Jian Sun - 2016

20 papers in library cite

Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton - 2012

71 papers in library cite

J. Deng, W. Dong, Richard Socher, L. J. Li, K. Li, Li Fei Fei - 2009

28 papers in library cite

G. Huang, Ze Liu, K. Weinberger, Laurens Van Der Maaten - 2017

5 papers in library cite

Alex Krizhevsky - 2009

27 papers in library cite

Antonio Torralba, Rob Fergus, W. Freeman - 2008

8 papers in library cite

Vaishaal Shankar - 2018

2 papers in library cite

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Zhongqiang Huang, A. Karpathy, A. Khosla, M. Bernstein - 2014

18 papers in library cite

Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu, K. He - 2017

3 papers in library cite

S. Zagoruyko, N. Komodakis - 2016

5 papers in library cite

A. Babenko, A. Slesarev, A. Chigorin, Victor Lempitsky - 2014

1 paper in library cites

A. Babenko, Victor Lempitsky - 2015

1 paper in library cites

D. Han, Jeremy Kim, Jeremy Kim - 2017

3 papers in library cite

Y. H. Q. V. Le, E. Real, A. Aggarwal - 2018

3 papers in library cite

C. Sun, A. Shrivastava, Shivalika Singh, Aman Gupta - 2017

2 papers in library cite

G. Miller, C. Fellbaum - 2007

2 papers in library cite

A. W. Smeulders, M. Worring, S. Santini, Aman Gupta, R. Jain - 2000

1 paper in library cites

B. Barz, Joachim Denzler - 2018

1 paper in library cites

S. S. Husain, M. Bober - 2017

1 paper in library cites

J. Revaud, J. Almazan, R. S. Rezende, C. R. D. Souza - 2019

1 paper in library cites

M. Jaderberg, K. Simonyan, Andrew Zisserman, Koray Kavukcuoglu - 2015

1 paper in library cites

Bo Wu, Weizhu Chen, Yu Fan, Y. Z. Zhang, J. Hou, J. Huang, Weizhou Liu, Tong Zhang - 2019

1 paper in library cites

C. Wah, S. Branson, Peter Welinder, Pietro Perona, S. Belongie - 2011

1 paper in library cites

M. J. Huiskes, M. S. Lew - 2008

1 paper in library cites

Cited by

1

papers in your library

Cites

12

papers in your library

Read

on November 10, 2025

Your review

Tags

Paper Aliases

No aliases