ImageNet-X

Understanding model mistakes with human annotations of ImageNet.

ImageNet-X is a set of human annotations pinpointing failure types for the popular ImageNet dataset. ImageNet-X labels distinguishing object factors such as pose, size, color, lighting, occlusions, co-occurences, etc. for each image in the validation set and a random subset of 12,000 training samples.

Paper

GitHub

Colab Notebook

ImageNet-X can be used to surface mistake types for any ImageNet classifier

Using ImageNet-X

Install imagenet-x package

pip install imagenet-x

Load annotations

from imagenet_x import load_annotations

annots = load_annotations(partition="val")

See this colab for a step by step notebook loading annotations and evaluating new models.

For advanced usage, see the README.

Research abstract

Deep learning vision systems are widely deployed across applications where reliability is critical. However, even today’s best models can fail to recognize an object when its pose, lighting, or background varies. While existing benchmarks surface examples that are challenging for models, they do not explain why such mistakes arise.

To address this need, we introduce ImageNet-X–a set of sixteen human annotations of factors such as pose, background, or lighting for the entire ImageNet1k validation set as well as a random subset of 12k training images. Equipped with ImageNet-X, we investigate 2,200 current recognition models and study the types of mistakes as a function of model’s (1) architecture – e.g. transformer vs. convolutional –, (2) learning paradigm – e.g. supervised vs. self-supervised –, and (3) training procedures – e.g. data augmentation.

Regardless of these choices, we find models have consistent failure modes across ImageNet-X categories. We also find that while data augmentation can improve robustness to certain factors, they induce spill-over effects to other factors. For example, color-jitter augmentation improves robustness to color and brightness, but surprisingly hurts robustness to pose. Together, these insights suggests that to advance the robustness of modern vision models, future research should focus on collecting additional diverse data and understanding data augmentation schemes. Along with these insights, we release a toolkit based on ImageNet-X to spur further study into the mistakes the image recognition systems make

Read the Paper