This repository contains Bash scripts to clean up the ImageNet 1k dataset and pretrained Pytorch models in different configurations.
The Bash scripts can be downloaded from https://www.dropbox.com/s/pyzem2svhnx5h6m/imagenet_clean_scripts.tar.gz?dl=0.
Pytorch pretrained models can be downloaded from https://www.dropbox.com/s/lzm60bz90wfl6ys/imagenet_clean_models.tar.gz?dl=0.
- Download ImageNet 1k (https://academictorrents.com/details/943977d8c96892d24237638335e481f3ccd54cfb) and/or ImagenetV2 (http://imagenetv2public.s3-website-us-west-2.amazonaws.com/) datasets.
- Bash to run the cleanup scripts.
- ImageMagick to apply the categorical fixes.
- Pytorch and Pytorch Image Models (https://github.com/rwightman/pytorch-image-models) to use the pretrained models.
Download and extract the scripts in a directory. Copy the imagenet_val_*.sh scripts into the validation set subdirectory of the dataset (val/) and execute the scripts in the following order:
- Fix image labels based on confident learning:
./imagenet_val_1_image_fixes.sh
- Remove the wrong-problematic images based on model consensus and confident learning:
./imagenet_val_2_image_removal.sh
- Apply categorical fixes:
./imagenet_val_3_categorical_fixes.sh
Download and extract the scripts in a directory. Copy the imagenet_train_*.sh scripts into the training set subdirectory of the dataset (train/) and execute the scripts in the following order:
- Fix image labels based on confident learning:
./imagenet_train_1_image_fixes.sh
- Remove the wrong-problematic images based on model consensus and confident learning:
./imagenet_train_2_image_removal.sh
- Apply categorical fixes:
./imagenet_train_3_categorical_fixes.sh
Optional steps:
- Removing the wrong images only found by confident learning (a subset of point 2): imagenet_train_2_image_removal1.sh
- Removing the wrong images only found by model consensus (a subset of point 2): imagenet_train_2_image_removal3.sh
- Applying the fixes and removal before category fixes for CAE-EDSR images (https://github.com/hendrycks/imagenet-r/tree/master/DeepAugment) before category fixes: imagenet_train_cae_edsr_1_image_fixes.sh and imagenet_train_cae_edsr_2_image_removal.sh
Note: The CAE and EDSR scripts expect that CAE/EDSR images must be renamed to a new name schema (e.g. n01440764_10042.JPEG -> n01440764_10042_CAE.JPEG)
Download and extract the scripts in a directory. Copy the imagenetv2_*.sh scripts into the ImageNetV2 subdirectory and execute the scripts in the following order:
- Fix image labels based on confident learning:
./imagenetv2_matched_frequency_format_1_image_fixes.sh
- Remove the wrong-problematic images based on model consensus and confident learning:
./imagenetv2_matched_frequency_format_2_image_removal.sh
- Apply categorical fixes:
./imagenetv2_matched_frequency_format_3_categorical_fixes.sh
Optional steps:
- Removing the wrong images only found by confident learning (a subset of point 2): imagenetv2_matched_frequency_format_2_image_removal1.sh
- Removing the wrong images only found by model consensus (a subset of point 2): imagenetv2_matched_frequency_format_2_image_removal3.sh
- Renaming the alphabethical folder names to nxxxxxxx format: imagenetv2_folder_name_fixes.sh
The pretrained models have the following name schema:
model_name-widthxheight-variant.pth.tar
- model_name - efficientnet_b0, shufflenet_v2_x1_5 or squeezenet1_1
- variant - baseline (trained on original ImageNet), clean (trained on ImageNet Clean), clean-imagenet-r (trained on ImageNet Clean with CAE/EDSR images)
Install Pytorch Image Models:
pip3 install timm
Validate an EfficientNet-B0 model (trained on ImageNet Clean, portrait input 216x384) on cleaned ImageNetV2 dataset (top-1/top-5 - 69.26 %/89.29 %):
./validate.py --model efficientnet_b0 --checkpoint efficientnet_b0-384x216-clean.pth.tar -b 64 --log-interval 100 --input-size 3 216 384 --num-classes 1000 IMAGENETV2_DIRECTORY
Validate a SqueezeNet 1.1 model (trained on ImageNet Clean+CAE/EDSR, landscape input 320x180) on ImageNet validation dataset (top-1/top-5 - 60.89 %/83.15 %):
./validate.py --torchvision-model squeezenet1_1 --checkpoint squeezenet1_1-180x320-clean-imagenetr.pth.tar -b 64 --log-interval 100 --input-size 3 320 180 --num-classes 1000 IMAGENET_VALIDATION_DIRECTORY
Validate a ShuffleNetV2 (x1_5) model (trained on original ImageNet, standard input 224x224) on cleaned ImageNet validation dataset (top-1/top-5 - 77.93 %/94.57 %):
./validate.py --hub-model-github-or-dir kecsap/vision --hub-model shufflenet_v2_x1_5 --checkpoint shufflenet_v2_x1_5-224x224-baseline.pth.tar -b 64 --log-interval 100 --num-classes 1000 IMAGENET_VALIDATION_DIRECTORY
If this helps your research, please cite the paper (https://arxiv.org/abs/2103.16324):
@misc{kertész2021automated,
title={Automated Cleanup of the ImageNet Dataset by Model Consensus, Explainability and Confident Learning},
author={Csaba Kertész},
year={2021},
eprint={2103.16324},
archivePrefix={arXiv},
primaryClass={cs.CV}
}