By Ze Liu*, Jia Ning*, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin and Han Hu.
This repo is the official implementation of "Video Swin Transformer". It is based on mmaction2.
06/25/2021 Initial commits
Video Swin Transformer is initially described in "Video Swin Transformer", which advocates an inductive bias of locality in video Transformers, leading to a better speed-accuracy trade-off compared to previous approaches which compute self-attention globally even with spatial-temporal factorization. The locality of the proposed video architecture is realized by adapting the Swin Transformer designed for the image domain, while continuing to leverage the power of pre-trained image models. Our approach achieves state-of-the-art accuracy on a broad range of video recognition benchmarks, including action recognition (84.9
top-1 accuracy on Kinetics-400 and 86.1
top-1 accuracy on Kinetics-600 with ~20x
less pre-training data and ~3x
smaller model size) and temporal modeling (69.6
top-1 accuracy on Something-Something v2).
Backbone | Pretrain | Lr Schd | spatial crop | acc@1 | acc@5 | #params | FLOPs | config | model |
---|---|---|---|---|---|---|---|---|---|
Swin-T | ImageNet-1K | 30ep | 224 | 78.8 | 93.6 | 28M | 87.9G | config | github/baidu |
Swin-S | ImageNet-1K | 30ep | 224 | 80.6 | 94.5 | 50M | 165.9G | config | github/baidu |
Swin-B | ImageNet-1K | 30ep | 224 | 80.6 | 94.6 | 88M | 281.6G | config | github/baidu |
Swin-B | ImageNet-22K | 30ep | 224 | 82.7 | 95.5 | 88M | 281.6G | config | github/baidu |
Backbone | Pretrain | Lr Schd | spatial crop | acc@1 | acc@5 | #params | FLOPs | config | model |
---|---|---|---|---|---|---|---|---|---|
Swin-B | ImageNet-22K | 30ep | 224 | 84.0 | 96.5 | 88M | 281.6G | config | github/baidu |
Backbone | Pretrain | Lr Schd | spatial crop | acc@1 | acc@5 | #params | FLOPs | config | model |
---|---|---|---|---|---|---|---|---|---|
Swin-B | Kinetics 400 | 60ep | 224 | 69.6 | 92.7 | 89M | 320.6G | config | github/baidu |
Notes:
- Pre-trained image models can be downloaded from Swin Transformer for ImageNet Classification.
- The pre-trained model of SSv2 could be downloaded at github/baidu.
- Access code for baidu is
swin
.
Please refer to install.md for installation.
We also provide docker file cuda10.1 (image url) and cuda11.0 (image url) for convenient usage.
Please refer to data_preparation.md for a general knowledge of data preparation. The supported datasets are listed in supported_datasets.md.
We also share our Kinetics-400 annotation file k400_val, k400_train for better comparison.
# single-gpu testing
python tools/test.py <CONFIG_FILE> <CHECKPOINT_FILE> --eval top_k_accuracy
# multi-gpu testing
bash tools/dist_test.sh <CONFIG_FILE> <CHECKPOINT_FILE> <GPU_NUM> --eval top_k_accuracy
To train a video recognition model with pre-trained image models (for Kinetics-400 and Kineticc-600 datasets), run:
# single-gpu training
python tools/train.py <CONFIG_FILE> --cfg-options model.backbone.pretrained=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]
# multi-gpu training
bash tools/dist_train.sh <CONFIG_FILE> <GPU_NUM> --cfg-options model.backbone.pretrained=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]
For example, to train a Swin-T
model for Kinetics-400 dataset with 8 gpus, run:
bash tools/dist_train.sh configs/recognition/swin/swin_tiny_patch244_window877_kinetics400_1k.py 8 --cfg-options model.backbone.pretrained=<PRETRAIN_MODEL>
To train a video recognizer with pre-trained video models (for Something-Something v2 datasets), run:
# single-gpu training
python tools/train.py <CONFIG_FILE> --cfg-options load_from=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]
# multi-gpu training
bash tools/dist_train.sh <CONFIG_FILE> <GPU_NUM> --cfg-options load_from=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]
For example, to train a Swin-B
model for SSv2 dataset with 8 gpus, run:
bash tools/dist_train.sh configs/recognition/swin/swin_base_patch244_window1677_sthv2.py 8 --cfg-options load_from=<PRETRAIN_MODEL>
Note: use_checkpoint
is used to save GPU memory. Please refer to this page for more details.
We use apex for mixed precision training by default. To install apex, use our provided docker or run:
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
If you would like to disable apex, comment out the following code block in the configuration files:
# do not use mmcv version fp16
fp16 = None
optimizer_config = dict(
type="DistOptimizerHook",
update_interval=1,
grad_clip=None,
coalesce=True,
bucket_size_mb=-1,
use_fp16=True,
)
If you find our work useful in your research, please cite:
@article{liu2021video,
title={Video Swin Transformer},
author={Liu, Ze and Ning, Jia and Cao, Yue and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Hu, Han},
journal={arXiv preprint arXiv:2106.13230},
year={2021}
}
@article{liu2021Swin,
title={Swin Transformer: Hierarchical Vision Transformer using Shifted Windows},
author={Liu, Ze and Lin, Yutong and Cao, Yue and Hu, Han and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Guo, Baining},
journal={arXiv preprint arXiv:2103.14030},
year={2021}
}
Image Classification: See Swin Transformer for Image Classification.
Object Detection: See Swin Transformer for Object Detection.
Semantic Segmentation: See Swin Transformer for Semantic Segmentation.
Self-Supervised Learning: See MoBY with Swin Transformer.