This file documents a large collection of baselines trained
with detectron2 in Sep-Oct, 2019.
The corresponding configurations for all models can be found under the configs/
directory.
Unless otherwise noted, the following settings are used for all runs:
- All models were trained on Big Basin servers with 8 NVIDIA V100 GPUs, with data-parallel sync SGD and a total minibatch size of 16 images.
- All models were trained with CUDA 9.2, cuDNN 7.4.2 or 7.6.3 (the difference on performance is found to be negligible).
- The default settings are not directly comparable with Detectron. For example, our default training data augmentation uses scale jittering in addition to horizontal flipping. For configs that are closer to Detectron's settings, see Detectron1-Comparisons.
- No test-time augmentation is used for inference.
- Inference time is measured with batch size 1. It contains the time taken to postprocess results for evaluation, as well as some input latency. Therefore it does not accurately reflect time-to-results. We'll provide better metrics for inference speed in the future.
- The model id column is provided for ease of reference.
- To check downloaded file integrity: any model on this page contains its md5 prefix in its file name.
- All COCO models were trained on
train2017
and evaluated onval2017
. - For Faster/Mask R-CNN, we provide baselines based on 3 different backbone combinations:
- FPN: Use a ResNet+FPN backbone with standard conv and FC heads for mask and box prediction, respectively. It obtains the best speed/accuracy tradeoff, but the other two are still useful for research.
- C4: Use a ResNet conv4 backbone with conv5 head. The original baseline in the Faster R-CNN paper.
- DC5 (Dilated-C5): Use a ResNet conv5 backbone with dilations in conv5, and standard conv and FC heads for mask and box prediction, respectively. This is used by the Deformable ConvNet paper.
- Most models are trained with the 3x schedule (~37 COCO epochs). Although 1x models are heavily under-trained, we provide some ResNet-50 models with the 1x (~12 COCO epochs) training schedule for comparison when doing quick research iteration.
We provide backbone models pretrained on ImageNet-1k dataset. These models are different from those provided in Detectron: we do not fuse BatchNorm into an affine layer.
- R-50.pkl: converted copy of MSRA's original ResNet-50 model
- R-101.pkl: converted copy of MSRA's original ResNet-101 model
- X-101-32x8d.pkl: ResNeXt-101-32x8d model trained with Caffe2 at FB
Pretrained models in Detectron's format can still be used. For example:
- X-152-32x8d-IN5k.pkl: ResNeXt-152-32x8d model trained on ImageNet-5k with Caffe2 at FB (see ResNeXt paper for details on ImageNet-5k).
- R-50-GN.pkl: ResNet-50 with Group Normalization.
- R-101-GN.pkl: ResNet-101 with Group Normalization.
All models available for download through this document are licensed under the Creative Commons Attribution-ShareAlike 3.0 license.
Name | lr sched |
train time (s/iter) |
inference time (s/im) |
train mem (GB) |
box AP |
model id | download |
---|---|---|---|---|---|---|---|
R50-C4 | 1x | 0.593 | 0.110 | 4.8 | 35.7 | 137257644 | model | metrics |
R50-DC5 | 1x | 0.380 | 0.089 | 5.0 | 37.3 | 137847829 | model | metrics |
R50-FPN | 1x | 0.210 | 0.060 | 3.0 | 37.9 | 137257794 | model | metrics |
R50-C4 | 3x | 0.589 | 0.108 | 4.8 | 38.4 | 137849393 | model | metrics |
R50-DC5 | 3x | 0.378 | 0.095 | 5.0 | 39.0 | 137849425 | model | metrics |
R50-FPN | 3x | 0.209 | 0.058 | 3.0 | 40.2 | 137849458 | model | metrics |
R101-C4 | 3x | 0.656 | 0.137 | 5.9 | 41.1 | 138204752 | model | metrics |
R101-DC5 | 3x | 0.452 | 0.103 | 6.1 | 40.6 | 138204841 | model | metrics |
R101-FPN | 3x | 0.286 | 0.071 | 4.1 | 42.0 | 137851257 | model | metrics |
X101-FPN | 3x | 0.638 | 0.139 | 6.7 | 43.0 | 139173657 | model | metrics |
Name | lr sched |
train time (s/iter) |
inference time (s/im) |
train mem (GB) |
box AP |
model id | download |
---|---|---|---|---|---|---|---|
R50 | 1x | 0.200 | 0.082 | 3.9 | 36.5 | 137593951 | model | metrics |
R50 | 3x | 0.201 | 0.081 | 3.9 | 37.9 | 137849486 | model | metrics |
R101 | 3x | 0.280 | 0.087 | 5.1 | 39.9 | 138363263 | model | metrics |
Name | lr sched |
train time (s/iter) |
inference time (s/im) |
train mem (GB) |
box AP |
prop. AR |
model id | download |
---|---|---|---|---|---|---|---|---|
RPN R50-C4 | 1x | 0.130 | 0.056 | 1.5 | 51.6 | 137258005 | model | metrics | |
RPN R50-FPN | 1x | 0.186 | 0.053 | 2.7 | 58.0 | 137258492 | model | metrics | |
Fast R-CNN R50-FPN | 1x | 0.140 | 0.056 | 2.6 | 37.8 | 137635226 | model | metrics |
Name | lr sched |
train time (s/iter) |
inference time (s/im) |
train mem (GB) |
box AP |
mask AP |
model id | download |
---|---|---|---|---|---|---|---|---|
R50-C4 | 1x | 0.621 | 0.140 | 5.2 | 36.8 | 32.2 | 137259246 | model | metrics |
R50-DC5 | 1x | 0.471 | 0.126 | 6.5 | 38.3 | 34.2 | 137260150 | model | metrics |
R50-FPN | 1x | 0.261 | 0.087 | 3.4 | 38.6 | 35.2 | 137260431 | model | metrics |
R50-C4 | 3x | 0.622 | 0.137 | 5.2 | 39.8 | 34.4 | 137849525 | model | metrics |
R50-DC5 | 3x | 0.470 | 0.111 | 6.5 | 40.0 | 35.9 | 137849551 | model | metrics |
R50-FPN | 3x | 0.261 | 0.079 | 3.4 | 41.0 | 37.2 | 137849600 | model | metrics |
R101-C4 | 3x | 0.691 | 0.163 | 6.3 | 42.6 | 36.7 | 138363239 | model | metrics |
R101-DC5 | 3x | 0.545 | 0.129 | 7.6 | 41.9 | 37.3 | 138363294 | model | metrics |
R101-FPN | 3x | 0.340 | 0.092 | 4.6 | 42.9 | 38.6 | 138205316 | model | metrics |
X101-FPN | 3x | 0.690 | 0.155 | 7.2 | 44.3 | 39.5 | 139653917 | model | metrics |
Name | lr sched |
train time (s/iter) |
inference time (s/im) |
train mem (GB) |
box AP |
kp. AP |
model id | download |
---|---|---|---|---|---|---|---|---|
R50-FPN | 1x | 0.315 | 0.102 | 5.0 | 53.6 | 64.0 | 137261548 | model | metrics |
R50-FPN | 3x | 0.316 | 0.095 | 5.0 | 55.4 | 65.5 | 137849621 | model | metrics |
R101-FPN | 3x | 0.390 | 0.106 | 6.1 | 56.4 | 66.1 | 138363331 | model | metrics |
X101-FPN | 3x | 0.738 | 0.168 | 8.7 | 57.3 | 66.0 | 139686956 | model | metrics |
Name | lr sched |
train time (s/iter) |
inference time (s/im) |
train mem (GB) |
box AP |
mask AP |
PQ | model id | download |
---|---|---|---|---|---|---|---|---|---|
R50-FPN | 1x | 0.304 | 0.129 | 4.8 | 37.6 | 34.7 | 39.4 | 139514544 | model | metrics |
R50-FPN | 3x | 0.302 | 0.127 | 4.8 | 40.0 | 36.5 | 41.5 | 139514569 | model | metrics |
R101-FPN | 3x | 0.392 | 0.137 | 6.0 | 42.4 | 38.5 | 43.0 | 139514519 | model | metrics |
Mask R-CNN baselines on the LVIS dataset, v0.5. These baselines are described in Table 3(c) of the LVIS paper.
NOTE: the 1x schedule here has the same amount of iterations as the COCO baselines. They are roughly 24 epochs of LVISv0.5 data.
Name | lr sched |
train time (s/iter) |
inference time (s/im) |
train mem (GB) |
box AP |
mask AP |
model id | download |
---|---|---|---|---|---|---|---|---|
R50-FPN | 1x | 0.319 | 0.369 | 6.6 | 24.0 | 24.4 | 134714017 | model | metrics |
R101-FPN | 1x | 0.395 | 0.385 | 7.6 | 25.8 | 26.1 | 134807205 | model | metrics |
X101-FPN | 1x | 1.330 | 0.461 | 10.0 | 27.3 | 27.9 | 135397361 | model | metrics |
Simple baselines for
- Mask R-CNN on Cityscapes instance segmentation (trained on fine annotations only)
- Faster R-CNN on PASCAL VOC object detection (trained on VOC 2007 train+val + VOC 2012 train+val, tested on VOC 2007 using 11-point interpolated AP)
Name | train time (s/iter) |
inference time (s/im) |
train mem (GB) |
box AP |
box AP50 |
mask AP |
model id | download |
---|---|---|---|---|---|---|---|---|
R50-FPN, Cityscapes | 0.240 | 0.397 | 4.4 | 36.5 | 142423278 | model | metrics | ||
R50-C4, VOC | 0.537 | 0.096 | 4.8 | 51.9 | 80.3 | 142202221 | model | metrics |
Ablations for Deformable Conv and Cascade R-CNN:
Name | lr sched |
train time (s/iter) |
inference time (s/im) |
train mem (GB) |
box AP |
mask AP |
model id | download |
---|---|---|---|---|---|---|---|---|
Baseline R50-FPN | 1x | 0.261 | 0.087 | 3.4 | 38.6 | 35.2 | 137260431 | model | metrics |
Deformable Conv | 1x | 0.342 | 0.084 | 3.5 | 41.5 | 37.5 | 138602867 | model | metrics |
Cascade R-CNN | 1x | 0.317 | 0.090 | 4.0 | 42.1 | 36.4 | 138602847 | model | metrics |
Ablations for GroupNorm:
Name | lr sched |
train time (s/iter) |
inference time (s/im) |
train mem (GB) |
box AP |
mask AP |
model id | download |
---|---|---|---|---|---|---|---|---|
Baseline R50-FPN | 3x | 0.261 | 0.079 | 3.4 | 41.0 | 37.2 | 137849600 | model | metrics |
GroupNorm | 3x | 0.356 | 0.102 | 7.3 | 42.6 | 38.6 | 138602888 | model | metrics |
GroupNorm (scratch) | 3x | 0.400 | 0.106 | 9.8 | 39.9 | 36.6 | 138602908 | model | metrics |
A few very large models trained for a long time, for demo purposes.
Name | inference time (s/im) |
train mem (GB) |
box AP |
mask AP |
PQ | model id | download |
---|---|---|---|---|---|---|---|
Panoptic FPN R101 | 0.172 | 11.4 | 47.4 | 41.3 | 46.1 | 139797668 | model | metrics |
Mask R-CNN X152 | 0.278 | 15.1 | 49.3 | 43.2 | 18131413 | model | metrics | |
above + test-time aug. | 51.4 | 45.5 |