This report documents the performances of the DeepDetect Open Source Deep Learning server on a variety of platforms and popular or particularly effective neural network architectures. The full server source code is available from https://github.com/beniz/deepdetect.
Given different platforms, the result should serve as a reference for parties and users interested in choosing the right NN model for their work on their server or embedded systems.
Ordered from most to less powerful:
- NVidia GTX 1080 Ti
- NVidia Jetson TX1
- NVidia Jetson TK1
- Raspberry Pi 3
Note that the 1080Ti and TX1 use the CuDNN NVidia accelerator library, while the TK1 uses GPU implementation without CuDNN and Raspberry uses CPU only.
For a detailed description of all platforms, see the dedicated platform section.
We conducts an experiment with multiple contemporary Neural Networks (NN) models.
- GoogleNet
- VGG16 and VGG19
- Resnet 50, 101 and 152
- Densenet 121 and 201
- Squeezenet v1.0 and v1.1
- Mobilenet (original Caffe version and another with custom speed-up)
- Shufflenet
One important aspect of choosing a model is the limitation of the hardware, such as the computational output (in flops), and the amount of available RAM. The number of flops required for a single pass for a model is displayd below, along with the number of parameters (weights in the network).
Below are performances, displayed in log scale. The reported performances are per image in ms. When batch size is greater than one, the reported value is the average time per image for that batch size. On GPUs and platforms with limited memory, not all batch sizes are applicable.
The reported performances use a customized version of Caffe as backend.
The graph shows the performances difference between the Raspberry Pi 3 and the Raspberry Pi 4 (2 GB) using NCNN as a backend.
- All considered networks are tested over image classification tasks, we may add more tasks to the benchmark in the future.
- There's basically an order of magnitude difference in performance in between each platforms taken in decreasing performance order.
- We seek the best architectures for embedded systems: Squeezenet, Shufflenet, Mobilenet and GoogleNet appear to be the most suited. We use the 10fps and 25fps thresholds as markups for potential real-time applications (with potential batch size > 1).
- We use an improved depthwise convolutional layer in order to boost the performances of the Mobilenet and Shuffletnet architectures. This new layer is available from our custom version of Caffe alongside many other improvements and features.
- Squeezenet v1.1 appears to be the clear winner for embedded platforms. More analysis of low parameters versions of MobileNet could prove competitive despite the grouped convolutions.
-
Desktop GTX1080Ti (11.3 TFLOPS 3585 cores)
On a Desktop with GTX1080Ti most models are able to perform better than 25 fps. The card has 11 GB GDDR5X VRAM with 3584 CUDA cores running at a maximum of 1582 MHz. This amounts to 11.3 TFLOP/s. While it is capable of real time processing, the power consumption is not viable for embedded system application. Weighting in at 280 watts under load, the Desktop setup is suitable for analysis application, surveillance, anything a desktop would do but no embedded applications.
-
Jetson TX1 (1 TFLOPS 256 cores)
Second on the list is the Nvidia Jetson TX1. Weighting in at 15 W max while in operation, TX1 is a great candidate for embedded system applications. At 1 TFLOPS theoritical output, TX1 is able to push squeezenet_v1.0, squeezenet_v1.1, mobilenet_depthwise, googlenet, and shufflenet to more than 25 fps. In extreme cases, Tx1 can compute up to 85 fps with batch-size equal or more than 16 for squeezenet_v1.1. For a project with critical time constraint such as autonomous cars, TX1 could prove to be viable solution.
-
Jetson TX2 (1.5 TFLOPS 256 cores)
The Jetson TX2 offers a 1.5 TFLOPS output and is a great solution for fast and power-efficient embedded systems. The TX2 is equipped with an NVIDIA Pascal GPU. This 7.5-watt modules can push up to 80 fps for a batch size of 128 for the Squeezenet model. In the same conditions, the frame per second goes up to 68 computed images for Googlenet.
On a 64 batch size, the Jetson TX2 can reach up to 50 frames per second. For a project with real time computation like autonomous cars, the Jetson TX2 would be an ideal candidate. TX2's performances allows very quick on edge computation.
-
Jetson Nano (500 GFLOPS 128 cores)
Weighting in at 5 W max while in operation, Nano is a low cost solution for embedded system applications and IA on edge. It has a 500 GFLOPS output. For a batch size of 1, ShuffleNet and SqueezeNet reach respectively 12 and 25 fps. Jetson Nano can push up to 10 fps with a batch-size superior or equal to 2 for Squeezenet-SSD-faces, SqueezeNet-SS-voc and ResNet18-ocr. When pushed to a 64 batch size, the Nano can compute up to 48 fps for SqueezeNet and ResNet18-ocr. For a large-scale project or one with budget-constraints Jetson Nano seems to be an interesting solution.
-
Jetson TK1 (300 GFLOPS 192 cores)
With 12.5 watts rated under load on the development board -should be lower on the module, claimed NVIDIA- and the cost of 200 USD, The Jetson TK1 seems to hit the sweet spot for computational power vs cost for embedded application. Given a proper optimization, the TK1 could reach 25 fps in term of processing speed. The TK1 would serve well in general purpose image classification in manufacturing processes, surveillance, and replacing workforce in non-safety-critical tasks.
-
Raspberry Pi3 model B (24GFLOPs GPU and 2.3 DMIPS/MHz CPU at 35 USD)
The last in our book is the Raspberry Pi3. At merely 4 watts under load, the Pi ought to be the preferred solution for remote sensing. The downside lies in its ability to process images, at merely 1 fps max performance.
The reported performances use a customized version of Caffe as backend. The results of the comparison of each model accross multiple platform are displayed below. The legend shows the number of batch size in color coded manner. Note that not all batch sizes are available for all architectures.
The challenge of implementing NN on an embedded system is the limitation on memory and computational resources.
That is to say it should have a small computational trace without losing the accuracy. To this purpose we looked into three rather novel architectures: SqueezeNet, MobileNet and ShuffleNet.
Mobilenet is an implementation of Google's MobileNet. Mobilenet has Top-1 accuracy of 70.81% and Top-5 accuracy of 89.5% compared to the leading model in accuracy, Densenet201, with 77.31% for Top-1 and 93.64% for Top-5. The MobileNet architecture has shown rather minimal lost in accuracy while reducing the footprint from 4.7 Gflops to 0.56 Gflops.
But the result was rather underwhelming. While faster than densenet201, the mobilenet is nowhere near the leading models in term of speed. The reason lies with the vanilla implementation of grouped convolutions
in Caffe. A dedicated rewrite of depthwise convolutions (modified from BVLC/caffe#5665) yielded an order of magnitude speed-up, making MobileNet usable again.
Our baseline was customized from https://github.com/shicai/MobileNet-Caffe.
You can witness the performance gain from the naive MobileNet implementation with vanilla Caffe below. On CPU
The gain is negligible on the Raspberry Pi 3 pure CPU platform. On GPU platforms the gain improves with batch size.
The ShuffleNet promised a more efficient NN via the dephtwise convolutions and a dedicated shuffling of channels.
We used a customized implementation from https://github.com/farmingyard/ShuffleNet, and that exhibit good performances.
The benchmark uses the dd_bench.py Python script with images that can be downloaded from https://deepdetect.com/stuff/bench.tar.gz.
Assuming you had successfully build DeepDetect and it's up and running, the following call to the benchmark tool was used:
python dd_bench.py --host localhost --port 8080 --sname imageserv --gpu --remote-bench-data-dir <bench folder's location> --max-batch-size 128 --create <NN model folder name>
Of course, you'd need to change <bench folder's location> to your location to the bench folder and <NN model folder name> to your model folder name or path, assuming it is saved under DeepDetect/models.
This will create a service on the DD server with the name of imgserv
with server listening from localhost:8080. It will use the available GPU according to --gpu
and will make attempts of increasing batchsize up to 128.
- Note attempting to create a service while it has already been created will result in errors. You can remove
--create <model name>
to avoid this issue. To automatically kill the service after benchmarking add--auto-kill
. For more information runpython dd\_bench.py --help
To use additional models for benchmarking, 2 files are needed,
- model.caffemodel
- deploy.prototxt The former is a structure representation of the network while the later holds the trained weight as the model.
To train your own model beforehand, please refer to the section here.
For the prototxt
file taken from other resources, we need to make sure that the input and output are compatible with DeepDetect.
In the general case we will add the first layer to take the input as 224x224 image and on the output we will add a layer to treat the output with softmax. A useful reference template is https://github.com/beniz/deepdetect/blob/6d0a1f2d1e487b492e004d7d5972f302d4182ab1/templates/caffe/googlenet/deploy.prototxt
Top 1 accuracy | 70.81 | missing | 75.3 | 76.4 | 77 | 67.9 | 74.9 | 77.3 | 59.5 | 59.5 | 70.5 | 71.3 | missing |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
batch size | mobilenet | mobilenet_depthwise | res50 | res101 | res152 | googlenet | densenet121 | densenet201 | Squeezenetv1.0 | Squeezenetv1.1 | vgg16 | vgg19 | shufflenet |
1 | 37.2 | 12.2 | 19.8 | 35.8 | 44.4 | 16.6 | 45.6 | 69 | 8.4 | 8.6 | 14 | 14.6 | 15 |
2 | 36.3 | 6.2 | 14.1 | 22.5 | 27.8 | 9.8 | 24 | 38.6 | 4.1 | 5.5 | 9.9 | 11.2 | 9.1 |
4 | 22.1 | 4.3 | 8.8 | 13.8 | 18.5 | 5.25 | 16.5 | 25.9 | 2.6 | 3.55 | 6.95 | 8.2 | 6.95 |
8 | 21.2 | 3.52 | 7.27 | 10.4 | 14.6 | 3.93 | 11.92 | 18.5 | 2.38 | 2.33 | 5.7 | 6.25 | 4.55 |
16 | 19.5 | 3.73 | 6.33 | 8.63 | 11.6 | 3.18 | 9.06 | 13.7 | 2.16 | 1.97 | 5.18 | 6.21 | 4.71 |
32 | 18.2 | 3.23 | 5.9 | 7.82 | x | 3.3 | x | x | 2.59 | 2.96 | 5.15 | 6.05 | 3.49 |
64 | 19.3 | 3.12 | x | x | x | 3.13 | x | x | 2.5 | 2.33 | 4.82 | 5.63 | 3.26 |
128 | 16.8 | 2.63 | x | x | x | 3.05 | x | x | 2.2 | 2.2 | 4.97 | 5.57 | 2.87 |
Top 1 accuracy | 70.81 | missing | 75.3 | 76.4 | 77 | 67.9 | 74.9 | 77.3 | 59.5 | 59.5 | 70.5 | 71.3 | missing |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
batch size | mobilenet | mobilenet_depthwise | res50 | res101 | res152 | googlenet | densenet121 | densenet201 | Squeezenetv1.0 | Squeezenetv1.1 | vgg16 | vgg19 | shufflenet |
1 | 171 | 33.8 | 89 | 142 | 195 | 43.6 | 134 | 248 | 33.4 | 30.2 | 133 | 152 | 60 |
2 | 173 | 29.2 | 77.7 | 122 | 180 | 29.6 | 98.5 | 159 | 23.7 | 17.9 | 165 | 187 | 38.8 |
4 | 164 | 27 | 69.6 | 112 | x | 24 | 93.7 | x | 20.7 | 14.2 | 127 | 149 | 21.7 |
8 | 155 | 26.1 | 66.7 | x | x | 21.8 | x | x | 18.6 | 12.1 | 110 | 130 | 20.6 |
16 | x | 25.6 | x | x | x | 20.2 | x | x | 17.7 | 11.8 | 100 | 120 | 21.8 |
32 | x | 25.5 | x | x | x | 19.7 | x | x | 17.5 | 11.8 | x | x | 22.9 |
64 | x | x | x | x | x | 20 | x | x | 17.6 | 11.5 | x | x | x |
128 | x | x | x | x | x | x | x | x | x | 11.6 | x | x | x |
Top 1 accuracy | 70.81 | missing | 75.3 | 76.4 | 77 | 67.9 | 74.9 | 77.3 | 59.5 | 59.5 | 70.5 | 71.3 | missing |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
batch size | mobilenet | mobilenet_depthwise | res50 | res101 | res152 | googlenet | densenet121 | densenet201 | Squeezenetv1.0 | Squeezenetv1.1 | vgg16 | vgg19 | shufflenet |
1 | 464 | 336 | 203 | 283 | 400 | 197 | 294 | 637 | 119 | 90.2 | x | x | 82.8 |
2 | 462 | 210 | 231 | 351 | 477 | 127 | 225 | x | 88 | 71.3 | x | x | 63.8 |
4 | 453 | 135 | 234 | x | x | 87.2 | x | x | 70.8 | 50.9 | x | x | 53.4 |
8 | 441 | 141 | x | x | x | 78.8 | x | x | 62.9 | 53.6 | x | x | 52 |
16 | 452 | 137 | x | x | x | 87.8 | x | x | 67 | 40 | x | x | 51.3 |
32 | x | x | x | x | x | 93 | x | x | 81 | 46.8 | x | x | x |
64 | x | x | x | x | x | x | x | x | x | 45.2 | x | x | x |
128 | x | x | x | x | x | x | x | x | x | x | x | x | x |
Top 1 accuracy | 70.81 | missing | 75.3 | 76.4 | 77 | 67.9 | 74.9 | 77.3 | 59.5 | 59.5 | 70.5 | 71.3 | missing |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
batch size | mobilenet | mobilenet_depthwise | res50 | res101 | res152 | googlenet | densenet121 | densenet201 | Squeezenetv1.0 | Squeezenetv1.1 | vgg16 | vgg19 | shufflenet |
1 | 1246 | 1443 | 3560 | x | x | 7980 | x | x | 1492 | 910 | x | x | 1115 |
2 | 1230 | 1370 | x | x | x | 8008 | x | x | 1478 | 917 | x | x | 1067 |
4 | x | 1372 | x | x | x | 7943 | x | x | 1493 | 919 | x | x | 1047 |
8 | x | 1401 | x | x | x | 8015 | x | x | 1444 | 913 | x | x | 1046 |
16 | x | x | x | x | x | x | x | x | 1456 | 909 | x | x | x |
32 | x | x | x | x | x | x | x | x | x | x | x | x | x |
64 | x | x | x | x | x | x | x | x | x | x | x | x | x |
128 | x | x | x | x | x | x | x | x | x | x | x | x | x |
mobilenet | mobilenet_depthwise | res50 | res101 | res152 | googlenet | densenet121 | densenet201 | Squeezenetv1.0 | Squeezenetv1.1 | vgg16 | vgg19 | shufflenet | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Giga flops | 0.5687 | 0.5514 | 3.8580 | 7.5702 | 11.282 | 1.5826 | 3.0631 | 4.7727 | 0.8475 | 0.3491 | 15.470 | 19.632 | 0.1234 |
million params | 4.2309 | 4.2309 | 25.556 | 44.548 | 60.191 | 6.9902 | 7.9778 | 20.012 | 1.2444 | 1.2315 | 138.34 | 143.65 | 1.8137 |
The bulk of this work was done by https://github.com/jsaksris/ while on internship at jolibrain.