Unexpectedly poor performance of Dace generated cuda code through DaceML framework #1663

vinaysaxena93 · 2024-09-20T11:29:40Z

Configuration:
OS: Linux (Ubuntu 20.04)
CPU: Intel Core i5 12600K, 64 GB RAM
GPU: Nvidia RTX A2000 12 GB
Dace version: 0.13.3
DaceML version: 0.2
Onnx Runtime: 1.7.0
CUDA: 11.4, CuDNN: 8.4
Python: 3.8.1, Pytorch: 1.8.1

I am trying to reproduce the results obtained in this 2021 paper from SPCL: [https://arxiv.org/abs/2110.10802]
I built Dace, ONNX Runtime (modified branch for Dace as instructed) with Cuda enabled and finally DaceML. I have installed the same version of Pytorch that is used in the paper and the same version of Cuda as well.

My main expectation from this is to see some acceleration in the forward and backward passes of Dace compiled neural network Pytorch layers compared to the native Pytorch implementation (as demonstrated in the paper results).
But. upon running the test scripts from the official DaceML repo inside the tests/torch directory and benchmarking the test code by setting cuda event timers, I am getting a huge (~20-30x) slowdown. This is consistent across both test_bert_encoder and test_efficientnet_block.
Here's an extract of the test code from the bert layer benchmarking with the corrosponding output: [https://gist.github.com/vinaysaxena93/78e07fd687eace24b43831989bb3b283]

As you might see, the time taken by Dacemodule in the forward passes is several orders of magnitude greater. The only modification to the code I have done is to add cuda events for timing. This result is consistent across all tests and my personal Pytorch scripts as well.

Could you please tell if there's anything I'm missing which is causing such slowdowns? I have not made any custom .dace.conf file and if that's the reason could you please post some hints on how to improve the performance / what configurations were used for the evaluation presented in the paper? Any knowledge will be very helpful.
Please let me know if you need any more information. If required I can also create a Dockerfile to reproduce my benchmark environment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpectedly poor performance of Dace generated cuda code through DaceML framework #1663

Unexpectedly poor performance of Dace generated cuda code through DaceML framework #1663

vinaysaxena93 commented Sep 20, 2024

Unexpectedly poor performance of Dace generated cuda code through DaceML framework #1663

Unexpectedly poor performance of Dace generated cuda code through DaceML framework #1663

Comments

vinaysaxena93 commented Sep 20, 2024