Faster packing #1172

bwintermann · 2024-08-27T14:34:41Z

When building ResNet50 i stumbled upon the issue that due to very large weight tensors, the HLS Codegen step, which internally uses the array2hexstring function, was taking a very long time to execute. For tensors of roughly 2 Mio. entries it took in the area of ~30s. When doing it for many layers, this step alone would take ~15min per build, making development in later steps difficult due to low iteration speeds.

To speed this process up, I firstly focused on the BINARY datatype case and rewrote the function in C, integrating it via Python's ctypes. I also added tests that check the results of randomized input tensors to the original Python implementation.

I tested two shapes for the input tensors, one with 64 as the innermost, and one with 2048 as the innermost dimension, both with overall roughly 2 Mio. elements. For both I executed the function 5 times.
For the 64 one I got an overall runtime of 237.41s (47.482s per sample) in Python and 2.856s overall runtime (0.571s per sample) for the C function for an estimated speedup of ~83x. For the 2048 one, I got an overall runtime of 232.201s (46.44s per sample) in Python and an overall runtime of 0.115s (0.023s per sample) in C, yielding an estimated speedup of ~2019x, presumably due to lower function call overhead.

In the future I would like to expand this to all DataTypes and try to speed up the C implementation a bit more as well, but for now I don't think that further speedup is strictly necessary.

Signed-off-by: bwintermann <bjarne.wintermann@me.com>

bwintermann · 2024-08-29T12:20:07Z

I actually just ran into an issue when executing the whole build flow, getting an OSError on imports from other modules. Currently looking into it.

Signed-off-by: bwintermann <bjarne.wintermann@me.com>

bwintermann · 2024-08-30T09:36:11Z

Fixed the bug, which was apparently caused by the multithreaded IPGen step, for which every thread reloaded the module and tried to access the same library. It is now protected by a singleton function to make sure it's only loaded once.
Running this on my ResNet50 model has reduced the code generation time from around ~16min to ~1min 45s, 99% of it spent on a call that uses INT8, which is still using the Python implementation.

bwintermann added 9 commits August 26, 2024 10:17

Initial commit for sped-up packing function in finn utils

a5cb912

Added test for checking faster packing functionality

3ba883c

Bugfixes for the datapacking tests

7e223da

Improved C function, bugfix in datapacking module

fce968a

Bugfix, working version

82781af

Renaming of C function variables

55031a5

Pre-Commit Hooks

ddcf4bb

Signoff

23d99cb

Signed-off-by: bwintermann <bjarne.wintermann@me.com>

Enabled fast packing for any prefix

7ecb7a6

Signed-off-by: bwintermann <bjarne.wintermann@me.com>

Fixed multithreading bug

0d19e26

Signed-off-by: bwintermann <bjarne.wintermann@me.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster packing #1172

Faster packing #1172

bwintermann commented Aug 27, 2024

bwintermann commented Aug 29, 2024

bwintermann commented Aug 30, 2024

Faster packing #1172

Are you sure you want to change the base?

Faster packing #1172

Conversation

bwintermann commented Aug 27, 2024

bwintermann commented Aug 29, 2024

bwintermann commented Aug 30, 2024