Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When building ResNet50 i stumbled upon the issue that due to very large weight tensors, the HLS Codegen step, which internally uses the
array2hexstring
function, was taking a very long time to execute. For tensors of roughly 2 Mio. entries it took in the area of ~30s. When doing it for many layers, this step alone would take ~15min per build, making development in later steps difficult due to low iteration speeds.To speed this process up, I firstly focused on the BINARY datatype case and rewrote the function in C, integrating it via Python's
ctypes
. I also added tests that check the results of randomized input tensors to the original Python implementation.I tested two shapes for the input tensors, one with 64 as the innermost, and one with 2048 as the innermost dimension, both with overall roughly 2 Mio. elements. For both I executed the function 5 times.
For the 64 one I got an overall runtime of 237.41s (47.482s per sample) in Python and 2.856s overall runtime (0.571s per sample) for the C function for an estimated speedup of ~83x. For the 2048 one, I got an overall runtime of 232.201s (46.44s per sample) in Python and an overall runtime of 0.115s (0.023s per sample) in C, yielding an estimated speedup of ~2019x, presumably due to lower function call overhead.
In the future I would like to expand this to all
DataType
s and try to speed up the C implementation a bit more as well, but for now I don't think that further speedup is strictly necessary.