Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

提交 #6

Open
wants to merge 26 commits into
base: main
Choose a base branch
from
Open

提交 #6

wants to merge 26 commits into from

Conversation

EAirPeter
Copy link

@EAirPeter EAirPeter commented Nov 30, 2022

使用HUST.PNG的情况下,

原版 4347ms
优化后单线程 65ms
优化后12线程 19ms
优化后24线程 17ms

通过环境变量OMP_NUM_THREADS控制线程数。


PROF_SCOPED_MARKER("WorkLoop");

#pragma omp parallel for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感觉按照子矩阵的方式划分并行任务会更好?

好像数据规模有点小,没法给cache上压力。。

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

按子矩阵具体是指?输入的4x4为单位?还是由多个4x4组成的block?
数据规模确实可能有点小,我profile出来没多少cache miss感觉还很意外。。。

Copy link
Contributor

@shawlleyw shawlleyw Dec 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

按子矩阵具体是指?输入的4x4为单位?还是由多个4x4组成的block? 数据规模确实可能有点小,我profile出来没多少cache miss感觉还很意外。。。

比如输入是1024*1024的图像,可以切分成一个线程取64 * 64的子矩阵计算啥的,这样对于列上的数据复用比较好。

不过好像因为1024*1024的矩阵太小了,甚至能全装cache里(?),不会反复flush导致cache miss。。

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

按子矩阵具体是指?输入的4x4为单位?还是由多个4x4组成的block? 数据规模确实可能有点小,我profile出来没多少cache miss感觉还很意外。。。

比如输入是1024*1024的图像,可以切分成一个线程取64 * 64的子矩阵计算啥的,这样对于列上的数据复用比较好。

不过好像因为1024*1024的矩阵太小了,甚至能全装cache里(?),不会反复flush导致cache miss。。

是的,当时在研究怎么load输入的时候有考虑过尝试这么做。我这里的写法是按行load,按理来说是很容易在列方向上出cache miss;但一方面因为profiler告诉我没多少miss,另一方面因为时间不够,就没有往这方面写,这个做法其实是make sense的。

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我看了下HUST.PNG的大小是4.19M,我CPU的L2$是6M,所以确实绰绰有余。。。

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我看了下HUST.PNG的大小是4.19M,我CPU的L2$是6M,所以确实绰绰有余。。。

Profiler跑的机器是4M的L3$,一次性只访问input最近4行的话也是绰绰有余。

@shawlleyw shawlleyw mentioned this pull request Dec 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants