Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase the K tile size in L1 for matmul ops #846

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

yzhang93
Copy link
Contributor

No description provided.

@yzhang93
Copy link
Contributor Author

yzhang93 commented Oct 16, 2024

Currently 5 bf16 ci tests failed because of [AIE ERROR] _XAie_LoadProgMemSection():231: Overflow of program memory. All i32 and i8 tests passed.

@Abhishek-Varma Did you see the similar error? How do you solve it?

@Abhishek-Varma
Copy link
Contributor

Currently 5 bf16 ci tests failed because of [AIE ERROR] _XAie_LoadProgMemSection():231: Overflow of program memory. All i32 and i8 tests passed.

@Abhishek-Varma Did you see the similar error? How do you solve it?

Yes, I'm seeing this issue for larger shape Matmul + Truncf bf16.

From what I could speculate yesterday, in Matmul + Truncf's case, is that it seemed to be due to vector unrolling of arith.truncf. But in this CI I see that it's happening now for Batch Matmul too.
Trying to play with Peano flags today to see if it helps.

@Abhishek-Varma
Copy link
Contributor

Abhishek-Varma commented Oct 17, 2024

Since both of us are facing the same issue - I tried looking into the CI failure (as the current speculation was that arith.truncf's vector unrolling is the culprit - so wanted to check what the delta here is, in case it helps solve both of our cases).

I was trying to compare the e2e IR for the case of PM issue (Program Memory issue) with the one that doesn't use this patch - to see what the delta is.

The only difference (besides the new tile size) that I could see was in the MAIN section's scf.for total iteration count.
For the case with PM issue the total iteration count is 2 whereas for the one that doesn't has 6.
And due to this AMDAIEAcquireReleaseToUseLock, when trying to unroll, seems to be changing the IR structure from :-

core {
    // PROLOGUE
       vector.contract 
    // MAIN
    scf.for 
         vector.contract
    // EPILOGUE
       vector.contract
}

To

core {
    // PROLOGUE
       vector.contract 
    // MAIN
       vector.contract
       vector.contract
    // EPILOGUE
       vector.contract
}

But I don't think this is the cause of PM exhaustion - when trying to understand the same issue for Truncf as well.

I tried changing the Peano flags too :-

    -O2 \
    --inline-threshold=10 \
    -S input.ll \
    --disable-builtin=memset \
    -o input.opt.ll \
    -vectorize-loops=false \
    -vectorize-slp=false \
    --two-entry-phi-node-folding-threshold=10 \
    -mandatory-inlining-before-opt=false \
    -basic-aa-full-phi-analysis=true \
    -basic-aa-max-lookup-search-depth=10

but none seem to work for both this as well as the Truncf PM issue. :/

I even tried re-looking at the changes done in #822 (in case any of that was a culprit common to us) - but those changes seem to not touch at least this PR's delta.

@Abhishek-Varma
Copy link
Contributor

@jtuyls suggested to add --disable-loop-unrolling in the Peano flag and it worked!

Here is the e2e IR of one of the test's failure with your patch - I verified the output too.

It also solved the PM issue for Truncf!

Thanks Jorn!

@yzhang93
Copy link
Contributor Author

@jtuyls suggested to add --disable-loop-unrolling in the Peano flag and it worked!

Here is the e2e IR of one of the test's failure with your patch - I verified the output too.

It also solved the PM issue for Truncf!

Thanks Jorn!

@Abhishek-Varma Thanks for looking into the issue! So what's the way to move forward? We disable that flag in Peano or add some optimizations?

@jtuyls
Copy link
Collaborator

jtuyls commented Oct 17, 2024

s for looking into the issue! So what's the way to move forward? We disable that flag in Peano or add some optimizations?

Maybe we can check whether it has any performance impact? At some point, this might matter a lot, but not sure at this point. We could make it a an optional flag for now? And potentially, we can unroll/partially unroll on our side?

@yzhang93
Copy link
Contributor Author

yzhang93 commented Oct 17, 2024

s for looking into the issue! So what's the way to move forward? We disable that flag in Peano or add some optimizations?

Maybe we can check whether it has any performance impact? At some point, this might matter a lot, but not sure at this point. We could make it a an optional flag for now? And potentially, we can unroll/partially unroll on our side?

Unfortunately, it does have performance impact. For large size such as 4096x4096x2048, the total execution time doubles while disabling the loop unrolling in peano :( (Comparison is without any change of tile sizes in this PR.)

Abhishek-Varma added a commit that referenced this pull request Oct 22, 2024
-- This commit makes the following updates to insert-loops-for-vectorization
   pass.
-- It makes it to work on bufferized inputs.
-- It also involves update pertaining to collapsing unit dimensions of
   a candidate generic op.
-- Also involves coalescing of the loops generated for tiles.
-- This is the first logically grouped PR needed to make Matmul + Truncf work
   for larger shape, and also unblock other outstanding/dependent PR like
   #846

Signed-off-by: Abhishek Varma <abhvarma@amd.com>
Abhishek-Varma added a commit that referenced this pull request Oct 22, 2024
-- This commit makes the following updates to insert-loops-for-vectorization
   pass.
-- It makes it to work on bufferized inputs.
-- It also involves update pertaining to collapsing unit dimensions of
   a candidate generic op.
-- Also involves coalescing of the loops generated for tiles.
-- This is the first logically grouped PR needed to make Matmul + Truncf work
   for larger shape, and also unblock other outstanding/dependent PR like
   #846

Signed-off-by: Abhishek Varma <abhvarma@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants