Increase the K tile size in L1 for matmul ops #846

yzhang93 · 2024-10-16T17:53:46Z

No description provided.

yzhang93 · 2024-10-16T18:55:40Z

Currently 5 bf16 ci tests failed because of [AIE ERROR] _XAie_LoadProgMemSection():231: Overflow of program memory. All i32 and i8 tests passed.

@Abhishek-Varma Did you see the similar error? How do you solve it?

Abhishek-Varma · 2024-10-17T06:08:13Z

Currently 5 bf16 ci tests failed because of [AIE ERROR] _XAie_LoadProgMemSection():231: Overflow of program memory. All i32 and i8 tests passed.

@Abhishek-Varma Did you see the similar error? How do you solve it?

Yes, I'm seeing this issue for larger shape Matmul + Truncf bf16.

From what I could speculate yesterday, in Matmul + Truncf's case, is that it seemed to be due to vector unrolling of arith.truncf. But in this CI I see that it's happening now for Batch Matmul too.
Trying to play with Peano flags today to see if it helps.

Abhishek-Varma · 2024-10-17T14:57:15Z

Since both of us are facing the same issue - I tried looking into the CI failure (as the current speculation was that arith.truncf's vector unrolling is the culprit - so wanted to check what the delta here is, in case it helps solve both of our cases).

I was trying to compare the e2e IR for the case of PM issue (Program Memory issue) with the one that doesn't use this patch - to see what the delta is.

The only difference (besides the new tile size) that I could see was in the MAIN section's scf.for total iteration count.
For the case with PM issue the total iteration count is 2 whereas for the one that doesn't has 6.
And due to this AMDAIEAcquireReleaseToUseLock, when trying to unroll, seems to be changing the IR structure from :-

core {
    // PROLOGUE
       vector.contract 
    // MAIN
    scf.for 
         vector.contract
    // EPILOGUE
       vector.contract
}

To

core {
    // PROLOGUE
       vector.contract 
    // MAIN
       vector.contract
       vector.contract
    // EPILOGUE
       vector.contract
}

But I don't think this is the cause of PM exhaustion - when trying to understand the same issue for Truncf as well.

I tried changing the Peano flags too :-

    -O2 \
    --inline-threshold=10 \
    -S input.ll \
    --disable-builtin=memset \
    -o input.opt.ll \
    -vectorize-loops=false \
    -vectorize-slp=false \
    --two-entry-phi-node-folding-threshold=10 \
    -mandatory-inlining-before-opt=false \
    -basic-aa-full-phi-analysis=true \
    -basic-aa-max-lookup-search-depth=10

but none seem to work for both this as well as the Truncf PM issue. :/

I even tried re-looking at the changes done in #822 (in case any of that was a culprit common to us) - but those changes seem to not touch at least this PR's delta.

Abhishek-Varma · 2024-10-17T15:49:15Z

@jtuyls suggested to add --disable-loop-unrolling in the Peano flag and it worked!

Here is the e2e IR of one of the test's failure with your patch - I verified the output too.

It also solved the PM issue for Truncf!

Thanks Jorn!

yzhang93 · 2024-10-17T16:46:56Z

@jtuyls suggested to add --disable-loop-unrolling in the Peano flag and it worked!

Here is the e2e IR of one of the test's failure with your patch - I verified the output too.

It also solved the PM issue for Truncf!

Thanks Jorn!

@Abhishek-Varma Thanks for looking into the issue! So what's the way to move forward? We disable that flag in Peano or add some optimizations?

jtuyls · 2024-10-17T16:58:33Z

s for looking into the issue! So what's the way to move forward? We disable that flag in Peano or add some optimizations?

Maybe we can check whether it has any performance impact? At some point, this might matter a lot, but not sure at this point. We could make it a an optional flag for now? And potentially, we can unroll/partially unroll on our side?

yzhang93 · 2024-10-17T19:35:55Z

s for looking into the issue! So what's the way to move forward? We disable that flag in Peano or add some optimizations?

Maybe we can check whether it has any performance impact? At some point, this might matter a lot, but not sure at this point. We could make it a an optional flag for now? And potentially, we can unroll/partially unroll on our side?

Unfortunately, it does have performance impact. For large size such as 4096x4096x2048, the total execution time doubles while disabling the loop unrolling in peano :( (Comparison is without any change of tile sizes in this PR.)

-- This commit makes the following updates to insert-loops-for-vectorization pass. -- It makes it to work on bufferized inputs. -- It also involves update pertaining to collapsing unit dimensions of a candidate generic op. -- Also involves coalescing of the loops generated for tiles. -- This is the first logically grouped PR needed to make Matmul + Truncf work for larger shape, and also unblock other outstanding/dependent PR like #846 Signed-off-by: Abhishek Varma <abhvarma@amd.com>

Increase the K tile size in L1 for matmul ops

e36ecf5

yzhang93 requested review from MaheshRavishankar, nirvedhmeshram, Abhishek-Varma and jtuyls as code owners October 16, 2024 17:53

Abhishek-Varma mentioned this pull request Oct 22, 2024

[Insert-Loops-For-Vec] Update insert-loops-for-vectorization pass #854

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase the K tile size in L1 for matmul ops #846

Increase the K tile size in L1 for matmul ops #846

yzhang93 commented Oct 16, 2024

yzhang93 commented Oct 16, 2024 •

edited

Loading

Abhishek-Varma commented Oct 17, 2024

Abhishek-Varma commented Oct 17, 2024 •

edited

Loading

Abhishek-Varma commented Oct 17, 2024

yzhang93 commented Oct 17, 2024

jtuyls commented Oct 17, 2024

yzhang93 commented Oct 17, 2024 •

edited

Loading

Increase the K tile size in L1 for matmul ops #846

Are you sure you want to change the base?

Increase the K tile size in L1 for matmul ops #846

Conversation

yzhang93 commented Oct 16, 2024

yzhang93 commented Oct 16, 2024 • edited Loading

Abhishek-Varma commented Oct 17, 2024

Abhishek-Varma commented Oct 17, 2024 • edited Loading

Abhishek-Varma commented Oct 17, 2024

yzhang93 commented Oct 17, 2024

jtuyls commented Oct 17, 2024

yzhang93 commented Oct 17, 2024 • edited Loading

yzhang93 commented Oct 16, 2024 •

edited

Loading

Abhishek-Varma commented Oct 17, 2024 •

edited

Loading

yzhang93 commented Oct 17, 2024 •

edited

Loading