tune heuristics for inner outer scheduler after cached inputs are moved to smem #3223

liqiangxl · 2024-10-18T17:55:14Z

Issue:
After all inner persistent buffers are moved to shared memory, the register buffer size is 4 and 8 bytes per element for RMS norm backward and layer norm backward. Current heuristics uses the largest possible batch size and leads to performance regressions. For exammpe, in RMS norm at hidden size 20K, will use 20 bathes with 128 threads. The batch size is too large and caused register spills, should consider increase threads per block to reduce batch size.

Fix:
(1) Max threads per block is increased from 256 to 512 due to reduced register pressures.
(2) Generate candidate heuristics using different threads per block and sort these heuristics based on register usage and occupancy.
Generate candidate heuristic:

vectorization_factor = largest possible
For each threads per block {128, 256, 512}
       generate_a_heuristics(threads per block, vectorization_factor)

Sort candidates based on:

        (1) prefer candidate won't cause register spills, extra_regs = avilable - required (based on buffer size)
        if (extra_regs_a > 0 && extra_regs_b < 0) {
          return true;
        } else if (extra_regs_a < 0 && extra_regs_b > 0) {
          return false;
        }
        // (2) prefer candidate with occupancy >= 16 warps, higher occupancy doesn't mean higher peformance but if smaller than
        // 16 warps per sm, increase occupany leads to better performance.
        if (a.warps_per_sm < 16 || b.warps_per_sm < 16) {
          return a.warps_per_sm > b.warps_per_sm;
        }
        // tie breaker, smaller threads_per_block reduces reduction overhead
        return a.threads_per_block < b.threads_per_block;

Results:
Hopper:
(1) layer norm and rms nomr bwd, fp16

(2) layer norm and rms nomr bwd, fp32

Amper: see dashboard

liqiangxl · 2024-10-18T17:56:07Z

!build

liqiangxl · 2024-10-18T18:02:46Z

!build --pybench

liqiangxl · 2024-10-18T18:39:04Z

!build --pybench

liqiangxl added 7 commits October 18, 2024 07:40

use pow2 threads per block

aaf5c46

Merge branch 'llu/inner_outer_pow2' into llu/move_all_pow2

7ac2cbf

move cached input buffer to smem even registers can store all of them

9d685ea

heuristics start from large threads per block

33f05cf

fix regression of ln bwd

444a913

avoid too large threads

b5e2e35

don't increase bdimx

3868345

liqiangxl mentioned this pull request Oct 18, 2024

always prioritize using smem to store inner persistent buffers #3217

Open

fix

9ff7c71

liqiangxl and others added 2 commits October 18, 2024 11:38

tidy

f321feb

Merge branch 'main' into llullu/move_all_pow2_heuristics

64d4898

liqiangxl mentioned this pull request Oct 23, 2024

move all cached inputs to smem and use pow2 threads per block #3221

Closed

liqiangxl added 5 commits October 23, 2024 17:27

merge main

6fbbcc1

merge

db84d04

clean

55a1313

ln bwd

24503b9

sort based heuristics

43739fa

liqiangxl changed the base branch from main to llu/inner_outer_smem_moveall October 24, 2024 16:05

clean

7b00fdb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tune heuristics for inner outer scheduler after cached inputs are moved to smem #3223

tune heuristics for inner outer scheduler after cached inputs are moved to smem #3223

liqiangxl commented Oct 18, 2024 •

edited

Loading

liqiangxl commented Oct 18, 2024

liqiangxl commented Oct 18, 2024

liqiangxl commented Oct 18, 2024

tune heuristics for inner outer scheduler after cached inputs are moved to smem #3223

Are you sure you want to change the base?

tune heuristics for inner outer scheduler after cached inputs are moved to smem #3223

Conversation

liqiangxl commented Oct 18, 2024 • edited Loading

liqiangxl commented Oct 18, 2024

liqiangxl commented Oct 18, 2024

liqiangxl commented Oct 18, 2024

liqiangxl commented Oct 18, 2024 •

edited

Loading