Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tune heuristics for inner outer scheduler after cached inputs are moved to smem #3223

Draft
wants to merge 16 commits into
base: llu/inner_outer_smem_moveall
Choose a base branch
from

Conversation

liqiangxl
Copy link
Collaborator

@liqiangxl liqiangxl commented Oct 18, 2024

Issue:
After all inner persistent buffers are moved to shared memory, the register buffer size is 4 and 8 bytes per element for RMS norm backward and layer norm backward. Current heuristics uses the largest possible batch size and leads to performance regressions. For exammpe, in RMS norm at hidden size 20K, will use 20 bathes with 128 threads. The batch size is too large and caused register spills, should consider increase threads per block to reduce batch size.

Fix:
(1) Max threads per block is increased from 256 to 512 due to reduced register pressures.
(2) Generate candidate heuristics using different threads per block and sort these heuristics based on register usage and occupancy.
Generate candidate heuristic:

vectorization_factor = largest possible
For each threads per block {128, 256, 512}
       generate_a_heuristics(threads per block, vectorization_factor)

Sort candidates based on:

        (1) prefer candidate won't cause register spills, extra_regs = avilable - required (based on buffer size)
        if (extra_regs_a > 0 && extra_regs_b < 0) {
          return true;
        } else if (extra_regs_a < 0 && extra_regs_b > 0) {
          return false;
        }
        // (2) prefer candidate with occupancy >= 16 warps, higher occupancy doesn't mean higher peformance but if smaller than
        // 16 warps per sm, increase occupany leads to better performance.
        if (a.warps_per_sm < 16 || b.warps_per_sm < 16) {
          return a.warps_per_sm > b.warps_per_sm;
        }
        // tie breaker, smaller threads_per_block reduces reduction overhead
        return a.threads_per_block < b.threads_per_block;

Results:
Hopper:
(1) layer norm and rms nomr bwd, fp16
image

(2) layer norm and rms nomr bwd, fp32
image
Amper: see dashboard

@liqiangxl
Copy link
Collaborator Author

!build

@liqiangxl
Copy link
Collaborator Author

!build --pybench

@liqiangxl
Copy link
Collaborator Author

!build --pybench

@liqiangxl liqiangxl changed the base branch from main to llu/inner_outer_smem_moveall October 24, 2024 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant