Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RMSNorm Blocked Implementation #638

Open
wants to merge 25 commits into
base: main_perf
Choose a base branch
from
Open

Commits on Jul 16, 2024

  1. Add Perf Kernels

    Add Perf Kernels
    
    This is a combination of 2 commits.
    
    Add Perf Kernels
    
    Add Perf Kernels
    
    This is a combination of 6 commits.
    
    add perf-kernels
    
    fix formating issues
    
    fix unused variables and other bugs
    
    fix other issues
    
    remove scripts
    
    save
    
    check changes
    
    format
    
    save
    
    save
    
    try
    
    pre-commit check
    
    save
    micmelesse committed Jul 16, 2024
    Configuration menu
    Copy the full SHA
    2d2dbe1 View commit details
    Browse the repository at this point in the history
  2. skip backward (#586)

    micmelesse committed Jul 16, 2024
    Configuration menu
    Copy the full SHA
    17575ea View commit details
    Browse the repository at this point in the history
  3. Change all block pointers to tensor pointers (#585)

    Change all block pointers to tensor pointers
    
    Block pointers are for nvidia TMAs. They are useful for regular loads as well but not well supported.
    
    Also cleaned up some code I came across along the way and updated comment at the top.
    vgokhale authored and micmelesse committed Jul 16, 2024
    Configuration menu
    Copy the full SHA
    a3d784a View commit details
    Browse the repository at this point in the history
  4. Add support for bshd layout (#587)

    Add support for layouts commonly used by users.
    
    Add option for varlen / thd layout to specify equal context lengths for all batches. Also often used by users.
    vgokhale authored and micmelesse committed Jul 16, 2024
    Configuration menu
    Copy the full SHA
    aa6685a View commit details
    Browse the repository at this point in the history
  5. Post-Merge CI (#612)

    * remove on push for Integration Tests
    
    * rename
    
    * add post merge test
    
    * save
    
    * dtype params
    
    * skip bad config
    
    * fix more stuff
    micmelesse committed Jul 16, 2024
    Configuration menu
    Copy the full SHA
    dbe1173 View commit details
    Browse the repository at this point in the history

Commits on Jul 18, 2024

  1. Increase CI timeout (#615)

    Increase CI timeout
    vgokhale authored Jul 18, 2024
    Configuration menu
    Copy the full SHA
    23ba546 View commit details
    Browse the repository at this point in the history

Commits on Jul 19, 2024

  1. Couple of FA optimizations (#608)

    Couple of FA optimizations
    
    Set SM scale multiplication to a constexpr. Minor asm improvement.
    
    Changed acc scaling to adjust for softmax division to
    multiplication with reciprocal. ~10% perf improvement.
    
    ---------
    
    Co-authored-by: Michael Melesse <micmelesse@gmail.com>
    vgokhale and micmelesse authored Jul 19, 2024
    Configuration menu
    Copy the full SHA
    df4c4d3 View commit details
    Browse the repository at this point in the history

Commits on Jul 31, 2024

  1. streamk v0.1 (#619)

    * streamk v0.1
    
    * remove unused variable
    
    * fix format issues
    
    * add README
    
    * fix format issue
    
    * change num_sms to num_cus
    xiaohuguo2023 authored Jul 31, 2024
    Configuration menu
    Copy the full SHA
    52a908f View commit details
    Browse the repository at this point in the history

Commits on Aug 6, 2024

  1. Add explicit multiply-reduce GEMM kernel (#621)

    * Add explicit multiply-reduce GEMM kernel
    
    * Remove `SPLIT_K` argument from kernel
    
    * Remove `GROUP_SIZE_M` argument from kernel
    
    * Remove conditional call to `tl.dot` from kernel
    
    * Remove table with performance data from README
    brunomazzottiamd authored Aug 6, 2024
    Configuration menu
    Copy the full SHA
    1d2e066 View commit details
    Browse the repository at this point in the history

Commits on Aug 13, 2024

  1. Copy *tune_gemm* from triton-mlir branch to main_perf branch (#614)

    * Copy *tune_gemm* from `triton-mlir` branch to `main_perf` branch
    
    The source commit in `triton-mlir` branch is the following one:
    ```
    commit cf44637
    Author: Lixun Zhang <Lixun.Zhang@amd.com>
    Date:   Tue Jul 23 14:22:01 2024 -0500
    
        [tuning] gemm tuning script v3.3 (#606)
    ```
    
    *tune_gemm* was copied from the source branch directory `scripts/amd/gemm`
    to the destination branch directory `python/perf-kernels/tune_gemm`.
    
    The SHA-256 hashes of *tune_gemm* files are the following ones:
    ```
    423aef1deb6c60f6578a1ecfc94d2473f8746b00d0368c553d31641fcfa5e354  README.md
    46ab93978fee33f75df23332f12546dae7910478c391f08b7b1ebd415d8266b7  icache_flush.py
    f18711544641b810a652e6a6629bfa2b613f6ade87399e88fdf05b81d4af58a4  matmul.py
    84a1c80ede36d3154e51188276eda2d2d0f52ed4f496ff69349c390d83b8ec10  matmul_kernel.py
    2812b40183637bc8d7e47d283c7d66b1792134a43de76f3eacf7b9b3e1c2431a  one_config.py
    0ac09c33b0173cea06ddabbf9f4e3afa1816781dea4fdcce5894a7e7d6a80e19  rocprof_gemm.py
    00eff41cf1c0bfc41d623e42b51706af67639fec76146741e2067d2a93e0148a  utils/file_generator.py
    cb7afb773ccee835b00396cccf87e0d44fe513131161f031fae42453725b3c82  utils/utils.py
    59f23811b660e49e566927853926a21f02a7014bb19c8ea67e6b382db6c59900  tune_gemm.py
    e787f35d750b869f113b3c01692f64243a9cb8a71a18ade2f0465f614f7284e4  tune_gemm.sh
    ```
    
    The files were kept as-is despite `pre-commit` intentions to change them.
    
    After that, *tune_gemm* directory in code and documentation was fixed to reflect
    it's new location.
    brunomazzottiamd authored Aug 13, 2024
    Configuration menu
    Copy the full SHA
    11e4447 View commit details
    Browse the repository at this point in the history

Commits on Aug 16, 2024

  1. Clean up *tune_gemm* script from main_perf branch (#629)

    * Reformat *tune_gemm* files with Triton's pre-commit
    
    The following command was executed to reformat the files:
    ```
    $ pre-commit run --files \
        python/perf-kernels/tune_gemm/* \
        python/perf-kernels/tune_gemm/utils/*
    ```
    
    * Fix *tune_gemm* issue with (1, 1) bias tensors
    
    * Fix `ruff` F405 errors
    
    Fix the following linter error:
    F405 `identifier` may be undefined, or defined from star imports
    
    * Fix `ruff` F841 errors
    
    Fix the following linter error:
    F841 Local variable `identifier` is assigned to but never used
    
    * Fix minor issues in README file
    
    * Add `--` to `num_threads` argument.
    * Replace `--icahe` argument (non-existent argument) with
      `--icache_flush` (existent argument).
    
    * Remove old files from *tune_gemm* V1
    
    * Add dependency graph to README file
    
    * Selectively disable `yapf` for parts of `one_config.py`
    brunomazzottiamd authored Aug 16, 2024
    Configuration menu
    Copy the full SHA
    624335f View commit details
    Browse the repository at this point in the history

Commits on Aug 19, 2024

  1. [tune gemm v3.4] Add xcd-based pid remapping and change back to rocpr…

    …ofv1 (#630)
    
    * Change to rocprofv1
    
    * improve post processing of rocprof results
    
    - set --iters=200 as default. This is enough since the time is stable
    after the first few runs.
    - Filter out kernel time that is too large. We use the first kernel
    time as the threshold. There must be something wrong with the kernel
    if its elapsedTime is larger than the first run. We need to
    investigate the reason. For now, just filter them out.
    
    * Add xcd-based pid remapping
    
    * Enable EVEN_K=false for large gemms
    
    * Update readme
    zhanglx13 authored Aug 19, 2024
    Configuration menu
    Copy the full SHA
    15cb3a8 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    177d0bd View commit details
    Browse the repository at this point in the history

Commits on Sep 6, 2024

  1. Softmax kernel

    Rahul Batra committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    e42690d View commit details
    Browse the repository at this point in the history
  2. Merge pull request #634 from ROCm/main_perf-softmax

    Softmax kernel
    rahulbatra85 authored Sep 6, 2024
    Configuration menu
    Copy the full SHA
    6d283a2 View commit details
    Browse the repository at this point in the history
  3. Move utility tools from triton-mlir to main_perf branch (#635)

    * Move utility tools from triton-mlir to main_perf branch
    
    - Plot layout script
    - occ.sh
    - amdgcn-cfg
    
    * yapf format
    
    * More formats
    
    * remove executablility of plot_layout.py
    
    * Address ruff complains
    
    * Move tune_gemm to tools
    zhanglx13 authored Sep 6, 2024
    Configuration menu
    Copy the full SHA
    3704738 View commit details
    Browse the repository at this point in the history
  4. Add rmsnorm kernel

    Rahul Batra committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    f80aed7 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    9da4278 View commit details
    Browse the repository at this point in the history

Commits on Sep 7, 2024

  1. Merge pull request #633 from ROCm/main_perf-rmsnorm

    Add rmsnorm kernel
    rahulbatra85 authored Sep 7, 2024
    Configuration menu
    Copy the full SHA
    c4bd738 View commit details
    Browse the repository at this point in the history

Commits on Sep 13, 2024

  1. Online softmax implementation

    Rahul Batra committed Sep 13, 2024
    Configuration menu
    Copy the full SHA
    a782caf View commit details
    Browse the repository at this point in the history

Commits on Sep 16, 2024

  1. Merge pull request #639 from ROCm/softmax_updates

    Online softmax implementation
    rahulbatra85 authored Sep 16, 2024
    Configuration menu
    Copy the full SHA
    96b3d37 View commit details
    Browse the repository at this point in the history

Commits on Sep 19, 2024

  1. Add Layernorm kernel

    Rahul Batra committed Sep 19, 2024
    Configuration menu
    Copy the full SHA
    042aa91 View commit details
    Browse the repository at this point in the history

Commits on Sep 24, 2024

  1. Add use mask

    Rahul Batra committed Sep 24, 2024
    Configuration menu
    Copy the full SHA
    ccb3538 View commit details
    Browse the repository at this point in the history
  2. Merge pull request #641 from ROCm/main_perf-layernorm

    Add Layernorm kernel
    rahulbatra85 authored Sep 24, 2024
    Configuration menu
    Copy the full SHA
    e13fc4c View commit details
    Browse the repository at this point in the history
  3. RMSNorm Blocked Implementation

    Rahul Batra committed Sep 24, 2024
    Configuration menu
    Copy the full SHA
    44e9360 View commit details
    Browse the repository at this point in the history