Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
This is squash of the following commits:
commit 5377f48 Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Wed Feb 14 13:32:36 2024 -0700 Tapir target tweaks, LTO touch-ups, etc. commit dbfc195 Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Tue Feb 13 16:57:54 2024 -0700 Fixes for LTO... commit f9094d3 Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Mon Feb 12 14:31:23 2024 -0700 Chasing a bug in the LoopSpawning pass... commit 4ddb9d1 Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Mon Feb 12 11:09:20 2024 -0700 Runtime tweaks for refactoring launch parameters (for cuda). Tweaks to multi-file (LTO) euler3d experiment. commit f8ff7c5 Author: Patrick McCormick <> Date: Wed Feb 7 16:32:38 2024 -0700 More launch explorations. commit d720ced Author: Patrick McCormick <> Date: Wed Feb 7 10:10:59 2024 -0700 Tweaks on launch heuristics. commit ab61a3c Author: Patrick McCormick <> Date: Tue Feb 6 13:20:20 2024 -0700 working on experiments for benchmarking. commit f510df1 Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Tue Feb 6 13:31:53 2024 -0700 A bit more verobse output. commit 3eec9c0 Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Tue Feb 6 13:12:41 2024 -0700 Tweaks for launch heuristics (hacks). commit 75895a7 Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Fri Feb 2 08:37:20 2024 -0700 More launch and compiler related tweaks and tests. Fix a mistake in the error reporting for the runtime's dylib handling... commit e8ee550 Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Thu Feb 1 13:01:21 2024 -0700 Experimenting with launch details and some nvvm metadata. commit 87fb4e4 Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Mon Jan 29 11:22:56 2024 -0700 Tweak to force environment variable to override occupancy-based launch parameter settings. commit 721f9f9 Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Mon Jan 29 09:23:19 2024 -0700 Tweaks for attribute support (launch parameters) and runtime auto-adjustment to launch parameters. commit 82c37ac Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Tue Jan 23 11:56:22 2024 -0700 Small touch-ups on build details in experiments. Still finding some issues with kokkos, latest cuda (13.x), and other details (e.g., host compiler). commit c65d80c Merge: 1241ae0 acc3dfb Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Tue Jan 23 11:00:07 2024 -0700 Merge remote-tracking branch 'origin/multi' into dev/16.x commit acc3dfb Author: Tarun Prabhu <tarun@lanl.gov> Date: Thu May 25 10:35:36 2023 -0600 A squash of many commits covering a broad scope: 1. Address some bugs/details/features introduced with the 16.x merge. - includes some minor tweaks for 16.x testing but this needs more work. - clang's sema probably needs to be revisited and improved. 2. A significant overhaul of the runtime to support: - binding of calling threads to unique (gpu) streams - removal of a lot of crufty code that was no longer being used. - simplified kernel launch options/interface - occupancy-based launch parameters (can cause performance regressions) - better environment variable support for tweaking behaviors and more flexibility for experimentation, testing, and debugging. 3. In alignment with lanl#2 portions of the transforms for CUDA and HIP have been cleaned up and simplified (in particular kernel launch details are much cleaner now). 4. Some bug fixes for attempts at post-processing code w/out parallel constructs. New "experiment" introduced to catch this as a regression. 5. Some runtime building blocks for driving prefetch operations. 6. Some new experiments/test codes. 7. Fix for nested outlining -- assumed dead-code elimination pass cleanup but fails with separate host and gpu code transformation modules. Had to introduce dead-code removal prior to gpu module passes (otherwise, the verifier pass fails). 8. Runtime entry points for numpy allocation entry points (e.g., calloc, realloc, etc.). TODO: Potentially some room here for GPU-side operations to improve performance. 9. Attribute support (e.g., target) for Kokkos 'statements'. 10. General code cleanup -- removing warnings, unused code, etc. 11. New support for launch parameter exploration within the experiments code base. 12. Some work on -ffast-math crashes and issues. TODO: This code needs to be further developed (expanded support for double-precision, additional entry points, etc.). There are also some issues here in what is specified on the command line can impact code from the host side but does not have a similar match on the GPU code of code transformation. TODO: ABI and other issues need to further explored. 13. Multiple target support within a code base is supported (e.g., run opencilk cpu threads and cuda-targeted forall loops). 14. Fixes around mutli-thread entry points within the runtime components. 15. Testing and feature support for H100; sync'ing CUDA and PTX version info, etc. commit 1241ae0 Author: Patrick McCormick <> Date: Fri Dec 8 16:50:13 2023 -0700 Dealing with some crufty system libraries on Darwin... This will likely break on newer installs (e.g., Arch). commit c734425 Author: Patrick McCormick <> Date: Fri Dec 8 15:23:21 2023 -0700 Missed cleaning up some debug statements in last commit... TODO: -ffastmath stuff... commit 6d81192 Author: Patrick McCormick <> Date: Fri Dec 8 15:00:12 2023 -0700 Some testing on H100. commit a7b07c0 Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Fri Dec 8 14:58:00 2023 -0700 Cuda runtime tweaks for multi-target and multi-threads. Likely still extremely buggy under duress... commit a8bbeeb Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Fri Dec 8 12:47:46 2023 -0700 Quick memory allocation/free mutex for multi-device use cases. commit ea7a1b8 Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Tue Dec 5 12:59:32 2023 -0700 More work on regressions, fast-math mode, hip performance, etc. commit 40365ba Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Tue Dec 5 13:02:21 2023 -0700 More work on regressions, fast-math mode, hip performance, etc. commit 9b75e11 Author: Patrick McCormick <> Date: Tue Dec 5 08:52:30 2023 -0700 Working on some issues surrounding --ffast-math: 1. ABI conflicts between the host stage and our module offload generation (e.g., host side passes generate vectorized code that is not supported on GPU backend(s). 2. Host architecture-centric tweaks occur before our GPU transform. That leads to addressing host architecture specific details as part of the transform (e.g., aarch64 and x86_64 will generate different calls vs. sticking with llvm intrinsics). A combo of ABI issues and/or the fact we're too late in the pass pipeline to address this with the current design means more work lies ahead... commit 2668be2 Author: Patrick McCormick <> Date: Mon Nov 27 15:09:29 2023 -0700 work on hip performance details. commit 243ff11 Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Tue Nov 28 12:52:52 2023 -0700 Testing streams and odd stalls (UVM?). This version seems to remove the stalls but also on a system with a newer kernel drop... CUDA only at this point. commit e7d0c09 Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Mon Nov 27 15:02:47 2023 -0700 Working on some runtime tweaks and clean up. Traced a new crash to the use of a ptxas whole-program optimization flag. commit 9512eb5 Author: Patrick McCormick <> Date: Fri Nov 17 08:45:58 2023 -0700 More work to setup the tests for better HIP and CUDA target flexiblity; including some reduced complexity the command line arg details in the makefile(s) (e.g., strip mining flags for GPU targets moved into the config files vs. being necessary in the makefile setup). Added better (correct) AMDGPU target attribute selection based on multiple target options (prior version was too hard-coded for gfx90a). commit afdb9c2 Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Thu Nov 16 12:59:11 2023 -0700 A bit more verbose and shared cuda and hip feature management (e.g., streaming modes). commit 4797933 Author: Patrick McCormick <> Date: Thu Nov 16 12:58:38 2023 -0700 Bug fixes for new prefetch feature set. commit 6875ccf Author: Patrick McCormick <> Date: Thu Nov 16 11:05:27 2023 -0700 More work on HIP performance debugging... commit 3ac2d66 Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Thu Nov 16 11:01:31 2023 -0700 First cut at CUDA prefetch streams support. Needs testing... commit 7e3a8a2 Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Thu Nov 16 08:14:07 2023 -0700 Some refactoring for HIP details, bug chasing, etc. commit 3f1e09e Author: Patrick McCormick <> Date: Wed Nov 15 09:40:07 2023 -0700 Some hacking for trying to debug AMD HIP code gen/runtime issues. A few new environment variables to make chasing (our tails) easier... - KITRT_THREADS_PER_BLOCK=1024 (default 256) - KITRT_MAX_NUM_PREFETCH_STREAMS=2 (default 4: size of round-robin stream queue for concurrent prefetch calls) - KITRT_DEVICE_ID=5 (default 0: change the default GPU selection) - KITRT_MIN_WARPS_PER_EXEC_UNIT=1 (default 1: reducing resource usage per warp -- impacts register allocation, etc.) The prefetch stream queue is enabled via the command line with "-mllvm -hipabi-streams". commit d3f74a0 Author: Patrick McCormick <> Date: Wed Nov 8 20:43:08 2023 -0700 Some cleanup and work to try and chase down HIP target runtime variabilty. commit c2bb71e Author: Patrick McCormick <> Date: Thu Nov 2 13:04:11 2023 -0600 chasing build issues/warnings/errors. commit b4bafb4 Author: Patrick McCormick <> Date: Thu Nov 2 09:04:25 2023 -0600 Chasing bugs... commit 9299708 Author: Patrick McCormick <> Date: Tue Oct 31 16:23:02 2023 -0600 working on benchmarks commit 74cb34f Author: Patrick McCormick <> Date: Wed Oct 25 14:31:56 2023 -0600 Exploring full kokkos builds w/ clang. commit 4dc3221 Author: Patrick McCormick <> Date: Tue Jun 27 14:13:50 2023 -0600 Some cleanup and small tweaks. commit 5136b24 Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Wed Nov 8 20:27:27 2023 -0700 Attempt at a quick multi-stream prefetch feature. commit 8ece574 Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Thu Nov 2 11:48:12 2023 -0600 small tweaks to sort out some performance details. commit 9b00669 Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Wed Nov 1 20:38:03 2023 -0600 Tweak in attempt to debug potential numa issues that are impacting consistent performance across multiple application runs. commit f5d53a6 Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Tue Oct 31 14:46:29 2023 -0600 A bit more cleanup and adding new tests specific to kitsune. commit ff078df Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Tue Oct 31 09:47:10 2023 -0600 A bit more cleanup and adding some infrastructure for the multi-target test code (added makefile and a kokkos version). Not all the pieces are in place to fully test. commit d884674 Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Tue Oct 31 08:53:36 2023 -0600 Clean up some code cruft -- no need to duplicate else branch cases. commit 88bc75b Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Tue Oct 31 08:39:57 2023 -0600 Forgot to save a cleaned up comment... commit bd7941e Author: Patrick McCormick <651611+pmccormick@users.noreply.github.com> Date: Mon Oct 30 20:01:09 2023 -0600 New code to handle tapir attributes on Kokkos "statements". Some new code for cuda memory management details (calloc, realloc, etc.). Along with some prep work for upcoming memory management and movement changes.
- Loading branch information