[AutoBump] Merge with 4b7f07a0 (Aug 27) (12) #365

mgehre-amd · 2024-09-23T20:39:16Z

No description provided.

Comparison operations regression tests, from the original larger PR that has been broken down: llvm#92272 --------- Co-authored-by: Jakub Kuderski <kubakuderski@gmail.com> Co-authored-by: Andrzej Warzyński <andrzej.warzynski@gmail.com>

S.substr(N) is simpler than S.slice(N, StringRef::npos) and S.slice(N, S.size()). Also, substr is probably better recognizable than slice thanks to std::string_view::substr.

Use existing helper.

llvm#106000) This reverts commit 33f3ebc.

The helper can simply use VPRecipeBuilder::Plan.

I'm planning to change the inner loop to a range-based for loop.

…#105845)" (llvm#106000)" (llvm#106001) This reverts commit 4b6c064. Add a requirement for an amdgpu target in the test.

…/store on RV32. (llvm#105874) In order to support -unaligned-scalar-mem properly, we need to be more careful with immediates of global variables. We need to guarantee that adding 4 in RISCVExpandingPseudos won't overflow simm12. Since we don't know what the simm12 is until link time, the only way to guarantee this is to make sure the base address is at least 8 byte aligned. There were also several corner cases bugs in immediate folding where we would fold an immediate in the range [2044,2047] where adding 4 would overflow. These are not related to unaligned-scalar-mem.

…I.liveins(). NFC MachineRegisterInfo::liveins returns std::pair<MCRegister, Register>. Don't convert to std::pair<unsigned, unsigned>.

…5554) The stub class for `FloatType` is present in `ir.pyi`, but it is missing from the `__all__` export list.

This is a fix forward for the issue introduced in llvm#104523.

…en constructing the debug varaible for __coro_frame (llvm#105626) As the title mentioned, do not search for the DILocalVariable for __promise when constructing the debug variable for __coro_frame. This should make sense because the debug variable of `__coro_frame` shouldn't dependent on the debug variable of `__promise`. And actually, it is not. Currently, we search the debug variable for `__promise` only because we want to get the debug location and the debug scope for the `__promise`. However, we can construct the debug location directly from the debug scope of the being compiled function. Then it is not necessary any more to search the `__promise` variable. And this patch makes the codes to construct the debug variable for `__coro_frame` to be more stable. Now we will always be able to construct the debug variable for the coroutine frame no matter if we found the debug variable for the __promise or not. This patch is not strictly NFC but it is intended to not affect any end users.

…ltiple-modules' As the title shows.

…lvm#87265) This PR introduces new pass "amdgpu-sw-lower-lds". This pass lowers the local data store, LDS, uses in kernel and non-kernel functions in module to use dynamically allocated global memory. Packed LDS Layout is emulated in the global memory. The lowered memory instructions from LDS to global memory are then instrumented for address sanitizer, to catch addressing errors. This pass only work when address sanitizer has been enabled and has instrumented the IR. It identifies that IR has been instrumented using "nosanitize_address" module flag. For a kernel, LDS access can be static or dynamic which are direct (accessed within kernel) and indirect (accessed through non-kernels). **Replacement of Kernel LDS accesses:** - All the LDS accesses corresponding to kernel will be packed together, where all static LDS accesses will be allocated first and then dynamic LDS follows. The total size with alignment is calculated. A new LDS global will be created for the kernel called "SW LDS" and it will have the attribute "amdgpu-lds-size" attached with value of the size calculated. All the LDS accesses in the module will be replaced by GEP with offset into the "Sw LDS". - A new "llvm.amdgcn.<kernel>.dynlds" is created per kernel accessing the dynamic LDS. This will be marked used by kernel and will have MD_absolue_symbol metadata set to total static LDS size, Since dynamic LDS allocation starts after all static LDS allocation. - A device global memory equal to the total LDS size will be allocated. At the prologue of the kernel, a single work-item from the work-group, does a "malloc" and stores the pointer of the allocation in "SW LDS". To store the offsets corresponding to all LDS accesses, another global variable is created which will be called "SW LDS metadata" in this pass. - **SW LDS:** It is LDS global of ptr type with name "llvm.amdgcn.sw.lds.<kernel-name>". - **SW LDS Metadata:** It is of struct type, with n members. n equals the number of LDS globals accessed by the kernel(direct and indirect). Each member of struct is another struct of type {i32, i32, i32}. First member corresponds to offset, second member corresponds to size of LDS global being replaced and third represents the total aligned size. It will have name "llvm.amdgcn.sw.lds.<kernel-name>.md". This global will have an intializer with static LDS related offsets and sizes initialized. But for dynamic LDS related entries, offsets will be intialized to previous static LDS allocation end offset. Sizes for them will be zero initially. These dynamic LDS offset and size values will be updated with in the kernel, since kernel can read the dynamic LDS size allocation done at runtime with query to "hidden_dynamic_lds_size" hidden kernel argument. - At the epilogue of kernel, allocated memory would be made free by the same single work-item. **Replacement of non-kernel LDS accesses:** - Multiple kernels can access the same non-kernel function. All the kernels accessing LDS through non-kernels are sorted and assigned a kernel-id. All the LDS globals accessed by non-kernels are sorted. - This information is used to build two tables: - **Base table:** Base table will have single row, with elements of the row placed as per kernel ID. Each element in the row corresponds to ptr of "SW LDS" variable created for that kernel. - **Offset table:** Offset table will have multiple rows and columns. Rows are assumed to be from 0 to (n-1). n is total number of kernels accessing the LDS through non-kernels. Each row will have m elements. m is the total number of unique LDS globals accessed by all non-kernels. Each element in the row correspond to the ptr of the replacement of LDS global done by that particular kernel. - A LDS variable in non-kernel will be replaced based on the information from base and offset tables. Based on kernel-id query, ptr of "SW LDS" for that corresponding kernel is obtained from base table. The Offset into the base "SW LDS" is obtained from corresponding element in offset table. With this information, replacement value is obtained.

/llvm-project/llvm/lib/Target/AMDGPU/AMDGPUSwLowerLDS.cpp:260:10: error: moving a local object in a return statement prevents copy elision [-Werror,-Wpessimizing-move] return std::move(OrderedKernels); ^ /llvm-project/llvm/lib/Target/AMDGPU/AMDGPUSwLowerLDS.cpp:260:10: note: remove std::move call here return std::move(OrderedKernels); ^~~~~~~~~~ ~ 1 error generated.

…mbdas (llvm#105999) Fixes llvm#104722. Missed handling `decltype(auto)` trailing return types for lambdas. This was a mistake and regression on my part with my PR, llvm#104722. Added some missing unit tests to test for the various placeholder trailing return types in lambdas.

… in compiler-rt with lit internal shell (llvm#105917) There are several files in the compiler-rt subproject that have command not found errors. This patch uses the `env` command to properly set the environment variables correctly when using the lit internal shell. fixes: llvm#102395 [This change is relevant [RFC] Enabling the lit internal shell by Default](https://discourse.llvm.org/t/rfc-enabling-the-lit-internal-shell-by-default/80179)

…ce. NFC This matches copyPhysReg.

…uple. NFC

…arameter packs (llvm#102131) We established an instantiation scope in order for constraint equivalence checking to properly map the uninstantiated parameters. That mechanism mapped even packs to themselves. Consequently, parameter packs e.g. appearing in a function call, were not expanded. So they would end up becoming `SubstTemplateTypeParmPackType`s that circularly depend on the canonical declaration of the function template, which is not yet determined, hence the spurious error. No release note as I plan to backport it to 19. Fixes llvm#101735 --------- Co-authored-by: cor3ntin <corentinjabot@gmail.com>

…6039) Fixes linking error in llvm CI: "AMDGPUSwLowerLDS::run()': AMDGPUSwLowerLDS.cpp:(.text._ZN12_GLOBAL__N_116AMDGPUSwLowerLDS3runEv+0x164): undefined reference to `llvm::getAddressSanitizerParams(llvm::Triple const&, int, bool, unsigned long*, int*, bool*)'" llvm#87265 amdgpu-sw-lower-lds pass uses getAddressSanitizerParams method from AddressSanitizer pass. It misses linking of LLVMInstrumentation to AMDGPUCodegen. This PR adds it.

Take the intersection of the existing range attribute for the return value and the inferred range.

…lvm#104941) getMaskedTypeForICmpPair() tries to model non-and operands as x & -1. However, this can end up confusing the matching logic, by picking the -1 operand as the "common" operand, resulting in a successful, but useless, match. This is what causes commutation failures for some of the optimizations driven by this function. Fix this by treating a match against -1 as a non-match.

…lvm#104788) This is a followup to llvm#104579 to remove the limitation on sinking loads/stores of allocas entirely, even if this would introduce a phi node. Nowadays, SROA supports speculating load/store over select/phi. Additionally, SimplifyCFG with sinking only runs at the end of the function simplification pipeline, after SROA. I checked that the two tests modified here still successfully SROA after the SimplifyCFG transform. We should, however, keep the limitation on lifetime intrinsics. SROA does not have speculation support for these, and I've also found that the way these are handled in the backend is very problematic (llvm#104776), so I think we should leave them alone.

The legacy cost model in some parts checks if any of the operands are constants via SCEV. Update VPlan construction to replace live-ins that are constants via SCEV with such constants. This means VPlans (and codegen) reflects what we computing the cost of and removes another case where the legacy and VPlan cost model diverged. Fixes llvm#105722.

This PR enables "amdgpu-sw-lower-lds" pass in the pipeline. Also introduces "amdgpu-enable-sw-lower-lds" cmd line flag to enbale/disable the pass.

…late decl (llvm#89934) In the last patch llvm#82310, we used template depths to tell if such alias decls contain lambdas, which is wrong because the lambda can also appear as a part of the default argument, and that would make `getTemplateInstantiationArgs` provide extra template arguments in undesired contexts. This leads to issue llvm#89853. Moreover, our approach for llvm#82104 was sadly wrong. We tried to teach `DeduceReturnType` to consider alias template arguments; however, giving these arguments in the context where they should have been substituted in a `TransformCallExpr` call is never correct. This patch addresses such problems by using a `RecursiveASTVisitor` to check if the lambda is contained by an alias `Decl`, as well as twiddling the lambda dependencies - we should also build a dependent lambda expression if the surrounding alias template arguments were dependent. Fixes llvm#89853 Fixes llvm#102760 Fixes llvm#105885

We can decide whether to expand isel or not in instruction selection pass and early-if-conversion pass. The transformation implemented in PPCExpandISel can be retired considering PPC backend doesn't generate `isel` instructions post-RA. Also if we are seeking performant branch-or-isel decision, we can turn to selectoptimize pass. --------- Co-authored-by: Kai Luo <lkail@cn.ibm.com>

This reverts commit c3776c1. This relands commit a959d70. This was originally causing bot failures on Python version 3.8. This relanding fixes that by adjusting the relevant type annotations that are not supported in earlier versions.

The renamable flag is useful during MachineCopyPropagation but renamable flag will be dropped after lowerCopy in some case. This patch introduces extra arguments to pass the renamable flag to copyPhysReg.

While working on a MIR unittest, I noticed that parseMIR includes an unused argument that sets a function name. This is not only redundant but also irrelevant, as parseMIR is designed to parse entire module, not specific functions, even though most unittests contain a single function per module. To streamline the API, I have removed this unnecessary argument from parseMIR. However, if this argument was originally included to enhance readability or for any other purpose, please let me know.

add f8E5M2 and tests for np_to_memref --------- Co-authored-by: Zhicheng Xiong <zhichengx@dc2-sim-c01-215.nvidia.com>

TSAN warns that `ptr` is read and write without protection in `clearExpiredEntries` and in the destructor of `Owner`. Add an atomic bool to synchronize these without incurring a cost when calling `get`.

…rt command in lit's internal shell (llvm#105961) This patch fixes the incorrect usage of lit's built-in `export` command. There is a typo in raising the error itself where the error being raised had the wrong number of parameters passed in. Fixes llvm#102386.

… tests (llvm#105754) This patch rewrites tests in clang and compiler-rt that uses bash command substitution syntax $() to execute the dirname command. This is done so that the tests can be run using lit's internal shell. Fixes llvm#102384.

…sts with lit internal shell (llvm#105729) This patch addresses compatibility issues with the lit internal shell by removing the use of subshell execution (parentheses and subshell syntax) in the `merge-posix.test` and `vptr.cpp` tests. The lit internal shell does not support parentheses, so the tests have been refactored to use separate command invocations. This change is relevant for enabling the lit internal shell by default, as outlined in [[RFC] Enabling the Lit Internal Shell by Default](https://discourse.llvm.org/t/rfc-enabling-the-lit-internal-shell-by-default/80179) fixes: llvm#102401

…ompatibility (llvm#106115) This patch addresses compatibility issues with the lit internal shell by expanding and rewriting test scripts in the compiler-rt subproject. These changes were prompted by the FileNotFound unresolved errors encountered during the testing process, specifically when running the command `LIT_USE_INTERNAL_SHELL=1 ninja check compiler-rt`. **Why the error occurred:** The error occurred because the original test scripts used process substitution `(<(...))` in their diff commands. Process substitution creates temporary files or FIFOs to hold command output, and these are then passed to `diff`. However, the lit internal shell, which is more limited than a typical shell like `bash`, does not support process substitution. When lit tries to execute these commands, it is unable to create or access the temporary files or FIFOs generated by process substitution. As a result, lit attempts to open a file or directory that doesn't exist, leading to the `FileNotFoundError`. **Changes Made:** - Instead of using process substitution, the commands now explicitly redirect the output of `llvm-profdata show` to temporary files before performing the `diff` comparison. This ensures that the lit internal shell can correctly find and open these files, resolving the `FileNotFoundError`. [This change is relevant [RFC] Enabling the lit internal shell by Default](https://discourse.llvm.org/t/rfc-enabling-the-lit-internal-shell-by-default/80179) fixes: llvm#106111

…or-loop (llvm#106150) This patch adds `REQUIRES: shell` to the `focus-function.test` because the lit internal shell does not support the for loop syntax. This will make the test file unsupported when running llvm-lit with its internal shell implementation, which is enabled by turning on the `LIT_USE_INTERNAL_SHELL=1`. fixes: llvm#106111

By default, type legalization will try to promote the build_vector, but that generic type legalizer doesn't support that. Bitcast to vXi16 instead. Same as what we do for vXf16 without Zfhmin. Fixes llvm#100846.

Add NotConstant(Null) roots for nonnull arguments and then propagate them through nuw/inbounds GEPs. Having this functionality in SCCP is useful because it allows reliably eliminating null comparisons, independently of how deeply nested they are in selects/phis. This handles cases that would hit a cutoff in ValueTracking otherwise. The implementation is something of a MVP, there are a number of obvious extensions (e.g. allocas are also non-null).

…icit-bool-conversion (llvm#104882) When readability-implicit-bool-conversion-check and readability-uppercase-literal-suffix-check is enabled this will cause you to apply a fix twice from (!i) -> (i == 0u) to (i == 0U) twice instead will skip the middle one Adding this option allows this check to be in sync with readability-uppercase-literal-suffix, avoiding duplicate warnings. Fixes llvm#40544

…lvm#104785) There was some inconsistency with ConvertVectorToLLVM Pass builder, files and option names. This patch aims to move all occurences to ConvertVectorToLLVM.

OpenCL's vload_half builtin expects two arguments, but the current TableGen definition expects three. This change fixes the mismatch and adds a test to check this.

…lvm#105663) Use SmallVectorImpl instead of SmallVector for function arguments to give the caller greater flexibility in choice of initial size.

jurahul and others added 30 commits August 25, 2024 05:40

[NFC] Use const members of StringToOffsetTable (llvm#105824)

1193f7d

[AMDGPU][LTO] Assume closed world after linking (llvm#105845)

33f3ebc

[llvm] Prefer StringRef::substr to StringRef::slice (NFC) (llvm#105943)

33e7cd6

S.substr(N) is simpler than S.slice(N, StringRef::npos) and S.slice(N, S.size()). Also, substr is probably better recognizable than slice thanks to std::string_view::substr.

[VPlan] Use getVPValueOrAddLiveIn in mapToVPValues (NFC).

d66cbec

Use existing helper.

Revert "[AMDGPU][LTO] Assume closed world after linking (llvm#105845)" (

4b6c064

llvm#106000) This reverts commit 33f3ebc.

[VPlan] Remove unneeded Plan arg from getVPValueOrAddLiveIn (NFC).

d853b3f

The helper can simply use VPRecipeBuilder::Plan.

[Mips] clang-format prescanForConstants (NFC)

675c748

I'm planning to change the inner loop to a range-based for loop.

Revert "Revert "[AMDGPU][LTO] Assume closed world after linking (llvm…

033e225

…#105845)" (llvm#106000)" (llvm#106001) This reverts commit 4b6c064. Add a requirement for an amdgpu target in the test.

[CodeGen] Use std::pair<MCRegister, Register> to match return from MR…

c503758

…I.liveins(). NFC MachineRegisterInfo::liveins returns std::pair<MCRegister, Register>. Don't convert to std::pair<unsigned, unsigned>.

[mlir] NFC: add missing 'FloatType' to core Python stub file (llvm#10…

92e00af

…5554) The stub class for `FloatType` is present in `ir.pyi`, but it is missing from the `__all__` export list.

[lldb] Support non-default libc++ ABI namespace

68a1593

This is a fix forward for the issue introduced in llvm#104523.

[NFC] Add an assertion requirement to an opt test (llvm#106027)

d982882

[doc] [C++20] [Modules] Add docs and release notes for '-Wdecls-in-mu…

88f9ac3

…ltiple-modules' As the title shows.

[AArch64] Use MCRegister in AArch64InstrInfo::copyGPRRegTuple interfa…

b12d338

…ce. NFC This matches copyPhysReg.

[AArch64] Pass DebugLoc by reference to AArch64InstrInfo::copyGPRRegT…

7e6b150

…uple. NFC

[compiler-rt][nsan] Adjust nan check

65d6c47

[SCCP] Merge return range attributes (llvm#105998)

dad14d4

Take the intersection of the existing range attribute for the return value and the inferred range.

[AMDGPU] Enable "amdgpu-sw-lower-lds" pass in pipeline. (llvm#89206)

1f02be2

This PR enables "amdgpu-sw-lower-lds" pass in the pipeline. Also introduces "amdgpu-enable-sw-lower-lds" cmd line flag to enbale/disable the pass.

zyn0217 and others added 29 commits August 27, 2024 09:25

[IR] Bump AttributeBitSet width to 16 bytes (llvm#106138)

1200d35

[TII][RISCV] Add renamable bit to copyPhysReg (llvm#91179)

b01c006

The renamable flag is useful during MachineCopyPropagation but renamable flag will be dropped after lowerCopy in some case. This patch introduces extra arguments to pass the renamable flag to copyPhysReg.

[MLIR][Python] add f8E5M2 and tests for np_to_memref (llvm#106028)

c8cac33

add f8E5M2 and tests for np_to_memref --------- Co-authored-by: Zhicheng Xiong <zhichengx@dc2-sim-c01-215.nvidia.com>

[mlir] ThreadLocalCache: make TSAN happy about destructors (llvm#106170)

ce2b488

TSAN warns that `ptr` is read and write without protection in `clearExpiredEntries` and in the destructor of `Owner`. Add an atomic bool to synchronize these without incurring a cost when calling `get`.

[RISCV] Merge duplicate switch cases. NFC

f54ae6d

[RISCV] Custom legalize vXbf16 BUILD_VECTOR without Zfbfmin.

0ef8e71

By default, type legalization will try to promote the build_vector, but that generic type legalizer doesn't support that. Bitcast to vXi16 instead. Same as what we do for vXf16 without Zfhmin. Fixes llvm#100846.

pre-commit test for llvm#106195 (llvm#106196)

57c1e21

[SLP][REVEC] Expand getelementptr into vector form. (llvm#103704)

3d1c63e

[mlir][vector] Rename LowerVectorToLLVM to ConvertVectorToLLVM (NFC) (l…

cb9267f

…lvm#104785) There was some inconsistency with ConvertVectorToLLVM Pass builder, files and option names. This patch aims to move all occurences to ConvertVectorToLLVM.

[SPIR-V] Fix vload_half builtin argument count (llvm#105585)

73834f4

OpenCL's vload_half builtin expects two arguments, but the current TableGen definition expects three. This change fixes the mismatch and adds a test to check this.

[Analysis][NFC] Use SmallVectorImpl consistently in ScalarEvolution (l…

0caa909

…lvm#105663) Use SmallVectorImpl instead of SmallVector for function arguments to give the caller greater flexibility in choice of initial size.

[gn build] Port 5f6172f

8f6864e

[gn build] Port 7bc9d95

27ec464

[gn build] Port 89c27d6

1deae20

[gn build] Port 8e901c2

df00828

[gn build] Port cbf34a5

9a4bf2c

[gn build] Port db94852

4b7f07a

[AutoBump] Merge with 4b7f07a (Aug 27)

3db25fc

cferry-AMD approved these changes Sep 30, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AutoBump] Merge with 4b7f07a0 (Aug 27) (12) #365

[AutoBump] Merge with 4b7f07a0 (Aug 27) (12) #365

mgehre-amd commented Sep 23, 2024

[AutoBump] Merge with 4b7f07a0 (Aug 27) (12) #365

Are you sure you want to change the base?

[AutoBump] Merge with 4b7f07a0 (Aug 27) (12) #365

Conversation

mgehre-amd commented Sep 23, 2024