[AutoBump] Merge with 76782e28 (needs torch-mlir bump, has downstream PR) (31) #291

cferry-AMD · 2024-08-20T13:08:48Z

This goes up to April 16.
Needs Xilinx/torch-mlir#240

Like with the 'default' clause, this is being applied to only Compute Constructs for now. The 'if' clause takes a condition expression which is used as a runtime value. This is not a particularly complex semantic implementation, as there isn't much to this clause, other than its interactions with 'self', which will be managed in the patch to implement that.

Fix for llvm#88451 Do not perform semantic check about data transfer on assignment statement in device context.

Closes llvm#88066. Compared to before, the function names in the stdbit table are sorted by function name, not order-of-appearance in the standard. Since macros aren't printed by docgen.py and are still a TODO in the code, they are also not printed in the new stdbit.h docs. Adds some checks to docgen.py for conditions that tripped me up. Add code to docgen.py to add the include of the `|check|` rewriter, since all other generated files need it.

Leftover from a previous commit, this ends up not being used, so remove it.

While working on a followup patch, it became clear that this extra bit of 'OpenACC' before each clause name was redundant with the visitors, so remove it to make this a little less verbose.

…esystem` to have it use cached `status` (llvm#88152) As-is, calls to `exists()` fallback on the implementation in `ProxyFileSystem::exists` which explicitly calls out to the underlying `FS`, which for the `DependencyScanningFilesystem` (overlay) is the real underlying filesystem. Instead, directly overloading `exists` allows us to have it rely on the cached `status` behavior used elsewhere by the `DependencyScanningFilesystem`.

Looks like I forgot to do build CIndex.cpp when validating myself!

Running the test-release.sh script with PGO enabled causes build errors like: ld.lld: error: Function Import: link error: linking module flags 'ProfileSummary': IDs have conflicting values I believe this a build system bug due to the PGO profile data being generated unconditionally. If you run `ninja check-all` and then `ninja install` like we do in test-release.sh, then the profile data is regenerated during `ninja install` and some of the clang tools which are not test dependencies get build during the ninja install step with different profile data. When these tools link against the LLVM libraries, like libSupport, we end up with these errors.

…llvm#87533) InstCombine transforms add of 0 to or of 0. For system atomics, this is problematic because while PCIe supports add, it does not support the other operations. Undo this for system scope atomics.

The malloc->calloc fold creates a new MemoryAccess, which may end of at the same address as a previously deleted access inside SkipStores. To the most part, this is not a problem, because SkipStores is normally only used together with MemDefs. Neither the old malloc access nor the new calloc access will be part of MemDefs, so there is no problem here. However, SkipStores is also used in one more place: In the main DSE loop, ToCheck entries are checked against it. Fix this by not using SkipStores here, and instead using a separate set to track deletions inside this loop. This way it is not affected by the calloc optimization that happens outside it. This is all pretty ugly, but I haven't found another good way to fix it. Suggestions welcome. No test case as I don't have a reliable DSE-only test-case for this. Fixes llvm#84458.

…vm#88211) StringMap uses xxHash instead of DJB.

Otherwise the test would fail on Darwin and other platforms that use Itanium ABI but do not support alias/ifunc.

… NFC After inlining, `scanSection` is significantly longer (more than 100+ instructions on x86-64 built with Clang) when `i` does not always increment by one (MIPS).

Enable frame pointer optimization by default to match it with gcc. Fixes: llvm#75013

This patch moves CUDA-related `Sema` function into new `SemaCUDA` class, following the recent example of SYCL, OpenACC, and HLSL. This is a part of the effort to split Sema. Additional context can be found in llvm#82217, llvm#84184, llvm#87634.

This patch implements intrinsic that supports `std::is_pointer_interconvertible_base_of` type trait from [P0466R5](https://wg21.link/p0466r5) "Layout-compatibility and Pointer-interconvertibility Traits". Normative wording: > Comment: If `Base` and Derived are non-union class types and are not (possibly _cv_-qualified) versions of the same type, `Derived` is a complete type. > Condition: `Derived` is unambiguously derived from `Base` without regard to _cv_-qualifiers, and each object of type `Derived` is pointer-interconvertible (6.7.2 [basic.compound]) with its `Base` subobject, or `Base` and `Derived` are not unions and name the same class type without regard to _cv_-qualifiers. The paper also express the following intent: > Note that `is_pointer_interconvertible_base_of_v<T,T>` is always true under this wording, even though `T` is not derived from itself. I find the treatment of unions in the wording contradictory to this intent, and I'm not able to find anything relevant in minutes or on the reflector. That said, this patch implements what the wording says, since it's very explicit about unions.

This patch is breaking `dr` to `cwg` equivalence in our terminology, making room for tests for LWG issues that concern compiler intrinsics.

…tem modules (llvm#88432) Co-authored-by: Alex Lorenz <arphaman@gmail.com>

Some tests in `dr6xx.cpp` were using C11 `_Static_assert`, and were expecting extension warnings in C++98 mode because of that. This is noise, and we can do better than that.

Coding my way home: 4.42841S, 102.96190W

1. Move expansion to combineFP_EXTEND to help with small vectors; 2. Combine FP_ROUND to reduce fptrunc then fpextend after promotion;

This patch uses `getConstantRangeAtUse` to infer nsw/nuw flags with at-use info. It will enables more optimizations in InstCombine. Compile-time impact: http://llvm-compile-time-tracker.com/compare.php?from=a5ed14bc8e122fa5ac0aa81f8d8390931bd6b4e4&to=a83d3402b663439b91cb37a046fb7ac0220ba5c7&stat=instructions%3Au Related issue: llvm#87854

This patch enables vectorization for non-power-of-2 VFs. Initially only VFs where adding 1 makes the VF a power-of-2, i.e. we can still make relatively effective use of the vectors. It relies on the existing target cost-models to return accurate costs for non-power-of-2 vectors. I checked mostly AArch64 and X86 and there the costs seem reasonable for the costs I checked, although I expect there will be a need to refine both the cost-models and lowering to make most effective use of non-power-of-2 SLP vectorization. Note that re-ordering and shuffling is not implemented for nodes requiring padding yet to keep the initial implementation simpler. The feature is guarded by a new flag, off by defaul for now. PR: llvm#77790

The overhead of taking a std::mutex is much lower than taking a reader lock on a shared mutex, even under heavy contention. The benefit of shared_mutex only occurs as the amount of time spent in the critical sections grows large enough. In our case all we do is read a pointer and return the lock. As a result, using a shared lock can be ~50%-100% slower Here are the results for the provided benchmark on my machine: ``` 2024-04-07T12:48:51-04:00 Running ./libcxx/benchmarks/shared_mutex_vs_mutex.libcxx.out Run on (12 X 400 MHz CPU s) CPU Caches: L1 Data 32 KiB (x6) L1 Instruction 32 KiB (x6) L2 Unified 1024 KiB (x6) L3 Unified 32768 KiB (x1) Load Average: 2.70, 2.70, 1.63 --------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------- BM_shared_mutex/threads:1 13.9 ns 13.9 ns 50533700 BM_shared_mutex/threads:2 34.5 ns 68.9 ns 9957784 BM_shared_mutex/threads:4 38.4 ns 137 ns 4987772 BM_shared_mutex/threads:8 51.1 ns 358 ns 1974160 BM_shared_mutex/threads:32 57.1 ns 682 ns 1043648 BM_mutex/threads:1 5.54 ns 5.53 ns 125867422 BM_mutex/threads:2 15.5 ns 30.9 ns 21830116 BM_mutex/threads:4 15.4 ns 57.2 ns 12136920 BM_mutex/threads:8 19.3 ns 140 ns 4997080 BM_mutex/threads:32 20.8 ns 252 ns 2859808 ```

…plemented (llvm#87871) Implemented: https://wg21.link/P0905R1 in 0913ca1 - https://eel.is/c++draft/cmp.weakeq (Removed in https://wg21.link/P1959R0) - https://eel.is/c++draft/cmp.strongeq (Removed in https://wg21.link/P1959R0) - https://eel.is/c++draft/cmp.partialord - https://github.com/llvm/llvm-project/blob/4abb722ffa7fcf809faa4a479fdf2f78c685b351/libcxx/include/__compare/ordering.h#L101 - https://eel.is/c++draft/cmp.weakord - https://github.com/llvm/llvm-project/blob/4abb722ffa7fcf809faa4a479fdf2f78c685b351/libcxx/include/__compare/ordering.h#L173 - https://eel.is/c++draft/cmp.strongord - https://github.com/llvm/llvm-project/blob/4abb722ffa7fcf809faa4a479fdf2f78c685b351/libcxx/include/__compare/ordering.h#L250 References: - https://wg21.link/P1959R0 - Remove `std::weak_equality` and `std::strong_equality`

…issing (llvm#88573) When then linker creates runtime pseudo relocations, it places them in a list with the assumption that the runtime will fix these relocations later, when the image gets loaded. If the relevant runtime function doesn't seem to be present in the linked image, error out. Normally when linking the mingw-w64 runtime libraries, this function always is available. However, if linking without including the mingw-w64 CRT startup files, and the image needs runtime pseudo relocations, make it clear that this won't work as expected at runtime. With ld.bfd, this situation is a hard error too; ld.bfd adds an undefined reference to this symbol if runtime pseudo relocations are needed. A later alternative would be to actually try to pull in the symbol (if seen in a static library, but not included yet). This would allow decoupling the function from the main mingw-w64 CRT startup code (making it optional, only running if the linker actually produced runtime pseudo relocations). Doing that would require restructuring the lld code (gathering pseudo relocations earlier, then loading the relocator function, then pulling in more object files to satisfy the dependencies of the relocator) though. Also, ld.bfd doesn't currently successfully pull in more object files to satisfy the dependency on _pei386_runtime_relocator, so with that in mind, there's not much extra value in making LLD do it currently either; we can't make such a change in mingw-w64's CRT until both linkers handle it. This fixes one issue brought up in llvm#84424.

We rename `TuneNoStripWSuffix` to `TunePreferWInst`. If all the users of an instruction just use the low 32 bits, we can convert it to its W variant. A quick test on Coremark (`-O3 -march=rv64gc`): | | W instructions | code size(.text) | |--------|----------------|------------------| | before | 302 | 12257 | | after | 343 | 12265 | | | +13.58% | +0.065% | Reviewers: asb, dtcxzyw, preames, lukel97, michaelmaitland, topperc Reviewed By: topperc, dtcxzyw Pull Request: llvm#87237

Close llvm#87609 We tried to profile the body of the lambda expressions in https://reviews.llvm.org/D153957. But as the original comments show, it is indeed dangerous. After we tried to skip calculating the ODR hash values recently, we have fall into this trap twice. So in this patch, I choose to not profile the body of the lambda expression. The signature of the lambda is still profiled.

This patch changes the preprocessor directives surrounding getCurrentTID, particularly moving it out of the block that is only defined when not building for Android. The getCurrentTID function is called in places that only require Linux definitions, so this function should have the same preprocessor scoping around it to prevent link time failures.

This PR includes: * vadd.vv/vand.vv/vor.vv/vxor.vv * vmseq.vv/vmsne.vv * vmin.vv/vminu.vv/vmax.vv/vmaxu.vv * vmul.vv/vmulh.vv/vmulhu.vv * vwadd.vv/vwaddu.vv * vwmul.vv/vwmulu * vwmacc.vv/vwmaccu.vv * vadc.vvm There is no test change, I may add it later. Fixes part of llvm#64422 Reviewers: michaelmaitland, preames, lukel97, topperc, asb Reviewed By: topperc, lukel97 Pull Request: llvm#88379

Functions inside __cpu_traits were needlessly prefixed with __parallel, which doesn't serve a real purpose anymore now that they are inside a traits class.

https://bugs.llvm.org/show_bug.cgi?id=51559 llvm#50901 IPSCCP pass removes the global variable and does not create a constant expression for the initializer value. Extend test coverage to include: - half, bfloat types. - checks for undef (int32 and ptr). There is no support for: - fp128, x86_fp80, ppc_fp128 types. llvm#88102

This commit ensures that the `CallDescription`s in `MallocChecker` are matched with the mode `CDM::CLibrary`, so: - they don't match methods or functions within user-defined namespaces; - they also match builtin variants of these functions (if any), so the checker can model `__builtin_alloca()` like `alloca()`. This change fixes llvm#81597. New tests were added to verify that `std::malloc` and `std::free` (from `<cstdlib>`) are modeled, but a method that's named e.g. `free` isn't confused with the memory release function. The responsibility for modeling `__builtin_alloca` and `__builtin_alloca_with_align` was moved from `BuiltinFunctionChecker` to `MallocChecker`, to avoid buggy interactions between the checkers and ensure that the builtin and non-builtin variants are handled by exactly the same logic. This change might be a step backwards for the users who don't have `unix.Malloc` enabled; but I suspect that `__builtin_alloca()` is so rare that it would be a waste of time to implement backwards compatibility for them. There were several test files that relied on `__builtin_alloca()` calls to get an `AllocaRegion`, these were modified to enable `unix.Malloc`. One of these files (cxx-uninitialized-object-ptr-ref.cpp) had some tests that relied on the fact that `malloc()` was treated as a "black box" in them, these were updated to use `calloc()` (to get initialized memory) and `free()` (to avoid memory leak reports). While I was developing this change, I found a very suspicious assert in `MallocChecker`. As it isn't blocking the goals of this commit, I just marked it with a FIXME, but I'll try to investigate and fix it in a follow-up change.

llvm#87980) This commit generalizes and cleans up the `ValueBoundsConstraintSet` API. The API used to provide function overloads for comparing/computing bounds of: - index-typed SSA value - dimension of shaped value - affine map + operands This commit removes all overloads. There is now a single entry point for each `compare` variant and each `computeBound` variant. These functions now take a `Variable`, which is internally represented as an affine map and map operands. This commit also adds support for computing bounds for an affine map + operands. There was previously no public API for that.

…m#88751) Fixes llvm#88478 Promoting the `EHCleanup` to `NormalAndEHCleanup` in `EmitCallArgs` surfaced another bug with deactivation of normal cleanups. Here we missed emitting CPP scope ends for deactivated normal cleanups. This patch also fixes that bug. We missed emitting CPP scope ends because we remove the `fallthrough` (clears the insertion point) before deactivating normal cleanups. This is to make the emitted "normal" cleanup code unreachable. But we still need to emit CPP scope ends in the original basic block even for a deactivated normal cleanup. (This worked correctly before we did not remove `fallthrough` for `EHCleanup`s).

We were calling classfiyPrim() instead of classify().

This gives us one extra SGPR to play with. The comment suggested that it could cause bugs, but I have tested it with Vulkan CTS with the default wave size for compute shaders set to 32 and did not find any problems.

…Y) & NegPow2C` (llvm#88859) Alive2: https://alive2.llvm.org/ce/z/NFtkSX This optimization will be beneficial to jemalloc users.

llvm-project/mlir/test/lib/Dialect/Test/TestDialect.cpp:597:31: error: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int64_t' (aka 'long') [-Werror,-Wsign-compare] if (getVarOperands().size() != expectedNumOperands) ~~~~~~~~~~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~~~ 1 error generated.

This patch updates the definition of `omp.distribute` to enforce the restrictions of a wrapper operation.

Re-land llvm#88395 Two build-bots were broken by the old version: - https://lab.llvm.org/buildbot/#/builders/285/builds/245 - https://lab.llvm.org/buildbot/#/builders/21/builds/96988 The problem in both cases was that the compiler did not support `std::filesystem` (which I use in the unit test). I have removed the dependency upon std::filesystem because there isn't an easy way to add the right linker options so that this is supported correctly in all build environments [1] [1] https://gitlab.kitware.com/cmake/cmake/-/issues/17834 --- This is a GNU extension: https://gcc.gnu.org/onlinedocs/gfortran/ACCESS.html Used in SALMON: https://salmon-tddft.jp/download.html Unfortunately the intrinsic takes a file path to operate on so there isn't an easy way to make the test robust. The unit test expects to be able to create, set read write and execute permissions, and delete files called std::filesystem::temp_directory_path() / <test_name>.<pid> The test will fail if a file already exists with that name. I have not implemented the intrinsic on Windows because this is wrapping a POSIX system call and Windows doesn't support all of the permission bits tested by the intrinsic. I don't have a Windows machine easily available to check if Gfortran implements this intrinsic on Windows.

The old code was replaced by llvm#80019.

erichkeane and others added 30 commits April 12, 2024 14:13

[flang][cuda] Fix crash in semantic (llvm#88577)

7b6b023

Fix for llvm#88451 Do not perform semantic check about data transfer on assignment statement in device context.

Fix typos (llvm#88565)

900be90

[NFC] Remove unused type alias in OpenACC

eaa3947

Leftover from a previous commit, this ends up not being used, so remove it.

[NFC][OpenACC] Rename visitor functions to omit redundant OpenACC

a472e64

While working on a followup patch, it became clear that this extra bit of 'OpenACC' before each clause name was redundant with the visitors, so remove it to make this a little less verbose.

[NFC] Clang-format previous OpenACC Patch

db8e182

Fix build failure from a472e64

03c393d

Looks like I forgot to do build CIndex.cpp when validating myself!

AMDGPU: Undo atomicrmw add/sub/xor 0 -> atomicrmw or canonicalization (…

9bd1085

…llvm#87533) InstCombine transforms add of 0 to or of 0. For system atomics, this is problematic because while PCIe supports add, it does not support the other operations. Undo this for system scope atomics.

[test] Improve tests for alias/ifunc attributes

b9bed1f

[unused-includes][ADT] Don't include DJB.h as it's not used. NFC. (ll…

f172bfc

…vm#88211) StringMap uses xxHash instead of DJB.

[test] alias-unused.cpp: set triple to Linux

f22c300

Otherwise the test would fail on Darwin and other platforms that use Itanium ABI but do not support alias/ifunc.

[ELF] Relocations: Avoid MIPS check for ELFCLASS64 with if constexpr.…

aef2810

… NFC After inlining, `scanSection` is significantly longer (more than 100+ instructions on x86-64 built with Clang) when `i` does not always increment by one (MIPS).

[M68k][clang] Enable frame pointer optimization by default (llvm#87264)

94b3c19

Enable frame pointer optimization by default to match it with gcc. Fixes: llvm#75013

[clang] Introduce SemaCUDA (llvm#88559)

0a6f6df

This patch moves CUDA-related `Sema` function into new `SemaCUDA` class, following the recent example of SYCL, OpenACC, and HLSL. This is a part of the effort to split Sema. Additional context can be found in llvm#82217, llvm#84184, llvm#87634.

[InstallAPI] Replace std::string -> StringRef

79dca25

[clang][NFC] Replace dr with cwg in DR test suite

ed128c7

This patch is breaking `dr` to `cwg` equivalence in our terminology, making room for tests for LWG issues that concern compiler intrinsics.

[modules] allow use of ptrauth module from no_undeclared_includes sys…

4138367

…tem modules (llvm#88432) Co-authored-by: Alex Lorenz <arphaman@gmail.com>

[clang][NFC] Tidy up extension warnings about _Static_assert

5b36bf5

Some tests in `dr6xx.cpp` were using C11 `_Static_assert`, and were expecting extension warnings in C++98 mode because of that. This is noise, and we can do better than that.

[mlir][spirv] Add op decoration NoContraction (llvm#88578)

8d8502a

[ORC] Replace more KV loop variables with structured bindings.

37ebf2a

Coding my way home: 4.42841S, 102.96190W

[X86][BF16] Improve vectorization of BF16 (llvm#88486)

3cf8535

1. Move expansion to combineFP_EXTEND to help with small vectors; 2. Combine FP_ROUND to reduce fptrunc then fpextend after promotion;

mstorsjo and others added 21 commits April 16, 2024 10:37

[libc++] Rename __cpu_traits functions (llvm#88741)

5b81156

Functions inside __cpu_traits were needlessly prefixed with __parallel, which doesn't serve a real purpose anymore now that they are inside a traits class.

[clang][Interp] Gracefully handle bitcasts to non-primitive types

9141e1c

We were calling classfiyPrim() instead of classify().

[AMDGPU] Fix implicit operands of VOPD cndmask instructions (llvm#87788)

4fc0a99

[AMDGPU] Stop reserving $vcc_hi in wave32 mode (llvm#87783)

f4f772c

This gives us one extra SGPR to play with. The comment suggested that it could cause bugs, but I have tested it with Vulkan CTS with the default wave size for compute shaders set to 32 and did not find any problems.

[clang][Interp][NFC] Add Block::dump()

485d556

[InstCombine] Fold `minmax (X & NegPow2C, Y & NegPow2C) -> minmax(X, …

80fce05

…Y) & NegPow2C` (llvm#88859) Alive2: https://alive2.llvm.org/ce/z/NFtkSX This optimization will be beneficial to jemalloc users.

[clang][Interp] Support __builtin_vectorelements

58b49ce

[MLIR][OpenMP] Make omp.distribute into a loop wrapper (llvm#87239)

70fe6ad

This patch updates the definition of `omp.distribute` to enforce the restrictions of a wrapper operation.

[Flang][OpenMP] NFC: Remove old reduction lowering code (llvm#88798)

76782e2

The old code was replaced by llvm#80019.

[AutoBump] Merge with 76782e2

03c95c3

cferry-AMD requested a review from mgehre-amd August 20, 2024 13:09

mgehre-amd approved these changes Aug 20, 2024

View reviewed changes

mgehre-amd mentioned this pull request Aug 20, 2024

Merge with fixes of 307f49f5 (needs LLVM bump, has downstream PR) (14) Xilinx/torch-mlir#240

Merged

mgehre-amd changed the title ~~[AutoBump] Merge with 76782e28 (31)~~ [AutoBump] Merge with 76782e28 (needs torch-mlir bump) (31) Aug 20, 2024

mgehre-amd changed the title ~~[AutoBump] Merge with 76782e28 (needs torch-mlir bump) (31)~~ [AutoBump] Merge with 76782e28 (needs torch-mlir bump, has downstream PR) (31) Aug 20, 2024

Base automatically changed from bump_to_be006372 to feature/fused-ops August 21, 2024 20:25

An error occurred while trying to automatically change base from bump_to_be006372 to feature/fused-ops August 21, 2024 20:25

mgehre-amd merged commit a3c3879 into feature/fused-ops Aug 21, 2024
5 checks passed

mgehre-amd deleted the bump_to_76782e28 branch August 21, 2024 20:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AutoBump] Merge with 76782e28 (needs torch-mlir bump, has downstream PR) (31) #291

[AutoBump] Merge with 76782e28 (needs torch-mlir bump, has downstream PR) (31) #291

cferry-AMD commented Aug 20, 2024 •

edited by mgehre-amd

Loading

[AutoBump] Merge with 76782e28 (needs torch-mlir bump, has downstream PR) (31) #291

[AutoBump] Merge with 76782e28 (needs torch-mlir bump, has downstream PR) (31) #291

Conversation

cferry-AMD commented Aug 20, 2024 • edited by mgehre-amd Loading

cferry-AMD commented Aug 20, 2024 •

edited by mgehre-amd

Loading