Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AutoBump] Merge with 76782e28 (needs torch-mlir bump, has downstream PR) (31) #291

Merged
merged 712 commits into from
Aug 21, 2024

Conversation

cferry-AMD
Copy link
Collaborator

@cferry-AMD cferry-AMD commented Aug 20, 2024

This goes up to April 16.
Needs Xilinx/torch-mlir#240

erichkeane and others added 30 commits April 12, 2024 14:13
Like with the 'default' clause, this is being applied to only Compute
Constructs for now. The 'if' clause takes a condition expression which
is used as a runtime value.

This is not a particularly complex semantic implementation, as there
isn't much to this clause, other than its interactions with 'self',
  which will be managed in the patch to implement that.
Fix for llvm#88451

Do not perform semantic check about data transfer on assignment
statement in device context.
Closes llvm#88066.

Compared to before, the function names in the stdbit table are sorted by
function name, not order-of-appearance in the standard. Since macros
aren't printed by docgen.py and are still a TODO in the code, they are
also not printed in the new stdbit.h docs.

Adds some checks to docgen.py for conditions that tripped me up.

Add code to docgen.py to add the include of the `|check|` rewriter,
since all other generated files need it.
Leftover from a previous commit, this ends up not being used, so remove
it.
While working on a followup patch, it became clear that this extra bit
of 'OpenACC' before each clause name was redundant with the visitors, so
remove it to make this a little less verbose.
…esystem` to have it use cached `status` (llvm#88152)

As-is, calls to `exists()` fallback on the implementation in
`ProxyFileSystem::exists` which explicitly calls out to the underlying
`FS`, which for the `DependencyScanningFilesystem` (overlay) is the real
underlying filesystem.

Instead, directly overloading `exists` allows us to have it rely on the
cached `status` behavior used elsewhere by the
`DependencyScanningFilesystem`.
Looks like I forgot to do build CIndex.cpp when validating myself!
Running the test-release.sh script with PGO enabled causes build errors
like:

ld.lld: error: Function Import: link error: linking module flags
'ProfileSummary': IDs have conflicting values

I believe this a build system bug due to the PGO profile data being
generated unconditionally. If you run `ninja check-all` and then `ninja
install` like we do in test-release.sh, then the profile data is
regenerated during `ninja install` and some of the clang tools which are
not test dependencies get build during the ninja install step with
different profile data. When these tools link against the LLVM
libraries, like libSupport, we end up with these errors.
…llvm#87533)

InstCombine transforms add of 0 to or of 0. For system atomics, this is
problematic because while PCIe supports add, it does not support the
other operations. Undo this for system scope atomics.
The malloc->calloc fold creates a new MemoryAccess, which may end of at
the same address as a previously deleted access inside SkipStores.

To the most part, this is not a problem, because SkipStores is normally
only used together with MemDefs. Neither the old malloc access nor the
new calloc access will be part of MemDefs, so there is no problem here.

However, SkipStores is also used in one more place: In the main DSE
loop, ToCheck entries are checked against it. Fix this by not using
SkipStores here, and instead using a separate set to track deletions
inside this loop. This way it is not affected by the calloc optimization
that happens outside it.

This is all pretty ugly, but I haven't found another good way to fix it.
Suggestions welcome.

No test case as I don't have a reliable DSE-only test-case for this.

Fixes llvm#84458.
Otherwise the test would fail on Darwin and other platforms that use
Itanium ABI but do not support alias/ifunc.
… NFC

After inlining, `scanSection` is significantly longer (more than 100+
instructions on x86-64 built with Clang) when `i` does not always
increment by one (MIPS).
Enable frame pointer optimization by default to match it with gcc.

Fixes: llvm#75013
This patch moves CUDA-related `Sema` function into new `SemaCUDA` class,
following the recent example of SYCL, OpenACC, and HLSL. This is a part
of the effort to split Sema. Additional context can be found in
llvm#82217,
llvm#84184,
llvm#87634.
This patch implements intrinsic that supports
`std::is_pointer_interconvertible_base_of` type trait from
[P0466R5](https://wg21.link/p0466r5) "Layout-compatibility and
Pointer-interconvertibility Traits".

Normative wording:
> Comment: If `Base` and Derived are non-union class types and are not
(possibly _cv_-qualified) versions of the same type, `Derived` is a
complete type.
> Condition: `Derived` is unambiguously derived from `Base` without
regard to _cv_-qualifiers, and each object of type `Derived` is
pointer-interconvertible (6.7.2 [basic.compound]) with its `Base`
subobject, or `Base` and `Derived` are not unions and name the same
class type without regard to _cv_-qualifiers.

The paper also express the following intent:
> Note that `is_pointer_interconvertible_base_of_v<T,T>` is always true
under this wording, even though `T` is not derived from itself.

I find the treatment of unions in the wording contradictory to this
intent, and I'm not able to find anything relevant in minutes or on the
reflector. That said, this patch implements what the wording says, since
it's very explicit about unions.
This patch is breaking `dr` to `cwg` equivalence in our terminology, making room for tests for LWG issues that concern compiler intrinsics.
…tem modules (llvm#88432)

Co-authored-by: Alex Lorenz <arphaman@gmail.com>
Some tests in `dr6xx.cpp` were using C11 `_Static_assert`, and were expecting extension warnings in C++98 mode because of that. This is noise, and we can do better than that.
Coding my way home: 4.42841S, 102.96190W
1. Move expansion to combineFP_EXTEND to help with small vectors;
2. Combine FP_ROUND to reduce fptrunc then fpextend after promotion;
This patch uses `getConstantRangeAtUse` to infer nsw/nuw flags with
at-use info. It will enables more optimizations in InstCombine.

Compile-time impact:
http://llvm-compile-time-tracker.com/compare.php?from=a5ed14bc8e122fa5ac0aa81f8d8390931bd6b4e4&to=a83d3402b663439b91cb37a046fb7ac0220ba5c7&stat=instructions%3Au

Related issue: llvm#87854
This patch enables vectorization for non-power-of-2 VFs. Initially only
VFs where adding 1 makes the VF a power-of-2, i.e. we can still make
relatively effective use of the vectors.

It relies on the existing target cost-models to return accurate costs
for
non-power-of-2 vectors. I checked mostly AArch64 and X86 and
there the costs seem reasonable for the costs I checked, although
I expect there will be a need to refine both the cost-models and
lowering
to make most effective use of non-power-of-2 SLP vectorization.

Note that re-ordering and shuffling is not implemented for nodes
requiring padding yet to keep the initial implementation simpler.

The feature is guarded by a new flag, off by defaul for now.

PR: llvm#77790
The overhead of taking a std::mutex is much lower than taking a reader
lock on a shared mutex, even under heavy contention.

The benefit of shared_mutex only occurs as the amount of
time spent in the critical sections grows large enough.

In our case all we do is read a pointer and return the lock.
As a result, using a shared lock can be ~50%-100% slower

Here are the results for the provided benchmark on my machine:

```
2024-04-07T12:48:51-04:00
Running ./libcxx/benchmarks/shared_mutex_vs_mutex.libcxx.out
Run on (12 X 400 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 1024 KiB (x6)
  L3 Unified 32768 KiB (x1)
Load Average: 2.70, 2.70, 1.63
---------------------------------------------------------------------
Benchmark                           Time             CPU   Iterations
---------------------------------------------------------------------
BM_shared_mutex/threads:1        13.9 ns         13.9 ns     50533700
BM_shared_mutex/threads:2        34.5 ns         68.9 ns      9957784
BM_shared_mutex/threads:4        38.4 ns          137 ns      4987772
BM_shared_mutex/threads:8        51.1 ns          358 ns      1974160
BM_shared_mutex/threads:32       57.1 ns          682 ns      1043648
BM_mutex/threads:1               5.54 ns         5.53 ns    125867422
BM_mutex/threads:2               15.5 ns         30.9 ns     21830116
BM_mutex/threads:4               15.4 ns         57.2 ns     12136920
BM_mutex/threads:8               19.3 ns          140 ns      4997080
BM_mutex/threads:32              20.8 ns          252 ns      2859808
```
mstorsjo and others added 21 commits April 16, 2024 10:37
…issing (llvm#88573)

When then linker creates runtime pseudo relocations, it places them in a
list with the assumption that the runtime will fix these relocations
later, when the image gets loaded. If the relevant runtime function
doesn't seem to be present in the linked image, error out.

Normally when linking the mingw-w64 runtime libraries, this function
always is available. However, if linking without including the mingw-w64
CRT startup files, and the image needs runtime pseudo relocations, make
it clear that this won't work as expected at runtime.

With ld.bfd, this situation is a hard error too; ld.bfd adds an
undefined reference to this symbol if runtime pseudo relocations are
needed.

A later alternative would be to actually try to pull in the symbol (if
seen in a static library, but not included yet). This would allow
decoupling the function from the main mingw-w64 CRT startup code (making
it optional, only running if the linker actually produced runtime pseudo
relocations).

Doing that would require restructuring the lld code (gathering pseudo
relocations earlier, then loading the relocator function, then pulling in
more object files to satisfy the dependencies of the relocator) though.

Also, ld.bfd doesn't currently successfully pull in more object files to
satisfy the dependency on _pei386_runtime_relocator, so with that in
mind, there's not much extra value in making LLD do it currently either;
we can't make such a change in mingw-w64's CRT until both linkers
handle it.

This fixes one issue brought up in
llvm#84424.
We rename `TuneNoStripWSuffix` to `TunePreferWInst`.

If all the users of an instruction just use the low 32 bits, we can
convert it to its W variant.

A quick test on Coremark (`-O3 -march=rv64gc`):

|        | W instructions | code size(.text) |
|--------|----------------|------------------|
| before | 302            | 12257            |
| after  | 343            | 12265            |
|        | +13.58%        | +0.065%          |

Reviewers: asb, dtcxzyw, preames, lukel97, michaelmaitland, topperc

Reviewed By: topperc, dtcxzyw

Pull Request: llvm#87237
Close llvm#87609

We tried to profile the body of the lambda expressions in
https://reviews.llvm.org/D153957. But as the original comments show,
it is indeed dangerous. After we tried to skip calculating the ODR
hash values recently, we have fall into this trap twice.

So in this patch, I choose to not profile the body of the lambda
expression. The signature of the lambda is still profiled.
This patch changes the preprocessor directives surrounding
getCurrentTID, particularly moving it out of the block that is only
defined when not building for Android. The getCurrentTID function is
called in places that only require Linux definitions, so this function
should have the same preprocessor scoping around it to prevent link time
failures.
This PR includes:
* vadd.vv/vand.vv/vor.vv/vxor.vv
* vmseq.vv/vmsne.vv
* vmin.vv/vminu.vv/vmax.vv/vmaxu.vv
* vmul.vv/vmulh.vv/vmulhu.vv
* vwadd.vv/vwaddu.vv
* vwmul.vv/vwmulu
* vwmacc.vv/vwmaccu.vv
* vadc.vvm

There is no test change, I may add it later.

Fixes part of llvm#64422

Reviewers: michaelmaitland, preames, lukel97, topperc, asb

Reviewed By: topperc, lukel97

Pull Request: llvm#88379
Functions inside __cpu_traits were needlessly prefixed with __parallel,
which doesn't serve a real purpose anymore now that they are inside a
traits class.
https://bugs.llvm.org/show_bug.cgi?id=51559
llvm#50901

IPSCCP pass removes the global variable and does not create a constant
expression for the initializer value.

Extend test coverage to include:
- half, bfloat types.
- checks for undef (int32 and ptr).

There is no support for:
- fp128, x86_fp80, ppc_fp128  types.
llvm#88102
This commit ensures that the `CallDescription`s in `MallocChecker` are
matched with the mode `CDM::CLibrary`, so:
- they don't match methods or functions within user-defined namespaces;
- they also match builtin variants of these functions (if any), so the
checker can model `__builtin_alloca()` like `alloca()`.

This change fixes llvm#81597. New
tests were added to verify that `std::malloc` and `std::free` (from
`<cstdlib>`) are modeled, but a method that's named e.g. `free` isn't
confused with the memory release function.

The responsibility for modeling `__builtin_alloca` and
`__builtin_alloca_with_align` was moved from `BuiltinFunctionChecker` to
`MallocChecker`, to avoid buggy interactions between the checkers and
ensure that the builtin and non-builtin variants are handled by exactly
the same logic.

This change might be a step backwards for the users who don't have
`unix.Malloc` enabled; but I suspect that `__builtin_alloca()` is so
rare that it would be a waste of time to implement backwards
compatibility for them.

There were several test files that relied on `__builtin_alloca()` calls
to get an `AllocaRegion`, these were modified to enable `unix.Malloc`.
One of these files (cxx-uninitialized-object-ptr-ref.cpp) had some tests
that relied on the fact that `malloc()` was treated as a "black box" in
them, these were updated to use `calloc()` (to get initialized memory)
and `free()` (to avoid memory leak reports).

While I was developing this change, I found a very suspicious assert in
`MallocChecker`. As it isn't blocking the goals of this commit, I just
marked it with a FIXME, but I'll try to investigate and fix it in a
follow-up change.
llvm#87980)

This commit generalizes and cleans up the `ValueBoundsConstraintSet`
API. The API used to provide function overloads for comparing/computing
bounds of:
- index-typed SSA value
- dimension of shaped value
- affine map + operands

This commit removes all overloads. There is now a single entry point for
each `compare` variant and each `computeBound` variant. These functions
now take a `Variable`, which is internally represented as an affine map
and map operands.

This commit also adds support for computing bounds for an affine map +
operands. There was previously no public API for that.
…m#88751)

Fixes llvm#88478

Promoting the `EHCleanup` to `NormalAndEHCleanup` in `EmitCallArgs`
surfaced another bug with deactivation of normal cleanups. Here we
missed emitting CPP scope ends for deactivated normal cleanups. This
patch also fixes that bug.

We missed emitting CPP scope ends because we remove the `fallthrough`
(clears the insertion point) before deactivating normal cleanups. This
is to make the emitted "normal" cleanup code unreachable. But we still
need to emit CPP scope ends in the original basic block even for a
deactivated normal cleanup.
(This worked correctly before we did not remove `fallthrough` for
`EHCleanup`s).
We were calling classfiyPrim() instead of classify().
This gives us one extra SGPR to play with. The comment suggested that it
could cause bugs, but I have tested it with Vulkan CTS with the default
wave size for compute shaders set to 32 and did not find any problems.
…Y) & NegPow2C` (llvm#88859)

Alive2: https://alive2.llvm.org/ce/z/NFtkSX

This optimization will be beneficial to jemalloc users.
llvm-project/mlir/test/lib/Dialect/Test/TestDialect.cpp:597:31:
error: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int64_t' (aka 'long') [-Werror,-Wsign-compare]
  if (getVarOperands().size() != expectedNumOperands)
      ~~~~~~~~~~~~~~~~~~~~~~~ ^  ~~~~~~~~~~~~~~~~~~~
1 error generated.
This patch updates the definition of `omp.distribute` to enforce the
restrictions of a wrapper operation.
Re-land llvm#88395

Two build-bots were broken by the old version:
 - https://lab.llvm.org/buildbot/#/builders/285/builds/245
 - https://lab.llvm.org/buildbot/#/builders/21/builds/96988

The problem in both cases was that the compiler did not support
`std::filesystem` (which I use in the unit test).

I have removed the dependency upon std::filesystem because there isn't
an easy way to add the right linker options so that this is supported
correctly in all build environments [1]

[1] https://gitlab.kitware.com/cmake/cmake/-/issues/17834

---

This is a GNU extension:
https://gcc.gnu.org/onlinedocs/gfortran/ACCESS.html

Used in SALMON:
https://salmon-tddft.jp/download.html

Unfortunately the intrinsic takes a file path to operate on so there
isn't an easy way to make the test robust. The unit test expects to be
able to create, set read write and execute permissions, and delete files
called
std::filesystem::temp_directory_path() / <test_name>.<pid>

The test will fail if a file already exists with that name.

I have not implemented the intrinsic on Windows because this is wrapping
a POSIX system call and Windows doesn't support all of the permission
bits tested by the intrinsic. I don't have a Windows machine easily
available to check if Gfortran implements this intrinsic on Windows.
@mgehre-amd mgehre-amd changed the title [AutoBump] Merge with 76782e28 (31) [AutoBump] Merge with 76782e28 (needs torch-mlir bump) (31) Aug 20, 2024
@mgehre-amd mgehre-amd changed the title [AutoBump] Merge with 76782e28 (needs torch-mlir bump) (31) [AutoBump] Merge with 76782e28 (needs torch-mlir bump, has downstream PR) (31) Aug 20, 2024
Base automatically changed from bump_to_be006372 to feature/fused-ops August 21, 2024 20:25
An error occurred while trying to automatically change base from bump_to_be006372 to feature/fused-ops August 21, 2024 20:25
@mgehre-amd mgehre-amd merged commit a3c3879 into feature/fused-ops Aug 21, 2024
5 checks passed
@mgehre-amd mgehre-amd deleted the bump_to_76782e28 branch August 21, 2024 20:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.