-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[flang][OpenMP] Implement more robust loop-nest detection logic #127
base: amd-trunk-dev
Are you sure you want to change the base?
Conversation
be5a0a0
to
83776bf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please alter the test to not rely on debug statements for failure/success determination.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not ideal indeed. Do you have any suggestions on how to do this differently?
I thought about doing this as a unit test, but setting up the test and later reading the test I think would be more complicated than a lit test the clearly shows what the loops look like and what the outcome should be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can a special flag to print loop info to llvm::outs()
. But I am not sure this is worth it tbh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't like relying on debug output either. It randomly interleaves with stdout and only works with assertions-builds. Additionally, it makes the test dependent on how often/in which order the isPerfectlyNested
function is called internally, making it sensitive to even NFC patches. But I have seen this in LLVM and Clang enough to be consider a established pattern there, although not for MLIR/Flang.
If you do this, you still must:
- Do not redirect/pipe stdout and stderr at the same time. either guarentee that just one of the ones is used or
2>
instead&>
- Add
REQUIRES: asserts
or it will fail in release builds
In MLIR I indeed usually see an internal option enabling additional printing. or print debug counters E.g. -test-print-shape-mapping
, or a pass that just prints the analysis result, e.g. -pass-pipeline=...test-print-dominance
, or test the diagnostic from -Rpass
output.
Since it is a transformation pass, one would typically (also) test whether the output is what was expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the info @Meinersbur, really useful.
I used 2>
and REQUIRES: asserts
.
Since it is a transformation pass, one would typically (also) test whether the output is what was expected.
All other do-concurrent-conversion
tests verify the output. I wanted this one to test only one thing which is loop-nest detection. My reasoning is that this test isolates this particular part of the pass so that we debug issues in nest detection more easily.
Ping! Please 🙏 take a look when you have time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The algorithm checks whether only "expected" instruction are present in-between code, but I think it is more relevant whether they have side-effects. That's because:
- Any operation that does not have side-effects can just be sunk into the inner loop or hoisted outside the outer loop, no matter whether it use used to compute the inner loop bounds or not. If it is invariant, an optimization can hoist it out again. I think this is the relevant property: "able to move all the code away".
- Ops might be needed to compute the inner loop's but have side-effects, e.g. a function call accesses a global variable.
I do not understand the argument about mem alloc. When does this happen? Shouldn't mem2reg have removed such allocations?
@@ -36,7 +36,8 @@ namespace fir { | |||
#include "flang/Optimizer/Transforms/Passes.h.inc" | |||
} // namespace fir | |||
|
|||
#define DEBUG_TYPE "fopenmp-do-concurrent-conversion" | |||
#define DEBUG_TYPE "do-concurrent-conversion" | |||
#define DBGS() (llvm::dbgs() << "[" DEBUG_TYPE << "]: ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The canonical form is LLVM_DEBUG(dbgs() << "text")
. I don't think introducing new patterns for a single file is a good idea. DBGS
may easily conflict with something else, as was the case with DEBUG
which was eventually renamed to LLVM_DEBUG
. Getting only a specific debug type can be done via cmdline: -mllvm -debug-only do-concurrent-conversion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This same pattern is used in a lot of places in MLIR (38 existing times). For example: Dialect/Transform/IR/TransformOps.cpp
, Dialect/Linalg/Transforms/Vectorization.cpp
, and Dialect/Linalg/Transforms/Transforms.cpp
, ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmmh, OK. I don't think it is a good idea but apparently some MLIR folks do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't like relying on debug output either. It randomly interleaves with stdout and only works with assertions-builds. Additionally, it makes the test dependent on how often/in which order the isPerfectlyNested
function is called internally, making it sensitive to even NFC patches. But I have seen this in LLVM and Clang enough to be consider a established pattern there, although not for MLIR/Flang.
If you do this, you still must:
- Do not redirect/pipe stdout and stderr at the same time. either guarentee that just one of the ones is used or
2>
instead&>
- Add
REQUIRES: asserts
or it will fail in release builds
In MLIR I indeed usually see an internal option enabling additional printing. or print debug counters E.g. -test-print-shape-mapping
, or a pass that just prints the analysis result, e.g. -pass-pipeline=...test-print-dominance
, or test the diagnostic from -Rpass
output.
Since it is a transformation pass, one would typically (also) test whether the output is what was expected.
@@ -0,0 +1,77 @@ | |||
! Tests loop-nest detection algorithm for do-concurrent mapping. | |||
|
|||
! RUN: %flang_fc1 -emit-hlfir -fopenmp -fdo-concurrent-parallel=host \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do-concurrent-conversion
is a MLIR-to-MLIR pass. Those tests usually only contains the input MLIR of the pass so we don't test more than necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this to show loops on the Fortran source level. Just makes it easy to correlate for which loop nests do we detect that they are perfectly nested.
If you don't think this is a good enough arugment to have the test on the fortran level, I will replace it with MLIR instead.
83776bf
to
4c3d9f1
Compare
This happens if cases like the following (see complete sample here): do concurrent(i=1:n, j=1:bar(n*m, n/m))
a(i) = n
end do If you look the IR, you will see: fir.do_loop %arg1 = %42 to %44 step %c1 unordered {
...
%53:3 = hlfir.associate %49 {adapt.valuebyref} : (i32) -> (!fir.ref<i32>, !fir.ref<i32>, i1)
%54:3 = hlfir.associate %52 {adapt.valuebyref} : (i32) -> (!fir.ref<i32>, !fir.ref<i32>, i1)
%55 = fir.call @_QFPbar(%53#1, %54#1) fastmath<contract> : (!fir.ref<i32>, !fir.ref<i32>) -> i32
hlfir.end_associate %53#1, %53#2 : !fir.ref<i32>, i1
hlfir.end_associate %54#1, %54#2 : !fir.ref<i32>, i1
%56 = fir.convert %55 : (i32) -> index
...
fir.do_loop %arg2 = %46 to %56 step %c1_4 unordered {
...
}
} The problem here are the |
These are very good points that I did not consider to be honest. But my conclusion from what you mentioned about side-effects is that flang potentially emits wrong IR! If you take again the same sample mentioned the previous comment, shouldn't flang have emitted:
before the outermost loop (the |
How would that happen? From a Fortran perspective there should not be any side effects in the code inside the |
flang is potentially emitting wrong IR atm (as I mentioned but for different reasons in my previous reply). Nothing is preventing |
Just as a follow up, if you modify the loop I posted earlier to be: do concurrent(i=1:n, j=1:bar(n*m, n/m))
a(i) = bar(n,m)
end do You get:
So, side-effects are checked only for the body of the loop and not for the bounds calcualations (which are still emitted inside the loop body). |
From the spec:
which is expected, but what I cannot put my hands on is what constitues "within a DO CONCURRENT"? Does it include the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand the argument about mem alloc. When does this happen? Shouldn't mem2reg have removed such allocations?
The problem here are the
hlfir.end_associate
ops. Even though the "effectively 2" loops are perfectly nested, we have thesehlfir.end_associate
ops that are not part of the slice responsible for computing the upper bound (in this case) of the inner loop even though they are in-practice exist only for the purpose of that computation.
Thanks for the explanation. I wouldn't consider the loops perfectly nested though. Calling bar
can have arbitrary side-effects, like accessing and incrementing global variables.
For such cases OpenMP specifies that side-effects of upper/lower bound expressions are undefined when/how often they are evaluated, but this does not apply here, so we cannot optimize based on that. Even if, the produced HLFIR seems indistinguishable from
do concurrent(i=1:n)
ub = bar(n*m, n/m)
do concurrent(j=1:ub)
a(i) = n
end do
end do
which (obviously?) is not perfectly nested. I think the frontend should be changed here. the call to bar
should be emitted before the outer loop, it cannot depend on i
and needs to be evaluated only once. Potentially adding a special case for when the n
is zero to not call bar
at all, if that is an issue.
In the meantime I this it is OK to just not support function calls as lb/ub expressions.
If you take again the same sample mentioned the previous comment, shouldn't flang have emitted:
%55 = fir.call @_QFPbar(%53#1, %54#1) fastmath<contract> : (!fir.ref<i32>, !fir.ref<i32>) -> i32
before the outermost loop (the
i
loop)?
Wrote all of the above before I continued reading the following discussion... 😒
@@ -36,7 +36,8 @@ namespace fir { | |||
#include "flang/Optimizer/Transforms/Passes.h.inc" | |||
} // namespace fir | |||
|
|||
#define DEBUG_TYPE "fopenmp-do-concurrent-conversion" | |||
#define DEBUG_TYPE "do-concurrent-conversion" | |||
#define DBGS() (llvm::dbgs() << "[" DEBUG_TYPE << "]: ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmmh, OK. I don't think it is a good idea but apparently some MLIR folks do.
I think they mean only the body. The header expressions can be evaluated once before entering any concurrent execution, so should be fine. Unfortunately flang currently doesn't seem to do so atm. Note that is make the way flang emits the loop violate this: do concurrent(i=1:n)
ub = bar(n*m, n/m)
do concurrent(j=1:ub)
a(i) = n
end do
end do since the non-pure |
Update: I had a call with Kiran and Harish, and they will be working on making sure we emit code that is more consistent with the spec. |
fbbf705
to
8df3bac
Compare
@Meinersbur @mjklemm Even though we still need to resolve the code-gen issue of But at the same time, we now have better detection logic than before. And the algorithm to do so is cleaner and easier to understand. |
71ece67
to
c41c26a
Compare
a4e6df3
to
ef20b85
Compare
ef20b85
to
2447f0a
Compare
Ping Ping! 🔔 Please take a look when you have time 🙏 |
The previous loop-nest detection algorithm fell short, in some cases, to detect whether a pair of `do concurrent` loops are perfectly nested or not. This is a re-implementation using forward and backward slice extraction algorithms to compare the set of ops required to setup the inner loop bounds vs. the set of ops nested in the outer loop other thatn the nested loop itself.
2447f0a
to
cf6eb8f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM and sorry for the delay
The previous loop-nest detection algorithm fell short, in some cases, to detect whether a pair of
do concurrent
loops are perfectly nested or not. This is a re-implementation using forward and backward slice extraction algorithms to compare the set of ops required to setup the inner loop bounds vs. the set of ops nested in the outer loop other thatn the nested loop itself.