-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[Passes] Move LoopInterchange into optimization pipeline #145503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
As mentioned in llvm#145071, LoopInterchange should be part of the optimization pipeline rather than the simplification pipeline. This patch moves LoopInterchange into the optimization pipeline. More contexts: - By default, LoopInterchange attempts to improve data locality, however, it also takes vectorization opportunities into account. Given that, it is reasonable to run it as close to vectorization as possible. - I looked into previous changes related to the placement of LoopInterchange, but couldn’t find any strong motivation suggesting that it benefits other simplifications. - As far as I tried some tests (including llvm-test-suite), removing LoopInterchange from the simplification pipeline does not affect other simplifications. Therefore, there doesn't seem to be much value in keeping it there.
if (PTO.LoopInterchange) | ||
LPM.addPass(LoopInterchangePass()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure where LoopInterchange should be placed within the optimization pipeline...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, would be interesting to get some more data about the impact on different positions, but current palcement seems reasonable to me at least
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, collecting diverse data would be ideal, but since the current LoopInterchange is rarely triggered in practice, I haven’t been able to gather enough data to have a meaningful discussion about the optimal placement...
TSVC s235 might be an interesting example. If LoopDistribute were extended to support non-innermost loops and it ran before LoopInterchange, the following could become possible:
Original:
for (int nl = 0; nl < 200*(ntimes/LEN2); nl++)
for (int i = 0; i < LEN2; i++) {
a[i] += b[i] * c[i];
for (int j = 1; j < LEN2; j++) {
aa[j][i] = aa[j-1][i] + bb[j][i] * a[i];
}
}
Distributed:
for (int nl = 0; nl < 200*(ntimes/LEN2); nl++)
for (int i = 0; i < LEN2; i++)
a[i] += b[i] * c[i];
for (int nl = 0; nl < 200*(ntimes/LEN2); nl++)
for (int i = 0; i < LEN2; i++)
for (int j = 1; j < LEN2; j++)
aa[j][i] = aa[j-1][i] + bb[j][i] * a[i];
Interchanged:
for (int nl = 0; nl < 200*(ntimes/LEN2); nl++)
for (int i = 0; i < LEN2; i++)
a[i] += b[i] * c[i];
// The i-loop and j-loop are interchanged.
for (int nl = 0; nl < 200*(ntimes/LEN2); nl++)
for (int j = 1; j < LEN2; j++)
for (int i = 0; i < LEN2; i++)
aa[j][i] = aa[j-1][i] + bb[j][i] * a[i];
IIRC, when I ran the last code in the past, it was about 8x faster than the original. But achieving this would require a lot of work on LoopDistribute...
if (PTO.LoopInterchange) | ||
LPM2.addPass(LoopInterchangePass()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be reasonable to keep this, at least for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense in general but do you have any data on how this change impacts the # of loops interchanged, and the impact on perf for the cases where it helps?
This should also at least have a test checking the position of LoopInterchange in the pipline and ideally a phase-ordering tests that shows an improvement
As tested with the llvm-test-suite, this patch reduces the number of interchanged loops by one. The omitted transformation corresponds to the following loop in for (i = 0; i < 32; i++)
for (j = 0; j < 32; j++)
for (k = 0; k < 2; k++)
fill_in_key_entry(&g_keyinfo[k][i][j], n_rows, n_cols); This change is due to the ordering between full unrolling and loop interchange. In the original pipeline, the k-loop is lifted to the outermost position because doing so improves data locality. On the other hand, with this patch, the k-loop is fully unrolled first, preventing the interchange from being applied. While it's unclear to me which behavior is preferable in general, in this case, performance remained largely unaffected. I think LoopInterchange should not be in the simplification pipeline because loop versioning may be introduced inside the LoopInterchange in the future (ref: #123436), which is better suited to the optimization pipeline. Additionally, if we're going to change its position, I believe we should do it before #124911, which is the primary motivation behind submitting this patch now. Or should we wait until it actually becomes a problem?
Exactly, I'll add it. Just to confirm, are you referring to a test like https://github.com/llvm/llvm-project/blob/559218f7ee78f77d32c14cd14a56b799167596f5/llvm/test/Other/new-pass-manager.ll ?
I agree with it, but performance improvement is not a purpose of this patch, and I don't have a good example for it... |
@llvm/pr-subscribers-llvm-transforms Author: Ryotaro Kasuga (kasuga-fj) ChangesAs mentioned in #145071, LoopInterchange should be part of the optimization pipeline rather than the simplification pipeline. This patch moves LoopInterchange into the optimization pipeline. More contexts:
Full diff: https://github.com/llvm/llvm-project/pull/145503.diff 2 Files Affected:
diff --git a/llvm/lib/Passes/PassBuilderPipelines.cpp b/llvm/lib/Passes/PassBuilderPipelines.cpp
index c83d2dc1f1514..98821bb1408a7 100644
--- a/llvm/lib/Passes/PassBuilderPipelines.cpp
+++ b/llvm/lib/Passes/PassBuilderPipelines.cpp
@@ -690,9 +690,6 @@ PassBuilder::buildFunctionSimplificationPipeline(OptimizationLevel Level,
LPM2.addPass(LoopDeletionPass());
- if (PTO.LoopInterchange)
- LPM2.addPass(LoopInterchangePass());
-
// Do not enable unrolling in PreLinkThinLTO phase during sample PGO
// because it changes IR to makes profile annotation in back compile
// inaccurate. The normal unroller doesn't pay attention to forced full unroll
@@ -1547,6 +1544,10 @@ PassBuilder::buildModuleOptimizationPipeline(OptimizationLevel Level,
// this may need to be revisited once we run GVN before loop deletion
// in the simplification pipeline.
LPM.addPass(LoopDeletionPass());
+
+ if (PTO.LoopInterchange)
+ LPM.addPass(LoopInterchangePass());
+
OptimizePM.addPass(createFunctionToLoopPassAdaptor(
std::move(LPM), /*UseMemorySSA=*/false, /*UseBlockFrequencyInfo=*/false));
diff --git a/llvm/test/Transforms/LoopInterchange/position-in-pipeline.ll b/llvm/test/Transforms/LoopInterchange/position-in-pipeline.ll
new file mode 100644
index 0000000000000..d0b6447d92fe7
--- /dev/null
+++ b/llvm/test/Transforms/LoopInterchange/position-in-pipeline.ll
@@ -0,0 +1,48 @@
+; RUN: opt -passes='default<O3>' -enable-loopinterchange -disable-output \
+; RUN: -disable-verify -verify-analysis-invalidation=0 \
+; RUN: -debug-pass-manager=quiet %s 2>&1 | FileCheck %s
+
+; Test the position of LoopInterchange in the pass pipeline.
+
+; CHECK-NOT: Running pass: LoopInterchangePass
+; CHECK: Running pass: ControlHeightReductionPass
+; CHECK-NEXT: Running pass: LoopSimplifyPass
+; CHECK-NEXT: Running pass: LCSSAPass
+; CHECK-NEXT: Running pass: LoopRotatePass
+; CHECK-NEXT: Running pass: LoopDeletionPass
+; CHECK-NEXT: Running pass: LoopRotatePass
+; CHECK-NEXT: Running pass: LoopDeletionPass
+; CHECK-NEXT: Running pass: LoopInterchangePass
+; CHECK-NEXT: Running pass: LoopDistributePass
+; CHECK-NEXT: Running pass: InjectTLIMappings
+; CHECK-NEXT: Running pass: LoopVectorizePass
+
+
+define void @foo(ptr %a, i32 %n) {
+entry:
+ br label %for.i.header
+
+for.i.header:
+ %i = phi i32 [ 0, %entry ], [ %i.next, %for.i.latch ]
+ br label %for.j
+
+for.j:
+ %j = phi i32 [ 0, %for.i.header ], [ %j.next, %for.j ]
+ %tmp = mul i32 %i, %n
+ %offset = add i32 %tmp, %j
+ %idx = getelementptr inbounds i32, ptr %a, i32 %offset
+ %load = load i32, ptr %idx, align 4
+ %inc = add i32 %load, 1
+ store i32 %inc, ptr %idx, align 4
+ %j.next = add i32 %j, 1
+ %j.exit = icmp eq i32 %j.next, %n
+ br i1 %j.exit, label %for.i.latch, label %for.j
+
+for.i.latch:
+ %i.next = add i32 %i, 1
+ %i.exit = icmp eq i32 %i.next, %n
+ br i1 %i.exit, label %for.i.header, label %exit
+
+exit:
+ ret void
+}
|
; RUN: opt -passes='default<O3>' -enable-loopinterchange -disable-output \ | ||
; RUN: -disable-verify -verify-analysis-invalidation=0 \ | ||
; RUN: -debug-pass-manager=quiet %s 2>&1 | FileCheck %s | ||
|
||
; Test the position of LoopInterchange in the pass pipeline. | ||
|
||
; CHECK-NOT: Running pass: LoopInterchangePass | ||
; CHECK: Running pass: ControlHeightReductionPass | ||
; CHECK-NEXT: Running pass: LoopSimplifyPass | ||
; CHECK-NEXT: Running pass: LCSSAPass | ||
; CHECK-NEXT: Running pass: LoopRotatePass | ||
; CHECK-NEXT: Running pass: LoopDeletionPass | ||
; CHECK-NEXT: Running pass: LoopRotatePass | ||
; CHECK-NEXT: Running pass: LoopDeletionPass | ||
; CHECK-NEXT: Running pass: LoopInterchangePass | ||
; CHECK-NEXT: Running pass: LoopDistributePass | ||
; CHECK-NEXT: Running pass: InjectTLIMappings | ||
; CHECK-NEXT: Running pass: LoopVectorizePass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please let me know if there is a better way...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In light of enabling by default, would be good to check the compile time impact for the new position, which may have less overhead?
if (PTO.LoopInterchange) | ||
LPM.addPass(LoopInterchangePass()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, would be interesting to get some more data about the impact on different positions, but current palcement seems reasonable to me at least
I believe it reduces overhead, but I didn't actually measure it. Anyway, I’m planning to work on reducing the compile time of LoopInterchange, so I will take the impact then as well. |
As mentioned in #145071, LoopInterchange should be part of the optimization pipeline rather than the simplification pipeline. This patch moves LoopInterchange into the optimization pipeline.
More contexts: