[LoopInterchange] Enable it by default #124911

sjoerdmeijer · 2025-01-29T11:56:28Z

This is a work in progress patch to enable loop-interchange by default and is a continuation of the RFC:

https://discourse.llvm.org/t/enabling-loop-interchange/82589

Basically, we promised to fix any compile-time and correctness issues in the different components involved here (loop-interchange and dependence analaysis.) before discussing enabling interchange by default. We think are close to complete this; I would like to explain where we are and wanted to check if there are any thoughts or concerns. A quick overview of the correctness and compile-time improvements that we have made include:

Correctness:

Compile-times:

And in terms of remaining work, we think we are very close to fixing these depenence analysis issues:

The compile-time increase with a geomean increase of 0.19% looks good (after committing #124247), I think:

stage1-O3:
Benchmark
kimwitu++        +0.10%
sqlite3          +0.14%
consumer-typeset +0.07%
Bullet           +0.06%
tramp3d-v4       +0.21%
mafft            +0.39%
ClamAVi          +0.06%
lencod           +0.61%
SPASS            +0.17%
7zip             +0.08%
geomean          +0.19%

See also:
http://llvm-compile-time-tracker.com/compare.php?from=19a7fe03b4f58c4f73ea91d5e63bc4c6e61f987b&to=b24f1367d68ee675ea93ecda4939208c6b68ae4b&stat=instructions%3Au

We might want to look into lencod to see if we can improve more, but not sure it is strictly necessary.

@kasuga-fj

This is a work in progress patch to enable loop-interchange by default and is a continuation of the RFC: https://discourse.llvm.org/t/enabling-loop-interchange/82589 Basically, we promised to fix any compile-time and correctness issues in the different components involved here (loop-interchange and dependence analaysis.) before discussing enabling interchange by default. We think are close to complete this; I would like to explain where we are and wanted to check if there are any thoughts or concerns. A quick overview of the correctness and compile-time improvements that we have made include: Correctness: - [LoopInterchange] Remove 'S' Scalar Dependencies (llvm#119345) - [LoopInterchange] Fix overflow in cost calculation (llvm#111807) - [LoopInterchange] Handle LE and GE correctly (PR llvm#124901) @kasuga-fj - [DA] disambiguate evolution of base addresses (llvm#116628) Compile-times: - [LoopInterchange] Constrain number of load/stores in a loop (llvm#118973) - [LoopInterchange] Bail out early if minimum loop nest is not met (llvm#115128) - [LoopInterchange] Hoist isComputableLoopNest() in the control flow (llvm#124247) And in terms of remaining work, we think we are very close to fixing these depenence analysis issues: - [DA] do not handle array accesses of different offsets (llvm#123436) - [DA] Dependence analysis does not handle array accesses of different sizes (llvm#116630) - [DA] use NSW arithmetic llvm#116632 The compile-time increase with a geomean increase of 0.19% looks good (after committing llvm#124247), I think: stage1-O3: Benchmark kimwitu++ +0.10% sqlite3 +0.14% consumer-typeset +0.07% Bullet +0.06% tramp3d-v4 +0.21% mafft +0.39% ClamAVi +0.06% lencod +0.61% SPASS +0.17% 7zip +0.08% geomean +0.19% See also: http://llvm-compile-time-tracker.com/compare.php?from=19a7fe03b4f58c4f73ea91d5e63bc4c6e61f987b&to=b24f1367d68ee675ea93ecda4939208c6b68ae4b&stat=instructions%3Au We might want to look into lencod to see if we can improve more, but not sure it is strictly necessary.

Meinersbur · 2025-01-29T19:03:38Z

Could you provide numbers for performance changes (improvments and regressions), e.g. highlights from llvm-test-suite? Would help illustrating that we also gain something from slighlty increased compile-time.

sjoerdmeijer · 2025-01-30T14:48:48Z

Could you provide numbers for performance changes (improvments and regressions), e.g. highlights from llvm-test-suite? Would help illustrating that we also gain something from slighlty increased compile-time.

Hi @Meinersbur, thanks for the comments, that's a very fair question!

I would like to use two metrics for success here: the number of times loop-interchange triggers in the test-suite, and performance changes, with a higher weight for the former, initially. Let me explain this. We see this as a first enablement step: i.e. we have prioritised correctness and made interchange more conservative, in order to get a first version running by default. Enablement of interchange, loopcache analysis, and data dependence analysis would a big step forward. Even before we made interchange more conservative, interchange was not triggering on our motivating examples. So after this first step, we will follow up with a second optimisation step and lift restrictions to let it trigger on our examples.

The number of times interchange triggers is a good metric for the potential of the pass and how general it is. I have counted 37 occurrences of interchange if external benchmark ffmpeg is included and 22 times without. So that is really good, I think. I will paste the build log with optimisation remarks at the end of this message for more details.

About perf changes: for me a good result is if this shows no performance regressions, and maybe a few minor uplifts here and there. And the reason is what I wrote above, we very much see this as a first enablement step, and know we have to work on making it more optimal. And when I measured the llvm test-suite, this is exactly what I saw: no regression, minor uplifts (in Polybench and its solvers). But that was before the latest interchange patches went in, so I will verify and confirm this with the top of trunk. I would like to add that I have issues with the llvm test-suite for evaluating this kind of codegen changes as they may not be very visible, and I briefly talked about that in my llvm dev conference talk here (sorry for the plug, but it's a link to the exact timestamp). So llvm test-suite performance numbers need to be taken with a pinch of salt, but again, will do confirm.

test-suite-externals/ffmpeg/libavcodec/aacps.c:121:9: remark: Loop interchanged
test-suite-externals/ffmpeg/libavcodec/aacps_tablegen.h:77:9: remark: Loop interchanged
test-suite-externals/ffmpeg/libavcodec/dcaenc.c:971:9: remark: Loop interchanged
test-suite-externals/ffmpeg/libavcodec/dpx.c:700:13: remark: Loop interchanged
test-suite-externals/ffmpeg/libavcodec/dpx.c:665:13: remark: Loop interchanged
test-suite-externals/ffmpeg/libavcodec/ituh263enc.c:786:13: remark: Loop interchanged
test-suite-externals/ffmpeg/libavcodec/mpeg12enc.c:113:9: remark: Loop interchanged
test-suite-externals/ffmpeg/libavcodec/mpeg4videoenc.c:1176:13: remark: Loop interchanged
test-suite-externals/ffmpeg/libavcodec/mpegaudiodec_template.c:268:9: remark: Loop interchanged
test-suite-externals/ffmpeg/libavcodec/proresenc_anatoliy.c:525:9: remark: Loop interchanged
test-suite-externals/ffmpeg/libavcodec/xface.c:291:9: remark: Loop interchanged
test-suite-externals/ffmpeg/libavfilter/af_arnndn.c:1512:9: remark: Loop interchanged
test-suite-externals/ffmpeg/libavfilter/vf_datascope.c:1077:9: remark: Loop interchanged
test-suite-externals/ffmpeg/libavfilter/vf_datascope.c:678:9: remark: Loop interchanged
test-suite-externals/ffmpeg/libavfilter/vf_photosensitivity.c:182:9: remark: Loop interchanged
MultiSource/Applications/JM/ldecod/block.c:935:5: remark: Loop interchanged
MultiSource/Applications/JM/ldecod/block.c:1265:5: remark: Loop interchanged
MultiSource/Applications/JM/ldecod/macroblock.c:2594:5: remark: Loop interchanged
MultiSource/Applications/JM/lencod/block.c:2139:5: remark: Loop interchanged
MultiSource/Applications/JM/lencod/block.c:2860:5: remark: Loop interchanged
MultiSource/Applications/JM/lencod/block.c:3016:5: remark: Loop interchanged
MultiSource/Applications/obsequi/tables.c:224:7: remark: Loop interchanged
MultiSource/Benchmarks/mediabench/mpeg2/mpeg2dec/spatscal.c:298:5: remark: Loop interchanged
MultiSource/Benchmarks/nbench/nbench1.c:283:2: remark: Loop interchanged
MultiSource/Benchmarks/nbench/nbench1.c:1906:1: remark: Loop interchanged
MultiSource/Benchmarks/nbench/nbench1.c:1907:2: remark: Loop interchanged
SingleSource/Benchmarks/Linpack/linpack-pc.c:595:17: remark: Loop interchanged
SingleSource/Benchmarks/Linpack/linpack-pc.c:1204:17: remark: Loop interchanged
SingleSource/Benchmarks/Misc/dt.c:15:5: remark: Loop interchanged
SingleSource/Benchmarks/Polybench/datamining/correlation/correlation.c:103:5: remark: Loop interchanged
SingleSource/Benchmarks/Polybench/datamining/covariance/covariance.c:81:5: remark: Loop interchanged
SingleSource/Benchmarks/Polybench/linear-algebra/solvers/cholesky/cholesky.c:63:5: remark: Loop interchanged
SingleSource/Benchmarks/Polybench/linear-algebra/solvers/cholesky/cholesky.c:64:7: remark: Loop interchanged
SingleSource/Benchmarks/Polybench/linear-algebra/solvers/lu/lu.c:51:5: remark: Loop interchanged
SingleSource/Benchmarks/Polybench/linear-algebra/solvers/lu/lu.c:52:7: remark: Loop interchanged
SingleSource/Benchmarks/Polybench/linear-algebra/solvers/ludcmp/ludcmp.c:62:5: remark: Loop interchanged
SingleSource/Benchmarks/Polybench/linear-algebra/solvers/ludcmp/ludcmp.c:63:7: remark: Loop interchanged

Meinersbur · 2025-01-30T15:57:37Z

llvm/lib/Passes/PassBuilderPipelines.cpp

@@ -201,7 +201,7 @@ static cl::opt<bool> RunNewGVN("enable-newgvn", cl::init(false), cl::Hidden,
                               cl::desc("Run the NewGVN pass"));

 static cl::opt<bool> EnableLoopInterchange(
-    "enable-loopinterchange", cl::init(false), cl::Hidden,
+    "enable-loopinterchange", cl::init(true), cl::Hidden,


Consider to also add switches to the clang driver like -fvectorize/-fno-vectorze and -funroll-loops/fno-unroll-loops. GCC's is called -floop-interchange.

Thanks for the suggestion, agree that this would be convenient to have. I will create a separate patch for this.

mshockwave · 2025-01-30T19:11:57Z

Has #116144 (correctness issue) been addressed?

kasuga-fj · 2025-01-31T07:02:29Z

Has #116144 (correctness issue) been addressed?

It would have been fixed with #119345 . As far as I know, there are no known correctness issues on the LoopInterchange side.

sjoerdmeijer · 2025-01-31T10:40:16Z

Has #116144 (correctness issue) been addressed?

It would have been fixed with #119345 . As far as I know, there are no known correctness issues on the LoopInterchange side.

Correct, should have been fixed.
In addition, I have just closed the following bug reports that had the same underlying issue as that one:

sjoerdmeijer · 2025-01-31T13:47:30Z

Re: perf number, @Meinersbur : I indeed see no regressions and 3 improvements of ~70% that are real, I think. There are a couple of minor improvements, but that could be real or noise, but didn't look further into it. I did 3 runs with interchange disabled ("before") and 3 runs with interchange enabled ("after"), and lower is better:

Program                                                 exec_time
                                                        before1   before2 before3 after1 after2 after3 diff
Polybench/linear-algebra/solvers/cholesky/cholesky.test  11.48     11.34   11.20    6.60   6.58   6.63 74.5%
    Polybench/linear-algebra/solvers/ludcmp/ludcmp.test  23.56     23.57   23.55   14.16  14.13  14.08 67.5%
            Polybench/linear-algebra/solvers/lu/lu.test  23.60     23.48   23.69   14.25  14.17  14.21 67.2%

This introduces options -floop-interchange and -fno-loop-interchange to enable/disable the loop-interchange pass. This is part of the work that tries to get that pass enabled by default (llvm#124911), where it was remarked that a user facing option to control this would be convenient to have. The option (name) is the same as GCC's.

kasuga-fj · 2025-02-06T07:27:05Z

I indeed see no regressions and 3 improvements of ~70% that are real, I think. There are a couple of minor improvements, but that could be real or noise, but didn't look further into it. I did 3 runs with interchange disabled ("before") and 3 runs with interchange enabled ("after"), and lower is better:

I tried these three cases on my local and couldn't reproduce this result. I investigated and found that the loops aren't interchanged in all the three cases. I assume that all the loops that were interchanged in your experiment are the same one that initializes the array B in the init_array function, for example the following one in cholesky.c.

https://github.com/llvm/llvm-test-suite/blob/94ca4917f6b02da34902b65b26240c5729d55bcf/SingleSource/Benchmarks/Polybench/linear-algebra/solvers/cholesky/cholesky.c#L62-L65

This loop has S dependence for the outermost loop so that it used to be able to interchange, but not anymore. My guess is that you are using an older version of the compiler that does not have the patches listed in this PR description. With future improvements, these loops should become interchangeable again. I think we should address this issue after the interchange is enabled by default.

This introduces options -floop-interchange and -fno-loop-interchange to enable/disable the loop-interchange pass. This is part of the work that tries to get that pass enabled by default (llvm#124911), where it was remarked that a user facing option to control this would be convenient to have. The option (name) is the same as GCC's.

This introduces options `-floop-interchange` and `-fno-loop-interchange` to enable/disable the loop-interchange pass. This is part of the work that tries to get that pass enabled by default (#124911), where it was remarked that a user facing option to control this would be convenient to have. The option name is the same as GCC's.

This introduces options `-floop-interchange` and `-fno-loop-interchange` to enable/disable the loop-interchange pass. This is part of the work that tries to get that pass enabled by default (llvm#124911), where it was remarked that a user facing option to control this would be convenient to have. The option name is the same as GCC's.

sjoerdmeijer · 2025-02-24T10:53:28Z

This loop has S dependence for the outermost loop so that it used to be able to interchange, but not anymore. My guess is that you are using an older version of the compiler that does not have the patches listed in this PR description. With future improvements, these loops should become interchangeable again. I think we should address this issue after the interchange is enabled by default.

I've checked and you're right. I used a slightly older build, but with all the correctness issues addressed it indeed doesn't trigger anymore on cholesky, lu and ludcmp. It still triggers on other polybench kernels (correlation and covariance), but due to their extremely low runtime and/or isn't on the hot path, doesn't show improvements. I think this indeed shows the potential of interchange, and is something I would like to address after enabling this by default.

kasuga-fj · 2025-02-25T06:39:10Z

I think this indeed shows the potential of interchange

I agree with you.

sjoerdmeijer · 2025-06-10T13:20:01Z

I would like to change the status from WIP to ready-to-merge as I believe the outstanding issues has been resolved:

The last two DependenceAnalysis bugs and patches that address them have been merged:

We have enabled interchange in the Flang driver and compiler:

[flang] add -floop-interchange and enable it with opt levels #140182

This has been running for a while now, and one bug was found and fixed (thanks @kasuga-fj ):

[LoopInterchange] Handle confused dependence correctly #140709

And another interchange bug was fixed:

[LoopInterchange] Incorrect exchange happens when the Levels of Dependence is less than the depth of the loop nest #140238

With all of this fixed, I think loop interchange is now ready to be turned on here too.

Shall we merge this, @Meinersbur , @kasuga-fj , @sebpop, @nikic, @mshockwave?

nikic · 2025-06-10T13:23:15Z

Could you please provide updated compile-time and performance numbers? The implementation has changed a lot in the meantime.

sjoerdmeijer · 2025-06-10T13:36:35Z

Could you please provide updated compile-time and performance numbers? The implementation has changed a lot in the meantime.

Yep, that makes sense. We focused a lot on compile-times, and after our last patch to improve this, we measured a 0.19% geomean increase. For different configurations the picture looked very similar to this:

stage1-O3:
Benchmark
kimwitu++        +0.10%
sqlite3          +0.14%
consumer-typeset +0.07%
Bullet           +0.06%
tramp3d-v4       +0.21%
mafft            +0.39%
ClamAVi          +0.06%
lencod           +0.61%
SPASS            +0.17%
7zip             +0.08%
geomean          +0.19%

But I will measure again and confirm.

kasuga-fj · 2025-06-11T12:14:18Z

@nikic Hi, let me ask some questions.

Is this (using different type in getelementptr for the same ptr %A) a valid LLVM IR?
- Based on the document, it seems to be permitted to me.
Even if the above one is a valid input, I believe that this is a special case and it is highly unlikely to happen in practice. How much consideration should be given to cases like this when we enable some pass by default?
- I spent some time to investigate the correctness issues for LoopInterchange and DependenceAnalysis, and as far as I can tell, it seems that there are no other issues. But I cannot confidently say that they are sound.
- However, I also think it is impossible to prove the soundness for all cases.

(NOTE: I think fixing the above one is not very difficult. If we should fix it, I will probably be able to submit a PR for it within a few days.)

nikic · 2025-06-11T12:29:21Z

@kasuga-fj Yes, this is valid. The delinearization reasoning based on GEP source element types needs to be completely removed.

kasuga-fj · 2025-06-11T12:31:27Z

The delinearization reasoning based on GEP source element types needs to be completely removed.

OK, thanks, so I'll work on it.

sjoerdmeijer · 2025-06-16T16:50:50Z

Could you please provide updated compile-time and performance numbers? The implementation has changed a lot in the meantime.

Here are the new compile time numbers @nikic :

http://llvm-compile-time-tracker.com/compare.php?from=a8dacd40da0f48484bda05e066b5fa692bcddcc6&to=95270f01ac1f4f29deec6f5c29e4d7e74b4171e5&stat=instructions:u

It shows almost no increase in compile-times. The reason is that it doesn't trigger in CTMark. For the entire LLVM test-suite, it now only triggers 2 times:

test-suite/MultiSource/Applications/obsequi/tables.c:224:7: remark: Loop interchanged with enclosing loop. [-Rpass=loop-interchange]
test-suite/MultiSource/Applications/obsequi/tables.c:224:7: remark: Loop interchanged with enclosing loop. [-Rpass=loop-interchange]

So, one way to explain the numbers is that interchange is now really good in deciding that it is not going to do anything, which isn't a bad thing.
We knew it wasn't going to trigger as much as before, but we can start looking at lifting restrictions later when this is running for some time.

While collecting compile-time numbers, I noticed that there is a problem with this patch. I.e. the patch doesn't actually add the pass to the pipeline, so it is not running interchange. This is related to the PTO.LoopInterchange check which essentially is only true when -floop-interchange is enabled. I will fix and update this soon.

nikic · 2025-06-16T16:54:41Z

Could you please provide updated compile-time and performance numbers? The implementation has changed a lot in the meantime.

Here are the new compile time numbers @nikic :

http://llvm-compile-time-tracker.com/compare.php?from=a8dacd40da0f48484bda05e066b5fa692bcddcc6&to=95270f01ac1f4f29deec6f5c29e4d7e74b4171e5&stat=instructions:u

I don't think this actually enables loop interchange? You add the extra checks for EnableLoopInterchange, but EnableLoopInterchange itself is still false.

kasuga-fj · 2025-06-17T07:36:50Z

Maybe we should modify here?

llvm-project/clang/lib/Frontend/CompilerInvocation.cpp

Lines 1964 to 1965 in 80b79ce

    
           Opts.InterchangeLoops = 
        
               Args.hasFlag(OPT_floop_interchange, OPT_fno_loop_interchange, false);

sjoerdmeijer · 2025-06-17T12:13:17Z

Could you please provide updated compile-time and performance numbers? The implementation has changed a lot in the meantime.

Here are the new compile time numbers @nikic :
http://llvm-compile-time-tracker.com/compare.php?from=a8dacd40da0f48484bda05e066b5fa692bcddcc6&to=95270f01ac1f4f29deec6f5c29e4d7e74b4171e5&stat=instructions:u

I don't think this actually enables loop interchange? You add the extra checks for EnableLoopInterchange, but EnableLoopInterchange itself is still false.

Yep, you're right, here are the right numbers:

http://llvm-compile-time-tracker.com/compare.php?from=bf60aa1c551ef5de62fd1d1cdcbff58cba55cacd&to=54b48e5d738fa763fb1f31590801b701262ce045&stat=instructions:u

sjoerdmeijer · 2025-06-19T10:49:29Z

Maybe we should modify here?

llvm-project/clang/lib/Frontend/CompilerInvocation.cpp

Lines 1964 to 1965 in 80b79ce

Opts.InterchangeLoops =

Args.hasFlag(OPT_floop_interchange, OPT_fno_loop_interchange, false);

This patch needs rebasing, it's old, in the meantime, I've added support for -floop-interchange which is now interacting with this and causing me a bit of pain. I.e., we now have -floop-interchange (and -fno-loop-interchange) and -mllvm -enable-loopinterchange to enable/disable it, and I have to worry about the different combinations. My question is @nikic , @kasuga-fj , do we want to keep both options around? Or shall we only keep the user-facing -floop-interchange, and get rid of LLVM option -enable-interchange? That would simplify things.

kasuga-fj · 2025-06-20T16:02:44Z

Maybe we should modify here?

llvm-project/clang/lib/Frontend/CompilerInvocation.cpp

Lines 1964 to 1965 in 80b79ce

Opts.InterchangeLoops =

Args.hasFlag(OPT_floop_interchange, OPT_fno_loop_interchange, false);

This patch needs rebasing, it's old, in the meantime, I've added support for -floop-interchange which is now interacting with this and causing me a bit of pain. I.e., we now have -floop-interchange (and -fno-loop-interchange) and -mllvm -enable-loopinterchange to enable/disable it, and I have to worry about the different combinations. My question is @nikic , @kasuga-fj , do we want to keep both options around? Or shall we only keep the user-facing -floop-interchange, and get rid of LLVM option -enable-interchange? That would simplify things.

I personally think it's okay to remove EnableLoopInterchange if it causes confusion. But I don't know if there is a general policy on deleting existing options.

nikic · 2025-06-20T16:43:26Z

I think the cl::opt option is mainly useful for use with opt rather than clang. (Ideally we'd not use cl::opt for this and instead allow specifying pipeline tuning options from the command line...)

sjoerdmeijer · 2025-06-24T10:06:41Z

Thanks for your thoughts on this. I will go ahead and remove the cl::opt option. If I am not mistaken, I don't see precedent for having both the clang and cl::opt option, and for the use of opt I don't think we'll need the -enable-loopinterchange option.

artagnon · 2025-06-24T19:34:37Z

Yep, you're right, here are the right numbers:

http://llvm-compile-time-tracker.com/compare.php?from=bf60aa1c551ef5de62fd1d1cdcbff58cba55cacd&to=54b48e5d738fa763fb1f31590801b701262ce045&stat=instructions:u

I think this is quite a serious regression, especially given that there are no binary changes. Is the regression coming mostly from DA? In that case, can we try to early-exit in LoopInterchange and avoid computing DA, if there are likely going to be no binary changes?

sjoerdmeijer · 2025-06-25T10:31:57Z

Yep, you're right, here are the right numbers:
http://llvm-compile-time-tracker.com/compare.php?from=bf60aa1c551ef5de62fd1d1cdcbff58cba55cacd&to=54b48e5d738fa763fb1f31590801b701262ce045&stat=instructions:u

I think this is quite a serious regression, especially given that there are no binary changes. Is the regression coming mostly from DA? In that case, can we try to early-exit in LoopInterchange and avoid computing DA, if there are likely going to be no binary changes?

We see geomean compile-time increases of 0.20% 0.30%, 0.13%, 0.17% and 0.18%, which I think is actually really good given that we are in fact talking about enabling/running 4 components: interchange, DependenceAnalysis, Delinearization and LoopCacheAnalysis. Most of the compile-time is spent on the required extra analysis, and we have spent significant effort on reducing it to this minimum compile time increases. You're right it's not triggering much, but that is also by design at this point, and again it still requires to calculate extra analysis info.

kasuga-fj · 2025-06-26T15:29:57Z

@sjoerdmeijer What compilation options and target did you use? The result of LoopCacheAnalysis heavily depends on the cache line size, which is retrieved via TargetTransformInfo. In my experience, the profitability check doesn't work well when the cache line size is 0, and IIUIC, some targets set it to 0 by default. Changing it to a non-zero value may increase the number of loops that can be interchanged. I also have some relatively easy-to-implement ideas to relax the legality checks to accept more patterns, but I'm not sure if it will be beneficial in practice.

Regarding the compile-time increase, I also think it's quite significant, and there is still room for improvement. Here are some suggestions:

Delay the computation of LoopCacheAnalysis until it's actually needed. I did a quick test on CMakeFiles/lencod.dir/q_matrix.c.o, which causes a significant increase in compile time, and this approach seemed very effective at least for this case.
IIUIC, DA doesn't cache its results. Both LoopCacheAnalysis and LoopInterchange appear to query DA with the same inputs, so caching these results might help reduce compile time.
LoopInterchange uses a bubble-sort like algorithm to select candidate loops to be exchanged. This can result in the same pair being evaluated multiple times, so caching may also help here.
At the moment, setting a smaller default value for MaxMemInstrCount seems reasonable to me.
As @artagnon suggests, there are opportunities for early exits. For example, if a direction vector is composed entirely of * (e.g., [* * *]), continuing the process obviously doesn't make sense. Skipping such cases could reduce the number of queries to DA.

For now, I'll work on the first one. As for the others, I haven't tried them and I'm not sure if they will really help, but I believe they are worth investigating.

mshockwave · 2025-06-26T19:58:38Z

I think the cl::opt option is mainly useful for use with opt rather than clang. (Ideally we'd not use cl::opt for this and instead allow specifying pipeline tuning options from the command line...)

+1 on this, there are many many cases where we only want to run opt on an IR, either Clang takes too long or it's a reduced test case. And it will be really helpful if we could have a command line flag to turn it off. Could we move EnableLoopInterchange to tools/opt/NewPMDriver.cpp and use it to populate PipelineTuningOptions instead of removing it completely?

kasuga-fj · 2025-06-27T01:53:57Z

I think the cl::opt option is mainly useful for use with opt rather than clang. (Ideally we'd not use cl::opt for this and instead allow specifying pipeline tuning options from the command line...)

+1 on this, there are many many cases where we only want to run opt on an IR, either Clang takes too long or it's a reduced test case. And it will be really helpful if we could have a command line flag to turn it off. Could we move EnableLoopInterchange to tools/opt/NewPMDriver.cpp and use it to populate PipelineTuningOptions instead of removing it completely?

Thanks, I didn't know that file. I think we can do it.

madhur13490 · 2025-06-27T04:41:23Z

@sjoerdmeijer What compilation options and target did you use? The result of LoopCacheAnalysis heavily depends on the cache line size, which is retrieved via TargetTransformInfo. In my experience, the profitability check doesn't work well when the cache line size is 0, and IIUIC, some targets set it to 0 by default. Changing it to a non-zero value may increase the number of loops that can be interchanged. I also have some relatively easy-to-implement ideas to relax the legality checks to accept more patterns, but I'm not sure if it will be beneficial in practice.

Regarding the compile-time increase, I also think it's quite significant, and there is still room for improvement. Here are some suggestions:

Delay the computation of LoopCacheAnalysis until it's actually needed. I did a quick test on CMakeFiles/lencod.dir/q_matrix.c.o, which causes a significant increase in compile time, and this approach seemed very effective at least for this case.

IIUIC, DA doesn't cache its results. Both LoopCacheAnalysis and LoopInterchange appear to query DA with the same inputs, so caching these results might help reduce compile time.

LoopInterchange uses a bubble-sort like algorithm to select candidate loops to be exchanged. This can result in the same pair being evaluated multiple times, so caching may also help here.

At the moment, setting a smaller default value for MaxMemInstrCount seems reasonable to me.

As @artagnon suggests, there are opportunities for early exits. For example, if a direction vector is composed entirely of * (e.g., [* * *]), continuing the process obviously doesn't make sense. Skipping such cases could reduce the number of queries to DA.

For now, I'll work on the first one. As for the others, I haven't tried them and I'm not sure if they will really help, but I believe they are worth investigating.

@kasuga-fj @artagnon Thanks for the comment. As mentioned in the patch description, the following patches are purely based on early exits, and they handle major cases. These patches have brought down the impact significantly.

#118973
#115128
#124247
#118656

However, as @kasuga-fj mentions, there can still be some opportunities left, but unless we quantify, we are not sure how much benefit it will offer. Some of the opportunities you mentioned are known to me, but I have not quantified the gain yet. Tuning MaxMemInstrCount is another option, but that might limit the opportunities interchange has. So, fewer computations lead to less compile-time impact.

I'd also like to take a pause and ask what constitutes "significant" and what does not for compile-time numbers. I feel we are doing quite well on the geomean as the benefits of enabling the pass outweigh the cost. We can continue improving compile-time (patches welcome!), but the question remains unanswered - When do we say the penalty on compile-time is acceptable?

artagnon · 2025-06-27T12:11:41Z

When do we say the penalty on compile-time is acceptable?

ConstraintElim is comparable in complexity: it gets turned on here. Note that there were a lot of binary changes, which justified a geomean regression of 0.3% on ThinLTO. It also does a good job of not blowing up compile-time in degenerate cases (so, never a 0.76%).

That said, I'm very enthusiastic about the prospect of DA operating in LLVM: once the patch lands, I will attempt to add runtime-checks functionality to DA, and gradually move LAA's users to DA (this will be a long-running project). Once there are more users of DA in the tree, and with the potential deprecation of LAA, I expect the costs to eventually be amortized. That said, we don't have very many contributors who have expertise in DA/LCA/LI at this point, and this is a chicken-and-egg problem: I hope that we will further be able to reduce/amortize costs after landing.

kasuga-fj · 2025-06-27T12:22:52Z

there can still be some opportunities left, but unless we quantify, we are not sure how much benefit it will offer. Some of the opportunities you mentioned are known to me, but I have not quantified the gain yet.

Have you tried them? As far as my quick experiment, at least delaying the calculation of LoopCacheAnalysis was effective in some cases. I believe all the ideas I listed are not very difficult to implement, so if you've not tried, I will try them.

Tuning MaxMemInstrCount is another option, but that might limit the opportunities interchange has.

Since interchange fails in most cases anyway, I don’t think reducing the value would cause major practical issues (although I’m not sure whether it actually helps with compile-time). We can always raise it again after improving LoopInterchange and once having a too small value becomes a real problem.

I feel we are doing quite well on the geomean as the benefits of enabling the pass outweigh the cost.

I think we also need to consider individual cases, not just the geomean. A compile-time increase of over 20% despite no changes is, at the very least, quite substantial.

Although I'm very interested in enabling LoopInterchange by default, after the recent discussions, I’ve come to the conclusion that it’s still not ready. In addition to the compile-time concerns, the issue of GEP-driven delinearization also remains.

sjoerdmeijer requested review from nikic, sebpop, Meinersbur, kiranchandramohan, madhur13490, CongzheUalberta and kasuga-fj January 29, 2025 11:56

Meinersbur reviewed Jan 30, 2025

View reviewed changes

sjoerdmeijer mentioned this pull request Feb 5, 2025

[Clang][Driver] Add an option to control loop-interchange #125830

Merged

sjoerdmeijer mentioned this pull request May 16, 2025

[flang] add -floop-interchange and enable it with opt levels #140182

Merged

kasuga-fj mentioned this pull request May 16, 2025

[LoopInterchange] Incorrect exchange happens when the Levels of Dependence is less than the depth of the loop nest #140238

Closed

sjoerdmeijer changed the title ~~[LoopInterchange] Enable it by default (WIP)~~ [LoopInterchange] Enable it by default Jun 10, 2025

kasuga-fj mentioned this pull request Jun 20, 2025

[Delinearization] Add function for fixed size array without relying on GEP #145050

Open

kasuga-fj mentioned this pull request Jun 24, 2025

[Passes] Move LoopInterchange into optimization pipeline #145503

Open

artagnon requested a review from fhahn June 25, 2025 15:21

[LoopInterchange] Enable it by default #124911

Are you sure you want to change the base?

[LoopInterchange] Enable it by default #124911

Uh oh!

Conversation

sjoerdmeijer commented Jan 29, 2025 • edited by madhur13490 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Meinersbur commented Jan 29, 2025

Uh oh!

sjoerdmeijer commented Jan 30, 2025

Uh oh!

Meinersbur Jan 30, 2025

Choose a reason for hiding this comment

Uh oh!

sjoerdmeijer Jan 31, 2025

Choose a reason for hiding this comment

Uh oh!

mshockwave commented Jan 30, 2025

Uh oh!

kasuga-fj commented Jan 31, 2025

Uh oh!

sjoerdmeijer commented Jan 31, 2025

Uh oh!

sjoerdmeijer commented Jan 31, 2025

Uh oh!

kasuga-fj commented Feb 6, 2025

Uh oh!

sjoerdmeijer commented Feb 24, 2025

Uh oh!

kasuga-fj commented Feb 25, 2025

Uh oh!

sjoerdmeijer commented Jun 10, 2025

Uh oh!

nikic commented Jun 10, 2025

Uh oh!

sjoerdmeijer commented Jun 10, 2025

Uh oh!

kasuga-fj commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nikic commented Jun 11, 2025

Uh oh!

kasuga-fj commented Jun 11, 2025

Uh oh!

sjoerdmeijer commented Jun 16, 2025

Uh oh!

nikic commented Jun 16, 2025

Uh oh!

kasuga-fj commented Jun 17, 2025

Uh oh!

sjoerdmeijer commented Jun 17, 2025

Uh oh!

sjoerdmeijer commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kasuga-fj commented Jun 20, 2025

Uh oh!

nikic commented Jun 20, 2025

Uh oh!

sjoerdmeijer commented Jun 24, 2025

Uh oh!

artagnon commented Jun 24, 2025

Uh oh!

sjoerdmeijer commented Jun 25, 2025

Uh oh!

kasuga-fj commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mshockwave commented Jun 26, 2025

Uh oh!

kasuga-fj commented Jun 27, 2025

Uh oh!

madhur13490 commented Jun 27, 2025

Uh oh!

artagnon commented Jun 27, 2025

Uh oh!

kasuga-fj commented Jun 27, 2025

Uh oh!

Uh oh!

sjoerdmeijer commented Jan 29, 2025 •

edited by madhur13490

Loading

kasuga-fj commented Jun 11, 2025 •

edited

Loading

sjoerdmeijer commented Jun 19, 2025 •

edited

Loading

kasuga-fj commented Jun 26, 2025 •

edited

Loading