-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compiling for llvm-cpu without targeting a specific CPU is a bad experience #18561
Comments
Yes please! #15487 |
Let's stop over thinking this and do something simple like I suggest. Open to other options but would like to see this improved. |
The suggested proposal SGTM. I might even want to default to having LLVM use the current host for the target CPU and available features (if those are different), then have users explicitly pass "generic" for the lowest common denominator. We could apply similar logic to the GPU backends - try to detect devices on the system (shell out to vulkaninfo / rocm-smi / nvidia-smi?) and default to what is available, but still support cross compilation with explicit device info and a "generic" target where possible. |
Yuck - that is not a cheap thing to do and has a high risk of flakes - I am still not sure why proper documentation is insufficient? You must specify your target device ( |
Note that this is also what clang does - https://clang.llvm.org/docs/HIPSupport.html - you must pass |
(if there's such a big concern about documentation not fixing this issue then I'd be ok with making compilation fail if the user doesn't specify an arch for a backend - whether a particular arg, generic, native, etc - but guessing is bad) |
We can certainly update the docs (https://iree.dev/guides/deployment-configurations/cpu/#compile-a-program) and start with a warning from the compiler if information is omitted and generic is used as the default. I'm seeing a proliferation of flags (mainly in rocm usage, but also cpu) and the documentation can't keep up. I want more of that to be captured somewhere - docs, samples, the compiler itself, etc. See one example here: iree/experimental/regression_suite/shark-test-suite-models/sdxl/test_unet.py Lines 90 to 110 in 914858f
|
That's insanity - besides the debug flags (dumping statistics/etc) if any of those are required that's a bug. I think Mahesh has said it before: a feature is not done until it's on by default and if all of those flags are needed to make the model compile or perform then the engineering was never completed. The only two flags required there should be |
I'm fine making Agreed on all of the other points. Need to burn down all of the other flags. I'm just starting with this one. |
I can take a pass at this, unless someone else wants to. Plan:
|
Can someone clarify why we have all three of these flags?
It seems like the triple could be a superset of the cpu? Is there some redundancy there? I see some riscv sample code setting both: iree/runtime/src/iree/hal/local/elf/testdata/generate.sh Lines 68 to 73 in d834aa7
but even our microkernels blog post (highlighting cpu performance work) only includes a few of the flags: iree/docs/website/docs/community/blog/posts/microkernels.md Lines 25 to 32 in d834aa7
Oh, the linalg tutorial from @bjacob explains that iree/docs/website/docs/community/blog/posts/linalg-tutorial.md Lines 183 to 193 in d834aa7
There is some complicated logic and then a few calls into LLVM itself in https://github.com/iree-org/iree/blob/main/compiler/plugins/target/LLVMCPU/LLVMTargetOptions.cpp. (This is why I filed #15487 - I've wanted someone directly familiar with LLVM CPU to be driving this) |
This stuff always grows into a bit of a hairball. The condition we are trying to guard is that |
The CPU flags mirror LLVM - we can't remove them, but we could more intelligently populate them - maybe - triple is often not enough. I think we do ask for defaults from LLVM today. I'm hesitant to suggest we diverge from clang behavior as then we have to support that (and if the issue here is that our documentation sucks adding bespoke stuff only hurts that). |
Made some progress stepping through the details:
|
@marbre pointed out that for bare metal arm, the target handling is letting some "errors" fall through: iree/compiler/plugins/target/LLVMCPU/LLVMTargetOptions.cpp Lines 83 to 97 in e19950c
iree/compiler/plugins/target/LLVMCPU/LLVMTargetOptions.cpp Lines 147 to 155 in e19950c
Sample logs: https://github.com/iree-org/iree-bare-metal-arm/actions/runs/10923467370/job/30320173426#step:11:262
Flags for those logs: https://github.com/iree-org/iree-bare-metal-arm/blob/23deb47d546786e7bd64fc6edd51a3095b6c1817/samples/simple_embedding/CMakeLists.txt#L98-L109 We may want to amend that logic here too. The concern about not "breaking existing users" is potentially leaving performance on the table with that style of error reporting. |
More context for my previous comment: #15387 |
Oh yeah, I only cared about ARM when I wrote that :-D |
I noticed that we override the iree/compiler/plugins/target/LLVMCPU/LLVMTargetOptions.cpp Lines 139 to 146 in e19950c
However, we only override parts of the triple, not the full object/string. In particular, that code appears to leave the "arch" unchanged. Possible values for that are in https://github.com/llvm/llvm-project/blob/main/llvm/include/llvm/TargetParser/Triple.h. Does that mean that if you compile on x86_64, your code generated with llvm-cpu won't be compatible with aarch64? I'm wondering if this other default should be changed to an explicit "host" too: iree/compiler/plugins/target/LLVMCPU/LLVMTargetOptions.cpp Lines 669 to 671 in e19950c
|
Answering my own question - yes. Compiled with embedded linking and
I'm still wondering if we want to default the target triple to |
Either way, we could have our docs explain OS, arch, features, etc. OS: matters when using the system linker (for full integration with debug tools). Does not matter with embedded linking mode. |
Ehhh... we only support the "host" CPU name on x86?
The https://github.com/llvm/llvm-project/blob/main/llvm/lib/TargetParser/Host.cpp file supports plenty of other architectures though... |
You're unfortunately finding the results of engineers not caring about anything but the exact configuration they are looking at in the moment they author something. Has been an issue for the lifetime of the project and probably always will be 😢 |
Thanks for digging into it, Scott. Feel free to loop one of the backend engineers in if you need help untangling it. I'm happy to nominate others to care in detail. |
I think I see enough of the pieces now to refactor the code a bit and add some helpful warnings and documentation. I'm not sure how I'll test my changes though, since a fair portion of this is different depending on the architecture of the host machine running the compiler and I only have x86_64 dev machines. It would be helpful to get some more eyes on the various configurations we want to support and then do some manual QA testing that the compiler either detects the right features and generates good code, or bails with a helpful error. |
Maybe more of a unit test via some magic env var or test only flag: --iree-testing-assume-host= then a lit test variant for each arch branch that runs device assignment and validates. We're not looking to test llvm here, just ensure that we're not fumbling the flag parsing. |
That could work, yeah. When I say "test my changes" here, I'm still just referring to local development "testing", not automated CI testing - that would be a nice bonus. |
Well, if you have the knobs to verify locally, then you're more than halfway to a lit test. That's how most of these things in llvm proper get tested. |
Pushed an initial attempt at reworking how the target init is handled: #18587 . I could pass that off to someone else and context switch to other tasks 🤔 |
Sorry, I had not kept up with the discussion here, was heads down in GPU data tiling. Here are the difficulties that I know of:
Here is what I would do:
|
Thanks for the context!
Defaulting to host (and not requiring it be set explicitly) goes against @benvanik 's suggestions up in the issue: #18561 (comment) (unless that was specifically referring to autodetection for gpu targets?)
Oh, this sounds useful. Are there functions in LLVM that would help get such a list? |
Yeah mostly about GPU targets (where we have to launch other tools that fully load/initialize drivers and such) - CPU detection in LLVM is free. |
WDYT about #18682 ? |
Progress on #18561. This introduces a warning (which we intend to promote to an error in the future) when targeting a generic CPU without explicitly asking for it. This addresses a performance footgun as that IREE default results in low performance. Along the way this grew into a substantial change to e2e testing rules: - `TARGET_CPU` and `TARGET_CPU_FEATURES` arguments are gone (were redundant with `COMPILER_FLAGS`). - For `TARGET_CPU_FEATURES_VARIANTS`, the special value `"default"` is renamed to `"generic"` and a new value `"host"` is also supported. Example warning (this is customized to the target architecture, here x86): ``` /home/benoit/matmul_i8.mlir:0:0: warning: while creating CPU target: Defaulting to targeting a generic CPU for the target architecture will result in poor performance. Please specify a target CPU and/or a target CPU feature set. If it is intended to target a generic CPU, specify "generic" as the CPU. This can be done in two ways: 1. With command-line flags: --iree-llvmcpu-target-cpu=... --iree-llvmcpu-target-cpu-features=... 2. Within the IR: #hal.executable.target< ... , cpu="...", cpu_features="..."> In the rest of this message, these fields are referred to as just `cpu` and `cpu_features`. Examples: cpu=generic Target a generic CPU of the target architecture. The generated code will have poor performance, but will run on any CPU. cpu=host Target the host CPU. The generated code will have optimal performance on the host CPU but will crash on other CPUs not supporting the same CPU features. cpu="name" Target a specific CPU. This is mostly used on x86. The accepted values are the same as in Clang command lines. List of accepted x86 CPUs: nocona, core2, penryn, bonnell, atom, silvermont, slm, goldmont, goldmont-plus, tremont, nehalem, corei7, westmere, sandybridge, corei7-avx, ivybridge, core-avx-i, haswell, core-avx2, broadwell, skylake, skylake-avx512, skx, cascadelake, cooperlake, cannonlake, icelake-client, rocketlake, icelake-server, tigerlake, sapphirerapids, alderlake, raptorlake, meteorlake, arrowlake, arrowlake-s, lunarlake, gracemont, pantherlake, sierraforest, grandridge, graniterapids, graniterapids-d, emeraldrapids, clearwaterforest, knl, knm, k8, athlon64, athlon-fx, opteron, k8-sse3, athlon64-sse3, opteron-sse3, amdfam10, barcelona, btver1, btver2, bdver1, bdver2, bdver3, bdver4, znver1, znver2, znver3, znver4, znver5, x86-64, x86-64-v2, x86-64-v3, x86-64-v4 cpu_features="+feature1,..." Target a CPU supporting the comma-separated of (+-prefixed) features. The accepted values are the same as in Clang command lines. ``` --------- Signed-off-by: Benoit Jacob <jacob.benoit.1@gmail.com>
@ScottTodd , should we close this as completed by #18682 or leave this open until further changes are made, such as promoting that warning into an error? That we shouldn't do before a full release cycle of whichever distribution channel users are getting their IREE from, so I wonder if leaving this open means having an open, essentially unactionable issue for half a year. |
Let's keep this open until we at least update our docs (https://iree.dev/guides/deployment-configurations/cpu/) to include the recommended best practices (e.g. We'll push a new stable release within a few weeks, and the next should then be 6-8 weeks later. |
I've seen multiple people falling down this hole: they run iree-compile on their model, targeting CPU. Then they get performance that is 10x-100x off of any reasonable expectation. Then they either go away silently or report back about poor experiences (not always reporting flags and such).
There are good reasons why a compiler like IREE shouldn't make assumptions about what the CPU target is, but on the other hand, it will almost always produce a a grossly subpar experience to not specify a target CPU, since the generic target (at least on X86) lacks so many features as to be basically useless for any high performance numerics.
I've even fallen down this hole recently and had to go remember the incantation to select a specific CPU. In the case I was working on (an f16 CPU LLM), performance was 100x different between not specifying a target CPU and specifying "host". We need to guide people better than this.
Proposal
As mentioned, there are good reasons for a compiler to not make too many assumptions without being told what to do. But I think we can/should actively warn, possibly with a link to the documentation site when the compiler is invoked without the user specifying a target CPU. Whatever the warning is should be very explicit that the user should pass
--iree-llvmcpu-target-cpu=host
to target the precise CPU they are running on. We should possibly also accept "generic" or something like that for if the user really wants to target the default and not get the warning. I basically want to guard against the case where the user has not specified anything and the compiler just silently generates 100x too slow of code. In almost all cases, it will be better for the user to say something and we should guide them on a proper choice.The text was updated successfully, but these errors were encountered: