-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[regression]: Increase in iree-compile memory for > 100X #18869
Comments
Commits between those versions: candidate-20241014.1046...candidate-20241015.1047 Other debugging tips: https://iree.dev/developers/debugging/compile-time-regressions/ |
My guess would be that this PR is the culprit from those commits: #18730. @pdhirajkumarprasad can you try locally with reverting that PR? |
Yes, this causes the timeout issue. Smaller repro
Looking into this. |
Some thoughts on how to catch this sort of regression sooner: Increasing test coverage can give earlier signal for failures.The tests that failed here (https://github.com/nod-ai/SHARK-TestSuite/actions/workflows/test_e2eshark.yml?query=branch%3Aalt-merge-reports I think?) run nightly and without prominent alerts for failures. We have other test suites that run on every commit (like https://github.com/iree-org/iree-test-suites/tree/main/onnx_models as part of https://github.com/iree-org/iree/blob/main/.github/workflows/pkgci_test_onnx.yml) I'd prefer for all tests run on presubmit (especially if blocking) in iree-org to pull from only the iree-org repository and other public sources. Contributors should only need access to projects in this organization to be able to make changes with confidence. They shouldn't need access to another repository or any private files, private logs, special hardware, etc. Presubmit tests should run in less than 30 minutes (ideally 15 minutes or faster). If test suites grow too large, they can be split into shards (given enough runner capacity) or we can apply some selection criteria to choose which run on every commit and which run less frequently. Guarding against system health regressions in the compilerWe have #13207 tracking general approaches to getting the compiler to fail in useful ways when conditions like those here are observed. I wonder if the latest idea posted there would have caught this issue:
Guarding against system health regressions in test suitesA timeout on these tests may have caught this sooner (but without a helpful error message). We have such timeouts in iree-org/iree and iree-org/iree-test-suites tests (typically 60 seconds for unit tests or 600 seconds for really large tests). We could explore other watchdog processes like one on memory usage or disk space. These metrics are often correlated. Viewing and tracking regular system health metricsThis failure looks like an infinite loop or extremely poor performance ("falling off a performance cliff"). In more usual cases where a metric just grows within one order one magnitude, a test could avoid a timeout while still regressing significantly. In this case we should lean on automated benchmarks and statistics tracking, rather than test suite controls or compiler heuristics. We have a flag to dump compilation statistics that we could have model test suites include and then dump the results to the logs. We have tracked these metrics in benchmark dashboards before, allowing us to spot historical trends or guard against regressions above a certain threshold (e.g. dispatch count increasing from 500 to 600, or executable binary size increasing from 500KB to 10MB). Note that if the compiler doesn't actually finish running (such as here, with the multi-hour CI runs), any summarization and statistics dumping that we want to run won't help since the code won't reach that point before the run is cancelled. |
I've created a new tracker for this here: #18875 |
What happened?
For the give IR
the iree-compile memory is going beyond >50G and still not completing. This behavior started from
while with
it compile in < 1 second with < 300MB
Steps to reproduce your issue
command:
What component(s) does this issue relate to?
Compiler
Version information
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: