Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving VM conversion performance. #18957

Merged
merged 3 commits into from
Oct 30, 2024
Merged

Conversation

benvanik
Copy link
Collaborator

The major change here is using a precomputed import table in VM conversion patterns. This removes the symbol lookup that was happening on each call. In models with 100k calls to imports this speeds things up a lot.

Also squashed a few more perf issues involving symbol lookups while profiling and made some passes that could nest on function-like ops do so.

These changes drop VM translation of the 405b model from 3.5mins to ~1.5min. Disabling verification (-verify-each=0 to iree-opt or -verify=false to iree-compile) takes it to 1min.

Remaining work is mostly around parallelizing some passes that are not trivially parallelizable (FoldGlobals, DropUnusedCalls, etc) and parallelizing some analysis (Explorer global init, call graph walking) that tends to get real expensive when there are 250k calls and 500k ops. Any place that does a symbol use walk is going to suffer. Many of these fixes are in our code but there's several upstream components that fall over with this amount of IR (CallGraph, DataFlowSolver, the verifier, etc).

The major change here is using a precomputed import table in VM
conversion patterns. This removes the symbol lookup that was happening
on each call. In models with 100k calls to imports this speeds things
up a lot.

Squashed a few more perf issues involving symbol lookups while profiling.

These changes drop VM translation of the 405b model from 3.5mins to
~1.5min. Disabling verification (`-verify-each=0` to iree-opt or
`-verify=false` to iree-compile) takes it to 1min.

Remaining work is mostly around parallelizing some passes that are not
trivially parallelizable (FoldGlobals, DropUnusedCalls, etc) and
parallelizing some analysis (Explorer global init, call graph walking)
that tends to get real expensive when there are 250k calls and 500k ops.
Any place that does a symbol use walk is going to suffer. Many of these
fixes are in our code but there's several upstream components that fall
over with this amount of IR (CallGraph, DataFlowSolver, the verifier,
etc).
@benvanik benvanik added compiler/dialects Relating to the IREE compiler dialects (flow, hal, vm) performance ⚡ Performance/optimization related work across the compiler and runtime labels Oct 30, 2024
@ScottTodd
Copy link
Member

👀 tagging #11994 on this, and the notes about verification also point to #12095. Thanks for improving compilation time!

Copy link
Member

@ScottTodd ScottTodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM when tests pass

@benvanik benvanik marked this pull request as ready for review October 30, 2024 23:12
@benvanik benvanik merged commit 2ec9017 into main Oct 30, 2024
39 checks passed
@benvanik benvanik deleted the users/benvanik/vm-conversion-perf branch October 30, 2024 23:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/dialects Relating to the IREE compiler dialects (flow, hal, vm) performance ⚡ Performance/optimization related work across the compiler and runtime
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants