(POC) Thunks: Support inregister ABI #4061

Sonicadvance1 · 2024-09-11T04:51:17Z

These four commits are effectively atomic so they could be merged in order as we sort of the details of this. I don't expect this to be merged in this state! Read the commit messages for some more details.

One thing that I have been thinking about is the overhead of building up and tearing down stack state on thunks. While thunking already gives us a significant performance improvement, there is still some inherit overhead around marshaling the data across the architecture boundary. We have punted on the problem for quite a while because it hasn't been /that/ significant of an issue.

We also have the problem that changing how thunks behave has to change details in the JIT and inside of the thunk generator itself, which means having knowledge on both sides which can be a bit unwieldy. This gets the conversation going again for making thunks even more optimal by supporting an an inregister implementation so that in the typical case, all of the arguments stay inside of registers and don't touch the stack. The remaining thunk functions can still fallback to the stack implementation as needed.

As for why I thought this was interesting to start looking in to. I did some initial investigation work towards what exactly the overhead of FEX's thunking is by microbenching VDSO. I made a quick and dirty implementation of __vdso_time which ran inregister entirely, versus not and in a microbench I saw an improvement in time from 45.300797ns down to 33.422434ns per call average. ~11.87ns removed by keeping the arguments in register rather than moving it to the stack. Or only consuming 73.7% of the time.

While this time improvement isn't guaranteed for every thunk function, it's a good starting point towards seeing how much thunks could /possibly/ be improved. While today we already know that games absolutely abuse __vdso_clock_gettime for busylooping, so this would improve how quickly they can busy loop. That isn't really interesting to improve performance. This instead lead me down to the path to see how many API calls recent games are making these days.

The initial game I grabbed a frame capture from was Spider-Man: Miles Morales. The frame I grabbed had 56k API calls to libvulkan. A decent amount but not an absolute ton. I then decided to grab a frame from Civilization 6's Graphics benchmark, 333,601 Vulkan API calls in one frame! Money.

So if we assumed the 11.87ns case per api call getting saved:

Miles Morales: ~665,188.328 ns per frame saved - 0.665ms per frame saved
Civ 6: 3,962,633.775 ns per frame saved - 3.962ms per frame saved!

Okay cool, so assuming that we do actually improve the thunk call overhead by the 11.98ns per call, and assuming a game is doing 333k vulkan calls per frame. We can potentially save nearly 4ms off of a frame of CPU time? And since typically a 60FPS game is running at 16.66ms, that's nearly 25% of the frame time saved? Sounds like a win to me, let's do it!

As a proof of concept I have only done a single VDSO function (clock_gettime) because it was easy on my end and showcased that it is possible and works. It also required a change in libVDSO_Guest.cpp which won't be required for Vulkan and GL.

This is going to require a decent amount of work on the thunkgen and also might require changes depending on what Alyssa is wanting to do with the IR. But it matches about where I want thunking to end up.

PS:
Another microbench of the clock_gettime VDSO path showed a reduction in overhead from 52.64ns to 49.93ns, a 2.70ns reduction per call, or a 5.15% reduction.

Assuming only 2.7ns reduction in overhead, the Civ6 benchmark would still cut 0.903ms per frame off their time, which isn't quite as good, but is still reasonable. True overhead reduction is likely to be somewhere in the middle of these values, can't really tell until it's actually implemented.

With the ABI being described right after the hash of the instruction, we can now take advantage of inregister ABI handling. - This requires passing up to 13 registers to the IR operation - Six GPRs, Six FPRs, and one return value which is a GPR or FPR - Theoretically there can be a thunk function which maxes that out entirely - Inregister currently only exists for x86-64, because nothing really uses `vectorcall` for 32-bit. - Requires passing all of this information to the backend so the AAPCS64 can map the x86-64 systemv ABI directly to the AAPCS64 equivalent - Missing optimization is `SpillStaticRegs` only needs to spill specific registers rather than all of them. - This takes advantage of the ABI semantics being matching pretty well between AAPCS64 and Linux Systemv x86-64, which is described near the `ThunkABIFlags` enum.

… inregister ABI This is just a proof of concept for getting thunkgen wired up just enough that I can get a workable example for VDSO. Some assumptions made here: - Only symbol explicitly marked with `inregister_abi` is using this - There is zero sanity checking if the number of FPRs and GPRs is within limits - There is zero sanity checking if the types actually map correctly - There is zero automated determination to enable inregister ABI handling, even though that should be possible. I don't expect this code to get merged, I just needed this to give me something to work with. There is one nicety seen from this implementation: - No need for a `fexfn_pack_` for in-register implementations. - Instead the thunk aliases directly to the thunk function! - Saves a branch because the JIT is doing the heavy lifting here - Needed a little bit of fixup in libVDSO_Guest because it is special cased there. Future implementation details should make this nicer: - thunkgen should understand ABI limitations and be able to determine if a function can be safely switched to inregister_abi. - If argument type is pointer, integer, or float/double. Accept it. - If the number of GPRs is less than the max. accept it. - If the number of FPRs is less than the max. Accept it. - If the return type is void, integer, or float/double. Accept it. - If all passed in types are 64-bits or less. Accept it. Prior investigation has shown that Vulkan effectively compresses down to 24 different ABI functions. 12 of which hit full GPR path, 4 hit GPR+FPR path, 8 have too many arguments to sanely convert to inregister. This is quite low considering Vulkan has around 641 function declarations.

Sonicadvance1 · 2024-10-20T18:45:11Z

@alyssarosenzweig how do you feel about the IR additions for this?

alyssarosenzweig · 2024-10-20T22:04:07Z

This requires passing up to 13 registers to the IR operation

~~This will explode RA.~~ maybe not if 6 are FPRs? Definitely close to the edge at least

alyssarosenzweig · 2024-10-20T22:05:39Z

This requires passing up to 13 registers to the IR operation

~~This will explode RA.~~ maybe not if 6 are FPRs? Definitely close to the edge at least

If this works (I should check..) no objection from an IR perspective

Sonicadvance1 · 2024-10-20T22:16:59Z

This requires passing up to 13 registers to the IR operation

~~This will explode RA.~~ maybe not if 6 are FPRs? Definitely close to the edge at least

If this works (I should check..) no objection from an IR perspective

I haven't tried a max-spec IR operation that actually used all GPRs and FPRs, but theoretically should work? If it doesn't we could limit the count still.

Sonicadvance1 added 5 commits September 10, 2024 20:41

FEXCore/IR: Adds Thunk ABI flags enum

ce020f9

ThunkLibs: Support passing an ABI to the Thunk operation

a68a2bf

thunkgen: Passthrough abi to analysis

ff0111c

Sonicadvance1 force-pushed the thunks_abi_adjust branch from a5121ae to ff0111c Compare September 11, 2024 19:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(POC) Thunks: Support inregister ABI #4061

(POC) Thunks: Support inregister ABI #4061

Sonicadvance1 commented Sep 11, 2024 •

edited

Loading

Sonicadvance1 commented Oct 20, 2024

alyssarosenzweig commented Oct 20, 2024 •

edited

Loading

alyssarosenzweig commented Oct 20, 2024

Sonicadvance1 commented Oct 20, 2024

(POC) Thunks: Support inregister ABI #4061

Are you sure you want to change the base?

(POC) Thunks: Support inregister ABI #4061

Conversation

Sonicadvance1 commented Sep 11, 2024 • edited Loading

Sonicadvance1 commented Oct 20, 2024

alyssarosenzweig commented Oct 20, 2024 • edited Loading

alyssarosenzweig commented Oct 20, 2024

Sonicadvance1 commented Oct 20, 2024

Sonicadvance1 commented Sep 11, 2024 •

edited

Loading

alyssarosenzweig commented Oct 20, 2024 •

edited

Loading