Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(POC) Thunks: Support inregister ABI #4061

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

Sonicadvance1
Copy link
Member

@Sonicadvance1 Sonicadvance1 commented Sep 11, 2024

These four commits are effectively atomic so they could be merged in order as we sort of the details of this. I don't expect this to be merged in this state! Read the commit messages for some more details.

One thing that I have been thinking about is the overhead of building up and tearing down stack state on thunks. While thunking already gives us a significant performance improvement, there is still some inherit overhead around marshaling the data across the architecture boundary. We have punted on the problem for quite a while because it hasn't been /that/ significant of an issue.

We also have the problem that changing how thunks behave has to change details in the JIT and inside of the thunk generator itself, which means having knowledge on both sides which can be a bit unwieldy. This gets the conversation going again for making thunks even more optimal by supporting an an inregister implementation so that in the typical case, all of the arguments stay inside of registers and don't touch the stack. The remaining thunk functions can still fallback to the stack implementation as needed.

As for why I thought this was interesting to start looking in to. I did some initial investigation work towards what exactly the overhead of FEX's thunking is by microbenching VDSO. I made a quick and dirty implementation of __vdso_time which ran inregister entirely, versus not and in a microbench I saw an improvement in time from 45.300797ns down to 33.422434ns per call average. ~11.87ns removed by keeping the arguments in register rather than moving it to the stack. Or only consuming 73.7% of the time.

While this time improvement isn't guaranteed for every thunk function, it's a good starting point towards seeing how much thunks could /possibly/ be improved. While today we already know that games absolutely abuse __vdso_clock_gettime for busylooping, so this would improve how quickly they can busy loop. That isn't really interesting to improve performance. This instead lead me down to the path to see how many API calls recent games are making these days.

The initial game I grabbed a frame capture from was Spider-Man: Miles Morales. The frame I grabbed had 56k API calls to libvulkan. A decent amount but not an absolute ton. I then decided to grab a frame from Civilization 6's Graphics benchmark, 333,601 Vulkan API calls in one frame! Money.

So if we assumed the 11.87ns case per api call getting saved:

  • Miles Morales: ~665,188.328 ns per frame saved - 0.665ms per frame saved
  • Civ 6: 3,962,633.775 ns per frame saved - 3.962ms per frame saved!

Okay cool, so assuming that we do actually improve the thunk call overhead by the 11.98ns per call, and assuming a game is doing 333k vulkan calls per frame. We can potentially save nearly 4ms off of a frame of CPU time? And since typically a 60FPS game is running at 16.66ms, that's nearly 25% of the frame time saved? Sounds like a win to me, let's do it!

As a proof of concept I have only done a single VDSO function (clock_gettime) because it was easy on my end and showcased that it is possible and works. It also required a change in libVDSO_Guest.cpp which won't be required for Vulkan and GL.

This is going to require a decent amount of work on the thunkgen and also might require changes depending on what Alyssa is wanting to do with the IR. But it matches about where I want thunking to end up.

PS:
Another microbench of the clock_gettime VDSO path showed a reduction in overhead from 52.64ns to 49.93ns, a 2.70ns reduction per call, or a 5.15% reduction.

Assuming only 2.7ns reduction in overhead, the Civ6 benchmark would still cut 0.903ms per frame off their time, which isn't quite as good, but is still reasonable. True overhead reduction is likely to be somewhere in the middle of these values, can't really tell until it's actually implemented.

With the ABI being described right after the hash of the instruction, we
can now take advantage of inregister ABI handling.

- This requires passing up to 13 registers to the IR operation
   - Six GPRs, Six FPRs, and one return value which is a GPR or FPR
   - Theoretically there can be a thunk function which maxes that out
     entirely
- Inregister currently only exists for x86-64, because nothing really
  uses `vectorcall` for 32-bit.
- Requires passing all of this information to the backend so the AAPCS64
  can map the x86-64 systemv ABI directly to the AAPCS64 equivalent
- Missing optimization is `SpillStaticRegs` only needs to spill specific
  registers rather than all of them.
- This takes advantage of the ABI semantics being matching pretty well
  between AAPCS64 and Linux Systemv x86-64, which is described near the
  `ThunkABIFlags` enum.
… inregister ABI

This is just a proof of concept for getting thunkgen wired up just
enough that I can get a workable example for VDSO.

Some assumptions made here:
- Only symbol explicitly marked with `inregister_abi` is using this
- There is zero sanity checking if the number of FPRs and GPRs is within
  limits
- There is zero sanity checking if the types actually map correctly
- There is zero automated determination to enable inregister ABI
  handling, even though that should be possible.

I don't expect this code to get merged, I just needed this to give me
something to work with.

There is one nicety seen from this implementation:
- No need for a `fexfn_pack_` for in-register implementations.
  - Instead the thunk aliases directly to the thunk function!
  - Saves a branch because the JIT is doing the heavy lifting here
  - Needed a little bit of fixup in libVDSO_Guest because it is special
    cased there.

Future implementation details should make this nicer:
- thunkgen should understand ABI limitations and be able to determine if
  a function can be safely switched to inregister_abi.
  - If argument type is pointer, integer, or float/double. Accept it.
  - If the number of GPRs is less than the max. accept it.
  - If the number of FPRs is less than the max. Accept it.
  - If the return type is void, integer, or float/double. Accept it.
  - If all passed in types are 64-bits or less. Accept it.

Prior investigation has shown that Vulkan effectively compresses down to
24 different ABI functions. 12 of which hit full GPR path, 4 hit GPR+FPR
path, 8 have too many arguments to sanely convert to inregister. This is
quite low considering Vulkan has around 641 function declarations.
@Sonicadvance1
Copy link
Member Author

@alyssarosenzweig how do you feel about the IR additions for this?

@alyssarosenzweig
Copy link
Collaborator

alyssarosenzweig commented Oct 20, 2024

  • This requires passing up to 13 registers to the IR operation

This will explode RA. maybe not if 6 are FPRs? Definitely close to the edge at least

@alyssarosenzweig
Copy link
Collaborator

  • This requires passing up to 13 registers to the IR operation

This will explode RA. maybe not if 6 are FPRs? Definitely close to the edge at least

If this works (I should check..) no objection from an IR perspective

@Sonicadvance1
Copy link
Member Author

  • This requires passing up to 13 registers to the IR operation

This will explode RA. maybe not if 6 are FPRs? Definitely close to the edge at least

If this works (I should check..) no objection from an IR perspective

I haven't tried a max-spec IR operation that actually used all GPRs and FPRs, but theoretically should work? If it doesn't we could limit the count still.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants