Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speeding up exp with lookup tables? #1261

Open
ethansmith2000 opened this issue Oct 8, 2024 · 3 comments
Open

Speeding up exp with lookup tables? #1261

ethansmith2000 opened this issue Oct 8, 2024 · 3 comments

Comments

@ethansmith2000
Copy link

ethansmith2000 commented Oct 8, 2024

Hi, I saw this recent post about FA-3 by pytorch and noticed this bit
GZU3x7DasAM2Iif

Something I had been curious about for a while is how many values computed by exp() in bfloat16 go to out of bounds numbers or the same numbers.
namely of the 65536 possible input values, only 2267 map to values that are not 0, 1, or inf, and of course these are all in predictable segments < or > than some value

knowing this is it possible to speed up computation by using a simple lookup table? I understand memory is a precious resource so this may backfire but was curious if this at all makes any sense.

Fp8 would let us slim this table down even more, (though i know sometimes ops are cast to higher precision so im not really sure)

@ethansmith2000 ethansmith2000 changed the title Speeding up exp? Speeding up exp with lookup tables? Oct 8, 2024
@tridao
Copy link
Contributor

tridao commented Oct 8, 2024

That's a good idea!
I haven't tried it but it could potentially speed things up. The challenge might be that multiple threads indexing into the lookup table can cause bank conflicts. It's not immediately clear if this would be faster or slower than calling the exponential function.

@SonicCodes
Copy link

Screenshot 2024-10-08 at 8 12 19 in the morning tried a bunch of optims here and there, there's a consistent 10% slowdown from expf and using shared memory LUT (1024) on fp32 ,

But could see a way we can utilize registers as it seems that you can get away with 2x speed up on expf if you can put in as well codebook in there, loading and unloading takes time, that means you have to have your threads doing longer computation for the speedup, but seems interesting enough :)

@TanjIsGray
Copy link

A ratio of just 512 slower than tensor ops is actually pretty good. 1 clock is 1024 ops in a 32x32 MatMul. The lookup table in ROM would be around 250 sq um, small enough to have one per CUDA core but way too enormous to be in a MatMul node. So that speed ratio just reflects the beauty of the MatMul, not a flaw in the exponent operator.

3.9 TOps of special functions divided by 17,424 CUDA cores is around 2.2 GOps per core, which is one per clock. Probably pipelined but able to launch 1 special op every clock per tiny core is damn good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants