Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return 32-bit for external accumulation #1088

Open
grimulkan opened this issue Jul 24, 2024 · 0 comments
Open

Return 32-bit for external accumulation #1088

grimulkan opened this issue Jul 24, 2024 · 0 comments

Comments

@grimulkan
Copy link

Would it be possible to return higher precision tensors where relevant, as an option, to allow users to break apart attention computation in blocks?

For example, in ring + flash attention (https://github.com/zhuzilin/ring-flash-attention), the computation is broken up over a bunch of GPUs each calling the flash attention kernel, and exchange/accumulate the results outside flash-attn. But since the returns are all 16-bit, even accumulating in a 32-bit buffer seems to be not as accurate as calling flash-attn in a single call.

I'm guessing flash-attn internally accumulates at higher precision and downcasts for the return? If so, could it be possible to return the raw full precision tensors for such purposes. Assuming they exist in the internal implementation, of course. If everything is truly 16-bit, maybe there is some other reason.

This would apply for both forward and backward calls.

zhuzilin/ring-flash-attention#42

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant