You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I want to make vLLM support a new hardware, the Tenstorrent's Grayskull (which is a general purpose DLA, just like CUDA, but not CUDA). After reading the document and the code, I have some understanding and some questions, need the community's help to clarify my thoughts and check my understanding. Please correct me if I have any misunderstandings.
My understandings
The essential part of the vLLM is the PagedAttention, which is a highly optimized "memory paging mechanism" implemented on CUDA.
The Python binding to expose the kernel to Python is at torch_bindings.cpp
To utilized the Tenstorrent Grayskull, I have to do:
Implement the PagedAttention with Tenstorrent Grayskull kernel. (that will a huge work)
Expose the kernel to Python with bindings.
What I DON'T have to do:
Modify LLM's implementation which already support vLLM, because they are already using the vLLM's interface.
My questions
I saw there are 2 versions of kernels, v1 and v2. Do I need to implement v1, or I can just go with v2?
Where can I find a list of API's that I have to implement? I am afraid I missed anything. In the torch_binding.py I saw there binds a lot of operations, but do I need to implement them all or just the paged_attention_v2()?
Can I first only modify the forward() function to adapt vLLM's interface, without the PagedAttention? will it work but just with worse performance?
Does quantization cause anything special considerations?
Is there are anything I missed but I should know?
Thank you for reading my long questions and thanks in advance for the helping :D
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi, vLLM community
I want to make vLLM support a new hardware, the Tenstorrent's Grayskull (which is a general purpose DLA, just like CUDA, but not CUDA). After reading the document and the code, I have some understanding and some questions, need the community's help to clarify my thoughts and check my understanding. Please correct me if I have any misunderstandings.
My understandings
PagedAttention
, which is a highly optimized "memory paging mechanism" implemented on CUDA.attention_kernel.cu
torch_bindings.cpp
PagedAttention
with Tenstorrent Grayskull kernel. (that will a huge work)My questions
torch_binding.py
I saw there binds a lot of operations, but do I need to implement them all or just thepaged_attention_v2()
?forward()
function to adapt vLLM's interface, without thePagedAttention
? will it work but just with worse performance?Thank you for reading my long questions and thanks in advance for the helping :D
Beta Was this translation helpful? Give feedback.
All reactions