You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Please help me to understand that, in the forward implemention of PagedAttention In /vllm-project/vllm/tree/main/vllm/model_executor/layers/attention.py, why the tensors sent to xops.memory_efficient_attention_forward use 'query.unsqueeze(0)' rather than 'query = query.unflatten(0, (batch_size, seq_len))' which is consistent with the self.alibi_slopes branch.
In my opinion, at the input moment of the forward function,
query has shape [batch_size, seq_len, num_heads * head_size], tokens of different batches stay in there own batch,
then after query = query.view(-1, self.num_heads, self.head_size) all tokens of all seqs of all batches flattened into the 1st dim.
before feed to xops.memory_efficient_attention_forward, should the tokens be reshaped back by query.unflatten(0, (batch_size, seq_len))? otherwise query.unflatten(0, (batch_size, seq_len)) will concat seqs of different batches into an entire one, and I guess it will cause the prediction wrong.
for example:
when query = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
the unsqueeze(0) will feed the embedded seqs [Hello, my name is The president of the United States is The capital of France is
The future of AI is] to attention.
Do I have some mistake?
Is there any trick in the 'query.unsqueeze(0)', so that there is not any confusion?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Please help me to understand that, in the forward implemention of PagedAttention In /vllm-project/vllm/tree/main/vllm/model_executor/layers/attention.py, why the tensors sent to xops.memory_efficient_attention_forward use 'query.unsqueeze(0)' rather than 'query = query.unflatten(0, (batch_size, seq_len))' which is consistent with the self.alibi_slopes branch.
In my opinion, at the input moment of the forward function,
query has shape [batch_size, seq_len, num_heads * head_size], tokens of different batches stay in there own batch,
then after query = query.view(-1, self.num_heads, self.head_size) all tokens of all seqs of all batches flattened into the 1st dim.
before feed to xops.memory_efficient_attention_forward, should the tokens be reshaped back by query.unflatten(0, (batch_size, seq_len))? otherwise query.unflatten(0, (batch_size, seq_len)) will concat seqs of different batches into an entire one, and I guess it will cause the prediction wrong.
for example:
when query = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
the unsqueeze(0) will feed the embedded seqs [Hello, my name is The president of the United States is The capital of France is
The future of AI is] to attention.
Do I have some mistake?
Is there any trick in the 'query.unsqueeze(0)', so that there is not any confusion?
Please help
Beta Was this translation helpful? Give feedback.
All reactions