why using 'query.unsqueeze(0)' rather than 'query = query.unflatten(0, (batch_size, seq_len))' consistently #2361

kowizards · 2024-01-06T07:37:50Z

kowizards
Jan 6, 2024

Please help me to understand that, in the forward implemention of PagedAttention In /vllm-project/vllm/tree/main/vllm/model_executor/layers/attention.py, why the tensors sent to xops.memory_efficient_attention_forward use 'query.unsqueeze(0)' rather than 'query = query.unflatten(0, (batch_size, seq_len))' which is consistent with the self.alibi_slopes branch.

In my opinion, at the input moment of the forward function,
query has shape [batch_size, seq_len, num_heads * head_size], tokens of different batches stay in there own batch,
then after query = query.view(-1, self.num_heads, self.head_size) all tokens of all seqs of all batches flattened into the 1st dim.

before feed to xops.memory_efficient_attention_forward, should the tokens be reshaped back by query.unflatten(0, (batch_size, seq_len))? otherwise query.unflatten(0, (batch_size, seq_len)) will concat seqs of different batches into an entire one, and I guess it will cause the prediction wrong.

for example:
when query = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
the unsqueeze(0) will feed the embedded seqs [Hello, my name is The president of the United States is The capital of France is
The future of AI is] to attention.

Do I have some mistake?
Is there any trick in the 'query.unsqueeze(0)', so that there is not any confusion?

Please help

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

why using 'query.unsqueeze(0)' rather than 'query = query.unflatten(0, (batch_size, seq_len))' consistently #2361

{{title}}

Replies: 0 comments

Select a reply

why using 'query.unsqueeze(0)' rather than 'query = query.unflatten(0, (batch_size, seq_len))' consistently #2361

kowizards Jan 6, 2024

Replies: 0 comments

kowizards
Jan 6, 2024