Question about recomputation #1740

puddingfjz · 2023-11-21T11:33:10Z

puddingfjz
Nov 21, 2023

I have a question about why recomputation can be implemented by converting the prompt tokens and the generated tokens into a new prompt and running one prefill stage computation.
For example, assume the prompt tokens are [t1, t2] and the generated tokens are [t3, t4]. I think the KV cache for t3 is computed based on the hidden states of t3 in each layer, which is affected by t1 and t2, but not by t4.
However, if we use a new prompt [t1, t2, t3, t4] and run the prefill stage computation, the KV cache of t3 will be affected by t4.

I am really confused here and hope someone can help with this.
Thanks!

Answered by puddingfjz

Nov 23, 2023

There is BlockDiagonalCausalMask.from_seqlens() as the attention bias in the prompt stage.

View full answer

puddingfjz · 2023-11-23T09:10:12Z

puddingfjz
Nov 23, 2023
Author

There is BlockDiagonalCausalMask.from_seqlens() as the attention bias in the prompt stage.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about recomputation #1740

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Question about recomputation #1740

puddingfjz Nov 21, 2023

Replies: 1 comment

puddingfjz Nov 23, 2023 Author

puddingfjz
Nov 21, 2023

puddingfjz
Nov 23, 2023
Author