H2O implementation #24

gopikrishnajha · 2024-10-13T06:06:02Z

In the update_kv function of H2OKVCluster class, I see this code.

attn_weights = torch.matmul(query_states[..., -self.window_size:, :], key_states.transpose(2, 3)) / math.sqrt(head_dim)

As far as I know there is no concept of window in H2O. Shouldn't the entire query_states matrix be considered for attn_weights computation? Why are you only snipping out the window_size part from the query states to be considered for matrix multiplication here?

The text was updated successfully, but these errors were encountered:

Zefan-Cai · 2024-10-13T19:13:54Z

Thank you for pointing out!
The current code is inconsistent with standard H2O as you mentioned. We have tested performance with or without the entire query_state to calculate the attention. We found that this has very limited influence on the performance. We will address this inconsistency in the updated code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

H2O implementation #24

H2O implementation #24

gopikrishnajha commented Oct 13, 2024

Zefan-Cai commented Oct 13, 2024

H2O implementation #24

H2O implementation #24

Comments

gopikrishnajha commented Oct 13, 2024

Zefan-Cai commented Oct 13, 2024