Set TP argument correctly when instantiating PagedKVCacheManager (IBM#94

) #### Motivation Users are seeing runtime errors when trying to use TP>1 with speculative decoding. #### Modifications We need to set the tensor parallel argument correctly when we instantiate the PagedKVCacheManager. #### Result I have verified that this change resolves the reported issue. #### Related Issues https://huggingface.co/ibm-fms/llama3-8b-accelerator/discussions/1 Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
tdoublep · May 10, 2024 · ddc56ee · ddc56ee
1 parent e87d462
commit ddc56ee
Showing 1 changed file with 1 addition and 1 deletion.
diff --git a/server/text_generation_server/models/paged_causal_lm.py b/server/text_generation_server/models/paged_causal_lm.py
@@ -327,7 +327,7 @@ def __init__(
             model_config.num_attention_heads,
             model_config.hidden_size,
             kv_heads=model_config.num_key_value_heads,
-            tensor_parallel_size=1,
+            tensor_parallel_size=self.engine.world_size,
             dtype=dtype,
             device=self.device,
             total_num_gpu_blocks=total_num_gpu_blocks,