Replies: 1 comment
-
If you just want an inference server solution for this you can try https://github.com/autonomi-ai/nos/tree/main/examples/tutorials/03-llm-streaming-chat. Llama 7B works out of the box locally with a grpc streaming interface and is pretty easy to set up. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Has anyone in the community attempted to use Triton for streaming tokens, similar to Open AI's Chat GPT?
I've come across the ModelStreamInfer method exposed in the grpc_service.proto interface, but this seems to be responding with all the tokens at once.
Beta Was this translation helpful? Give feedback.
All reactions