Streaming API and Web page for Large Language Models based on Python.
This repository contains:
- Flask API: REAL streaming generation of LLM and streaming response interface.
- Gradio APP: easy LLM web page.
- Request: fast back-end requests.
Take Llama3 for example:
- Follow Llama3 download to download Meta-Llama-3-8B-Instruct model, or from huggingface / modelscope.
- Follow Llama3 quick-start to install dependencies for Llama3.
Start our project:
-
Install dependencies for this repository:
pip install flask gradio transformers
-
[Optional] Modify the settings in
settings.py
. -
Run Flask service:
python llm_service.py --host 0.0.0.0 --port 8800 --ckpts /Meta-Llama-3-8B-Instruct
-
Run Gradio app:
gradio llm_app.py --address http://127.0.0.1:80/
-
Service invocation:
python llm_request.py --address http://127.0.0.1:80/
-
The initial streaming output scheme adopted by the project was the TextIteratorStreamer that comes with the official transformers library. However, the generation speed was still very slow. After researching, I found that the TextIteratorStreamer actually converts "print-ready text" into a streaming structure, meaning that the LLM first needs to generate the entire text block before converting it, which is not what I wanted. I wanted the LLM to yield each token as it is generated.
-
Subsequently, I came across LowinLi's project that truly implemented streaming output for pretrained models. When I eagerly applied it to the Llama3 model, it threw an error. After debugging, I found that Llama3 has two eos_tokens, which caused the loop to generate negative ids. Thus, I made modifications based on this project, cleaned up redundancies, adapted it for Llama3, and made it easier to read and understand.