diff --git a/llm/qwen/README.md b/llm/qwen/README.md index cd2c88f5e75..6846fc71f2f 100644 --- a/llm/qwen/README.md +++ b/llm/qwen/README.md @@ -67,6 +67,35 @@ curl http://$ENDPOINT/v1/chat/completions \ }' | jq -r '.choices[0].message.content' ``` +## Running Multimodal Qwen2-VL + + +1. Start serving Qwen2-VL: + +```console +sky launch -c qwen2-vl qwen2-vl-7b.yaml +``` +2. Send a multimodalrequest to the endpoint for completion: +```bash +ENDPOINT=$(sky status --endpoint 8000 qwen2-vl) + +curl http://$ENDPOINT/v1/chat/completions \ + -H 'Content-Type: application/json' \ + -H 'Authorization: Bearer token' \ + --data '{ + "model": "Qwen/Qwen2-VL-7B-Instruct", + "messages": [ + { + "role": "user", + "content": [ + {"type" : "text", "text": "Covert this logo to ASCII art"}, + {"type": "image_url", "image_url": {"url": "https://pbs.twimg.com/profile_images/1584596138635632640/HWexMoH5_400x400.jpg"}} + ] + }], + "max_tokens": 1024 + }' | jq . +``` + ## Scale up the service with SkyServe 1. With [SkyPilot Serving](https://skypilot.readthedocs.io/en/latest/serving/sky-serve.html), a serving library built on top of SkyPilot, scaling up the Qwen service is as simple as running: diff --git a/llm/qwen/qwen2-vl-7b.yaml b/llm/qwen/qwen2-vl-7b.yaml new file mode 100644 index 00000000000..cc7600bbd9e --- /dev/null +++ b/llm/qwen/qwen2-vl-7b.yaml @@ -0,0 +1,36 @@ +envs: + MODEL_NAME: Qwen/Qwen2-VL-7B-Instruct + +service: + # Specifying the path to the endpoint to check the readiness of the replicas. + readiness_probe: + path: /v1/chat/completions + post_data: + model: $MODEL_NAME + messages: + - role: user + content: Hello! What is your name? + max_tokens: 1 + initial_delay_seconds: 1200 + # How many replicas to manage. + replicas: 2 + + +resources: + accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} + disk_tier: best + ports: 8000 + +setup: | + # Install later transformers version for the support of + # qwen2_vl support + pip install git+https://github.com/huggingface/transformers@21fac7abba2a37fae86106f87fcf9974fd1e3830 + pip install vllm==0.6.1.post2 + pip install vllm-flash-attn + +run: | + export PATH=$PATH:/sbin + vllm serve $MODEL_NAME \ + --host 0.0.0.0 \ + --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ + --max-model-len 2048 | tee ~/openai_api_server.log