[LLM] Update qwen examples (#3957)

* update qwen examples * Fix misalign
skypilot-org · Sep 18, 2024 · 303d43f · 303d43f
1 parent e870839
commit 303d43f
Show file tree

Hide file tree

Showing 4 changed files with 23 additions and 46 deletions.
diff --git a/llm/qwen/README.md b/llm/qwen/README.md
@@ -3,9 +3,9 @@
 [Qwen2](https://github.com/QwenLM/Qwen2) is one of the top open LLMs.
 As of Jun 2024, Qwen1.5-110B-Chat is ranked higher than GPT-4-0613 on the [LMSYS Chatbot Arena Leaderboard](https://chat.lmsys.org/?leaderboard).
 
-📰 **Update (26 April 2024) -** SkyPilot now also supports the [**Qwen1.5-110B**](https://qwenlm.github.io/blog/qwen1.5-110b/) model! It performs competitively with Llama-3-70B across a [series of evaluations](https://qwenlm.github.io/blog/qwen1.5-110b/#model-quality). Use [serve-110b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/qwen/serve-110b.yaml) to serve the 110B model.
+📰 **Update (Jun 6 2024) -** SkyPilot now also supports the [**Qwen2**](https://qwenlm.github.io/blog/qwen2/) model! It further improves the competitive model, Qwen1.5.
 
-📰 **Update (6 Jun 2024) -** SkyPilot now also supports the [**Qwen2**](https://qwenlm.github.io/blog/qwen2/) model! It further improves the competitive model, Qwen1.5.
+📰 **Update (April 26 2024) -** SkyPilot now also supports the [**Qwen1.5-110B**](https://qwenlm.github.io/blog/qwen1.5-110b/) model! It performs competitively with Llama-3-70B across a [series of evaluations](https://qwenlm.github.io/blog/qwen1.5-110b/#model-quality). Use [qwen15-110b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/qwen/qwen15-110b.yaml) to serve the 110B model.
 
 <p align="center">
     <img src="https://i.imgur.com/d7tEhAl.gif" alt="qwen" width="600"/>
@@ -27,16 +27,16 @@ As of Jun 2024, Qwen1.5-110B-Chat is ranked higher than GPT-4-0613 on the [LMSYS
 
 After [installing SkyPilot](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html), run your own Qwen model on vLLM with SkyPilot in 1-click:
 
-1. Start serving Qwen 110B on a single instance with any available GPU in the list specified in [serve-110b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/qwen/serve-110b.yaml) with a vLLM powered OpenAI-compatible endpoint (You can also switch to [serve-72b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/qwen/serve-72b.yaml) or [serve-7b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/qwen/serve-7b.yaml) for a smaller model):
+1. Start serving Qwen 110B on a single instance with any available GPU in the list specified in [qwen15-110b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/qwen/qwen15-110b.yaml) with a vLLM powered OpenAI-compatible endpoint (You can also switch to [qwen2-72b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/qwen/qwen2-72b.yaml) or [qwen2-7b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/qwen/qwen2-7b.yaml) for a smaller model):
 
 ```console
-sky launch -c qwen serve-110b.yaml
+sky launch -c qwen qwen15-110b.yaml
 ```
 2. Send a request to the endpoint for completion:
 ```bash
-IP=$(sky status --ip qwen)
+ENDPOINT=$(sky status --endpoint 8000 qwen)
 
-curl http://$IP:8000/v1/completions \
+curl http://$ENDPOINT/v1/completions \
     -H "Content-Type: application/json" \
     -d '{
       "model": "Qwen/Qwen1.5-110B-Chat",
@@ -47,7 +47,7 @@ curl http://$IP:8000/v1/completions \
 
 3. Send a request for chat completion:
 ```bash
-curl http://$IP:8000/v1/chat/completions \
+curl http://$ENDPOINT/v1/chat/completions \
     -H "Content-Type: application/json" \
     -d '{
       "model": "Qwen/Qwen1.5-110B-Chat",
@@ -69,7 +69,7 @@ curl http://$IP:8000/v1/chat/completions \
 
 1. With [SkyPilot Serving](https://skypilot.readthedocs.io/en/latest/serving/sky-serve.html), a serving library built on top of SkyPilot, scaling up the Qwen service is as simple as running:
 ```bash
-sky serve up -n qwen ./serve-72b.yaml
+sky serve up -n qwen ./qwen2-72b.yaml
 ```
 This will start the service with multiple replicas on the cheapest available locations and accelerators. SkyServe will automatically manage the replicas, monitor their health, autoscale based on load, and restart them when needed.
 
@@ -82,13 +82,13 @@ sky serve status qwen
 After a while, you will see the following output:
 ```console
 Services
-NAME        VERSION  UPTIME  STATUS        REPLICAS  ENDPOINT            
+NAME  VERSION  UPTIME  STATUS        REPLICAS  ENDPOINT            
 Qwen  1        -       READY         2/2       3.85.107.228:30002  
 
 Service Replicas
-SERVICE_NAME  ID  VERSION  IP  LAUNCHED    RESOURCES                   STATUS REGION  
-Qwen          1   1        -   2 mins ago  1x Azure({'A100-80GB': 8}) READY  eastus  
-Qwen          2   1        -   2 mins ago  1x GCP({'L4': 8})          READY  us-east4-a 
+SERVICE_NAME  ID  VERSION  ENDPOINT  LAUNCHED    RESOURCES                  STATUS REGION  
+Qwen          1   1        -         2 mins ago  1x Azure({'A100-80GB': 8}) READY  eastus  
+Qwen          2   1        -         2 mins ago  1x GCP({'L4': 8})          READY  us-east4-a 
 ```
 As shown, the service is now backed by 2 replicas, one on Azure and one on GCP, and the accelerator
 type is chosen to be **the cheapest available one** on the clouds. That said, it maximizes the

diff --git a/llm/qwen/serve-110b.yaml → llm/qwen/qwen15-110b.yaml b/llm/qwen/serve-110b.yaml → llm/qwen/qwen15-110b.yaml
@@ -24,20 +24,12 @@ resources:
   ports: 8000
 
 setup: |
-  conda activate qwen
-  if [ $? -ne 0 ]; then
-    conda create -n qwen python=3.10 -y
-    conda activate qwen
-  fi
-  pip install vllm==0.4.2
-  pip install flash-attn==2.5.9.post1
+  pip install vllm==0.6.1.post2
+  pip install vllm-flash-attn
 
 run: |
-  conda activate qwen
   export PATH=$PATH:/sbin
-  python -u -m vllm.entrypoints.openai.api_server \
+  vllm serve $MODEL_NAME \
     --host 0.0.0.0 \
-    --model $MODEL_NAME \
     --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
-    --max-num-seqs 16 | tee ~/openai_api_server.log
-
+    --max-model-len 1024 | tee ~/openai_api_server.log
diff --git a/llm/qwen/serve-72b.yaml → llm/qwen/qwen2-72b.yaml b/llm/qwen/serve-72b.yaml → llm/qwen/qwen2-72b.yaml
@@ -24,20 +24,12 @@ resources:
   ports: 8000
 
 setup: |
-  conda activate qwen
-  if [ $? -ne 0 ]; then
-    conda create -n qwen python=3.10 -y
-    conda activate qwen
-  fi
-  pip install vllm==0.4.2
-  pip install flash-attn==2.5.9.post1
+  pip install vllm==0.6.1.post2
+  pip install vllm-flash-attn
 
 run: |
-  conda activate qwen
   export PATH=$PATH:/sbin
-  python -u -m vllm.entrypoints.openai.api_server \
+  vllm serve $MODEL_NAME \
     --host 0.0.0.0 \
-    --model $MODEL_NAME \
     --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
-    --max-num-seqs 16 | tee ~/openai_api_server.log
-
+    --max-model-len 1024 | tee ~/openai_api_server.log
diff --git a/llm/qwen/serve-7b.yaml → llm/qwen/qwen2-7b.yaml b/llm/qwen/serve-7b.yaml → llm/qwen/qwen2-7b.yaml
@@ -22,19 +22,12 @@ resources:
   ports: 8000
 
 setup: |
-  conda activate qwen
-  if [ $? -ne 0 ]; then
-    conda create -n qwen python=3.10 -y
-    conda activate qwen
-  fi
-  pip install vllm==0.4.2
-  pip install flash-attn==2.5.9.post1
+  pip install vllm==0.6.1.post2
+  pip install vllm-flash-attn
 
 run: |
-  conda activate qwen
   export PATH=$PATH:/sbin
-  python -m vllm.entrypoints.openai.api_server \
+  vllm serve $MODEL_NAME \
     --host 0.0.0.0 \
-    --model $MODEL_NAME \
     --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
     --max-model-len 1024 | tee ~/openai_api_server.log