-
Notifications
You must be signed in to change notification settings - Fork 501
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Integrating the Yi series models (#3958)
* Add files via upload * Update and rename qwen2-7b.yaml to yi15-6b.yaml * Add files via upload * Update yi15-9b.yaml * Update yi15-34b.yaml * Update yi15-6b.yaml * Add files via upload * Update yicoder-1_5b.yaml * Update yicoder-9b.yaml * Add files via upload * Update yi15-34b.yaml * Update yi15-6b.yaml * Update yi15-9b.yaml * Update yicoder-1_5b.yaml * Update yicoder-9b.yaml
- Loading branch information
Showing
6 changed files
with
152 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,60 @@ | ||
# Serving Yi on Your Own Kubernetes or Cloud | ||
|
||
🤖 The Yi series models are the next generation of open-source large language models trained from scratch by [01.AI](https://www.lingyiwanwu.com/en). | ||
|
||
**Update (Sep 19, 2024) -** SkyPilot now supports the [**Yi**](https://01-ai.github.io/) model(Yi-Coder Yi-1.5)! | ||
|
||
<p align="center"> | ||
<img src="https://raw.githubusercontent.com/01-ai/Yi/main/assets/img/coder/bench1.webp" alt="yi" width="600"/> | ||
</p> | ||
|
||
## Why use SkyPilot to deploy over commercial hosted solutions? | ||
|
||
* Get the best GPU availability by utilizing multiple resources pools across Kubernetes clusters and multiple regions/clouds. | ||
* Pay absolute minimum — SkyPilot picks the cheapest resources across Kubernetes clusters and regions/clouds. No managed solution markups. | ||
* Scale up to multiple replicas across different locations and accelerators, all served with a single endpoint | ||
* Everything stays in your Kubernetes or cloud account (your VMs & buckets) | ||
* Completely private - no one else sees your chat history | ||
|
||
|
||
## Running Yi model with SkyPilot | ||
|
||
After [installing SkyPilot](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html), run your own Yi model on vLLM with SkyPilot in 1-click: | ||
|
||
1. Start serving Yi-1.5 34B on a single instance with any available GPU in the list specified in [yi15-34b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/yi/yi15-34b.yaml) with a vLLM powered OpenAI-compatible endpoint (You can also switch to [yicoder-9b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/yi/yicoder-9b.yaml) or [other model](https://github.com/skypilot-org/skypilot/tree/master/llm/yi) for a smaller model): | ||
|
||
```console | ||
sky launch -c yi yi15-34b.yaml | ||
``` | ||
2. Send a request to the endpoint for completion: | ||
```bash | ||
ENDPOINT=$(sky status --endpoint 8000 yi) | ||
|
||
curl http://$ENDPOINT/v1/completions \ | ||
-H "Content-Type: application/json" \ | ||
-d '{ | ||
"model": "01-ai/Yi-1.5-34B-Chat", | ||
"prompt": "Who are you?", | ||
"max_tokens": 512 | ||
}' | jq -r '.choices[0].text' | ||
``` | ||
|
||
3. Send a request for chat completion: | ||
```bash | ||
curl http://$ENDPOINT/v1/chat/completions \ | ||
-H "Content-Type: application/json" \ | ||
-d '{ | ||
"model": "01-ai/Yi-1.5-34B-Chat", | ||
"messages": [ | ||
{ | ||
"role": "system", | ||
"content": "You are a helpful assistant." | ||
}, | ||
{ | ||
"role": "user", | ||
"content": "Who are you?" | ||
} | ||
], | ||
"max_tokens": 512 | ||
}' | jq -r '.choices[0].message.content' | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
envs: | ||
MODEL_NAME: 01-ai/Yi-1.5-34B-Chat | ||
|
||
resources: | ||
accelerators: {A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8} | ||
disk_size: 1024 | ||
disk_tier: best | ||
memory: 32+ | ||
ports: 8000 | ||
|
||
setup: | | ||
pip install vllm==0.6.1.post2 | ||
pip install vllm-flash-attn | ||
run: | | ||
export PATH=$PATH:/sbin | ||
vllm serve $MODEL_NAME \ | ||
--host 0.0.0.0 \ | ||
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ | ||
--max-model-len 1024 | tee ~/openai_api_server.log |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
envs: | ||
MODEL_NAME: 01-ai/Yi-1.5-6B-Chat | ||
|
||
resources: | ||
accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} | ||
disk_tier: best | ||
ports: 8000 | ||
|
||
setup: | | ||
pip install vllm==0.6.1.post2 | ||
pip install vllm-flash-attn | ||
run: | | ||
export PATH=$PATH:/sbin | ||
vllm serve $MODEL_NAME \ | ||
--host 0.0.0.0 \ | ||
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ | ||
--max-model-len 1024 | tee ~/openai_api_server.log |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
envs: | ||
MODEL_NAME: 01-ai/Yi-1.5-9B-Chat | ||
|
||
resources: | ||
accelerators: {L4:8, A10g:8, A10:8, A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8} | ||
disk_tier: best | ||
ports: 8000 | ||
|
||
setup: | | ||
pip install vllm==0.6.1.post2 | ||
pip install vllm-flash-attn | ||
run: | | ||
export PATH=$PATH:/sbin | ||
vllm serve $MODEL_NAME \ | ||
--host 0.0.0.0 \ | ||
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ | ||
--max-model-len 1024 | tee ~/openai_api_server.log |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
envs: | ||
MODEL_NAME: 01-ai/Yi-Coder-1.5B-Chat | ||
|
||
resources: | ||
accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} | ||
disk_tier: best | ||
ports: 8000 | ||
|
||
setup: | | ||
pip install vllm==0.6.1.post2 | ||
pip install vllm-flash-attn | ||
run: | | ||
export PATH=$PATH:/sbin | ||
vllm serve $MODEL_NAME \ | ||
--host 0.0.0.0 \ | ||
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ | ||
--max-model-len 1024 | tee ~/openai_api_server.log |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
envs: | ||
MODEL_NAME: 01-ai/Yi-Coder-9B-Chat | ||
|
||
resources: | ||
accelerators: {L4:8, A10g:8, A10:8, A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8} | ||
disk_tier: best | ||
ports: 8000 | ||
|
||
setup: | | ||
pip install vllm==0.6.1.post2 | ||
pip install vllm-flash-attn | ||
run: | | ||
export PATH=$PATH:/sbin | ||
vllm serve $MODEL_NAME \ | ||
--host 0.0.0.0 \ | ||
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ | ||
--max-model-len 1024 | tee ~/openai_api_server.log |