Skip to content

Commit

Permalink
Integrating the Yi series models (#3958)
Browse files Browse the repository at this point in the history
* Add files via upload

* Update and rename qwen2-7b.yaml to yi15-6b.yaml

* Add files via upload

* Update yi15-9b.yaml

* Update yi15-34b.yaml

* Update yi15-6b.yaml

* Add files via upload

* Update yicoder-1_5b.yaml

* Update yicoder-9b.yaml

* Add files via upload

* Update yi15-34b.yaml

* Update yi15-6b.yaml

* Update yi15-9b.yaml

* Update yicoder-1_5b.yaml

* Update yicoder-9b.yaml
  • Loading branch information
Haijian06 authored Sep 19, 2024
1 parent 7ca0c48 commit e558ec2
Show file tree
Hide file tree
Showing 6 changed files with 152 additions and 0 deletions.
60 changes: 60 additions & 0 deletions llm/yi/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Serving Yi on Your Own Kubernetes or Cloud

🤖 The Yi series models are the next generation of open-source large language models trained from scratch by [01.AI](https://www.lingyiwanwu.com/en).

**Update (Sep 19, 2024) -** SkyPilot now supports the [**Yi**](https://01-ai.github.io/) model(Yi-Coder Yi-1.5)!

<p align="center">
<img src="https://raw.githubusercontent.com/01-ai/Yi/main/assets/img/coder/bench1.webp" alt="yi" width="600"/>
</p>

## Why use SkyPilot to deploy over commercial hosted solutions?

* Get the best GPU availability by utilizing multiple resources pools across Kubernetes clusters and multiple regions/clouds.
* Pay absolute minimum — SkyPilot picks the cheapest resources across Kubernetes clusters and regions/clouds. No managed solution markups.
* Scale up to multiple replicas across different locations and accelerators, all served with a single endpoint
* Everything stays in your Kubernetes or cloud account (your VMs & buckets)
* Completely private - no one else sees your chat history


## Running Yi model with SkyPilot

After [installing SkyPilot](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html), run your own Yi model on vLLM with SkyPilot in 1-click:

1. Start serving Yi-1.5 34B on a single instance with any available GPU in the list specified in [yi15-34b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/yi/yi15-34b.yaml) with a vLLM powered OpenAI-compatible endpoint (You can also switch to [yicoder-9b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/yi/yicoder-9b.yaml) or [other model](https://github.com/skypilot-org/skypilot/tree/master/llm/yi) for a smaller model):

```console
sky launch -c yi yi15-34b.yaml
```
2. Send a request to the endpoint for completion:
```bash
ENDPOINT=$(sky status --endpoint 8000 yi)

curl http://$ENDPOINT/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "01-ai/Yi-1.5-34B-Chat",
"prompt": "Who are you?",
"max_tokens": 512
}' | jq -r '.choices[0].text'
```

3. Send a request for chat completion:
```bash
curl http://$ENDPOINT/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "01-ai/Yi-1.5-34B-Chat",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Who are you?"
}
],
"max_tokens": 512
}' | jq -r '.choices[0].message.content'
```
20 changes: 20 additions & 0 deletions llm/yi/yi15-34b.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
envs:
MODEL_NAME: 01-ai/Yi-1.5-34B-Chat

resources:
accelerators: {A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8}
disk_size: 1024
disk_tier: best
memory: 32+
ports: 8000

setup: |
pip install vllm==0.6.1.post2
pip install vllm-flash-attn
run: |
export PATH=$PATH:/sbin
vllm serve $MODEL_NAME \
--host 0.0.0.0 \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
--max-model-len 1024 | tee ~/openai_api_server.log
18 changes: 18 additions & 0 deletions llm/yi/yi15-6b.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
envs:
MODEL_NAME: 01-ai/Yi-1.5-6B-Chat

resources:
accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB}
disk_tier: best
ports: 8000

setup: |
pip install vllm==0.6.1.post2
pip install vllm-flash-attn
run: |
export PATH=$PATH:/sbin
vllm serve $MODEL_NAME \
--host 0.0.0.0 \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
--max-model-len 1024 | tee ~/openai_api_server.log
18 changes: 18 additions & 0 deletions llm/yi/yi15-9b.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
envs:
MODEL_NAME: 01-ai/Yi-1.5-9B-Chat

resources:
accelerators: {L4:8, A10g:8, A10:8, A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8}
disk_tier: best
ports: 8000

setup: |
pip install vllm==0.6.1.post2
pip install vllm-flash-attn
run: |
export PATH=$PATH:/sbin
vllm serve $MODEL_NAME \
--host 0.0.0.0 \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
--max-model-len 1024 | tee ~/openai_api_server.log
18 changes: 18 additions & 0 deletions llm/yi/yicoder-1_5b.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
envs:
MODEL_NAME: 01-ai/Yi-Coder-1.5B-Chat

resources:
accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB}
disk_tier: best
ports: 8000

setup: |
pip install vllm==0.6.1.post2
pip install vllm-flash-attn
run: |
export PATH=$PATH:/sbin
vllm serve $MODEL_NAME \
--host 0.0.0.0 \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
--max-model-len 1024 | tee ~/openai_api_server.log
18 changes: 18 additions & 0 deletions llm/yi/yicoder-9b.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
envs:
MODEL_NAME: 01-ai/Yi-Coder-9B-Chat

resources:
accelerators: {L4:8, A10g:8, A10:8, A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8}
disk_tier: best
ports: 8000

setup: |
pip install vllm==0.6.1.post2
pip install vllm-flash-attn
run: |
export PATH=$PATH:/sbin
vllm serve $MODEL_NAME \
--host 0.0.0.0 \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
--max-model-len 1024 | tee ~/openai_api_server.log

0 comments on commit e558ec2

Please sign in to comment.