Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP IBM release #68

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
74 commits
Select commit Hold shift + click to select a range
0aa4139
[Doc] Documentation on supported hardware for quantization methods (#…
mgoin Jun 21, 2024
b4ed395
[BugFix] exclude version 1.15.0 for modelscope (#5668)
zhyncs Jun 21, 2024
0f677c3
[ci][test] fix ca test in main (#5746)
youkaichao Jun 21, 2024
1133e22
[LoRA] Add support for pinning lora adapters in the LRU cache (#5603)
rohithkrn Jun 21, 2024
fda6325
[CI][Hardware][Intel GPU] add Intel GPU(XPU) ci pipeline (#5616)
jikunshang Jun 22, 2024
094fdac
[Model] Support Qwen-VL and Qwen-VL-Chat models with text-only inputs…
DamonFool Jun 22, 2024
e51d665
[Misc] Remove #4789 workaround left in vllm/entrypoints/openai/run_ba…
zifeitong Jun 22, 2024
4a0c093
[Bugfix] Fix pin_lora error in TPU executor (#5760)
WoosukKwon Jun 22, 2024
83066f6
[Docs][TPU] Add installation tip for TPU (#5761)
WoosukKwon Jun 22, 2024
e31d19c
[core][distributed] improve shared memory broadcast (#5754)
youkaichao Jun 22, 2024
40e9542
[BugFix] [Kernel] Add Cutlass2x fallback kernels (#5744)
varun-sundar-rabindranath Jun 23, 2024
a2bf2e2
[Distributed] Add send and recv helpers (#5719)
andoorve Jun 23, 2024
664ebf4
[Bugfix] Add phi3v resize for dynamic shape and fix torchvision requi…
Isotr0py Jun 24, 2024
5c50a9a
[doc][faq] add warning to download models for every nodes (#5783)
youkaichao Jun 24, 2024
e6035dc
[Doc] Add "Suggest edit" button to doc pages (#5789)
mgoin Jun 24, 2024
98bbdeb
[Doc] Add Phi-3-medium to list of supported models (#5788)
mgoin Jun 24, 2024
6244a71
[Bugfix] Fix FlexibleArgumentParser replaces _ with - for actual args…
CatherineSue Jun 24, 2024
68062fb
[ci] Remove aws template (#5757)
khluu Jun 25, 2024
f4c2e68
[Doc] Add notice about breaking changes to VLMs (#5818)
DarkLight1337 Jun 25, 2024
726516c
[Speculative Decoding] Support draft model on different tensor-paral…
wooyeonlee0 Jun 25, 2024
f36cd77
[Misc] Remove useless code in cpu_worker (#5824)
DamonFool Jun 25, 2024
77e41ec
[Core] Add fault tolerance for `RayTokenizerGroupPool` (#5748)
Yard1 Jun 25, 2024
ec820f3
[doc][distributed] add both gloo and nccl tests (#5834)
youkaichao Jun 25, 2024
b6ef994
[CI/Build] Add unit testing for FlexibleArgumentParser (#5798)
mgoin Jun 25, 2024
98fc761
[Misc] Update `w4a16` `compressed-tensors` support to include `w8a16`…
dsikka Jun 25, 2024
2aeab77
[Hardware][TPU] Refactor TPU backend (#5831)
WoosukKwon Jun 25, 2024
83a217e
[Hardware][AMD][CI/Build][Doc] Upgrade to ROCm 6.1, Dockerfile improv…
mawong-amd Jun 25, 2024
cb6c339
[Hardware][TPU] Raise errors for unsupported sampling params (#5850)
WoosukKwon Jun 25, 2024
323eb56
[CI/Build] Add E2E tests for MLPSpeculator (#5791)
tdoublep Jun 26, 2024
b57155e
[Bugfix] Fix assertion in NeuronExecutor (#5841)
aws-patlange Jun 26, 2024
9504961
[Core] Refactor Worker and ModelRunner to consolidate control plane c…
stephanie-wang Jun 26, 2024
53b9418
[Misc][Doc] Add Example of using OpenAI Server with VLM (#5832)
ywang96 Jun 26, 2024
c598c5b
[bugfix][distributed] fix shm broadcast when the queue size is full (…
youkaichao Jun 26, 2024
e5e4b11
[Bugfix] Fix embedding to support 2D inputs (#5829)
WoosukKwon Jun 26, 2024
eaf51e6
[Bugfix][TPU] Fix KV cache size calculation (#5860)
WoosukKwon Jun 26, 2024
214cf9d
[CI/Build] Refactor image test assets (#5821)
DarkLight1337 Jun 26, 2024
d530475
[Kernel] Adding bias epilogue support for `cutlass_scaled_mm` (#5560)
ProExpertProg Jun 26, 2024
2b2a6f0
[Frontend] Add tokenize/detokenize endpoints (#5054)
sasha0552 Jun 26, 2024
7b07f28
[Hardware][TPU] Support parallel sampling & Swapping (#5855)
WoosukKwon Jun 26, 2024
7ef738c
[Bugfix][TPU] Fix CPU cache allocation (#5869)
WoosukKwon Jun 26, 2024
b6c26b3
Support CPU inference with VSX PowerPC ISA (#5652)
ChipKerchner Jun 26, 2024
e957974
[doc] update usage of env var to avoid conflict (#5873)
youkaichao Jun 26, 2024
c6a2818
[Misc] Add example for LLaVA-NeXT (#5879)
ywang96 Jun 27, 2024
35ebe7d
[BugFix] Fix cuda graph for MLPSpeculator (#5875)
njhill Jun 27, 2024
8b8b470
[Doc] Add note about context length in Phi-3-Vision example (#5887)
DarkLight1337 Jun 27, 2024
8444703
[VLM][Bugfix] Make sure that `multi_modal_kwargs` is broadcasted prop…
xwjiang2010 Jun 27, 2024
78b8c94
[Model] Add base class for LoRA-supported models (#5018)
DarkLight1337 Jun 27, 2024
147bca0
[Bugfix] Fix img_sizes Parsing in Phi3-Vision (#5888)
ywang96 Jun 27, 2024
9e1d61e
[CI/Build] [1/3] Reorganize entrypoints tests (#5526)
DarkLight1337 Jun 27, 2024
7e0358a
[Model][Bugfix] Implicit model flags and reenable Phi-3-Vision (#5896)
DarkLight1337 Jun 27, 2024
59f2ce5
[doc][misc] add note for Kubernetes users (#5916)
youkaichao Jun 27, 2024
e3f2711
Squash 4645
prashantgupta24 Jun 27, 2024
6c13375
Squash 5930
prashantgupta24 Jun 27, 2024
d9562cf
[Core] Make Ray an optional "extras" requirement
njhill Apr 29, 2024
cf8b27e
🚧 add ibm-adapter branch
prashantgupta24 Jun 27, 2024
95d1306
🎨 fix format
prashantgupta24 Jun 27, 2024
cd15111
🎨 fix format
prashantgupta24 Jun 27, 2024
cfa5530
[Model] Add Gemma 2 (#5908)
WoosukKwon Jun 27, 2024
88e5be1
[core][misc] remove logical block (#5882)
youkaichao Jun 27, 2024
d38ed5d
[Kernel][ROCm][AMD] fused_moe Triton configs v2 for mi300X (#5932)
divakar-amd Jun 27, 2024
bc30b64
[Hardware][TPU] Optimize KV cache swapping (#5878)
WoosukKwon Jun 28, 2024
9bafffb
[VLM][BugFix] Make sure that `multi_modal_kwargs` can broadcast prope…
xwjiang2010 Jun 28, 2024
bba1cc6
[Bugfix][Hardware][Intel CPU] Fix unpassed multi_modal_kwargs for CPU…
Isotr0py Jun 28, 2024
4a5916d
[Core] Registry for processing model inputs (#5214)
DarkLight1337 Jun 28, 2024
0060a6f
Unmark fused_moe config json file as executable (#5960)
tlrmchlsmth Jun 28, 2024
358f984
[Hardware][Intel] OpenVINO vLLM backend (#5379)
ilya-lavrenov Jun 28, 2024
c5c4a9c
[Bugfix] Better error message for MLPSpeculator when `num_speculative…
tdoublep Jun 28, 2024
1099339
[CI/Build] [2/3] Reorganize entrypoints tests (#5904)
DarkLight1337 Jun 28, 2024
3536f3c
[Distributed] Make it clear that % should not be in tensor dict keys.…
xwjiang2010 Jun 28, 2024
1b8ffd1
[Spec Decode] Introduce DraftModelRunner (#5799)
comaniac Jun 28, 2024
b82befe
[Bugfix] Fix compute datatype for cutlass 3.x epilogues (#5931)
tlrmchlsmth Jun 28, 2024
b3a4ff5
[ Misc ] Remove `fp8_shard_indexer` from Col/Row Parallel Linear (Sim…
robertgshaw2-neuralmagic Jun 28, 2024
919dc5d
[ Bugfix ] Enabling Loading Models With Fused QKV/MLP on Disk with FP…
robertgshaw2-neuralmagic Jun 28, 2024
cf4072b
🎨 format code
prashantgupta24 Jun 28, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions .buildkite/run-openvino-test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# This script build the OpenVINO docker image and run the offline inference inside the container.
# It serves a sanity check for compilation and basic model usage.
set -ex

# Try building the docker image
docker build -t openvino-test -f Dockerfile.openvino .

# Setup cleanup
remove_docker_container() { docker rm -f openvino-test || true; }
trap remove_docker_container EXIT
remove_docker_container

# Run the image and launch offline inference
docker run --network host --env VLLM_OPENVINO_KVCACHE_SPACE=1 --name openvino-test openvino-test python3 /workspace/vllm/examples/offline_inference.py
17 changes: 13 additions & 4 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@
# In this file, you can add more tests to run either by adding a new step or
# adding a new command to an existing step. See different options here for examples.
# This script will be feed into Jinja template in `test-template-aws.j2` to generate
# the final pipeline yaml file.

# This script will be feed into Jinja template in `test-template-aws.j2` at
# https://github.com/vllm-project/buildkite-ci/blob/main/scripts/test-template-aws.j2
# to generate the final pipeline yaml file.


steps:
- label: Regression Test
Expand All @@ -24,7 +27,9 @@ steps:

- label: Core Test
mirror_hardwares: [amd]
command: pytest -v -s core
commands:
- pytest -v -s core
- pytest -v -s distributed/test_parallel_state.py

- label: Distributed Comm Ops Test
#mirror_hardwares: [amd]
Expand All @@ -51,7 +56,7 @@ steps:
- TEST_DIST_MODEL=meta-llama/Llama-2-7b-hf DISTRIBUTED_EXECUTOR_BACKEND=mp pytest -v -s distributed/test_basic_distributed_correctness.py
- TEST_DIST_MODEL=facebook/opt-125m DISTRIBUTED_EXECUTOR_BACKEND=mp pytest -v -s distributed/test_chunked_prefill_distributed.py
- TEST_DIST_MODEL=meta-llama/Llama-2-7b-hf DISTRIBUTED_EXECUTOR_BACKEND=mp pytest -v -s distributed/test_chunked_prefill_distributed.py
- pytest -v -s spec_decode/e2e/test_integration_dist.py
- pytest -v -s spec_decode/e2e/test_integration_dist_tp2.py
- CUDA_VISIBLE_DEVICES=0,1 pytest -v -s test_sharded_state_loader.py
- CUDA_VISIBLE_DEVICES=0,1 pytest -v -s distributed/test_utils.py

Expand All @@ -68,6 +73,7 @@ steps:
# See https://github.com/vllm-project/vllm/pull/5473#issuecomment-2166601837 for context.
- TEST_DIST_MODEL=facebook/opt-125m DISTRIBUTED_EXECUTOR_BACKEND=ray pytest -v -s distributed/test_basic_distributed_correctness.py
- TEST_DIST_MODEL=facebook/opt-125m DISTRIBUTED_EXECUTOR_BACKEND=mp pytest -v -s distributed/test_basic_distributed_correctness.py
- pytest -v -s spec_decode/e2e/test_integration_dist_tp4.py

- label: Engine Test
mirror_hardwares: [amd]
Expand Down Expand Up @@ -197,6 +203,9 @@ steps:
gpu: a100
num_gpus: 4
commands:
# FIXIT: find out which code initialize cuda before running the test
# before the fix, we need to use spawn to test it
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
# NOTE: don't test llama model here, it seems hf implementation is buggy
# see https://github.com/vllm-project/vllm/pull/5689 for details
- pytest -v -s distributed/test_custom_all_reduce.py
Expand Down
139 changes: 0 additions & 139 deletions .buildkite/test-template-aws.j2

This file was deleted.

23 changes: 8 additions & 15 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,8 @@ cmake_minimum_required(VERSION 3.21)

project(vllm_extensions LANGUAGES CXX)

option(VLLM_TARGET_DEVICE "Target device backend for vLLM" "cuda")
# CUDA by default, can be overridden by using -DVLLM_TARGET_DEVICE=... (used by setup.py)
set(VLLM_TARGET_DEVICE "cuda" CACHE STRING "Target device backend for vLLM")

message(STATUS "Build type: ${CMAKE_BUILD_TYPE}")
message(STATUS "Target device: ${VLLM_TARGET_DEVICE}")
Expand Down Expand Up @@ -32,8 +33,7 @@ set(HIP_SUPPORTED_ARCHS "gfx906;gfx908;gfx90a;gfx940;gfx941;gfx942;gfx1030;gfx11
# versions are derived from Dockerfile.rocm
#
set(TORCH_SUPPORTED_VERSION_CUDA "2.3.0")
set(TORCH_SUPPORTED_VERSION_ROCM_5X "2.0.1")
set(TORCH_SUPPORTED_VERSION_ROCM_6X "2.1.1")
set(TORCH_SUPPORTED_VERSION_ROCM "2.4.0")

#
# Try to find python package with an executable that exactly matches
Expand Down Expand Up @@ -98,18 +98,11 @@ elseif(HIP_FOUND)
# .hip extension automatically, HIP must be enabled explicitly.
enable_language(HIP)

# ROCm 5.x
if (ROCM_VERSION_DEV_MAJOR EQUAL 5 AND
NOT Torch_VERSION VERSION_EQUAL ${TORCH_SUPPORTED_VERSION_ROCM_5X})
message(WARNING "Pytorch version ${TORCH_SUPPORTED_VERSION_ROCM_5X} "
"expected for ROCMm 5.x build, saw ${Torch_VERSION} instead.")
endif()

# ROCm 6.x
if (ROCM_VERSION_DEV_MAJOR EQUAL 6 AND
NOT Torch_VERSION VERSION_EQUAL ${TORCH_SUPPORTED_VERSION_ROCM_6X})
message(WARNING "Pytorch version ${TORCH_SUPPORTED_VERSION_ROCM_6X} "
"expected for ROCMm 6.x build, saw ${Torch_VERSION} instead.")
# ROCm 5.X and 6.X
if (ROCM_VERSION_DEV_MAJOR GREATER_EQUAL 5 AND
NOT Torch_VERSION VERSION_EQUAL ${TORCH_SUPPORTED_VERSION_ROCM})
message(WARNING "Pytorch version ${TORCH_SUPPORTED_VERSION_ROCM} "
"expected for ROCm build, saw ${Torch_VERSION} instead.")
endif()
else()
message(FATAL_ERROR "Can't find CUDA or HIP installation.")
Expand Down
4 changes: 2 additions & 2 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -142,7 +142,7 @@ RUN ldconfig /usr/local/cuda-$(echo $CUDA_VERSION | cut -d. -f1,2)/compat/
# install vllm wheel first, so that torch etc will be installed
RUN --mount=type=bind,from=build,src=/workspace/dist,target=/vllm-workspace/dist \
--mount=type=cache,target=/root/.cache/pip \
python3 -m pip install dist/*.whl --verbose
pip install "$(echo dist/*.whl)[ray]" --verbose
#################### vLLM installation IMAGE ####################


Expand Down Expand Up @@ -172,7 +172,7 @@ FROM vllm-base AS vllm-openai

# install additional dependencies for openai api server
RUN --mount=type=cache,target=/root/.cache/pip \
pip install accelerate hf_transfer modelscope
pip install accelerate hf_transfer 'modelscope!=1.15.0'

ENV VLLM_USAGE_SOURCE production-docker-image

Expand Down
26 changes: 26 additions & 0 deletions Dockerfile.openvino
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# The vLLM Dockerfile is used to construct vLLM image that can be directly used
# to run the OpenAI compatible server.

FROM ubuntu:22.04 AS dev

RUN apt-get update -y && \
apt-get install -y python3-pip git
WORKDIR /workspace

# copy requirements
COPY requirements-build.txt /workspace/vllm/
COPY requirements-common.txt /workspace/vllm/
COPY requirements-openvino.txt /workspace/vllm/

COPY vllm/ /workspace/vllm/vllm
COPY setup.py /workspace/vllm/

# install build requirements
RUN PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu" python3 -m pip install -r /workspace/vllm/requirements-build.txt
# build vLLM with OpenVINO backend
RUN PIP_PRE=1 PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu https://storage.openvinotoolkit.org/simple/wheels/nightly/" VLLM_TARGET_DEVICE="openvino" python3 -m pip install /workspace/vllm/

COPY examples/ /workspace/vllm/examples
COPY benchmarks/ /workspace/vllm/benchmarks

CMD ["/bin/bash"]
22 changes: 22 additions & 0 deletions Dockerfile.ppc64le
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
FROM mambaorg/micromamba
ARG MAMBA_DOCKERFILE_ACTIVATE=1
USER root

RUN apt-get update -y && apt-get install -y git wget vim numactl gcc-12 g++-12 protobuf-compiler libprotobuf-dev && update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12

# Some packages in requirements-cpu are installed here
# IBM provides optimized packages for ppc64le processors in the open-ce project for mamba
# Currently these may not be available for venv or pip directly
RUN micromamba install -y -n base -c https://ftp.osuosl.org/pub/open-ce/1.11.0-p10/ -c defaults python=3.10 pytorch-cpu=2.1.2 torchvision-cpu=0.16.2 && micromamba clean --all --yes

COPY ./ /workspace/vllm

WORKDIR /workspace/vllm

# These packages will be in rocketce eventually
RUN pip install -v -r requirements-cpu.txt --prefer-binary --extra-index-url https://repo.fury.io/mgiessing

RUN VLLM_TARGET_DEVICE=cpu python3 setup.py install

WORKDIR /vllm-workspace
ENTRYPOINT ["/opt/conda/bin/python3", "-m", "vllm.entrypoints.openai.api_server"]
Loading
Loading