Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use continuous batching by default #882

Open
wants to merge 58 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 43 commits
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
ec5f305
Use continuous batching by default
Wovchena Sep 19, 2024
dd7a5cf
Merge branch 'master' into use-continuos-batching-by-default
andrei-kochin Sep 19, 2024
229b7c5
Merge branch 'master' into use-continuos-batching-by-default
andrei-kochin Sep 20, 2024
41d1fe7
Update src/cpp/src/llm_pipeline.cpp
andrei-kochin Sep 20, 2024
36150c4
Update src/cpp/src/llm_pipeline.cpp
andrei-kochin Sep 20, 2024
4a4a09e
Update src/cpp/src/llm_pipeline.cpp
andrei-kochin Sep 20, 2024
1a58b5e
Update src/cpp/src/llm_pipeline.cpp
andrei-kochin Sep 20, 2024
90d81e6
Reorder cout
Wovchena Sep 20, 2024
6dc43a3
Update src/cpp/src/llm_pipeline.cpp
andrei-kochin Sep 20, 2024
03e2f32
Update src/cpp/src/llm_pipeline.cpp
andrei-kochin Sep 20, 2024
e561e93
Update src/cpp/src/llm_pipeline.cpp
andrei-kochin Sep 20, 2024
37ea2ad
Update src/cpp/src/llm_pipeline.cpp
andrei-kochin Sep 20, 2024
b62aee9
Update src/cpp/src/llm_pipeline.cpp
andrei-kochin Sep 23, 2024
07505b3
Update src/cpp/src/llm_pipeline.cpp
andrei-kochin Sep 23, 2024
e078818
Update src/cpp/src/llm_pipeline.cpp
andrei-kochin Sep 23, 2024
001d3a0
Update src/cpp/src/llm_pipeline.cpp
andrei-kochin Sep 23, 2024
a0a964f
Update src/cpp/src/llm_pipeline.cpp
andrei-kochin Sep 25, 2024
3cb2105
Update src/cpp/src/llm_pipeline.cpp
andrei-kochin Sep 25, 2024
40ea516
Limit max new tokens.
popovaan Sep 25, 2024
193df7e
Fixed error
popovaan Sep 25, 2024
1704548
Clean up
Wovchena Sep 30, 2024
086c7b8
Default destructors
Wovchena Sep 30, 2024
607d90d
Merge branch 'master' into use-continuos-batching-by-default
Wovchena Sep 30, 2024
741c13b
Default ~PerfTime
Wovchena Sep 30, 2024
06d1b1e
Merge branch 'master' into use-continuos-batching-by-default
ilya-lavrenov Oct 10, 2024
8d7d39d
Update src/cpp/src/llm_pipeline.cpp
andrei-kochin Oct 11, 2024
c4e8e05
Update src/cpp/src/llm_pipeline.cpp
andrei-kochin Oct 11, 2024
8116342
Update src/cpp/src/llm_pipeline.cpp
andrei-kochin Oct 11, 2024
b87d0f6
Update src/cpp/src/llm_pipeline.cpp
andrei-kochin Oct 11, 2024
1806fa0
CB: fix deadlock (#71)
Wovchena Oct 11, 2024
c9dc107
Merge branch 'master' into use-continuos-batching-by-default
ilya-lavrenov Oct 12, 2024
4bbcd0e
Increase timeouts for tests
ilya-lavrenov Oct 12, 2024
743e018
Update causal_lm_cpp.yml
ilya-lavrenov Oct 12, 2024
cfccefa
Use split_core_complile_config for CB
ilya-lavrenov Oct 12, 2024
03965d6
Update causal_lm_cpp.yml
ilya-lavrenov Oct 12, 2024
784c331
Drop request if it's aborted by streamer
ilya-lavrenov Oct 13, 2024
93b8c38
Update src/cpp/src/continuous_batching_impl.cpp
ilya-lavrenov Oct 13, 2024
043d842
Drop request in case of exceptions, etc
ilya-lavrenov Oct 14, 2024
fdad63c
Turned off prefix caching
ilya-lavrenov Oct 14, 2024
a21f725
Apply suggestions from code review
ilya-lavrenov Oct 14, 2024
a66be9e
Apply suggestions from code review
ilya-lavrenov Oct 14, 2024
82fceb5
Update continuous_batching_impl.cpp
ilya-lavrenov Oct 14, 2024
a246c1c
Merge branch 'master' into use-continuos-batching-by-default
ilya-lavrenov Oct 14, 2024
4ee8f12
Merge branch 'master' into use-continuos-batching-by-default
ilya-lavrenov Oct 14, 2024
73a8872
Apply suggestions from code review
ilya-lavrenov Oct 14, 2024
4019678
Apply suggestions from code review
ilya-lavrenov Oct 14, 2024
ed7668e
Merge branch 'master' into use-continuos-batching-by-default
ilya-lavrenov Oct 14, 2024
feae546
Update causal_lm_cpp.yml
ilya-lavrenov Oct 14, 2024
5bdf779
Apply suggestions from code review
ilya-lavrenov Oct 14, 2024
e3f2949
Merge branch 'master' into use-continuos-batching-by-default
ilya-lavrenov Oct 16, 2024
f1a9ab5
Merge branch 'master' into use-continuos-batching-by-default
andrei-kochin Oct 17, 2024
debbdd4
Merge branch 'master' into use-continuos-batching-by-default
ilya-lavrenov Oct 18, 2024
7827199
Apply suggestions from code review
ilya-lavrenov Oct 21, 2024
3de57d3
Merge branch 'master' into use-continuos-batching-by-default
ilya-lavrenov Oct 21, 2024
42d26df
Merge branch 'master' into use-continuos-batching-by-default
ilya-lavrenov Oct 22, 2024
467ab86
Apply suggestions from code review
ilya-lavrenov Oct 22, 2024
5b7f94a
Merge branch 'master' into use-continuos-batching-by-default
andrei-kochin Oct 24, 2024
5a391a8
Merge branch 'master' into use-continuos-batching-by-default
andrei-kochin Oct 30, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 24 additions & 24 deletions .github/workflows/causal_lm_cpp.yml
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ jobs:
- run: >
. ./ov/setupvars.sh
&& export PYTHONPATH=./build/:$PYTHONPATH
&& timeout 25s ./build/samples/cpp/greedy_causal_lm/greedy_causal_lm ./open_llama_3b_v2/ "return 0"
&& timeout 50s ./build/samples/cpp/greedy_causal_lm/greedy_causal_lm ./open_llama_3b_v2/ "return 0"
| diff <(timeout 25s samples/python/greedy_causal_lm/greedy_causal_lm.py ./open_llama_3b_v2/ "return 0") -

cpp-beam_search_causal_lm-ubuntu:
Expand Down Expand Up @@ -100,7 +100,7 @@ jobs:
source ./ov/setupvars.sh
export PYTHONPATH=./build/:$PYTHONPATH # C++ ignores that

timeout 25s ${{ matrix.executable }} ./TinyLlama-1.1B-Chat-v1.0/ "Why is the Sun yellow?" > ./pred.txt
timeout 50s ${{ matrix.executable }} ./TinyLlama-1.1B-Chat-v1.0/ "Why is the Sun yellow?" > ./pred.txt
python -c "
import transformers
with open('pred.txt', 'r') as file:
Expand All @@ -116,7 +116,7 @@ jobs:
"
echo "Why is the Sun yellow?" passed

timeout 25s ${{ matrix.executable }} ./TinyLlama-1.1B-Chat-v1.0/ 69 > ./pred.txt
timeout 50s ${{ matrix.executable }} ./TinyLlama-1.1B-Chat-v1.0/ 69 > ./pred.txt
python -c "
import transformers
with open('pred.txt', 'r') as file:
Expand All @@ -132,7 +132,7 @@ jobs:
"
echo 69 passed

timeout 25s ${{ matrix.executable }} ./TinyLlama-1.1B-Chat-v1.0/ Hi > ./pred.txt
timeout 50s ${{ matrix.executable }} ./TinyLlama-1.1B-Chat-v1.0/ Hi > ./pred.txt
python -c "
import transformers
with open('pred.txt', 'r') as file:
Expand All @@ -148,7 +148,7 @@ jobs:
"
echo "Hi" passed

timeout 25s ${{ matrix.executable }} ./TinyLlama-1.1B-Chat-v1.0/ "return 0" > ./pred.txt
timeout 50s ${{ matrix.executable }} ./TinyLlama-1.1B-Chat-v1.0/ "return 0" > ./pred.txt
python -c "
import transformers
with open('pred.txt', 'r') as file:
Expand All @@ -164,7 +164,7 @@ jobs:
"
echo "return 0" passed

timeout 25s ${{ matrix.executable }} ./TinyLlama-1.1B-Chat-v1.0/ "你好! 你好嗎?" > ./pred.txt
timeout 50s ${{ matrix.executable }} ./TinyLlama-1.1B-Chat-v1.0/ "你好! 你好嗎?" > ./pred.txt
python -c "
import transformers
with open('pred.txt', 'r', errors='ignore') as file:
Expand All @@ -180,7 +180,7 @@ jobs:
"
echo "你好! 你好嗎?" passed

timeout 1m ${{ matrix.executable }} ./TinyLlama-1.1B-Chat-v1.0/ "Alan Turing was a" "return 0" "你好! 你好嗎?" > ./pred.txt
timeout 2m ${{ matrix.executable }} ./TinyLlama-1.1B-Chat-v1.0/ "Alan Turing was a" "return 0" "你好! 你好嗎?" > ./pred.txt
python -c "
import transformers
with open('pred.txt', 'r', errors='ignore') as file:
Expand Down Expand Up @@ -285,7 +285,7 @@ jobs:
- run: >
. ./ov/setupvars.sh
&& export PYTHONPATH=./build/:$PYTHONPATH
&& timeout 2m ./build/samples/cpp/greedy_causal_lm/greedy_causal_lm ./Qwen-7B-Chat/ 69 | diff <(timeout 2m samples/python/greedy_causal_lm/greedy_causal_lm.py ./Qwen-7B-Chat/ 69) -
&& timeout 4m ./build/samples/cpp/greedy_causal_lm/greedy_causal_lm ./Qwen-7B-Chat/ 69 | diff <(timeout 4m samples/python/greedy_causal_lm/greedy_causal_lm.py ./Qwen-7B-Chat/ 69) -

cpp-beam_search_causal_lm-Qwen1_5-7B-Chat:
runs-on: ubuntu-20.04-16-cores
Expand Down Expand Up @@ -318,8 +318,8 @@ jobs:
- run: >
. ./ov/setupvars.sh
&& export PYTHONPATH=./build/:$PYTHONPATH
&& timeout 50s ./build/samples/cpp/beam_search_causal_lm/beam_search_causal_lm ./Qwen1.5-7B-Chat/ "你好!"
| diff <(timeout 50s ./samples/python/beam_search_causal_lm/beam_search_causal_lm.py ./Qwen1.5-7B-Chat/ "你好!") -
&& timeout 100s ./build/samples/cpp/beam_search_causal_lm/beam_search_causal_lm ./Qwen1.5-7B-Chat/ "你好!"
| diff <(timeout 100s ./samples/python/beam_search_causal_lm/beam_search_causal_lm.py ./Qwen1.5-7B-Chat/ "你好!") -

cpp-beam_search_causal_lm-Phi-2:
runs-on: ubuntu-20.04-16-cores
Expand Down Expand Up @@ -352,8 +352,8 @@ jobs:
- run: >
. ./ov/setupvars.sh
&& export PYTHONPATH=./build/:$PYTHONPATH
&& timeout 50s ./build/samples/cpp/beam_search_causal_lm/beam_search_causal_lm ./phi-2/ 69
| diff <(timeout 50s ./samples/python/beam_search_causal_lm/beam_search_causal_lm.py ./phi-2/ 69) -
&& timeout 100s ./build/samples/cpp/beam_search_causal_lm/beam_search_causal_lm ./phi-2/ 69
| diff <(timeout 100s ./samples/python/beam_search_causal_lm/beam_search_causal_lm.py ./phi-2/ 69) -

cpp-beam_search_causal_lm-notus-7b-v1:
runs-on: ubuntu-20.04-16-cores
Expand Down Expand Up @@ -386,8 +386,8 @@ jobs:
- run: >
. ./ov/setupvars.sh
&& export PYTHONPATH=./build/:$PYTHONPATH
&& timeout 50s ./build/samples/cpp/beam_search_causal_lm/beam_search_causal_lm ./notus-7b-v1/ 69
| diff <(timeout 50s ./samples/python/beam_search_causal_lm/beam_search_causal_lm.py ./notus-7b-v1/ 69) -
&& timeout 100s ./build/samples/cpp/beam_search_causal_lm/beam_search_causal_lm ./notus-7b-v1/ 69
| diff <(timeout 100s ./samples/python/beam_search_causal_lm/beam_search_causal_lm.py ./notus-7b-v1/ 69) -

cpp-speculative_decoding_lm-ubuntu:
runs-on: ubuntu-20.04-16-cores
Expand Down Expand Up @@ -479,7 +479,7 @@ jobs:
predicted_greedy = f.readline()
with open('predictions_prompt_lookup.txt', 'r') as f:
predicted_prompt_lookup = f.readline()
assert predicted_greedy == predicted_prompt_lookup
assert predicted_greedy == predicted_prompt_lookup, "Expected {predicted_greedy}, actual {predicted_prompt_lookup}"
ilya-lavrenov marked this conversation as resolved.
Show resolved Hide resolved
"
echo "Prompt lookup" passed
- name: run and compare (model with seq_length_axis = 1)
Expand All @@ -501,7 +501,7 @@ jobs:
predicted_greedy = f.readline()
with open('predictions_prompt_lookup.txt', 'r') as f:
predicted_prompt_lookup = f.readline()
assert predicted_greedy == predicted_prompt_lookup
assert predicted_greedy == predicted_prompt_lookup, "Expected {predicted_greedy}, actual {predicted_prompt_lookup}"
ilya-lavrenov marked this conversation as resolved.
Show resolved Hide resolved
"
echo "Prompt lookup" passed

Expand Down Expand Up @@ -536,7 +536,7 @@ jobs:
- name: Run Generation
run: |
source ./ov/setupvars.sh
timeout 50s ./build/samples/cpp/greedy_causal_lm/greedy_causal_lm ./phi-1_5/ "Alan Turing was a" > ./pred_greedy.txt
timeout 100s ./build/samples/cpp/greedy_causal_lm/greedy_causal_lm ./phi-1_5/ "Alan Turing was a" > ./pred_greedy.txt
- name: Compare
run: |
python -c "
Expand All @@ -556,7 +556,7 @@ jobs:
- run: >
. ./ov/setupvars.sh
&& export PYTHONPATH=./build/:$PYTHONPATH
&& timeout 50s samples/python/greedy_causal_lm/greedy_causal_lm.py ./phi-1_5/ "Alan Turing was a"
&& timeout 100s samples/python/greedy_causal_lm/greedy_causal_lm.py ./phi-1_5/ "Alan Turing was a"
| diff ./pred_greedy.txt -

cpp-greedy_causal_lm-redpajama-3b-chat:
Expand Down Expand Up @@ -590,7 +590,7 @@ jobs:
- name: Run Generation
run: |
source ./ov/setupvars.sh
timeout 50s ./build/samples/cpp/greedy_causal_lm/greedy_causal_lm ./redpajama-3b-chat/ "Alan Turing was a" > ./pred_greedy.txt
timeout 100s ./build/samples/cpp/greedy_causal_lm/greedy_causal_lm ./redpajama-3b-chat/ "Alan Turing was a" > ./pred_greedy.txt
- name: Compare
run: |
python -c "
Expand All @@ -610,7 +610,7 @@ jobs:
- run: >
. ./ov/setupvars.sh
&& export PYTHONPATH=./build/:$PYTHONPATH
&& timeout 50s samples/python/greedy_causal_lm/greedy_causal_lm.py ./redpajama-3b-chat/ "Alan Turing was a"
&& timeout 100s samples/python/greedy_causal_lm/greedy_causal_lm.py ./redpajama-3b-chat/ "Alan Turing was a"
| diff ./pred_greedy.txt -

cpp-chat_sample-ubuntu:
Expand Down Expand Up @@ -645,7 +645,7 @@ jobs:
run: |
source ./ov/setupvars.sh
printf 'What is 2 + 2?\nWhat is the previous answer?\nAdd 1 to it.\nSubtract 5 from it.\nWhy is the sun yellow?\nWhat was my first question?\n' > ./input.txt
timeout 30s ./build/samples/cpp/chat_sample/chat_sample ./TinyLlama-1.1B-Chat-v1.0/ < input.txt > ./pred.txt
timeout 60s ./build/samples/cpp/chat_sample/chat_sample ./TinyLlama-1.1B-Chat-v1.0/ < input.txt > ./pred.txt
python -c "
from transformers import LlamaTokenizer, AutoModelForCausalLM
model_id = 'TinyLlama/TinyLlama-1.1B-Chat-v1.0'
Expand Down Expand Up @@ -675,7 +675,7 @@ jobs:
diff pred.txt ref.txt
echo "Chat sample cpp" passed
export PYTHONPATH=./build/:$PYTHONPATH
timeout 30s ./samples/python/chat_sample/chat_sample.py ./TinyLlama-1.1B-Chat-v1.0/ < input.txt > ./pred2.txt
timeout 60s ./samples/python/chat_sample/chat_sample.py ./TinyLlama-1.1B-Chat-v1.0/ < input.txt > ./pred2.txt
diff pred2.txt ref.txt
echo "Chat sample python" passed

Expand Down Expand Up @@ -708,7 +708,7 @@ jobs:
- name: Run visual_language_chat sample - MiniCPM-V-2_6
run: >
source ./ov/setupvars.sh
&& timeout 120s ./build/samples/cpp/visual_language_chat/visual_language_chat ./miniCPM-V-2_6/ cat.jpg
&& timeout 240s ./build/samples/cpp/visual_language_chat/visual_language_chat ./miniCPM-V-2_6/ cat.jpg
ilya-lavrenov marked this conversation as resolved.
Show resolved Hide resolved
<<< $'What is on the image?\nWhat is special on the image?'
- name: Download and convert LLaVa 1.5 model and an image
run: |
Expand All @@ -729,7 +729,7 @@ jobs:
source ./ov/setupvars.sh
export PYTHONPATH=./build/:$PYTHONPATH
printf 'What is on the image?\nWhat is special on the image?\n' > ./input.txt
timeout 120s python ./samples/python/visual_language_chat/visual_language_chat.py ./miniCPM-V-2_6/ cat.jpg < input.txt > ./pred.txt
timeout 240s python ./samples/python/visual_language_chat/visual_language_chat.py ./miniCPM-V-2_6/ cat.jpg < input.txt > ./pred.txt
ilya-lavrenov marked this conversation as resolved.
Show resolved Hide resolved

cpp-continuous-batching-ubuntu:
runs-on: ubuntu-20.04-8-cores
Expand Down
36 changes: 32 additions & 4 deletions src/cpp/src/continuous_batching_impl.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
#include "text_callback_streamer.hpp"
#include "continuous_batching_impl.hpp"
#include "paged_attention_transformations.hpp"
#include "utils.hpp"

namespace ov::genai {
template<class... Ts> struct overloaded : Ts... {using Ts::operator()...;};
Expand All @@ -18,15 +19,18 @@ ContinuousBatchingPipeline::ContinuousBatchingImpl::ContinuousBatchingImpl(
m_tokenizer = tokenizer;
ov::Core core;

auto [core_plugin_config, compile_plugin_config] = ov::genai::utils::split_core_complile_config(plugin_config);
core.set_property(core_plugin_config);

// The model can be compiled for GPU as well
std::shared_ptr<ov::Model> model = core.read_model(models_path + "/openvino_model.xml");

DeviceConfig device_config(core, scheduler_config, device, plugin_config);
DeviceConfig device_config(core, scheduler_config, device, compile_plugin_config);

bool is_need_per_layer_cache_control = scheduler_config.use_cache_eviction;
apply_paged_attention_transformations(model, device_config, is_need_per_layer_cache_control);

ov::InferRequest infer_request = core.compile_model(model, device_config.get_device(), plugin_config).create_infer_request();
ov::InferRequest infer_request = core.compile_model(model, device_config.get_device(), compile_plugin_config).create_infer_request();

// setup KV caches
m_cache_manager = std::make_shared<CacheManager>(device_config, core);
Expand Down Expand Up @@ -255,9 +259,26 @@ ContinuousBatchingPipeline::ContinuousBatchingImpl::generate(const std::vector<o
std::vector<EncodedGenerationResult> results;
results.reserve(m_awaiting_requests.size());

auto drop_requests = [&] () {
for (const std::shared_ptr<ov::genai::SequenceGroup> request : m_requests) {
for (const auto& sequence: request->get_sequences()) {
if (m_scheduler->has_block_table(sequence->get_id())) {
m_scheduler->free_sequence(sequence->get_id());
}
}
m_sampler->clear_beam_search_info(request->get_request_id());
}
m_requests.clear();
};

bool continue_generation = true;
while (has_non_finished_requests() && continue_generation) {
step();
try {
step();
} catch (...) {
drop_requests();
throw;
}
if (streamer_ptr && generations.at(0)->can_read()) {
std::unordered_map<uint64_t, GenerationOutput> token = generations.at(0).get()->back();
OPENVINO_ASSERT(1 == token.size());
Expand All @@ -269,6 +290,12 @@ ContinuousBatchingPipeline::ContinuousBatchingImpl::generate(const std::vector<o
streamer_ptr->end();
}

if (!continue_generation) {
drop_requests();
} else {
OPENVINO_ASSERT(m_requests.empty(), "Internal error: current request is supposed to be dropped within step() function as completed");
}

for (size_t generation_idx = 0; generation_idx < generations.size(); ++generation_idx) {
const auto& generation = generations[generation_idx];
EncodedGenerationResult result;
Expand Down Expand Up @@ -304,7 +331,8 @@ ContinuousBatchingPipeline::ContinuousBatchingImpl::generate(const std::vector<s
constexpr bool add_generation_prompt = true;
std::string history = m_tokenizer.apply_chat_template(m_history, add_generation_prompt);
timer.start();
input_ids.push_back(m_tokenizer.encode(history).input_ids);
// ov::genai::add_special_tokens(false) is aligned with stateful pipeline
input_ids.push_back(m_tokenizer.encode(history, ov::genai::add_special_tokens(false)).input_ids);
timer.end();
} else {
input_ids.reserve(prompts.size());
Expand Down
8 changes: 0 additions & 8 deletions src/cpp/src/continuous_batching_impl_interface.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -26,14 +26,6 @@ class ContinuousBatchingPipeline::ImplInterface {
float m_paged_attention_time_ms = 0.0f;
float m_matmul_time_ms = 0.0f;
float m_infer_total_ms = 0.0f;

~PerfTime() {
std::cout << "Inference requests aggregated statistic: " << std::endl;
std::cout << "Paged attention % of inference execution: " << (m_paged_attention_time_ms / m_infer_total_ms) * 100 << std::endl;
std::cout << "MatMul % of inference execution: " << (m_matmul_time_ms / m_infer_total_ms) * 100 << std::endl;
std::cout << "Total inference execution secs: " << m_infer_total_ms / 1000. << std::endl;
std::cout << std::endl;
}
} m_perf;
bool m_is_chat_conversation = false;
ChatHistory m_history;
Expand Down
30 changes: 30 additions & 0 deletions src/cpp/src/llm_pipeline.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
#include <algorithm>
#include <nlohmann/json.hpp>
#include <openvino/openvino.hpp>
#include <limits>
#include "openvino/genai/continuous_batching_pipeline.hpp"
#include "openvino/genai/generation_config.hpp"
#include "openvino/genai/llm_pipeline.hpp"
Expand Down Expand Up @@ -522,6 +523,7 @@ ov::genai::LLMPipeline::LLMPipeline(
const ov::genai::Tokenizer& tokenizer,
OptionalGenerationConfig generation_config
) {
OPENVINO_THROW("Not supported");
auto start_time = std::chrono::steady_clock::now();
m_pimpl = std::make_unique<StatefulLLMPipeline>(request, tokenizer, generation_config);
auto stop_time = std::chrono::steady_clock::now();
Expand All @@ -534,12 +536,25 @@ ov::genai::LLMPipeline::LLMPipeline(
const std::string& device,
const ov::AnyMap& plugin_config
){
// std::cout << "Using continuous batching backend.\n";
auto start_time = std::chrono::steady_clock::now();
if (plugin_config.find(ov::genai::scheduler_config.name()) != plugin_config.end()) {
auto config_without_scheduler_config = plugin_config;
config_without_scheduler_config.erase(ov::genai::scheduler_config.name());
auto& scheduler_config = plugin_config.at(ov::genai::scheduler_config.name()).as<SchedulerConfig>();
m_pimpl = std::make_unique<ContinuousBatchingAdapter>(model_path, tokenizer, scheduler_config, device, config_without_scheduler_config);
// std::cout << "Found custom SchedulerConfig.\n";
} else if (true) {
SchedulerConfig scheduler_config;
scheduler_config.cache_size = 1;
scheduler_config.enable_prefix_caching = false;
m_pimpl = std::make_unique<ContinuousBatchingAdapter>(
model_path,
tokenizer,
scheduler_config,
device,
plugin_config
);
} else if ("NPU" == device) {
m_pimpl = std::make_unique<StaticLLMPipeline>(model_path, tokenizer, device, plugin_config);
} else {
Expand All @@ -554,12 +569,24 @@ ov::genai::LLMPipeline::LLMPipeline(
const std::string& device,
const ov::AnyMap& config
){
// std::cout << "Using continuous batching backend.\n";
auto start_time = std::chrono::steady_clock::now();
if (config.find(ov::genai::scheduler_config.name()) != config.end()) {
auto config_without_scheduler_config = config;
config_without_scheduler_config.erase(ov::genai::scheduler_config.name());
auto& scheduler_config = config.at(ov::genai::scheduler_config.name()).as<SchedulerConfig>();
m_pimpl = std::make_unique<ContinuousBatchingAdapter>(path, scheduler_config, device, config_without_scheduler_config);
// std::cout << "Found custom SchedulerConfig.\n";
} else if (true) {
SchedulerConfig scheduler_config;
scheduler_config.cache_size = 1;
scheduler_config.enable_prefix_caching = false;
ilya-lavrenov marked this conversation as resolved.
Show resolved Hide resolved
m_pimpl = std::make_unique<ContinuousBatchingAdapter>(
path,
scheduler_config,
device,
config
);
} else if ("NPU" == device) {
m_pimpl = std::make_unique<StaticLLMPipeline>(path, device, config);
} else {
Expand Down Expand Up @@ -592,6 +619,9 @@ void ov::genai::LLMPipeline::set_generation_config(const GenerationConfig& confi
if (config.eos_token_id == -1)
m_pimpl->m_generation_config.eos_token_id = default_eos_token_id;

if (config.max_new_tokens == SIZE_MAX)
m_pimpl->m_generation_config.max_new_tokens = 100;

m_pimpl->m_generation_config.validate();
}

Expand Down
4 changes: 0 additions & 4 deletions src/cpp/src/timer.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,4 @@ class ManualTimer {
auto m_end = std::chrono::steady_clock::now();
m_total += std::chrono::duration<double, std::milli>(m_end - m_start).count();
}

~ManualTimer() {
std::cout << m_title << ": " << m_total / 1000. << " secs" << std::endl;
}
};
3 changes: 3 additions & 0 deletions src/python/py_generate_pipeline.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
// SPDX-License-Identifier: Apache-2.0

#include <filesystem>
#include <pybind11/iostream.h>
ilya-lavrenov marked this conversation as resolved.
Show resolved Hide resolved
#include <pybind11/pybind11.h>
#include <pybind11/stl.h>
#include <pybind11/stl_bind.h>
Expand Down Expand Up @@ -403,6 +404,7 @@ PYBIND11_MODULE(py_generate_pipeline, m) {
ScopedVar env_manager(utils::ov_tokenizers_module_path());
return std::make_unique<LLMPipeline>(model_path, device, utils::properties_to_any_map(config));
}),
py::call_guard<py::scoped_ostream_redirect, py::scoped_estream_redirect>(), // Respect std::cout flushes from constructor.
ilya-lavrenov marked this conversation as resolved.
Show resolved Hide resolved
py::arg("model_path"), "folder with openvino_model.xml and openvino_tokenizer[detokenizer].xml files",
py::arg("device") = "CPU", "device on which inference will be done",
py::arg("config") = ov::AnyMap({}), "openvino.properties map",
Expand All @@ -422,6 +424,7 @@ PYBIND11_MODULE(py_generate_pipeline, m) {
ScopedVar env_manager(utils::ov_tokenizers_module_path());
return std::make_unique<LLMPipeline>(model_path, tokenizer, device, utils::properties_to_any_map(config));
}),
py::call_guard<py::scoped_ostream_redirect, py::scoped_estream_redirect>(), // Respect std::cout flushes from constructor.
ilya-lavrenov marked this conversation as resolved.
Show resolved Hide resolved
ilya-lavrenov marked this conversation as resolved.
Show resolved Hide resolved
py::arg("model_path"),
py::arg("tokenizer"),
py::arg("device") = "CPU",
Expand Down
Loading