-
Notifications
You must be signed in to change notification settings - Fork 180
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
- Loading branch information
1 parent
441f8cc
commit 088ab98
Showing
12 changed files
with
784 additions
and
14 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
|
||
# Copyright (C) 2024 Intel Corporation | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
python online_evaluate.py |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,170 @@ | ||
# ChatQnA Accuracy | ||
|
||
ChatQnA is a Retrieval-Augmented Generation (RAG) pipeline, which can enhance generative models through external information retrieval. | ||
|
||
For evaluating the accuracy, we use 2 latest published datasets and 10+ metrics which are popular and comprehensive: | ||
|
||
- Dataset | ||
- [MultiHop](https://arxiv.org/pdf/2401.15391) (English dataset) | ||
- [CRUD](https://arxiv.org/abs/2401.17043) (Chinese dataset) | ||
- metrics (measure accuracy of both the context retrieval and response generation) | ||
- evaluation for retrieval/reranking | ||
- MRR@10 | ||
- MAP@10 | ||
- Hits@10 | ||
- Hits@4 | ||
- LLM-as-a-Judge | ||
- evaluation for the generated response from the end-to-end pipeline | ||
- BLEU | ||
- ROGUE(L) | ||
- LLM-as-a-Judge | ||
|
||
## Prerequisite | ||
|
||
### Environment | ||
|
||
```bash | ||
git clone https://github.com/opea-project/GenAIEval | ||
cd GenAIEval | ||
pip install -r requirements.txt | ||
pip install -e . | ||
``` | ||
|
||
## MultiHop (English dataset) | ||
|
||
[MultiHop-RAG](https://arxiv.org/pdf/2401.15391): a QA dataset to evaluate retrieval and reasoning across documents with metadata in the RAG pipelines. It contains 2556 queries, with evidence for each query distributed across 2 to 4 documents. The queries also involve document metadata, reflecting complex scenarios commonly found in real-world RAG applications. | ||
|
||
### Launch Service of RAG System | ||
|
||
Please refer to this [guide](https://github.com/opea-project/GenAIExamples/blob/main/ChatQnA/README.md) to launch the service of `ChatQnA`. | ||
|
||
### Launch Service of LLM-as-a-Judge | ||
|
||
To setup a LLM model, we can use [tgi-gaudi](https://github.com/huggingface/tgi-gaudi) to launch a service. For example, the follow command is to setup the [mistralai/Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) model on 2 Gaudi2 cards: | ||
|
||
``` | ||
# please set your llm_port and hf_token | ||
docker run -p {your_llm_port}:80 --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e HF_TOKEN={your_hf_token} --cap-add=sys_nice --ipc=host ghcr.io/huggingface/tgi-gaudi:2.0.1 --model-id mistralai/Mixtral-8x7B-Instruct-v0.1 --max-input-tokens 2048 --max-total-tokens 4096 --sharded true --num-shard 2 | ||
# for better performance, set `PREFILL_BATCH_BUCKET_SIZE`, `BATCH_BUCKET_SIZE`, `max-batch-total-tokens`, `max-batch-prefill-tokens` | ||
docker run -p {your_llm_port}:80 --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e HF_TOKEN={your_hf_token} -e PREFILL_BATCH_BUCKET_SIZE=1 -e BATCH_BUCKET_SIZE=8 --cap-add=sys_nice --ipc=host ghcr.io/huggingface/tgi-gaudi:2.0.5 --model-id mistralai/Mixtral-8x7B-Instruct-v0.1 --max-input-tokens 2048 --max-total-tokens 4096 --sharded true --num-shard 2 --max-batch-total-tokens 65536 --max-batch-prefill-tokens 2048 | ||
``` | ||
|
||
### Prepare Dataset | ||
|
||
We use the evaluation dataset from [MultiHop-RAG](https://github.com/yixuantt/MultiHop-RAG) repo, use the below command to prepare the dataset. | ||
|
||
```bash | ||
git clone https://github.com/yixuantt/MultiHop-RAG.git | ||
``` | ||
|
||
### Evaluation | ||
|
||
Use below command to run the evaluation, please note that for the first run, argument `--ingest_docs` should be added in the command to ingest the documents into the vector database, while for the subsequent run, this argument should be omitted. Set `--retrieval_metrics` to get retrieval related metrics (MRR@10/MAP@10/Hits@10/Hits@4). Set `--ragas_metrics` and `--llm_endpoint` to get end-to-end rag pipeline metrics (faithfulness/answer_relevancy/...), which are judged by LLMs. We set `--limits` is 100 as default, which means only 100 examples are evaluated by llm-as-judge as it is very time consuming. | ||
|
||
If you are using docker compose to deploy `ChatQnA` system, you can simply run the evaluation as following: | ||
|
||
```bash | ||
python eval_multihop.py --docs_path MultiHop-RAG/dataset/corpus.json --dataset_path MultiHop-RAG/dataset/MultiHopRAG.json --ingest_docs --retrieval_metrics --ragas_metrics --llm_endpoint http://{llm_as_judge_ip}:{llm_as_judge_port}/generate | ||
``` | ||
|
||
If you are using Kubernetes manifest/helm to deploy `ChatQnA` system, you must specify more arguments as following: | ||
|
||
```bash | ||
python eval_multihop.py --docs_path MultiHop-RAG/dataset/corpus.json --dataset_path MultiHop-RAG/dataset/MultiHopRAG.json --ingest_docs --retrieval_metrics --ragas_metrics --llm_endpoint http://{llm_as_judge_ip}:{llm_as_judge_port}/generate --database_endpoint http://{your_dataprep_ip}:{your_dataprep_port}/v1/dataprep --embedding_endpoint http://{your_embedding_ip}:{your_embedding_port}/v1/embeddings --tei_embedding_endpoint http://{your_tei_embedding_ip}:{your_tei_embedding_port} --retrieval_endpoint http://{your_retrieval_ip}:{your_retrieval_port}/v1/retrieval --service_url http://{your_chatqna_ip}:{your_chatqna_port}/v1/chatqna | ||
``` | ||
|
||
The default values for arguments are: | ||
|Argument|Default value| | ||
|--------|-------------| | ||
|service_url|http://localhost:8888/v1/chatqna| | ||
|database_endpoint|http://localhost:6007/v1/dataprep| | ||
|embedding_endpoint|http://localhost:6000/v1/embeddings| | ||
|tei_embedding_endpoint|http://localhost:8090| | ||
|retrieval_endpoint|http://localhost:7000/v1/retrieval| | ||
|reranking_endpoint|http://localhost:8000/v1/reranking| | ||
|output_dir|./output| | ||
|temperature|0.1| | ||
|max_new_tokens|1280| | ||
|chunk_size|256| | ||
|chunk_overlap|100| | ||
|search_type|similarity| | ||
|retrival_k|10| | ||
|fetch_k|20| | ||
|lambda_mult|0.5| | ||
|dataset_path|None| | ||
|docs_path|None| | ||
|limits|100| | ||
|
||
You can check arguments details use below command: | ||
|
||
```bash | ||
python eval_multihop.py --help | ||
``` | ||
|
||
## CRUD (Chinese dataset) | ||
|
||
[CRUD-RAG](https://arxiv.org/abs/2401.17043) is a Chinese benchmark for RAG (Retrieval-Augmented Generation) system. This example utilize CRUD-RAG for evaluating the RAG system. | ||
|
||
### Prepare Dataset | ||
|
||
We use the evaluation dataset from [CRUD-RAG](https://github.com/IAAR-Shanghai/CRUD_RAG) repo, use the below command to prepare the dataset. | ||
|
||
```bash | ||
git clone https://github.com/IAAR-Shanghai/CRUD_RAG | ||
mkdir data/ | ||
cp CRUD_RAG/data/crud_split/split_merged.json data/ | ||
cp -r CRUD_RAG/data/80000_docs/ data/ | ||
python process_crud_dataset.py | ||
``` | ||
|
||
### Launch Service of RAG System | ||
|
||
Please refer to this [guide](https://github.com/opea-project/GenAIExamples/blob/main/ChatQnA/README.md) to launch the service of `ChatQnA` system. For Chinese dataset, you should replace the English emebdding and llm model with Chinese, for example, `EMBEDDING_MODEL_ID="BAAI/bge-base-zh-v1.5"` and `LLM_MODEL_ID=Qwen/Qwen2-7B-Instruct`. | ||
|
||
### Evaluation | ||
|
||
Use below command to run the evaluation, please note that for the first run, argument `--ingest_docs` should be added in the command to ingest the documents into the vector database, while for the subsequent run, this argument should be omitted. | ||
|
||
If you are using docker compose to deploy `ChatQnA` system, you can simply run the evaluation as following: | ||
|
||
```bash | ||
python eval_crud.py --dataset_path ./data/split_merged.json --docs_path ./data/80000_docs --ingest_docs | ||
|
||
# if you want to get ragas metrics | ||
python eval_crud.py --dataset_path ./data/split_merged.json --docs_path ./data/80000_docs --contain_original_data --llm_endpoint "http://{llm_as_judge_ip}:{llm_as_judge_port}" --ragas_metrics | ||
``` | ||
|
||
If you are using Kubernetes manifest/helm to deploy `ChatQnA` system, you must specify more arguments as following: | ||
|
||
```bash | ||
python eval_crud.py --dataset_path ./data/split_merged.json --docs_path ./data/80000_docs --ingest_docs --database_endpoint http://{your_dataprep_ip}:{your_dataprep_port}/v1/dataprep --embedding_endpoint http://{your_embedding_ip}:{your_embedding_port}/v1/embeddings --retrieval_endpoint http://{your_retrieval_ip}:{your_retrieval_port}/v1/retrieval --service_url http://{your_chatqna_ip}:{your_chatqna_port}/v1/chatqna | ||
``` | ||
|
||
The default values for arguments are: | ||
|Argument|Default value| | ||
|--------|-------------| | ||
|service_url|http://localhost:8888/v1/chatqna| | ||
|database_endpoint|http://localhost:6007/v1/dataprep| | ||
|embedding_endpoint|http://localhost:6000/v1/embeddings| | ||
|retrieval_endpoint|http://localhost:7000/v1/retrieval| | ||
|reranking_endpoint|http://localhost:8000/v1/reranking| | ||
|output_dir|./output| | ||
|temperature|0.1| | ||
|max_new_tokens|1280| | ||
|chunk_size|256| | ||
|chunk_overlap|100| | ||
|dataset_path|./data/split_merged.json| | ||
|docs_path|./data/80000_docs| | ||
|tasks|["question_answering"]| | ||
|
||
You can check arguments details use below command: | ||
|
||
```bash | ||
python eval_crud.py --help | ||
``` | ||
|
||
## Acknowledgements | ||
|
||
This example is mostly adapted from [MultiHop-RAG](https://github.com/yixuantt/MultiHop-RAG) and [CRUD-RAG](https://github.com/IAAR-Shanghai/CRUD_RAG) repo, we thank the authors for their great work! |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,210 @@ | ||
#!/usr/bin/env python | ||
# -*- coding: utf-8 -*- | ||
# Copyright (C) 2024 Intel Corporation | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
|
||
import argparse | ||
import json | ||
import os | ||
|
||
from evals.evaluation.rag_eval import Evaluator | ||
from evals.evaluation.rag_eval.template import CRUDTemplate | ||
from evals.metrics.ragas import RagasMetric | ||
from tqdm import tqdm | ||
|
||
|
||
class CRUD_Evaluator(Evaluator): | ||
def get_ground_truth_text(self, data: dict): | ||
if self.task == "summarization": | ||
ground_truth_text = data["summary"] | ||
elif self.task == "question_answering": | ||
ground_truth_text = data["answers"] | ||
elif self.task == "continuation": | ||
ground_truth_text = data["continuing"] | ||
elif self.task == "hallucinated_modified": | ||
ground_truth_text = data["hallucinatedMod"] | ||
else: | ||
raise NotImplementedError( | ||
f"Unknown task {self.task}, only support " | ||
"summarization, question_answering, continuation and hallucinated_modified." | ||
) | ||
return ground_truth_text | ||
|
||
def get_query(self, data: dict): | ||
if self.task == "summarization": | ||
query = data["text"] | ||
elif self.task == "question_answering": | ||
query = data["questions"] | ||
elif self.task == "continuation": | ||
query = data["beginning"] | ||
elif self.task == "hallucinated_modified": | ||
query = data["newsBeginning"] | ||
else: | ||
raise NotImplementedError( | ||
f"Unknown task {self.task}, only support " | ||
"summarization, question_answering, continuation and hallucinated_modified." | ||
) | ||
return query | ||
|
||
def get_document(self, data: dict): | ||
if self.task == "summarization": | ||
document = data["text"] | ||
elif self.task == "question_answering": | ||
document = data["news1"] | ||
elif self.task == "continuation": | ||
document = data["beginning"] | ||
elif self.task == "hallucinated_modified": | ||
document = data["newsBeginning"] | ||
else: | ||
raise NotImplementedError( | ||
f"Unknown task {self.task}, only support " | ||
"summarization, question_answering, continuation and hallucinated_modified." | ||
) | ||
return document | ||
|
||
def get_template(self): | ||
if self.task == "summarization": | ||
template = CRUDTemplate.get_summarization_template() | ||
elif self.task == "question_answering": | ||
template = CRUDTemplate.get_question_answering_template() | ||
elif self.task == "continuation": | ||
template = CRUDTemplate.get_continuation_template() | ||
else: | ||
raise NotImplementedError( | ||
f"Unknown task {self.task}, only support " | ||
"summarization, question_answering, continuation and hallucinated_modified." | ||
) | ||
return template | ||
|
||
def post_process(self, result): | ||
return result.split("<response>")[-1].split("</response>")[0].strip() | ||
|
||
def get_ragas_metrics(self, results, arguments): | ||
from langchain_huggingface import HuggingFaceEndpointEmbeddings | ||
|
||
embeddings = HuggingFaceEndpointEmbeddings(model=arguments.tei_embedding_endpoint) | ||
|
||
metric = RagasMetric( | ||
threshold=0.5, | ||
model=arguments.llm_endpoint, | ||
embeddings=embeddings, | ||
metrics=["faithfulness", "answer_relevancy"], | ||
) | ||
|
||
all_answer_relevancy = 0 | ||
all_faithfulness = 0 | ||
ragas_inputs = { | ||
"question": [], | ||
"answer": [], | ||
"ground_truth": [], | ||
"contexts": [], | ||
} | ||
|
||
valid_results = self.remove_invalid(results["results"]) | ||
|
||
for data in tqdm(valid_results): | ||
data = data["original_data"] | ||
|
||
query = self.get_query(data) | ||
generated_text = data["generated_text"] | ||
ground_truth = data["ground_truth_text"] | ||
retrieved_documents = data["retrieved_documents"] | ||
|
||
ragas_inputs["question"].append(query) | ||
ragas_inputs["answer"].append(generated_text) | ||
ragas_inputs["ground_truth"].append(ground_truth) | ||
ragas_inputs["contexts"].append(retrieved_documents[:3]) | ||
|
||
ragas_metrics = metric.measure(ragas_inputs) | ||
return ragas_metrics | ||
|
||
|
||
def args_parser(): | ||
parser = argparse.ArgumentParser() | ||
|
||
parser.add_argument( | ||
"--service_url", type=str, default="http://localhost:8888/v1/chatqna", help="Service URL address." | ||
) | ||
parser.add_argument("--output_dir", type=str, default="./output", help="Directory to save evaluation results.") | ||
parser.add_argument( | ||
"--temperature", type=float, default=0.1, help="Controls the randomness of the model's text generation" | ||
) | ||
parser.add_argument( | ||
"--max_new_tokens", type=int, default=1280, help="Maximum number of new tokens to be generated by the model" | ||
) | ||
parser.add_argument( | ||
"--chunk_size", type=int, default=256, help="the maximum number of characters that a chunk can contain" | ||
) | ||
parser.add_argument( | ||
"--chunk_overlap", | ||
type=int, | ||
default=100, | ||
help="the number of characters that should overlap between two adjacent chunks", | ||
) | ||
parser.add_argument("--dataset_path", default="../data/split_merged.json", help="Path to the dataset") | ||
parser.add_argument("--docs_path", default="../data/80000_docs", help="Path to the retrieval documents") | ||
|
||
# Retriever related options | ||
parser.add_argument("--tasks", default=["question_answering"], nargs="+", help="Task to perform") | ||
parser.add_argument("--ingest_docs", action="store_true", help="Whether to ingest documents to vector database") | ||
parser.add_argument( | ||
"--database_endpoint", type=str, default="http://localhost:6007/v1/dataprep", help="Service URL address." | ||
) | ||
parser.add_argument( | ||
"--embedding_endpoint", type=str, default="http://localhost:6000/v1/embeddings", help="Service URL address." | ||
) | ||
parser.add_argument( | ||
"--retrieval_endpoint", type=str, default="http://localhost:7000/v1/retrieval", help="Service URL address." | ||
) | ||
parser.add_argument( | ||
"--tei_embedding_endpoint", | ||
type=str, | ||
default="http://localhost:8090", | ||
help="Service URL address of tei embedding.", | ||
) | ||
parser.add_argument("--ragas_metrics", action="store_true", help="Whether to compute ragas metrics.") | ||
parser.add_argument("--llm_endpoint", type=str, default=None, help="Service URL address.") | ||
parser.add_argument( | ||
"--show_progress_bar", action="store", default=True, type=bool, help="Whether to show a progress bar" | ||
) | ||
parser.add_argument("--contain_original_data", action="store_true", help="Whether to contain original data") | ||
|
||
args = parser.parse_args() | ||
return args | ||
|
||
|
||
def main(): | ||
args = args_parser() | ||
if os.path.isfile(args.dataset_path): | ||
with open(args.dataset_path) as f: | ||
all_datasets = json.load(f) | ||
else: | ||
raise FileNotFoundError(f"Evaluation dataset file {args.dataset_path} not exist.") | ||
os.makedirs(args.output_dir, exist_ok=True) | ||
for task in args.tasks: | ||
if task == "question_answering": | ||
dataset = all_datasets["questanswer_1doc"] | ||
elif task == "summarization": | ||
dataset = all_datasets["event_summary"] | ||
else: | ||
raise NotImplementedError( | ||
f"Unknown task {task}, only support " | ||
"summarization, question_answering, continuation and hallucinated_modified." | ||
) | ||
output_save_path = os.path.join(args.output_dir, f"{task}.json") | ||
evaluator = CRUD_Evaluator(dataset=dataset, output_path=output_save_path, task=task) | ||
if args.ingest_docs: | ||
CRUD_Evaluator.ingest_docs(args.docs_path, args.database_endpoint, args.chunk_size, args.chunk_overlap) | ||
results = evaluator.evaluate( | ||
args, show_progress_bar=args.show_progress_bar, contain_original_data=args.contain_original_data | ||
) | ||
print(results["overall"]) | ||
if args.ragas_metrics: | ||
ragas_metrics = evaluator.get_ragas_metrics(results, args) | ||
print(ragas_metrics) | ||
print(f"Evaluation results of task {task} saved to {output_save_path}.") | ||
|
||
|
||
if __name__ == "__main__": | ||
main() |
Oops, something went wrong.