diff --git a/AudioQnA/benchmark/accuracy/README.md b/AudioQnA/benchmark/accuracy/README.md index 557dd0562..699fd3e85 100644 --- a/AudioQnA/benchmark/accuracy/README.md +++ b/AudioQnA/benchmark/accuracy/README.md @@ -1,4 +1,4 @@ -# AudioQnA accuracy Evaluation +# AudioQnA Accuracy AudioQnA is an example that demonstrates the integration of Generative AI (GenAI) models for performing question-answering (QnA) on audio scene, which contains Automatic Speech Recognition (ASR) and Text-to-Speech (TTS). The following is the piepline for evaluating the ASR accuracy. diff --git a/AudioQnA/benchmark/accuracy/run_acc.sh b/AudioQnA/benchmark/accuracy/run_acc.sh new file mode 100644 index 000000000..c0dc95059 --- /dev/null +++ b/AudioQnA/benchmark/accuracy/run_acc.sh @@ -0,0 +1,5 @@ + +# Copyright (C) 2024 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 + +python online_evaluate.py diff --git a/ChatQnA/benchmark/accuracy/README.md b/ChatQnA/benchmark/accuracy/README.md new file mode 100644 index 000000000..0cfae4564 --- /dev/null +++ b/ChatQnA/benchmark/accuracy/README.md @@ -0,0 +1,170 @@ +# ChatQnA Accuracy + +ChatQnA is a Retrieval-Augmented Generation (RAG) pipeline, which can enhance generative models through external information retrieval. + +For evaluating the accuracy, we use 2 latest published datasets and 10+ metrics which are popular and comprehensive: + +- Dataset + - [MultiHop](https://arxiv.org/pdf/2401.15391) (English dataset) + - [CRUD](https://arxiv.org/abs/2401.17043) (Chinese dataset) +- metrics (measure accuracy of both the context retrieval and response generation) + - evaluation for retrieval/reranking + - MRR@10 + - MAP@10 + - Hits@10 + - Hits@4 + - LLM-as-a-Judge + - evaluation for the generated response from the end-to-end pipeline + - BLEU + - ROGUE(L) + - LLM-as-a-Judge + +## Prerequisite + +### Environment + +```bash +git clone https://github.com/opea-project/GenAIEval +cd GenAIEval +pip install -r requirements.txt +pip install -e . +``` + +## MultiHop (English dataset) + +[MultiHop-RAG](https://arxiv.org/pdf/2401.15391): a QA dataset to evaluate retrieval and reasoning across documents with metadata in the RAG pipelines. It contains 2556 queries, with evidence for each query distributed across 2 to 4 documents. The queries also involve document metadata, reflecting complex scenarios commonly found in real-world RAG applications. + +### Launch Service of RAG System + +Please refer to this [guide](https://github.com/opea-project/GenAIExamples/blob/main/ChatQnA/README.md) to launch the service of `ChatQnA`. + +### Launch Service of LLM-as-a-Judge + +To setup a LLM model, we can use [tgi-gaudi](https://github.com/huggingface/tgi-gaudi) to launch a service. For example, the follow command is to setup the [mistralai/Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) model on 2 Gaudi2 cards: + +``` +# please set your llm_port and hf_token + +docker run -p {your_llm_port}:80 --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e HF_TOKEN={your_hf_token} --cap-add=sys_nice --ipc=host ghcr.io/huggingface/tgi-gaudi:2.0.1 --model-id mistralai/Mixtral-8x7B-Instruct-v0.1 --max-input-tokens 2048 --max-total-tokens 4096 --sharded true --num-shard 2 + +# for better performance, set `PREFILL_BATCH_BUCKET_SIZE`, `BATCH_BUCKET_SIZE`, `max-batch-total-tokens`, `max-batch-prefill-tokens` +docker run -p {your_llm_port}:80 --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e HF_TOKEN={your_hf_token} -e PREFILL_BATCH_BUCKET_SIZE=1 -e BATCH_BUCKET_SIZE=8 --cap-add=sys_nice --ipc=host ghcr.io/huggingface/tgi-gaudi:2.0.5 --model-id mistralai/Mixtral-8x7B-Instruct-v0.1 --max-input-tokens 2048 --max-total-tokens 4096 --sharded true --num-shard 2 --max-batch-total-tokens 65536 --max-batch-prefill-tokens 2048 +``` + +### Prepare Dataset + +We use the evaluation dataset from [MultiHop-RAG](https://github.com/yixuantt/MultiHop-RAG) repo, use the below command to prepare the dataset. + +```bash +git clone https://github.com/yixuantt/MultiHop-RAG.git +``` + +### Evaluation + +Use below command to run the evaluation, please note that for the first run, argument `--ingest_docs` should be added in the command to ingest the documents into the vector database, while for the subsequent run, this argument should be omitted. Set `--retrieval_metrics` to get retrieval related metrics (MRR@10/MAP@10/Hits@10/Hits@4). Set `--ragas_metrics` and `--llm_endpoint` to get end-to-end rag pipeline metrics (faithfulness/answer_relevancy/...), which are judged by LLMs. We set `--limits` is 100 as default, which means only 100 examples are evaluated by llm-as-judge as it is very time consuming. + +If you are using docker compose to deploy `ChatQnA` system, you can simply run the evaluation as following: + +```bash +python eval_multihop.py --docs_path MultiHop-RAG/dataset/corpus.json --dataset_path MultiHop-RAG/dataset/MultiHopRAG.json --ingest_docs --retrieval_metrics --ragas_metrics --llm_endpoint http://{llm_as_judge_ip}:{llm_as_judge_port}/generate +``` + +If you are using Kubernetes manifest/helm to deploy `ChatQnA` system, you must specify more arguments as following: + +```bash +python eval_multihop.py --docs_path MultiHop-RAG/dataset/corpus.json --dataset_path MultiHop-RAG/dataset/MultiHopRAG.json --ingest_docs --retrieval_metrics --ragas_metrics --llm_endpoint http://{llm_as_judge_ip}:{llm_as_judge_port}/generate --database_endpoint http://{your_dataprep_ip}:{your_dataprep_port}/v1/dataprep --embedding_endpoint http://{your_embedding_ip}:{your_embedding_port}/v1/embeddings --tei_embedding_endpoint http://{your_tei_embedding_ip}:{your_tei_embedding_port} --retrieval_endpoint http://{your_retrieval_ip}:{your_retrieval_port}/v1/retrieval --service_url http://{your_chatqna_ip}:{your_chatqna_port}/v1/chatqna +``` + +The default values for arguments are: +|Argument|Default value| +|--------|-------------| +|service_url|http://localhost:8888/v1/chatqna| +|database_endpoint|http://localhost:6007/v1/dataprep| +|embedding_endpoint|http://localhost:6000/v1/embeddings| +|tei_embedding_endpoint|http://localhost:8090| +|retrieval_endpoint|http://localhost:7000/v1/retrieval| +|reranking_endpoint|http://localhost:8000/v1/reranking| +|output_dir|./output| +|temperature|0.1| +|max_new_tokens|1280| +|chunk_size|256| +|chunk_overlap|100| +|search_type|similarity| +|retrival_k|10| +|fetch_k|20| +|lambda_mult|0.5| +|dataset_path|None| +|docs_path|None| +|limits|100| + +You can check arguments details use below command: + +```bash +python eval_multihop.py --help +``` + +## CRUD (Chinese dataset) + +[CRUD-RAG](https://arxiv.org/abs/2401.17043) is a Chinese benchmark for RAG (Retrieval-Augmented Generation) system. This example utilize CRUD-RAG for evaluating the RAG system. + +### Prepare Dataset + +We use the evaluation dataset from [CRUD-RAG](https://github.com/IAAR-Shanghai/CRUD_RAG) repo, use the below command to prepare the dataset. + +```bash +git clone https://github.com/IAAR-Shanghai/CRUD_RAG +mkdir data/ +cp CRUD_RAG/data/crud_split/split_merged.json data/ +cp -r CRUD_RAG/data/80000_docs/ data/ +python process_crud_dataset.py +``` + +### Launch Service of RAG System + +Please refer to this [guide](https://github.com/opea-project/GenAIExamples/blob/main/ChatQnA/README.md) to launch the service of `ChatQnA` system. For Chinese dataset, you should replace the English emebdding and llm model with Chinese, for example, `EMBEDDING_MODEL_ID="BAAI/bge-base-zh-v1.5"` and `LLM_MODEL_ID=Qwen/Qwen2-7B-Instruct`. + +### Evaluation + +Use below command to run the evaluation, please note that for the first run, argument `--ingest_docs` should be added in the command to ingest the documents into the vector database, while for the subsequent run, this argument should be omitted. + +If you are using docker compose to deploy `ChatQnA` system, you can simply run the evaluation as following: + +```bash +python eval_crud.py --dataset_path ./data/split_merged.json --docs_path ./data/80000_docs --ingest_docs + +# if you want to get ragas metrics +python eval_crud.py --dataset_path ./data/split_merged.json --docs_path ./data/80000_docs --contain_original_data --llm_endpoint "http://{llm_as_judge_ip}:{llm_as_judge_port}" --ragas_metrics +``` + +If you are using Kubernetes manifest/helm to deploy `ChatQnA` system, you must specify more arguments as following: + +```bash +python eval_crud.py --dataset_path ./data/split_merged.json --docs_path ./data/80000_docs --ingest_docs --database_endpoint http://{your_dataprep_ip}:{your_dataprep_port}/v1/dataprep --embedding_endpoint http://{your_embedding_ip}:{your_embedding_port}/v1/embeddings --retrieval_endpoint http://{your_retrieval_ip}:{your_retrieval_port}/v1/retrieval --service_url http://{your_chatqna_ip}:{your_chatqna_port}/v1/chatqna +``` + +The default values for arguments are: +|Argument|Default value| +|--------|-------------| +|service_url|http://localhost:8888/v1/chatqna| +|database_endpoint|http://localhost:6007/v1/dataprep| +|embedding_endpoint|http://localhost:6000/v1/embeddings| +|retrieval_endpoint|http://localhost:7000/v1/retrieval| +|reranking_endpoint|http://localhost:8000/v1/reranking| +|output_dir|./output| +|temperature|0.1| +|max_new_tokens|1280| +|chunk_size|256| +|chunk_overlap|100| +|dataset_path|./data/split_merged.json| +|docs_path|./data/80000_docs| +|tasks|["question_answering"]| + +You can check arguments details use below command: + +```bash +python eval_crud.py --help +``` + +## Acknowledgements + +This example is mostly adapted from [MultiHop-RAG](https://github.com/yixuantt/MultiHop-RAG) and [CRUD-RAG](https://github.com/IAAR-Shanghai/CRUD_RAG) repo, we thank the authors for their great work! diff --git a/ChatQnA/benchmark/accuracy/eval_crud.py b/ChatQnA/benchmark/accuracy/eval_crud.py new file mode 100644 index 000000000..f6e3e25a0 --- /dev/null +++ b/ChatQnA/benchmark/accuracy/eval_crud.py @@ -0,0 +1,210 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- +# Copyright (C) 2024 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 + + +import argparse +import json +import os + +from evals.evaluation.rag_eval import Evaluator +from evals.evaluation.rag_eval.template import CRUDTemplate +from evals.metrics.ragas import RagasMetric +from tqdm import tqdm + + +class CRUD_Evaluator(Evaluator): + def get_ground_truth_text(self, data: dict): + if self.task == "summarization": + ground_truth_text = data["summary"] + elif self.task == "question_answering": + ground_truth_text = data["answers"] + elif self.task == "continuation": + ground_truth_text = data["continuing"] + elif self.task == "hallucinated_modified": + ground_truth_text = data["hallucinatedMod"] + else: + raise NotImplementedError( + f"Unknown task {self.task}, only support " + "summarization, question_answering, continuation and hallucinated_modified." + ) + return ground_truth_text + + def get_query(self, data: dict): + if self.task == "summarization": + query = data["text"] + elif self.task == "question_answering": + query = data["questions"] + elif self.task == "continuation": + query = data["beginning"] + elif self.task == "hallucinated_modified": + query = data["newsBeginning"] + else: + raise NotImplementedError( + f"Unknown task {self.task}, only support " + "summarization, question_answering, continuation and hallucinated_modified." + ) + return query + + def get_document(self, data: dict): + if self.task == "summarization": + document = data["text"] + elif self.task == "question_answering": + document = data["news1"] + elif self.task == "continuation": + document = data["beginning"] + elif self.task == "hallucinated_modified": + document = data["newsBeginning"] + else: + raise NotImplementedError( + f"Unknown task {self.task}, only support " + "summarization, question_answering, continuation and hallucinated_modified." + ) + return document + + def get_template(self): + if self.task == "summarization": + template = CRUDTemplate.get_summarization_template() + elif self.task == "question_answering": + template = CRUDTemplate.get_question_answering_template() + elif self.task == "continuation": + template = CRUDTemplate.get_continuation_template() + else: + raise NotImplementedError( + f"Unknown task {self.task}, only support " + "summarization, question_answering, continuation and hallucinated_modified." + ) + return template + + def post_process(self, result): + return result.split("")[-1].split("")[0].strip() + + def get_ragas_metrics(self, results, arguments): + from langchain_huggingface import HuggingFaceEndpointEmbeddings + + embeddings = HuggingFaceEndpointEmbeddings(model=arguments.tei_embedding_endpoint) + + metric = RagasMetric( + threshold=0.5, + model=arguments.llm_endpoint, + embeddings=embeddings, + metrics=["faithfulness", "answer_relevancy"], + ) + + all_answer_relevancy = 0 + all_faithfulness = 0 + ragas_inputs = { + "question": [], + "answer": [], + "ground_truth": [], + "contexts": [], + } + + valid_results = self.remove_invalid(results["results"]) + + for data in tqdm(valid_results): + data = data["original_data"] + + query = self.get_query(data) + generated_text = data["generated_text"] + ground_truth = data["ground_truth_text"] + retrieved_documents = data["retrieved_documents"] + + ragas_inputs["question"].append(query) + ragas_inputs["answer"].append(generated_text) + ragas_inputs["ground_truth"].append(ground_truth) + ragas_inputs["contexts"].append(retrieved_documents[:3]) + + ragas_metrics = metric.measure(ragas_inputs) + return ragas_metrics + + +def args_parser(): + parser = argparse.ArgumentParser() + + parser.add_argument( + "--service_url", type=str, default="http://localhost:8888/v1/chatqna", help="Service URL address." + ) + parser.add_argument("--output_dir", type=str, default="./output", help="Directory to save evaluation results.") + parser.add_argument( + "--temperature", type=float, default=0.1, help="Controls the randomness of the model's text generation" + ) + parser.add_argument( + "--max_new_tokens", type=int, default=1280, help="Maximum number of new tokens to be generated by the model" + ) + parser.add_argument( + "--chunk_size", type=int, default=256, help="the maximum number of characters that a chunk can contain" + ) + parser.add_argument( + "--chunk_overlap", + type=int, + default=100, + help="the number of characters that should overlap between two adjacent chunks", + ) + parser.add_argument("--dataset_path", default="../data/split_merged.json", help="Path to the dataset") + parser.add_argument("--docs_path", default="../data/80000_docs", help="Path to the retrieval documents") + + # Retriever related options + parser.add_argument("--tasks", default=["question_answering"], nargs="+", help="Task to perform") + parser.add_argument("--ingest_docs", action="store_true", help="Whether to ingest documents to vector database") + parser.add_argument( + "--database_endpoint", type=str, default="http://localhost:6007/v1/dataprep", help="Service URL address." + ) + parser.add_argument( + "--embedding_endpoint", type=str, default="http://localhost:6000/v1/embeddings", help="Service URL address." + ) + parser.add_argument( + "--retrieval_endpoint", type=str, default="http://localhost:7000/v1/retrieval", help="Service URL address." + ) + parser.add_argument( + "--tei_embedding_endpoint", + type=str, + default="http://localhost:8090", + help="Service URL address of tei embedding.", + ) + parser.add_argument("--ragas_metrics", action="store_true", help="Whether to compute ragas metrics.") + parser.add_argument("--llm_endpoint", type=str, default=None, help="Service URL address.") + parser.add_argument( + "--show_progress_bar", action="store", default=True, type=bool, help="Whether to show a progress bar" + ) + parser.add_argument("--contain_original_data", action="store_true", help="Whether to contain original data") + + args = parser.parse_args() + return args + + +def main(): + args = args_parser() + if os.path.isfile(args.dataset_path): + with open(args.dataset_path) as f: + all_datasets = json.load(f) + else: + raise FileNotFoundError(f"Evaluation dataset file {args.dataset_path} not exist.") + os.makedirs(args.output_dir, exist_ok=True) + for task in args.tasks: + if task == "question_answering": + dataset = all_datasets["questanswer_1doc"] + elif task == "summarization": + dataset = all_datasets["event_summary"] + else: + raise NotImplementedError( + f"Unknown task {task}, only support " + "summarization, question_answering, continuation and hallucinated_modified." + ) + output_save_path = os.path.join(args.output_dir, f"{task}.json") + evaluator = CRUD_Evaluator(dataset=dataset, output_path=output_save_path, task=task) + if args.ingest_docs: + CRUD_Evaluator.ingest_docs(args.docs_path, args.database_endpoint, args.chunk_size, args.chunk_overlap) + results = evaluator.evaluate( + args, show_progress_bar=args.show_progress_bar, contain_original_data=args.contain_original_data + ) + print(results["overall"]) + if args.ragas_metrics: + ragas_metrics = evaluator.get_ragas_metrics(results, args) + print(ragas_metrics) + print(f"Evaluation results of task {task} saved to {output_save_path}.") + + +if __name__ == "__main__": + main() diff --git a/ChatQnA/benchmark/accuracy/eval_multihop.py b/ChatQnA/benchmark/accuracy/eval_multihop.py new file mode 100644 index 000000000..9b07ea2e3 --- /dev/null +++ b/ChatQnA/benchmark/accuracy/eval_multihop.py @@ -0,0 +1,279 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- +# Copyright (C) 2024 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 + +import argparse +import json +import os + +import requests +from evals.evaluation.rag_eval import Evaluator +from evals.metrics.ragas import RagasMetric +from evals.metrics.retrieval import RetrievalBaseMetric +from tqdm import tqdm + + +class MultiHop_Evaluator(Evaluator): + def get_ground_truth_text(self, data: dict): + return data["answer"] + + def get_query(self, data: dict): + return data["query"] + + def get_template(self): + return None + + def get_reranked_documents(self, query, docs, arguments): + data = { + "initial_query": query, + "retrieved_docs": [{"text": doc} for doc in docs], + "top_n": 10, + } + headers = {"Content-Type": "application/json"} + + response = requests.post(arguments.reranking_endpoint, data=json.dumps(data), headers=headers) + if response.ok: + reranked_documents = response.json()["documents"] + return reranked_documents + else: + print(f"Request for retrieval failed due to {response.text}.") + return [] + + def get_retrieved_documents(self, query, arguments): + data = {"text": query} + headers = {"Content-Type": "application/json"} + response = requests.post(arguments.embedding_endpoint, data=json.dumps(data), headers=headers) + if response.ok: + embedding = response.json()["embedding"] + else: + print(f"Request for embedding failed due to {response.text}.") + return [] + data = { + "text": query, + "embedding": embedding, + "search_type": arguments.search_type, + "k": arguments.retrival_k, + "fetch_k": arguments.fetch_k, + "lambda_mult": arguments.lambda_mult, + } + response = requests.post(arguments.retrieval_endpoint, data=json.dumps(data), headers=headers) + if response.ok: + retrieved_documents = response.json()["retrieved_docs"] + return [doc["text"] for doc in retrieved_documents] + else: + print(f"Request for retrieval failed due to {response.text}.") + return [] + + def get_retrieval_metrics(self, all_queries, arguments): + print("start to retrieve...") + metric = RetrievalBaseMetric() + hits_at_10 = 0 + hits_at_4 = 0 + map_at_10 = 0 + mrr_at_10 = 0 + total = 0 + for data in tqdm(all_queries): + if data["question_type"] == "null_query": + continue + query = data["query"] + retrieved_documents = self.get_retrieved_documents(query, arguments) + if arguments.rerank: + retrieved_documents = self.get_reranked_documents(query, retrieved_documents, arguments) + golden_context = [each["fact"] for each in data["evidence_list"]] + test_case = { + "input": query, + "golden_context": golden_context, + "retrieval_context": retrieved_documents, + } + results = metric.measure(test_case) + hits_at_10 += results["Hits@10"] + hits_at_4 += results["Hits@4"] + map_at_10 += results["MAP@10"] + mrr_at_10 += results["MRR@10"] + total += 1 + + # Calculate average metrics over all queries + hits_at_10 = hits_at_10 / total + hits_at_4 = hits_at_4 / total + map_at_10 = map_at_10 / total + mrr_at_10 = mrr_at_10 / total + + return { + "Hits@10": hits_at_10, + "Hits@4": hits_at_4, + "MAP@10": map_at_10, + "MRR@10": mrr_at_10, + } + + def evaluate(self, all_queries, arguments): + results = [] + accuracy = 0 + index = 0 + for data in tqdm(all_queries): + if data["question_type"] == "null_query": + continue + + generated_text = self.send_request(data, arguments) + data["generated_text"] = generated_text + + # same method with paper: https://github.com/yixuantt/MultiHop-RAG/issues/8 + if data["answer"] in generated_text: + accuracy += 1 + result = {"id": index, **self.scoring(data)} + results.append(result) + index += 1 + + valid_results = self.remove_invalid(results) + + try: + overall = self.compute_overall(valid_results) if len(valid_results) > 0 else {} + except Exception as e: + print(repr(e)) + overall = dict() + + overall.update({"accuracy": accuracy / len(results)}) + return overall + + def get_ragas_metrics(self, all_queries, arguments): + from langchain_huggingface import HuggingFaceEndpointEmbeddings + + embeddings = HuggingFaceEndpointEmbeddings(model=arguments.tei_embedding_endpoint) + + metric = RagasMetric(threshold=0.5, model=arguments.llm_endpoint, embeddings=embeddings) + all_answer_relevancy = 0 + all_faithfulness = 0 + ragas_inputs = { + "question": [], + "answer": [], + "ground_truth": [], + "contexts": [], + } + + for data in tqdm(all_queries): + if data["question_type"] == "null_query": + continue + retrieved_documents = self.get_retrieved_documents(data["query"], arguments) + generated_text = self.send_request(data, arguments) + data["generated_text"] = generated_text + + ragas_inputs["question"].append(data["query"]) + ragas_inputs["answer"].append(generated_text) + ragas_inputs["ground_truth"].append(data["answer"]) + ragas_inputs["contexts"].append(retrieved_documents[:3]) + + if len(ragas_inputs["question"]) >= arguments.limits: + break + + ragas_metrics = metric.measure(ragas_inputs) + return ragas_metrics + + +def args_parser(): + parser = argparse.ArgumentParser() + + parser.add_argument( + "--service_url", type=str, default="http://localhost:8888/v1/chatqna", help="Service URL address." + ) + parser.add_argument("--output_dir", type=str, default="./output", help="Directory to save evaluation results.") + parser.add_argument( + "--temperature", type=float, default=0.1, help="Controls the randomness of the model's text generation" + ) + parser.add_argument( + "--max_new_tokens", type=int, default=1280, help="Maximum number of new tokens to be generated by the model" + ) + parser.add_argument( + "--chunk_size", type=int, default=256, help="the maximum number of characters that a chunk can contain" + ) + parser.add_argument( + "--chunk_overlap", + type=int, + default=100, + help="the number of characters that should overlap between two adjacent chunks", + ) + parser.add_argument("--search_type", type=str, default="similarity", help="similarity type") + parser.add_argument("--retrival_k", type=int, default=10, help="Number of Documents to return.") + parser.add_argument( + "--fetch_k", type=int, default=20, help="Number of Documents to fetch to pass to MMR algorithm." + ) + parser.add_argument( + "--lambda_mult", + type=float, + default=0.5, + help="Number between 0 and 1 that determines the degree of diversity among the results with 0 corresponding to maximum diversity and 1 to minimum diversity. Defaults to 0.5.", + ) + parser.add_argument("--dataset_path", default=None, help="Path to the dataset") + parser.add_argument("--docs_path", default=None, help="Path to the retrieval documents") + + # Retriever related options + parser.add_argument("--ingest_docs", action="store_true", help="Whether to ingest documents to vector database") + parser.add_argument("--retrieval_metrics", action="store_true", help="Whether to compute retrieval metrics.") + parser.add_argument("--ragas_metrics", action="store_true", help="Whether to compute ragas metrics.") + parser.add_argument("--limits", type=int, default=100, help="Number of examples to be evaluated by llm-as-judge") + parser.add_argument( + "--database_endpoint", type=str, default="http://localhost:6007/v1/dataprep", help="Service URL address." + ) + parser.add_argument( + "--embedding_endpoint", type=str, default="http://localhost:6000/v1/embeddings", help="Service URL address." + ) + parser.add_argument( + "--tei_embedding_endpoint", + type=str, + default="http://localhost:8090", + help="Service URL address of tei embedding.", + ) + parser.add_argument( + "--retrieval_endpoint", type=str, default="http://localhost:7000/v1/retrieval", help="Service URL address." + ) + parser.add_argument("--rerank", action="store_true", help="Whether to use rerank microservice.") + parser.add_argument( + "--reranking_endpoint", type=str, default="http://localhost:8000/v1/reranking", help="Service URL address." + ) + parser.add_argument("--llm_endpoint", type=str, default=None, help="Service URL address.") + parser.add_argument( + "--show_progress_bar", action="store", default=True, type=bool, help="Whether to show a progress bar" + ) + parser.add_argument("--contain_original_data", action="store_true", help="Whether to contain original data") + + args = parser.parse_args() + return args + + +def main(): + args = args_parser() + + evaluator = MultiHop_Evaluator() + + with open(args.docs_path, "r") as file: + doc_data = json.load(file) + + documents = [] + for doc in doc_data: + metadata = {"title": doc["title"], "published_at": doc["published_at"], "source": doc["source"]} + documents.append(doc["body"]) + + # save docs to a tmp file + tmp_corpus_file = "tmp_corpus.txt" + with open(tmp_corpus_file, "w") as f: + for doc in documents: + f.write(doc + "\n") + + if args.ingest_docs: + evaluator.ingest_docs(tmp_corpus_file, args.database_endpoint, args.chunk_size, args.chunk_overlap) + + with open(args.dataset_path, "r") as file: + all_queries = json.load(file) + + # get retrieval quality + if args.retrieval_metrics: + retrieval_metrics = evaluator.get_retrieval_metrics(all_queries, args) + print(retrieval_metrics) + + # get rag quality + if args.ragas_metrics: + ragas_metrics = evaluator.get_ragas_metrics(all_queries, args) + print(ragas_metrics) + + +if __name__ == "__main__": + main() diff --git a/ChatQnA/benchmark/accuracy/process_crud_dataset.py b/ChatQnA/benchmark/accuracy/process_crud_dataset.py new file mode 100644 index 000000000..8bcc81c1a --- /dev/null +++ b/ChatQnA/benchmark/accuracy/process_crud_dataset.py @@ -0,0 +1,9 @@ +# Copyright (C) 2024 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 + +import os + +path = os.path.join(os.path.dirname(__file__), "./data/80000_docs") +for file in os.listdir(path): + src_file = os.path.join(path, file) + os.rename(src_file, src_file + ".txt") diff --git a/ChatQnA/benchmark/accuracy/run_acc.sh b/ChatQnA/benchmark/accuracy/run_acc.sh new file mode 100644 index 000000000..311dbb038 --- /dev/null +++ b/ChatQnA/benchmark/accuracy/run_acc.sh @@ -0,0 +1,64 @@ +#!/bin/bash +# Copyright (C) 2024 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 + +set -x + +function main { + + init_params "$@" + # run_benchmark + echo $dataset + if [[ ${dataset} == "MultiHop" ]]; then + run_multihop + elif [[ ${dataset} == "crud" ]]; then + run_crud + fi + +} + +# init params +function init_params { + for var in "$@" + do + case $var in + --dataset=*) + dataset=$( echo $var |cut -f2 -d=) + ;; + *) + echo "Error: No such parameter: ${var}" + exit 1 + ;; + esac + done +} + +# run_multihop +function run_multihop { + git clone https://github.com/yixuantt/MultiHop-RAG.git + + python eval_multihop.py \ + --docs_path MultiHop-RAG/dataset/corpus.json \ + --dataset_path MultiHop-RAG/dataset/MultiHopRAG.json \ + --ingest_docs \ + --retrieval_metrics + +} + +# run_crud +function run_crud { + + git clone https://github.com/IAAR-Shanghai/CRUD_RAG + mkdir data/ + cp CRUD_RAG/data/crud_split/split_merged.json data/ + cp -r CRUD_RAG/data/80000_docs/ data/ + python process_crud_dataset.py + + python eval_crud.py \ + --dataset_path ./data/split_merged.json \ + --docs_path ./data/80000_docs \ + --ingest_docs +} + + +main "$@" diff --git a/CodeGen/benchmark/accuracy/README.md b/CodeGen/benchmark/accuracy/README.md index 4e52a93e0..1a8ebf632 100644 --- a/CodeGen/benchmark/accuracy/README.md +++ b/CodeGen/benchmark/accuracy/README.md @@ -1,4 +1,4 @@ -# CodeGen accuracy Evaluation +# CodeGen Accuracy ## Evaluation Framework @@ -13,7 +13,7 @@ Please refer to [CodeGen Examples](https://github.com/opea-project/GenAIExamples Use `curl` command to test codegen service and ensure that it has started properly ```bash -export CODEGEN_ENDPOINT = "http://${your_ip}:7778/v1/codegen" +export CODEGEN_ENDPOINT="http://${your_ip}:7778/v1/codegen" curl $CODEGEN_ENDPOINT \ -H "Content-Type: application/json" \ -d '{"messages": "Implement a high-level API for a TODO list application. The API takes as input an operation request and updates the TODO list in place. If the request is invalid, raise an exception."}' @@ -24,7 +24,7 @@ curl $CODEGEN_ENDPOINT \ For evaluating the models on coding tasks or specifically coding LLMs, we follow the [bigcode-evaluation-harness](https://github.com/bigcode-project/bigcode-evaluation-harness) and provide the command line usage and function call usage. [HumanEval](https://huggingface.co/datasets/openai_humaneval), [HumanEval+](https://huggingface.co/datasets/evalplus/humanevalplus), [InstructHumanEval](https://huggingface.co/datasets/codeparrot/instructhumaneval), [APPS](https://huggingface.co/datasets/codeparrot/apps), [MBPP](https://huggingface.co/datasets/mbpp), [MBPP+](https://huggingface.co/datasets/evalplus/mbppplus), and [DS-1000](https://github.com/HKUNLP/DS-1000/) for both completion (left-to-right) and insertion (FIM) mode are available. -#### command line usage +#### Environment ```shell git clone https://github.com/opea-project/GenAIEval @@ -32,15 +32,14 @@ cd GenAIEval pip install -r requirements.txt pip install -e . -cd evals/evaluation/bigcode_evaluation_harness/examples -python main.py --model Qwen/CodeQwen1.5-7B-Chat \ - --tasks humaneval \ - --codegen_url $CODEGEN_ENDPOINT \ - --max_length_generation 2048 \ - --batch_size 1 \ - --save_generations \ - --save_references \ - --allow_code_execution +``` + +#### Evaluation + +``` +export CODEGEN_ENDPOINT="http://${your_ip}:7778/v1/codegen" +export CODEGEN_MODEL=your_model +bash run_acc.sh $CODEGEN_MODEL $CODEGEN_ENDPOINT ``` **_Note:_** Currently, our framework is designed to execute tasks in full. To ensure the accuracy of results, we advise against using the 'limit' or 'limit_start' parameters to restrict the number of test samples. diff --git a/CodeGen/benchmark/accuracy/main.py b/CodeGen/benchmark/accuracy/main.py new file mode 100644 index 000000000..d9ed623ff --- /dev/null +++ b/CodeGen/benchmark/accuracy/main.py @@ -0,0 +1,17 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- +# Copyright (C) 2024 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 + +# +from evals.evaluation.bigcode_evaluation_harness import evaluate, setup_parser + + +def main(): + eval_args = setup_parser() + results = evaluate(eval_args) + print(results) + + +if __name__ == "__main__": + main() diff --git a/CodeGen/benchmark/accuracy/run_acc.sh b/CodeGen/benchmark/accuracy/run_acc.sh new file mode 100644 index 000000000..a5c451965 --- /dev/null +++ b/CodeGen/benchmark/accuracy/run_acc.sh @@ -0,0 +1,13 @@ + + +# Copyright (C) 2024 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 + +python main.py --model $1 \ + --tasks humaneval \ + --codegen_url $2 \ + --max_length_generation 2048 \ + --batch_size 1 \ + --save_generations \ + --save_references \ + --allow_code_execution diff --git a/FaqGen/benchmark/accuracy/README.md b/FaqGen/benchmark/accuracy/README.md index 1ff2ce1f1..9ca392e5c 100644 --- a/FaqGen/benchmark/accuracy/README.md +++ b/FaqGen/benchmark/accuracy/README.md @@ -1,4 +1,4 @@ -# FaqGen Evaluation +# FaqGen Accuracy ## Dataset diff --git a/FaqGen/benchmark/accuracy/run_acc.sh b/FaqGen/benchmark/accuracy/run_acc.sh new file mode 100644 index 000000000..766b718ff --- /dev/null +++ b/FaqGen/benchmark/accuracy/run_acc.sh @@ -0,0 +1,4 @@ +# Copyright (C) 2024 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 + +python evaluate.py