update examples accuracy (#941)

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
opea-project · Oct 14, 2024 · 088ab98 · 088ab98
1 parent 441f8cc
commit 088ab98
Show file tree

Hide file tree

Showing 12 changed files with 784 additions and 14 deletions.
diff --git a/AudioQnA/benchmark/accuracy/README.md b/AudioQnA/benchmark/accuracy/README.md
@@ -1,4 +1,4 @@
-# AudioQnA accuracy Evaluation
+# AudioQnA Accuracy
 
 AudioQnA is an example that demonstrates the integration of Generative AI (GenAI) models for performing question-answering (QnA) on audio scene, which contains Automatic Speech Recognition (ASR) and Text-to-Speech (TTS). The following is the piepline for evaluating the ASR accuracy.
 

diff --git a/AudioQnA/benchmark/accuracy/run_acc.sh b/AudioQnA/benchmark/accuracy/run_acc.sh
@@ -0,0 +1,5 @@
+
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+python online_evaluate.py
diff --git a/ChatQnA/benchmark/accuracy/README.md b/ChatQnA/benchmark/accuracy/README.md
@@ -0,0 +1,170 @@
+# ChatQnA Accuracy
+
+ChatQnA is a Retrieval-Augmented Generation (RAG) pipeline, which can enhance generative models through external information retrieval.
+
+For evaluating the accuracy, we use 2 latest published datasets and 10+ metrics which are popular and comprehensive:
+
+- Dataset
+  - [MultiHop](https://arxiv.org/pdf/2401.15391) (English dataset)
+  - [CRUD](https://arxiv.org/abs/2401.17043) (Chinese dataset)
+- metrics (measure accuracy of both the context retrieval and response generation)
+  - evaluation for retrieval/reranking
+    - MRR@10
+    - MAP@10
+    - Hits@10
+    - Hits@4
+    - LLM-as-a-Judge
+  - evaluation for the generated response from the end-to-end pipeline
+    - BLEU
+    - ROGUE(L)
+    - LLM-as-a-Judge
+
+## Prerequisite
+
+### Environment
+
+```bash
+git clone https://github.com/opea-project/GenAIEval
+cd GenAIEval
+pip install -r requirements.txt
+pip install -e .
+```
+
+## MultiHop (English dataset)
+
+[MultiHop-RAG](https://arxiv.org/pdf/2401.15391): a QA dataset to evaluate retrieval and reasoning across documents with metadata in the RAG pipelines. It contains 2556 queries, with evidence for each query distributed across 2 to 4 documents. The queries also involve document metadata, reflecting complex scenarios commonly found in real-world RAG applications.
+
+### Launch Service of RAG System
+
+Please refer to this [guide](https://github.com/opea-project/GenAIExamples/blob/main/ChatQnA/README.md) to launch the service of `ChatQnA`.
+
+### Launch Service of LLM-as-a-Judge
+
+To setup a LLM model, we can use [tgi-gaudi](https://github.com/huggingface/tgi-gaudi) to launch a service. For example, the follow command is to setup the [mistralai/Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) model on 2 Gaudi2 cards:
+
+```
+# please set your llm_port and hf_token
+
+docker run -p {your_llm_port}:80 --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e HF_TOKEN={your_hf_token} --cap-add=sys_nice --ipc=host ghcr.io/huggingface/tgi-gaudi:2.0.1 --model-id mistralai/Mixtral-8x7B-Instruct-v0.1 --max-input-tokens 2048 --max-total-tokens 4096 --sharded true --num-shard 2
+
+# for better performance, set `PREFILL_BATCH_BUCKET_SIZE`, `BATCH_BUCKET_SIZE`, `max-batch-total-tokens`, `max-batch-prefill-tokens`
+docker run -p {your_llm_port}:80 --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e HF_TOKEN={your_hf_token} -e PREFILL_BATCH_BUCKET_SIZE=1 -e BATCH_BUCKET_SIZE=8 --cap-add=sys_nice --ipc=host ghcr.io/huggingface/tgi-gaudi:2.0.5 --model-id mistralai/Mixtral-8x7B-Instruct-v0.1 --max-input-tokens 2048 --max-total-tokens 4096 --sharded true --num-shard 2 --max-batch-total-tokens 65536 --max-batch-prefill-tokens 2048
+```
+
+### Prepare Dataset
+
+We use the evaluation dataset from [MultiHop-RAG](https://github.com/yixuantt/MultiHop-RAG) repo, use the below command to prepare the dataset.
+
+```bash
+git clone https://github.com/yixuantt/MultiHop-RAG.git
+```
+
+### Evaluation
+
+Use below command to run the evaluation, please note that for the first run, argument `--ingest_docs` should be added in the command to ingest the documents into the vector database, while for the subsequent run, this argument should be omitted. Set `--retrieval_metrics` to get retrieval related metrics (MRR@10/MAP@10/Hits@10/Hits@4). Set `--ragas_metrics` and `--llm_endpoint` to get end-to-end rag pipeline metrics (faithfulness/answer_relevancy/...), which are judged by LLMs. We set `--limits` is 100 as default, which means only 100 examples are evaluated by llm-as-judge as it is very time consuming.
+
+If you are using docker compose to deploy `ChatQnA` system, you can simply run the evaluation as following:
+
+```bash
+python eval_multihop.py --docs_path MultiHop-RAG/dataset/corpus.json  --dataset_path MultiHop-RAG/dataset/MultiHopRAG.json --ingest_docs --retrieval_metrics --ragas_metrics --llm_endpoint http://{llm_as_judge_ip}:{llm_as_judge_port}/generate
+```
+
+If you are using Kubernetes manifest/helm to deploy `ChatQnA` system, you must specify more arguments as following:
+
+```bash
+python eval_multihop.py --docs_path MultiHop-RAG/dataset/corpus.json  --dataset_path MultiHop-RAG/dataset/MultiHopRAG.json --ingest_docs --retrieval_metrics --ragas_metrics --llm_endpoint http://{llm_as_judge_ip}:{llm_as_judge_port}/generate --database_endpoint http://{your_dataprep_ip}:{your_dataprep_port}/v1/dataprep --embedding_endpoint http://{your_embedding_ip}:{your_embedding_port}/v1/embeddings --tei_embedding_endpoint http://{your_tei_embedding_ip}:{your_tei_embedding_port} --retrieval_endpoint http://{your_retrieval_ip}:{your_retrieval_port}/v1/retrieval --service_url http://{your_chatqna_ip}:{your_chatqna_port}/v1/chatqna
+```
+
+The default values for arguments are:
+|Argument|Default value|
+|--------|-------------|
+|service_url|http://localhost:8888/v1/chatqna|
+|database_endpoint|http://localhost:6007/v1/dataprep|
+|embedding_endpoint|http://localhost:6000/v1/embeddings|
+|tei_embedding_endpoint|http://localhost:8090|
+|retrieval_endpoint|http://localhost:7000/v1/retrieval|
+|reranking_endpoint|http://localhost:8000/v1/reranking|
+|output_dir|./output|
+|temperature|0.1|
+|max_new_tokens|1280|
+|chunk_size|256|
+|chunk_overlap|100|
+|search_type|similarity|
+|retrival_k|10|
+|fetch_k|20|
+|lambda_mult|0.5|
+|dataset_path|None|
+|docs_path|None|
+|limits|100|
+
+You can check arguments details use below command:
+
+```bash
+python eval_multihop.py --help
+```
+
+## CRUD (Chinese dataset)
+
+[CRUD-RAG](https://arxiv.org/abs/2401.17043) is a Chinese benchmark for RAG (Retrieval-Augmented Generation) system. This example utilize CRUD-RAG for evaluating the RAG system.
+
+### Prepare Dataset
+
+We use the evaluation dataset from [CRUD-RAG](https://github.com/IAAR-Shanghai/CRUD_RAG) repo, use the below command to prepare the dataset.
+
+```bash
+git clone https://github.com/IAAR-Shanghai/CRUD_RAG
+mkdir data/
+cp CRUD_RAG/data/crud_split/split_merged.json data/
+cp -r CRUD_RAG/data/80000_docs/ data/
+python process_crud_dataset.py
+```
+
+### Launch Service of RAG System
+
+Please refer to this [guide](https://github.com/opea-project/GenAIExamples/blob/main/ChatQnA/README.md) to launch the service of `ChatQnA` system. For Chinese dataset, you should replace the English emebdding and llm model with Chinese, for example, `EMBEDDING_MODEL_ID="BAAI/bge-base-zh-v1.5"` and `LLM_MODEL_ID=Qwen/Qwen2-7B-Instruct`.
+
+### Evaluation
+
+Use below command to run the evaluation, please note that for the first run, argument `--ingest_docs` should be added in the command to ingest the documents into the vector database, while for the subsequent run, this argument should be omitted.
+
+If you are using docker compose to deploy `ChatQnA` system, you can simply run the evaluation as following:
+
+```bash
+python eval_crud.py --dataset_path ./data/split_merged.json --docs_path ./data/80000_docs --ingest_docs
+
+# if you want to get ragas metrics
+python eval_crud.py --dataset_path ./data/split_merged.json --docs_path ./data/80000_docs  --contain_original_data --llm_endpoint "http://{llm_as_judge_ip}:{llm_as_judge_port}"  --ragas_metrics
+```
+
+If you are using Kubernetes manifest/helm to deploy `ChatQnA` system, you must specify more arguments as following:
+
+```bash
+python eval_crud.py --dataset_path ./data/split_merged.json --docs_path ./data/80000_docs --ingest_docs --database_endpoint http://{your_dataprep_ip}:{your_dataprep_port}/v1/dataprep --embedding_endpoint http://{your_embedding_ip}:{your_embedding_port}/v1/embeddings --retrieval_endpoint http://{your_retrieval_ip}:{your_retrieval_port}/v1/retrieval --service_url http://{your_chatqna_ip}:{your_chatqna_port}/v1/chatqna
+```
+
+The default values for arguments are:
+|Argument|Default value|
+|--------|-------------|
+|service_url|http://localhost:8888/v1/chatqna|
+|database_endpoint|http://localhost:6007/v1/dataprep|
+|embedding_endpoint|http://localhost:6000/v1/embeddings|
+|retrieval_endpoint|http://localhost:7000/v1/retrieval|
+|reranking_endpoint|http://localhost:8000/v1/reranking|
+|output_dir|./output|
+|temperature|0.1|
+|max_new_tokens|1280|
+|chunk_size|256|
+|chunk_overlap|100|
+|dataset_path|./data/split_merged.json|
+|docs_path|./data/80000_docs|
+|tasks|["question_answering"]|
+
+You can check arguments details use below command:
+
+```bash
+python eval_crud.py --help
+```
+
+## Acknowledgements
+
+This example is mostly adapted from [MultiHop-RAG](https://github.com/yixuantt/MultiHop-RAG) and [CRUD-RAG](https://github.com/IAAR-Shanghai/CRUD_RAG) repo, we thank the authors for their great work!
diff --git a/ChatQnA/benchmark/accuracy/eval_crud.py b/ChatQnA/benchmark/accuracy/eval_crud.py
@@ -0,0 +1,210 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+
+import argparse
+import json
+import os
+
+from evals.evaluation.rag_eval import Evaluator
+from evals.evaluation.rag_eval.template import CRUDTemplate
+from evals.metrics.ragas import RagasMetric
+from tqdm import tqdm
+
+
+class CRUD_Evaluator(Evaluator):
+    def get_ground_truth_text(self, data: dict):
+        if self.task == "summarization":
+            ground_truth_text = data["summary"]
+        elif self.task == "question_answering":
+            ground_truth_text = data["answers"]
+        elif self.task == "continuation":
+            ground_truth_text = data["continuing"]
+        elif self.task == "hallucinated_modified":
+            ground_truth_text = data["hallucinatedMod"]
+        else:
+            raise NotImplementedError(
+                f"Unknown task {self.task}, only support "
+                "summarization, question_answering, continuation and hallucinated_modified."
+            )
+        return ground_truth_text
+
+    def get_query(self, data: dict):
+        if self.task == "summarization":
+            query = data["text"]
+        elif self.task == "question_answering":
+            query = data["questions"]
+        elif self.task == "continuation":
+            query = data["beginning"]
+        elif self.task == "hallucinated_modified":
+            query = data["newsBeginning"]
+        else:
+            raise NotImplementedError(
+                f"Unknown task {self.task}, only support "
+                "summarization, question_answering, continuation and hallucinated_modified."
+            )
+        return query
+
+    def get_document(self, data: dict):
+        if self.task == "summarization":
+            document = data["text"]
+        elif self.task == "question_answering":
+            document = data["news1"]
+        elif self.task == "continuation":
+            document = data["beginning"]
+        elif self.task == "hallucinated_modified":
+            document = data["newsBeginning"]
+        else:
+            raise NotImplementedError(
+                f"Unknown task {self.task}, only support "
+                "summarization, question_answering, continuation and hallucinated_modified."
+            )
+        return document
+
+    def get_template(self):
+        if self.task == "summarization":
+            template = CRUDTemplate.get_summarization_template()
+        elif self.task == "question_answering":
+            template = CRUDTemplate.get_question_answering_template()
+        elif self.task == "continuation":
+            template = CRUDTemplate.get_continuation_template()
+        else:
+            raise NotImplementedError(
+                f"Unknown task {self.task}, only support "
+                "summarization, question_answering, continuation and hallucinated_modified."
+            )
+        return template
+
+    def post_process(self, result):
+        return result.split("<response>")[-1].split("</response>")[0].strip()
+
+    def get_ragas_metrics(self, results, arguments):
+        from langchain_huggingface import HuggingFaceEndpointEmbeddings
+
+        embeddings = HuggingFaceEndpointEmbeddings(model=arguments.tei_embedding_endpoint)
+
+        metric = RagasMetric(
+            threshold=0.5,
+            model=arguments.llm_endpoint,
+            embeddings=embeddings,
+            metrics=["faithfulness", "answer_relevancy"],
+        )
+
+        all_answer_relevancy = 0
+        all_faithfulness = 0
+        ragas_inputs = {
+            "question": [],
+            "answer": [],
+            "ground_truth": [],
+            "contexts": [],
+        }
+
+        valid_results = self.remove_invalid(results["results"])
+
+        for data in tqdm(valid_results):
+            data = data["original_data"]
+
+            query = self.get_query(data)
+            generated_text = data["generated_text"]
+            ground_truth = data["ground_truth_text"]
+            retrieved_documents = data["retrieved_documents"]
+
+            ragas_inputs["question"].append(query)
+            ragas_inputs["answer"].append(generated_text)
+            ragas_inputs["ground_truth"].append(ground_truth)
+            ragas_inputs["contexts"].append(retrieved_documents[:3])
+
+        ragas_metrics = metric.measure(ragas_inputs)
+        return ragas_metrics
+
+
+def args_parser():
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument(
+        "--service_url", type=str, default="http://localhost:8888/v1/chatqna", help="Service URL address."
+    )
+    parser.add_argument("--output_dir", type=str, default="./output", help="Directory to save evaluation results.")
+    parser.add_argument(
+        "--temperature", type=float, default=0.1, help="Controls the randomness of the model's text generation"
+    )
+    parser.add_argument(
+        "--max_new_tokens", type=int, default=1280, help="Maximum number of new tokens to be generated by the model"
+    )
+    parser.add_argument(
+        "--chunk_size", type=int, default=256, help="the maximum number of characters that a chunk can contain"
+    )
+    parser.add_argument(
+        "--chunk_overlap",
+        type=int,
+        default=100,
+        help="the number of characters that should overlap between two adjacent chunks",
+    )
+    parser.add_argument("--dataset_path", default="../data/split_merged.json", help="Path to the dataset")
+    parser.add_argument("--docs_path", default="../data/80000_docs", help="Path to the retrieval documents")
+
+    # Retriever related options
+    parser.add_argument("--tasks", default=["question_answering"], nargs="+", help="Task to perform")
+    parser.add_argument("--ingest_docs", action="store_true", help="Whether to ingest documents to vector database")
+    parser.add_argument(
+        "--database_endpoint", type=str, default="http://localhost:6007/v1/dataprep", help="Service URL address."
+    )
+    parser.add_argument(
+        "--embedding_endpoint", type=str, default="http://localhost:6000/v1/embeddings", help="Service URL address."
+    )
+    parser.add_argument(
+        "--retrieval_endpoint", type=str, default="http://localhost:7000/v1/retrieval", help="Service URL address."
+    )
+    parser.add_argument(
+        "--tei_embedding_endpoint",
+        type=str,
+        default="http://localhost:8090",
+        help="Service URL address of tei embedding.",
+    )
+    parser.add_argument("--ragas_metrics", action="store_true", help="Whether to compute ragas metrics.")
+    parser.add_argument("--llm_endpoint", type=str, default=None, help="Service URL address.")
+    parser.add_argument(
+        "--show_progress_bar", action="store", default=True, type=bool, help="Whether to show a progress bar"
+    )
+    parser.add_argument("--contain_original_data", action="store_true", help="Whether to contain original data")
+
+    args = parser.parse_args()
+    return args
+
+
+def main():
+    args = args_parser()
+    if os.path.isfile(args.dataset_path):
+        with open(args.dataset_path) as f:
+            all_datasets = json.load(f)
+    else:
+        raise FileNotFoundError(f"Evaluation dataset file {args.dataset_path} not exist.")
+    os.makedirs(args.output_dir, exist_ok=True)
+    for task in args.tasks:
+        if task == "question_answering":
+            dataset = all_datasets["questanswer_1doc"]
+        elif task == "summarization":
+            dataset = all_datasets["event_summary"]
+        else:
+            raise NotImplementedError(
+                f"Unknown task {task}, only support "
+                "summarization, question_answering, continuation and hallucinated_modified."
+            )
+        output_save_path = os.path.join(args.output_dir, f"{task}.json")
+        evaluator = CRUD_Evaluator(dataset=dataset, output_path=output_save_path, task=task)
+        if args.ingest_docs:
+            CRUD_Evaluator.ingest_docs(args.docs_path, args.database_endpoint, args.chunk_size, args.chunk_overlap)
+        results = evaluator.evaluate(
+            args, show_progress_bar=args.show_progress_bar, contain_original_data=args.contain_original_data
+        )
+        print(results["overall"])
+        if args.ragas_metrics:
+            ragas_metrics = evaluator.get_ragas_metrics(results, args)
+            print(ragas_metrics)
+        print(f"Evaluation results of task {task} saved to {output_save_path}.")
+
+
+if __name__ == "__main__":
+    main()