Repository: https://github.com/tech-podcasts/JinJinLeDao_QA_Dataset
HuggingFace: https://huggingface.co/datasets/wavpub/JinJinLeDao_QA_Dataset
The dataset contains over 18,000 Chinese question-answer pairs extracted from 281 episodes of the Chinese podcast "JinJinLeDao". The subtitles were extracted using the OpenAI Whisper transcription tool, and the question-answer pairs were generated using GPT-3.5 by dividing the subtitles into blocks and prompting the model to generate questions and answers.
This dataset can be used for various natural language processing tasks, such as question answering and text generation, among others.
The dataset is in Chinese (Mandarin).
The dataset contains over 18,000 question-answer pairs.
Each data instance contains the following fields:
question: The generated question based on the text block. answer: The corresponding answer to the generated question. episode: The title of the podcast episode from which the question-answer pair was extracted. podcast: The name of the specific program within the "JinJinLeDao" podcast where the episode was featured.
The dataset does not have predefined splits. Users can split the data according to their own requirements.
The dataset was created to provide a resource for Chinese language natural language processing research.
The source data consists of 281 episodes of the Chinese podcast "JinJinLeDao", which were transcribed using the OpenAI Whisper transcription tool.
The source language producers are the hosts of the "JinJinLeDao" podcast.
The dataset was annotated using an automated process, in which GPT-3.5 was used to generate questions and answers based on text prompts.
The initial annotation of the dataset was carried out through an automated process, without the involvement of human annotators. However, we later introduced a manual correction step to improve the accuracy of the data, and we would like to express our gratitude to Chunhui Gao for taking the time to assist us with this task.
The dataset does not contain any personal or sensitive information, except for some user names mentioned in the audio content.
The dataset was created for academic and research purposes only.
As the dataset was generated using an automated process, there may be biases in the generated questions and answers.
The dataset was generated using an automated process, which may result in lower quality data compared to manually annotated datasets.
The dataset was curated JinJinLeDao and Hongyang Jin.
The dataset is released under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
If you use this dataset in your research, please cite the following paper:
N/A
Thanks to JinJinLeDao for providing the data and to Hongyang Jin for curating and sharing this dataset.We would also like to express our gratitude to Chunhui Gao for his assistance in improving the accuracy of the data.