This repository presents corpus2question
, a method for summarizing and exploring datasets based on latent questions on documents. It also contains the reference implementation for the paper Can questions summarize a corpus? Using question generation for characterizing COVID-19 research.
corpus2question
relies on the question generation network used in doc2query and frequency aggregations. Check our tutorial for a small example.
All raw generated questions over the CORD-19 dataset are available at this link in the CSV format. You can also find the aggregated top 10k at this link. The reference implementation for the paper is available at this notebook.
If you use corpus2question
on your academic work, or use the generated questions over the CORD-19 dataset, please cite us with:
@misc{surita2020questions,
title={Can questions summarize a corpus? Using question generation for characterizing COVID-19 research},
author={Gabriela Surita and Rodrigo Nogueira and Roberto Lotufo},
year={2020},
eprint={2009.09290},
archivePrefix={arXiv},
primaryClass={cs.IR}
}