processed 718492 article entries from CORD-19 metadata file
total distinct cord id with JSON entry: 640739 (77753 duplicated cord ids)
total CORD-19 official PMC-derived JSON fulltext: 184289
total CORD-19 official PDF-derived JSON fulltex: 212489
total entry with at least one JSON fulltext (official CORD-19): 235673
total CORD-19 JSON fulltext we harvested too: 217243
via Unpaywall PDF and GROBID: 217070
via PMC/NLM and Pub2TEI: 137335
-> we missed 18430 entries, see file /media/lopez/store/cord-19/data/missed_entries.csv for the list missed entries
total extra fultext we harvested in addition: 79660
via Unpaywall PDF and GROBID: 79639
via PMC/NLM and Pub2TEI: 16907
-> see file /media/lopez/store/cord-19/data/extra_entries.csv for the list of extra entries
total additional fulltexts we harvested: 61230
total distinct cord id with TEI XML full text (our fulltext harvesting and conversion): 296903
296819 PDF files
Official changelog for this version:
---SUMMARY---
total metadata rows: 718492
CORD UIDs (new: 12137, removed: 2371)
Full text:
PDF - 240618 json (new: 4484, removed: 203)
PMC - 184667 json (new: 4047)
CORD-19 metadata.json file can have multiple rows with the same cord identifier see the FAQ:
- We observed that CORD-19 metadata.json file can indeed have explicit duplicates.
However, it is not clear why it is the case, when sources and metadata are identical:
$ grep 'fq4xq00d' metadata.csv
fq4xq00d,,WHO,Offline: COVID-19—what countries must do now,,,,unk,,2020,"Horton, Richard",Lancet,,#31271,,,,,214779677
fq4xq00d,,WHO,Offline: COVID-19—what countries must do now,,,,unk,,2020,"Horton, Richard",Lancet,,#31270,,,,,214779677
We counted 241335 distinct cord id
out of 253454 entries, a total of 12,119 explicit duplicates.
- However, for the large majority of duplicate cases, we have the same article, but different
cord id
identifiers. For example:
a78dpgjk,,WHO,Evaluating Potential Deceased Donor Renal Transplant Recipients for Asymptomatic COVID-19,,,,unk,,2020,"Ho, Quan Yao; Chung, Shimin Jasmine; Low, Shoen Choon Seng; Chen, Robert Chun; Tee, Swee Ping; Chan, Fu Zi Yvonne; Tan, Ban Hock; Kee, Terence Yi Shern",Transplantation Direct,,#379989,,,,,219879474
6zepsday,d2e68d7eb8b42ae905c460038d091585c354fc5a,Medline; PMC,Evaluating Potential Deceased Donor Renal Transplant Recipients for Asymptomatic COVID-19,10.1097/txd.0000000000001010,PMC7266364,32607425,cc-by-nc-nd,,2020-05-22,"Ho, Quan Yao; Chung, Shimin Jasmine; Low, Shoen Choon Seng; Chen, Robert Chun; Teh, Swee Ping; Chan, Fu Zi Yvonne; Tan, Ban Hock; Kee, Terence Yi Shern",Transplant Direct,,,,document_parses/pdf_json/d2e68d7eb8b42ae905c460038d091585c354fc5a.json,document_parses/pmc_json/PMC7266364.xml.json,https://www.ncbi.nlm.nih.gov/pubmed/32607425/; https://doi.org/10.1097/txd.0000000000001010,219879474
5gas6xri,,WHO,99mTc-Leukocyte Scintigraphy Revealed Viral Pulmonary Infection in a COVID-19 Patient,,,,unk,"Tc-leukocyte scintigraphy was performed on a 40-year-old woman with spiking fevers. A focus of intense uptake in the right upper thorax was identified, concerning for infection along the central line in the superior vena cava. Additionally, heterogeneously increased uptake in both lungs was noted, which suggested pulmonary infection. CT images of the chest showed patchy ground-glass changes in both lungs and a large consolidation in the right lower lobe, which were consistent with changes for COVID-19 (coronavirus disease 2019). Severe acute respiratory syndrome coronavirus 2 RNA test was positive. This case demonstrates that leukocyte uptake in bilateral lungs could reveal viral pulmonary infection in COVID-19.",2020,"Zheng, Jiefu; Liu, Yiyan",Clin. nucl. med,,#672034,,,,,221562769
65enq06o,23988b36eee15a54f905ec54811ca232192f10d6,Medline; PMC,(99m)Tc-Leukocyte Scintigraphy Revealed Viral Pulmonary Infection in a COVID-19 Patient,10.1097/rlu.0000000000003219,PMC7473795,32701817,no-cc,"(99m)Tc-leukocyte scintigraphy was performed on a 40-year-old woman with spiking fevers. A focus of intense uptake in the right upper thorax was identified, concerning for infection along the central line in the superior vena cava. Additionally, heterogeneously increased uptake in both lungs was noted, which suggested pulmonary infection. CT images of the chest showed patchy ground-glass changes in both lungs and a large consolidation in the right lower lobe, which were consistent with changes for COVID-19 (coronavirus disease 2019). Severe acute respiratory syndrome coronavirus 2 RNA test was positive. This case demonstrates that leukocyte uptake in bilateral lungs could reveal viral pulmonary infection in COVID-19.",2020-07-22,"Zheng, Jiefu; Liu, Yiyan",Clin Nucl Med,,,,document_parses/pdf_json/23988b36eee15a54f905ec54811ca232192f10d6.json,,https://www.ncbi.nlm.nih.gov/pubmed/32701817/; https://doi.org/10.1097/rlu.0000000000003219,221562769
We counted 161839 distinct entries out of 253454 entries, a total of 91,257 duplicates: 12,119 explicit duplicates and 79,138 article-level duplicates.
total entries from CORD-19 metadata file (including all duplicates): 253454
total distinct entries: 161839
total duplicates: 91257
total duplicated crod id: 12119 (explicit duplicates)
total article-level duplicates: 79138
- Result from our harvesting (given by option
--disgnostic
):
*
total entries: 161839
---
total valid entries: 138440 entries with valid OA URL and PDF and TEI XML
---
total invalid OA URL: 20697
total entries with valid OA URL: 141142
---
total invalid PDF: 22274
total entries with successfully downloaded PDF: 139565
---
total invalid TEI: 23399
total entries with successfully convereted TEI XML: 138440
---
total entries with GROBID TEI file: 138432
total entries with Pub2TEI TEI file: 104288
total entries with at least one TEI file: 140322
139192 PDF -> 138432 TEI XML (760 GROBID conversion failures)
104288 nxlm -> 104288 TEI XML (but 45 tranformations reported as containing some kind of failure)
entries with at least one fulltext tei file -> 140322
CORD-19 dataset coverage check (see check_cord19_coverage.py
):
253454 entries
241335 distinct cord id
total distinct PMC-derived JSON: 77115
total distinct PDF-derived JSON: 94541
total distinct entry with at least one JSON: 103587
Dataset "official" reported numbers see https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2020-09-11/changelog:
PDF - 106144 json
PMC - 77321 json
Here are the results regarding the CORD-19 version 5 (metadata.csv) to illustrate the interest of the tool:
official CORD-19 | this harvester | |
---|---|---|
total entries | 45,828 | 45,828 |
entries with valid OA URL | - | 42,742 |
entries with successfully downloaded PDF | - | 42,362 |
entries with structured full texts via GROBID | ~33,000 (JSON) | 41,070 (TEI XML) |
entries with structured full texts via PMC JATS | - | 15,955 (TEI XML) |
total entries with at least one structured full text | ~33,000 (JSON) | 41,609 (TEI XML) |