This repository contains code used to extract details of preprints related to COVID-19 and visualize their distribution over time. With thanks to Bianca Kramer for improving the visualisations.
A citable version of this repository is also available on figshare, here: https://doi.org/10.6084/m9.figshare.12033672.
Preprint data is harvested from Crossref and arXiv using the R packages rcrossref and aRxiv, respectively.
With respect to Crossref, all records defined as "posted-content" are harvested using the cr_types
function of the rcrossref package, and filtered for partial matches to keywords relating to COVID-19 ("coronavirus", "covid-19", "sars-cov", "ncov-2019", "2019-ncov") in either their titles or abstracts. The institution
, publisher
and group-title
properties are then used to match preprints to relevant preprint repositories. In some cases, multiple Crossref records are registered for a single preprint (e.g. ChemRxiv registers a new Crossref record for each new version of a preprint). In these cases, only the earliest posted version is included in this dataset. Additionally, some preprints are deposited to multiple preprint repositories - in these cases both preprint records are included.
With respect to arXiv, records are harvested by searching directly (using the arxiv_search
function of the aRxiv package) for COVID-19 related keywords in titles or abstracts.