Introduced in | # reviews | Domain | Avg. size | Avg. ratio of included (TA) | Avg. ratio of included (FT) | Additional data | Data URL | Cochrane | Publicly available | Included in CSMeD | |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | Cohen et al. (2006) | 15 | Drug | 1,249 | 7.7% | — | — | Web | — | ✓ | ✓ |
2 | Wallace et al. (2010) | 3 | Clinical | 3,456 | 7.9% | — | — | GiitHub | — | ✓ | ✓ |
3 | Howard et al. (2015) | 5 | Mixed | 19,271 | 4.6% | — | — | Supplementary | — | ✓ | ✓ |
4 | Miwa et al. (2015) | 4 | Social science | 8,933 | 6.4% | — | — | — | — | — | — |
5 | Scells et al. (2017) | 93 | Clinical | 1,159 | 1.2% | — | Search queries | GitHub | ✓ | ✓ | ✓ |
6 | CLEF TAR 2017 | 50 | DTA | 5,339 | 4.4% | — | Review protocol | GitHub | ✓ | ✓ | ✓ |
7 | CLEF TAR 2018 | 30 | DTA | 7,283 | 4.7% | — | Review protocol | GitHub | ✓ | ✓ | ✓ |
8 | CLEF TAR 2019 | 49 | Mixed** | 2,659 | 8.9% | — | Review protocol | GitHub | ✓ | ✓ | ✓ |
9 | Alharbi et al. (2019) | 25 | Clinical | 4,402 | 0.4% | — | Review updates | GitHub | ✓ | ✓ | ✓ |
10 | Parmar (2021) | 6 | Biomedical | 3,019 | 21.6% | 7.3% | — | — | — | — | — |
11 | Hannousse et al. (2022) | 7 | Computer Science | 340 | 11.7% | — | Review protocol | GitHub | — | ✓ | ✓ |
12 | Wang et al. (2022) | 40 | Clinical | 1,326 | — | — | Review protocol | GitHub | — | ✓ | — |
** CLEF TAR 2019 contains 38 reviews of interventions, 8 DTA, 1 Prognosis and 2 Qualitative systematic reviews.
TA stands for Title + Abstract screening phase, FT for Full-text screening phase.
Avg. size
describes the size of a review in terms of the number records retrieved from the search
query. Avg. ratio of included (TA)
describes the average ratio of included records in the TA
phase. Avg. ratio of included (FT)
describes the average ratio of included records in the FT phase.
Dataset name | #reviews | #docs. | #included | %included | Avg. #words in document | Avg. #words in review |
---|---|---|---|---|---|---|
CSMeD-train | 148 | 2,053 | 904 | 44.0% | 4,535 | 1,493 |
CSMeD-dev | 36 | 644 | 202 | 31.4% | 4,419 | 1,402 |
CSMeD-test | 29 | 636 | 278 | 43.7% | 4,957 | 2,318 |
CSMeD-test-small | 16 | 50 | 22 | 44.0% | 5,042 | 2,354 |
Column '#docs' refers to the total number of documents included in the dataset and '#included' mentions number of included documents on the full-text step. CSMeD-test-small is a subset of CSMeD-test.
Assuming you have conda
installed, run:
$ conda create -n csmed python=3.10
$ conda activate csmed
(csmed)$ pip install -r requirements.txt
To obtain the datasets, you need to configure the following:
- Get a cookie from https://www.cochranelibrary.com/
Furthermore, to obtain full-text PDFs, you need to configure the following:
- SemanticScholar API key: https://www.semanticscholar.org/product/api
- CORE API key: https://core.ac.uk/services/api
- GROBID: https://grobid.readthedocs.io/en/latest/Install-Grobid/
If you have all the prerequisites, run:
(csmed)$ python confgure.py
And follow the prompts providing API keys, cookies, email address to use PubMed Entrez and paths to GROBID server. You don't need to provide all the information, the bare minimum to construct the datasets is the cookie from Cochrane Library and the email address for PubMed Entrez.
To download the datasets, run:
(csmed)$ python scripts/prepare_prospective_dataset.py