CSMeD: Citation Screening Meta-Dataset for systematic review automation evaluation

Citation screening datasets for title and abstract screening

	Introduced in	# reviews	Domain	Avg. size	Avg. ratio of included (TA)	Avg. ratio of included (FT)	Additional data	Data URL	Cochrane	Publicly available	Included in CSMeD
1	Cohen et al. (2006)	15	Drug	1,249	7.7%	—	—	Web	—	✓	✓
2	Wallace et al. (2010)	3	Clinical	3,456	7.9%	—	—	GiitHub	—	✓	✓
3	Howard et al. (2015)	5	Mixed	19,271	4.6%	—	—	Supplementary	—	✓	✓
4	Miwa et al. (2015)	4	Social science	8,933	6.4%	—	—	—	—	—	—
5	Scells et al. (2017)	93	Clinical	1,159	1.2%	—	Search queries	GitHub	✓	✓	✓
6	CLEF TAR 2017	50	DTA	5,339	4.4%	—	Review protocol	GitHub	✓	✓	✓
7	CLEF TAR 2018	30	DTA	7,283	4.7%	—	Review protocol	GitHub	✓	✓	✓
8	CLEF TAR 2019	49	Mixed**	2,659	8.9%	—	Review protocol	GitHub	✓	✓	✓
9	Alharbi et al. (2019)	25	Clinical	4,402	0.4%	—	Review updates	GitHub	✓	✓	✓
10	Parmar (2021)	6	Biomedical	3,019	21.6%	7.3%	—	—	—	—	—
11	Hannousse et al. (2022)	7	Computer Science	340	11.7%	—	Review protocol	GitHub	—	✓	✓
12	Wang et al. (2022)	40	Clinical	1,326	—	—	Review protocol	GitHub	—	✓	—

** CLEF TAR 2019 contains 38 reviews of interventions, 8 DTA, 1 Prognosis and 2 Qualitative systematic reviews.

TA stands for Title + Abstract screening phase, FT for Full-text screening phase. Avg. size describes the size of a review in terms of the number records retrieved from the search query. Avg. ratio of included (TA) describes the average ratio of included records in the TA phase. Avg. ratio of included (FT) describes the average ratio of included records in the FT phase.

CSMeD-FT: Full-text screening dataset

Dataset name	#reviews	#docs.	#included	%included	Avg. #words in document	Avg. #words in review
CSMeD-train	148	2,053	904	44.0%	4,535	1,493
CSMeD-dev	36	644	202	31.4%	4,419	1,402
CSMeD-test	29	636	278	43.7%	4,957	2,318
CSMeD-test-small	16	50	22	44.0%	5,042	2,354

Column '#docs' refers to the total number of documents included in the dataset and '#included' mentions number of included documents on the full-text step. CSMeD-test-small is a subset of CSMeD-test.

Installation

Requirements

Assuming you have conda installed, run:

$ conda create -n csmed python=3.10
$ conda activate csmed
(csmed)$ pip install -r requirements.txt

Data acquisition prerequisites

To obtain the datasets, you need to configure the following:

Get a cookie from https://www.cochranelibrary.com/

Furthermore, to obtain full-text PDFs, you need to configure the following:

SemanticScholar API key: https://www.semanticscholar.org/product/api
CORE API key: https://core.ac.uk/services/api
GROBID: https://grobid.readthedocs.io/en/latest/Install-Grobid/

If you have all the prerequisites, run:

(csmed)$ python confgure.py

And follow the prompts providing API keys, cookies, email address to use PubMed Entrez and paths to GROBID server. You don't need to provide all the information, the bare minimum to construct the datasets is the cookie from Cochrane Library and the email address for PubMed Entrez.

Downloading datasets

To download the datasets, run:

(csmed)$ python scripts/prepare_prospective_dataset.py

sorokinvld / systematic-review-datasets Goto Github PK

systematic-review-datasets's Introduction

CSMeD: Citation Screening Meta-Dataset for systematic review automation evaluation

Citation screening datasets for title and abstract screening

CSMeD-FT: Full-text screening dataset

Installation

Requirements

Data acquisition prerequisites

Downloading datasets

systematic-review-datasets's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent