Giter Site home page Giter Site logo

sorokinvld / systematic-review-datasets Goto Github PK

View Code? Open in Web Editor NEW

This project forked from wojciechkusa/systematic-review-datasets

0.0 1.0 0.0 64.13 MB

[NeurIPS 2023] CSMeD: Bridging the Dataset Gap in Automated Citation Screening for Systematic Literature Reviews

Home Page: https://systematic-review-datasets.streamlit.app

License: Apache License 2.0

Python 100.00%

systematic-review-datasets's Introduction

CSMeD: Citation Screening Meta-Dataset for systematic review automation evaluation


Citation screening datasets for title and abstract screening

Introduced in # reviews Domain Avg. size Avg. ratio of included (TA) Avg. ratio of included (FT) Additional data Data URL Cochrane Publicly available Included in CSMeD
1 Cohen et al. (2006) 15 Drug 1,249 7.7% Web
2 Wallace et al. (2010) 3 Clinical 3,456 7.9% GiitHub
3 Howard et al. (2015) 5 Mixed 19,271 4.6% Supplementary
4 Miwa et al. (2015) 4 Social science 8,933 6.4%
5 Scells et al. (2017) 93 Clinical 1,159 1.2% Search queries GitHub
6 CLEF TAR 2017 50 DTA 5,339 4.4% Review protocol GitHub
7 CLEF TAR 2018 30 DTA 7,283 4.7% Review protocol GitHub
8 CLEF TAR 2019 49 Mixed** 2,659 8.9% Review protocol GitHub
9 Alharbi et al. (2019) 25 Clinical 4,402 0.4% Review updates GitHub
10 Parmar (2021) 6 Biomedical 3,019 21.6% 7.3%
11 Hannousse et al. (2022) 7 Computer Science 340 11.7% Review protocol GitHub
12 Wang et al. (2022) 40 Clinical 1,326 Review protocol GitHub

** CLEF TAR 2019 contains 38 reviews of interventions, 8 DTA, 1 Prognosis and 2 Qualitative systematic reviews.

TA stands for Title + Abstract screening phase, FT for Full-text screening phase. Avg. size describes the size of a review in terms of the number records retrieved from the search query. Avg. ratio of included (TA) describes the average ratio of included records in the TA phase. Avg. ratio of included (FT) describes the average ratio of included records in the FT phase.

CSMeD-FT: Full-text screening dataset

Dataset name #reviews #docs. #included %included Avg. #words in document Avg. #words in review
CSMeD-train 148 2,053 904 44.0% 4,535 1,493
CSMeD-dev 36 644 202 31.4% 4,419 1,402
CSMeD-test 29 636 278 43.7% 4,957 2,318
CSMeD-test-small 16 50 22 44.0% 5,042 2,354

Column '#docs' refers to the total number of documents included in the dataset and '#included' mentions number of included documents on the full-text step. CSMeD-test-small is a subset of CSMeD-test.

Installation

Requirements

Assuming you have conda installed, run:

$ conda create -n csmed python=3.10
$ conda activate csmed
(csmed)$ pip install -r requirements.txt

Data acquisition prerequisites

To obtain the datasets, you need to configure the following:

Furthermore, to obtain full-text PDFs, you need to configure the following:

  1. SemanticScholar API key: https://www.semanticscholar.org/product/api
  2. CORE API key: https://core.ac.uk/services/api
  3. GROBID: https://grobid.readthedocs.io/en/latest/Install-Grobid/

If you have all the prerequisites, run:

(csmed)$ python confgure.py

And follow the prompts providing API keys, cookies, email address to use PubMed Entrez and paths to GROBID server. You don't need to provide all the information, the bare minimum to construct the datasets is the cookie from Cochrane Library and the email address for PubMed Entrez.

Downloading datasets

To download the datasets, run:

(csmed)$ python scripts/prepare_prospective_dataset.py

systematic-review-datasets's People

Contributors

wojciechkusa avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.