asreview / paper-megameta-postprocessing-screeningresults Goto Github PK

5.0 5.0 0.0 654 KB

The repository is part of the so-called, Mega-Meta study on reviewing factors contributing to substance use, anxiety, and depressive disorders. This repository contains the scripts for Post-Processing the screening results.

Home Page: https://www.asreview.ai

License: MIT License

R 79.87% Jupyter Notebook 14.93% Python 5.20%

asreview mega-meta systematic-review deduplication

paper-megameta-postprocessing-screeningresults's Introduction

ASReview: Active learning for Systematic Reviews

Systematically screening large amounts of textual data is time-consuming and often tiresome. The rapidly evolving field of Artificial Intelligence (AI) has allowed the development of AI-aided pipelines that assist in finding relevant texts for search tasks. A well-established approach to increasing efficiency is screening prioritization via Active Learning.

The Active learning for Systematic Reviews (ASReview) project, published in Nature Machine Intelligence implements different machine learning algorithms that interactively query the researcher. ASReview LAB is designed to accelerate the step of screening textual data with a minimum of records to be read by a human with no or very few false negatives. ASReview LAB will save time, increase the quality of output and strengthen the transparency of work when screening large amounts of textual data to retrieve relevant information. Active Learning will support decision-making in any discipline or industry.

ASReview software implements three different modes:

Oracle Screen textual data in interaction with the active learning model. The reviewer is the 'oracle', making the labeling decisions.
Exploration Explore or demonstrate ASReview LAB with a completely labeled dataset. This mode is suitable for teaching purposes.
Simulation Evaluate the performance of active learning models on fully labeled data. Simulations can be run in ASReview LAB or via the command line interface with more advanced options.

Installation

The ASReview software requires Python 3.8 or later. Detailed step-by-step instructions to install Python and ASReview are available for Windows and macOS users.

pip install asreview

Upgrade ASReview with the following command:

pip install --upgrade asreview

To install ASReview LAB with Docker, see Install with Docker.

How it works

Getting started

Getting Started with ASReview LAB.

Citation

If you wish to cite the underlying methodology of the ASReview software, please use the following publication in Nature Machine Intelligence:

van de Schoot, R., de Bruin, J., Schram, R. et al. An open source machine learning framework for efficient and transparent systematic reviews. Nat Mach Intell 3, 125–133 (2021). https://doi.org/10.1038/s42256-020-00287-7

For citing the software, please refer to the specific release of the ASReview software on Zenodo https://doi.org/10.5281/zenodo.3345592. The menu on the right can be used to find the citation format of prevalence.

For more scientific publications on the ASReview software, go to asreview.ai/papers.

Contact

For an overview of the team working on ASReview, see ASReview Research Team. ASReview LAB is maintained by Jonathan de Bruin and Yongchao Terry Ma.

The best resources to find an answer to your question or ways to get in contact with the team are:

Documentation - asreview.readthedocs.io
Newsletter - asreview.ai/newsletter/subscribe
Quick tour - ASReview LAB quick tour
Issues or feature requests - ASReview issue tracker
FAQ - ASReview Discussions
Donation - asreview.ai/donate
Contact - [email protected]

License

The ASReview software has an Apache 2.0 LICENSE. The ASReview team accepts no responsibility or liability for the use of the ASReview tool or any direct or indirect damages arising out of the application of the tool.

paper-megameta-postprocessing-screeningresults's People

Contributors

Stargazers

Watchers

paper-megameta-postprocessing-screeningresults's Issues

Conservative deduplication does not run

It appears that there is an issue with the conservative deduplication strategy:
After the doi_retrieval in Python and loading the data back into R for the deduplication part, a few columns were added

> # IMPORTING RESULTS
> ## from doi retrieval 
> df <- read_xlsx(paste0(OUTPUT_PATH, DOI_RETRIEVED_PATH))
New names:
* `` -> ...1

which caused a hiccup in the conservative deduplication part:

New names:
* ...1 -> ...6
New names:
* ...1 -> ...6
 Error: not compatible: 
not compatible: 
- Cols in y but not x: `...1`.
- Cols in x but not y: `...6`.

Run `rlang::last_error()` to see where the error occurred.

This issues causes the conservative deduplication function to fail and therefore, needs repairment.

solve duplicates

After merging the three datasets, it appeared there are still some duplicates in the dataset. This holds for relevant papers, for irrelevant papers, and for unseen papers. We would need a script that searches for more DOIs, for example in Crossref, so that we can apply another round of deduplication based on DOIs and a script for title-matching.

Deduplication during quality check is not conservative

The deduplication strategy within the quality check script is only based on title deduplication.
This may be dangerous for records which have the same title by coincidence.

add file with requirement

A file requirements.txt needs to be added containing a list of required R-packages including version information.

create two datasets

Can you create two datasets as output:

one with all the information for the quality checks
one clean dataset which can be used for future studies

For the second dataset there should be five columns:

(ir)relevant for topic area 1-3 (output of the combined screening phases using ASReview)
misclassified (as part of the quality checks 1->0 or 0->1)
final label which can be used for future studies

In this second dataset, records should appear only once.

improve gitignore

Please create a proper gitignore file via https://www.toptal.com/developers/gitignore.

Also, make sure the datasets are included in the gitignore.

rlang returns an error

This code chunk returns an error:

# First pivot the title and doi columns
mismatch_included_no_source <- mir %>% 
  select(-contains("source")) %>%
  pivot_longer(cols = ends_with(c("title","doi")),
    names_to = c("intended_subject", ".value"),
    names_pattern = "(.+)_(.+)"
  )

error:

Error: `cols` must select at least one column.
Run `rlang::last_error()` to see where the error occurred.

request for descriptive stats

I would very much like to obtain a table with descriptive statistics including:

Generic stats:

total number of records per subject area
missing information (abstracts, titles, DOI)
number of prior relevant/irrelevant papers used in the first phase
number labelled records in the first phase (plus % relevant)
number of labelled records in the second phase (plus % relevant)

Quality stats:

number of irrelevant papers which appeared to be relevant after screening by a 2nd screener
number of relevant papers which appeared to be irrelevant after screening by a 2nd screener

Data for quality check 2 is incomplete

This issue is meant as an extra reminder that the data for quality check 2 (articles which have been incorrectly included) should be updated! Thus far we are working with the preliminary results.

This issue can be resolved when the final data for quality check 2 is available and the master-script is adapted to import this instead of the preliminary results.

Quality checks unclear

While reading your impressive documentation, it remains unclear to me what you did in step 4 of the post-processing, 'Deal with noisy labels corrected in two rounds of quality checks'.
I cannot find a script with a similar name or an explanation.
Could you point me to where it is, or if not add it to the documentation?