Giter Site home page Giter Site logo

gao-lab / scrnaseq_benchmark Goto Github PK

View Code? Open in Web Editor NEW

This project forked from tabdelaal/scrnaseq_benchmark

3.0 1.0 1.0 413 KB

Bench-marking classification tools for scRNA-seq data

License: MIT License

Python 54.62% R 41.92% Dockerfile 3.17% Shell 0.29%

scrnaseq_benchmark's Introduction

A comparison of automatic cell identification methods for single-cell RNA-sequencing data

We present a comprehensive evaluation of the performance of state-of-the-art classification methods, in addition to general-purpose classifiers, for automatic cell identification single cell RNA-sequencing datasets. Our goal is to provide the community with a fair evaluation of all available methods to facilitate the users’ choice as well as direct further developments to focus on the challenging aspects of automated cell type identification. (published in genome biology Sep. 2019 https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1795-z)

Repository description

We provide all the scripts to run and evaluate all classifiers, and to reproduce the results introduced in the paper.

  1. 'Scripts' folder contains a wrapper function to read the data and apply certain classification method.
  2. Cross_Validation R script can be used to produce training and test indices for cross validation.
  3. rank_gene_dropouts Python script can be used to apply feature selection using the dropout method, and rank genes accordingly.
  4. evaluate R script can be used to evaluate the prediction of a certain classifier and obtain scores such as accuracy, median F1-score and % unlabeld cells.

For more details, please check function documentations.

General Usage

To benchmark and fairly evaluate the performance of different classifiers using benchmark-datasets (Filtered datasets can be downloaded from https://zenodo.org/record/3357167), apply the following steps:

Step 1

Apply the Cross_Validation R function on the corresponding dataset to obtain fixed training and test cell indices, straitified across different cell types. For example, using the Tabula Muris (TM) dataset

Cross_Validation('~/TM/Labels.csv', 1, '~/TM/')

This command will create a CV_folds.RData file used as input in Step 2.

Step 2

Run each classifier wrapper. For example, running scPred on TM dataset

run_scPred('~/TM/Filtered_TM_data.csv','~/TM/Labels.csv','~/TM/CV_folds.RData','~/Results/TM/')

This command will output the true and predicted cell labels as csv files, as well as the classifier computation time.

Step 3

Evaluate the classifier prediction by

result <- evaluate('~/Results/TM/scPred_True_Labels.csv', '~/Results/TM/scPred_Pred_Labels.csv')

This command will return the corresponding accuracy, median F1-score, F1-scores for all cell populations, % unlabeled cells, and confusion matrix.

Usage with feature selection

Step 1

Apply the Cross_Validation R function on the corresponding dataset to obtain fixed training and test cell indices, straitified across different cell types. For example, using the Tabula Muris (TM) dataset

Cross_Validation('~/TM/Labels.csv', 1, '~/TM/')

This command will create a CV_folds.RData file used as input in Step 2 and 3.

Step 2

Apply the rank_gene_dropouts Python script to get the genes ranking for each training fold using the dropout criteria

rank_gene_dropouts('~/TM/Filtered_TM_data.csv', '~/TM/CV_folds.RData', '~/TM/')

This command will create a rank_genes_dropouts.csv file used as input in Step 3.

Step 3

Run each classifier wrapper. For example, running scPred on TM dataset with 1000 genes

run_scPred('~/TM/Filtered_TM_data.csv','~/TM/Labels.csv','~/TM/CV_folds.RData','~/Results/TM/',
GeneOrderPath = '~/TM/rank_genes_dropouts.csv',NumGenes = 1000)

This command will output the true and predicted cell labels as csv files, as well as the classifier computation time.

Step 4

Evaluate the classifier prediction by

result <- evaluate('~/Results/TM/scPred_True_Labels.csv', '~/Results/TM/scPred_Pred_Labels.csv')

This command will return the corresponding accuracy, median F1-score, F1-scores for all cell populations, % unlabeled cells, and confusion matrix.

Evaluate Marker-based methods using DE genes

To evaluate the marker-based methods SCINA, DigitalCellSorter and Garnett using DE genes learned from the data, you may follow these steps:

Step 1

Apply the Cross_Validation R function on the corresponding dataset to obtain fixed training and test cell indices, straitified across different cell types. For example, using the Zheng_sorted dataset

Cross_Validation('~/TM/Labels.csv', 1, '~/Zheng_sorted/')

This command will create a CV_folds.RData file used as input in Step 2 and 3.

Step 2

For each fold use the training data to get the DE genes using the DEgenesMAST R function, and pass these DE genes to the corresponding method, for example here we use SCINA, to obtain cell prediction for the test data.

load('CV_folds.RData')
Data <- read.csv('~/Zheng_sorted/Filtered_DownSampled_SortedPBMC_data',row.names = 1)
Labels <- as.matrix(read.csv('~/Zheng_sorted/Labels.csv'))
Labels <- as.vector(Labels[,col_Index])
Data <- Data[Cells_to_Keep,]
Labels <- Labels[Cells_to_Keep]

for (i in c(1:n_folds))
{
    MarkerGenes <-  DEgenesMAST(t(Data[Train_Idx[[i]],]), Labels[Train_Idx[[i]]], Normalize = TRUE, LogTransform = TRUE)
    ## write the MarkerGenes into a marker genes file format, depending on the tested method, for example for SCINA
    write.csv(MarkerGenes, 'MarkerGenes.csv')
    ## run the SCINA wrapper using these DE marker genes
    run_SCINA(Data[Test_Idx[[i]],], Labels[Test_Idx[[i]]], 'MarkerGenes.csv', '~/Results/Zheng_sorted/')
}

Snakemake

To support future extension of this benchmarking work with new classifiers and datasets, we provide a Snakemake workflow to automate the performed benchmarking analyses (https://github.com/tabdelaal/scRNAseq_Benchmark/tree/snakemake_and_docker).

scrnaseq_benchmark's People

Contributors

arlene-1998 avatar davycats avatar jeff1995 avatar lcmmichielsen avatar tabdelaal avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

Forkers

shunsunsun

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.