Giter Site home page Giter Site logo

sclearn's Introduction

scLearn: Learning for single cell assignment

Introduction of scLearn

  • scLearn is a learning-based framework that automatically infers quantitative measurement/similarity and threshold that can be used for different single cell assignment tasks, achieving a well-generalized assignment performance on different single cell types. The main contributions of scLearn are (1) scLearn is robust to different assignment tasks with a well-generalized assignment performance, (2) scLearn is efficient in the identification of novel cell types that are absent in the reference datasets and (3) For the first time, a multi-label single cell assignment strategy is proposed in scLearn to assign single cell to proper time status as well as cell type simultaneously, proven to be effective for cell development and lineage analysis with additional temporal information. scLearn is developed as a R package, built in with comprehensive human and mammalian single cell reference datasets and pre-trained models, which can be utilized directly to facilitate broad applications of single cell assignment.
  • scLearn a learning-based framework designed to intuitively carry out a cell search by measuring the similarity between query cells and each reference cell cluster centroid utilizing measurement and similarity thresholds learned from reference datasets, rather than manually designing the measurement/similarity or empirically selecting the threshold. Basically, scLearn comprises three main steps: data preprocessing, model learning, and cell assignment:
    • Data preprocessing: First, a routine normalization and quality control for single cell RNA-sequencing data is performed. scLearn removes the rare cell types whose cell numbers are less than 10 from the reference datasets. Then, scLearn performs feature selection utilizing M3Drop, which is based on a specific dropout rate that has proven suitable for single cell assignment.
    • Model learning: scLearn establishes a learning-based model to automatically learn the measurement used for cell assignment based on reference cells. In this model, the identification of query cell type is formulated as a single-label single cell assignment. The model learning comprise the following parts:
      • Discriminative component analysis (DCA) is applied and a transformation matrix that can be applied to formulate an optimal measurement that naturally fits the relationship between these samples is learned on the basis of the prior sample similarity or dissimilarity.
      • In addition, the assignment of query cell into proper time point and cell type simultaneously is formulated as a multi-label single cell assignment. In this case, scLearn extended the DCA-based matrix transformation to a multi-label dimension reduction by maximizing the dependence between the original feature space and the associated labels (multi-label dimension reduction via dependence maximization, MDDM).
      • For either case, the derived transformation matrix can be multiplied by the original reference data matrix and the query data matrix, respectively, and the learned measurement can be obtained on the basis of the distance/similarity between the transformed data samples. For single-label single cell assignment, bootstrapping sampling technology is also utilized in this step to reduce sampling imbalances and to obtain a stable learning-based model.
      • It should be noted that single cell assignment methods should support the detection of novel cell types, while all existing single cell assignment strategies have adopted an empirical similarity threshold, such as a Pearson correlation coefficient of 0.7 or Cosine similarity of 0.5, which should differ among distinct datasets with different cell types and annotations. In general, the similarity thresholds of datasets with fine-grained annotation (deep annotation, i.e., cells are categorized in a fine-grained manner), should be larger than those of datasets with coarse-grained annotation (shallow annotation, i.e., cells are categorized in a coarse-grained manner), because the cells in the former datasets are more similar than the cells in the latter datasets. Therefore, one threshold for all datasets and all cell types is not suitable. To this end, in this step, scLearn learns the similarity thresholds for each cell type in each dataset instead of specifying a priori thresholds.
    • Cell assignment: Finally, according to the learned measurement and the learned threshold obtained with the learning-based model, scLearn assigns the cell type of the query cells by comparison with the reference datasets.

The scLearn workflow

scLearn comprises three steps: data preprocessing, model learning, and cell assignment.

  • (1) In the first step, the main processes comprise routine normalization, cell quality control, rare cell-type filtering, and feature selection; nGene, number of genes; nUMI, number of unique molecular identifiers; P-mitGene, percentage of mitochondrial genes; and G, cell group.
  • (2) In the second step, for single-label single cell assignment, DCA is applied to learn the transformation matrix; For multi-label single cell assignment, MDDM is applied to learn the transformation matrix. Then, with the learned transformation matrix, the transformed reference cell samples are obtained for the following assignment. The thresholds for labeling a cell as “unassigned” for each cell type are also automatically learned. G, cell group; DCA, discriminative component analysis. LTM, Learned Transformation Matrix, which can be calculated as the optimal transformation matrix for single-label single cell assignment or by equation 6 for multi-label single cell assignment, respectively (see Materials and Methods); and TRCM, Transformed Reference Cell Matrix, which can be calculated by equation 1 (see Materials and Methods).
  • (3) In the third step, the transformed query cell samples are obtained based on LTM with an available optional cell quality control procedure. The transformed query samples are compared against the transformed reference cell matrix to derive the measurement fulfilling the cell-type assignment with the rejection task. TQCM, Transformed Query Cell Matrix, which can be calculated by equation 2 (see Materials and Methods).

Install

  • Install: You can install the scLearn package from Github using devtools packages with R>=3.6.1.

    library(devtools)
    library(SingleCellExperiment)
    library(M3Drop)
    install_github("bm2-lab/scLearn")

Tutorial

Single-label single cell assignment

  • For illustration purpose, we took the dataset baron-human.rds and xin-human.rds as examples.

    • Data preprocessing:
    # loading the reference dataset
    data<-readRDS('baron-human.rds')
    rawcounts<-assays(data)[[1]]
    refe_ann<-as.character(data$cell_type1)
    names(refe_ann)<-colnames(data)
    # cell quality control and rare cell type filtered and feature selection
    data_qc<-Cell_qc(rawcounts,refe_ann,species="Hs")
    data_type_filtered<-Cell_type_filter(data_qc$expression_profile,data_qc$sample_information_cellType,min_cell_number = 10)
    high_varGene_names <- Feature_selection_M3Drop(data_type_filtered$expression_profile)
    • Model learning:
    # training the model. To improve the accuracy for "unassigned" cell, you can increase "bootstrap_times", but it will takes longer time. The default value of "bootstrap_times" is 10.
    scLearn_model_learning_result<-scLearn_model_learning(high_varGene_names,data_type_filtered$expression_profile,data_type_filtered$sample_information_cellType,bootstrap_times=1)
    • Cell assignment:
    # loading the quary cell and performing cell quality control.
    data2<-readRDS('xin-human.rds')
    rawcounts2<-assays(data2)[[1]]
    ### the true labels of this test datasets 
    #query_ann<-as.character(data2$cell_type1)
    #names(query_ann)<-colnames(data2)
    #query_ann<-query_ann[query_ann %in% c("alpha","beta","delta","gamma")]
    #rawcounts2<-rawcounts2[,names(query_ann)]
    #data_qc_query<-Cell_qc(rawcounts2,query_ann,species="Hs")
    ### 
    data_qc_query<-Cell_qc(rawcounts2,species="Hs",gene_low=50,umi_low=50)
    # Assignment with trained model above. To get a less strict result for "unassigned" cells, you can decrease "diff" and "vote_rate". If you are sure that the cell type of query cells must be in the reference dataset, you can set "threshold_use" as FALSE. It means you don't want to use the thresholds learned by scLearn.
    scLearn_predict_result<-scLearn_cell_assignment(scLearn_model_learning_result,data_qc_query$expression_profile,diff=0.05,threshold_use=TRUE,vote_rate=0.6)
    

Multi-label single cell assignment

  • For illustration purpose, we took the dataset ESC.rds as an example.

    • Data preprocessing:
    # loading the reference dataset
    data<-readRDS('ESC.rds')
    rawcounts<-assays(data)[[1]]
    refe_ann1<-as.character(data$cell_type1)
    names(refe_ann1)<-colnames(data)
    refe_ann2<-as.character(data$cell_type2)
    names(refe_ann2)<-colnames(data)
    # cell quality control and rare cell type filtered and feature selection
    data_qc<-Cell_qc(rawcounts,refe_ann1,refe_ann2,species="Hs")
    data_type_filtered<-Cell_type_filter(data_qc$expression_profile,data_qc$sample_information_cellType,data_qc$sample_information_timePoint,min_cell_number = 10)
    high_varGene_names <- Feature_selection_M3Drop(data_type_filtered$expression_profile)
    • Model learning:
    # training the model
    scLearn_model_learning_result<-scLearn_model_learning(high_varGene_names,data_type_filtered$expression_profile,data_type_filtered$sample_information_cellType,data_type_filtered$sample_information_timePoint,dim_para=0.999)
    • Cell assignment: We just use 'ESC.rds' itself to test the multi-label single cell assignment here.
    # loading the quary cell and performing cell quality control
    data2<-readRDS('ESC.rds')
    rawcounts2<-assays(data2)[[1]]
    ### the true labels of this test dataset
    #query_ann1<-as.character(data2$cell_type1)
    #names(query_ann1)<-colnames(data2)
    #query_ann2<-as.character(data2$cell_type2)
    #names(query_ann2)<-colnames(data2)
    #rawcounts2<-rawcounts2[,names(query_ann1)]
    #data_qc_query<-Cell_qc(rawcounts2,query_ann1,query_ann2,species="Hs")
    data_qc_query<-Cell_qc(rawcounts2,species="Hs",gene_low=50,umi_low=50)
    # Assignment with trained model above
    scLearn_predict_result<-scLearn_cell_assignment(scLearn_model_learning_result,data_qc_query$expression_profile)

Pre-trained scLearn models

Citation

B. Duan, C. Zhu, G. Chuai, C. Tang, X. Chen, S. Chen, S. Fu, G. Li, Q. Liu, Learning for single-cell assignment. Sci. Adv. 6, eabd0855 (2020)

Contact

[email protected] or [email protected]

sclearn's People

Contributors

binduan avatar michaelchuai avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sclearn's Issues

run into error when performing cell assignment codes

Dear authors,
When I ran this code"scLearn_predict_result<-scLearn_cell_assignment(scLearn_model_learning_result,data_qc_query$expression_profile,diff=0.05,threshold_use=TRUE,vote_rate=0.6)", there was a error:
Error in scLearn_cell_assignment(scLearn_model_learning_result, data_qc_query$expression_profile, :
参数没有用(threshold_use = TRUE).
Even if I changed the value of threshold_use to FALSE, it still has problem.
Can you explain to me that why this happenen? Thank you for your time.

Downloading the pre-trained models not possible

It is not possible to download the pre-trained models! It requires to create an account, verify it by the passcode sent by SMS, but each time the process fails with no explanation whatsoever.

the meaning of additional information in predict_result

I ran the codes that you provided in github, and I got the scLearn_predict_result. But I have a problem about the the meaning of additional information in predict_result. I have read the codes that how to get additional information, still have no clues. What'more, I can't find the types of query cells, so I don't know how to get the accuracy of scLearn. You don't provide the types of query cells and the method of how to get the accuracy. I really hope to receive your reply. I will appreciate it if you can solve my problems.

How to distinguish similar cell types?

Dear authors,
I find it is hard to distinguish some cell types if they are similar in the reference dataset using default parameters. There are too many 'unassigned' in the output. I wonder how to adjust the parameters in scLearn_model_learning() to get a less strict model? Thanks.

Error when run the example data

Hello, I encounter a error when run the:
scLearn_predict_result<-scLearn_cell_assignment(scLearn_model_learning_result,data_qc_query$expression_profile).
The message is:
[1] "The number of missing features in the query data is 0 "
[1] "The rate of missing features in the query data is 0 "
Error in scLearn_model_learning_result$trans_matrix_learned[[r]] %*% expression_profile_query_hvg :
non-conformable arguments

could you give me some advice for this error? The R version is 4.0.2.

The of installation scLearn failed

Hi, When I used the command devtools::install_github("bm2-lab/scLearn"), I got the bug information:

Installing package into ‘C:/Software/R/Rlibrary’
(as ‘lib’ is unspecified)
* installing *source* package 'scLearn' ...
** using staged installation
** R
** byte-compile and prepare package for lazy loading
Error: (converted from warning) package 'dml' was built under R version 4.0.3
Execution halted
ERROR: lazy loading failed for package 'scLearn'
* removing 'C:/Software/R/Rlibrary/scLearn'

I have checked my R version which was version 4.0.2 and the R version for package more than 3.6.1. By the way I also reinstalled the dml package, but it came to failed again. Have you ever met this bug or how could I fix it? THX

how to get the accuracy that you provided in the paper

Dear authors,
When I ran the codes that you shared in the github, what I got are as follows:
"[1] "The number of missing features in the query data is 79 "
[1] "The rate of missing features in the query data is 0.0671768707482993 "'
and the reference dataset is deng-reads.rds, the query dataset is zeisel.rds.
But I can't get the accuracy that you provided in the papers.
Like this "deng_zeisel scLearn 0.996"
Is something missing when I ran the codes.
I will appreciate it if you can helo me out of this. Thank you for your time.

Cell assignment Error occurred

Hello, I encounter a error when run the Cell assignment part:

data2<-readRDS('pancreas_human_segerstolpe.rds')
rawcounts2<-assays(data2)[[1]]
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘assays’ for signature ‘"list"’

The pancreas_human_segerstolpe.rds file was downloaded from pre-trained scLearn models of the 30 datasets.
How should I deal with such error? Looking forward to your reply .

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.