kkang7 / cdseq_r_package Goto Github PK

CDSeq R Package

R 78.15% C 0.08% C++ 21.77%

cdseq_r_package's Introduction

CDSeq

CDSeq is a complete deconvolution method for dissecting bulk RNA-Seq data. The input of CDSeq is, ideally, bulk RNA-Seq read counts (similar to the input format required by DESeq2), and CDSeq will estimate, simultaneously, the cell-type-specific gene expression profiles and the sample-specific cell-type proportions, no reference of pure cell line GEPs or scRNAseq reference is needed for running CDSeq.

For example, if you have a bulk RNA-Seq data, a G by M matrix A, which is a G by M matrix. G denotes the number of genes and M is the sample size, then CDSeq will output B (a G by T matrix) and C (a T by M matrix), where T is the number of cell types, B is the estimate of cell-type-specific GEPs and C is the estimate of sample-specific cell-type proportions.

Importantly, you can ask CDSeq to estimate the number of cell types, i.e. T, by providing a vector of possible integer values for T. For example, if the user input for T is a vector, i.e. (T={2,3,4,5,6}), then CDSeq will estimate the most likely number for T.

Installation

You can install the released version of CDSeq from CRAN with:

install.packages("CDSeq")

And the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("kkang7/CDSeq_R_Package")

build the vignette with

# install.packages("devtools")
devtools::install_github("kkang7/CDSeq_R_Package", build_vignettes = TRUE)

Known issue about MacOS installation

It is possible for Mac users to run into some errors when install from source due to problems of Rcpp compiler tools. Follow the instruction here may help: https://thecoatlessprofessor.com/programming/cpp/r-compiler-tools-for-rcpp-on-macos/

Example

Load package

library(CDSeq)

When the number of cell types is a scalar

## basic example code
result1<-CDSeq(bulk_data =  mixtureGEP, 
               cell_type_number = 6, 
               mcmc_iterations = 5, # increase the mcmc_iterations to 700 or above
               gene_length = as.vector(gene_length), 
               reference_gep = refGEP,  # gene expression profile of pure cell lines
               cpu_number = 1)

When the number of cell types is a vector

The cell_type_number can also be a vector which contains different integer values. CDSeq will perform estimation for each integer in the vector and estimate the number of cell types in the mixtures. For example, one can set cell_type_number = 2:10 as follows, and CDSeq will estimate the most likely number of cell types from 2 to 10.

result2<-CDSeq(bulk_data =  mixtureGEP, 
              cell_type_number = 2:10, 
              mcmc_iterations = 5, 
              dilution_factor = 1, 
              block_number = 1, 
              gene_length = as.vector(gene_length), 
              reference_gep = refGEP, # gene expression profile of pure cell lines
              cpu_number = 1, # use multiple cores to save time. Set the cpu_number = length(cell_type_number) if there is enough cores.
              print_progress_msg_to_file = 0)

Use single cell to annotate CDSeq-estimated cell types

cdseq.result <- CDSeq::CDSeq(bulk_data = pbmc_mix,
                             cell_type_number = seq(3,12,3),
                             beta = 0.5,
                             alpha = 5,
                             mcmc_iterations = 700,
                             cpu_number = 4,
                             dilution_factor = 10)

cdseq.result.celltypeassign <- cellTypeAssignSCRNA(cdseq_gep = cdseq.result$estGEP, # CDSeq-estimated cell-type-specific GEPs
                                                   cdseq_prop = cdseq.result$estProp, # CDSeq-estimated cell type proportions
                                                   sc_gep = sc_gep,         # PBMC single cell data
                                                   sc_annotation = sc_annotation,# PBMC single data annotations
                                                   sc_pt_size = 3,
                                                   cdseq_pt_size = 6,
                                                   seurat_nfeatures = 100,
                                                   seurat_npcs = 50,
                                                   seurat_dims=1:5,
                                                   plot_umap = 1,
                                                   plot_tsne = 0)

Setting recommendations

We provide recommendations for parameter settings. Note that these recommendations are merely emperical and there is no theoretical justifications yet. User can tune the parameters based on specific applications and domain knowledges.

Parameters	Recommended setting
beta	0.5
alpha	5
mcmc_iteration	700-2000
dilution_factor	2-10
gene_subset_size	200-500
block_number	>5

Check vignette for more details and examples: browseVignettes(“CDSeq”).

Contact

email: [email protected]

cdseq_r_package's People

Contributors

Stargazers

Watchers

Forkers

liangdp1984 pelzko grst feeeengym jmkarp musculusmus b8307038 romcia mdmkac1

cdseq_r_package's Issues

No reference Gene expression profiles

HI,

i am try to estimate GEP using CDseq, the point is that i don't have reference_gep but i have a "bona fide" matrix of 6 cell type proportions for all my samples, my question is, can i use it in your program to get the GEP specific for this cell types or not?

sc_gep and sc_annotation

Hello. Should sc_gep and sc_annotation any PBMC single-cell data and its corresponding cell type identities?

Eg:
sc_gep = seurat object
sc_annotation = seurat$celltypes

Is this correct? My run for cellTypeAssignSCRNA() failed when I used the default sc_gep/sc_annotation objects inside the package. Error says that the sc_gep should have the same genes and in the same order as the cdseq_gep output from CDSeq().

input to cellTypeAssignSCRNA

Hello there!

I have been using CDSeqR package in the last month and I have a doubt on the usage of the function cellTypeAssignSCRNA.

My understanding is that for the input sc_gep I have to use the GEP as counts and not normalised data. But what is the reason? Considering the first step of the function (cell type assignment using the input sc_gep), I would have thought that the format required for sc_gep was normalised data. In fact, the function computes the correlation between sc_gep and cdseq_gep where the latter is in the format of normalised data. Would it be wrong then to use normalised data for sc_gep?

Thanks in advance for your time and help.

How to assign cell types without either reference GEP or single-cell?

Hello,

I was initially drawn to this package because in the original publication, it was said that we could use it to deconvolve bulk RNA-seq data and assign cell types using marker genes, but apparently it is not possible in the current version of the package.

I saw that there used to be a cellTypeAssignMarkerGenes function that did exactly what I wanted to do, which is use marker genes to assign cell types to the deconvolved data, but it seems to have been removed as it was considered an "unused function" in commit 44502f2.

I am wondering if there is any other way to achieve what I am trying to do in the current version of the package? I do not have access to either a reference GEP or a reference single cell RNAseq.

length(gene_length) should be equal to nrow(bulk_data)

Hi I tried the following

          result2<-CDSeq(bulk_data =  M2, 
          cell_type_number = 2:10, 
          mcmc_iterations = 5, 
          dilution_factor = 1, 
          block_number = 1, 
          gene_length = as.vector(gene_length), 
          reference_gep = refGEP, # gene expression profile of pure cell lines
          cpu_number = 1, # use multiple cores to save time. Set the cpu_number = length(cell_type_number) if there is enough cores.
          print_progress_msg_to_file = 0)

and got the following error

CDSeq is running in non Reduce-Recover mode. To use Reduce-Recover mode, assign a value to block_number that is greater than 1.

Error in CDSeq(bulk_data = M2, cell_type_number = 2:10, mcmc_iterations = 5, :
length(gene_length) should be equal to nrow(bulk_data)

my data looks like

head(M2)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20]
[1,] 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
[2,] 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0
[3,] 94 101 92 126 213 263 515 99 173 107 180 301 109 185 130 52 203 84 118 170
[4,] 8 7 8 10 24 16 20 11 12 7 8 15 9 11 10 13 13 14 6 23
[5,] 25 34 26 27 64 66 114 32 73 46 42 52 38 70 68 43 73 47 40 59
[6,] 2 6 6 7 0 5 6 3 4 4 3 12 2 8 5 4 5 6 3 5
[,21] [,22] [,23] [,24] [,25] [,26] [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36] [,37] [,38]
[1,] 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
[2,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[3,] 188 132 481 173 103 116 163 183 198 149 131 89 225 150 226 160 586 657

Input for CDSeq?

Dear all,

I am completely new to CDSeq but am very interesting to try it. From what I read so far, I have the feeling that the input should be raw counts and not normalized read counts. Is it correct?
My worry is that a lot of publicly available datasets, as the ones we can get from cbioportal, are provided as normalized counts and there is no way to get raw counts to my knowledge. So, is there no way we can apply CDSeq on such datasets?

Thank you in advance,
Jane

estProp always even across types

To test CDSeq, I used counts from published bulk RNA-seq of human skin (accession numbers below). Gene-level counts were used as input. Surprisingly, estProp from the output always distributes the proportion of single cells evenly across the types. the below command runs to completion. Is there a reason that this might happen? A similar result is also returned with subsets of these samples.

cdsrr<-CDSeq(bulk_data =  srr, 
             cell_type_number = 10, 
             mcmc_iterations = 5, # increase the mcmc_iterations to 700 or above
             gene_length = as.vector(length),
             cpu_number = 6)

SRR12330925
SRR12330926
SRR12330927
SRR12330928
SRR12330929
SRR12330930
SRR12330931
SRR12330932
SRR12330933
SRR12330947
SRR12330948
SRR12330949
SRR12330950
SRR12330951
SRR12330952
SRR12330953

Undefined variable in vignette?

@kkang7 I have a question regarding a variable that is used in the vignette. As seen below true_prop is used to define row names of the dataframe that will then be plotted to compare to ground truth as stated in the vignette.

We next compare the CDSeq-estimated sample-specific cell-type proportions with the ground truth. We first prepare data frame for ggplot.

This is the only section of the code that contains this variable. I am not quite sure where/when this variable is created. To clarify I am using the "Annotate CDSeq-estimated cell types using scRNASeq" method. I mention this because in the previous methods (e.g., when cell type is a scalar) CDSeq-estimated cell type proportions are compared to ground truth using this function:
trueProp <- true_prop_cell[result1$cell_type_assignment,]

for (i in 1:4) {
  if(i==1){
    rownames(cdseq.result.celltypeassign[[i]]$cdseq_prop_merged) <- rownames(true_prop)
    cdseq_prop_df <- melt(cdseq.result.celltypeassign[[i]]$cdseq_prop_merged)
    names(cdseq_prop_df)[1:3] <- c("celltype", "sampleID",paste0("estprop_",i))
    true_prop_df <- melt(true_prop)
    names(true_prop_df)[1:3] <- c("celltype", "sampleID","trueprop")
    
    merge_df <- merge(cdseq_prop_df,true_prop_df, by=c("celltype","sampleID"))
  }else{
    rownames(cdseq.result.celltypeassign[[i]]$cdseq_prop_merged) <- rownames(true_prop)
    cdseq_prop_df <- melt(cdseq.result.celltypeassign[[i]]$cdseq_prop_merged)
    names(cdseq_prop_df)[1:3] <- c("celltype", "sampleID",paste0("estprop_",i))
  
    merge_df <- merge(merge_df, cdseq_prop_df, by=c("celltype","sampleID"))
  }
}
}

If anyone has encountered this issue or can provide clarification it would be greatly appreciated.

After reading a little more I understand that true_prop comes from the SyntheticMixtureData which itself is created by randomly mixing pure cell lines that were downloaded. The proportions of this random mixture are used as the ground truth. Is the purpose of creating true_prop to compare estimated proportions to a random distribution of cell types in the mixture(i.e., a random proportion)? Thus I assume this is for testing the accuracy of the estimated proportions as seen in the supplements of the paper.

question about scRNA-seq data and thetheory of CDSeq

hello ，I read your paper and have some problems :
1.I have scRNA-seq data and how clould i use it to match the algorithm result?
2.I can not understand the real difference from LDA to CDseq? what is CDSeq's gene length means?

there s a tie between two or more possible cell type annotations

Dear Developers, I'm using CDSeq to deconvolute rnaseq data. I get the following waring:

In cellTypeAssignSCRNA(cdseq_gep = cdseq.result$estGEP, cdseq_prop = cdseq.result$estProp,  :
  In cluster 26, there s a tie between two or more possible cell type annotations. A random one is chosen. Try to reduce pseudo_cell_count or adjust resolution parameter.

Is there a way to identify the cell types that get the tie? I want to see the alternatives to see how far are the cell types in the tie.
thanks

Release Version

We are currently working on putting cdSeq for R on bioconda, but for that we would need you to create a release. Would that be possible?
Thank you very much in advance!

Error in CDSeq()

HI
I am using CDSeqR to deconvolve some bulk samples.
Hoever I am getting the following error:

res_cdseq = CDSeq::CDSeq(bulk_data = T,cell_type_number = 2:8,cpu_number = 1,mcmc_iterations = 1000, reference_gep = C)
CDSeq is running in non Reduce-Recover mode. To use Reduce-Recover mode, assign a value to block_number that is greater than 1.

Error in { : task 1 failed - "std::bad_alloc"
In addition: Warning messages:
1: In CDSeq::CDSeq(bulk_data = T, cell_type_number = 2:8, cpu_number = 1,  :
  gene_length is NOT provided. CDSeq will estiamte read rate not gene rate. Please provide gene length if you are interested in GEP estimation.
2: In CDSeq::CDSeq(bulk_data = T, cell_type_number = 2:8, cpu_number = 1,  :
  reference_gep is NOT read counts data, RPKM normalization is NOT performed.
3: In CDSeq::CDSeq(bulk_data = T, cell_type_number = 2:8, cpu_number = 1,  :
  Gene length is NOT provided, RPKM normalization is NOT performed.
4: In CDSeq::CDSeq(bulk_data = T, cell_type_number = 2:8, cpu_number = 1,  :
   bulk_data is NOT read count data. Please provide read counts data if possible for potentially better estimations.

I have not provided any gene lengths as I dont have that information, but my data is read counts and not FPKM or any other transformation.

Here is my data:
https://github.com/RK1912/Deconv_data

Do let me kwow if I need to make any changes to the data .

Thanks!

can't install the CDSeq package

when I try to use command "install.packages("CDSeq")", error occurs that:
Warning in install.packages :
package ‘CDSeq’ is not available for this version of R.

My R version is R 4.2.1 in Mac M1. I wonder which R version is the CDSeq package needed?

Thank you!

CDSeq not available for R 4.2?

Hi, I am trying to download CDSeq, but get the following error.

`install.packages("CDSeq")
Installing package into ‘/home/ariel/R/x86_64-pc-linux-gnu-library/4.2’
(as ‘lib’ is unspecified)
Warning in install.packages :
package ‘CDSeq’ is not available for this version of R

A version of this package for your version of R might be available elsewhere,
see the ideas at
https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages`

Is CDSeq not compatible with R 4.2, or is there something I'm doing wrong? I'm run Linux, so the known issue with Macs doesn't apply.

warning messages with CDSeq

Hi
I am working on the deconvolution of the RNASeq data starting with the gene counts file which is in the following format

Gene Sample1 Sample2 Sample3 Sample4 Sample5
g1 0.12 0.23 0.34 0.45 0.56
g2 0.14 0.14 0.14 0.14 0.14
g3 0.16 0.05 -0.06 -0.17 -0.28
g4 0.18 0.65 1.12 1.59 2.06
g5 0.2 0.984 1.768 2.552 3.336

After running CDSeq using this code result <- CDSeq(table, cell_type_number = 3), I got the following warning messages.
1: In CDSeq(table, cell_type_number = 3) :
gene_length is NOT provided. CDSeq will estiamte read rate not gene rate. Please provide gene length if you are interested in GEP estimation.
2: In CDSeq(table, cell_type_number = 3) :
Reference gene expression profile is missing. Cell type identification and RPKM normalization will NOT be performed by CDSeq. Users can identify CDSeq-identified cell-types using marker genes or reference gene expression profiles.
3: In CDSeq(table, cell_type_number = 3) :
bulk_data is NOT read count data. Please provide read counts data if possible for potentially better estimations.

My questions are...

if we are not sure with the 'cell_type_number', how can one give some value here? actually our intention is to predict the cell type and proportions. how can we define this number. here blindly I gave 3.
Do we need to provide the gene length? As I am providing gene counts, I guess gene lengths may not be required to provide. am I correct?
I feel this is a serious point to consider as it is linked with the identification of cell type. I don’t understand what is this reference gene expression profiles. I have just the gene counts file with me.
I am already providing a read count data, but still I am getting this warning.

can anyone explain this?
Thank you

Specific cell type expression matrix

Hi,

i am wondering if it is possible to get expression matrix for each cell type, i mean in the example you have 6 cell types and hence you can get 6 cell type matrix RNA data ( gene by sample)

I am looking at Cell2RNA function but i guess is not what i want

Many thanks in advance

Run time of CDSeq()

Hi! Thanks for developing CDSeqR. I’m just curious what is the expected run time or if you have estimated run times based on general parameters of CDSeq(). My input for bulk data is a gene (rows) x samples (columns) data. I have 48 samples and 60k genes. When I run CDSeq, I left it for 8hrs, and it was still running so I just aborted it. I wonder if that long is expected. Thank you!

Error in `contrasts<-`(`tmp`, value = contr.funs[1 + isOF[nn]])

I am unable to get to the bottom of this error when attempting to run 'cellTypeAssignSCRNA()' without a reference set.

Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
  contrasts can be applied only to factors with 2 or more levels

It prints to the console just after

running Harmony..

so I think it is occurring somewhere between lines 385 and 412 in file 'cellTypeAssignSCRNA.R'.

Including known cell type proportions in CDSeq initial deconvolution

Hi there! I am using your package and have run into an issue.
We know that there are about 7 different cell types in our bulk samples (based on scRNAseq profiling), with one cell type taking up 70-80% of the sample.
When we apply your deconvolution method, we are only able to pick up that one cell type, even when we increase the number of cell types to many greater than 7 (say to 10-15).
Is there a way to input known cell type proportions in the initial CDSeq() step?
Thank you!

.

Renaming clusters after cellTypeAssignSCRNA?

After running cellTypeAssignSCRNA I have 12 clusters of single cells with the CDSeq estimates assigned to specific clusters. Currently the clusters have generic labeling (i.e., cluster_0, cluster_1, ...) but after reviewing each clusters marker genes I would like to assign meaningful cell type labels (i.e, endothelial cells, monocytes,...) for each cluster. I am not sure if the umap and tsne ggplot are able to be edited after running cellTypeAssignSCRNA. Renaming cluster labels in the plot environment as shown below will change this data slot in the result but will not change the output of the umap plot.
levels(cdseq.result.celltypeassign[["cdseq_scRNA_umap"]][["plot_env"]][["cluster_label"]]) <- new.cluster.ids
Any suggestions are greatly appreciated.

Error in CDSeq()

HI
I am using CDSeqR to deconvolve some bulk samples.
Hoever I am getting the following error:

res_cdseq = CDSeq::CDSeq(bulk_data = T,cell_type_number = 2:8,cpu_number = 1,mcmc_iterations = 1000, reference_gep = C)
CDSeq is running in non Reduce-Recover mode. To use Reduce-Recover mode, assign a value to block_number that is greater than 1.

Error in { : task 1 failed - "std::bad_alloc"
In addition: Warning messages:
1: In CDSeq::CDSeq(bulk_data = T, cell_type_number = 2:8, cpu_number = 1,  :
  gene_length is NOT provided. CDSeq will estiamte read rate not gene rate. Please provide gene length if you are interested in GEP estimation.
2: In CDSeq::CDSeq(bulk_data = T, cell_type_number = 2:8, cpu_number = 1,  :
  reference_gep is NOT read counts data, RPKM normalization is NOT performed.
3: In CDSeq::CDSeq(bulk_data = T, cell_type_number = 2:8, cpu_number = 1,  :
  Gene length is NOT provided, RPKM normalization is NOT performed.
4: In CDSeq::CDSeq(bulk_data = T, cell_type_number = 2:8, cpu_number = 1,  :
   bulk_data is NOT read count data. Please provide read counts data if possible for potentially better estimations.

I have not provided any gene lengths as I dont have that information, but my data is read counts and not FPKM or any other transformation.

Here is my data:
https://github.com/RK1912/Deconv_data

Do let me kwow if I need to make any changes to the data .

Thanks!

Limited by default 500. How can we provide/calculate "gene_length" and "reference_gep"?

Hello. How can we provide/calculate "gene_length" and "reference_gep" for our bulk RNA-seq counts data? It is not clear from the paper and vignette how it was derived/created. Thank you.

CDSeq running in parallel

Hi!

I'm having issues running the CDSeq function in "parallel" mode. When I use block_number = 1 I have no issues and I get all the results needed. When I increase the number of blocks and the cpu_number accordingly, I get the following error:

Error in dimnames(CDSeq_result$cellTypeAssignSplit)[[1]] <- gene_names : attempt to set an attribute on NULL

Debugging the CDSeq.R file I noticed that CDSeq_result exists, but the attribute cellTypeAssignSplit is NULL. Using block_number = 1 instead the attribute exists. The attribute cellTypeAssignSplit is assigned in the foreach loop a few lines above and I noticed that this doesn't return cellTypeAssignSplit in case of block_number > 1. In fact for block_number = 1:

return(list(estProp = estProp, estGEP = estGEP, 
                        celltype_assignment = celltype_assignment, 
                        cellTypeAssignSplit = result$cellTypeAssignSplit, 
                        processID = Sys.getpid()))

while for block_number > 1:

return(list(estProp = estProp, processID = Sys.getpid()))

Is this intentional? How can I solve my issue?

Thanks in advance for your help!
Giulio