plger / scdblfinder Goto Github PK

View Code? Open in Web Editor NEW

135.0 135.0 17.0 12.58 MB

Methods for detecting doublets in single-cell sequencing data

Home Page: https://plger.github.io/scDblFinder/

License: GNU General Public License v3.0

R 99.78% Dockerfile 0.22%

doublets single-cell

scdblfinder's People

Contributors

Stargazers

Watchers

Forkers

csoneson ltla zorrodong mengchengyao federicomarini amcdavid pacificduckling xnnba1984 tristan891 wudustan jannes-g ewowiredu grst 5l1v3r1 sukses24 kew24 maxiancheng

scdblfinder's Issues

Error in ecdf(d$cxds_score[w]) : 'x' must have 1 or more non-missing values

I have 18 samples constructed as a list of 18 SCE objects in which I performed emptyDrops() and scDblFinder() in loops. One of the samples (loop 11) failed at the scDblFinder() step with the error "Error in ecdf(d$cxds_score[w]) : 'x' must have 1 or more non-missing values". I know the scripts are working because the other 17 worked so I'm not sure what is wrong with that one object since it was constructed the same as all others that is causing this error. I found the line I think is hanging up 2*ecdf(d$cxds_score[w])(d$cxds_score[w]) in the command, but I don't know how to fix my file.

I reran all the scripts again thinking maybe something didn't get saved in-between correctly on the HPC, but same error same spot.

Update: if I run use.cxds = FALSE it finishes. I'm still uncertain why this one SCE object has this problem.

Samples = list.files(path='./mats')
for(i in Samples){
RawDat[[i]] <- read10xCounts(paste("/PATH/RAW/mats",i,sep='/'))
cat(paste(i,', ',sep=''))
}

emp_out = vector(mode='list',length=length(Samples))
names(emp_out) = Samples
for(i in Samples){
emp_out[[i]] <- emptyDrops(counts(RawDat[[i]]))
cat(paste(i,', ',sep=''))
}

CellDat = vector(mode='list',length=length(Samples))
names(CellDat) = Samples
for(i in Samples){
CellDat[[i]] <- RawDat[[i]][,which(emp_out[[i]]$FDR <= 0.001)]
cat(paste(i,', ',sep=''))
}

set.seed(101)
for(i in Samples){ 
CellDat[[i]] <- scDblFinder(CellDat[[i]])
cat(paste(i,', ',sep=''))
}

# ERROR at loop 11 (1-10 and 12 - 18 finishes)
Clustering cells...
11 clusters
Creating ~8883 artifical doublets...
Dimensional reduction
Finding KNN...
Evaluating cell neighborhoods...
Training model...
Error in ecdf(d$cxds_score[w]) : 'x' must have 1 or more non-missing values

> CellDat[[11]]
class: SingleCellExperiment 
dim: 32285 14805 
metadata(1): Samples
assays(1): counts
rownames(32285): ENSMUSG00000051951 ENSMUSG00000089699 ...
  ENSMUSG00000095019 ENSMUSG00000095041
rowData names(3): ID Symbol Type
colnames: NULL
colData names(2): Sample Barcode
reducedDimNames(0):
altExpNames(0):

> CellDat[[12]]
class: SingleCellExperiment 
dim: 32285 2131 
metadata(2): Samples scDblFinder.stats
assays(1): counts
rownames(32285): ENSMUSG00000051951 ENSMUSG00000089699 ...
  ENSMUSG00000095019 ENSMUSG00000095041
rowData names(4): ID Symbol Type scDblFinder.selected
colnames(2131): cell1 cell2 ... cell2130 cell2131
colData names(12): Sample Barcode ... scDblFinder.mostLikelyOrigin
  scDblFinder.originAmbiguous
reducedDimNames(0):
altExpNames(0):

scDbliFinder.sample is different from the sample column specified in `scDblFinder` function `samples` column ?

Hi, I have run scDblFinder in "split" smaple mode to detect doublets with following code (since the data is large, I only provide code):

set.seed(221113L)
sce_qc <- scDblFinder::scDblFinder(
    sce_raw[, !sce_raw$low_lib_size],
    clusters = TRUE, dims = 50L, 
    samples = "Sample", multiSampleMode = "split",
    returnType = "sce"
)

When I check the results, the scDblFinder.sample column seems strange:

data.frame(colData(sce_qc)) %>%
    dplyr::select(Sample, scDblFinder.sample) %>% 
    dplyr::filter(Sample != scDblFinder.sample)
# here is some output
                   Sample scDblFinder.sample
AAACCCAAGCCTCTCT-1    B4T              B16T2
AAACCCAAGTGTAGAT-1    B4T                B1T
AAACGCTGTGTATTGC-1    B4T              B14T2
AAAGTGAGTAGATCGG-1    B4T               B16U
AACAAAGGTGGATCGA-1    B4T                B1U
AACAAGAGTCTACATG-1    B4T              B14T1
AACCAACAGGTAAACT-1    B4T                B1T
AACGGGAGTGAGATCG-1    B4T              B14T2
AAGAACATCTCTCGCA-1    B4T               B12T
AAGATAGAGCCTCATA-1    B4T                B1U
AAGATAGAGTAAGACT-1    B4T                B1T
AAGATAGCAAATGGCG-1    B4T               B16U
AAGGAATGTTGAATCC-1    B4T               B12U

I don't know why they are different when I used a "split" mode? From the help page of scDblFinder, "split" mode runs all process separated by samples, I think they should be the same, is it right?

Note to self: figure out why computeDoubletDensity is unhappy on R-devel

Probably something to do with I():

http://bioconductor.org/checkResults/devel/bioc-LATEST/scDblFinder/malbec2-checksrc.html

Also note the many other complaints in the CHECK report. Some of these are mine, some of these are for @plger.

Installing fails as requires R 4.1.0

Trying to update gives the following error

ERROR: this R is version 4.0.5, package 'scDblFinder' requires R >= 4.1

Isn't 4.1 still in development?

Error in sample.int(length(x), size, replace, prob)

Hello,

Thank you so much writting this tool, I have used it on some sc datasets and has worked nicely. But when trying a different dataset prepared in the same way as the previous ones, I get an error and I was wondering if you have seen this before.

These are my commands:
#this first one works fine
seurat.sce <- as.SingleCellExperiment(seurat)

#This is the one that gives me the error
seurat.sce <- scDblFinder(seurat.sce,clusters = 'seurat_clusters')

The error is:
19 clusters
Creating ~10468 artifical doublets...
Error in sample.int(length(x), size, replace, prob) :
invalid 'replace' argument

Here is my session info:
R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.3

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] grid stats graphics grDevices utils datasets methods
[8] base

other attached packages:
[1] scDblFinder_1.4.0 UpSetR_1.4.0 eulerr_6.1.0 SeuratObject_4.0.0
[5] Seurat_4.0.0 ggvenn_0.1.8 ggplot2_3.3.3 dplyr_1.0.4

loaded via a namespace (and not attached):
[1] VGAM_1.1-5 plyr_1.8.6
[3] igraph_1.2.6 lazyeval_0.2.2
[5] enrichR_3.0 polylabelr_0.2.0
[7] splines_4.0.3 BiocParallel_1.24.1
[9] densityClust_0.3 listenv_0.8.0
[11] scattermore_0.7 scater_1.18.6
[13] GenomeInfoDb_1.26.2 fastICA_1.2-2
[15] digest_0.6.27 htmltools_0.5.1.1
[17] viridis_0.5.1 fansi_0.4.2
[19] magrittr_2.0.1 tensor_1.5
[21] cluster_2.1.0 ROCR_1.0-11
[23] limma_3.46.0 globals_0.14.0
[25] matrixStats_0.58.0 docopt_0.7.1
[27] colorspace_2.0-0 ggrepel_0.9.1
[29] xfun_0.20 sparsesvd_0.2
[31] crayon_1.4.1 RCurl_1.98-1.2
[33] jsonlite_1.7.2 spatstat_1.64-1
[35] spatstat.data_2.0-0 survival_3.2-7
[37] zoo_1.8-8 glue_1.4.2
[39] polyclip_1.10-0 gtable_0.3.0
[41] zlibbioc_1.36.0 XVector_0.30.0
[43] leiden_0.3.7 DelayedArray_0.16.1
[45] BiocSingular_1.6.0 future.apply_1.7.0
[47] SingleCellExperiment_1.12.0 BiocGenerics_0.36.0
[49] abind_1.4-5 scales_1.1.1
[51] pheatmap_1.0.12 edgeR_3.32.1
[53] DBI_1.1.1 miniUI_0.1.1.1
[55] Rcpp_1.0.6 viridisLite_0.3.0
[57] xtable_1.8-4 dqrng_0.2.1
[59] reticulate_1.18 rsvd_1.0.3
[61] stats4_4.0.3 htmlwidgets_1.5.3
[63] httr_1.4.2 FNN_1.1.3
[65] RColorBrewer_1.1-2 ellipsis_0.3.1
[67] ica_1.0-2 scuttle_1.0.4
[69] pkgconfig_2.0.3 farver_2.0.3
[71] uwot_0.1.10 deldir_0.2-9
[73] locfit_1.5-9.4 utf8_1.1.4
[75] tidyselect_1.1.0 labeling_0.4.2
[77] rlang_0.4.10 reshape2_1.4.4
[79] later_1.1.0.1 munsell_0.5.0
[81] tools_4.0.3 xgboost_1.3.2.1
[83] cli_2.3.0 generics_0.1.0
[85] ggridges_0.5.3 stringr_1.4.0
[87] fastmap_1.1.0 goftest_1.2-2
[89] fitdistrplus_1.1-3 DDRTree_0.1.5
[91] purrr_0.3.4 RANN_2.6.1
[93] sparseMatrixStats_1.2.1 pbapply_1.4-3
[95] future_1.21.0 nlme_3.1-152
[97] mime_0.9 monocle_2.18.0
[99] slam_0.1-48 scran_1.18.6
[101] compiler_4.0.3 rstudioapi_0.13
[103] beeswarm_0.3.1 plotly_4.9.3
[105] png_0.1-7 testthat_3.0.1
[107] spatstat.utils_2.0-0 statmod_1.4.35
[109] tibble_3.0.6 stringi_1.5.3
[111] desc_1.2.0 bluster_1.0.0
[113] lattice_0.20-41 Matrix_1.3-2
[115] HSMMSingleCell_1.10.0 vctrs_0.3.6
[117] pillar_1.4.7 lifecycle_0.2.0
[119] combinat_0.0-8 lmtest_0.9-38
[121] BiocNeighbors_1.8.2 RcppAnnoy_0.0.18
[123] bitops_1.0-6 data.table_1.13.6
[125] cowplot_1.1.1 irlba_2.3.3
[127] GenomicRanges_1.42.0 httpuv_1.5.5
[129] patchwork_1.1.1 R6_2.5.0
[131] promises_1.2.0.1 KernSmooth_2.23-18
[133] gridExtra_2.3 vipor_0.4.5
[135] IRanges_2.24.1 parallelly_1.23.0
[137] codetools_0.2-18 MASS_7.3-53
[139] assertthat_0.2.1 pkgload_1.1.0
[141] SummarizedExperiment_1.20.0 rprojroot_2.0.2
[143] rjson_0.2.20 withr_2.4.1
[145] qlcMatrix_0.9.7 sctransform_0.3.2
[147] GenomeInfoDbData_1.2.4 S4Vectors_0.28.1
[149] mgcv_1.8-33 parallel_4.0.3
[151] beachmat_2.6.4 rpart_4.1-15
[153] tidyr_1.1.2 DelayedMatrixStats_1.12.3
[155] MatrixGenerics_1.2.1 Rtsne_0.15
[157] Biobase_2.50.0 shiny_1.6.0
[159] ggbeeswarm_0.6.0 tinytex_0.30

Thank you so much for your help!

Forgot to import sweep from DelayedArray

Oops:

library(scDblFinder)
example(computeDoubletDensity, echo=FALSE)
library(DelayedArray)
scores <- computeDoubletDensity(DelayedArray(counts))
## Error in .check_Ops_vector_arg_length(e, x_nrow, e_what = e_what, x_what = x_what) :
##   when the right operand is not a DelayedArray object (or derivative),
##   its length (250000) cannot be greater than the first dimension of the
##   left operand (10000)

Should be a very simple matter of slapping @importFrom DelayedArray sweep on top of .spawn_doublet_pcs(). Still spawns a warning but I think that's a DelayedArray problem rather than anything on our end.

Filter ATAC before running scDblFinder

Good day,

I wanna use scDblFinder on my scATAC data. For RNA, you warn users to perform initial QC so it does not influence the modeling of doublets for more precise doublet calling.

I have checked the ATAC vignette but could not find information about this particular point. What would you recommend?

I appreciate any suggestions you can give me.

V minor coding bug with multiSampleMode = 'split' and returnType = 'table'

Looking at the code, I think that if you specify multiSampleMode' = 'split', then you always get an augmented sce back, is that correct? Regardless of the specified returnType. Unless I'm missing something :)

Will

package ‘scDblFinder’ is not available (for R version 3.6.0) ???

Hi, thanks for developing this cool tool.
When I try to install it in R 3.6.0 using "BiocManager::install("scDblFinder")" , it says "package ‘scDblFinder’ is not available (for R version 3.6.0)".

Could you give any advice? Thanks.

if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("scDblFinder")

scDblFinder not deterministic when using batch+BPPARAM

Hello,

I found that scDblFinder is not deterministic when running it with multiple batches and with the BPPARAM argument, producing quite different results. Below is a MWE, do you have any idea what is going on? Output is deterministic when either setting BPPARAM to SerialParam or removing the batch argument.

library(BiocParallel)
library(SingleCellExperiment)
library(scDblFinder)
library(parallel)
library(scRNAseq)
library(scran)

sce <- scRNAseq::LawlorPancreasData()

sce$batch <- factor(c(rep("A", 100), rep("B", 200), rep("C", 100), rep("D", 238)))

sce$cluster <- as.character(scran::quickCluster(sce))

k <- 
mclapply(1:3, mc.cores=3, function(x){
  
  set.seed(123)
  m <- 
    scDblFinder::scDblFinder(sce=sce, 
                             clusters=as.character(sce$cluster), 
                             samples=sce$batch,
                             BPPARAM=MulticoreParam(workers = 3))
  return(m$scDblFinder.score)
  
}); names(k) <- paste0("run_",1:length(k))

par(mfrow=c(2,2))
plot(k$run_1, k$run_2)
plot(k$run_1, k$run_3)
plot(k$run_2, k$run_3)


R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 12.0.1

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] scran_1.18.3                scRNAseq_2.4.0              scDblFinder_1.4.0           SingleCellExperiment_1.12.0
 [5] SummarizedExperiment_1.20.0 Biobase_2.50.0              GenomicRanges_1.42.0        GenomeInfoDb_1.26.2        
 [9] IRanges_2.24.1              S4Vectors_0.28.1            BiocGenerics_0.36.0         MatrixGenerics_1.2.0       
[13] matrixStats_0.57.0          BiocParallel_1.24.1        

loaded via a namespace (and not attached):
  [1] ggbeeswarm_0.6.0              colorspace_2.0-0              ellipsis_0.3.1                scuttle_1.0.4                
  [5] bluster_1.0.0                 XVector_0.30.0                BiocNeighbors_1.8.2           rstudioapi_0.13              
  [9] bit64_4.0.5                   interactiveDisplayBase_1.28.0 AnnotationDbi_1.52.0          fansi_0.4.2                  
 [13] xml2_1.3.2                    sparseMatrixStats_1.2.0       cachem_1.0.1                  scater_1.18.3                
 [17] Rsamtools_2.6.0               dbplyr_2.1.1                  shiny_1.6.0                   BiocManager_1.30.10          
 [21] compiler_4.0.3                httr_1.4.2                    dqrng_0.2.1                   lazyeval_0.2.2               
 [25] assertthat_0.2.1              Matrix_1.3-2                  fastmap_1.1.0                 limma_3.46.0                 
 [29] later_1.1.0.1                 BiocSingular_1.6.0            htmltools_0.5.1.1             prettyunits_1.1.1            
 [33] tools_4.0.3                   rsvd_1.0.3                    igraph_1.2.6                  gtable_0.3.0                 
 [37] glue_1.4.2                    GenomeInfoDbData_1.2.4        dplyr_1.0.5                   rappdirs_0.3.1               
 [41] Rcpp_1.0.7                    vctrs_0.3.6                   Biostrings_2.58.0             ExperimentHub_1.16.1         
 [45] rtracklayer_1.50.0            DelayedMatrixStats_1.12.2     stringr_1.4.0                 beachmat_2.6.4               
 [49] mime_0.9                      lifecycle_1.0.0               irlba_2.3.3                   ensembldb_2.14.0             
 [53] renv_0.13.2                   statmod_1.4.35                XML_3.99-0.5                  AnnotationHub_2.22.0         
 [57] edgeR_3.32.1                  zlibbioc_1.36.0               scales_1.1.1                  ProtGenerics_1.22.0          
 [61] hms_1.0.0                     promises_1.1.1                AnnotationFilter_1.14.0       yaml_2.2.1                   
 [65] curl_4.3                      memoise_2.0.0                 gridExtra_2.3                 ggplot2_3.3.5                
 [69] biomaRt_2.46.1                stringi_1.5.3                 RSQLite_2.2.3                 BiocVersion_3.12.0           
 [73] GenomicFeatures_1.42.1        rlang_0.4.12                  pkgconfig_2.0.3               bitops_1.0-6                 
 [77] lattice_0.20-41               purrr_0.3.4                   GenomicAlignments_1.26.0      bit_4.0.4                    
 [81] tidyselect_1.1.0              magrittr_2.0.1                R6_2.5.0                      generics_0.1.0               
 [85] DelayedArray_0.16.1           DBI_1.1.1                     withr_2.4.2                   pillar_1.6.0                 
 [89] RCurl_1.98-1.2                tibble_3.1.1                  crayon_1.4.1                  xgboost_1.3.2.1              
 [93] utf8_1.1.4                    BiocFileCache_1.14.0          viridis_0.5.1                 progress_1.2.2               
 [97] locfit_1.5-9.4                grid_4.0.3                    data.table_1.13.6             blob_1.2.1                   
[101] digest_0.6.27                 xtable_1.8-4                  httpuv_1.5.5                  openssl_1.4.3                
[105] munsell_0.5.0                 beeswarm_0.2.3                viridisLite_0.3.0             vipor_0.4.5                  
[109] askpass_1.1

why error ?

bcmvn.MM482 <- find.pK(sweep.stats.MM482)
DimPlot(object = MM482.BM, reduction = 'umap', group.by = "RNA_snn_res.0.5", label = TRUE, repel = TRUE, raster=FALSE) + NoLegend()
FeaturePlot(MM482.BM, features = "scDblFinder.score", cols = c("yellow", "red"), reduction = 'umap', raster = FALSE) + DarkTheme()
FeaturePlot(MM482.BM, features = "pANN_0.25_0.005_555",cols = c("yellow", "red"), reduction = 'umap', raster=FALSE) + DarkTheme()
DimPlot(MM482.BM,pt.size = 1,label=FALSE, label.size = 5,reduction = "umap",group.by = "DF.classifications_0.25_0.005_555")
DimPlot(MM482.BM,pt.size = 1,label=FALSE, label.size = 5,reduction = "umap",group.by = "DF.classifications_0.25_0.005_483")
MM482.BM <- doubletFinder_v3(MM482.BM, PCs = use.pcs, pN = 0.25, pK = mpk.MM482, nExp = nExp_poi.adj.MM482, reuse.pANN = "pANN_0.25_0.005_555", sct = FALSE)
MM482.singlet <- subset(x = MM482.BM, subset = DF.classifications_0.25_0.005_483 == "Singlet")
MM482.singlet

FeaturePlot(MM482.BM, features = "scDblFinder.score", cols = c("yellow", "red"), reduction = 'umap', raster = FALSE) + DarkTheme()
Error: None of the requested features were found: scDblFinder.score in slot data
In addition: Warning message:
In FetchData(object = object, vars = c(dims, "ident", features), :
The following requested variables were not found: scDblFinder.score

Error when running library(scDblFinder) on jupyter through %%R

Describe the bug
When I ran library(scDblFinder) on jupyter, it appeared this error: Error: package or namespace load failed for ‘scDblFinder’ in dyn.load(file, DLLpath = DLLpath, ...):
unable to load shared object '/home/nguyen/R/x86_64-pc-linux-gnu-library/4.1/xgboost/libs/xgboost.so':
/home/nguyen/anaconda3/lib/python3.10/site-packages/zmq/backend/cython/../../../../.././libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /home/nguyen/R/x86_64-pc-linux-gnu-library/4.1/xgboost/libs/xgboost.so)
I tried to update libstdc++.so.6 and remove/install again but it can not work.
Please help me.
I am using R version 4.1.2
I set up scDblFinder through BiocManager: version 1.13.13
Thank you very much.

scDblFinder.class no longer added to SCE output

I'm having a new issue where scDblFinder.class is no longer added to my SCE. I am using the development version. It seems to run with no issue, so I'm at a bit of a loss.

> sce <- scDblFinder(sce, samples = "Sample", BPPARAM = BP, verbose = TRUE)
Training model...
Error in base::table(...) : all arguments must have the same length
> names(colData(sce))
 [1] "Sample"                        "Barcode"                       "Group"                        
 [4] "Batch"                         "sum"                           "detected"                     
 [7] "subsets_Mito_sum"              "subsets_Mito_detected"         "subsets_Mito_percent"         
[10] "total"                         "discard"                       "Phase"                        
[13] "G1.score"                      "S.score"                       "G2M.score"                    
[16] "scDblFinder.sample"            "scDblFinder.cluster"           "scDblFinder.distanceToNearest"
[19] "scDblFinder.nearestClass"      "scDblFinder.difficulty"        "scDblFinder.ratio"            
[22] "scDblFinder.cxds_score"        "scDblFinder.weighted"          "scDblFinder.score"            
[25] "scDblFinder.mostLikelyOrigin"  "scDblFinder.originAmbiguous"

sessionInfo:

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.7 LTS

Matrix products: default
BLAS:   /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] BiocStyle_2.17.1            openxlsx_4.2.2              pheatmap_1.0.12             PCAtools_2.1.22            
 [5] lattice_0.20-41             reshape2_1.4.4              ggrepel_0.8.2               BiocParallel_1.23.2        
 [9] scDblFinder_1.3.9           cowplot_1.1.0               scuttle_0.99.15             celldex_0.99.1             
[13] dittoSeq_1.1.9              SingleR_1.3.8               scran_1.17.18               scater_1.17.5              
[17] ggplot2_3.3.2               DropletUtils_1.9.12         SingleCellExperiment_1.11.7 SummarizedExperiment_1.19.7
[21] DelayedArray_0.15.11        matrixStats_0.57.0          Matrix_1.2-18               Biobase_2.49.1             
[25] GenomicRanges_1.41.6        GenomeInfoDb_1.25.11        IRanges_2.23.10             S4Vectors_0.27.13          
[29] BiocGenerics_0.35.4        

loaded via a namespace (and not attached):
  [1] ggbeeswarm_0.6.0              colorspace_1.4-1              ellipsis_0.3.1                ggridges_0.5.2               
  [5] bluster_0.99.1                XVector_0.29.3                BiocNeighbors_1.7.0           yaImpute_1.0-32              
  [9] rstudioapi_0.11               farver_2.0.3                  bit64_4.0.5                   interactiveDisplayBase_1.27.5
 [13] AnnotationDbi_1.51.3          R.methodsS3_1.8.1             knitr_1.30                    pROC_1.16.2                  
 [17] dbplyr_1.4.4                  R.oo_1.24.0                   shiny_1.5.0                   HDF5Array_1.17.11            
 [21] BiocManager_1.30.10           compiler_4.0.2                httr_1.4.2                    dqrng_0.2.1                  
 [25] assertthat_0.2.1              fastmap_1.0.1                 limma_3.45.14                 later_1.1.0.1                
 [29] BiocSingular_1.5.1            htmltools_0.5.0               tools_4.0.2                   rsvd_1.0.3                   
 [33] igraph_1.2.5                  gtable_0.3.0                  glue_1.4.2                    GenomeInfoDbData_1.2.3       
 [37] dplyr_1.0.2                   rappdirs_0.3.1                Rcpp_1.0.5                    vctrs_0.3.4                  
 [41] rhdf5filters_1.1.3            ExperimentHub_1.15.3          DelayedMatrixStats_1.11.1     xfun_0.18                    
 [45] stringr_1.4.0                 mime_0.9                      lifecycle_0.2.0               irlba_2.3.3                  
 [49] statmod_1.4.34                AnnotationHub_2.21.5          edgeR_3.31.4                  zlibbioc_1.35.0              
 [53] scales_1.1.1                  promises_1.1.1                rhdf5_2.33.10                 RColorBrewer_1.1-2           
 [57] yaml_2.2.1                    curl_4.3                      memoise_1.1.0                 gridExtra_2.3                
 [61] stringi_1.5.3                 RSQLite_2.2.1                 BiocVersion_3.12.0            zip_2.1.1                    
 [65] rlang_0.4.7                   pkgconfig_2.0.3               bitops_1.0-6                  evaluate_0.14                
 [69] purrr_0.3.4                   Rhdf5lib_1.11.3               labeling_0.3                  bit_4.0.4                    
 [73] tidyselect_1.1.0              plyr_1.8.6                    magrittr_1.5                  R6_2.4.1                     
 [77] generics_0.0.2                DBI_1.1.0                     pillar_1.4.6                  withr_2.3.0                  
 [81] RCurl_1.98-1.2                tibble_3.0.3                  crayon_1.3.4                  intrinsicDimension_1.2.0     
 [85] xgboost_1.2.0.1               BiocFileCache_1.13.1          rmarkdown_2.4                 viridis_0.5.1                
 [89] locfit_1.5-9.4                grid_4.0.2                    data.table_1.13.0             blob_1.2.1                   
 [93] digest_0.6.25                 xtable_1.8-4                  httpuv_1.5.4                  R.utils_2.10.1               
 [97] scds_1.5.0                    munsell_0.5.0                 beeswarm_0.2.3                viridisLite_0.3.0

size factors should be positive in computeDoubletDensity

Dear all,

with arbitrary samples (until recently I did not know which property was decisive) I got 'Error in .local(x, ...) : size factors should be positive' as error from computeDoubletDensity.

In some cases increasing subset.row was helpful (e.g. selecting the top 5000 highly variable features, instead of 2000). Sometimes though this could not resolve the issue.

I saw the other, similar, issue raised here (#32): My data set though was cleaned for cells with very low total read counts. The error persisted.

Solution:

One has to exclude cells which have zero total reads for features provided in subset.row:

factors <- scuttle::librarySizeFactors(expr_mat[subset.row,])
which(factors == 0)

Would it be acceptable to add a more meaningful error message? E.g. inform the user about cell names which have zero as library size?

I expect handling such error inside your function is not in your interest. If it is though:
(i) What would happen of library sizes are increased by a common value to avoid zeros? I mean adding 1 or 0.0001 or so, similar to log1p.
(ii) If such cells are excluded, an NA could be returned as doublet score. Or a -1 or so? Or they could be excluded complete from the return, which would cause other problems though.

Thanks.

error in serialize(data)

Hi,

Thanks for maintaining this tool, I met a problem when trying this tool when using MulticoreParam

code:

library(scDblFinder)
library(BiocParallel)

sce = as.SingleCellExperiment(seurat_filtered)
sce = scDblFinder(sce, samples="sample_label", BPPARAM=MulticoreParam(4))

Error in serialize(data, node$con, xdr = FALSE) : 
  error writing to connection
Error in manager$availability[[as.character(result$node)]] <- TRUE : 
  wrong args for environment subassignment
In addition: Warning messages:
1: In serialize(data, node$con, xdr = FALSE) :
  'package:stats' may not be available when loading
2: In serialize(data, node$con, xdr = FALSE) :
  'package:stats' may not be available when loading
3: In serialize(data, node$con, xdr = FALSE) :
  'package:stats' may not be available when loading
Error in serialize(data, node$con, xdr = FALSE) : 
  error writing to connection

When I remove BPPARAM=MulticoreParam(4), the code can be run through without error (although slow). so I guess it might be related to the multiple processing. The object size I am dealing with is 4.3 GB, while the server has more than 140 GB of memory, so I guess it shouldn't be the memory issue, May I ask if you have any idea about this problem and the potential solution?

Thanks,

Issue running scDblFinder on scATAC-seq data

Hello,

I am trying to run scDblFinder to find doublets in my scATAC-seq data but run into the following error early on:

Error in names(res) <- nms :
'names' attribute [4] must be the same length as the vector [2]
In addition: Warning message:
stop worker failed:
attempt to select less than one element in OneIndex

From preliminary google searches, this problem seems external to scDblFinder, but any insight you may have will be very helpful.
I am running the following code:

cancer_2_h_new
#An object of class Seurat
#251195 features across 23665 samples within 1 assay
#Active assay: ATAC (251195 features, 251195 variable features)
#4 dimensional reductions calculated: lsi, umap, harmony, umap_harmony

cancer_sce = as.SingleCellExperiment(cancer_2_h_new)

set.seed(123)
library(scDblFinder)
library(BiocParallel)

sce <- scDblFinder(cancer_sce, samples="Mouse", aggregateFeatures=TRUE, nfeatures=25,BPPARAM=MulticoreParam(3), processing = "normFeatures")

rbind error when running multiple samples

Hello,
I am able to run the developer version of scDblFinder with one sample, but when using an SCE with multiple samples (named in colData) I run into the following error (true whether I load in an SCE or matrix with a vector of sample IDs):

masterSCE = scDblFinder(sce = sce, samples = "sample_ID", nfeatures = 1000, score = 'xgb',verbose = TRUE)
Error in .format_mismatch_message(x_colnames, object_colnames) :
the DataFrame objects to rbind do not have the same column names ('ratio.k20' is unique)

sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C LC_TIME=English_United States.1252

attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base

other attached packages:
[1] forcats_0.5.0 stringr_1.4.0 dplyr_1.0.2 purrr_0.3.4 readr_1.4.0 tidyr_1.1.2 tibble_3.0.3 ggplot2_3.3.2 tidyverse_1.3.0 DropletUtils_1.9.13 SingleCellExperiment_1.11.8
[12] SummarizedExperiment_1.19.9 Biobase_2.49.1 GenomicRanges_1.41.6 GenomeInfoDb_1.25.11 IRanges_2.23.10 S4Vectors_0.27.13 BiocGenerics_0.35.4 MatrixGenerics_1.1.3 matrixStats_0.57.0 scDblFinder_1.3.9

loaded via a namespace (and not attached):
[1] ggbeeswarm_0.6.0 colorspace_1.4-1 ellipsis_0.3.1 rprojroot_1.3-2 scuttle_0.99.18 bluster_0.99.1 XVector_0.29.3 BiocNeighbors_1.7.0 fs_1.5.0 yaImpute_1.0-32 rstudioapi_0.11 remotes_2.2.0
[13] fansi_0.4.1 lubridate_1.7.9 xml2_1.3.2 R.methodsS3_1.8.1 scater_1.17.5 jsonlite_1.7.1 pROC_1.16.2 broom_0.7.1 dbplyr_1.4.4 R.oo_1.24.0 HDF5Array_1.17.13 BiocManager_1.30.10
[25] compiler_4.0.2 httr_1.4.2 dqrng_0.2.1 backports_1.1.10 assertthat_0.2.1 Matrix_1.2-18 limma_3.45.14 cli_2.0.2 BiocSingular_1.5.2 prettyunits_1.1.1 tools_4.0.2 rsvd_1.0.3
[37] igraph_1.2.5 gtable_0.3.0 glue_1.4.2 GenomeInfoDbData_1.2.4 Rcpp_1.0.5 cellranger_1.1.0 vctrs_0.3.4 rhdf5filters_1.1.3 DelayedMatrixStats_1.11.1 ps_1.3.4 rvest_0.3.6 beachmat_2.5.8
[49] lifecycle_0.2.0 irlba_2.3.3 statmod_1.4.34 edgeR_3.31.4 zlibbioc_1.35.0 scales_1.1.1 hms_0.5.3 rhdf5_2.33.10 yaml_2.2.1 curl_4.3 gridExtra_2.3 stringi_1.5.3
[61] scran_1.17.20 pkgbuild_1.1.0 BiocParallel_1.23.2 rlang_0.4.7 pkgconfig_2.0.3 bitops_1.0-6 lattice_0.20-41 Rhdf5lib_1.11.3 processx_3.4.4 tidyselect_1.1.0 plyr_1.8.6 magrittr_1.5
[73] R6_2.4.1 generics_0.0.2 DelayedArray_0.15.15 DBI_1.1.0 pillar_1.4.6 haven_2.3.1 withr_2.3.0 RCurl_1.98-1.2 modelr_0.1.8 crayon_1.3.4 intrinsicDimension_1.2.0 xgboost_1.2.0.1
[85] viridis_0.5.1 locfit_1.5-9.4 grid_4.0.2 readxl_1.3.1 data.table_1.13.0 blob_1.2.1 callr_3.4.4 reprex_0.3.0 R.utils_2.10.1 scds_1.5.0 munsell_0.5.0 beeswarm_0.2.3
[97] viridisLite_0.3.0 vipor_0.4.5

Deprecated "dgcMatrix"

Hi Pierre-Luc,

fyi, the Matrix package >= 1.5.0 has deprecated the as(., "dgCMatrix") syntax, now erroring if that is used. It now must be as(., "CsparseMatrix"). I would therefore suggest to update the respective lines in the source and require Matrix to be >= 1.5.0 in the DESCRIPTION.

I do not have a MRE at hand now, but once you update Matrix you get something like:

> bp  <- BiocParallel::MulticoreParam(mc_workers, RNGseed=1234)
> sce <- scDblFinder::scDblFinder(sce, clusters="cluster", samples="zt", BPPARAM=bp)
Error: BiocParallel errors
  2 remote errors, element index: 1, 2
  0 unevaluated and other errors
  first remote error:
Error in value[[3L]](cond): An error occured while processing sample 'zt1':
Error: as(<dgeMatrix>, "dgCMatrix") is deprecated since Matrix 1.5-0; do as(., "CsparseMatrix") instead

best,
-Alex

xcgboost install

In R version 4.1.2
Error: package or namespace load failed for ‘scDblFinder’ in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]): there is no package called ‘xgboost’

I have tried
devtools::install_github("plger/scDblFinder")


rjson   (0.2.20 -> 0.2.21 ) [CRAN]
xgboost (NA     -> 1.5.2.1) [CRAN]
Skipping 33 packages ahead of CRAN: BiocGenerics, S4Vectors, DelayedArray, Biobase, MatrixGenerics, Rhtslib, zlibbioc, GenomeInfoDbData, XVector, BiocParallel, Rsamtools, Biostrings, SummarizedExperiment, GenomicRanges, GenomeInfoDb, IRanges, BiocNeighbors, beachmat, ScaledMatrix, sparseMatrixStats, limma, DelayedMatrixStats, SingleCellExperiment, BiocIO, GenomicAlignments, BiocSingular, scuttle, metapod, bluster, edgeR, rtracklayer, scater, scran

Installing 2 packages: rjson, xgboost

Installing packages into ‘/home/jovyan/R/x86_64-pc-linux-gnu-library/4.1’
(as ‘lib’ is unspecified)

Warning message in i.p(...):
“installation of package ‘xgboost’ had non-zero exit status”
✔  checking for file ‘/tmp/RtmprOot6n/remotes3b5e71218072/plger-scDblFinder-fec63bf/DESCRIPTION’ (454ms)
─  preparing ‘scDblFinder’:
✔  checking DESCRIPTION meta-information
─  checking for LF line-endings in source and make files and shell scripts
─  checking for empty or unneeded directories
─  looking to see if a ‘data/datalist’ file should be added
─  building ‘scDblFinder_1.9.5.tar.gz’
   

Installing package into ‘/home/jovyan/R/x86_64-pc-linux-gnu-library/4.1’
(as ‘lib’ is unspecified)

Warning message in i.p(...):
“installation of package ‘/tmp/RtmprOot6n/file3b5e6d7b9987/scDblFinder_1.9.5.tar.gz’ had non-zero exit status”

and

install.packages("xgboost")

(as ‘lib’ is unspecified)

Warning message in install.packages("xgboost"):
“installation of package ‘xgboost’ had non-zero exit status”

any suggestions for ways round this short of rolling back my R version?

Merging multiple samples with scATAC prior to scDblFinder

Hi! Thanks for your great software.

I am using your package with some Multiome data (calling doublets separately for RNA and ATAC). I have multiple samples and was following your recommendation of creating a SingleCellExperiment with all samples together. This is straightforward for the RNA data (they all quantify the same rows/genes), but for the ATAC data, peaks/rows are unique for each library. I could do a merge of the ATAC peaks for each sample, and re-quantify in those regions (e.g. like this), but it seems like a lot of pre-processing and modifying the raw count data prior to running the doublet finder algorithm.

So, what I'm asking is, how preferable is it to merge samples instead of doing the doublet finding separately per each sample? If results are (more or less) robust, maybe it is OK to run the samples separately?

thanks for your help

Error in scDblFinder() due to NULL `th`

Hi,

Trying the function on the Kumar dataset I got :

> scDblFinder(sce, trans = "scran", verbose=FALSE)
Error in `$<-.data.frame`(`*tmp*`, "classification", value = logical(0)) : 
  replacement has 0 rows, data has 457

Debugging a bit the function I noticed that doubletThresholding() returns NULL which causes the error with the empty values at this line in the code of scDblFinder ;

d$classification <- ifelse(d$ratio >= th, "doublet", "singlet")

Given that th is NULL.

Error when using customed clusters = '...'

Thanks for this tool! I met this problem when I tried to set customed clusters = '...' during running :
scDblFinder(sce, samples = "batch", clusters = 'my_cluster_label')
Error in scDblFinder(sce[, x], artificialDoublets = artificialDoublets, :
Only one cluster generated

my_cluster_label is a colname in the colData(sce)
I tried to convert the class or type of sce@colData$my_cluster_labelto factors/numeric/characters or set clusters = sce@colData$my_cluster_label but they all turn out to be useless.
I could get the 'scDblFinder.class' label in my sce without setting clusters.
I was using scDblFinder V1.1.8.
Thanks for any help!!!

Batch mode with BPPARAM=MulticoreParam() not working

Hi, not sure if this is an issue with scDblFinder, BiocParallel, or me, but this worked in the past but isn't working any more for some reason.

library(Seurat)
#> Attaching SeuratObject
library(scDblFinder)
library(BiocParallel)

l <- c(pbmc_small, pbmc_small)
l[[1]][["batch"]] = "A"
l[[2]][["batch"]] = "B"
seu <- merge(x=l[[1]], y=l[[2]])
#> Warning in CheckDuplicateCellNames(object.list = objects): Some cell names are
#> duplicated across objects provided. Renaming to enforce unique cell names.

sce <- as.SingleCellExperiment(seu)
out <- scDblFinder(sce, samples = "batch", BPPARAM=MulticoreParam(2))
#> Warning in parallel::mccollect(wait = FALSE, timeout = 1): 1 parallel job did
#> not deliver a result
#> Error in result[[njob]] <- value: attempt to select less than one element in OneIndex

^{Created on 2021-05-02 by the reprex package (v2.0.0)}

Session info

sessioninfo::session_info()
─ Session info ─────────────────────────────────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 4.0.5 (2021-03-31)
 os       macOS Big Sur 10.16         
 system   x86_64, darwin17.0          
 ui       RStudio                     
 language (EN)                        
 collate  en_GB.UTF-8                 
 ctype    en_GB.UTF-8                 
 tz       Europe/London               
 date     2021-05-02                  

─ Packages ─────────────────────────────────────────────────────────────────────────────────────────────
 package              * version    date       lib source                                 
 abind                  1.4-5      2016-07-21 [1] CRAN (R 4.0.2)                         
 assertthat             0.2.1      2019-03-21 [1] CRAN (R 4.0.2)                         
 backports              1.2.1      2020-12-09 [1] CRAN (R 4.0.2)                         
 beachmat               2.6.4      2020-12-20 [1] Bioconductor                           
 beeswarm               0.3.1      2021-03-07 [1] CRAN (R 4.0.3)                         
 Biobase                2.50.0     2020-10-27 [1] Bioconductor                           
 BiocGenerics           0.36.1     2021-04-16 [1] Bioconductor                           
 BiocNeighbors          1.8.2      2020-12-07 [1] Bioconductor                           
 BiocParallel         * 1.24.1     2020-11-06 [1] Bioconductor                           
 BiocSingular           1.6.0      2020-10-27 [1] Bioconductor                           
 bitops                 1.0-7      2021-04-24 [1] CRAN (R 4.0.2)                         
 bluster                1.0.0      2020-10-27 [1] Bioconductor                           
 callr                  3.7.0      2021-04-20 [1] CRAN (R 4.0.2)                         
 cli                    2.5.0      2021-04-26 [1] CRAN (R 4.0.2)                         
 clipr                  0.7.1      2020-10-08 [1] CRAN (R 4.0.2)                         
 cluster                2.1.2      2021-04-17 [1] CRAN (R 4.0.2)                         
 codetools              0.2-18     2020-11-04 [1] CRAN (R 4.0.5)                         
 colorspace             2.0-0      2020-11-11 [1] CRAN (R 4.0.2)                         
 cowplot                1.1.1      2020-12-30 [1] CRAN (R 4.0.2)                         
 crayon                 1.4.1      2021-02-08 [1] CRAN (R 4.0.2)                         
 data.table             1.14.0     2021-02-21 [1] CRAN (R 4.0.2)                         
 DBI                    1.1.1      2021-01-15 [1] CRAN (R 4.0.3)                         
 DelayedArray           0.16.3     2021-03-24 [1] Bioconductor                           
 DelayedMatrixStats     1.12.3     2021-02-03 [1] Bioconductor                           
 deldir                 0.2-10     2021-02-16 [1] CRAN (R 4.0.2)                         
 digest                 0.6.27     2020-10-24 [1] CRAN (R 4.0.2)                         
 dplyr                  1.0.5      2021-03-05 [1] CRAN (R 4.0.2)                         
 dqrng                  0.2.1      2019-05-17 [1] CRAN (R 4.0.2)                         
 edgeR                  3.32.1     2021-01-14 [1] Bioconductor                           
 ellipsis               0.3.1      2020-05-15 [1] CRAN (R 4.0.2)                         
 evaluate               0.14       2019-05-28 [1] CRAN (R 4.0.1)                         
 fansi                  0.4.2      2021-01-15 [1] CRAN (R 4.0.2)                         
 fastmap                1.1.0      2021-01-25 [1] CRAN (R 4.0.2)                         
 fitdistrplus           1.1-3      2020-12-05 [1] CRAN (R 4.0.2)                         
 fs                     1.5.0      2020-07-31 [1] CRAN (R 4.0.2)                         
 future                 1.21.0     2020-12-10 [1] CRAN (R 4.0.3)                         
 future.apply           1.7.0      2021-01-04 [1] CRAN (R 4.0.2)                         
 generics               0.1.0      2020-10-31 [1] CRAN (R 4.0.2)                         
 GenomeInfoDb           1.26.7     2021-04-08 [1] Bioconductor                           
 GenomeInfoDbData       1.2.4      2021-01-17 [1] Bioconductor                           
 GenomicRanges          1.42.0     2020-10-27 [1] Bioconductor                           
 ggbeeswarm             0.6.0      2017-08-07 [1] CRAN (R 4.0.2)                         
 ggplot2                3.3.3      2020-12-30 [1] CRAN (R 4.0.2)                         
 ggrepel                0.9.1      2021-01-15 [1] CRAN (R 4.0.2)                         
 ggridges               0.5.3      2021-01-08 [1] CRAN (R 4.0.2)                         
 globals                0.14.0     2020-11-22 [1] CRAN (R 4.0.2)                         
 glue                   1.4.2      2020-08-27 [1] CRAN (R 4.0.2)                         
 goftest                1.2-2      2019-12-02 [1] CRAN (R 4.0.2)                         
 gridExtra              2.3        2017-09-09 [1] CRAN (R 4.0.2)                         
 gtable                 0.3.0      2019-03-25 [1] CRAN (R 4.0.2)                         
 highr                  0.9        2021-04-16 [1] CRAN (R 4.0.5)                         
 htmltools              0.5.1.1    2021-01-22 [1] CRAN (R 4.0.2)                         
 htmlwidgets            1.5.3      2020-12-10 [1] CRAN (R 4.0.3)                         
 httpuv                 1.6.0      2021-04-23 [1] CRAN (R 4.0.2)                         
 httr                   1.4.2      2020-07-20 [1] CRAN (R 4.0.2)                         
 ica                    1.0-2      2018-05-24 [1] CRAN (R 4.0.2)                         
 igraph                 1.2.6      2020-10-06 [1] CRAN (R 4.0.2)                         
 IRanges                2.24.1     2020-12-12 [1] Bioconductor                           
 irlba                  2.3.3      2019-02-05 [1] CRAN (R 4.0.2)                         
 jsonlite               1.7.2      2020-12-09 [1] CRAN (R 4.0.2)                         
 KernSmooth             2.23-18    2020-10-29 [1] CRAN (R 4.0.5)                         
 knitr                  1.33       2021-04-24 [1] CRAN (R 4.0.2)                         
 later                  1.2.0      2021-04-23 [1] CRAN (R 4.0.2)                         
 lattice                0.20-41    2020-04-02 [1] CRAN (R 4.0.5)                         
 lazyeval               0.2.2      2019-03-15 [1] CRAN (R 4.0.2)                         
 leiden                 0.3.7      2021-01-26 [1] CRAN (R 4.0.3)                         
 lifecycle              1.0.0      2021-02-15 [1] CRAN (R 4.0.2)                         
 limma                  3.46.0     2020-10-27 [1] Bioconductor                           
 listenv                0.8.0      2019-12-05 [1] CRAN (R 4.0.2)                         
 lmtest                 0.9-38     2020-09-09 [1] CRAN (R 4.0.2)                         
 locfit                 1.5-9.4    2020-03-25 [1] CRAN (R 4.0.2)                         
 magrittr               2.0.1      2020-11-17 [1] CRAN (R 4.0.2)                         
 MASS                   7.3-53.1   2021-02-12 [1] CRAN (R 4.0.5)                         
 Matrix                 1.3-2      2021-01-06 [1] CRAN (R 4.0.5)                         
 MatrixGenerics         1.2.1      2021-01-30 [1] Bioconductor                           
 matrixStats            0.58.0     2021-01-29 [1] CRAN (R 4.0.2)                         
 mgcv                   1.8-35     2021-04-18 [1] CRAN (R 4.0.2)                         
 mime                   0.10       2021-02-13 [1] CRAN (R 4.0.2)                         
 miniUI                 0.1.1.1    2018-05-18 [1] CRAN (R 4.0.2)                         
 munsell                0.5.0      2018-06-12 [1] CRAN (R 4.0.2)                         
 nlme                   3.1-152    2021-02-04 [1] CRAN (R 4.0.5)                         
 parallelly             1.24.0     2021-03-14 [1] CRAN (R 4.0.2)                         
 patchwork              1.1.1      2020-12-17 [1] CRAN (R 4.0.2)                         
 pbapply                1.4-3      2020-08-18 [1] CRAN (R 4.0.2)                         
 pillar                 1.6.0      2021-04-13 [1] CRAN (R 4.0.5)                         
 pkgconfig              2.0.3      2019-09-22 [1] CRAN (R 4.0.2)                         
 plotly                 4.9.3      2021-01-10 [1] CRAN (R 4.0.2)                         
 plyr                   1.8.6      2020-03-03 [1] CRAN (R 4.0.2)                         
 png                    0.1-7      2013-12-03 [1] CRAN (R 4.0.2)                         
 polyclip               1.10-0     2019-03-14 [1] CRAN (R 4.0.2)                         
 processx               3.5.1      2021-04-04 [1] CRAN (R 4.0.2)                         
 promises               1.2.0.1    2021-02-11 [1] CRAN (R 4.0.2)                         
 ps                     1.6.0      2021-02-28 [1] CRAN (R 4.0.3)                         
 purrr                  0.3.4      2020-04-17 [1] CRAN (R 4.0.2)                         
 R6                     2.5.0      2020-10-28 [1] CRAN (R 4.0.2)                         
 RANN                   2.6.1      2019-01-08 [1] CRAN (R 4.0.2)                         
 RColorBrewer           1.1-2      2014-12-07 [1] CRAN (R 4.0.2)                         
 Rcpp                   1.0.6      2021-01-15 [1] CRAN (R 4.0.2)                         
 RcppAnnoy              0.0.18     2020-12-15 [1] CRAN (R 4.0.2)                         
 RCurl                  1.98-1.3   2021-03-16 [1] CRAN (R 4.0.2)                         
 reprex               * 2.0.0      2021-04-02 [1] CRAN (R 4.0.2)                         
 reshape2               1.4.4      2020-04-09 [1] CRAN (R 4.0.2)                         
 reticulate             1.19       2021-04-21 [1] CRAN (R 4.0.2)                         
 rlang                  0.4.10     2020-12-30 [1] CRAN (R 4.0.2)                         
 rmarkdown              2.7        2021-02-19 [1] CRAN (R 4.0.2)                         
 ROCR                   1.0-11     2020-05-02 [1] CRAN (R 4.0.2)                         
 rpart                  4.1-15     2019-04-12 [1] CRAN (R 4.0.5)                         
 rstudioapi             0.13       2020-11-12 [1] CRAN (R 4.0.2)                         
 rsvd                   1.0.5      2021-04-16 [1] CRAN (R 4.0.5)                         
 Rtsne                  0.15       2018-11-10 [1] CRAN (R 4.0.2)                         
 S4Vectors              0.28.1     2020-12-09 [1] Bioconductor                           
 scales                 1.1.1      2020-05-11 [1] CRAN (R 4.0.2)                         
 scater                 1.18.6     2021-02-26 [1] Bioconductor                           
 scattermore            0.7        2020-11-24 [1] CRAN (R 4.0.2)                         
 scDblFinder          * 1.5.16     2021-04-19 [1] Github (plger/scDblFinder@d11467b)     
 scran                  1.18.7     2021-04-16 [1] Bioconductor                           
 sctransform            0.3.2.9006 2021-04-01 [1] Github (ChristophH/sctransform@73e2e3e)
 scuttle                1.0.4      2020-12-17 [1] Bioconductor                           
 sessioninfo            1.1.1      2018-11-05 [1] CRAN (R 4.0.2)                         
 Seurat               * 4.0.1      2021-04-13 [1] Github (satijalab/seurat@4e868fc)      
 SeuratObject         * 4.0.0      2021-01-15 [1] CRAN (R 4.0.2)                         
 shiny                  1.6.0      2021-01-25 [1] CRAN (R 4.0.3)                         
 SingleCellExperiment   1.12.0     2020-10-27 [1] Bioconductor                           
 sparseMatrixStats      1.2.1      2021-02-02 [1] Bioconductor                           
 spatstat.core          2.1-2      2021-04-18 [1] CRAN (R 4.0.2)                         
 spatstat.data          2.1-0      2021-03-21 [1] CRAN (R 4.0.3)                         
 spatstat.geom          2.1-0      2021-04-15 [1] CRAN (R 4.0.2)                         
 spatstat.sparse        2.0-0      2021-03-16 [1] CRAN (R 4.0.2)                         
 spatstat.utils         2.1-0      2021-03-15 [1] CRAN (R 4.0.2)                         
 statmod                1.4.35     2020-10-19 [1] CRAN (R 4.0.2)                         
 stringi                1.5.3      2020-09-09 [1] CRAN (R 4.0.2)                         
 stringr                1.4.0      2019-02-10 [1] CRAN (R 4.0.2)                         
 styler                 1.4.1      2021-03-30 [1] CRAN (R 4.0.2)                         
 SummarizedExperiment   1.20.0     2020-10-27 [1] Bioconductor                           
 survival               3.2-11     2021-04-26 [1] CRAN (R 4.0.2)                         
 tensor                 1.5        2012-05-05 [1] CRAN (R 4.0.2)                         
 tibble                 3.1.1      2021-04-18 [1] CRAN (R 4.0.2)                         
 tidyr                  1.1.3      2021-03-03 [1] CRAN (R 4.0.3)                         
 tidyselect             1.1.0      2020-05-11 [1] CRAN (R 4.0.2)                         
 utf8                   1.2.1      2021-03-12 [1] CRAN (R 4.0.2)                         
 uwot                   0.1.10     2020-12-15 [1] CRAN (R 4.0.2)                         
 vctrs                  0.3.7      2021-03-29 [1] CRAN (R 4.0.2)                         
 vipor                  0.4.5      2017-03-22 [1] CRAN (R 4.0.2)                         
 viridis                0.6.0      2021-04-15 [1] CRAN (R 4.0.5)                         
 viridisLite            0.4.0      2021-04-13 [1] CRAN (R 4.0.5)                         
 withr                  2.4.2      2021-04-18 [1] CRAN (R 4.0.5)                         
 xfun                   0.22       2021-03-11 [1] CRAN (R 4.0.2)                         
 xgboost                1.4.1.1    2021-04-22 [1] CRAN (R 4.0.2)                         
 xtable                 1.8-4      2019-04-21 [1] CRAN (R 4.0.2)                         
 XVector                0.30.0     2020-10-28 [1] Bioconductor                           
 yaml                   2.2.1      2020-02-01 [1] CRAN (R 4.0.2)                         
 zlibbioc               1.36.0     2020-10-28 [1] Bioconductor                           
 zoo                    1.8-9      2021-03-09 [1] CRAN (R 4.0.2)                         

[1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library

Error in installing scDblFinder: object ‘colBlockApply’ is not exported by 'namespace:beachmat'

Hi,

I installed scDblFinder by BiocManager::install("scDblFinder"). When I ran library(scDblFinder), I got this error message:

Error: package or namespace load failed for ‘scDblFinder’:
 object ‘colBlockApply’ is not exported by 'namespace:beachmat'

I tried to updated the package beachmat, but this error still remain unsolved.

Could you provide any suggestion for this?
Thank you for your helping!

Possible Bug: cxds2 call does not include artificial doublets in whichDbls

Here is where cxds2 is being called:

scDblFinder/R/scDblFinder.R

Line 406 in 30090a0

cxds_score <- cxds2(e, whichDbls=which(ctype==2L | !inclInTrain))

Here is an example ctype that I came across (assigned just above):

# dimensions of various things
ncol_sce <- 26657
len_wDbl <- 0
ncol_ad <- 21325
knownUse <- 'discard'

# using above dimensions to make ctype
ctype_bad <- factor(
  rep(
    c(1L, ifelse(knownUse=="positive", 2L, 1L), 2L),
    c(ncol_sce, len_wDbl, ncol_ad), 
    ),
  labels = c("real", "doublet")
)

# ctype looks reasonable
table(ctype_bad)
#ctype_bad
#   real doublet 
#  26657   21325

# given above, and call to cxds2: artificial doublets not added to whichDbls
which(ctype_bad == 2L)
# integer(0)

Just wondering if this is a bug or intentional. Thanks!

did not converge

Thanks for providing this good AI tools!
I have a question or asking for suggestions, when I encountered with a larger number of cells as input:

did not converge in 20 iterations

I didn't find any parameters to increase the number of iterations.

Thanks

Miscellaneous clustering observations

It seems that you could replace

scDblFinder/R/clustering.R

Lines 45 to 46 in 6bfd3b0

    
           x <- t(vapply(split(names(k),k), FUN.VALUE=numeric(ncol(x)), 
        
                         FUN=function(i) colMeans(x[i,,drop=FALSE])))

with something like:

x <- t(assay(sumCountsAcrossCells(x, k, average=TRUE)))

using scuttle's sumCountsAcrossCells function. This implements parallelization if required, and is also safer when x does not have any names, in which case names(k) is probably not going to do the right thing.

Also, scran::buildKNNGraph can be swapped for the lower-level bluster::makeKNNGraph, which does the same thing but doesn't need the d= and transposed= arguments because your input should directly match up to what it wants. Again, this function can be augmented with parallelization via BPPARAM=, if one so pleases.

BiocParallel error - could not find symbol 'useNames' in environment of the generic function

Hi there,

I realize this may be an issue external to scDblFinder, but I figured I would post it in case you had an insight. Feel free to close if you're not sure.

When running scDblFinder with multiple samples, I'm getting a BiocParallel error that I can't seem to figure out. Any help is appreciated!

sce <- scDblFinder(sce, samples="orig.ident", BPPARAM=MulticoreParam(2))

Error: BiocParallel errors
  element index: 1, 2, 3, 4
  first error: An error occured while processing sample 'Control':
Error in rowVars(DelayedArray(x)): could not find symbol "useNames" in environment of the generic function

> sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 11.1

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] scran_1.20.1                scuttle_1.2.1               BiocParallel_1.26.2        
 [4] forcats_0.5.1               stringr_1.4.0               dplyr_1.0.7                
 [7] purrr_0.3.4                 readr_2.0.2                 tidyr_1.1.4                
[10] tibble_3.1.5                ggplot2_3.3.5               tidyverse_1.3.1            
[13] SingleCellExperiment_1.14.1 SummarizedExperiment_1.22.0 Biobase_2.52.0             
[16] GenomicRanges_1.44.0        GenomeInfoDb_1.28.4         IRanges_2.26.0             
[19] S4Vectors_0.30.2            BiocGenerics_0.38.0         MatrixGenerics_1.5.1       
[22] matrixStats_0.61.0          scDblFinder_1.6.0           SeuratObject_4.0.2         
[25] Seurat_4.0.5               

loaded via a namespace (and not attached):
  [1] utf8_1.2.2                reticulate_1.22           tidyselect_1.1.1         
  [4] htmlwidgets_1.5.4         grid_4.1.1                Rtsne_0.15               
  [7] munsell_0.5.0             ScaledMatrix_1.0.0        codetools_0.2-18         
 [10] ica_1.0-2                 statmod_1.4.36            xgboost_1.4.1.1          
 [13] future_1.22.1             miniUI_0.1.1.1            withr_2.4.2              
 [16] colorspace_2.0-2          knitr_1.36                rstudioapi_0.13          
 [19] ROCR_1.0-11               tensor_1.5                listenv_0.8.0            
 [22] GenomeInfoDbData_1.2.6    polyclip_1.10-0           parallelly_1.28.1        
 [25] vctrs_0.3.8               generics_0.1.0            xfun_0.27                
 [28] R6_2.5.1                  ggbeeswarm_0.6.0          rsvd_1.0.5               
 [31] locfit_1.5-9.4            bitops_1.0-7              spatstat.utils_2.2-0     
 [34] DelayedArray_0.18.0       assertthat_0.2.1          promises_1.2.0.1         
 [37] scales_1.1.1              beeswarm_0.4.0            gtable_0.3.0             
 [40] beachmat_2.8.1            globals_0.14.0            goftest_1.2-3            
 [43] rlang_0.4.12              splines_4.1.1             lazyeval_0.2.2           
 [46] spatstat.geom_2.3-0       broom_0.7.9               yaml_2.2.1               
 [49] reshape2_1.4.4            abind_1.4-5               modelr_0.1.8             
 [52] backports_1.2.1           httpuv_1.6.3              tools_4.1.1              
 [55] ellipsis_0.3.2            spatstat.core_2.3-0       RColorBrewer_1.1-2       
 [58] ggridges_0.5.3            Rcpp_1.0.7                plyr_1.8.6               
 [61] sparseMatrixStats_1.4.2   zlibbioc_1.38.0           RCurl_1.98-1.5           
 [64] rpart_4.1-15              deldir_1.0-5              pbapply_1.5-0            
 [67] viridis_0.6.2             cowplot_1.1.1             zoo_1.8-9                
 [70] haven_2.4.3               ggrepel_0.9.1             cluster_2.1.2            
 [73] fs_1.5.0                  magrittr_2.0.1            data.table_1.14.2        
 [76] scattermore_0.7           lmtest_0.9-38             reprex_2.0.1             
 [79] RANN_2.6.1                fitdistrplus_1.1-6        hms_1.1.1                
 [82] patchwork_1.1.1           mime_0.12                 evaluate_0.14            
 [85] xtable_1.8-4              readxl_1.3.1              gridExtra_2.3            
 [88] compiler_4.1.1            scater_1.20.1             KernSmooth_2.23-20       
 [91] crayon_1.4.1              htmltools_0.5.2           mgcv_1.8-36              
 [94] later_1.3.0               tzdb_0.1.2                lubridate_1.8.0          
 [97] DBI_1.1.1                 dbplyr_2.1.1              MASS_7.3-54              
[100] Matrix_1.3-4              cli_3.0.1                 metapod_1.0.0            
[103] igraph_1.2.7              pkgconfig_2.0.3           plotly_4.10.0            
[106] spatstat.sparse_2.0-0     xml2_1.3.2                vipor_0.4.5              
[109] dqrng_0.3.0               XVector_0.32.0            rvest_1.0.2              
[112] digest_0.6.28             sctransform_0.3.2         RcppAnnoy_0.0.19         
[115] spatstat.data_2.1-0       rmarkdown_2.11            cellranger_1.1.0         
[118] leiden_0.3.9              uwot_0.1.10               edgeR_3.34.1             
[121] DelayedMatrixStats_1.14.3 shiny_1.7.1               lifecycle_1.0.1          
[124] nlme_3.1-152              jsonlite_1.7.2            BiocNeighbors_1.10.0     
[127] viridisLite_0.4.0         limma_3.48.3              fansi_0.5.0              
[130] pillar_1.6.4              lattice_0.20-44           fastmap_1.1.0            
[133] httr_1.4.2                survival_3.2-11           glue_1.4.2               
[136] png_0.1-7                 bluster_1.2.1             stringi_1.7.5            
[139] BiocSingular_1.8.1        irlba_2.3.3               future.apply_1.8.1

Error in if (length(expected) > 1 && x > min(expected) && x < max(expected)) return(0): missing value where TRUE/FALSE needed

Hi! Thank you for this great tool. I am encountering the error in the title when running scDblFinder on a large dataset (CellRanger estimated ~20,000 cells):

Assuming the input to be a matrix of counts or expected counts.

Aggregating features...

Warning message:
"Quick-TRANSfer stage steps exceeded maximum (= 1905250)"
Creating ~11084 artificial doublets...

Dimensional reduction

Evaluating kNN...

Training model...

Error in if (length(expected) > 1 && x > min(expected) && x < max(expected)) return(0): missing value where TRUE/FALSE needed

I have not encountered this error in several other (much smaller) samples I have tried, so is this related to the dataset being too large?

Traceback:
1. scDblFinder(peak_assay, aggregateFeatures = TRUE, nfeatures = 25, 
 .     processing = "normFeatures")
2. .scDblscore(d, scoreType = score, addVals = pca[, includePCs, 
 .     drop = FALSE], threshold = threshold, dbr = dbr, dbr.sd = dbr.sd, 
 .     nrounds = nrounds, max_depth = max_depth, iter = iter, BPPARAM = BPPARAM, 
 .     features = trainingFeatures, verbose = verbose, metric = metric, 
 .     filterUnidentifiable = removeUnidentifiable, unident.th = unident.th)
3. which((d$type == "real" & doubletThresholding(d, dbr = dbr, dbr.sd = dbr.sd, 
 .     stringency = 0.7, perSample = perSample, returnType = "call") == 
 .     "doublet") | (d$type == "doublet" & d$score < unident.th & 
 .     filterUnidentifiable) | !d$include.in.training)
4. doubletThresholding(d, dbr = dbr, dbr.sd = dbr.sd, stringency = 0.7, 
 .     perSample = perSample, returnType = "call")
5. .optimThreshold(d, dbr = .gdbr(d, dbr), dbr.sd = dbr.sd, stringency = stringency)
6. optimize(totfn, c(0, 1), maximum = FALSE)
7. (function (arg) 
 . f(arg, ...))(0.381966011250105)
8. f(arg, ...)
9. .prop.dev(d$type, d$score, expected, x)

Session info

`R version 4.2.2 (2022-10-31)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS

Matrix products: default
BLAS/LAPACK: /opt/conda/envs/NET_R_env/lib/libopenblasp-r0.3.21.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] BSgenome.Hsapiens.UCSC.hg38_1.4.5 BSgenome_1.66.3                  
 [3] Biostrings_2.66.0                 XVector_0.38.0                   
 [5] CopyscAT_0.40                     MASS_7.3-60                      
 [7] jsonlite_1.8.4                    sp_1.6-1                         
 [9] rtracklayer_1.58.0                gplots_3.1.3                     
[11] tibble_3.2.1                      tidyr_1.3.0                      
[13] edgeR_3.40.2                      limma_3.54.2                     
[15] stringr_1.5.0                     mclust_6.0.0                     
[17] changepoint_2.2.4                 zoo_1.8-12                       
[19] data.table_1.14.8                 igraph_1.4.3                     
[21] FNN_1.1.3.2                       Rtsne_0.16                       
[23] biomaRt_2.54.0                    fastcluster_1.2.3                
[25] NMF_0.26                          cluster_2.1.4                    
[27] rngtools_1.5.2                    registry_0.5-1                   
[29] viridis_0.6.3                     viridisLite_0.4.2                
[31] dplyr_1.1.2                       RColorBrewer_1.1-3               
[33] scDblFinder_1.13.13               SingleCellExperiment_1.20.1      
[35] SummarizedExperiment_1.28.0       MatrixGenerics_1.10.0            
[37] matrixStats_0.63.0                glue_1.6.2                       
[39] ggplot2_3.4.1                     EnsDb.Hsapiens.v86_2.99.0        
[41] ensembldb_2.22.0                  AnnotationFilter_1.22.0          
[43] GenomicFeatures_1.50.2            AnnotationDbi_1.60.0             
[45] Biobase_2.58.0                    GenomicRanges_1.50.2             
[47] GenomeInfoDb_1.34.9               IRanges_2.32.0                   
[49] S4Vectors_0.36.2                  BiocGenerics_0.44.0              
[51] Signac_1.10.0                     SeuratObject_4.1.3               
[53] Seurat_4.3.0                     

loaded via a namespace (and not attached):
  [1] rappdirs_0.3.3            pbdZMQ_0.3-9             
  [3] scattermore_1.0           bit64_4.0.5              
  [5] irlba_2.3.5.1             DelayedArray_0.24.0      
  [7] KEGGREST_1.38.0           RCurl_1.98-1.12          
  [9] doParallel_1.0.17         generics_0.1.3           
 [11] ScaledMatrix_1.6.0        cowplot_1.1.1            
 [13] RSQLite_2.2.20            RANN_2.6.1               
 [15] future_1.32.0             bit_4.0.5                
 [17] spatstat.data_3.0-1       xml2_1.3.4               
 [19] httpuv_1.6.11             hms_1.1.3                
 [21] evaluate_0.21             promises_1.2.0.1         
 [23] fansi_1.0.4               restfulr_0.0.15          
 [25] progress_1.2.2            caTools_1.18.2           
 [27] dbplyr_2.3.1              DBI_1.1.3                
 [29] htmlwidgets_1.6.2         spatstat.geom_3.2-1      
 [31] purrr_1.0.1               ellipsis_0.3.2           
 [33] gridBase_0.4-7            deldir_1.0-9             
 [35] sparseMatrixStats_1.10.0  vctrs_0.6.2              
 [37] ROCR_1.0-11               abind_1.4-5              
 [39] cachem_1.0.8              withr_2.5.0              
 [41] progressr_0.13.0          sctransform_0.3.5        
 [43] GenomicAlignments_1.34.1  prettyunits_1.1.1        
 [45] scran_1.26.2              goftest_1.2-3            
 [47] IRdisplay_1.1             lazyeval_0.2.2           
 [49] crayon_1.5.2              spatstat.explore_3.2-1   
 [51] pkgconfig_2.0.3           nlme_3.1-162             
 [53] vipor_0.4.5               ProtGenerics_1.30.0      
 [55] rlang_1.1.0               globals_0.16.2           
 [57] lifecycle_1.0.3           miniUI_0.1.1.1           
 [59] filelock_1.0.2            BiocFileCache_2.6.0      
 [61] rsvd_1.0.5                polyclip_1.10-4          
 [63] lmtest_0.9-40             Matrix_1.5-4             
 [65] IRkernel_1.3.2            base64enc_0.1-3          
 [67] beeswarm_0.4.0            ggridges_0.5.4           
 [69] png_0.1-8                 rjson_0.2.21             
 [71] bitops_1.0-7              KernSmooth_2.23-21       
 [73] blob_1.2.3                DelayedMatrixStats_1.20.0
 [75] parallelly_1.35.0         spatstat.random_3.1-5    
 [77] beachmat_2.14.2           scales_1.2.1             
 [79] memoise_2.0.1             magrittr_2.0.3           
 [81] plyr_1.8.8                ica_1.0-3                
 [83] zlibbioc_1.44.0           compiler_4.2.2           
 [85] dqrng_0.3.0               BiocIO_1.8.0             
 [87] fitdistrplus_1.1-11       Rsamtools_2.14.0         
 [89] cli_3.6.1                 listenv_0.9.0            
 [91] patchwork_1.1.2           pbapply_1.7-0            
 [93] tidyselect_1.2.0          stringi_1.7.12           
 [95] yaml_2.3.7                BiocSingular_1.14.0      
 [97] locfit_1.5-9.7            ggrepel_0.9.3            
 [99] grid_4.2.2                fastmatch_1.1-3          
[101] tools_4.2.2               future.apply_1.10.0      
[103] parallel_4.2.2            uuid_1.1-0               
[105] bluster_1.8.0             foreach_1.5.2            
[107] metapod_1.6.0             gridExtra_2.3            
[109] digest_0.6.31             BiocManager_1.30.20      
[111] shiny_1.7.4               Rcpp_1.0.10              
[113] scuttle_1.8.4             later_1.3.1              
[115] RcppAnnoy_0.0.20          httr_1.4.5               
[117] colorspace_2.1-0          XML_3.99-0.14            
[119] tensor_1.5                reticulate_1.28          
[121] splines_4.2.2             uwot_0.1.14              
[123] RcppRoll_0.3.0            statmod_1.5.0            
[125] spatstat.utils_3.0-3      scater_1.26.1            
[127] xgboost_1.7.5.1           plotly_4.10.1            
[129] xtable_1.8-4              R6_2.5.1                 
[131] pillar_1.9.0              htmltools_0.5.5          
[133] mime_0.12                 fastmap_1.1.1            
[135] BiocParallel_1.32.6       BiocNeighbors_1.16.0     
[137] codetools_0.2-19          utf8_1.2.3               
[139] lattice_0.21-8            spatstat.sparse_3.0-1    
[141] curl_4.3.3                ggbeeswarm_0.7.2         
[143] leiden_0.4.3              gtools_3.9.4             
[145] survival_3.5-5            repr_1.1.6               
[147] munsell_0.5.0             GenomeInfoDbData_1.2.9   
[149] iterators_1.0.14          reshape2_1.4.4           
[151] gtable_0.3.3             `

"griffiths" threshold measure should not be per-cluster

Hi there,

Aaron pointed me towards this package and I think there's something I can help with. In the doubletThresholding function, the "griffiths" method is calculating deviations from the per-cluster medians. However, this is a problem when a cluster consists only of doublets, as cells may not deviate from the cluster average, which will be high. These clusters are quite common if you have very different cell types in your experiment.

Rather, in the paper I used this approach for, I calculated deviations per-sample. The idea being that different samples have different characteristics (timepoint, level of digestion etc.) and therefore should be handled separately.

I thought about submitting a PR to tweak this, but I wasn't sure if you wanted the function to be used across samples (in which case $cluster can more or less be replaced with $sample), or on one sample at a time (in which case the code just gets very much simpler).

Cheers,
Jonny.

Unreasonably high doublets rate

Dear developers,

Thank you very much for developing this useful tool. I tried it on my dataset. I used the samples = sampleID argument. However, I still have >10% doublets rate, which is unreasonable. Could you help please?

Here is my code:

bp <- SnowParam(8, RNGseed=1234) #to make the results reproducible. Unix use MulticoreParam()
bpstart(bp)
split_D<- scDblFinder(split_D,samples = 'sampleID',BPPARAM = bp) #splitD is my SCE object. 
bpstop(bp)
split_D@colData$scDblFinder.class %>% table

singlet doublet 
  31037    3260

Here are the numbers of cells for each sampleID:

split_D@colData$sampleID
4210      5831      6486      2981      5037      5525      1424      2803.

I double checked in the resulting SCE object and the scDblFinder.sample equals the sampleID.

According to 10X, each sample at this cell number should contain <5% doublets: https://kb.10xgenomics.com/hc/en-us/articles/360001378811-What-is-the-maximum-number-of-cells-that-can-be-profiled-

sessionInfo()
R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22621)

Matrix products: default

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] BiocParallel_1.32.5         scDblFinder_1.13.7          SingleCellExperiment_1.20.0 SummarizedExperiment_1.28.0
 [5] Biobase_2.58.0              GenomicRanges_1.50.2        GenomeInfoDb_1.34.6         IRanges_2.32.0             
 [9] S4Vectors_0.36.1            BiocGenerics_0.44.0         MatrixGenerics_1.10.0       matrixStats_0.63.0         
[13] future_1.31.0               dittoSeq_1.10.0             forcats_0.5.2               stringr_1.5.0              
[17] dplyr_1.0.10                purrr_1.0.1                 readr_2.1.3                 tidyr_1.2.1                
[21] tibble_3.1.8                ggplot2_3.4.0               tidyverse_1.3.2             plyr_1.8.8                 
[25] data.table_1.14.6           SeuratObject_4.1.3          Seurat_4.3.0

Using both scATAC and scGEX combined from 10x Multiome for doublet calling

Hi,

I have 10X Multiome data where we have scATAC and scGEX data from the same nucleus. As is understand scDblFinder can use either ATAC or Genexression data to call doublets. Would it be possible to use both assays together for doublet-calling?

Thanks

Size factors should be positive

Hi scDblFinder team,

Thanks for such a great package. Recently, I am using this package to find the doublet cells for my Seurat object. I transferred my Seurat object into single cell experiment, but when I run the scDblFinder, I got an error: Size factors should be positive. My Seurat object has no log-transformation. But even after I log-transformed them, I still got the same error. I also have other datasets and they can run it smoothly. The only difference is the failed datasets have mouse cell spike-in, but I have removed these cells before running scDblFinder. Is there any solution for this issue?

Thanks,
Yale

I'm having an issues with sctransform not sure

I'm somewhat new to using this so I am not sure how to fix it.

Org_nodoub <- processing_seurat_sctransform(Org_nodoub,
vars_to_regress = c("nCount_RNA","percent.mito","percent.ribo"),
npcs = 30,
res = 0.5)
Error in qr.resid(qr = qr, y = data.expr[x, ]) :
'qr' and 'y' must have the same number of rows

How to find all features used in training

Hi scDblFinder team!

It's mentioned in the paper that scDblFinder utilizes multiple features obtained from the Knn network, such as projections on principal components; library size; the number of detected features; and co-expression scores. But I can only find the scDblFinder.weighted and scDblFinder.cxds_score in the output R object. Could you tell me how to obtain all features used in training GDBT tree in the R object?

Thanks

probs outside [0,1] when training model

Hello, when combining samples in a single SCE, I find that I get an error during model training. This does not occur when running scDblFinder on single samples. The error is independent of cluster method (it shows up with "overcluster" as well):

masterSCE1 = scDblFinder(sce = masterSCE, samples = "sample_ID", nfeatures = 1000, clust.method = "fastcluster", score='xgb',verbose = TRUE, use.cxds = TRUE)
Training model...
Error in quantile.default(d$score[w], 1 - dbr) : 'probs' outside [0,1]

########################

sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C LC_TIME=English_United States.1252

attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base

other attached packages:
[1] DropletUtils_1.9.13 SingleCellExperiment_1.11.8 SummarizedExperiment_1.19.9 Biobase_2.49.1 GenomicRanges_1.41.6 GenomeInfoDb_1.25.11 IRanges_2.23.10 S4Vectors_0.27.13
[9] BiocGenerics_0.35.4 MatrixGenerics_1.1.3 matrixStats_0.57.0 scDblFinder_1.3.13 future_1.19.1 Seurat_3.2.2 forcats_0.5.0 stringr_1.4.0
[17] dplyr_1.0.2 purrr_0.3.4 readr_1.4.0 tidyr_1.1.2 tibble_3.0.3 ggplot2_3.3.2 tidyverse_1.3.0

loaded via a [1] R.utils_2.10.1 [9] munsell_0.5.0 [17] colorspace_1.4-1 [25] polyclip_1.10-0 [33] rsvd_1.0.3 [41] scales_1.1.1 [49] splines_4.0.2 [57] backports_1.1.10 [65] zlibbioc_1.35.0 [73] cowplot_1.1.0 [81] RSpectra_0.16-0 [89] xtable_1.8-4 [97] R.oo_1.24.0 [105] rappdirs_0.3.1 [113] xml2_1.3.2 [121] sctransform_0.3.1 [129] curl_4.3 [137] viridisLite_0.3.0 [145] glue_1.4.2 [153] BiocSingular_1.5.2 namespace (and not attached):
reticulate_1.16 tidyselect_1.1.0 htmlwidgets_1.5.2 grid_4.0.2 BiocParallel_1.23.2 Rtsne_0.15 pROC_1.16.2
codetools_0.2-16 ica_1.0-2 statmod_1.4.34 scran_1.17.21 xgboost_1.2.0.1 miniUI_0.1.1.1 withr_2.3.0
rstudioapi_0.11 intrinsicDimension_1.2.0 ROCR_1.0-11 tensor_1.5 listenv_0.8.0 labeling_0.3 GenomeInfoDbData_1.2.4
farver_2.0.3 rhdf5_2.33.11 rprojroot_1.3-2 vctrs_0.3.4 generics_0.0.2 R6_2.4.1 ggbeeswarm_0.6.0
locfit_1.5-9.4 rhdf5filters_1.1.3 bitops_1.0-6 spatstat.utils_1.17-0 DelayedArray_0.15.16 assertthat_0.2.1 promises_1.1.1
beeswarm_0.2.3 gtable_0.3.0 beachmat_2.5.8 globals_0.13.0 processx_3.4.4 goftest_1.2-2 rlang_0.4.8
lazyeval_0.2.2 broom_0.7.1 BiocManager_1.30.10 yaml_2.2.1 reshape2_1.4.4 abind_1.4-5 modelr_0.1.8
httpuv_1.5.4 tools_4.0.2 ellipsis_0.3.1 RColorBrewer_1.1-2 ggridges_0.5.2 Rcpp_1.0.5 plyr_1.8.6
RCurl_1.98-1.2 ps_1.4.0 prettyunits_1.1.1 rpart_4.1-15 deldir_0.1-29 pbapply_1.4-3 viridis_0.5.1
zoo_1.8-8 haven_2.3.1 ggrepel_0.8.2 cluster_2.1.0 fs_1.5.0 magrittr_1.5 data.table_1.13.0
lmtest_0.9-38 reprex_0.3.0 RANN_2.6.1 fitdistrplus_1.1-1 hms_0.5.3 patchwork_1.0.1 mime_0.9
scds_1.5.0 readxl_1.3.1 gridExtra_2.3 compiler_4.0.2 scater_1.17.5 KernSmooth_2.23-17 crayon_1.3.4
htmltools_0.5.0 mgcv_1.8-31 later_1.1.0.1 lubridate_1.7.9 DBI_1.1.0 dbplyr_1.4.4 MASS_7.3-51.6
Matrix_1.2-18 cli_2.0.2 R.methodsS3_1.8.1 igraph_1.2.5 pkgconfig_2.0.3 plotly_4.9.2.1 scuttle_0.99.18
yaImpute_1.0-32 vipor_0.4.5 dqrng_0.2.1 XVector_0.29.3 rvest_0.3.6 callr_3.4.4 digest_0.6.25
RcppAnnoy_0.0.16 spatstat.data_1.4-3 cellranger_1.1.0 leiden_0.3.3 edgeR_3.31.4 uwot_0.1.8 DelayedMatrixStats_1.11.1
shiny_1.5.0 lifecycle_0.2.0 nlme_3.1-148 jsonlite_1.7.1 Rhdf5lib_1.11.3 BiocNeighbors_1.7.0 limma_3.45.14
fansi_0.4.1 pillar_1.4.6 lattice_0.20-41 fastmap_1.0.1 httr_1.4.2 pkgbuild_1.1.0 survival_3.1-12
remotes_2.2.0 spatstat_1.64-1 png_0.1-7 bluster_0.99.1 HDF5Array_1.17.14 stringi_1.5.3 blob_1.2.1
irlba_2.3.3 future.apply_1.6.0

doublets mostly differ between scDblFinder and doubletFinder

Hi,

I tested your algorithm and doubletFinder on a single 10X PBMC sample of about 7600 cells (after some basic filtering).

I do not see a lot of overlap between the two. See below the output from table(doubletFinder,scDblFinder)
scDblFinder
doubletFinder doublet singlet
Doublet 124 317
Singlet 340 6983

scDblFinder call:
pbmc.sce <- scDblFinder(pbmc.sce, clusters="res.1.2",dbr=0.06, dims=50)

doubletFinder_v3 call:
pbmc <- doubletFinder_v3(pbmc, PCs = 1:50, pN = 0.25, pK = 0.02, nExp = nExp_poi.adj, reuse.pANN = F, sct = T)
pbmc <- doubletFinder_v3(pbmc, PCs = 1:50, pN = 0.25, pK = 0.02, nExp = nExp_poi.adj, reuse.pANN = "pANN_0.25_0.02_466", sct = T)

Is this expected ?
did I make a mistake in the calls ?

Thanks

Error with empty factor levels

Hello,

I noted that the following error comes up when there are empty factor levels in the clusters argument:

Error in value[[3L]](cond) : 
  An error occured while processing sample 'batch1':
Error in sample.int(length(x), size, replace, prob): invalid 'replace' argument

Reproducible example to illustrate it:

library(scDblFinder)
library(SingleCellExperiment)
sce <- mockDoubletSCE()
sce <- cbind(sce, sce)
sce <- sce[,!grepl("\\+", as.character(sce$cluster))]
sce$cluster <- as.character(sce$cluster)

#/ Create a cluster level not present in one of the batches by simply forming a cluster of one cell:
sce$cluster[ncol(sce)] <- "cluster3" # batch 1 (see below) will be empty for that factor level "cluster3"
sce$cluster <- factor(sce$cluster)

#/ simulate batch:
sce$batch <- c(rep("batch1", floor(ncol(sce)/2)), rep("batch2", ceiling(ncol(sce)/2)))

#/ will exit with error
scDblFinder(sce = sce, clusters = sce$cluster, samples = sce$batch)

#/ fine as empty factor level is removed
scDblFinder(sce = sce, clusters = as.character(sce$cluster), samples = sce$batch)

So I guess if you add a checkpoint that removes empty factor levels things should be fine.
Empty factor levels could happen if you work with multiple batches and/or an integrated dataset with clusters being specific for one batch/condition etc.

Issue with knownDoublets

the knownDoublets option in scDblFinder is is throwing an error when presented with knownDoublets and also samples

Reproducible example:

library(scDblFinder)
sce <- mockDoubletSCE()
sce$type <- sce$type %in% "doublet"
sce$channel <- c(rep("sample1", floor(ncol(sce)/2)), rep("sample2", ceiling(ncol(sce)/2)))[1:ncol(sce)]
scldbl <- scDblFinder(sce = sce, 
                      samples = "channel",
                      knownDoublets = "type")

yields
Error in value[[3L]](cond) : An error occured while processing sample 'cluster1': Error in .checkColArg(sce, knownDoublets):knownDoubletsshould have a length equal to the number of columns insce.

but

scldbl <- scDblFinder(sce = sce, 
                      #samples = "channel",
                      knownDoublets = "type")

succeeds
Clustering cells... 4 clusters Creating ~5000 artifical doublets... Dimensional reduction Finding KNN... Evaluating cell neighborhoods... Training model... Finding threshold... Threshold found:0.425 19 (3.7%) doublets called

Pre-filter the count matrix before scDblFinder?

Dear Developers,
I'm including this awesome tool in my scRNA-seq analysis workflow but hope you could help clarify the correct procedures.

I notices that in the github readme page, the function takes the count matrix without empty cells as input. My question is, do I need to perform the regular filters (such as lower/upper thresholds for the number of genes per cell or the total UMI counts per cell) before I feed the data to scDblFinder? I saw people doing different things, but think may double-check with you.

Thanks,
Jp

Doublet numbers still not reproduced even though I used BPPARAM and bpstart

Dear developers,
Thank you for nice package.

I know doublet reproducibility already discussed a lot in issue and I also read them.
But when I adjust that code to my data, it's still not reproducible. Always give me a different results.
I checked my data by using the code which was uploaded on the issue #53.
This is the code which I used and the results.

> sce <- as.SingleCellExperiment(my_seurat_object)
> bp <- MulticoreParam(2, RNGseed=123)
> bpstart(bp)
> m1 <- scDblFinder(sce, clusters=sce$cluster, BPPARAM=bp)$scDblFinder.score
Creating ~5000 artificial doublets...
Dimensional reduction
Evaluating kNN...
Training model...
iter=0, 83 cells excluded from training.
iter=1, 83 cells excluded from training.
iter=2, 80 cells excluded from training.
Threshold found:0.738
50 (4.7%) doublets called
> bpstop(bp)

> bpstart(bp)
> m2 <- scDblFinder(sce, clusters=sce$cluster, BPPARAM=bp)$scDblFinder.score
Creating ~5000 artificial doublets...
Dimensional reduction
Evaluating kNN...
Training model...
iter=0, 76 cells excluded from training.
iter=1, 89 cells excluded from training.
iter=2, 79 cells excluded from training.
Threshold found:0.784
44 (4.1%) doublets called
> bpstop(bp)
> identical(m1,m2)
[1] FALSE

Do you have any ideas about this? My BiocParallel package version is already 1.28.3.
I tried a lot but it's not matched again and again... Please help!
This is the sessioninfo of my R.

R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] rsvd_1.0.5                  batchelor_1.10.0            remotes_2.4.2               Nebulosa_1.4.0              patchwork_1.1.1            
 [6] SeuratWrappers_0.3.0        harmony_0.1.0               Rcpp_1.0.8.3                cowplot_1.1.1               dplyr_1.0.9                
[11] Seurat_4.1.0                SeuratObject_4.0.4          scDblFinder_1.11.4          SingleCellExperiment_1.16.0 SummarizedExperiment_1.24.0
[16] GenomicRanges_1.46.1        GenomeInfoDb_1.30.1         IRanges_2.28.0              S4Vectors_0.32.4            MatrixGenerics_1.6.0       
[21] matrixStats_0.62.0          scaterlegacy_1.5.0          ggplot2_3.3.6               Biobase_2.54.0              BiocGenerics_0.40.0        
[26] BiocParallel_1.28.3        

loaded via a namespace (and not attached):
  [1] utf8_1.2.2                shinydashboard_0.7.2      ks_1.13.5                 R.utils_2.11.0            reticulate_1.24          
  [6] tidyselect_1.1.2          RSQLite_2.2.12            AnnotationDbi_1.56.2      htmlwidgets_1.5.4         grid_4.1.2               
 [11] Rtsne_0.16                munsell_0.5.0             ScaledMatrix_1.2.0        codetools_0.2-18          ica_1.0-2                
 [16] xgboost_1.6.0.1           statmod_1.4.36            scran_1.22.1              future_1.24.0             miniUI_0.1.1.1           
 [21] withr_2.5.0               spatstat.random_2.2-0     colorspace_2.0-3          filelock_1.0.2            rstudioapi_0.13          
 [26] ROCR_1.0-11               tensor_1.5                listenv_0.8.0             labeling_0.4.2            tximport_1.22.0          
 [31] GenomeInfoDbData_1.2.7    polyclip_1.10-0           farver_2.1.0              bit64_4.0.5               rhdf5_2.38.1             
 [36] parallelly_1.31.0         vctrs_0.4.1               generics_0.1.2            BiocFileCache_2.2.1       R6_2.5.1                 
 [41] ggbeeswarm_0.6.0          locfit_1.5-9.5            bitops_1.0-7              rhdf5filters_1.6.0        spatstat.utils_2.3-0     
 [46] cachem_1.0.6              DelayedArray_0.20.0       assertthat_0.2.1          BiocIO_1.4.0              promises_1.2.0.1         
 [51] scales_1.2.0              beeswarm_0.4.0            gtable_0.3.0              beachmat_2.10.0           globals_0.14.0           
 [56] goftest_1.2-3             rlang_1.0.2               splines_4.1.2             rtracklayer_1.54.0        lazyeval_0.2.2           
 [61] spatstat.geom_2.4-0       BiocManager_1.30.16       yaml_2.3.5                reshape2_1.4.4            abind_1.4-5              
 [66] httpuv_1.6.5              tools_4.1.2               ellipsis_0.3.2            spatstat.core_2.4-2       RColorBrewer_1.1-3       
 [71] ggridges_0.5.3            plyr_1.8.7                sparseMatrixStats_1.6.0   progress_1.2.2            zlibbioc_1.40.0          
 [76] purrr_0.3.4               RCurl_1.98-1.6            prettyunits_1.1.1         rpart_4.1.16              deldir_1.0-6             
 [81] pbapply_1.5-0             viridis_0.6.2             zoo_1.8-10                ggrepel_0.9.1             cluster_2.1.3            
 [86] magrittr_2.0.3            data.table_1.14.2         scattermore_0.8           ResidualMatrix_1.4.0      lmtest_0.9-40            
 [91] RANN_2.6.1                mvtnorm_1.1-3             fitdistrplus_1.1-8        hms_1.1.1                 mime_0.12                
 [96] xtable_1.8-4              XML_3.99-0.9              mclust_5.4.9              gridExtra_2.3             scater_1.22.0            
[101] compiler_4.1.2            biomaRt_2.50.3            tibble_3.1.7              KernSmooth_2.23-20        crayon_1.5.1             
[106] R.oo_1.24.0               htmltools_0.5.2           mgcv_1.8-40               later_1.3.0               tidyr_1.2.0              
[111] DBI_1.1.2                 dbplyr_2.1.1              MASS_7.3-56               rappdirs_0.3.3            Matrix_1.4-1             
[116] cli_3.3.0                 R.methodsS3_1.8.1         metapod_1.2.0             parallel_4.1.2            igraph_1.3.1             
[121] pkgconfig_2.0.3           GenomicAlignments_1.30.0  scuttle_1.4.0             plotly_4.10.0             spatstat.sparse_2.1-1    
[126] xml2_1.3.3                vipor_0.4.5               dqrng_0.3.0               XVector_0.34.0            stringr_1.4.0            
[131] digest_0.6.29             pracma_2.3.8              sctransform_0.3.3         RcppAnnoy_0.0.19          spatstat.data_2.2-0      
[136] Biostrings_2.62.0         leiden_0.3.9              uwot_0.1.11               edgeR_3.36.0              DelayedMatrixStats_1.16.0
[141] restfulr_0.0.13           curl_4.3.2                shiny_1.7.1               Rsamtools_2.10.0          rjson_0.2.21             
[146] lifecycle_1.0.1           nlme_3.1-157              jsonlite_1.8.0            Rhdf5lib_1.16.0           BiocNeighbors_1.12.0     
[151] viridisLite_0.4.0         limma_3.50.3              fansi_1.0.3               pillar_1.7.0              lattice_0.20-45          
[156] ggrastr_1.0.1             KEGGREST_1.34.0           fastmap_1.1.0             httr_1.4.2                survival_3.3-1           
[161] glue_1.6.2                png_0.1-7                 bluster_1.4.0             bit_4.0.4                 stringi_1.7.6            
[166] blob_1.2.3                BiocSingular_1.10.0       memoise_2.0.1             irlba_2.3.5               future.apply_1.8.1

Dimension reduction option

Would be useful to be able to specify a dimensionality reduction in scDblFinder rather than automatically defaulting to PCA.
What if you have some other latent space calculated and want to work there?
What if you have corrected your PCA space with fastMNN or harmony and want to work there?

Exploring Overrepresented Doublets

Hello,

First, thank you for a great package. It's been incredibly useful in analyzing our datasets. I'm very interested in the methods you've developed to look at over-represented doublets as a way to explore physically interacting cells in our data.

I have data from multiple samples run on a single lane, which I demultiplexed using genotype-based methods (souporcell in this case). I am using your tool to identify doublets derived from the same sample but two different cell types, so I ran scDblfinder with the knownDoublets argument set to the output from souporcell. This worked great, but now I'm trying to use the scDblFinder.stats within the metadata to explore overrepresented doublets to look for possible cell-cell interactions. The output however doesn't specify whether identified doublets derived from the same sample or not and if they did which sample. In other words, I can't tell if doublets were formed before or after the samples were combined and I can't tell if the doublets are present in all samples or derive just from one condition. I would like to see if a doublets from a treated sample are present in the untreated sample as well or if they are in one but not the other. I know that the samples argument is available, but it specifies that this is for multiple lanes/independently process samples not for multiplexed samples. Is there an argument to scDblfinder where I can provide the multiplexed sample assignment and run on the same 10X lane to do this?

Thank you for your help!

Return threshold in output?

Would it be possible to return the doublet threshold as part of the standard output, rather than only printing to console, so that it can be used programatically?
The full output becomes very unwieldy with an R environment with multiple large datasets so it would be v helpful to have this as a standard output to help interpret the doublet scores in the output.

scDblFinder - known doublets

Hi, thanks for a great package. When I run scDblFinder on a single cell experiment object with arguments knowns= and knownsUse="discard", the output sce$scDblFinder.class calls some of the known doublets as singlets.
The help for scDblFinder seems to state that with option "discard", the known doublets, while not used for training, should still be called as doublets, so I'm not sure why this is happening. I can of course just add those known doublets back in as doublets manually, but wondered if there was an issue with the scDblFinder code here?

rowMeans Error

Hello,

I am receiving an error when I try to use scDblFinder which returns the following messages/error:

Clustering cells...
Identifying top genes per cluster...
Error in base::rowMeans(x, na.rm = na.rm, dims = dims, ...) : 
  'x' must be an array of at least two dimensions

Here is the code I used:

library(scDblFinder)
library(DropletUtils)
library(SingleCellExperiment)

tenX <- "/path/to/10x/filtered_gene_bc_matrices/"

counts <- read10xCounts(tenX)

sce <- SingleCellExperiment(list(counts=counts))
sce <- scDblFinder(sce, verbose = TRUE)

And this is what the sce object looks like:

class: SingleCellExperiment 
dim: 32838 19330 
metadata(0):
assays(1): counts
rownames(32838): ENSG00000243485 ENSG00000237613 ... ENSG00000198695
  ENSG00000198727
rowData names(0):
colnames: NULL
colData names(0):
reducedDimNames(0):
spikeNames(0):
altExpNames(0):

Here is my session info if needed:

R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 10 (buster)

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.8.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.8.0

locale:
[1] C

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] DropletUtils_1.6.1          SingleCellExperiment_1.8.0 
 [3] SummarizedExperiment_1.16.1 DelayedArray_0.12.3        
 [5] BiocParallel_1.20.1         matrixStats_0.57.0         
 [7] Biobase_2.46.0              GenomicRanges_1.38.0       
 [9] GenomeInfoDb_1.22.1         IRanges_2.20.2             
[11] S4Vectors_0.24.4            BiocGenerics_0.32.0        
[13] scDblFinder_1.1.8          

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5               rsvd_1.0.3               locfit_1.5-9.4          
 [4] lattice_0.20-40          R6_2.4.1                 ggplot2_3.3.2           
 [7] pillar_1.4.6             zlibbioc_1.32.0          rlang_0.4.8             
[10] data.table_1.13.2        irlba_2.3.3              R.oo_1.24.0             
[13] R.utils_2.10.1           Matrix_1.2-18            BiocNeighbors_1.4.2     
[16] statmod_1.4.34           igraph_1.2.6             RCurl_1.98-1.2          
[19] munsell_0.5.0            HDF5Array_1.14.4         compiler_3.6.3          
[22] vipor_0.4.5              BiocSingular_1.2.2       pkgconfig_2.0.3         
[25] ggbeeswarm_0.6.0         tidyselect_1.1.0         tibble_3.0.4            
[28] gridExtra_2.3            GenomeInfoDbData_1.2.2   edgeR_3.28.1            
[31] randomForest_4.6-14      viridisLite_0.3.0        crayon_1.3.4            
[34] dplyr_1.0.2              R.methodsS3_1.8.1        bitops_1.0-6            
[37] grid_3.6.3               gtable_0.3.0             lifecycle_0.2.0         
[40] magrittr_1.5             scales_1.1.1             dqrng_0.2.1             
[43] XVector_0.26.0           viridis_0.5.1            limma_3.42.2            
[46] scater_1.14.6            DelayedMatrixStats_1.8.0 ellipsis_0.3.1          
[49] generics_0.0.2           vctrs_0.3.4              Rhdf5lib_1.8.0          
[52] tools_3.6.3              glue_1.4.2               beeswarm_0.2.3          
[55] purrr_0.3.4              scran_1.14.6             colorspace_1.4-1        
[58] rhdf5_2.30.1

Any help is appreciated!

Running scDblFinder deterministic and serial with the 'samples' parameter

Hello,

I was trying to run scDblFinder with the samples parameter, set.seed(), but without BPPARAM and noticed that reproducibility was not given (Finding the same number of doublets).
Either removing the samples parameter or adding BPPARAM=MulticoreParam(1, RNGseed=seed) produced reproducible results.
However, I was searching for a way for serial execution suitable for running in RStudio (I keep having problems with BiocParallel) and needed to consider individual samples. So, after some testing I ended up using BPPARAM=SerialParam(RNGseed = seed), which seems to lead to the behaviour I was looking for.
I did not find any comment on SerialParam() in the documentation. Would this also be your suggested solution in my case or could there be a better alternative?

I´m grateful for any clarification.

Best wishes,
Christian

Installing scDblFinder with conda not working

Hi,

I am trying to install scDblFinder trought conda using:

conda install -c bioconda bioconductor-scdblfinder

However, without any sucess. The error message follows:

Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: / 
Found conflicts! Looking for incompatible packages.
This can take several minutes.  Press CTRL-C to abort.
failed                                                                                                                              

UnsatisfiableError:

I also tried to use BiocManager::install("scDblFinder"), without sucess once again. Please, I would appreciate any help on this matter, since scDblFinder seems to be the state of the art software for doublet removal.

Thanks

Unsupported matrix format (DelayedArray)

Hi, thanks for nice package!
I found this package does not support DelayedArray,
so please think about the extension.

library("TENxPBMCData")
library("scRNAseq")
library("scDblFinder")

# Dense matrix
sce <- ZeiselBrainData()
is(counts(sce))
sce <- scDblFinder(sce)

# Sparse matrix
sce2 <- BaronPancreasData('human')
is(counts(sce2))
sce2 <- scDblFinder(sce2)

# DelayedArray
sce3 <- TENxPBMCData(dataset = "pbmc3k")
is(counts(sce3))
sce3 <- scDblFinder(sce3)
# Overclustering...
# clusters
#   1   2   3   4   5   6   7   8   9  10  11
# 354 345 175 162 150 193 276 293 301 312 139
# Creating ~2700 artifical doublets...
#  cbind(...) でエラー:
#   missing 'cbind' method for DataTable class DelayedMatrix

Merge data or individual data

Hi scDbIFinder developers,
Thank you so much for such wonderful package. It run sooooo fast~~~.

I have a few questions regarding the usage of scDbiFinder. Maybe because it is roughly new, there is no tutorial to follow.

I assumed the "sample" parameter could be used for batches information when we deal with multiple samples/batches data. Then what would be better, detect/remove doublets from individual data and merge for further analysis, or work on merged dataset as a whole.
what is the "normal/common", doublets rate, based on experience?

Thank you.

	x <- t(vapply(split(names(k),k), FUN.VALUE=numeric(ncol(x)),
	FUN=function(i) colMeans(x[i,,drop=FALSE])))

plger / scdblfinder Goto Github PK

scdblfinder's People

Contributors

Stargazers

Watchers

Forkers

scdblfinder's Issues

Reproducible example to illustrate it:

Recommend Projects

Recommend Topics

Recommend Org