ltla / tiledbarray Goto Github PK

10.0 8.0 0.0 94 KB

Clone of the Bioconductor repository for the TileDBArray package.

Home Page: https://bioconductor.org/packages/devel/bioc/html/TileDBArray.html

License: MIT License

R 81.64% C++ 13.40% Shell 4.96%

tiledbarray's Introduction

DelayedArray backends for TileDB

This package implements DelayedArray backend for TileDB to read, write and store dense and sparse arrays. The resulting TileDBArray objects are directly compatible with any Bioconductor package that accepts DelayedArray objects, serving as a swap-in replacement for the predominant HDF5Array that is currently used throughout the Bioconductor ecosystem for representing large datasets. See the official Bioconductor landing page for more details.

tiledbarray's People

Contributors

Stargazers

Watchers

tiledbarray's Issues

Please enable Travis CI

The repository currently does not have continuous integration setup. There are many available options, and one that I have found useful is to rely on pre-made Docker containers already holding the build requirements as it really shortens the build and test time. [1]

I could add such a .travis.yml if that is seen as interesting enough to explore, but I currently cannot enable Travis CI as it sees me as having a fork:

If you enable Travis support I can easily add a .travis.yml. The container to use is the (oddly named, from a one-off repository I keep using) eddelbuettel/rocker-tiledb:bioc defined in this Dockerfile.

~~(Interestingly enough the Dockerfile, on my machine, replicates the issue I was having on my machine for some use case: the matrix ops fail. I may try some bisecting.)~~

[1] Direct Travis use with caching may achieve similar timings, but I find the system to be somewhat opaque and more difficult to debug. Using a Docker container is simple and portable---but clearly not as flexible as one has to add new dependencies explicitly. Tradeoffs, as always.

First head-to-head against HDF5Array

This is a condensed version of a real application involving PCA on sparse log-transformed expression values:

sce <- scRNAseq::MacoskoRetinaData() 
y <- scuttle::normalizeCounts(counts(sce))
dim(y)
## [1] 24658 49300

library(BiocSingular)
library(HDF5Array)
system.time(hdf.mat <- writeHDF5Array(y, filepath="macosko.h5", name="logcounts"))
##    user  system elapsed 
## 144.265   3.220 147.627 
system.time(hdf.pcs <- runPCA(t(hdf.mat), 10, BSPARAM=RandomParam(deferred=TRUE)))
##    user  system elapsed 
## 861.133  57.775 918.967 

library(TileDBArray)
system.time(tdb.mat <- writeTileDBArray(y, path="macosko_tdb", attr="logcounts"))
##    user  system elapsed 
##  66.415   1.717  20.009 
system.time(tdb.pcs <- runPCA(t(tdb.mat), 10, BSPARAM=RandomParam(deferred=TRUE)))
##    user  system elapsed 
## 888.668 167.635 347.845

Note that this is not quite a fair comparison:

HDF5 library read/writes are single-threaded, while the TileDB library will happily use multiple cores.
HDF5Array write for sparse matrices is currently rather inefficient, see Bioconductor/HDF5Array#30.

Nonetheless, these results are encouraging given that no effort has been made to optimize the TileDB calls either. For starters, I suspect the tile extents are too small. (Defaults to 100 in each dimension.)

Session information

R version 4.0.0 Patched (2020-05-01 r78341)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.4 LTS

Matrix products: default
BLAS:   /home/luna/Software/R/R-4-0-branch-dev/lib/libRblas.so
LAPACK: /home/luna/Software/R/R-4-0-branch-dev/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] TileDBArray_0.0.1           HDF5Array_1.17.2           
 [3] rhdf5_2.33.3                BiocSingular_1.5.0         
 [5] scRNAseq_2.3.6              SingleCellExperiment_1.11.5
 [7] SummarizedExperiment_1.19.5 DelayedArray_0.15.5        
 [9] matrixStats_0.56.0          Matrix_1.2-18              
[11] Biobase_2.49.0              GenomicRanges_1.41.5       
[13] GenomeInfoDb_1.25.2         IRanges_2.23.10            
[15] S4Vectors_0.27.12           BiocGenerics_0.35.4        

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6                  rsvd_1.0.3                   
 [3] lattice_0.20-41               zoo_1.8-8                    
 [5] assertthat_0.2.1              digest_0.6.25                
 [7] mime_0.9                      BiocFileCache_1.13.0         
 [9] R6_2.4.1                      RSQLite_2.2.0                
[11] httr_1.4.1                    pillar_1.4.4                 
[13] zlibbioc_1.35.0               rlang_0.4.6                  
[15] curl_4.3                      irlba_2.3.3                  
[17] tiledb_0.7.0                  blob_1.2.1                   
[19] BiocParallel_1.23.0           RcppCCTZ_0.2.7               
[21] AnnotationHub_2.21.1          RCurl_1.98-1.2               
[23] bit_1.1-15.2                  shiny_1.4.0.2                
[25] compiler_4.0.0                httpuv_1.5.4                 
[27] base64enc_0.1-3               pkgconfig_2.0.3              
[29] htmltools_0.5.0               tidyselect_1.1.0             
[31] tibble_3.0.1                  GenomeInfoDbData_1.2.3       
[33] interactiveDisplayBase_1.27.5 crayon_1.3.4                 
[35] dplyr_1.0.0                   dbplyr_1.4.4                 
[37] later_1.1.0.1                 rhdf5filters_1.1.0           
[39] bitops_1.0-6                  rappdirs_0.3.1               
[41] grid_4.0.0                    xtable_1.8-4                 
[43] lifecycle_0.2.0               DBI_1.1.0                    
[45] magrittr_1.5                  scuttle_0.99.9               
[47] XVector_0.29.2                promises_1.1.1               
[49] DelayedMatrixStats_1.11.0     ellipsis_0.3.1               
[51] generics_0.0.2                vctrs_0.3.1                  
[53] Rhdf5lib_1.11.2               tools_4.0.0                  
[55] bit64_0.9-7                   nanotime_0.2.4               
[57] glue_1.4.1                    purrr_0.3.4                  
[59] BiocVersion_3.12.0            fastmap_1.0.1                
[61] yaml_2.2.1                    AnnotationDbi_1.51.0         
[63] BiocManager_1.30.10           ExperimentHub_1.15.0         
[65] memoise_1.1.0

Instructions on how to use / save / read TileDBArrays in a SummarizedExperiment

So great to see the TileDBArray backend join the Bioconductor family 👍 .

A quick question: are you planning to create helper functions to save / read SummarizedExperiments with TileDBArray backends? Analogous to what Herve Pages has provided in the HDF5Array package here for instance?

Or perhaps there already are examples of how to persist TileDBArray-backed SummarizedExperiments that I could learn from?

How to create a huge on-disk array directly from R

To analyze a huge multi-dimensional array, I checked some on-disk implementations such as DelayedArray, HDF5Array, and TileDBArray but all of them seem to assume that a huge array is already stored in HDF5 or TileDB and if we want to create an on-disk array in R, we can create only a small array that can fit in memory.

For example, in the code below, small_arr can be created but large_arr cannot be created because we have to create a huge in-memory array first and then it will be converted to RleArray.

library("TileDBArray")
small_arr <- as(array(runif(10*20*30), c(10,20,30)), "TileDBArray")
large_arr <- as(array(runif(10000*1000*1000), c(10000,1000,1000)), "TileDBArray")

Can I create a TileDBArray without defining the intermediate array object?
Perhaps, does using new (constractor) solve this problem?

TileDBArray/R/TileDBArray.R

Lines 123 to 124 in 469ad47

    
           new("TileDBArraySeed", dim=d, dimnames=dimnames, path=x,  
        
               sparse=is.sparse(s), attr=attr, type=my.type, extent=meta$extent)

is there a method to add (write) to the TileDB on-disk (e.g. on-disk rbind/cbind)?

hope the title is self-descriptive.

set.seed(123) # fix RNG seed
mat <- matrix(0, nrow=8, ncol=20) # create matrix with zeros
mat[sample(seq_len(8*20), 15)] <- seq(1, 15) # add non-zero entries
spmat <- as(mat, "dgCMatrix") # make it sparse
path <- tempfile() # get temp file path
out <- TileDBArray::writeTileDBArray(x = spmat, # write to location
                                     path = path)
# error bc file path already exist (and prob inefficient [guessing out needs be from disk read?])
out <- TileDBArray::writeTileDBArray(x = cbind(out,out),
                                     path = path)

EDIT: example from https://dirk.eddelbuettel.com/papers/useR2021_tiledb_tutorial.pdf

Creation of TiledbArray fails on Linux ARM64

Hello,

The CMD build of TileDBArray fails on Linux ARM64 with the following error:

 ~/bbs-3.17-bioc/R/bin/R CMD build --keep-empty-dirs --no-resave-data TileDBArray
* checking for file ‘TileDBArray/DESCRIPTION’ ... OK
* preparing ‘TileDBArray’:
* checking DESCRIPTION meta-information ... OK
* cleaning src
* running ‘cleanup’
* installing the package to build vignettes
* creating vignettes ... ERROR
--- re-building ‘userguide.Rmd’ using rmarkdown
Quitting from lines 35-38 (userguide.Rmd) 
Error: processing vignette 'userguide.Rmd' failed with diagnostics:
Expecting an external pointer: [type=NULL].
--- failed re-building ‘userguide.Rmd’

SUMMARY: processing the following file failed:
  ‘userguide.Rmd’

Error: Vignette re-building failed.
Execution halted

I have also tested with as(X, "TileDBArray") and it fails the same way:

* creating vignettes ... ERROR
--- re-building ‘userguide.Rmd’ using rmarkdown
Quitting from lines 43-44 (userguide.Rmd) 
Error: processing vignette 'userguide.Rmd' failed with diagnostics:
Expecting an external pointer: [type=NULL].
--- failed re-building ‘userguide.Rmd’

Do you have any ideas what could be the problem ?

Extension to multi-dimensional array (proposal)

Thanks for the great work.
I'd like to use TileDBArray against a multi-dimensional array as follows but this code was failed with an error.

library(TileDBArray)

mat <- as(array(rbinom(5*6, 2, 0.1), c(5,6)), "TileDBArray")
arr <- as(array(rbinom(5*6*7, 2, 0.1), c(5,6,7)), "TileDBArray")

If this code could be performed, we can accelerate not only matrix decomposition algorithms but also tensor decomposition algorithms, just like Aaron Lun accelerated the runPCA by using TileDBArray before.
https://www.youtube.com/watch?v=wQJbSh-NHeg&t=1819s

Could you extend TileDBArray to multi-dimensional array?

Add fromSparseMatrix and toSparseMatrix from tiledb v0.9.0.

This should probably be used in the coercion methods, without the need to go through block processing.

Thoughts on using tiledb_array

Having played around with tiledb_array on 0.7.0, here are some thoughts on its integration with TileDBArray. I will use the following example to provide some context:

library(tiledb)
tmp <- tempfile()
dir.create(tmp)

d1  <- tiledb_dim("d1", domain = c(1L, 5L))
d2  <- tiledb_dim("d2", domain = c(1L, 5L))
dom <- tiledb_domain(c(d1, d2))
val <- tiledb_attr("val", type = "FLOAT64")
sch <- tiledb_array_schema(dom, c(val))
tiledb_array_create(tmp, sch)

A <- tiledb_array(uri = tmp)
A[] <- data.frame(d1=rep(1:5,5), d2=rep(1:5,each=5), val=1:25)

Error when the index is a symbol

There's some odd substitute() calls inside the [ method that probably causes this:

A[list(c(1,2), c(4,5)),] 
## $d1
##  [1] 1 1 1 1 1 2 2 2 2 2 4 4 4 4 4 5 5 5 5 5
## 
## $d2
##  [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
## 
## $val
##  [1]  1  6 11 16 21  2  7 12 17 22  4  9 14 19 24  5 10 15 20 25

Y <- list(c(1,2), c(4,5))
A[Y,]
## Error in is[[1]] : object of type 'symbol' is not subsettable

This blocks programmatic usage for the time being.

Preferred subset output

In fact, it doesn't even have to be the [ function, you could give me an entirely different function that does this. Let's call it tiledb_extract_indices() for now. For the inputs, I would like:

the tiledb_array object x, let's say with N dimensions.
one or more indexing arguments, say i for the first dimension, j and so on. These would be integer vectors (or NULL, if we want all of that dimension).

For outputs, I would like a N-column matrix of coordinates and a vector - or data.frame, I suppose, to handle multiple attributes - of values. The matrix and the data.frame have the same number of rows but are separated to make it easier to distinguish between location and value.

The coordinates themselves would refer to the coordinates of the indexing arguments i and j and friends, not the coordinates on the full array in x. This is important as it disambiguates between duplicated values in i. For example, I would like to be able to do this:

# (Ideally, d1 and d2 would be their own matrix or df so that it is easy
# to understand which elements are indices and which are values.
# Nonetheless, I'll show it like this to make it easier to compare with 
# the current state of affairs.)
tiledb_extract_indices(x, i=c(2,2,2,2), j=1)
## $d1
##  [1] 1 2 3 4
##
## $d2
##  [1] 1 1 1 1
##
## $val
##  [1] 2 2 2 2

From this output, I can easily construct an array or sparse matrix with rows defined by i and columns defined by j. If I need the full indices (with respect to the entire array), I can simply subset i by d1 and j by d2. just In contrast, the current behavior is to do:

A[list(2,2,2,2),list(1)]
## $d1
## [1] 2 2 2 2
## 
## $d2
## [1] 1 1 1 1
## 
## $val
## [1] 2 2 2 2

This is harder to reason with because I now need to figure out which of the 2's in d1 match up with the 2's in the row-subsetting list i.

Similarly, tiledb_extract_indices would be in charge of figuring out how to create a query from arbitrary integer vectors in i and j and friends. The current state requires me to perform a series of loops to arrange the inputs in the right manner (namely to identify continguous runs and create a list with one entry for the run's start and endpoints), which is unlikely to be efficient.

	new("TileDBArraySeed", dim=d, dimnames=dimnames, path=x,
	sparse=is.sparse(s), attr=attr, type=my.type, extent=meta$extent)