Having played around with tiledb_array
on 0.7.0, here are some thoughts on its integration with TileDBArray
. I will use the following example to provide some context:
library(tiledb)
tmp <- tempfile()
dir.create(tmp)
d1 <- tiledb_dim("d1", domain = c(1L, 5L))
d2 <- tiledb_dim("d2", domain = c(1L, 5L))
dom <- tiledb_domain(c(d1, d2))
val <- tiledb_attr("val", type = "FLOAT64")
sch <- tiledb_array_schema(dom, c(val))
tiledb_array_create(tmp, sch)
A <- tiledb_array(uri = tmp)
A[] <- data.frame(d1=rep(1:5,5), d2=rep(1:5,each=5), val=1:25)
Error when the index is a symbol
There's some odd substitute()
calls inside the [
method that probably causes this:
A[list(c(1,2), c(4,5)),]
## $d1
## [1] 1 1 1 1 1 2 2 2 2 2 4 4 4 4 4 5 5 5 5 5
##
## $d2
## [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
##
## $val
## [1] 1 6 11 16 21 2 7 12 17 22 4 9 14 19 24 5 10 15 20 25
Y <- list(c(1,2), c(4,5))
A[Y,]
## Error in is[[1]] : object of type 'symbol' is not subsettable
This blocks programmatic usage for the time being.
Preferred subset output
In fact, it doesn't even have to be the [
function, you could give me an entirely different function that does this. Let's call it tiledb_extract_indices()
for now. For the inputs, I would like:
- the
tiledb_array
object x
, let's say with N dimensions.
- one or more indexing arguments, say
i
for the first dimension, j
and so on. These would be integer vectors (or NULL
, if we want all of that dimension).
For outputs, I would like a N-column matrix of coordinates and a vector - or data.frame
, I suppose, to handle multiple attributes - of values. The matrix and the data.frame
have the same number of rows but are separated to make it easier to distinguish between location and value.
The coordinates themselves would refer to the coordinates of the indexing arguments i
and j
and friends, not the coordinates on the full array in x
. This is important as it disambiguates between duplicated values in i
. For example, I would like to be able to do this:
# (Ideally, d1 and d2 would be their own matrix or df so that it is easy
# to understand which elements are indices and which are values.
# Nonetheless, I'll show it like this to make it easier to compare with
# the current state of affairs.)
tiledb_extract_indices(x, i=c(2,2,2,2), j=1)
## $d1
## [1] 1 2 3 4
##
## $d2
## [1] 1 1 1 1
##
## $val
## [1] 2 2 2 2
From this output, I can easily construct an array or sparse matrix with rows defined by i
and columns defined by j
. If I need the full indices (with respect to the entire array), I can simply subset i
by d1
and j
by d2
. just In contrast, the current behavior is to do:
A[list(2,2,2,2),list(1)]
## $d1
## [1] 2 2 2 2
##
## $d2
## [1] 1 1 1 1
##
## $val
## [1] 2 2 2 2
This is harder to reason with because I now need to figure out which of the 2's in d1
match up with the 2's in the row-subsetting list i
.
Similarly, tiledb_extract_indices
would be in charge of figuring out how to create a query from arbitrary integer vectors in i
and j
and friends. The current state requires me to perform a series of loops to arrange the inputs in the right manner (namely to identify continguous runs and create a list with one entry for the run's start and endpoints), which is unlikely to be efficient.