vertica / ddr Goto Github PK

View Code? Open in Web Editor NEW

117.0 117.0 17.0 1.24 MB

Standard API for Distributed Data Structures in R

License: GNU General Public License v2.0

R 95.41% C++ 4.35% Makefile 0.24%

ddr's People

Contributors

Stargazers

Watchers

Forkers

armgong jaynal83 abhik1368 xinchoubiology gwork91 clarkfitzg waqasm86 aespar21 brainprint krishnakalyan3 rconsortium etsangsplk aniltcl minghao2016 han-tun

ddr's Issues

head and tail on Dobjects cause segfaults on Mac

This happens when I do a local install of ddR from source R CMD INSTALL . on a Mac laptop and then attempt to use head or tail on a darray or dframe. The issue is in the C++ for getPartitionIdsAndOffsets, so it breaks all the unit tests that call this.

It does not happen on Linux (Ubuntu 16.04). It does not happen on the Mac when I use the current CRAN version of the package (ddR_0.1.2). So it could be some kind of configuration problem specific to my machine.

Here's an example:

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.5 (El Capitan)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] ddR_0.1.3

loaded via a namespace (and not attached):
[1] Rcpp_0.12.5
> a = darray(dim = c(10, 10), psize = c(10, 5))
> head(collect(a))
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    0    0    0    0    0    0    0    0    0     0
[2,]    0    0    0    0    0    0    0    0    0     0
[3,]    0    0    0    0    0    0    0    0    0     0
[4,]    0    0    0    0    0    0    0    0    0     0
[5,]    0    0    0    0    0    0    0    0    0     0
[6,]    0    0    0    0    0    0    0    0    0     0
> head(a)

 *** caught segfault ***
address 0x18, cause 'memory not mapped'

Traceback:
 1: .Call("ddR_getPartitionIdsAndOffsets", PACKAGE = "ddR", indices,
psizes, nparts)
 2: getPartitionIdsAndOffsets(list(sort(i), sort(j)), psize(x), nparts(x))
 3: x[1:n, 1:(dim(x)[[2]]), drop = FALSE]
 4: x[1:n, 1:(dim(x)[[2]]), drop = FALSE]
 5: head.DObject(a)
 6: head(a)

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Selection: 1
R is aborting now ...
Segmentation fault: 11

Possible to return DObject when subsetting?

Probably the most common operation in R is subsetting, ie. [ and $. While trying to use ddR and looking at ops.R I notice that the subsetting operators all collect.

How would one return a distributed object from subsetting? I.e. remove 10% of the rows of a dframe that contain NA. The resulting dframe will still be large, so it's best to keep it distributed.

One idea is to have DObjects closed under subsetting and leave it to the user call collect() explicitly.

Memory Issue

I am running into a memory issue with drandomForest. I receive this message which we have all seen before: Error in drandomForest.default(m, y, ..., ntree = ntree, nExecutor = nExecutor, :
cannot allocate vector of size 7.9 Gb

function call:
tree.result <- drandomForest( Target~., data=as.data.frame(signalDF), mtry=predictors, nExecutor = 4)

Executing on a 64-bit machine with 32 Gb memory. # predictors = 64, and the the dataset contains ~ 4 million rows. The size of the data is slightly over 2GB. We are using the function in classification mode ( y is set as a factor). Predictors are a combination of numeric vars and factors.

Please advise. Thanks!

Duplicated README files

The same README.md is in the top level directory and the ddR directory. Maybe we can use a symlink or something so there's only one copy.

worker to worker communication

I read the article dmapply: A functional primitive to express distributed
machine learning algorithms in R and found following diagram:

That was surprise to see peer-to-peer communication. I started to investigated to ddR code and found following - ddR.R#L278-L285. So essentially all communications go through master. Do I miss something or this is just misleading diagram?

I'm not that experienced with snow clusters. Can peer-to-peer communication be potentially done on parallel package framework?

Fork backend potential issues - external pointers

Looking to fork_driver.R we can see that dmapply is essentially mcmapply. So we rely on the fact that every object can be easily lazily copied from master to worker process. But we are missing the fact, that it is possible that elements of dlist can be objects which keep some data out of R's heap in external pointers.

DESIGN: will ddR support implicit use of global variables?

The changes in #15 brought this bug out. Global variables are not exported to a PSOCK cluster. This causes the kmeans example to fail. A minimal example:

library(ddR)

globvar <- 100
f <- function(x) globvar

useBackend(parallel, type="FORK")

# Works fine
dl <- dlist(1:10, letters)
collect(dlapply(dl, f))

useBackend(parallel, type="PSOCK")

# Fails to find globvar
dl <- dlist(1:10, letters)
collect(dlapply(dl, f))

So I think we should make a call for how ddR should work for portability. Here's what I see as the options:

Only allow pure functions The simplest approach
Add a parameter to pass an environment where the function is to be evaluated Supported by Spark and parallel backends
Programmatically gather function dependencies from the code SparkR does this

Right now 2) is the most appealing, because it's clear what's happening. 1) would be not enough- for example I often compose a large function out of several small functions. 3) is appealing, but is significantly more complex.

BUG: $ method for dframes returns single NA

> a = as.dframe(iris)
> colnames(a)
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"
> a$Species
Species
   <NA>
Levels: setosa versicolor virginica
> iris$Species
  [1] setosa     setosa     setosa     setosa     setosa     setosa
  [7] setosa     setosa     setosa     setosa     setosa     setosa
 [13] setosa     setosa     setosa     setosa     setosa     setosa

Probably the best way to fix this is to include dframes in OO model as described in the design section of the wiki.

as.darray() returns a darray instead of a sparse_darray for input Matrix of type "dgCMatrix"

It would be better to have the output of as.darray of type sparse_darray whenever the input is a sparse matrix.

S4 generics for coercion function

I think it make sense to extend S4 method as instead of using as.dlist, as.darray and so on. Correct me if I missed something.

parts(NULL) returns error

It would be more convenient for programmers if parts(NULL) returns NULL instead of error. Of course, the returned NULL should then be managed in dmapply. This way, a program will not need code redundancy when an optional dobject can become NULL.

Agent/Runner interfaces

This is a question more than an issue, I would like to create a new backend using Service Fabric Stateful services, can you lead me to some examples or at least where in source I should be looking for the interfaces I need to implement to create my own backend?

Thanks.

Refactoring wish list

Following up on #15 here are some ideas for improvements to ddR.

Makefile to automate testing, Doc build, and examples. See #15
Overhaul init and useBackend methods. See #15
Internal documentation on ddR's programming model and how to write backends. See ddR wiki
Rewrite examples for clarity, reproducibility, and best practices
Examples of reading / processing / writing actual data
Benchmarks for various backends (from @dselivanov )
Simplifying do_dmapply
Set up continuous integration on Travis (or similar service) - Probably requires admin access to repo?
Making distributed objects act more like their local counterparts through more OO code

The changes below might require more conversation, since I don't know the reasons behind the design decisions:

Allow partitioning dataframes only on rows
Make ddR column major order like R
Change name from arrays to matrices

PERF: colnames() takes several minutes

This is too slow:

> system.time({
+ colnames(ds)
+ })
   user  system elapsed
 16.015  41.366 206.624

Here ds is a a dframe with 4 partitions. Each chunk is about 10 million rows and 8 columns.

Running on a 2016 Macbook pro with a FORK cluster.

example kmeans doesn't converge

When I run the current example kmeans script I see:

campus-030-003 ~/dev/ddR/examples $ Rscript dkmeans-example.R 
Loading required package: methods

Welcome to 'ddR' (Distributed Data-structures in R)!
For more information, visit: https://github.com/vertica/ddR

Attaching package: ‘ddR’

The following objects are masked from ‘package:base’:

    cbind, rbind

Generating data with rows= 1e+06  and cols= 100 
Warning message:
did not converge in 10 iterations 
training dkmeans model on distributed data:  12.153 
Warning message:
did not converge in 10 iterations 
training normal kmeans model on centralized data:  16.524

Now I've modified the kmeans example to this:

library(ddR)
library(kmeans.ddR)

nInst = 2 # Change level of parallelism
useBackend(parallel, executors = nInst)
# Uncomment the following lines to use Distributed R 
#library(distributedR.ddR)
#useBackend(distributedR)

# Set up data size
numcols = 100
numrows = 100000
K = 3

set.seed(37)
centers = matrix(rnorm(K * numcols, sd = 10), nrow = K)
dnumrows = as.integer(numrows/nInst)

generateKMeansData <- function(id, centers, nrow, ncol) {
    offsets = matrix(rnorm(dnumrows * numcols), nrow = nrow, ncol = ncol)
    cluster_ids = sample.int(nrow(centers),dnumrows,replace = TRUE)
    feature_obs = centers[cluster_ids,] + offsets
    feature_obs
}

cat(sprintf("Generating %d x %d matrix for clustering with %d means\n",
            numrows, numcols, K))

dfeature <- dmapply(generateKMeansData,id = 1:nInst,
  MoreArgs = list(centers = centers, nrow = dnumrows, ncol = numcols),
        output.type = "darray", 
        combine = "rbind", nparts = c(nInst,1))

cat("training dkmeans model on distributed data\n")
dtraining_time <- system.time(
    dmodel <- dkmeans(dfeature, K, iter.max = 100)
)[3]
cat(dtraining_time, "\n")

feature <- collect(dfeature)
cat("training normal kmeans model on centralized data\n")
training_time <- system.time(
    model <- kmeans(feature, K, iter.max = 100, algorithm = "Lloyd")
)[3]
cat(training_time, "\n")

And I see:

> dtraining_time <- system.time(
+     dmodel <- dkmeans(dfeature, K, iter.max = 100)
+ )[3]
Warning message:
did not converge in 100 iterations

So the distributed version doesn't converge when k = 3, while base R does. I'll follow up tomorrow by looking at the code for dkmeans.

rbind, cbind fail on dataframes

This works fine in base R.

> df <- as.dframe(iris)
> df2 <- rbind(df, df)
Error in do_dmapply(ddR.env$driver, func = match.fun(FUN), ..., MoreArgs = MoreArgs,  :
  Each partition of the result should be of type = data.frame, to match with output.type =dframe