vertica / ddr Goto Github PK
View Code? Open in Web Editor NEWStandard API for Distributed Data Structures in R
License: GNU General Public License v2.0
Standard API for Distributed Data Structures in R
License: GNU General Public License v2.0
This happens when I do a local install of ddR from source R CMD INSTALL .
on a Mac laptop and then attempt to use head
or tail
on a darray
or dframe
. The issue is in the C++ for getPartitionIdsAndOffsets
, so it breaks all the unit tests that call this.
It does not happen on Linux (Ubuntu 16.04). It does not happen on the Mac when I use the current CRAN version of the package (ddR_0.1.2). So it could be some kind of configuration problem specific to my machine.
Here's an example:
> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.5 (El Capitan)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ddR_0.1.3
loaded via a namespace (and not attached):
[1] Rcpp_0.12.5
> a = darray(dim = c(10, 10), psize = c(10, 5))
> head(collect(a))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 0 0 0 0 0 0 0 0 0 0
[2,] 0 0 0 0 0 0 0 0 0 0
[3,] 0 0 0 0 0 0 0 0 0 0
[4,] 0 0 0 0 0 0 0 0 0 0
[5,] 0 0 0 0 0 0 0 0 0 0
[6,] 0 0 0 0 0 0 0 0 0 0
> head(a)
*** caught segfault ***
address 0x18, cause 'memory not mapped'
Traceback:
1: .Call("ddR_getPartitionIdsAndOffsets", PACKAGE = "ddR", indices,
psizes, nparts)
2: getPartitionIdsAndOffsets(list(sort(i), sort(j)), psize(x), nparts(x))
3: x[1:n, 1:(dim(x)[[2]]), drop = FALSE]
4: x[1:n, 1:(dim(x)[[2]]), drop = FALSE]
5: head.DObject(a)
6: head(a)
Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Selection: 1
R is aborting now ...
Segmentation fault: 11
Probably the most common operation in R is subsetting, ie. [
and $
. While trying to use ddR
and looking at ops.R
I notice that the subsetting operators all collect
.
How would one return a distributed object from subsetting? I.e. remove 10% of the rows of a dframe
that contain NA
. The resulting dframe
will still be large, so it's best to keep it distributed.
One idea is to have DObject
s closed under subsetting and leave it to the user call collect()
explicitly.
I am running into a memory issue with drandomForest. I receive this message which we have all seen before: Error in drandomForest.default(m, y, ..., ntree = ntree, nExecutor = nExecutor, :
cannot allocate vector of size 7.9 Gb
function call:
tree.result <- drandomForest( Target~., data=as.data.frame(signalDF), mtry=predictors, nExecutor = 4)
Executing on a 64-bit machine with 32 Gb memory. # predictors = 64, and the the dataset contains ~ 4 million rows. The size of the data is slightly over 2GB. We are using the function in classification mode ( y is set as a factor). Predictors are a combination of numeric vars and factors.
Please advise. Thanks!
The same README.md
is in the top level directory and the ddR directory. Maybe we can use a symlink or something so there's only one copy.
I read the article dmapply: A functional primitive to express distributed
machine learning algorithms in R and found following diagram:
That was surprise to see peer-to-peer communication. I started to investigated to ddR code and found following - ddR.R#L278-L285. So essentially all communications go through master. Do I miss something or this is just misleading diagram?
I'm not that experienced with snow clusters. Can peer-to-peer communication be potentially done on parallel
package framework?
Looking to fork_driver.R we can see that dmapply
is essentially mcmapply
. So we rely on the fact that every object can be easily lazily copied from master to worker process. But we are missing the fact, that it is possible that elements of dlist
can be objects which keep some data out of R's heap in external pointers.
The changes in #15 brought this bug out. Global variables are not exported to a PSOCK cluster. This causes the kmeans example to fail. A minimal example:
library(ddR)
globvar <- 100
f <- function(x) globvar
useBackend(parallel, type="FORK")
# Works fine
dl <- dlist(1:10, letters)
collect(dlapply(dl, f))
useBackend(parallel, type="PSOCK")
# Fails to find globvar
dl <- dlist(1:10, letters)
collect(dlapply(dl, f))
So I think we should make a call for how ddR should work for portability. Here's what I see as the options:
Right now 2) is the most appealing, because it's clear what's happening. 1) would be not enough- for example I often compose a large function out of several small functions. 3) is appealing, but is significantly more complex.
> a = as.dframe(iris)
> colnames(a)
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
> a$Species
Species
<NA>
Levels: setosa versicolor virginica
> iris$Species
[1] setosa setosa setosa setosa setosa setosa
[7] setosa setosa setosa setosa setosa setosa
[13] setosa setosa setosa setosa setosa setosa
Probably the best way to fix this is to include dframes
in OO model as described in the design section of the wiki.
It would be better to have the output of as.darray of type sparse_darray whenever the input is a sparse matrix.
I think it make sense to extend S4 method as
instead of using as.dlist
, as.darray
and so on. Correct me if I missed something.
It would be more convenient for programmers if parts(NULL) returns NULL instead of error. Of course, the returned NULL should then be managed in dmapply. This way, a program will not need code redundancy when an optional dobject can become NULL.
This is a question more than an issue, I would like to create a new backend using Service Fabric Stateful services, can you lead me to some examples or at least where in source I should be looking for the interfaces I need to implement to create my own backend?
Thanks.
Following up on #15 here are some ideas for improvements to ddR.
The changes below might require more conversation, since I don't know the reasons behind the design decisions:
This is too slow:
> system.time({
+ colnames(ds)
+ })
user system elapsed
16.015 41.366 206.624
Here ds
is a a dframe
with 4 partitions. Each chunk is about 10 million rows and 8 columns.
Running on a 2016 Macbook pro with a FORK
cluster.
When I run the current example kmeans script I see:
campus-030-003 ~/dev/ddR/examples $ Rscript dkmeans-example.R
Loading required package: methods
Welcome to 'ddR' (Distributed Data-structures in R)!
For more information, visit: https://github.com/vertica/ddR
Attaching package: ‘ddR’
The following objects are masked from ‘package:base’:
cbind, rbind
Generating data with rows= 1e+06 and cols= 100
Warning message:
did not converge in 10 iterations
training dkmeans model on distributed data: 12.153
Warning message:
did not converge in 10 iterations
training normal kmeans model on centralized data: 16.524
Now I've modified the kmeans example to this:
library(ddR)
library(kmeans.ddR)
nInst = 2 # Change level of parallelism
useBackend(parallel, executors = nInst)
# Uncomment the following lines to use Distributed R
#library(distributedR.ddR)
#useBackend(distributedR)
# Set up data size
numcols = 100
numrows = 100000
K = 3
set.seed(37)
centers = matrix(rnorm(K * numcols, sd = 10), nrow = K)
dnumrows = as.integer(numrows/nInst)
generateKMeansData <- function(id, centers, nrow, ncol) {
offsets = matrix(rnorm(dnumrows * numcols), nrow = nrow, ncol = ncol)
cluster_ids = sample.int(nrow(centers),dnumrows,replace = TRUE)
feature_obs = centers[cluster_ids,] + offsets
feature_obs
}
cat(sprintf("Generating %d x %d matrix for clustering with %d means\n",
numrows, numcols, K))
dfeature <- dmapply(generateKMeansData,id = 1:nInst,
MoreArgs = list(centers = centers, nrow = dnumrows, ncol = numcols),
output.type = "darray",
combine = "rbind", nparts = c(nInst,1))
cat("training dkmeans model on distributed data\n")
dtraining_time <- system.time(
dmodel <- dkmeans(dfeature, K, iter.max = 100)
)[3]
cat(dtraining_time, "\n")
feature <- collect(dfeature)
cat("training normal kmeans model on centralized data\n")
training_time <- system.time(
model <- kmeans(feature, K, iter.max = 100, algorithm = "Lloyd")
)[3]
cat(training_time, "\n")
And I see:
> dtraining_time <- system.time(
+ dmodel <- dkmeans(dfeature, K, iter.max = 100)
+ )[3]
Warning message:
did not converge in 100 iterations
So the distributed version doesn't converge when k = 3, while base R does. I'll follow up tomorrow by looking at the code for dkmeans
.
This works fine in base R.
> df <- as.dframe(iris)
> df2 <- rbind(df, df)
Error in do_dmapply(ddR.env$driver, func = match.fun(FUN), ..., MoreArgs = MoreArgs, :
Each partition of the result should be of type = data.frame, to match with output.type =dframe
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.