Giter Site home page Giter Site logo

ddr's People

Contributors

clarkfitzg avatar etduwx avatar fun-indra avatar ironholds avatar lawremi avatar vishrutg avatar zeliff avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ddr's Issues

head and tail on Dobjects cause segfaults on Mac

This happens when I do a local install of ddR from source R CMD INSTALL . on a Mac laptop and then attempt to use head or tail on a darray or dframe. The issue is in the C++ for getPartitionIdsAndOffsets, so it breaks all the unit tests that call this.

It does not happen on Linux (Ubuntu 16.04). It does not happen on the Mac when I use the current CRAN version of the package (ddR_0.1.2). So it could be some kind of configuration problem specific to my machine.

Here's an example:

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.5 (El Capitan)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] ddR_0.1.3

loaded via a namespace (and not attached):
[1] Rcpp_0.12.5
> a = darray(dim = c(10, 10), psize = c(10, 5))
> head(collect(a))
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    0    0    0    0    0    0    0    0    0     0
[2,]    0    0    0    0    0    0    0    0    0     0
[3,]    0    0    0    0    0    0    0    0    0     0
[4,]    0    0    0    0    0    0    0    0    0     0
[5,]    0    0    0    0    0    0    0    0    0     0
[6,]    0    0    0    0    0    0    0    0    0     0
> head(a)

 *** caught segfault ***
address 0x18, cause 'memory not mapped'

Traceback:
 1: .Call("ddR_getPartitionIdsAndOffsets", PACKAGE = "ddR", indices,
psizes, nparts)
 2: getPartitionIdsAndOffsets(list(sort(i), sort(j)), psize(x), nparts(x))
 3: x[1:n, 1:(dim(x)[[2]]), drop = FALSE]
 4: x[1:n, 1:(dim(x)[[2]]), drop = FALSE]
 5: head.DObject(a)
 6: head(a)

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Selection: 1
R is aborting now ...
Segmentation fault: 11

Possible to return DObject when subsetting?

Probably the most common operation in R is subsetting, ie. [ and $. While trying to use ddR and looking at ops.R I notice that the subsetting operators all collect.

How would one return a distributed object from subsetting? I.e. remove 10% of the rows of a dframe that contain NA. The resulting dframe will still be large, so it's best to keep it distributed.

One idea is to have DObjects closed under subsetting and leave it to the user call collect() explicitly.

Memory Issue

I am running into a memory issue with drandomForest. I receive this message which we have all seen before: Error in drandomForest.default(m, y, ..., ntree = ntree, nExecutor = nExecutor, :
cannot allocate vector of size 7.9 Gb

function call:
tree.result <- drandomForest( Target~., data=as.data.frame(signalDF), mtry=predictors, nExecutor = 4)

Executing on a 64-bit machine with 32 Gb memory. # predictors = 64, and the the dataset contains ~ 4 million rows. The size of the data is slightly over 2GB. We are using the function in classification mode ( y is set as a factor). Predictors are a combination of numeric vars and factors.

Please advise. Thanks!

Duplicated README files

The same README.md is in the top level directory and the ddR directory. Maybe we can use a symlink or something so there's only one copy.

worker to worker communication

I read the article dmapply: A functional primitive to express distributed
machine learning algorithms in R
and found following diagram:
screen shot 2017-03-31 at 09 50 48

That was surprise to see peer-to-peer communication. I started to investigated to ddR code and found following - ddR.R#L278-L285. So essentially all communications go through master. Do I miss something or this is just misleading diagram?

I'm not that experienced with snow clusters. Can peer-to-peer communication be potentially done on parallel package framework?

Fork backend potential issues - external pointers

Looking to fork_driver.R we can see that dmapply is essentially mcmapply. So we rely on the fact that every object can be easily lazily copied from master to worker process. But we are missing the fact, that it is possible that elements of dlist can be objects which keep some data out of R's heap in external pointers.

DESIGN: will ddR support implicit use of global variables?

The changes in #15 brought this bug out. Global variables are not exported to a PSOCK cluster. This causes the kmeans example to fail. A minimal example:

library(ddR)

globvar <- 100
f <- function(x) globvar

useBackend(parallel, type="FORK")

# Works fine
dl <- dlist(1:10, letters)
collect(dlapply(dl, f))

useBackend(parallel, type="PSOCK")

# Fails to find globvar
dl <- dlist(1:10, letters)
collect(dlapply(dl, f))

So I think we should make a call for how ddR should work for portability. Here's what I see as the options:

  1. Only allow pure functions The simplest approach
  2. Add a parameter to pass an environment where the function is to be evaluated Supported by Spark and parallel backends
  3. Programmatically gather function dependencies from the code SparkR does this

Right now 2) is the most appealing, because it's clear what's happening. 1) would be not enough- for example I often compose a large function out of several small functions. 3) is appealing, but is significantly more complex.

BUG: $ method for dframes returns single NA

> a = as.dframe(iris)
> colnames(a)
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"
> a$Species
Species
   <NA>
Levels: setosa versicolor virginica
> iris$Species
  [1] setosa     setosa     setosa     setosa     setosa     setosa
  [7] setosa     setosa     setosa     setosa     setosa     setosa
 [13] setosa     setosa     setosa     setosa     setosa     setosa

Probably the best way to fix this is to include dframes in OO model as described in the design section of the wiki.

parts(NULL) returns error

It would be more convenient for programmers if parts(NULL) returns NULL instead of error. Of course, the returned NULL should then be managed in dmapply. This way, a program will not need code redundancy when an optional dobject can become NULL.

Agent/Runner interfaces

This is a question more than an issue, I would like to create a new backend using Service Fabric Stateful services, can you lead me to some examples or at least where in source I should be looking for the interfaces I need to implement to create my own backend?

Thanks.

Refactoring wish list

Following up on #15 here are some ideas for improvements to ddR.

  • Makefile to automate testing, Doc build, and examples. See #15
  • Overhaul init and useBackend methods. See #15
  • Internal documentation on ddR's programming model and how to write backends. See ddR wiki
  • Rewrite examples for clarity, reproducibility, and best practices
  • Examples of reading / processing / writing actual data
  • Benchmarks for various backends (from @dselivanov )
  • Simplifying do_dmapply
  • Set up continuous integration on Travis (or similar service) - Probably requires admin access to repo?
  • Making distributed objects act more like their local counterparts through more OO code

The changes below might require more conversation, since I don't know the reasons behind the design decisions:

  • Allow partitioning dataframes only on rows
  • Make ddR column major order like R
  • Change name from arrays to matrices

PERF: colnames() takes several minutes

This is too slow:

> system.time({
+ colnames(ds)
+ })
   user  system elapsed
 16.015  41.366 206.624

Here ds is a a dframe with 4 partitions. Each chunk is about 10 million rows and 8 columns.

Running on a 2016 Macbook pro with a FORK cluster.

example kmeans doesn't converge

When I run the current example kmeans script I see:

campus-030-003 ~/dev/ddR/examples $ Rscript dkmeans-example.R 
Loading required package: methods

Welcome to 'ddR' (Distributed Data-structures in R)!
For more information, visit: https://github.com/vertica/ddR

Attaching package: ‘ddR’

The following objects are masked from ‘package:base’:

    cbind, rbind

Generating data with rows= 1e+06  and cols= 100 
Warning message:
did not converge in 10 iterations 
training dkmeans model on distributed data:  12.153 
Warning message:
did not converge in 10 iterations 
training normal kmeans model on centralized data:  16.524 

Now I've modified the kmeans example to this:

library(ddR)
library(kmeans.ddR)

nInst = 2 # Change level of parallelism
useBackend(parallel, executors = nInst)
# Uncomment the following lines to use Distributed R 
#library(distributedR.ddR)
#useBackend(distributedR)

# Set up data size
numcols = 100
numrows = 100000
K = 3

set.seed(37)
centers = matrix(rnorm(K * numcols, sd = 10), nrow = K)
dnumrows = as.integer(numrows/nInst)

generateKMeansData <- function(id, centers, nrow, ncol) {
    offsets = matrix(rnorm(dnumrows * numcols), nrow = nrow, ncol = ncol)
    cluster_ids = sample.int(nrow(centers),dnumrows,replace = TRUE)
    feature_obs = centers[cluster_ids,] + offsets
    feature_obs
}

cat(sprintf("Generating %d x %d matrix for clustering with %d means\n",
            numrows, numcols, K))

dfeature <- dmapply(generateKMeansData,id = 1:nInst,
  MoreArgs = list(centers = centers, nrow = dnumrows, ncol = numcols),
        output.type = "darray", 
        combine = "rbind", nparts = c(nInst,1))

cat("training dkmeans model on distributed data\n")
dtraining_time <- system.time(
    dmodel <- dkmeans(dfeature, K, iter.max = 100)
)[3]
cat(dtraining_time, "\n")

feature <- collect(dfeature)
cat("training normal kmeans model on centralized data\n")
training_time <- system.time(
    model <- kmeans(feature, K, iter.max = 100, algorithm = "Lloyd")
)[3]
cat(training_time, "\n")

And I see:

> dtraining_time <- system.time(
+     dmodel <- dkmeans(dfeature, K, iter.max = 100)
+ )[3]
Warning message:
did not converge in 100 iterations 

So the distributed version doesn't converge when k = 3, while base R does. I'll follow up tomorrow by looking at the code for dkmeans.

rbind, cbind fail on dataframes

This works fine in base R.

> df <- as.dframe(iris)
> df2 <- rbind(df, df)
Error in do_dmapply(ddR.env$driver, func = match.fun(FUN), ..., MoreArgs = MoreArgs,  :
  Each partition of the result should be of type = data.frame, to match with output.type =dframe

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.