s-u / iotools Goto Github PK

View Code? Open in Web Editor NEW

47.0 10.0 11.0 287 KB

High-performance I/O tools to run distributed R jobs seamlessly on Hadoop and handle chunk-wise data processing

R 24.15% C 75.85%

iotools's People

Stargazers

Watchers

Forkers

echeran thsiung mrkdsmith genomicsnx relax007 kaneplusplus sdsxpln jfontestad issilva5 kalibera

iotools's Issues

Windows line endings

We need to deal cleanly with windows-style line endings; I just had
a bug with unparsed '\r' characters.

col.names argument not used in write.csv.raw

write.csv.raw = function(x, file = "", append = FALSE, sep = ",", nsep="\t",
                          col.names = TRUE, fileEncoding = "") {
  if (is.character(file)) {
    file <- if (nzchar(fileEncoding))
            file(file, ifelse(append, "ab", "wb"), encoding = fileEncoding)
        else file(file, ifelse(append, "ab", "wb"))
    on.exit(close(file))
  } else if (!isOpen(file, "w")) {
    open(file, "wb")
    on.exit(close(file))
  }

  r = as.output(x, sep = sep, nsep=nsep)
  writeBin(r, con=file)
}

duplicated line of code in dstrfw

this line

is exactly duplicate of its immediate predecessor

How to compile it?

I git clone https://github.com/s-u/iotools.git, but I can not make it?
Would you help to do ?

mstrsplit with no separator

It would be nice if there were a sep argument to mstrsplit, like NA or NULL signaling that there are no separators and each row corresponds to a single element.

write.csv.raw colnames adds null character

There is a bug in write.csv.raw() with this line:

iotools/R/local.R

Line 135 in d418a82

cr = rawToChar(as.output(matrix(colnames(x),nrow=1),sep = sep))

rawToChar() is not necessary and creates a bug: writeBin() adds a null character at the end of the output if the input is a character. The file cannot be read with read.csv.raw() because the first character of the data is null (at the end of the columns names after \n).

Add a matrix version of fdrbind

```is.connection``` function call is undefined in ```iotools:::output```

Install failing for R 3.1.1

I'm seeing the following in R 3.1.1 on both Mac and Linux.

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

vagrant@vagrant-ubuntu-trusty-64:~$ R CMD INSTALL iotools 
* installing to library ‘/home/vagrant/R/x86_64-pc-linux-gnu-library/3.1’
* installing *source* package ‘iotools’ ...
** libs
gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -I../inst/include     -fpic  -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -g  -c as_output.c -o as_output.o
In file included from as_output.c:2:0:
as_output.c: In function ‘dybuf_collect’:
as_output.c:64:24: error: ‘HEAD’ undeclared (first use in this function)
     int l = LENGTH(CAR(HEAD));
                        ^
/usr/share/R/include/Rinternals.h:319:43: note: in definition of macro ‘SHORT_VEC_LENGTH’
 # define SHORT_VEC_LENGTH(x) (((VECSEXP) (x))->vecsxp.length)
                                           ^
/usr/share/R/include/Rinternals.h:325:21: note: in expansion of macro ‘IS_LONG_VEC’
 # define LENGTH(x) (IS_LONG_VEC(x) ? R_BadLongVector(x, __FILE__, __LINE__) : SHORT_VEC_LENGTH(x))
                     ^
as_output.c:64:13: note: in expansion of macro ‘LENGTH’
     int l = LENGTH(CAR(HEAD));
             ^
as_output.c:64:20: note: in expansion of macro ‘CAR’
     int l = LENGTH(CAR(HEAD));
                    ^
as_output.c:64:24: note: each undeclared identifier is reported only once for each function it appears in
     int l = LENGTH(CAR(HEAD));
                        ^
/usr/share/R/include/Rinternals.h:319:43: note: in definition of macro ‘SHORT_VEC_LENGTH’
 # define SHORT_VEC_LENGTH(x) (((VECSEXP) (x))->vecsxp.length)
                                           ^
/usr/share/R/include/Rinternals.h:325:21: note: in expansion of macro ‘IS_LONG_VEC’
 # define LENGTH(x) (IS_LONG_VEC(x) ? R_BadLongVector(x, __FILE__, __LINE__) : SHORT_VEC_LENGTH(x))
                     ^
as_output.c:64:13: note: in expansion of macro ‘LENGTH’
     int l = LENGTH(CAR(HEAD));
             ^
as_output.c:64:20: note: in expansion of macro ‘CAR’
     int l = LENGTH(CAR(HEAD));
                    ^
as_output.c:70:13: warning: passing argument 1 of ‘Rf_unprotect’ makes integer from pointer without a cast
   UNPROTECT(res);
             ^
/usr/share/R/include/Rinternals.h:633:35: note: in definition of macro ‘UNPROTECT’
 #define UNPROTECT(n) Rf_unprotect(n)
                                   ^
/usr/share/R/include/Rinternals.h:1258:6: note: expected ‘int’ but argument is of type ‘SEXP’
 void Rf_unprotect(int);
      ^
In file included from as_output.c:9:0:
as_output.c: At top level:
../inst/include/utils.h:14:19: warning: inline function ‘Rspace’ declared but never defined
 R_INLINE Rboolean Rspace(unsigned int c);
                   ^
../inst/include/utils.h:14:19: warning: inline function ‘Rspace’ declared but never defined
/usr/lib/R/etc/Makeconf:128: recipe for target 'as_output.o' failed
make: *** [as_output.o] Error 1
ERROR: compilation failed for package ‘iotools’
* removing ‘/home/vagrant/R/x86_64-pc-linux-gnu-library/3.1/iotools’

Adding the connection function to input.file

Can input.file take as an argument the function to open the connection? I'd like to be able to call input.file with compressed files.

Quote handling

Currently we do not do any special parsing of quotes (other than
always converting columns with quotes to characters instead of
other types). It would be easy to at least do some basic quote
parsing, even if we don't go as far as parsing nested/embedded
strings.

Could not find class error

Trying out the iotools package in a pseudo-distributed mode with CentOS and CDH4(I have tested to ensure other Hadoop jobs are running). I have installed the package and configured HADOOP_PREFIX.

To start with, I am trying to run an identity MapReduce job using the hmr() function, but it is giving the following error:
" Could not find or load main class org.apache.hadoop.util.PlatformName"
" Could not find or load main class org.apache.hadoop.util.RunJar"

These classes are from the Hadoop core jar. Not sure where I am going wrong.
Would be really great if you could shed light on what the issue could be?

Add a prefetch parameter to imstrsplit

imstrsplit should have a prefetch function so that subsequent chunks can be requested when run in a foreach loop.

header option not working with read.table.raw

write.csv(iris, "iris.csv", row.names=FALSE, quote=FALSE)
a = iotools:::read.table.raw("iris.csv", header=TRUE, sep=",", colClasses=c(rep("numeric", 4), "character"))
head(a)
##    V1  V2  V3  V4     V5
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa

?dstrfw should be more upfront about the use of col_types names

By complete accident doing some debugging, I realized that (despite my having read the argument list in ?dstrfw several times), contrary to what I had been doing, dstrfw provides functionality to handle input column names directly (rather than fixing up ex-post).

I only saw this because I was reading the source code of the function -- it took very careful poring over every word of the rest of ?dstrfw to find the morsel at the end of the Value section which states this feature.

It's a great feature, so it should get higher prominence! In particular it should definitely get mention in the col_types argument explanation.

I'm happy to submit a PR to this end.

Let dstrsplit remove nul characters

library(iotools)
download.file("http://euler.stat.yale.edu/~mjk56/fix_sample.tsv", "fix_sample.tsv")
dstrsplit(read.chunk(chunk.reader("fix_sample.tsv")), col_types=rep("character", 46))

produces the error message embedded nul in string. It would be nice if dstrsplit could do this for us.

is.connection function call is undefined in iotools:::output

The iotools:::input function checks to see if the input is a connection using is.connection. This does not appear to be a defined function in R version 3.1.2.

problem with chunk.tapply

Purpose: Split the iris dataframe by the column 'Species' and then count the number of rows.

Problem: try using chunk.tapply seems to gives an unexpected result

Plain way,

sapply(split(iris, iris$Species), nrow)

yields the (named)vector c(50, 50, 50).

Try to reproduce this with chunk.tapply:

write.table(iris[, c(5,1:4)]
            , "iris.csv"
            , row.names = FALSE
            , sep = ","
            , col.names = FALSE)

View(read.csv("iris.csv", header = FALSE))

library("iotools")
chunk.tapply(input = "iris.csv"
             , FUN = function(x){
                       df <- dstrsplit(x
                                 , col_types = c("character"
                                                 , rep("numeric", 4))
                                 , sep = ","
                                 )
                       # print(df) # to observe whats happening
                       nrow(df)
                       }
             , CH.MERGE =  c
             , sep = ",")

yields c(100, 50). Printing the dataframes after dstrsplit shows that rows corresponding to 'setosa' and 'versicolor' aren't considered as different dataframes. Is there something wrong with what I am doing?

iotools version: 0.1-12(the one on cran)

read.csv.raw fails if all rows are skipped and if colClasses are absent -- unlike utils::read.csv

Calling read.csv.raw will fail if all rows are skipped and if colClasses are absent. The built-in function utils::read.csv after which the interface of the really great and fast iotools::read.csv.raw is modelled, does not fail. Would it be possible to catch this admittedly special case which, in my case, would make programming a bit easier?

Many thanks,
Daniel

A short reproducible example:

packageVersion("iotools")   # ‘0.2.3’
x <- matrix(sample(100), ncol = 10)
write.csv(x, file = "test.csv", row.names = FALSE)

read.csv("test.csv", skip = 10)                                             # dim == c(0, 11)
read.csv("test.csv", skip = 10, colClasses = rep("integer", 10))            # dim == c(0, 11)
read.csv.raw("test.csv", skip = 10)    # Error in subset[, i] : subscript out of bounds
read.csv.raw("test.csv", skip = 10, colClasses = rep("integer", 10))        # dim == c(0, 11)

unlink("test.csv")

By the way, I prefer the real/expected column names which are returned by read.csv.raw whenever colClasses are specified compared to the unexpected column names returned by read.csv ('X' concatenated with the values of the last and here skipped line [-- instead of "The default is to use "V" followed by the column number" quoting ?read.csv]).

calling ctapply on large matrices results in gc problems

Calling ctapply() on matrices results in a memory footprint which can not be removed by gc().

mat <- matrix(rnorm(1e7), 1e2, 1e5)
idx <- sort(rep_len(1:20, nrow(mat)))
gc(reset = TRUE)
str( ctapply(mat, idx, colSums, MERGE = rbind) )
gc(reset = TRUE)
str( ctapply(mat, idx, colSums, MERGE = rbind) )
gc(reset = TRUE)

My session info:

R version 3.1.2 (2014-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] iotools_0.1-6

loaded via a namespace (and not attached):
[1] parallel_3.1.2

mstrsplit segfault with raw vector when no newlines present

When a raw vector is given to mstrsplit, we memchr forward to
the first newline in the process of finding the column headers.
This causes a NULL pointer if there was only one row of data,
and a rather nasty bug.

iotools::mstrsplit(charToRaw("asdf|asdf|asdf"))

> dstrsplit(charToRaw("1|0|\n1||\n"), list(a=1,b=1L,NA))
  a b
1 1 0
2 1 0

should be

1 1 0
2 1 NA

(PS: work-around is to use numeric as floats are parsed correctly)

s-u / iotools Goto Github PK

iotools's People

Stargazers

Watchers

Forkers

iotools's Issues

Recommend Projects

Recommend Topics

Recommend Org