s-u / iotools Goto Github PK
View Code? Open in Web Editor NEWHigh-performance I/O tools to run distributed R jobs seamlessly on Hadoop and handle chunk-wise data processing
High-performance I/O tools to run distributed R jobs seamlessly on Hadoop and handle chunk-wise data processing
We need to deal cleanly with windows-style line endings; I just had
a bug with unparsed '\r' characters.
write.csv.raw = function(x, file = "", append = FALSE, sep = ",", nsep="\t",
col.names = TRUE, fileEncoding = "") {
if (is.character(file)) {
file <- if (nzchar(fileEncoding))
file(file, ifelse(append, "ab", "wb"), encoding = fileEncoding)
else file(file, ifelse(append, "ab", "wb"))
on.exit(close(file))
} else if (!isOpen(file, "w")) {
open(file, "wb")
on.exit(close(file))
}
r = as.output(x, sep = sep, nsep=nsep)
writeBin(r, con=file)
}
is exactly duplicate of its immediate predecessor
I git clone https://github.com/s-u/iotools.git, but I can not make it?
Would you help to do ?
It would be nice if there were a sep argument to mstrsplit, like NA or NULL signaling that there are no separators and each row corresponds to a single element.
There is a bug in write.csv.raw() with this line:
Line 135 in d418a82
rawToChar() is not necessary and creates a bug: writeBin() adds a null character at the end of the output if the input is a character. The file cannot be read with read.csv.raw() because the first character of the data is null (at the end of the columns names after \n).
I'm seeing the following in R 3.1.1 on both Mac and Linux.
> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
vagrant@vagrant-ubuntu-trusty-64:~$ R CMD INSTALL iotools
* installing to library ‘/home/vagrant/R/x86_64-pc-linux-gnu-library/3.1’
* installing *source* package ‘iotools’ ...
** libs
gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -I../inst/include -fpic -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -g -c as_output.c -o as_output.o
In file included from as_output.c:2:0:
as_output.c: In function ‘dybuf_collect’:
as_output.c:64:24: error: ‘HEAD’ undeclared (first use in this function)
int l = LENGTH(CAR(HEAD));
^
/usr/share/R/include/Rinternals.h:319:43: note: in definition of macro ‘SHORT_VEC_LENGTH’
# define SHORT_VEC_LENGTH(x) (((VECSEXP) (x))->vecsxp.length)
^
/usr/share/R/include/Rinternals.h:325:21: note: in expansion of macro ‘IS_LONG_VEC’
# define LENGTH(x) (IS_LONG_VEC(x) ? R_BadLongVector(x, __FILE__, __LINE__) : SHORT_VEC_LENGTH(x))
^
as_output.c:64:13: note: in expansion of macro ‘LENGTH’
int l = LENGTH(CAR(HEAD));
^
as_output.c:64:20: note: in expansion of macro ‘CAR’
int l = LENGTH(CAR(HEAD));
^
as_output.c:64:24: note: each undeclared identifier is reported only once for each function it appears in
int l = LENGTH(CAR(HEAD));
^
/usr/share/R/include/Rinternals.h:319:43: note: in definition of macro ‘SHORT_VEC_LENGTH’
# define SHORT_VEC_LENGTH(x) (((VECSEXP) (x))->vecsxp.length)
^
/usr/share/R/include/Rinternals.h:325:21: note: in expansion of macro ‘IS_LONG_VEC’
# define LENGTH(x) (IS_LONG_VEC(x) ? R_BadLongVector(x, __FILE__, __LINE__) : SHORT_VEC_LENGTH(x))
^
as_output.c:64:13: note: in expansion of macro ‘LENGTH’
int l = LENGTH(CAR(HEAD));
^
as_output.c:64:20: note: in expansion of macro ‘CAR’
int l = LENGTH(CAR(HEAD));
^
as_output.c:70:13: warning: passing argument 1 of ‘Rf_unprotect’ makes integer from pointer without a cast
UNPROTECT(res);
^
/usr/share/R/include/Rinternals.h:633:35: note: in definition of macro ‘UNPROTECT’
#define UNPROTECT(n) Rf_unprotect(n)
^
/usr/share/R/include/Rinternals.h:1258:6: note: expected ‘int’ but argument is of type ‘SEXP’
void Rf_unprotect(int);
^
In file included from as_output.c:9:0:
as_output.c: At top level:
../inst/include/utils.h:14:19: warning: inline function ‘Rspace’ declared but never defined
R_INLINE Rboolean Rspace(unsigned int c);
^
../inst/include/utils.h:14:19: warning: inline function ‘Rspace’ declared but never defined
/usr/lib/R/etc/Makeconf:128: recipe for target 'as_output.o' failed
make: *** [as_output.o] Error 1
ERROR: compilation failed for package ‘iotools’
* removing ‘/home/vagrant/R/x86_64-pc-linux-gnu-library/3.1/iotools’
Can input.file take as an argument the function to open the connection? I'd like to be able to call input.file with compressed files.
Currently we do not do any special parsing of quotes (other than
always converting columns with quotes to characters instead of
other types). It would be easy to at least do some basic quote
parsing, even if we don't go as far as parsing nested/embedded
strings.
Trying out the iotools package in a pseudo-distributed mode with CentOS and CDH4(I have tested to ensure other Hadoop jobs are running). I have installed the package and configured HADOOP_PREFIX.
To start with, I am trying to run an identity MapReduce job using the hmr() function, but it is giving the following error:
" Could not find or load main class org.apache.hadoop.util.PlatformName"
" Could not find or load main class org.apache.hadoop.util.RunJar"
These classes are from the Hadoop core jar. Not sure where I am going wrong.
Would be really great if you could shed light on what the issue could be?
imstrsplit
should have a prefetch function so that subsequent chunks can be requested when run in a foreach
loop.
write.csv(iris, "iris.csv", row.names=FALSE, quote=FALSE)
a = iotools:::read.table.raw("iris.csv", header=TRUE, sep=",", colClasses=c(rep("numeric", 4), "character"))
head(a)
## V1 V2 V3 V4 V5
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
By complete accident doing some debugging, I realized that (despite my having read the argument list in ?dstrfw
several times), contrary to what I had been doing, dstrfw
provides functionality to handle input column names directly (rather than fixing up ex-post).
I only saw this because I was reading the source code of the function -- it took very careful poring over every word of the rest of ?dstrfw
to find the morsel at the end of the Value section which states this feature.
It's a great feature, so it should get higher prominence! In particular it should definitely get mention in the col_types
argument explanation.
I'm happy to submit a PR to this end.
library(iotools)
download.file("http://euler.stat.yale.edu/~mjk56/fix_sample.tsv", "fix_sample.tsv")
dstrsplit(read.chunk(chunk.reader("fix_sample.tsv")), col_types=rep("character", 46))
produces the error message embedded nul in string
. It would be nice if dstrsplit
could do this for us.
The iotools:::input
function checks to see if the input is a connection using is.connection
. This does not appear to be a defined function in R version 3.1.2.
Purpose: Split the iris dataframe by the column 'Species' and then count the number of rows.
Problem: try using chunk.tapply
seems to gives an unexpected result
Plain way,
sapply(split(iris, iris$Species), nrow)
yields the (named)vector c(50, 50, 50)
.
Try to reproduce this with chunk.tapply
:
write.table(iris[, c(5,1:4)]
, "iris.csv"
, row.names = FALSE
, sep = ","
, col.names = FALSE)
View(read.csv("iris.csv", header = FALSE))
library("iotools")
chunk.tapply(input = "iris.csv"
, FUN = function(x){
df <- dstrsplit(x
, col_types = c("character"
, rep("numeric", 4))
, sep = ","
)
# print(df) # to observe whats happening
nrow(df)
}
, CH.MERGE = c
, sep = ",")
yields c(100, 50)
. Printing the dataframes after dstrsplit
shows that rows corresponding to 'setosa' and 'versicolor' aren't considered as different dataframes. Is there something wrong with what I am doing?
iotools version: 0.1-12(the one on cran)
Calling read.csv.raw will fail if all rows are skipped and if colClasses are absent. The built-in function utils::read.csv after which the interface of the really great and fast iotools::read.csv.raw is modelled, does not fail. Would it be possible to catch this admittedly special case which, in my case, would make programming a bit easier?
Many thanks,
Daniel
A short reproducible example:
packageVersion("iotools") # ‘0.2.3’
x <- matrix(sample(100), ncol = 10)
write.csv(x, file = "test.csv", row.names = FALSE)
read.csv("test.csv", skip = 10) # dim == c(0, 11)
read.csv("test.csv", skip = 10, colClasses = rep("integer", 10)) # dim == c(0, 11)
read.csv.raw("test.csv", skip = 10) # Error in subset[, i] : subscript out of bounds
read.csv.raw("test.csv", skip = 10, colClasses = rep("integer", 10)) # dim == c(0, 11)
unlink("test.csv")
By the way, I prefer the real/expected column names which are returned by read.csv.raw whenever colClasses are specified compared to the unexpected column names returned by read.csv ('X' concatenated with the values of the last and here skipped line [-- instead of "The default is to use "V" followed by the column number" quoting ?read.csv]).
Calling ctapply() on matrices results in a memory footprint which can not be removed by gc().
mat <- matrix(rnorm(1e7), 1e2, 1e5)
idx <- sort(rep_len(1:20, nrow(mat)))
gc(reset = TRUE)
str( ctapply(mat, idx, colSums, MERGE = rbind) )
gc(reset = TRUE)
str( ctapply(mat, idx, colSums, MERGE = rbind) )
gc(reset = TRUE)
My session info:
R version 3.1.2 (2014-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] iotools_0.1-6
loaded via a namespace (and not attached):
[1] parallel_3.1.2
When a raw vector is given to mstrsplit, we memchr forward to
the first newline in the process of finding the column headers.
This causes a NULL pointer if there was only one row of data,
and a rather nasty bug.
iotools::mstrsplit(charToRaw("asdf|asdf|asdf"))
I think there is a problem with my parallel implementation of chunk.apply. I'll follow up with a good example and hopefully a diagnosis and pull request.
Just to inform you about a typo, setep instead of step in Note section:
https://github.com/s-u/iotools/blob/master/man/chunk.apply.Rd#L35
> dstrsplit(charToRaw("1|0|\n1||\n"), list(a=1,b=1L,NA))
a b
1 1 0
2 1 0
should be
1 1 0
2 1 NA
(PS: work-around is to use numeric as floats are parsed correctly)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.