Giter Site home page Giter Site logo

iotools's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

iotools's Issues

Windows line endings

We need to deal cleanly with windows-style line endings; I just had
a bug with unparsed '\r' characters.

col.names argument not used in write.csv.raw

write.csv.raw = function(x, file = "", append = FALSE, sep = ",", nsep="\t",
                          col.names = TRUE, fileEncoding = "") {
  if (is.character(file)) {
    file <- if (nzchar(fileEncoding))
            file(file, ifelse(append, "ab", "wb"), encoding = fileEncoding)
        else file(file, ifelse(append, "ab", "wb"))
    on.exit(close(file))
  } else if (!isOpen(file, "w")) {
    open(file, "wb")
    on.exit(close(file))
  }

  r = as.output(x, sep = sep, nsep=nsep)
  writeBin(r, con=file)
}

mstrsplit with no separator

It would be nice if there were a sep argument to mstrsplit, like NA or NULL signaling that there are no separators and each row corresponds to a single element.

write.csv.raw colnames adds null character

There is a bug in write.csv.raw() with this line:

cr = rawToChar(as.output(matrix(colnames(x),nrow=1),sep = sep))

rawToChar() is not necessary and creates a bug: writeBin() adds a null character at the end of the output if the input is a character. The file cannot be read with read.csv.raw() because the first character of the data is null (at the end of the columns names after \n).

Install failing for R 3.1.1

I'm seeing the following in R 3.1.1 on both Mac and Linux.

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base
vagrant@vagrant-ubuntu-trusty-64:~$ R CMD INSTALL iotools 
* installing to library ‘/home/vagrant/R/x86_64-pc-linux-gnu-library/3.1’
* installing *source* package ‘iotools’ ...
** libs
gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -I../inst/include     -fpic  -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -g  -c as_output.c -o as_output.o
In file included from as_output.c:2:0:
as_output.c: In function ‘dybuf_collect’:
as_output.c:64:24: error: ‘HEAD’ undeclared (first use in this function)
     int l = LENGTH(CAR(HEAD));
                        ^
/usr/share/R/include/Rinternals.h:319:43: note: in definition of macro ‘SHORT_VEC_LENGTH’
 # define SHORT_VEC_LENGTH(x) (((VECSEXP) (x))->vecsxp.length)
                                           ^
/usr/share/R/include/Rinternals.h:325:21: note: in expansion of macro ‘IS_LONG_VEC’
 # define LENGTH(x) (IS_LONG_VEC(x) ? R_BadLongVector(x, __FILE__, __LINE__) : SHORT_VEC_LENGTH(x))
                     ^
as_output.c:64:13: note: in expansion of macro ‘LENGTH’
     int l = LENGTH(CAR(HEAD));
             ^
as_output.c:64:20: note: in expansion of macro ‘CAR’
     int l = LENGTH(CAR(HEAD));
                    ^
as_output.c:64:24: note: each undeclared identifier is reported only once for each function it appears in
     int l = LENGTH(CAR(HEAD));
                        ^
/usr/share/R/include/Rinternals.h:319:43: note: in definition of macro ‘SHORT_VEC_LENGTH’
 # define SHORT_VEC_LENGTH(x) (((VECSEXP) (x))->vecsxp.length)
                                           ^
/usr/share/R/include/Rinternals.h:325:21: note: in expansion of macro ‘IS_LONG_VEC’
 # define LENGTH(x) (IS_LONG_VEC(x) ? R_BadLongVector(x, __FILE__, __LINE__) : SHORT_VEC_LENGTH(x))
                     ^
as_output.c:64:13: note: in expansion of macro ‘LENGTH’
     int l = LENGTH(CAR(HEAD));
             ^
as_output.c:64:20: note: in expansion of macro ‘CAR’
     int l = LENGTH(CAR(HEAD));
                    ^
as_output.c:70:13: warning: passing argument 1 of ‘Rf_unprotect’ makes integer from pointer without a cast
   UNPROTECT(res);
             ^
/usr/share/R/include/Rinternals.h:633:35: note: in definition of macro ‘UNPROTECT’
 #define UNPROTECT(n) Rf_unprotect(n)
                                   ^
/usr/share/R/include/Rinternals.h:1258:6: note: expected ‘int’ but argument is of type ‘SEXP’
 void Rf_unprotect(int);
      ^
In file included from as_output.c:9:0:
as_output.c: At top level:
../inst/include/utils.h:14:19: warning: inline function ‘Rspace’ declared but never defined
 R_INLINE Rboolean Rspace(unsigned int c);
                   ^
../inst/include/utils.h:14:19: warning: inline function ‘Rspace’ declared but never defined
/usr/lib/R/etc/Makeconf:128: recipe for target 'as_output.o' failed
make: *** [as_output.o] Error 1
ERROR: compilation failed for package ‘iotools’
* removing ‘/home/vagrant/R/x86_64-pc-linux-gnu-library/3.1/iotools’

Quote handling

Currently we do not do any special parsing of quotes (other than
always converting columns with quotes to characters instead of
other types). It would be easy to at least do some basic quote
parsing, even if we don't go as far as parsing nested/embedded
strings.

Could not find class error

Trying out the iotools package in a pseudo-distributed mode with CentOS and CDH4(I have tested to ensure other Hadoop jobs are running). I have installed the package and configured HADOOP_PREFIX.

To start with, I am trying to run an identity MapReduce job using the hmr() function, but it is giving the following error:
" Could not find or load main class org.apache.hadoop.util.PlatformName"
" Could not find or load main class org.apache.hadoop.util.RunJar"

These classes are from the Hadoop core jar. Not sure where I am going wrong.
Would be really great if you could shed light on what the issue could be?

header option not working with read.table.raw

write.csv(iris, "iris.csv", row.names=FALSE, quote=FALSE)
a = iotools:::read.table.raw("iris.csv", header=TRUE, sep=",", colClasses=c(rep("numeric", 4), "character"))
head(a)
##    V1  V2  V3  V4     V5
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa

?dstrfw should be more upfront about the use of col_types names

By complete accident doing some debugging, I realized that (despite my having read the argument list in ?dstrfw several times), contrary to what I had been doing, dstrfw provides functionality to handle input column names directly (rather than fixing up ex-post).

I only saw this because I was reading the source code of the function -- it took very careful poring over every word of the rest of ?dstrfw to find the morsel at the end of the Value section which states this feature.

It's a great feature, so it should get higher prominence! In particular it should definitely get mention in the col_types argument explanation.

I'm happy to submit a PR to this end.

Let dstrsplit remove nul characters

library(iotools)
download.file("http://euler.stat.yale.edu/~mjk56/fix_sample.tsv", "fix_sample.tsv")
dstrsplit(read.chunk(chunk.reader("fix_sample.tsv")), col_types=rep("character", 46))

produces the error message embedded nul in string. It would be nice if dstrsplit could do this for us.

problem with chunk.tapply

Purpose: Split the iris dataframe by the column 'Species' and then count the number of rows.

Problem: try using chunk.tapply seems to gives an unexpected result

Plain way,

sapply(split(iris, iris$Species), nrow)

yields the (named)vector c(50, 50, 50).


Try to reproduce this with chunk.tapply:

write.table(iris[, c(5,1:4)]
            , "iris.csv"
            , row.names = FALSE
            , sep = ","
            , col.names = FALSE)

View(read.csv("iris.csv", header = FALSE))

library("iotools")
chunk.tapply(input = "iris.csv"
             , FUN = function(x){
                       df <- dstrsplit(x
                                 , col_types = c("character"
                                                 , rep("numeric", 4))
                                 , sep = ","
                                 )
                       # print(df) # to observe whats happening
                       nrow(df)
                       }
             , CH.MERGE =  c
             , sep = ",")

yields c(100, 50). Printing the dataframes after dstrsplit shows that rows corresponding to 'setosa' and 'versicolor' aren't considered as different dataframes. Is there something wrong with what I am doing?

iotools version: 0.1-12(the one on cran)

read.csv.raw fails if all rows are skipped and if colClasses are absent -- unlike utils::read.csv

Calling read.csv.raw will fail if all rows are skipped and if colClasses are absent. The built-in function utils::read.csv after which the interface of the really great and fast iotools::read.csv.raw is modelled, does not fail. Would it be possible to catch this admittedly special case which, in my case, would make programming a bit easier?

Many thanks,
Daniel

A short reproducible example:

packageVersion("iotools")   # ‘0.2.3’
x <- matrix(sample(100), ncol = 10)
write.csv(x, file = "test.csv", row.names = FALSE)

read.csv("test.csv", skip = 10)                                             # dim == c(0, 11)
read.csv("test.csv", skip = 10, colClasses = rep("integer", 10))            # dim == c(0, 11)
read.csv.raw("test.csv", skip = 10)    # Error in subset[, i] : subscript out of bounds
read.csv.raw("test.csv", skip = 10, colClasses = rep("integer", 10))        # dim == c(0, 11)

unlink("test.csv")

By the way, I prefer the real/expected column names which are returned by read.csv.raw whenever colClasses are specified compared to the unexpected column names returned by read.csv ('X' concatenated with the values of the last and here skipped line [-- instead of "The default is to use "V" followed by the column number" quoting ?read.csv]).

calling ctapply on large matrices results in gc problems

Calling ctapply() on matrices results in a memory footprint which can not be removed by gc().

mat <- matrix(rnorm(1e7), 1e2, 1e5)
idx <- sort(rep_len(1:20, nrow(mat)))
gc(reset = TRUE)
str( ctapply(mat, idx, colSums, MERGE = rbind) )
gc(reset = TRUE)
str( ctapply(mat, idx, colSums, MERGE = rbind) )
gc(reset = TRUE)

My session info:

R version 3.1.2 (2014-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] iotools_0.1-6

loaded via a namespace (and not attached):
[1] parallel_3.1.2

mstrsplit segfault with raw vector when no newlines present

When a raw vector is given to mstrsplit, we memchr forward to
the first newline in the process of finding the column headers.
This causes a NULL pointer if there was only one row of data,
and a rather nasty bug.

iotools::mstrsplit(charToRaw("asdf|asdf|asdf"))

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.