Giter Site home page Giter Site logo

maditr's Introduction

maditr: Fast Data Aggregation, Modification, and Filtering

CRAN_Status_Badge Coverage Status

Links

Installation

maditr is on CRAN, so for installation you can print in the console install.packages("maditr").

Overview

Package provides pipe-style interface for data.table package. It preserves all data.table features without significant impact on performance. let and take functions are simplified interfaces for most common data manipulation tasks.

  • To select rows from data: rows(mtcars, am==0)
  • To select columns from data: columns(mtcars, mpg, vs:carb)
  • To aggregate data: take(mtcars, mean_mpg = mean(mpg), by = am)
  • To aggregate all non-grouping columns: take_all(mtcars, mean, by = am)
  • To aggregate several columns with one summary: take(mtcars, mpg, hp, fun = mean, by = am)
  • To get total summary skip by argument: take_all(mtcars, mean)
  • Use magrittr pipe %>% to chain several operations:
     mtcars %>%
        let(mpg_hp = mpg/hp) %>%
        take(mean(mpg_hp), by = am)
  • To modify variables or add new variables:
      mtcars %>%
         let(new_var = 42,
             new_var2 = new_var*hp) %>%
         head()
  • To drop variable assign NULL: let(mtcars, am = NULL) %>% head()
  • To modify all non-grouping variables:
    iris %>%
      let_all(
          scaled = (.x - mean(.x))/sd(.x),
          by = Species) %>%
       head()
  • To aggregate all variables conditionally on name:
    iris %>%
      take_all(
          mean = if(startsWith(.name, "Sepal")) mean(.x),
          median = if(startsWith(.name, "Petal")) median(.x),
          by = Species
      )
  • For parametric assignment use :=:
    new_var = "my_var"
    old_var = "mpg"
    mtcars %>%
        let((new_var) := get(old_var)*2) %>%
        head()
     
    # or,  
    expr = quote(mean(cyl))
    mtcars %>% 
        let((new_var) := eval(expr)) %>% 
        head()
    
    # the same with `take` 
    by_var = "vs,am"
    take(mtcars, (new_var) := eval(expr), by = by_var)

query_if function translates its arguments one-to-one to [.data.table method. Additionally there are some conveniences such as automatic data.frame conversion to data.table.

vlookup & xlookup

Let's make datasets for lookups:

library(maditr)

workers = fread("
    name company
    Nick Acme
    John Ajax
    Daniela Ajax
")

positions = fread("
    name position
    John designer
    Daniela engineer
    Cathie manager
")

# xlookup
workers = let(workers,
  position = xlookup(name, positions$name, positions$position)
)

# vlookup
# by default we search in the first column and return values from second column
workers = let(workers,
  position = vlookup(name, positions, no_match = "Not found")
)

# the same 
workers = let(workers,
  position = vlookup(name, positions, 
                     result_column = "position", 
                     no_match = "Not found") # or, result_column = 2 
)

head(workers)

More examples

We will use for demonstartion well-known mtcars dataset and some examples from dplyr package.

library(maditr)

data(mtcars)

# Newly created variables are available immediately
mtcars %>%
    let(
        cyl2 = cyl * 2,
        cyl4 = cyl2 * 2
    ) %>% head()

# You can also use let() to remove variables and
# modify existing variables
mtcars %>%
    let(
        mpg = NULL,
        disp = disp * 0.0163871 # convert to litres
    ) %>% head()


# window functions are useful for grouped computations
mtcars %>%
    let(rank = rank(-mpg, ties.method = "min"),
        by = cyl) %>%
    head()

# You can drop variables by setting them to NULL
mtcars %>%
    let(cyl = NULL) %>%
    head()

# keeps all existing variables
mtcars %>%
    let(displ_l = disp / 61.0237) %>%
    head()

# keeps only the variables you create
mtcars %>%
    take(displ_l = disp / 61.0237) %>% 
    head()


# can refer to both contextual variables and variable names:
var = 100
mtcars %>%
    let(cyl = cyl * var) %>%
    head()

# select rows
mtcars %>%
    rows(am==0) %>% 
    head()

# select rows with compound condition
mtcars %>%
    rows(am==0 & mpg>mean(mpg))

# select columns
mtcars %>% 
    columns(vs:carb, cyl)
    
mtcars %>% 
    columns(-am, -cyl)    

# regular expression pattern
columns(iris, "^Petal") # variables which start from 'Petal'
columns(iris, "Width$") # variables which end with 'Width'

# move Species variable to the front
# pattern "^." matches all variables
columns(iris, Species, "^.")

# pattern "^.*al" means "contains 'al'"
columns(iris, "^.*al")

# numeric indexing - all variables except Species
columns(iris, 1:4) 

# A 'take' with summary functions applied without 'by' argument returns an aggregated data
mtcars %>%
    take(mean = mean(disp), n = .N)

# Usually, you'll want to group first
mtcars %>%
    take(mean = mean(disp), n = .N, by = am)

# grouping by multiple variables
mtcars %>%
    take(mean = mean(disp), n = .N, by = list(am, vs))

# You can group by expressions:
mtcars %>%
    take_all(
        mean,
        by = list(vsam = vs + am)
    )

# modify all non-grouping variables in-place
mtcars %>%
    let_all((.x - mean(.x))/sd(.x), by = am) %>%
    head()

# modify all non-grouping variables to new variables
mtcars %>%
    let_all(scaled = (.x - mean(.x))/sd(.x), by = am) %>%
    head()

# conditionally modify all variables
iris %>%
    let_all(mean = if(is.numeric(.x)) mean(.x)) %>%
    head()

# modify all variables conditionally on name
iris %>%
    let_all(
        mean = if(startsWith(.name, "Sepal")) mean(.x),
        median = if(startsWith(.name, "Petal")) median(.x),
        by = Species
    ) %>%
    head()

# aggregation with 'take_all'
mtcars %>%
    take_all(mean = mean(.x), sd = sd(.x), n = .N, by = am)

# conditionally aggregate all variables
iris %>%
    take_all(mean = if(is.numeric(.x)) mean(.x))

# aggregate all variables conditionally on name
iris %>%
    take_all(
        mean = if(startsWith(.name, "Sepal")) mean(.x),
        median = if(startsWith(.name, "Petal")) median(.x),
        by = Species
    )

# parametric evaluation:
var = quote(mean(cyl))
mtcars %>% 
    let(mean_cyl = eval(var)) %>% 
    head()
take(mtcars, eval(var))

# all together
new_var = "mean_cyl"
mtcars %>% 
    let((new_var) := eval(var)) %>% 
    head()
take(mtcars, (new_var) := eval(var))

Variable selection in the expressions

You can use 'columns' inside expression in the 'take'/'let'. 'columns' will be replaced with data.table with selected columns. In 'let' in the expressions with ':=', 'cols' or '%to%' can be placed in the left part of the expression. It is usefull for multiple assignment. There are four ways of column selection:

  1. Simply by column names
  2. By variable ranges, e. g. vs:carb. Alternatively, you can use '%to%' instead of colon: 'vs %to% carb'.
  3. With regular expressions. Characters which start with '^' or end with $ considered as Perl-style regular expression patterns. For example, '^Petal' returns all variables started with 'Petal'. 'Width$' returns all variables which end with 'Width'. Pattern '^.' matches all variables and pattern '^.*my_str' is equivalent to contains "my_str"'.
  4. By character variables with interpolated parts. Expression in the curly brackets inside characters will be evaluated in the parent frame with 'text_expand' function. For example, a{1:3} will be transformed to the names 'a1', 'a2', 'a3'. 'cols' is just a shortcut for 'columns'.
# range selection
iris %>% 
    let(
        avg = rowMeans(Sepal.Length %to% Petal.Width)
    ) %>% 
    head()

# multiassignment
iris %>% 
    let(
        # starts with Sepal or Petal
        multipled1 %to% multipled4 := cols("^(Sepal|Petal)")*2
    ) %>% 
    head()


mtcars %>% 
    let(
        # text expansion
        cols("scaled_{names(mtcars)}") := lapply(cols("{names(mtcars)}"), scale)
    ) %>% 
    head()

# range selection in 'by'
# selection of range + additional column
mtcars %>% 
    take(
        res = sum(cols(mpg, disp %to% drat)),
        by = vs %to% gear
    )

Joins

Here we use the same datasets as with lookups:

workers = fread("
    name company
    Nick Acme
    John Ajax
    Daniela Ajax
")

positions = fread("
    name position
    John designer
    Daniela engineer
    Cathie manager
")

workers
positions

Different kinds of joins:

workers %>% dt_inner_join(positions)
workers %>% dt_left_join(positions)
workers %>% dt_right_join(positions)
workers %>% dt_full_join(positions)

# filtering joins
workers %>% dt_anti_join(positions)
workers %>% dt_semi_join(positions)

To suppress the message, supply by argument:

workers %>% dt_left_join(positions, by = "name")

Use a named by if the join variables have different names:

positions2 = setNames(positions, c("worker", "position")) # rename first column in 'positions'
workers %>% dt_inner_join(positions2, by = c("name" = "worker"))

'dplyr'-like interface for data.table.

There are a small subset of 'dplyr' verbs to work with data.table. Note that there is no group_by verb - use by or keyby argument when needed.

  • dt_mutate adds new variables or modify existing variables. If data is data.table then it modifies in-place.
  • dt_summarize computes summary statistics. Splits the data into subsets, computes summary statistics for each, and returns the result in the "data.table" form.
  • dt_summarize_all the same as dt_summarize but work over all non-grouping variables.
  • dt_filter Selects rows/cases where conditions are true. Rows where the condition evaluates to NA are dropped.
  • dt_select Selects column/variables from the data set. Range of variables are supported, e. g. vs:carb. Characters which start with ^ or end with \$ considered as Perl-style regular expression patterns. For example, '^Petal' returns all variables started with 'Petal'. 'Width\$' returns all variables which end with 'Width'. Pattern ^. matches all variables and pattern '^.*my_str' is equivalent to contains "my_str". See examples.
  • dt_arrange sorts dataset by variable(-s). Use '-' to sort in desending order. If data is data.table then it modifies in-place.
# examples from 'dplyr'
# newly created variables are available immediately
mtcars  %>%
    dt_mutate(
        cyl2 = cyl * 2,
        cyl4 = cyl2 * 2
    ) %>%
    head()


# you can also use dt_mutate() to remove variables and
# modify existing variables
mtcars %>%
    dt_mutate(
        mpg = NULL,
        disp = disp * 0.0163871 # convert to litres
    ) %>%
    head()


# window functions are useful for grouped mutates
mtcars %>%
    dt_mutate(
        rank = rank(-mpg, ties.method = "min"),
        keyby = cyl) %>%
    print()


# You can drop variables by setting them to NULL
mtcars %>% dt_mutate(cyl = NULL) %>% head()

# A summary applied without by returns a single row
mtcars %>%
    dt_summarise(mean = mean(disp), n = .N)

# Usually, you'll want to group first
mtcars %>%
    dt_summarise(mean = mean(disp), n = .N, by = cyl)


# Multiple 'by' - variables
mtcars %>%
    dt_summarise(cyl_n = .N, by = list(cyl, vs))

# Newly created summaries immediately
# doesn't overwrite existing variables
mtcars %>%
    dt_summarise(disp = mean(disp),
                  sd = sd(disp),
                  by = cyl)

# You can group by expressions:
mtcars %>%
    dt_summarise_all(mean, by = list(vsam = vs + am))

# filter by condition
mtcars %>%
    dt_filter(am==0)

# filter by compound condition
mtcars %>%
    dt_filter(am==0,  mpg>mean(mpg))


# select
mtcars %>% dt_select(vs:carb, cyl)
mtcars %>% dt_select(-am, -cyl)

# regular expression pattern
dt_select(iris, "^Petal") # variables which start from 'Petal'
dt_select(iris, "Width$") # variables which end with 'Width'
# move Species variable to the front
# pattern "^." matches all variables
dt_select(iris, Species, "^.")
# pattern "^.*al" means "contains 'al'"
dt_select(iris, "^.*al")
dt_select(iris, 1:4) # numeric indexing - all variables except Species

# sorting
dt_arrange(mtcars, cyl, disp)
dt_arrange(mtcars, -disp)

maditr's People

Contributors

gdemin avatar mattdowle avatar michaelchirico avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

maditr's Issues

to_wide should not use dcast(fun.aggregate=identity)

Hi @gdemin
Recently data.table merged some new dcast code which uses more strict checking of fun.aggregate, which is supposed to be a function which returns a single value. This is documented on ?dcast: " The aggregating function should take a vector as input and return a single value (or a list of length one) as output."

Using new data.table from github master, we ran example("to_wide") which gave the following:

> iris %>%
+     to_long(list(Sepal = cols("^Sepal"), Petal = cols("^Petal"))) %>%
+     let(
+         variable = factor(variable, levels = 1:2, labels = c("Length", "Width"))
+     ) %>%
+     to_wide(values_in = c(Sepal, Petal))
Error: Aggregating function(s) should take vector inputs and return a single value (length=1). However, function(s) returns length!=1. This value will have to be used to fill any missing combinations, and therefore must be length=1. Either override by setting the 'fill' argument explicitly or modify your function to handle this case appropriately.

Details: Rdatatable/data.table#6032

It seems that to_wide is calling dcast with fun.aggregate=identity which is problematic because that returns a vector with length>1. Can you please modify your code so that it uses a fun.aggregate that returns a single value? (length=1)

Thanks in advance!

Return top row(s) by group

Is there a way to do this same command in maditr? It returns the top row by group.

library(maditr)

dt <- data.table(x = c(1,2,3,4,5), y = c("a","a","a","b","b"))

dt[, head(.SD, 1), by = y]

Conda forge r-maditr

Hello Gregory,

Thank you for this package. I was just wondering why there is no availability to get it from the conda conda install -c conda-forge command line.
As a work around, I will try with the reshape2 package. I am using the dcast() function.

Question about na.rm in maditr

This is not an issue, is a question ..

Can I get the same result in this tables with expss and maditr? Can I modify the use of mean in fun?
Column 2016 difference between vox & maspais. I'm looking for equal result in expss and maditr...
`
year <- c(2016,2017,2018,2019,2020,2016)
psoe <- c(21,32,34,36,36.5,21)
pp <- c(15,16,13.5,17,17,15)
vox <- c(11,12,11,12,13,NA)
cs <- c(1,1,2,4,2,NA)
maspais <- c(1,2,3,1,3,5)
others <- c(0.3,1,0.4,0.3,0.5, 0.1)

data <- data.frame(year,psoe,pp,vox,cs,maspais,others)

take(data, psoe,pp,vox,cs,maspais,others, fun = mean, by=year)
data %>%
tab_cols(year) %>%
tab_cells(psoe,pp,vox,cs,maspais,others) %>%
tab_stat_mean() %>%
tab_pivot() %>%
t()
`
Thanks in advance

Want to write a dt_rename but failed

In dplyr there is a function called rename, while in data.table we have setnames.

Recently I tried to write a dt_rename and dt_transmute using data.table but failed. This should be easy, but I don't know how to let an R function to take an expression or multiple expressions, and melt and reorganized them. Is there some tricks?

Thanks.

let dt_* functions more like dplyr

Awesome piece! Great syntax for data.table. But maybe some adjustment for the in-place operation might help. For instance, using dt_mutate could not yield any results unless we use head() or print(), why not just add %>% .[] and show it to make it more like dplyr, or even use as_tibble() to show it in a new way. I think change in place might be useless for pipe work.
Again, thank you for all the awesome efforts you have made to the community!

Try to use dt_distinct, any clues?

I would like to add dt_distinct to maditr, however, failed. Here's my codes:

dt_distinct = function(data,...,.keep_all = FALSE){

  data = as.data.table(data)
  if(is.null(substitute(...))) return(unique(data))
  
  if(.keep_all) data[,.SD[1],by = list(...)]
  else data %>% dt_select(...) %>% unique()

}

However, it never returns the correct answer for more than one variables. Is there any way to get it right?

naming conventionos

Congrats to this very interesting package! I am just wondering about the naming conventions. Is there a reason they are not analogous to dplyr or sth. which is already established? Your conventions might be easier to adopt if you chose more similar verbs like i.e. dt_filter, dt_select, dt_arrange etc.

Could dt_select support selection by column names pattern?

In dplyr, we have iris %>% select(starts_with("Se")).
In maditr, the analogous way is iris %>% dt_select(startsWith(colnames(.),"Se")).

Is there some way to make it more convenient? Such as:
iris %>% dt_select("^Se")
Once receive a character, understand it as a regular expression.

Get lag and lead using API

It is great design to have "by" parameter in dt_mutate and dt_summarise, however, when I try:

iris %>% dt_mutate(id = lag(Sepal.Length,2)) %>% as.data.table

It did not work. Is there some way to make it work? Should we design a new API or just optimize it inside dt_mutate?

maditr does not work without `library(maditr)`

library(dplyr)
iris %>% 
  maditr::let(lat_lon = 42) %>% 
  head

Does not work, if I do NOT first library(maditr). This is probably caused by the eval.parent,

let.data.frame <- function (data, ..., by, keyby, i) {
    eval.parent(substitute(let_if(data, i, ..., by = by, keyby = keyby)))
}

I am not an expert in NSE but what do you think about this simple fix

let.data.frame <- function (data, ..., by, keyby, i) {
    eval.parent(substitute(maditr::let_if(data, i, ..., by = by, keyby = keyby)))
}

so you don't assume people have run library(maditr).

Basically this arises from me trying to see if maditr can be integrate with disk.frame see DiskFrame/disk.frame#231

Make all methods S3-methods for easier integration with other packages like disk.frame

I want to integrate maditr with disk.frame but for that I need the methods in maditr to be S3 compatible. E.g.

let = function(data, ....) {
  UseMethod("let")
}

let.default = function(data, # omitted) {
  # your original let function implementation here; omitted
}

then in disk.frame I can implement the let method

let.disk.frame <- function(data,...) {
  # implement it for disk.frame
}

which will enable large datasets that doesn't fit into ram to benefit from maditr.

Imports for data.table version are incorrect

Version 0.5.0 of maditr lists data.table (≥ 1.10) for Imports in the DESCRIPTION file. The maditr really needs data.table (>= 1.11). Otherwise the object groupingsets is not available from namespace:data.table

install.packages("maditr")
#> Installing package into ‘/home/R/site-library_R-3.4’
#> (as ‘lib’ is unspecified)
#> trying URL 'http://erratic/cran/src/contrib/maditr_0.5.0.tar.gz'
#> Content type 'application/x-gzip' length 22693 bytes (22 KB)
#> ==================================================
#> downloaded 22 KB
#>
#> * installing *source* package ‘maditr’ ...
#> ** package ‘maditr’ successfully unpacked and MD5 sums checked
#> ** R
#> ** inst
#> ** preparing package for lazy loading
#> Error : 'groupingsets' is not an exported object from 'namespace:data.table'
#> Error : unable to load R code in package ‘maditr’
#> ERROR: lazy loading failed for package ‘maditr’
#> * removing ‘/home/R/site-library_R-3.4/maditr’
packageVersion("data.table")
#> [1] ‘1.10.4’

sessionInfo()
#> R version 3.4.3 (2017-11-30)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 16.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/atlas-base/libf77blas.so.3.0
#> LAPACK: /home/R/R-3.4.3/lib/R/lib/libRlapack.so
#>
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C       
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C          
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C     
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base
#>
#> loaded via a namespace (and not attached):
#> [1] compiler_3.4.3 tools_3.4.3

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.