Giter Site home page Giter Site logo

easystats / datawizard Goto Github PK

View Code? Open in Web Editor NEW
184.0 8.0 12.0 88.01 MB

Magic potions to clean and transform your data 🧙

Home Page: https://easystats.github.io/datawizard/

License: Other

R 98.92% TeX 1.08%
data wrangling janitor dplyr tidyr reshape rstats manipulation hacktoberfest r-package

datawizard's Introduction

datawizard: Easy Data Wrangling and Statistical Transformations

DOI downloads total status lifecycle

{datawizard} is a lightweight package to easily manipulate, clean, transform, and prepare your data for analysis. It is part of the easystats ecosystem, a suite of R packages to deal with your entire statistical analysis, from cleaning the data to reporting the results.

It covers two aspects of data preparation:

  • Data manipulation: {datawizard} offers a very similar set of functions to that of the tidyverse packages, such as a {dplyr} and {tidyr}, to select, filter and reshape data, with a few key differences. 1) All data manipulation functions start with the prefix data_* (which makes them easy to identify). 2) Although most functions can be used exactly as their tidyverse equivalents, they are also string-friendly (which makes them easy to program with and use inside functions). Finally, {datawizard} is super lightweight (no dependencies, similar to poorman), which makes it awesome for developers to use in their packages.

  • Statistical transformations: {datawizard} also has powerful functions to easily apply common data transformations, including standardization, normalization, rescaling, rank-transformation, scale reversing, recoding, binning, etc.



Installation

CRAN_Status_Badge insight status badge R-CMD-check

Type Source Command
Release CRAN install.packages("datawizard")
Development r-universe install.packages("datawizard", repos = "https://easystats.r-universe.dev")
Development GitHub remotes::install_github("easystats/datawizard")

Tip

Instead of library(datawizard), use library(easystats). This will make all features of the easystats-ecosystem available.

To stay updated, use easystats::install_latest().

Citation

To cite the package, run the following command:

citation("datawizard")
To cite package 'datawizard' in publications use:

  Patil et al., (2022). datawizard: An R Package for Easy Data
  Preparation and Statistical Transformations. Journal of Open Source
  Software, 7(78), 4684, https://doi.org/10.21105/joss.04684

A BibTeX entry for LaTeX users is

  @Article{,
    title = {{datawizard}: An {R} Package for Easy Data Preparation and Statistical Transformations},
    author = {Indrajeet Patil and Dominique Makowski and Mattan S. Ben-Shachar and Brenton M. Wiernik and Etienne Bacher and Daniel Lüdecke},
    journal = {Journal of Open Source Software},
    year = {2022},
    volume = {7},
    number = {78},
    pages = {4684},
    doi = {10.21105/joss.04684},
  }

Features

Most courses and tutorials about statistical modeling assume that you are working with a clean and tidy dataset. In practice, however, a major part of doing statistical modeling is preparing your data–cleaning up values, creating new columns, reshaping the dataset, or transforming some variables. {datawizard} provides easy to use tools to perform these common, critical, and sometimes tedious data preparation tasks.

Data wrangling

Select, filter and remove variables

The package provides helpers to filter rows meeting certain conditions…

data_match(mtcars, data.frame(vs = 0, am = 1))
#>                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
#> Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
#> Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
#> Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
#> Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8

… or logical expressions:

data_filter(mtcars, vs == 0 & am == 1)
#>                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
#> Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
#> Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
#> Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
#> Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8

Finding columns in a data frame, or retrieving the data of selected columns, can be achieved using find_columns() or get_columns():

# find column names matching a pattern
find_columns(iris, starts_with("Sepal"))
#> [1] "Sepal.Length" "Sepal.Width"

# return data columns matching a pattern
get_columns(iris, starts_with("Sepal")) |> head()
#>   Sepal.Length Sepal.Width
#> 1          5.1         3.5
#> 2          4.9         3.0
#> 3          4.7         3.2
#> 4          4.6         3.1
#> 5          5.0         3.6
#> 6          5.4         3.9

It is also possible to extract one or more variables:

# single variable
data_extract(mtcars, "gear")
#>  [1] 4 4 4 3 3 3 3 4 4 4 4 3 3 3 3 3 3 4 4 4 3 3 3 3 3 4 5 5 5 5 5 4

# more variables
head(data_extract(iris, ends_with("Width")))
#>   Sepal.Width Petal.Width
#> 1         3.5         0.2
#> 2         3.0         0.2
#> 3         3.2         0.2
#> 4         3.1         0.2
#> 5         3.6         0.2
#> 6         3.9         0.4

Due to the consistent API, removing variables is just as simple:

head(data_remove(iris, starts_with("Sepal")))
#>   Petal.Length Petal.Width Species
#> 1          1.4         0.2  setosa
#> 2          1.4         0.2  setosa
#> 3          1.3         0.2  setosa
#> 4          1.5         0.2  setosa
#> 5          1.4         0.2  setosa
#> 6          1.7         0.4  setosa

Reorder or rename

head(data_relocate(iris, select = "Species", before = "Sepal.Length"))
#>   Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1  setosa          5.1         3.5          1.4         0.2
#> 2  setosa          4.9         3.0          1.4         0.2
#> 3  setosa          4.7         3.2          1.3         0.2
#> 4  setosa          4.6         3.1          1.5         0.2
#> 5  setosa          5.0         3.6          1.4         0.2
#> 6  setosa          5.4         3.9          1.7         0.4
head(data_rename(iris, c("Sepal.Length", "Sepal.Width"), c("length", "width")))
#>   length width Petal.Length Petal.Width Species
#> 1    5.1   3.5          1.4         0.2  setosa
#> 2    4.9   3.0          1.4         0.2  setosa
#> 3    4.7   3.2          1.3         0.2  setosa
#> 4    4.6   3.1          1.5         0.2  setosa
#> 5    5.0   3.6          1.4         0.2  setosa
#> 6    5.4   3.9          1.7         0.4  setosa

Merge

x <- data.frame(a = 1:3, b = c("a", "b", "c"), c = 5:7, id = 1:3)
y <- data.frame(c = 6:8, d = c("f", "g", "h"), e = 100:102, id = 2:4)

x
#>   a b c id
#> 1 1 a 5  1
#> 2 2 b 6  2
#> 3 3 c 7  3
y
#>   c d   e id
#> 1 6 f 100  2
#> 2 7 g 101  3
#> 3 8 h 102  4

data_merge(x, y, join = "full")
#>    a    b c id    d   e
#> 3  1    a 5  1 <NA>  NA
#> 1  2    b 6  2    f 100
#> 2  3    c 7  3    g 101
#> 4 NA <NA> 8  4    h 102

data_merge(x, y, join = "left")
#>   a b c id    d   e
#> 3 1 a 5  1 <NA>  NA
#> 1 2 b 6  2    f 100
#> 2 3 c 7  3    g 101

data_merge(x, y, join = "right")
#>    a    b c id d   e
#> 1  2    b 6  2 f 100
#> 2  3    c 7  3 g 101
#> 3 NA <NA> 8  4 h 102

data_merge(x, y, join = "semi", by = "c")
#>   a b c id
#> 2 2 b 6  2
#> 3 3 c 7  3

data_merge(x, y, join = "anti", by = "c")
#>   a b c id
#> 1 1 a 5  1

data_merge(x, y, join = "inner")
#>   a b c id d   e
#> 1 2 b 6  2 f 100
#> 2 3 c 7  3 g 101

data_merge(x, y, join = "bind")
#>    a    b c id    d   e
#> 1  1    a 5  1 <NA>  NA
#> 2  2    b 6  2 <NA>  NA
#> 3  3    c 7  3 <NA>  NA
#> 4 NA <NA> 6  2    f 100
#> 5 NA <NA> 7  3    g 101
#> 6 NA <NA> 8  4    h 102

Reshape

A common data wrangling task is to reshape data.

Either to go from wide/Cartesian to long/tidy format

wide_data <- data.frame(replicate(5, rnorm(10)))

head(data_to_long(wide_data))
#>   name       value
#> 1   X1 -0.08281164
#> 2   X2 -1.12490028
#> 3   X3 -0.70632036
#> 4   X4 -0.70278946
#> 5   X5  0.07633326
#> 6   X1  1.93468099

or the other way

long_data <- data_to_long(wide_data, rows_to = "Row_ID") # Save row number

data_to_wide(long_data,
  names_from = "name",
  values_from = "value",
  id_cols = "Row_ID"
)
#>    Row_ID          X1          X2          X3         X4          X5
#> 1       1 -0.08281164 -1.12490028 -0.70632036 -0.7027895  0.07633326
#> 2       2  1.93468099 -0.87430362  0.96687656  0.2998642 -0.23035595
#> 3       3 -2.05128979  0.04386162 -0.71016648  1.1494697  0.31746484
#> 4       4  0.27773897 -0.58397514 -0.05917365 -0.3016415 -1.59268440
#> 5       5 -1.52596060 -0.82329858 -0.23094342 -0.5473394 -0.18194062
#> 6       6 -0.26916362  0.11059280  0.69200045 -0.3854041  1.75614174
#> 7       7  1.23305388  0.36472778  1.35682290  0.2763720  0.11394932
#> 8       8  0.63360774  0.05370100  1.78872284  0.1518608 -0.29216508
#> 9       9  0.35271746  1.36867235  0.41071582 -0.4313808  1.75409316
#> 10     10 -0.56048248 -0.38045724 -2.18785470 -1.8705001  1.80958455

Empty rows and columns

tmp <- data.frame(
  a = c(1, 2, 3, NA, 5),
  b = c(1, NA, 3, NA, 5),
  c = c(NA, NA, NA, NA, NA),
  d = c(1, NA, 3, NA, 5)
)

tmp
#>    a  b  c  d
#> 1  1  1 NA  1
#> 2  2 NA NA NA
#> 3  3  3 NA  3
#> 4 NA NA NA NA
#> 5  5  5 NA  5

# indices of empty columns or rows
empty_columns(tmp)
#> c 
#> 3
empty_rows(tmp)
#> [1] 4

# remove empty columns or rows
remove_empty_columns(tmp)
#>    a  b  d
#> 1  1  1  1
#> 2  2 NA NA
#> 3  3  3  3
#> 4 NA NA NA
#> 5  5  5  5
remove_empty_rows(tmp)
#>   a  b  c  d
#> 1 1  1 NA  1
#> 2 2 NA NA NA
#> 3 3  3 NA  3
#> 5 5  5 NA  5

# remove empty columns and rows
remove_empty(tmp)
#>   a  b  d
#> 1 1  1  1
#> 2 2 NA NA
#> 3 3  3  3
#> 5 5  5  5

Recode or cut dataframe

set.seed(123)
x <- sample(1:10, size = 50, replace = TRUE)

table(x)
#> x
#>  1  2  3  4  5  6  7  8  9 10 
#>  2  3  5  3  7  5  5  2 11  7

# cut into 3 groups, based on distribution (quantiles)
table(categorize(x, split = "quantile", n_groups = 3))
#> 
#>  1  2  3 
#> 13 19 18

Data Transformations

The packages also contains multiple functions to help transform data.

Standardize

For example, to standardize (z-score) data:

# before
summary(swiss)
#>    Fertility      Agriculture     Examination      Education    
#>  Min.   :35.00   Min.   : 1.20   Min.   : 3.00   Min.   : 1.00  
#>  1st Qu.:64.70   1st Qu.:35.90   1st Qu.:12.00   1st Qu.: 6.00  
#>  Median :70.40   Median :54.10   Median :16.00   Median : 8.00  
#>  Mean   :70.14   Mean   :50.66   Mean   :16.49   Mean   :10.98  
#>  3rd Qu.:78.45   3rd Qu.:67.65   3rd Qu.:22.00   3rd Qu.:12.00  
#>  Max.   :92.50   Max.   :89.70   Max.   :37.00   Max.   :53.00  
#>     Catholic       Infant.Mortality
#>  Min.   :  2.150   Min.   :10.80   
#>  1st Qu.:  5.195   1st Qu.:18.15   
#>  Median : 15.140   Median :20.00   
#>  Mean   : 41.144   Mean   :19.94   
#>  3rd Qu.: 93.125   3rd Qu.:21.70   
#>  Max.   :100.000   Max.   :26.60

# after
summary(standardize(swiss))
#>    Fertility         Agriculture       Examination         Education      
#>  Min.   :-2.81327   Min.   :-2.1778   Min.   :-1.69084   Min.   :-1.0378  
#>  1st Qu.:-0.43569   1st Qu.:-0.6499   1st Qu.:-0.56273   1st Qu.:-0.5178  
#>  Median : 0.02061   Median : 0.1515   Median :-0.06134   Median :-0.3098  
#>  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000  
#>  3rd Qu.: 0.66504   3rd Qu.: 0.7481   3rd Qu.: 0.69074   3rd Qu.: 0.1062  
#>  Max.   : 1.78978   Max.   : 1.7190   Max.   : 2.57094   Max.   : 4.3702  
#>     Catholic       Infant.Mortality  
#>  Min.   :-0.9350   Min.   :-3.13886  
#>  1st Qu.:-0.8620   1st Qu.:-0.61543  
#>  Median :-0.6235   Median : 0.01972  
#>  Mean   : 0.0000   Mean   : 0.00000  
#>  3rd Qu.: 1.2464   3rd Qu.: 0.60337  
#>  Max.   : 1.4113   Max.   : 2.28566

Winsorize

To winsorize data:

# before
anscombe
#>    x1 x2 x3 x4    y1   y2    y3    y4
#> 1  10 10 10  8  8.04 9.14  7.46  6.58
#> 2   8  8  8  8  6.95 8.14  6.77  5.76
#> 3  13 13 13  8  7.58 8.74 12.74  7.71
#> 4   9  9  9  8  8.81 8.77  7.11  8.84
#> 5  11 11 11  8  8.33 9.26  7.81  8.47
#> 6  14 14 14  8  9.96 8.10  8.84  7.04
#> 7   6  6  6  8  7.24 6.13  6.08  5.25
#> 8   4  4  4 19  4.26 3.10  5.39 12.50
#> 9  12 12 12  8 10.84 9.13  8.15  5.56
#> 10  7  7  7  8  4.82 7.26  6.42  7.91
#> 11  5  5  5  8  5.68 4.74  5.73  6.89

# after
winsorize(anscombe)
#>    x1 x2 x3 x4   y1   y2   y3   y4
#> 1  10 10 10  8 8.04 9.13 7.46 6.58
#> 2   8  8  8  8 6.95 8.14 6.77 5.76
#> 3  12 12 12  8 7.58 8.74 8.15 7.71
#> 4   9  9  9  8 8.81 8.77 7.11 8.47
#> 5  11 11 11  8 8.33 9.13 7.81 8.47
#> 6  12 12 12  8 8.81 8.10 8.15 7.04
#> 7   6  6  6  8 7.24 6.13 6.08 5.76
#> 8   6  6  6  8 5.68 6.13 6.08 8.47
#> 9  12 12 12  8 8.81 9.13 8.15 5.76
#> 10  7  7  7  8 5.68 7.26 6.42 7.91
#> 11  6  6  6  8 5.68 6.13 6.08 6.89

Center

To grand-mean center data

center(anscombe)
#>    x1 x2 x3 x4          y1         y2    y3         y4
#> 1   1  1  1 -1  0.53909091  1.6390909 -0.04 -0.9209091
#> 2  -1 -1 -1 -1 -0.55090909  0.6390909 -0.73 -1.7409091
#> 3   4  4  4 -1  0.07909091  1.2390909  5.24  0.2090909
#> 4   0  0  0 -1  1.30909091  1.2690909 -0.39  1.3390909
#> 5   2  2  2 -1  0.82909091  1.7590909  0.31  0.9690909
#> 6   5  5  5 -1  2.45909091  0.5990909  1.34 -0.4609091
#> 7  -3 -3 -3 -1 -0.26090909 -1.3709091 -1.42 -2.2509091
#> 8  -5 -5 -5 10 -3.24090909 -4.4009091 -2.11  4.9990909
#> 9   3  3  3 -1  3.33909091  1.6290909  0.65 -1.9409091
#> 10 -2 -2 -2 -1 -2.68090909 -0.2409091 -1.08  0.4090909
#> 11 -4 -4 -4 -1 -1.82090909 -2.7609091 -1.77 -0.6109091

Ranktransform

To rank-transform data:

# before
head(trees)
#>   Girth Height Volume
#> 1   8.3     70   10.3
#> 2   8.6     65   10.3
#> 3   8.8     63   10.2
#> 4  10.5     72   16.4
#> 5  10.7     81   18.8
#> 6  10.8     83   19.7

# after
head(ranktransform(trees))
#>   Girth Height Volume
#> 1     1    6.0    2.5
#> 2     2    3.0    2.5
#> 3     3    1.0    1.0
#> 4     4    8.5    5.0
#> 5     5   25.5    7.0
#> 6     6   28.0    9.0

Rescale

To rescale a numeric variable to a new range:

change_scale(c(0, 1, 5, -5, -2))
#> [1]  50  60 100   0  30
#> attr(,"min_value")
#> [1] -5
#> attr(,"max_value")
#> [1] 5
#> attr(,"new_min")
#> [1] 0
#> attr(,"new_max")
#> [1] 100
#> attr(,"range_difference")
#> [1] 10
#> attr(,"to_range")
#> [1]   0 100
#> attr(,"class")
#> [1] "dw_transformer" "numeric"

Rotate or transpose

x <- mtcars[1:3, 1:4]

x
#>                mpg cyl disp  hp
#> Mazda RX4     21.0   6  160 110
#> Mazda RX4 Wag 21.0   6  160 110
#> Datsun 710    22.8   4  108  93

data_rotate(x)
#>      Mazda RX4 Mazda RX4 Wag Datsun 710
#> mpg         21            21       22.8
#> cyl          6             6        4.0
#> disp       160           160      108.0
#> hp         110           110       93.0

Data properties

datawizard provides a way to provide comprehensive descriptive summary for all variables in a dataframe:

data(iris)
describe_distribution(iris)
#> Variable     | Mean |   SD |  IQR |        Range | Skewness | Kurtosis |   n | n_Missing
#> ----------------------------------------------------------------------------------------
#> Sepal.Length | 5.84 | 0.83 | 1.30 | [4.30, 7.90] |     0.31 |    -0.55 | 150 |         0
#> Sepal.Width  | 3.06 | 0.44 | 0.52 | [2.00, 4.40] |     0.32 |     0.23 | 150 |         0
#> Petal.Length | 3.76 | 1.77 | 3.52 | [1.00, 6.90] |    -0.27 |    -1.40 | 150 |         0
#> Petal.Width  | 1.20 | 0.76 | 1.50 | [0.10, 2.50] |    -0.10 |    -1.34 | 150 |         0

Or even just a variable

describe_distribution(mtcars$wt)
#> Mean |   SD |  IQR |        Range | Skewness | Kurtosis |  n | n_Missing
#> ------------------------------------------------------------------------
#> 3.22 | 0.98 | 1.19 | [1.51, 5.42] |     0.47 |     0.42 | 32 |         0

There are also some additional data properties that can be computed using this package.

x <- (-10:10)^3 + rnorm(21, 0, 100)
smoothness(x, method = "diff")
#> [1] 1.791243
#> attr(,"class")
#> [1] "parameters_smoothness" "numeric"

Function design and pipe-workflow

The design of the {datawizard} functions follows a design principle that makes it easy for user to understand and remember how functions work:

  1. the first argument is the data
  2. for methods that work on data frames, two arguments are following to select and exclude variables
  3. the following arguments are arguments related to the specific tasks of the functions

Most important, functions that accept data frames usually have this as their first argument, and also return a (modified) data frame again. Thus, {datawizard} integrates smoothly into a “pipe-workflow”.

iris |>
  # all rows where Species is "versicolor" or "virginica"
  data_filter(Species %in% c("versicolor", "virginica")) |>
  # select only columns with "." in names (i.e. drop Species)
  get_columns(contains("\\.")) |>
  # move columns that ends with "Length" to start of data frame
  data_relocate(ends_with("Length")) |>
  # remove fourth column
  data_remove(4) |>
  head()
#>    Sepal.Length Petal.Length Sepal.Width
#> 51          7.0          4.7         3.2
#> 52          6.4          4.5         3.2
#> 53          6.9          4.9         3.1
#> 54          5.5          4.0         2.3
#> 55          6.5          4.6         2.8
#> 56          5.7          4.5         2.8

Contributing and Support

In case you want to file an issue or contribute in another way to the package, please follow this guide. For questions about the functionality, you may either contact us via email or also file an issue.

Code of Conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

datawizard's People

Contributors

bwiernik avatar dominiquemakowski avatar etiennebacher avatar github-actions[bot] avatar indrajeetpatil avatar mattansb avatar rempsyc avatar strengejacke avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datawizard's Issues

Allow more thresholds for windzorizing

Originally posted #47

Current behavior: Thresholds are relative, based on top/bottom percentile.

Suggested behavior: This can be expanded in many ways:

  • Fixed values - used passed lower and upper thresholds.
  • Relative Z score
  • Relative robust Z score

Docs

  • Some of the docs are referencing functions that use to be coupled with these functions in their previous homes. (e.g., standardize references model standardization).
  • Docs from the functions previously in effectsize need to have md formatting turned on.

winsorize factors

Originally posted #47

Current behavior: factors aren't winsorized.

Suggested behavior: Add include_factors (defaults to FALSE for backward comp) to allow some for of winsorizing, where levels with low frequency -- defined by threshold -- to be combined to ".other".

improve printing for transformed variable

> datawizard::standardize(c(4, 5, 2, 4, 42))
[1] -0.4317236 -0.3733826 -0.5484057 -0.4317236  1.7852356
attr(,"center")
[1] 11.4
attr(,"scale")
[1] 17.1406
attr(,"robust")
[1] FALSE

Would be nice to have a better printing which could just be print(as.vector(x)) (but we could also think of adding a small description with the mean and the scale and the method used)

standardization of matrix converts it to vector

Is there a reason for this, or is this a bug?

Tagging @DominiqueMakowski because he wrote the original function.

x <- matrix(1:8, nrow = 4)

datawizard::standardise(x)
#> [1] -1.4288690 -1.0206207 -0.6123724 -0.2041241  0.2041241  0.6123724  1.0206207
#> [8]  1.4288690
#> attr(,"center")
#> [1] 4.5
#> attr(,"scale")
#> [1] 2.44949
#> attr(,"robust")
#> [1] FALSE



x <- array(1:8, dim = c(4, 2))

datawizard::standardise(x)
#> [1] -1.4288690 -1.0206207 -0.6123724 -0.2041241  0.2041241  0.6123724  1.0206207
#> [8]  1.4288690
#> attr(,"center")
#> [1] 4.5
#> attr(,"scale")
#> [1] 2.44949
#> attr(,"robust")
#> [1] FALSE

Created on 2021-11-29 by the reprex package (v2.0.1)

feature request: providing optional argument to prettify ANOVA output

I had tried this in easystats/parameters#370, but due to my scattered brain, I made a mess of it. So here is a clearer description of this feature request.

Feature details

While reporting ANOVA, I want to report df, residual dfs, etc. in the same row, but almost all package and GUI software packages report them on their own rows. So I need to further tidy it up, and I am guessing I am not the only one who needs to do this.

We could provide an optional argument (e.g., pretty_anova) for relevant methods in model_parameters function so that users, if they wish to, can ask for the "prettified" output.

What would the output look like if this feature is supported?

The default will be pretty_anova = FALSE, so the output will continue to be backward compatible.

library(parameters)
m <- aov(Sepal.Length ~ Species, data = iris)

# default will continue to be what we have now
as.data.frame(parameters(m))
#>   Parameter Sum_Squares  df Mean_Square        F            p
#> 1   Species    63.21213   2  31.6060667 119.2645 1.669669e-31
#> 2 Residuals    38.95620 147   0.2650082       NA           NA

# pretty printing (not wedded to this argument name)
as.data.frame(parameters(m, pretty_anova = TRUE))
#> Parameter  Sum_Squares  Sum_Squares_Error  df  df (error)  Mean_Square   Mean_Square_Error  F        p 
#> 1 Species  63.21213     38.95620            2  147           31.6060667  0.2650082      119.2645 1.669669e-31

Any unwanted side effects?

  • Q: Since effectsize depends on parameters, will this break effectsize?
    A: No, the default value for this argument is backward compatible, so this will not have any adverse effects for effectsize.

  • Q: What if the user sets both pretty_anova = TRUE and (for instance) eta_squared = TRUE, won't that cause problems for effectsze?
    A: No, this can be avoided by delaying the reformatting/prettification of the dataframe after the effect size computation takes place. So the dataframe can be "prettified" after this point:

https://github.com/easystats/parameters/blob/02c6a92fc76b2f8fc1d840ee96a609327355a96f/R/methods_htest.R#L212-L221

Relevant context

This is relevant only for the following objects:

c("aov", "aovlist", "anova", "Gam", "manova", "maov")

Writing a JOSS paper

Once the API is more stable (which should be the case in a few months), we should definitely do this.

Potential improvements to `reshape_*()`

  • Work with rownames and not only row numbers
  • longer: argument: names_pattern
  • longer: ".value" special name
  • wider: arguments:
    • names_prefix
    • names_glue
    • names_sort
    • names_repair
    • values_fill
    • values_fn
  • wider: names_sep alias for sep argument
  • Non-standard evaluation in reshape_wider() for names_from and values_from? (need an example where this is needed)
  • tidyr::pivot_longer() examples:
    • 3
    tidyr::who |> 
      datawizard::reshape_longer(cols = 5:60,
                                 names_to = c("diagnosis", "gender", "age"),
                                 names_pattern = "new_?(.*)_(.)(.*)",
                                 values_to = "count")
    • 4
    anscombe |> 
      datawizard::reshape_longer(cols = colnames(anscombe),
                                 names_to = c(".value", "set"),
                                 names_pattern = "(.)(.)"
  • tidyr::pivot_wider() examples:
    • 2
     tidyr::fish_encounters |> 
        datawizard::reshape_wider(names_from = "station", values_from = "seen", values_fill = 0)
    • 5
    tidyr::us_rent_income |> 
        datawizard::reshape_wider(names_from = "variable",
                                  names_glue = "{variable}_{.value}",
                                  values_from = c("estimate", "moe"))
    • 6
    warpbreaks[c("wool", "tension", "breaks")] |> 
        datawizard::reshape_wider(
            names_from = "wool",
            values_from = "breaks",
            values_fn = mean
        )

improvizations for `center` documentation

  • explain how it's different from standardize (subtract mean versus subtract mean and divide by SD)
  • add alias centre (and resist American hegemony 😅)
  • examples about how to use select and exclude arguments

Implement `is_empty_object()` function

Multiple packages in our ecosystem have this same function as an internal function, and so there is unnecessary duplication of code. We can define it only once in {datawizard} and other packages can start using it.

# is string empty?
.is_empty_object <- function(x) {
  if (is.list(x)) {
    x <- tryCatch(
      {
        .compact_list(x)
      },
      error = function(x) {
        x
      }
    )
  }
  
  if (inherits(x, c("tbl_df", "tbl"))) x <- as.data.frame(x)
  x <- suppressWarnings(x[!is.na(x)])
  length(x) == 0 || is.null(x)
}

# remove NULL elements from lists
.compact_list <- function(x, remove_na = FALSE) {
  if (remove_na) {
    x[!sapply(x, function(i) length(i) == 0 || is.null(i) || (length(i) == 1 & is.na(i)) || any(i == "NULL", na.rm = TRUE))]
  } else {
    x[!sapply(x, function(i) length(i) == 0 || is.null(i) || any(i == "NULL", na.rm = TRUE))]
  }
}

Request for function that tabulates factors

I enjoy using describe_distribution() for numerical vectors but would like an alternative for tabulating factors.

set.seed(2021)
x <- factor(sample(LETTERS[1:3], size = 100, replace = TRUE))
datawizard::describe_distribution(x)
#> Mean | SD | Min | Max | Skewness | Kurtosis |   n | n_Missing
#> -------------------------------------------------------------
#> NA   | NA |   A |   C |    -0.14 |     -1.3 | 100 |         0

Created on 2021-12-13 by the reprex package (v2.0.1)

I would propose something like this:

describe_factor(x) #or tabulate_factor(x)
#> Variable | Level | Count | Proportion
#> -----------------------------------
#> x | A | 26 | 0.26
#> x | B | 40 | 0.40
#> x | C | 34 | 0.34

If you like the idea, I can make a pull request to start this off.

CRAN checklist

I want to push this to CRAN asap. So listing things that need to be done before I push it to CRAN.
Please edit as you please:

  • improve README
  • devtools::check_win_devel()
  • devtools::check_win_release()

move visualisation_matrix lower to make it more re-usable

A robust solution to generate the data / visualisation grid could be useful to other developers/packages as well (@vincentarelbundock), so we might want to see whether to move it to datawizard or insight. We could rename it something like data_grid() or get_datagrid() (if in insight), and reexport it under its current name (visualization_matrix) in modelbased to avoid any breaking changes.

What do you think?

raking function to create weights

Maybe based on iterake, we could add a light-weight re-implementation of that algorithm. Would be a nice addition to demean() and rescale_weights() to deal with (rather) common problems, which don't have a straight forward solution in many R packages yet.

expand `demean` to `degroup`

Some ideas for expanding the demean function into a more general degroup (or decenter, or ??) function (listed by ease of implementation as I perceive it):

  1. Allow for group-centering around other functions. Popular choices I've seen: median(), min(), max(), also Mode() is popular for categorical predictors.
Mode <- function(x, multimodal = FALSE) {
  uniqv <- unique(x)
  tab <- tabulate(match(x, uniqv))
  if (multimodal) {
    idx <- which(tab==max(tab))
  } else {
    idx <- which.max(tab)
  }
  uniqv[idx]
}
  1. Allow for more than 1 grouping var
    Order of operations would be: split y by G1, then split y_between by G2, etc... (Would need a better naming scheme?)

  2. Center around an indexed value. For example, center y around y[time==0], or y[condition=="a"]. Can be mixed with (1): max(y[time==0]), etc.

Standardize: unnecessary argument

Just noticed this:

datawizard/R/standardize.R

Lines 186 to 187 in 5ab5cdc

append = FALSE,
suffix = "_z",

It seems like we could have just had suffix=NULL by default and if suffix="z_" then it's added, no need for a toggle argument append on top of that as it makes it more confusing and weighs the API down

I can make a quick fix providing it won't break a thousand things ^^

standardize: define fixed args in the S3 method

You need to define the S3 method differently in datawizard.

If this is the generic:

standardize <- function(x,
                        robust = FALSE,
                        two_sd = FALSE,
                        weights = NULL,
                        verbose = TRUE,
                        ...) {
  UseMethod("standardize")
}

Than all other methods need to have at least these arguments.

In datawizard, it should be:

standardize <- function(x, ...) {
  UseMethod("standardize")
}

Originally posted by @strengejacke in easystats/modelbased#169 (comment)

revising `describe_distribution` output for `factor` class

As awesome as this function is for numeric type variables, I am not sure if this is the best output we can provide for factor type variables.

as.data.frame(parameters::describe_distribution(as.factor(mtcars$am)))
#>   Mean SD Min Max  Skewness Kurtosis  n n_Missing
#> 1   NA NA   0   1 0.4008089 -1.96655 32         0

I was thinking instead we can probably take inspiration from skimr output for factor class:

as.data.frame(skimr::skim(as.factor(mtcars$am)))
#>   skim_type skim_variable n_missing complete_rate factor.ordered
#> 1    factor          data         0             1          FALSE
#>   factor.n_unique factor.top_counts
#> 1               2      0: 19, 1: 13

circular dependencies

Note that adjust (https://github.com/easystats/datawizard/blob/master/R/adjust.R), factor_analysis (https://github.com/easystats/datawizard/blob/master/R/factor_analysis.R), and a few other functions use other eaystats packages (mostly insight and model_parameters).

This is not ideal since we will have another instance of circular dependence.

Not sure what can we do about this though. Functions used from insight, etc., are better off there and should not be moved to datawinzard so we will just have to live with this I am guessing 🤷‍♂️

Streamline data transformation code

4 functions, namely standardize, normalize, ranktransform and change_scale have a very similar code, which makes it 1) unnecessary lengthy and 2) harder to maintain. It could be streamlined with a nice internal.

including mode for the distribution in `describe_distribution`

Saw some discussion on Twitter about how base-R doesn't have a function to calculate mode. So was wondering if it might be a good idea for us to include this info.

library(parameters)

describe_distribution(mtcars$wt)
#> Mean |   SD |  IQR |        Range | Skewness | Kurtosis |  n | n_Missing
#> ------------------------------------------------------------------------
#> 3.22 | 0.98 | 1.19 | [1.51, 5.42] |     0.47 |     0.42 | 32 |         0

getmode <- function(v) {
  uniqv <- unique(v)
  uniqv[which.max(tabulate(match(v, uniqv)))]
}

Created on 2021-06-03 by the reprex package (v2.0.0)

Modulation of appending/overwriting standardized variables in `standardize()` not working

Related to #30

d <- iris[1:4, ]

## NOT WORKING
# this should only return the two standardized variables, including suffix
datawizard::standardise(d, select = c("Sepal.Length", "Sepal.Width"), append = FALSE, suffix = "_z")
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1    1.2402159   1.3887301          1.4         0.2  setosa
#> 2    0.3382407  -0.9258201          1.4         0.2  setosa
#> 3   -0.5637345   0.0000000          1.3         0.2  setosa
#> 4   -1.0147221  -0.4629100          1.5         0.2  setosa

## working
# this should return the original data frame and column bound 
# the standardized variables, including suffix
datawizard::standardise(d, select = c("Sepal.Length", "Sepal.Width"), append = TRUE, suffix = "_z")
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length_z
#> 1          5.1         3.5          1.4         0.2  setosa      1.2402159
#> 2          4.9         3.0          1.4         0.2  setosa      0.3382407
#> 3          4.7         3.2          1.3         0.2  setosa     -0.5637345
#> 4          4.6         3.1          1.5         0.2  setosa     -1.0147221
#>   Sepal.Width_z
#> 1     1.3887301
#> 2    -0.9258201
#> 3     0.0000000
#> 4    -0.4629100

## NOT WORKING
# this should only return the standardized variables, w/o suffix
datawizard::standardise(d, select = c("Sepal.Length", "Sepal.Width"), append = FALSE, suffix = NULL)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1    1.2402159   1.3887301          1.4         0.2  setosa
#> 2    0.3382407  -0.9258201          1.4         0.2  setosa
#> 3   -0.5637345   0.0000000          1.3         0.2  setosa
#> 4   -1.0147221  -0.4629100          1.5         0.2  setosa

## NOT WORKING
# this should return the original data frame and the the standardized variables
# *overwrite* the related variables
datawizard::standardise(d, select = c("Sepal.Length", "Sepal.Width"), append = TRUE, suffix = NULL)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length
#> 1    1.2402159   1.3887301          1.4         0.2  setosa          5.1
#> 2    0.3382407  -0.9258201          1.4         0.2  setosa          4.9
#> 3   -0.5637345   0.0000000          1.3         0.2  setosa          4.7
#> 4   -1.0147221  -0.4629100          1.5         0.2  setosa          4.6
#>   Sepal.Width
#> 1         3.5
#> 2         3.0
#> 3         3.2
#> 4         3.1

Created on 2021-11-04 by the reprex package (v2.0.1)

set_na()

If we're going to copy some useful functions from sj* packages, we should also have a set_na() function. This is a quite common operation after importing data from Stata or SPSS.

`standardize()` error for `nlme::gls`: `"Error in eval(predvars, data, env): object <group name> not found"`

This seems only to happen when the object passed to correlation or weightsis constructed *before* the main call: (Also true fornlme::lme()`)

library(nlme)
library(effectsize)

cr <- corAR1(form = ~ 1 | Mare)
fm1 <- gls(follicles ~ sin(2 * pi * Time) + cos(2 * pi * Time), Ovary,
           correlation = cr)

standardize(fm1)
#> Error: Unable to refit the model with standardized data.
#> Failed with the following error:
#> "Error in eval(predvars, data, env): object 'Mare' not found
#> �"
#> 
#> Try instead to standardize the data (standardize(data)) and refit the model manually.




fm2 <- gls(follicles ~ sin(2 * pi * Time) + cos(2 * pi * Time), Ovary,
           correlation = corAR1(form = ~ 1 | Mare))

standardize(fm2) # Works
#> Generalized least squares fit by REML
#>   Model: follicles ~ sin(2 * pi * Time) + cos(2 * pi * Time) 
#>   Data: data_std 
#>   Log-restricted-likelihood: -297.8955
#> 
#> Coefficients:
#>        (Intercept) sin(2 * pi * Time) cos(2 * pi * Time) 
#>        -0.02178075        -0.01786764        -0.02994492 
#> 
#> Correlation Structure: AR(1)
#>  Formula: ~1 | Mare 
#>  Parameter estimate(s):
#>       Phi 
#> 0.7881418 
#> Degrees of freedom: 308 total; 305 residual
#> Residual standard error: 1.00673

Created on 2022-01-07 by the reprex package (v2.0.1)

Not sure how to fix this...

list of possible functions to be included in `datawizard`

Please edit the list as you come across such functions in the core packages.

I am organizing by core package:

bayestestR

  • simulate_correlation
  • simulate_difference
  • simulate_ttest

effectsize

  • adjust
  • standardize
  • normalize

correlation

  • simulate_simpson
  • winsorize

parameters

  • center
  • data_partition
  • data_to_numeric
  • demean (?)
  • degroup (?)
  • reduce_parameters
  • principal_components()
  • factor_analysis()
  • n_factors()
  • n_clusters()
  • check_factorstructure()
  • check_clusterstructure()

performance

modelbased

report

  • data_rename
  • data_findcols
  • data_remove
  • data_reorder
  • data_addprefix
  • data_addsuffix

insight

  • check_if_installed
  • data_to_long
  • data_to_wide
  • format_message
  • All the table printing (possibly the implementation of a generic easydf/easydata class for dataframe that would be naturally printed with possible methods like data_add_title / data_add_footer that can be nicely put in a chain

getting rid of `parameters` dependency

Functions where it is needed and can't be avoided.

  • principal_components

out <- parameters::model_parameters(pca, sort = sort, threshold = threshold)

  • factor_analysis

if (!is.null(cor)) {
out <- parameters::model_parameters(
psych::fa(
cor,
nfactors = n,
rotate = rotation,
n.obs = nrow(x),
...
),
sort = sort,
threshold = threshold
)
} else {
out <- parameters::model_parameters(
psych::fa(x, nfactors = n, rotate = rotation, ...),
sort = sort,
threshold = threshold
)
}

Pulling functions from {effectsize}

Some ideas:

  • standardize() + unstandardize() (with the .numeric(), .factor(), .data.frame(), .grouped_df() methods)
    The method for models (.default()) can live in {effectsize} - meaning datawizard would be imported by effectsize.
  • change_scale()
  • normalize()
  • ranktransform()
  • adjust() / data_adjust() (which isn't even used by {effectsize})

Bug in standardize

datawizard::standardise(data.frame(x = c(1,2,3,NA), y = c(1,1,1,NA)))
#> Error in `[[<-.data.frame`(`*tmp*`, var, value = c(1, 1, 1)): replacement has 3 rows, data has 4

Created on 2021-11-04 by the reprex package (v2.0.1)

Remove deprecated `change_scale()` function

You should be using data_rescale() instead. I will be making this change in the next-to-next version.

#' @rdname data_rescale
#' @export
change_scale <- function(x, ...) {
# TODO: Don't deprecate for now
# so we have time to change it accross the verse, but for next round
# .Deprecated("data_rescale")
data_rescale(x, ...)
}

cc @strengejacke, @mattansb, @DominiqueMakowski You need to make these changes in see, parameters, effectsize, and modelbased.

https://github.com/search?p=1&q=org%3Aeasystats+change_scale&type=Code

  • see
  • parameters
  • modelbased
  • effectsize

logo

Unfortunately we don't have the rights to gandalf :p

I made a first draft with a monster-wizard tidying-up a bunch of data (cells)

logo

Design

A list of questions:

  • NSE, characters or both? It's likely that for we will need to plan something like a super robust data_select() that can accommodate various types of inputs (+ regex etc.). This will also be very useful in other easystats packages.
  • Naming scheme:
    • column names: cols? vars? colnames?
    • row names: rows? rownames?
    • ...
  • Regarding pattern-matching, follow tidyverse's API (starts_with, ends_with, matches) or raw regex? Or somewhat in between with a few commonly-used regex helpers (rgx_start, rgx_end, rgx_digits, ..., tho obviously not to the extent of RVerbalExpressions)

"tidyselect" helpers

Would be nice to implement some of the basic tidyselect helpers, such as for use in reshape_longer() and reshape_wider(). This can be done with base.

Suggestion for new package covering missing stat functions

Would it be an idea to add another package with stat functions you would expect to be in stats but they are not?

Harmonic mean, geometric mean, mode, correlation of variance, standard errors, these kind of things? I could think of tens of obvious functions.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.