nelson-gon / mde Goto Github PK

View Code? Open in Web Editor NEW

4.0 2.0 4.0 1.41 MB

mde: Missing Data Explorer

Home Page: https://nelson-gon.github.io/mde

License: GNU General Public License v3.0

R 100.00%

missing-data missing-values r-package r-stats data-analysis data-exploration r recode missing data-science

mde's Introduction

🖤 Building Impactful Software.
💻 Simplicity focused open source advocate. Author of several open source R and python packages like cytounet, urlfix, manymodelr, pyfdc, mde. A full list is available at https://nelson-gon.github.io/projects.
📫 Please contact me via LinkedIn at https://www.linkedin.com/in/nelsongon/

Thank you 🖤

Keep Building 🏗

mde's People

Contributors

Stargazers

Watchers

Forkers

romainfrancois jimsforks minghao2016 shahronak47

mde's Issues

Allow multiple values as replacements

Given a data.frame object with NAs , I would like to be able to use recode_na_as to replace these with my own defined values, with one trick: These values should be recycled across the data.frame. Currently, this is not possible and will only use the first value with the warning:

In x[list] <- values :
number of items to replace is not a multiple of replacement length

Allow recoding based on rows

Recode as NA if a value satisfies some condition

In recode_na_as and recode_as_na, is it possible to recode values that meet some conditions for example gteq,lteq,lt,e.tc.?

Support tidy select in recode_na_as and recode_as_na

Currently, recode_na_as and recode_as_na's subset_cols argument is a bit less versatile. Could there be a way to extend support to use tidy select like features? (Think contains,starts_with,end_with)

Recode character NA as another value

Description

Given a character NA such as "na", I would like to convert this to 420 for example.

Similar Features

This is similar to recode_as_na, recode_na_as, recode_na_as_str. The only difference is that we would take any value and convert it to some other value.

Feature Details

Admittedly, the above may slightly deviate from the package's name as that is a recode that may not necessarily be missing. However, for "na" it should be possible to do that.
Proposed Implementation

Currently, this can be achieved via a two-step process.

to_na<- recode_as_na(df,value=c("na"/"null"),...)
from_na <- recode_na_as(df, value = 0, ...)

This was inspired by the Stackoverflow question https://stackoverflow.com/q/70385221/10323798

na_summary should be able to specify columns to include or exclude

Description

The function na_summary() should be able to specify a subset of columns to include or exclude.

Similar Features

This would enhance the na_summary() function. The percent_missing() function already has this feature in the argument exclude_cols. The argument ... in the `na_summary()' function might have the feature but there is no documentation of example usage.

Feature Details

The documented usage is
na_summary(df, grouping_cols = NULL, sort_by = NULL, descending = FALSE, ...)
where ... is Arguments to other functions.
But here ... fails. That is, the exclude_cols argument, which is an argument to the percent_missing() function, fails with an error when used in the na_summary() function. Here is an MWE:

> library(mde)
Welcome to mde. This is mde version 0.2.1.
 Please file issues and feedback at https://www.github.com/Nelson-Gon/mde/issues
Turn this message off using 'suppressPackageStartupMessages(library(mde))'
 Happy Exploration :)
> na_summary(airquality)
  variable missing complete percent_complete percent_missing
1      Day       0      153        100.00000        0.000000
2    Month       0      153        100.00000        0.000000
3    Ozone      37      116         75.81699       24.183007
4  Solar.R       7      146         95.42484        4.575163
5     Temp       0      153        100.00000        0.000000
6     Wind       0      153        100.00000        0.000000
> percent_missing(airquality, exclude_cols = c("Day","Temp"))
     Ozone  Solar.R Wind Month
1 24.18301 4.575163    0     0
> na_summary(airquality, exclude_cols = c("Day","Temp"))
Error in na_summary.data.frame(airquality, exclude_cols = c("Day", "Temp")) : 
  Binding of datasets failed. Please check using percent_missing and get_na_counts first

Proposed Implementation

I can think of at least four possible implementations:
A. Provide example documentation of how to use the argument exclude_cols of the percent_missing() function as the argument ... in the na_summary() function.
or
B. Add the argument exclude_cols or include_cols to the na_summary() function for variables you want to exclude or include.
or
C. Add a boolean argument for missingness versus completeness so that you will either get a completeness report (with the statistics columns complete and percent_complete) or a minimal missingness report (with the statistics columns missing and percent_missing).
or
D. Add the argument exclude_stats or include_stats to the na_summary() function for statistics you want to exclude or include.

I propose A. if the issue is only a matter of improving the documentation. Otherwise it is a feature request which can be implemented in several different ways such as B, C, or D.

Dictionary Style Missing Data Recode

Description

I would like to replace NAs for several columns at once.

Similar Features

This could be done several times with recode_na_as but I would rather do it once.

Feature Details

Given a dataset:

df <- structure(list(A = c(3L, 4L, NA, 9L, NA), B = c(9L, NA, NA, 2L, 
12L)), class = "data.frame", row.names = c(NA, -5L))

Recode A as 0 and B as 2020. Do this at once instead of:

mde::recode_na_as(df, pattern_type = "ends_with", pattern="A",value=0)
mde::recode_na_as(df, pattern_type = "ends_with", pattern="B",value=2020)

Proposed Implementation

None yet.

Support multiple patterns

In recode_*_*, I would like to be able to use multiple patterns and/or use regex. In this example,I would like to do something like:
recode_na_as(df,value=0,pattern_type="starts_with",pattern="this|col") as opposed to:
recode_na_as(df,value=0,pattern_type="starts_with",pattern="this") %>% recode_na_as(value=0,pattern_type="starts_with",pattern="col")

`na_summary` fails for `data.frame` objects with logical columns

Describe the bug

na_summary fails for logical columns.

To Reproduce

test_df <- data.frame(A= 1:4, B= as.logical(1, NA, 2, 4))
mde::na_summary(test_df)

Expected behavior

An output from na_summary.

Unexpected behavior

Error: Problem with summarise() input ..1.
i ..1 = across(everything(), ~get_na_means(.)).
x no applicable method for 'get_na_means' applied to an object of class "logical"

System Details

Developer version https://github.com/Nelson-Gon/mde/tree/221eeab7c0ba19dd66a3186bfeb344c3a4032cb9

Conditionally Recode NA based on Percent Missingness

Description

I would like to recode_as_na if and only if the percentage of missing values meets some target criterion.

Similar Features

This is similar to drop_na_if except we're looking to keep not drop values.

Feature Details

I have described it in enough detail above.

Proposed Implementation

I currently have no proposed implementation.

drop_na_if returns percentages data frame instead

When using drop_na_if, it returns the percentages/decimals data.frame object instead of the original data.frame with columns dropped.

This is clearly a bug and is not what one would expect to happen.

Issues with descending order

na_summary's descending argument ascends instead.

Control over warning messages

In recode_as_na and other functions that force coercion, there should be an option to turn off the coercion warning(not pretty but someone might prefer not to see the warnings).

na_counts and na_summary fail for logical vectors

When a data frame contains logical vectors, both na_counts and na_summary will fail with the error:

No applicable methods for na_counts applied to an object of class logical
To Reproduce

xdupe <- as.logical(c("T", "F", "F", "F", "T", "T", "F"))
ydupe <- as.logical(c("T", "F", "F", "F", "F", "T", "T"))
cities <- c("Knox", "Whiteville", "Madison", "York", "Paris", "Corona", "Bakersfield")
df <- data.frame(cities, xdupe, ydupe)
df$cities <- as.character(df$cities)
mde::na_summary(df)

Expected behavior

Expected to get a summary of missingness

Unexpected behavior
See above

System Details

R 4.3.1 mde 0.3.2

Exclude columns by RegEx match.

Description

For functions that use exclusion, it would be great to exclude via a regular expression or wildcard.

Similar Features

This is almost similar to pattern_type plus pattern except that this would exclude not include.

Feature Details

In shinymde, there's an exclude columns option when summarising missingness. This is tedious if you have thousands of columns. A simple RegEx match would save much more time.

Proposed Implementation

Support regular expression entries in the exclude_columns argument.

Issues with recoding as NA

In using recode_as_na, character vectors are coerced to ~~factor levels~~ integer instead.

Example:

df <- data.frame(col_1 = c(45, 23, 89, "this", "and"),
                 col_2 = c(5,6,7,0,"this")) 
  col_1 col_2
1    45     5
2    23     6
3    89     7
4  this     0
5   and  this

Unexpected behavior


mde::recode_as_na(df,c("this","and","this"))
  col_1 col_2
1     2     2
2     1     3
3     3     4
4    NA     1
5    NA    NA

Compared to using characters:

df <- data.frame(col_1 = c(45, 23, 89, "this", "and"),
                col_2 = c(5,6,7,0,"this"), stringsAsFactors = FALSE)
 mde::recode_as_na(df,c("this","and","this"))
  col_1 col_2
1    45     5
2    23     6
3    89     7
4  <NA>     0
5  <NA>  <NA>

Improve grouped_sort in na_summary

Fix CRAN

Patch for checks Nelson-Gon/manymodelr#17 and r-lib/testthat#1051

drop_na_at should return the entire dataset

drop_na_at currently drops NAs and returns only columns for which missing values have been dropped. This might be less useful if one would like to do the analysis at once.

The package does not focus on imputation(just exploration) so it would be great to keep the entire dataset intact. Stated differently, one should drop_na_at if such a drop results in equal number of rows(highly unlikely).

Support case (in)sensitivity in recoding

Support sorted output in na_summary

Drop rows based on missingness counts

Description

I would like to drop rows that contain missing values based on counts.

Similar Features

This is similar to drop_row_if except it would use counts not percents.

Feature Details

Given an example data set:

df <- data.frame(A=1:5, B=c(1,NA,NA,2, 3), C= c(1,NA,NA,2,3))

I would like to drop rows that have x number of NAs.

Proposed Implementation

Use drop_row_if but provide an argument for counts too.

Exclude certain columns when dropping NAs

In drop_na_if, one should be able to drop_na_if only for certain columns i.e subset and drop_at.

This can probably be done using drop_na_at as the "backend"(behind-the-scenes)

Conditionally drop rows

Description

Drop rows with x% missing

Similar Features

This may be similar to column_based_recode or drop_na_if but for rows not columns.

Feature Details

Given a data.frame:


df <- data.frame(A=rep(NA,4), B=c(rep(NA,3),1))

I would like to keep rows that only have an x% of observed values.

Proposed Implementation

None yet.

Expand tests

There is a large portion of functions that seem to have no tests at all or sketchy tests. recode_na_for for instance. This will make future updates time consuming. Perhaps use coverage?

Provide warning and convert factors to character in recode_as_na_str

Description

When using recode_as_na_str, the result will return factor levels.

Similar Features

This is similar to recode_as_na_str

Feature Details

I would like to have a warning that tells me that factors have been converted to character during the recoding process.

Proposed Implementation

if (is.factor(x)) warning("X has been converted to character"). Proceed as usual.

Support grouping in drop_na_if

Description

I would like to drop groups that have x% missing.

Similar Features

This is similar to drop_na_if which currently doesn't support grouping.

Feature Details

The provided detail is sufficient.

Proposed Implementation

Add a grouping_cols argument to drop_na_if
Calculate percent missingness and drop as required.

Support mean, mode, max, sd, sem, etc

In recoding values, I would like to be able to use common functions like mean, mode, sd, max, min or my own user defined equation.

Rethink na_summary output

The current output from na_summary looks messy especially with respect to naming.

Support grouped recoding

Given a data.frame object, it is possible that one would like to recode_na_as a given value only for specific "individuals"/groups. Is there a way to therefore support grouped replacements and importantly "subsetting" these groups.

Example:

some_data <- data.frame(ID=c("A1","A2","A3", "A4"), 
                        A=c(5,NA,0,8), B=c(10,0,0,1),
                        C=c(1,NA,NA,25))

For the above data, I would like to replace NAs with the value 233 or 420 only for IDs corresponding to A1 and A2

Support dates in missingness reports

Description

Extend functionality for objects of class POSIXct or Date.

Similar Features

get_na_*

Feature Details

Sufficiently described.

Proposed Implementation

Write functions such as get_na_means.POSIXct or "inherit" from other classes.

Recode as NA based on a partial match

Description

I would like to recode_as_na based on string matching.

Similar Features

This is similar to recode_as_na

Feature Details

Given a data.frame:

partial_match <- data.frame(A=c("Hi","match_me"), B=c(NA, "not_me"))

I would like to change all values that contain me to NA .

Proposed Implementation

None, yet.

Moving to `dplyr` 1.0.0.

Please see this issue

Conditional NA recoding based on other columns

See this post for more.

Allow grouping in na_summary

In na_summary, I would like to get this summary by group.

Topic Based Vignettes

Description

I would like to have vignettes that deal with a single topic.

Similar Features

This is similar to the general package vignette.

Feature Details

I would like to have different vignettes for the following topics

Exploring missingness
Recoding

Proposed Implementation

Extend vignettes as proposed above.

Dealing with row names in na_summary

Description

I would like to preserve "reorder" row names when sorting in na_summary.

Similar Features

This is related to na_summary when sorted.

Feature Details

Given a data.frame object, running na_summary on this data works as expected except the returned rows are in their original order. Example:

df <- data.frame(A=1:5,B=c(NA,NA,25,24,53), C=c(NA,1,2,3,4))

na_summary(df,sort_by="variable",descending=TRUE)                 
  variable missing complete percent_complete percent_missing
3        C       1        4               80              20
2        B       2        3               60              40
1        A       0        5              100               0

In the above result, we could instead change 3 to 1 to 1 to 3 as per the new numbering.

Proposed Implementation

Change row.names to 1:nrow(df). This might be fine for numeric rownames but not non-numeric indices. Say we had some names, it might be problematic to change these to numeric indices. Perhaps add a warning/argument to ask users what they would like to do with the indices?

nelson-gon / mde Goto Github PK

mde's Introduction

mde's People

Contributors

Stargazers

Watchers

Forkers

mde's Issues

Recommend Projects

Recommend Topics

Recommend Org