Giter Site home page Giter Site logo

nelson-gon / mde Goto Github PK

View Code? Open in Web Editor NEW
4.0 2.0 4.0 1.41 MB

mde: Missing Data Explorer

Home Page: https://nelson-gon.github.io/mde

License: GNU General Public License v3.0

R 100.00%
missing-data missing-values r-package r-stats data-analysis data-exploration r recode missing data-science

mde's Introduction

Python R Shell Script

Thank you ๐Ÿ–ค

Keep Building ๐Ÿ—

mde's People

Contributors

jordanjenkins avatar nelson-gon avatar shahronak47 avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

mde's Issues

Allow multiple values as replacements

Given a data.frame object with NAs , I would like to be able to use recode_na_as to replace these with my own defined values, with one trick: These values should be recycled across the data.frame. Currently, this is not possible and will only use the first value with the warning:

In x[list] <- values :
number of items to replace is not a multiple of replacement length

Recode character NA as another value

Description

Given a character NA such as "na", I would like to convert this to 420 for example.

Similar Features

This is similar to recode_as_na, recode_na_as, recode_na_as_str. The only difference is that we would take any value and convert it to some other value.

Feature Details

Admittedly, the above may slightly deviate from the package's name as that is a recode that may not necessarily be missing. However, for "na" it should be possible to do that.
Proposed Implementation

Currently, this can be achieved via a two-step process.

to_na<- recode_as_na(df,value=c("na"/"null"),...)
from_na <- recode_na_as(df, value = 0, ...)

This was inspired by the Stackoverflow question https://stackoverflow.com/q/70385221/10323798

na_summary should be able to specify columns to include or exclude

Description

The function na_summary() should be able to specify a subset of columns to include or exclude.

Similar Features

This would enhance the na_summary() function. The percent_missing() function already has this feature in the argument exclude_cols. The argument ... in the `na_summary()' function might have the feature but there is no documentation of example usage.

Feature Details

The documented usage is
na_summary(df, grouping_cols = NULL, sort_by = NULL, descending = FALSE, ...)
where ... is Arguments to other functions.
But here ... fails. That is, the exclude_cols argument, which is an argument to the percent_missing() function, fails with an error when used in the na_summary() function. Here is an MWE:

> library(mde)
Welcome to mde. This is mde version 0.2.1.
 Please file issues and feedback at https://www.github.com/Nelson-Gon/mde/issues
Turn this message off using 'suppressPackageStartupMessages(library(mde))'
 Happy Exploration :)
> na_summary(airquality)
  variable missing complete percent_complete percent_missing
1      Day       0      153        100.00000        0.000000
2    Month       0      153        100.00000        0.000000
3    Ozone      37      116         75.81699       24.183007
4  Solar.R       7      146         95.42484        4.575163
5     Temp       0      153        100.00000        0.000000
6     Wind       0      153        100.00000        0.000000
> percent_missing(airquality, exclude_cols = c("Day","Temp"))
     Ozone  Solar.R Wind Month
1 24.18301 4.575163    0     0
> na_summary(airquality, exclude_cols = c("Day","Temp"))
Error in na_summary.data.frame(airquality, exclude_cols = c("Day", "Temp")) : 
  Binding of datasets failed. Please check using percent_missing and get_na_counts first

Proposed Implementation

I can think of at least four possible implementations:
A. Provide example documentation of how to use the argument exclude_cols of the percent_missing() function as the argument ... in the na_summary() function.
or
B. Add the argument exclude_cols or include_cols to the na_summary() function for variables you want to exclude or include.
or
C. Add a boolean argument for missingness versus completeness so that you will either get a completeness report (with the statistics columns complete and percent_complete) or a minimal missingness report (with the statistics columns missing and percent_missing).
or
D. Add the argument exclude_stats or include_stats to the na_summary() function for statistics you want to exclude or include.

I propose A. if the issue is only a matter of improving the documentation. Otherwise it is a feature request which can be implemented in several different ways such as B, C, or D.

Dictionary Style Missing Data Recode

Description

I would like to replace NAs for several columns at once.

Similar Features

This could be done several times with recode_na_as but I would rather do it once.

Feature Details

Given a dataset:

df <- structure(list(A = c(3L, 4L, NA, 9L, NA), B = c(9L, NA, NA, 2L, 
12L)), class = "data.frame", row.names = c(NA, -5L))

Recode A as 0 and B as 2020. Do this at once instead of:

mde::recode_na_as(df, pattern_type = "ends_with", pattern="A",value=0)
mde::recode_na_as(df, pattern_type = "ends_with", pattern="B",value=2020)

Proposed Implementation

None yet.

Support multiple patterns

In recode_*_*, I would like to be able to use multiple patterns and/or use regex. In this example,I would like to do something like:
recode_na_as(df,value=0,pattern_type="starts_with",pattern="this|col") as opposed to:
recode_na_as(df,value=0,pattern_type="starts_with",pattern="this") %>% recode_na_as(value=0,pattern_type="starts_with",pattern="col")

`na_summary` fails for `data.frame` objects with logical columns

Describe the bug

na_summary fails for logical columns.

To Reproduce

test_df <- data.frame(A= 1:4, B= as.logical(1, NA, 2, 4))
mde::na_summary(test_df)

Expected behavior

An output from na_summary.

Unexpected behavior

Error: Problem with summarise() input ..1.
i ..1 = across(everything(), ~get_na_means(.)).
x no applicable method for 'get_na_means' applied to an object of class "logical"

System Details

Developer version https://github.com/Nelson-Gon/mde/tree/221eeab7c0ba19dd66a3186bfeb344c3a4032cb9

Conditionally Recode NA based on Percent Missingness

Description

I would like to recode_as_na if and only if the percentage of missing values meets some target criterion.

Similar Features

This is similar to drop_na_if except we're looking to keep not drop values.

Feature Details

I have described it in enough detail above.

Proposed Implementation

I currently have no proposed implementation.

drop_na_if returns percentages data frame instead

When using drop_na_if, it returns the percentages/decimals data.frame object instead of the original data.frame with columns dropped.

This is clearly a bug and is not what one would expect to happen.

Control over warning messages

In recode_as_na and other functions that force coercion, there should be an option to turn off the coercion warning(not pretty but someone might prefer not to see the warnings).

na_counts and na_summary fail for logical vectors

When a data frame contains logical vectors, both na_counts and na_summary will fail with the error:

No applicable methods for na_counts applied to an object of class logical
To Reproduce

xdupe <- as.logical(c("T", "F", "F", "F", "T", "T", "F"))
ydupe <- as.logical(c("T", "F", "F", "F", "F", "T", "T"))
cities <- c("Knox", "Whiteville", "Madison", "York", "Paris", "Corona", "Bakersfield")
df <- data.frame(cities, xdupe, ydupe)
df$cities <- as.character(df$cities)
mde::na_summary(df)

Expected behavior

Expected to get a summary of missingness

Unexpected behavior
See above

System Details

R 4.3.1 mde 0.3.2

Exclude columns by RegEx match.

Description

For functions that use exclusion, it would be great to exclude via a regular expression or wildcard.

Similar Features

This is almost similar to pattern_type plus pattern except that this would exclude not include.

Feature Details

In shinymde, there's an exclude columns option when summarising missingness. This is tedious if you have thousands of columns. A simple RegEx match would save much more time.

Proposed Implementation

Support regular expression entries in the exclude_columns argument.

Issues with recoding as NA

In using recode_as_na, character vectors are coerced to factor levels integer instead.

Example:

df <- data.frame(col_1 = c(45, 23, 89, "this", "and"),
                 col_2 = c(5,6,7,0,"this")) 
  col_1 col_2
1    45     5
2    23     6
3    89     7
4  this     0
5   and  this

Unexpected behavior


mde::recode_as_na(df,c("this","and","this"))
  col_1 col_2
1     2     2
2     1     3
3     3     4
4    NA     1
5    NA    NA

Compared to using characters:

df <- data.frame(col_1 = c(45, 23, 89, "this", "and"),
                col_2 = c(5,6,7,0,"this"), stringsAsFactors = FALSE)
 mde::recode_as_na(df,c("this","and","this"))
  col_1 col_2
1    45     5
2    23     6
3    89     7
4  <NA>     0
5  <NA>  <NA>

drop_na_at should return the entire dataset

drop_na_at currently drops NAs and returns only columns for which missing values have been dropped. This might be less useful if one would like to do the analysis at once.

The package does not focus on imputation(just exploration) so it would be great to keep the entire dataset intact. Stated differently, one should drop_na_at if such a drop results in equal number of rows(highly unlikely).

Drop rows based on missingness counts

Description

I would like to drop rows that contain missing values based on counts.

Similar Features

This is similar to drop_row_if except it would use counts not percents.

Feature Details

Given an example data set:

df <- data.frame(A=1:5, B=c(1,NA,NA,2, 3), C= c(1,NA,NA,2,3))

I would like to drop rows that have x number of NAs.

Proposed Implementation

Use drop_row_if but provide an argument for counts too.

Exclude certain columns when dropping NAs

In drop_na_if, one should be able to drop_na_if only for certain columns i.e subset and drop_at.

This can probably be done using drop_na_at as the "backend"(behind-the-scenes)

Conditionally drop rows

Description

Drop rows with x% missing

Similar Features

This may be similar to column_based_recode or drop_na_if but for rows not columns.

Feature Details

Given a data.frame:


df <- data.frame(A=rep(NA,4), B=c(rep(NA,3),1))

I would like to keep rows that only have an x% of observed values.

Proposed Implementation

None yet.

Expand tests

There is a large portion of functions that seem to have no tests at all or sketchy tests. recode_na_for for instance. This will make future updates time consuming. Perhaps use coverage?

Provide warning and convert factors to character in recode_as_na_str

Description

When using recode_as_na_str, the result will return factor levels.

Similar Features

This is similar to recode_as_na_str

Feature Details

I would like to have a warning that tells me that factors have been converted to character during the recoding process.

Proposed Implementation

if (is.factor(x)) warning("X has been converted to character"). Proceed as usual.

Support grouping in drop_na_if

Description

I would like to drop groups that have x% missing.

Similar Features

This is similar to drop_na_if which currently doesn't support grouping.

Feature Details

The provided detail is sufficient.

Proposed Implementation

  • Add a grouping_cols argument to drop_na_if
  • Calculate percent missingness and drop as required.

Support grouped recoding

Given a data.frame object, it is possible that one would like to recode_na_as a given value only for specific "individuals"/groups. Is there a way to therefore support grouped replacements and importantly "subsetting" these groups.

Example:

some_data <- data.frame(ID=c("A1","A2","A3", "A4"), 
                        A=c(5,NA,0,8), B=c(10,0,0,1),
                        C=c(1,NA,NA,25))

For the above data, I would like to replace NAs with the value 233 or 420 only for IDs corresponding to A1 and A2

Support dates in missingness reports

Description

Extend functionality for objects of class POSIXct or Date.

Similar Features

get_na_*

Feature Details

Sufficiently described.

Proposed Implementation

Write functions such as get_na_means.POSIXct or "inherit" from other classes.

Recode as NA based on a partial match

Description

I would like to recode_as_na based on string matching.

Similar Features

This is similar to recode_as_na

Feature Details

Given a data.frame:

partial_match <- data.frame(A=c("Hi","match_me"), B=c(NA, "not_me"))

I would like to change all values that contain me to NA .

Proposed Implementation

None, yet.

Topic Based Vignettes

Description

I would like to have vignettes that deal with a single topic.

Similar Features

This is similar to the general package vignette.

Feature Details

I would like to have different vignettes for the following topics

  • Exploring missingness
  • Recoding

Proposed Implementation

Extend vignettes as proposed above.

Dealing with row names in na_summary

Description

I would like to preserve "reorder" row names when sorting in na_summary.

Similar Features

This is related to na_summary when sorted.

Feature Details

Given a data.frame object, running na_summary on this data works as expected except the returned rows are in their original order. Example:

df <- data.frame(A=1:5,B=c(NA,NA,25,24,53), C=c(NA,1,2,3,4))

na_summary(df,sort_by="variable",descending=TRUE)                 
  variable missing complete percent_complete percent_missing
3        C       1        4               80              20
2        B       2        3               60              40
1        A       0        5              100               0

In the above result, we could instead change 3 to 1 to 1 to 3 as per the new numbering.

Proposed Implementation

Change row.names to 1:nrow(df). This might be fine for numeric rownames but not non-numeric indices. Say we had some names, it might be problematic to change these to numeric indices. Perhaps add a warning/argument to ask users what they would like to do with the indices?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.