njtierney / naniar Goto Github PK

View Code? Open in Web Editor NEW

640.0 19.0 53.0 91.93 MB

Tidy data structures, summaries, and visualisations for missing data

Home Page: http://naniar.njtierney.com/

License: Other

R 100.00%

missing-data data-visualisation ggplot2 missingness tidy-data r-package

naniar's Introduction

naniar

naniar provides principled, tidy ways to summarise, visualise, and manipulate missing data with minimal deviations from the workflows in ggplot2 and tidy data. It does this by providing:

Shadow matrices, a tidy data structure for missing data:
- bind_shadow() and nabular()
Shorthand summaries for missing data:
- n_miss() and n_complete()
- pct_miss()and pct_complete()
Numerical summaries of missing data in variables and cases:
- miss_var_summary() and miss_var_table()
- miss_case_summary(), miss_case_table()
Statistical tests of missingness:
- mcar_test() for Little’s (1988) missing completely at random (MCAR) test
Visualisation for missing data:
- geom_miss_point()
- gg_miss_var()
- gg_miss_case()
- gg_miss_fct()

For more details on the workflow and theory underpinning naniar, read the vignette Getting started with naniar.

For a short primer on the data visualisation available in naniar, read the vignette Gallery of Missing Data Visualisations.

For full details of the package, including

Installation

You can install naniar from CRAN:

install.packages("naniar")

Or you can install the development version on github using remotes:

# install.packages("remotes")
remotes::install_github("njtierney/naniar")

A short overview of naniar

Visualising missing data might sound a little strange - how do you visualise something that is not there? One approach to visualising missing data comes from ggobi and manet, which replaces NA values with values 10% lower than the minimum value in that variable. This visualisation is provided with the geom_miss_point() ggplot2 geom, which we illustrate by exploring the relationship between Ozone and Solar radiation from the airquality dataset.

library(ggplot2)

ggplot(data = airquality,
       aes(x = Ozone,
           y = Solar.R)) +
  geom_point()
#> Warning: Removed 42 rows containing missing values or values outside the scale range
#> (`geom_point()`).

ggplot2 does not handle these missing values, and we get a warning message about the missing values.

We can instead use geom_miss_point() to display the missing data

library(naniar)

ggplot(data = airquality,
       aes(x = Ozone,
           y = Solar.R)) +
  geom_miss_point()

geom_miss_point() has shifted the missing values to now be 10% below the minimum value. The missing values are a different colour so that missingness becomes pre-attentive. As it is a ggplot2 geom, it supports features like faceting and other ggplot features.

p1 <-
ggplot(data = airquality,
       aes(x = Ozone,
           y = Solar.R)) + 
  geom_miss_point() + 
  facet_wrap(~Month, ncol = 2) + 
  theme(legend.position = "bottom")

p1

Data Structures

naniar provides a data structure for working with missing data, the shadow matrix (Swayne and Buja, 1998). The shadow matrix is the same dimension as the data, and consists of binary indicators of missingness of data values, where missing is represented as “NA”, and not missing is represented as “!NA”, and variable names are kep the same, with the added suffix “_NA” to the variables.

head(airquality)
#>   Ozone Solar.R Wind Temp Month Day
#> 1    41     190  7.4   67     5   1
#> 2    36     118  8.0   72     5   2
#> 3    12     149 12.6   74     5   3
#> 4    18     313 11.5   62     5   4
#> 5    NA      NA 14.3   56     5   5
#> 6    28      NA 14.9   66     5   6

as_shadow(airquality)
#> # A tibble: 153 × 6
#>    Ozone_NA Solar.R_NA Wind_NA Temp_NA Month_NA Day_NA
#>    <fct>    <fct>      <fct>   <fct>   <fct>    <fct> 
#>  1 !NA      !NA        !NA     !NA     !NA      !NA   
#>  2 !NA      !NA        !NA     !NA     !NA      !NA   
#>  3 !NA      !NA        !NA     !NA     !NA      !NA   
#>  4 !NA      !NA        !NA     !NA     !NA      !NA   
#>  5 NA       NA         !NA     !NA     !NA      !NA   
#>  6 !NA      NA         !NA     !NA     !NA      !NA   
#>  7 !NA      !NA        !NA     !NA     !NA      !NA   
#>  8 !NA      !NA        !NA     !NA     !NA      !NA   
#>  9 !NA      !NA        !NA     !NA     !NA      !NA   
#> 10 NA       !NA        !NA     !NA     !NA      !NA   
#> # ℹ 143 more rows

Binding the shadow data to the data you help keep better track of the missing values. This format is called “nabular”, a portmanteau of NA and tabular. You can bind the shadow to the data using bind_shadow or nabular:

bind_shadow(airquality)
#> # A tibble: 153 × 12
#>    Ozone Solar.R  Wind  Temp Month   Day Ozone_NA Solar.R_NA Wind_NA Temp_NA
#>    <int>   <int> <dbl> <int> <int> <int> <fct>    <fct>      <fct>   <fct>  
#>  1    41     190   7.4    67     5     1 !NA      !NA        !NA     !NA    
#>  2    36     118   8      72     5     2 !NA      !NA        !NA     !NA    
#>  3    12     149  12.6    74     5     3 !NA      !NA        !NA     !NA    
#>  4    18     313  11.5    62     5     4 !NA      !NA        !NA     !NA    
#>  5    NA      NA  14.3    56     5     5 NA       NA         !NA     !NA    
#>  6    28      NA  14.9    66     5     6 !NA      NA         !NA     !NA    
#>  7    23     299   8.6    65     5     7 !NA      !NA        !NA     !NA    
#>  8    19      99  13.8    59     5     8 !NA      !NA        !NA     !NA    
#>  9     8      19  20.1    61     5     9 !NA      !NA        !NA     !NA    
#> 10    NA     194   8.6    69     5    10 NA       !NA        !NA     !NA    
#> # ℹ 143 more rows
#> # ℹ 2 more variables: Month_NA <fct>, Day_NA <fct>
nabular(airquality)
#> # A tibble: 153 × 12
#>    Ozone Solar.R  Wind  Temp Month   Day Ozone_NA Solar.R_NA Wind_NA Temp_NA
#>    <int>   <int> <dbl> <int> <int> <int> <fct>    <fct>      <fct>   <fct>  
#>  1    41     190   7.4    67     5     1 !NA      !NA        !NA     !NA    
#>  2    36     118   8      72     5     2 !NA      !NA        !NA     !NA    
#>  3    12     149  12.6    74     5     3 !NA      !NA        !NA     !NA    
#>  4    18     313  11.5    62     5     4 !NA      !NA        !NA     !NA    
#>  5    NA      NA  14.3    56     5     5 NA       NA         !NA     !NA    
#>  6    28      NA  14.9    66     5     6 !NA      NA         !NA     !NA    
#>  7    23     299   8.6    65     5     7 !NA      !NA        !NA     !NA    
#>  8    19      99  13.8    59     5     8 !NA      !NA        !NA     !NA    
#>  9     8      19  20.1    61     5     9 !NA      !NA        !NA     !NA    
#> 10    NA     194   8.6    69     5    10 NA       !NA        !NA     !NA    
#> # ℹ 143 more rows
#> # ℹ 2 more variables: Month_NA <fct>, Day_NA <fct>

Using the nabular format helps you manage where missing values are in your dataset and make it easy to do visualisations where you split by missingness:

airquality %>%
  bind_shadow() %>%
  ggplot(aes(x = Temp,
             fill = Ozone_NA)) + 
  geom_density(alpha = 0.5)

And even visualise imputations

airquality %>%
  bind_shadow() %>%
  as.data.frame() %>% 
   simputation::impute_lm(Ozone ~ Temp + Solar.R) %>%
  ggplot(aes(x = Solar.R,
             y = Ozone,
             colour = Ozone_NA)) + 
  geom_point()
#> Warning: Removed 7 rows containing missing values or values outside the scale range
#> (`geom_point()`).

Or perform an upset plot - to plot of the combinations of missingness across cases, using the gg_miss_upset function

gg_miss_upset(airquality)

naniar does this while following consistent principles that are easy to read, thanks to the tools of the tidyverse.

naniar also provides handy visualations for each variable:

gg_miss_var(airquality)

Or the number of missings in a given variable at a repeating span

gg_miss_span(pedestrian,
             var = hourly_counts,
             span_every = 1500)

You can read about all of the visualisations in naniar in the vignette Gallery of missing data visualisations using naniar.

naniar also provides handy helpers for calculating the number, proportion, and percentage of missing and complete observations:

n_miss(airquality)
#> [1] 44
n_complete(airquality)
#> [1] 874
prop_miss(airquality)
#> [1] 0.04793028
prop_complete(airquality)
#> [1] 0.9520697
pct_miss(airquality)
#> [1] 4.793028
pct_complete(airquality)
#> [1] 95.20697

Numerical summaries for missing data

naniar provides numerical summaries of missing data, that follow a consistent rule that uses a syntax begining with miss_. Summaries focussing on variables or a single selected variable, start with miss_var_, and summaries for cases (the initial collected row order of the data), they start with miss_case_. All of these functions that return dataframes also work with dplyr’s group_by().

For example, we can look at the number and percent of missings in each case and variable with miss_var_summary(), and miss_case_summary(), which both return output ordered by the number of missing values.

miss_var_summary(airquality)
#> # A tibble: 6 × 3
#>   variable n_miss pct_miss
#>   <chr>     <int>    <num>
#> 1 Ozone        37    24.2 
#> 2 Solar.R       7     4.58
#> 3 Wind          0     0   
#> 4 Temp          0     0   
#> 5 Month         0     0   
#> 6 Day           0     0
miss_case_summary(airquality)
#> # A tibble: 153 × 3
#>     case n_miss pct_miss
#>    <int>  <int>    <dbl>
#>  1     5      2     33.3
#>  2    27      2     33.3
#>  3     6      1     16.7
#>  4    10      1     16.7
#>  5    11      1     16.7
#>  6    25      1     16.7
#>  7    26      1     16.7
#>  8    32      1     16.7
#>  9    33      1     16.7
#> 10    34      1     16.7
#> # ℹ 143 more rows

You could also group_by() to work out the number of missings in each variable across the levels within it.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
airquality %>%
  group_by(Month) %>%
  miss_var_summary()
#> # A tibble: 25 × 4
#> # Groups:   Month [5]
#>    Month variable n_miss pct_miss
#>    <int> <chr>     <int>    <num>
#>  1     5 Ozone         5     16.1
#>  2     5 Solar.R       4     12.9
#>  3     5 Wind          0      0  
#>  4     5 Temp          0      0  
#>  5     5 Day           0      0  
#>  6     6 Ozone        21     70  
#>  7     6 Solar.R       0      0  
#>  8     6 Wind          0      0  
#>  9     6 Temp          0      0  
#> 10     6 Day           0      0  
#> # ℹ 15 more rows

You can read more about all of these functions in the vignette “Getting Started with naniar”.

Statistical tests of missingness

naniar provides mcar_test() for Little’s (1988) statistical test for missing completely at random (MCAR) data. The null hypothesis in this test is that the data is MCAR, and the test statistic is a chi-squared value. Given the high statistic value and low p-value, we can conclude that the airquality data is not missing completely at random:

mcar_test(airquality)
#> # A tibble: 1 × 4
#>   statistic    df p.value missing.patterns
#>       <dbl> <dbl>   <dbl>            <int>
#> 1      35.1    14 0.00142                4

Contributions

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

Future Work

Extend the geom_miss_* family to include categorical variables, Bivariate plots: scatterplots, density overlays
SQL translation for databases
Big Data tools (sparklyr, sparklingwater)
Work well with other imputation engines / processes
Provide tools for assessing goodness of fit for classical approaches of MCAR, MAR, and MNAR (graphical inference from nullabor package)

Acknowledgements

Firstly, thanks to Di Cook for giving the initial inspiration for the package and laying down the rich theory and literature that the work in naniar is built upon. Naming credit (once again!) goes to Miles McBain. Among various other things, Miles also worked out how to overload the missing data and make it work as a geom. Thanks also to Colin Fay for helping me understand tidy evaluation and for features such as replace_to_na, miss_*_cumsum, and more.

A note on the name

naniar was previously named ggmissing and initially provided a ggplot geom and some other visualisations. ggmissing was changed to naniar to reflect the fact that this package is going to be bigger in scope, and is not just related to ggplot2. Specifically, the package is designed to provide a suite of tools for generating visualisations of missing values and imputations, manipulate, and summarise missing data.

…But why naniar?

Well, I think it is useful to think of missing values in data being like this other dimension, perhaps like C.S. Lewis’s Narnia - a different world, hidden away. You go inside, and sometimes it seems like you’ve spent no time in there but time has passed very quickly, or the opposite. Also, NAniar = na in r, and if you so desire, naniar may sound like “noneoya” in an nz/aussie accent. Full credit to @MilesMcbain for the name, and @Hadley for the rearranged spelling.

naniar's People

Contributors

Stargazers

Watchers

Forkers

milesmcbain seasmith dsdesrosiers umeshach zhaoxiaohe rpodcast dataning romainfrancois jimhester medewitt karawoo colinfay skynode anorris8 jminnier arniaryeyanda ktaranov notplayingcute m-sostero halzahrani jmerone iphuoc lionel- pridiltal guhjy batermj gravitytrope cpsievert csetraynor denrou willmbracken gundie88 amoeba gdmcdonald stulacy shamshadm liutiming thiagoando andrewheiss mcortinat moly-malibu cregouby maksymiuks hlydecker cortrudolph hadley aemedina24 alejoelpaisa christiangoueguel jenniferlopes sciborgo allenlile

naniar's Issues

Look into using ggduo to visualise missings

https://ggobi.github.io/ggally/#ggallyggduo

Create data functions for facilitating univariate/1D plots of missings

add badges for appveyor and travis-CI

introduce functions to work with spark

@MilesMcBain mentioned that it is difficult to work with missing data within spark.

For example:

library(sparklyr)
library(tibble)
library(dplyr)

dat <- tribble(
    ~A, ~B, ~C,
    NA,  1,  1,
    1,  1, NA,
    NA, NA,  1,
    NA, NA, NA,
    1,  1,  1 
)

sc <- spark_connect(master = "local")
spark_dat <- copy_to(sc, dat)

#A crappy non-scalable way to do complete.cases
complete_cases <- 
    spark_dat %>% 
    filter(!is.na(A) & !is.na(B) & !is.na(C)) %>%
    collect()

#A crappy non-scalable way to do find rows with any na 
any_na <-
    spark_dat %>%
    filter(!(!is.na(A) & !is.na(B) & !is.na(C))) %>%
    collect()

It would be great to have naniar functions that also worked with spark.

Not sure how much work this would involve, but it looks like rstudio have a pretty nice extension API.

Just a thought for now, there's a lot of other things that I want to finish up first, but this should be on my roadmap towards version 1.0.0

create other flavours of missing values

Building on issues #25 and #31, and discussions with @rgayler, there needs to be a way to create different flavours of missing values to indicate different mechanisms.

An example of this could be where a weather station records -99 as a missing value, but missing specifically because the weather was so cold the instruments stop working.

Currently in R there is only one kind of NA value (ignoring NA_integer_ ... and friends).

So there needs to be a way to specify your own missing value NA_this (or something).

This might be a function like tidyr::replace_na, perhaps instead called replace_na_why or something.

This might look like

data %>%
replace_na_why(.condition = var == -99,
              .why = "weather station too cold",
              .suffix = "TC")

This would then create a value NA_TC, which then has a mechanism recorded.

Since R does not treat these as missing, we would incorporate this into the shadow matrix values

!NA, NA, and NA_.why

perform reductionism on `geom_missing_point` and `stat_missing_point`

In trying to get the geom_missing_point to work, we might have added extra code that isn't needed to make it do what we want - would be good to only have the essential bits so that the code is a bit tidier.

create `.ts` and `.mts` S3 methods for `as_shadow` and `bind_shadow`

Problem

as_shadow and bind_shadow methods only work for data.frames - ideally these would work for .ts and .mts objects

Solution

Create a method for as_shadow
Create a method for bind_shadow
Write unit tests for these

Related issues

Related to #4

Think about how to visualise cases where more than one variable is imputed

library(simputation)
library(narnia)
library(tidyverse)
# impute Ozone

So I can use the shadow matrix and then impute values and keep track of them, which is really nifty!

aq_shadow <- bind_shadow(airquality)

# impute the values and visualise
impute_lm(Ozone ~ Temp + Wind, 
          dat = aq_shadow) %>%
  ggplot(aes(x = Temp,
             y = Ozone,
             colour = Ozone_NA)) + 
  geom_point()

But if I impute more than one variable, and I want to visualise both imputations, I cannot make the imputations visible on the one graphic

aq_shadow %>%
  impute_lm(Ozone ~ Temp + Wind) %>%
  impute_lm(Solar.R ~ Temp + Wind) %>%
  ggplot(aes(x = Solar.R,
             y = Ozone,
             colour = Ozone_NA)) + 
  geom_point()

I would need to make two graphics. Or think of a way to incorporate this extra information.

shadow_shift where there is no variance

Previously in #36 I worked out how to handle cases where there is only one complete value.

Now I need to work out a way to handle cases where there is no variance

library(naniar)

shadow_shift(c(NA,4,4))
#> [1] 4 4 4

It performs find when there is some small variance

shadow_shift(c(NA,4,3))
#> [1] 2.884298 4.000000 3.000000
shadow_shift(c(NA,4,NA))
#> [1] 3.424959 4.000000 3.457942

Time Series: Make numerical summary for missings in a run

In time series data, It is important to be able to look at the number of missings that occur in a single "run" - the "number of missings in a row".

Below we see that there are sections of the time series plot that are missing, indicated by the gaps:

library(imputeTS)

plot(tsNH4)

There is one approach to plot the number of the missings in plotNA.gapsize():

plotNA.gapsize(tsNH4)

We need this as a numerical summary so the user can have more flexible options.

clean up old functions and tidy up old vignette

We don't really need miss_cat, shadow_df, and shadow_cat anymore.

However, I think that the old vignette presents the original way we were thinking of how to create the plots. This vignette isn't entirely accurate anymore, but I'd like to preserve the way it describes how we walk through the process of creating the first plots.

A new vignette would need to be created that provides a stronger motivation for the use of ggmissing.

Develop automated testing for ggmissing

We do a lot of informal testing of the code after we include a feature, but it will save us time if we think of what sorts of tests to write for each function. I guess we ask questions like:

What objects does this function work on?
What object should it return?
Is there an expected data structure or format the object should be in?

And go from there. More info on testing: http://r-pkgs.had.co.nz/tests.html

Add jitter/noise to `shadow_shift` to seperate out repeats of the same value

List of packages / algorithms to support

The idea in narnia is that it will not provide its own imputation engine, but will rather leverage off of existing packages in the R ecosystem.

Here I document the packages that I would like to support:

simputation
mi
mice
norm (em.norm) (Thank you Sergey!)
Amelia

There are many more ways to impute data, so I encourage members of the community to chime in!

another way to display info. about missing

One simple way to display missing data in bivariate plots is to create a separate variable for where the values are missing and to plot that information. See script and an example plot here:

https://gist.github.com/soodoku/36fecfc442342c0e01aad6742b8ee47e

Develop data functions to facilitate bivariate plots: scatterplots, density overlays

geom_imputed_* and friends

This would be a new geom built for imputed data / imputed dataframes.

Not sure how the specifics of this would work, but something like:

ggplot(data = data_imputed,
       aes(x = var1,
           y = var2)) + 
  geom_imputed_point()

Then this could display something similar to geom_missing_point(), but instead show the imputed values in addition to the regular data.

This might use shadow_bind or shadow_augment or similar to represent the imputations somehow.

How to identify imputed and missing values locations when imputing values

Related to #47

The which_na function stores the rows and cols of the missing values in naniar. This could be stored behind the scenes to give locations of imputed values, and missing values.

naniar::which_na(airquality)
row col
 [1,]   5   1
 [2,]  10   1
 [3,]  25   1
 [4,]  26   1
 [5,]  27   1
.
.
.
[40,]  11   2
[41,]  27   2
[42,]  96   2
[43,]  97   2
[44,]  98   2

an impl should then use this information to identify which values are still missing, and which are imputed.

blog posts and other resources to look at

http://datascienceplus.com/imputing-missing-data-with-r-mice-package/

in example, missing data is not coloured

Basically, running the code

library(dplyr)
library(ggplot2)
library(ggmissing)

brfss %>%
  mutate(miss_cat = miss_cat(., "PHYSHLTH", "MENTHLTH")) %>% 
  ggplot(data = .,
         aes(x = shadow_shift(PHYSHLTH),
             y = shadow_shift(POORHLTH),
             colour = miss_cat)) + 
  geom_point()

Produces a plot with the legend: "Not Missing", when in fact there is missing data in POORHLTH:

sum(is.na(brfss$POORHLTH))

Unsure why that is happening.

revealers, helpers to clear up common / other representations of missing values

derive_shadows, or other things...helpers to clear up common representations of missing values such as "NA", "N/A", etc. and might allow for an easier way for users to describe different missing data codes, such as -99, which might indicate missing, but some other kind of missing value, perhaps a different mechanism of missingness

ideas for function names:

narify (play on clarify)
darken
shade
refract

Other commands that might be useful?

is_na
fill_na
drop_na
is_null

develop `shadow_shift` method for categorical variables

_case or _row

In ggmissing the functions that refer to case refer to the rows of a dataframe.

Perhaps it would make more sense to rename these as _row?

e.g.,

gg_missing_case() becomes gg_missing_row() (or possibly rows
summary_missing_case() becomes summary_missing_row() (or possible rows

And so on.

Do you have any preference, @dicook ?

Update Vignette to include new code for shadow_df

Time series: Proportion of missings in a given window

The package imputeTS does a moving window plot of the missing values for a given moving window / break set.

# visualisation missing values for time series.
library(imputeTS)
# imputeTS

plotNA.distributionBar(tsNH4)

This is nice, if you know the right window size to look for, but I see at least three things that need to be implemented in naniar:

A function that creates a dataframe counting the number of missings for a given static window size
as for one, but allows you to specify windows of varying length (for example, what if you want to compare weekdays vs weekends, or something?
A plotting function or possibly a geom that takes a ts object and produces an equivalent plot to above

Perform a literature review of missing data packages available on CRAN

There are a lot of packages on CRAN that work with missing data.

I'm trying to pin down the ones most used by users. This poll on twitter indicates that mice is the most popular, but also that people use Amelia, vtreat, Hmisc::aregImpute(), and VIM.

Below is some code that show how many packages have "miss" and "imput" in the description, and the number of downloads they have from the rstudio server each month:

# super handy code from Julia Silge's blog:
# https://juliasilge.com/blog/mining-cran-description/

library(tidyverse)
#> Loading tidyverse: ggplot2
#> Loading tidyverse: tibble
#> Loading tidyverse: tidyr
#> Loading tidyverse: readr
#> Loading tidyverse: purrr
#> Loading tidyverse: dplyr
#> Conflicts with tidy packages ----------------------------------------------
#> filter(): dplyr, stats
#> lag():    dplyr, stats

cran <- tools::CRAN_package_db()

# the returned data frame has two columns with the same name???
cran <- cran[,-65]

# make it a tibble
cran <- tbl_df(cran)

# I want to find which packages mention the words "missing data" in their description.

cran_miss <- cran %>%
  select(Package,
         Description) %>% 
  mutate(has_missing = grepl("imput | miss", Description)) %>%
  filter(has_missing)

library(cranlogs)

# use cranlogs to find how often packages are downloaded
cran_miss_download <- cran_downloads(packages = cran_miss$Package,
               when = "last-month") %>%
  as_tibble() %>%
  group_by(package) %>%
  mutate(n_dl = sum(count)) %>%
  select(package, n_dl) %>%
  ungroup() %>%
  distinct() %>%
  arrange(-n_dl)

cran_miss_download %>%
  slice(1:20) %>%
  ggplot(aes(x = reorder(package, n_dl),
             y = n_dl)) + 
  geom_col() +
  coord_flip()

# OK, what if we filter out Hmisc, Purrr, gtools, and R.methodsS3

cran_miss_download %>%
  slice(5:25) %>%
  ggplot(aes(x = reorder(package, n_dl),
             y = n_dl)) + 
  geom_col() +
  coord_flip()

remove `any_na` in favour of `anyNA`

anyNA is much, much faster.

test_na_vec <- c(NA, rep(1:10, 100))

mb1 <- 
microbenchmark::microbenchmark(
"any_na" = ggmissing::any_na(test_na_vec),
"anyNA" = anyNA(c(NA, 1, 1, 1)),
unit = "eps"
)

mb1
#> Unit: evaluations per second
#>    expr        min         lq       mean     median        uq        max
#>  any_na   653.6824   5610.225   8372.278   7598.131  11636.05   13607.85
#>   anyNA 71947.6221 556468.125 815641.789 761343.349 924643.48 1897533.21
#>  neval
#>    100
#>    100

New naming scheme for the missing diagnostics / summary functions

Currently I'm finding it a bit hard to remember which functions I want to do what summary of the missing data.

I am moving towards the format miss_type_value/fun, because it makes more sense to me when tabbing through functions.

miss_* = I want to explore missing values

miss_case_* = I want to explore missing cases

miss_case_pct = I want to find the percentage of cases containing a missing value
miss_case_summary = I want to find the number / percentage of missings in each case
miss_case_table = I want a tabulation of the number / percentage of cases missing

This is more consistent and easier to reason with. I will not be providing .Deprecated for these functions, naniar is still early days, and these functions shouldn't break much analysis code, and are easy to fix.

percent_missing_case()  --> miss_case_pct
percent_missing_var()   --> miss_var_pct
percent_missing_df()    --> miss_df_pct

summary_missing_case()  --> miss_case_summary
summary_missing_var()   --> miss_var_summary

table_missing_case()   --> miss_case_table
table_missing_var()    --> miss_var_table

small typo in the repository description

At the top of the page: enhanace --> enhance

Does `shadow_shift` need to use range() instead of min() to shift missing values ?

suggestion for `shadow_shift` and `label_missing`

It would be cool is shadow_shift behaved like this:

library(dplyr)
library(naniar)

data %>%
  shadow_shift(var1,var2)

And then it added some shifted values to the variables, and then added a column identifying the missingness.

Currently to do this, it would look like this:

airquality %>%
  mutate(Ozone_shift = shadow_shift(Ozone),
         Solar.R_shift = shadow_shift(Solar.R),
         miss_label = label_missing_2d(Ozone,Solar.R)) %>%
  head()
#>   Ozone Solar.R Wind Temp Month Day Ozone_shift Solar.R_shift  miss_label
#> 1    41     190  7.4   67     5   1    41.00000     190.00000 Not Missing
#> 2    36     118  8.0   72     5   2    36.00000     118.00000 Not Missing
#> 3    12     149 12.6   74     5   3    12.00000     149.00000 Not Missing
#> 4    18     313 11.5   62     5   4    18.00000     313.00000 Not Missing
#> 5    NA      NA 14.3   56     5   5   -11.75002     -23.01917     Missing
#> 6    28      NA 14.9   66     5   6    28.00000     -18.27869     Missing

On this note, it would be great to have label_missing take any number of arguments

Add pedestrian count data to narnia

At the moment I am using the airquality dataset, but it would be good to have something updated.

A good candidate for this is the pedestrian count data from Melbourne.

However, I will also need to look into other missing data packages to see what they use, as these may also be interesting.

Take inspiration from missingno

Really cool python package! https://github.com/ResidentMario/missingno

Document the datasets

I'm not sure what the history is of these datasets are:

brfss
tao

I can probably remove df, as this was created from wakefield.

Read up on correctly documenting datasets: http://r-pkgs.had.co.nz/data.html

Develop `shadow_shift` method for factors - perhaps add another level (smaller than smallest))

Investigate shadow_cat, miss_cat and remove if not needed

From what I can tell so far, shadow_cat, and miss_cat are no longer needed, as geom_missing_point() calls shadow_shift and label_missing_2d, do not need shadow_cat nor miss_cat. I will unexport these functions in the next commit and then do a bit of investigating to work out if these really are needed anymore.

Look at how other stats/programming languages handle missing values

Following on from discussion with @MilesMcBain in #31, it would be worthwhile to see how other languages and stats languages handle missing values and critique/borrow from these.

Here's a start on STATA: http://www.ats.ucla.edu/stat/stata/modules/missing.html

Create S3 Methods for `shadow_shift` to check for variable types

Specifically, we need S3 methods for:

numeric
factor
integer
... and more?

Update the vignettes

The vignettes are out of date, I need to:

Move "extending ggplot2" vignette into it's own repository
Tidy up the notes from the shadow-mechanics.
Create a new vignette that more closely follows the process I described in my WOMBAT / MeDaScin talk

develop `shadow_shift` method for time series

fork the repo into "dev" and "master"

Getting things ready for onboarding, and submission to CRAN.

Remove dependency on ggalt

ggalt needs to build with PROJ.4 which means ggmissing will fail when installing on a Linux system if PROJ.4 is not installed. This happened to me today. A quick google suggests this library has something to do with map projections... so nothing missing data related. This is also why Travis build is failing.

I recreated geom_lollipop for gg_missing_var using stock ggplot2. I started on gg_missing_case but realised I have no idea what that plot is supposed to look like, and since I won't install ggalt, I can't find out.

See: https://github.com/njtierney/ggmissing/blob/miles/R/gg_missing_var.R

replace shadow_df with as_shadow, or rename

Currently these functions produce different output:

library(naniar)
shadow_df(airquality)
#> # A tibble: 153 × 6
#>    Ozone Solar.R  Wind  Temp Month   Day
#>    <lgl>   <lgl> <lgl> <lgl> <lgl> <lgl>
#> 1  FALSE   FALSE FALSE FALSE FALSE FALSE
#> 2  FALSE   FALSE FALSE FALSE FALSE FALSE
#> 3  FALSE   FALSE FALSE FALSE FALSE FALSE
#> 4  FALSE   FALSE FALSE FALSE FALSE FALSE
#> 5   TRUE    TRUE FALSE FALSE FALSE FALSE
#> 6  FALSE    TRUE FALSE FALSE FALSE FALSE
#> 7  FALSE   FALSE FALSE FALSE FALSE FALSE
#> 8  FALSE   FALSE FALSE FALSE FALSE FALSE
#> 9  FALSE   FALSE FALSE FALSE FALSE FALSE
#> 10  TRUE   FALSE FALSE FALSE FALSE FALSE
#> # ... with 143 more rows
as_shadow(airquality)
#> # A tibble: 153 × 6
#>    Ozone_NA Solar.R_NA Wind_NA Temp_NA Month_NA Day_NA
#>      <fctr>     <fctr>  <fctr>  <fctr>   <fctr> <fctr>
#> 1       !NA        !NA     !NA     !NA      !NA    !NA
#> 2       !NA        !NA     !NA     !NA      !NA    !NA
#> 3       !NA        !NA     !NA     !NA      !NA    !NA
#> 4       !NA        !NA     !NA     !NA      !NA    !NA
#> 5        NA         NA     !NA     !NA      !NA    !NA
#> 6       !NA         NA     !NA     !NA      !NA    !NA
#> 7       !NA        !NA     !NA     !NA      !NA    !NA
#> 8       !NA        !NA     !NA     !NA      !NA    !NA
#> 9       !NA        !NA     !NA     !NA      !NA    !NA
#> 10       NA        !NA     !NA     !NA      !NA    !NA
#> # ... with 143 more rows

But I think I need to give these different names, as_shadow is pretty similar to shadow_df.

add_prop_miss should take a `vars`/`select` argument

This would mean that users could specify which would follow the rules/benefits of select, so that users could use starts_with and friends.

This might look something like:

data %>%
    add_prop_miss(vars(starts_with("male"))

# or

data %>%
    add_prop_miss(vars("male", "female", "job_type")

Possibly related to the implementation of add_tally and add_count.

add white lines to clearly separate the missing values in `geom_missing_point`

For example:

library(naniar)
library(ggplot2)
# method to draw new white band
y_white_band <- min(airquality$Solar.R, 
                    na.rm = TRUE) - 0.95 * min(airquality$Solar.R, 
                                               na.rm = TRUE)

x_white_band <- min(airquality$Ozone, 
                    na.rm = TRUE) - 0.95 * min(airquality$Solar.R, 
                                               na.rm = TRUE)

ggplot(data = airquality,
       aes(x = Ozone,
           y = Solar.R)) +
  geom_missing_point() + 
  # geom_hline(yintercept = -3,
  geom_hline(yintercept = y_white_band,
             size = 2,
             colour = "white") +
  # geom_vline(xintercept = -3,
  geom_vline(xintercept = x_white_band,
             size = 2,
             colour = "white")

As opposed to

ggplot(data = airquality,
       aes(x = Ozone,
           y = Solar.R)) +
  geom_missing_point()

Useful patterns with `bind_shadow`

You can use the _NA suffix to then group by and summarise:

library(narnia)
library(tidyverse)

airquality %>%
  bind_shadow() %>%
  group_by(Ozone_NA) %>%
  summarise_at(.vars = "Solar.R",
               .funs = c("mean", "sd", "var", "min", "max"),
               na.rm = TRUE)

#> # A tibble: 2 x 6
#>   Ozone_NA     mean       sd      var   min   max
#>     <fctr>    <dbl>    <dbl>    <dbl> <dbl> <dbl>
#> 1      !NA 184.8018 91.15230 8308.742     7   334
#> 2       NA 189.5143 87.69478 7690.375    31   332

shadow_shift returns the same number when there is only one non missing value

When there is only one non missing value, shadow_shift returns the same number again. This is because the range is used when doing the shadow_shift.

library(naniar)
shadow_shift(c(10,10))
#> [1] 10 10

shadow_shift(c(NA,10))
#> [1] 10 10

shadow_shift(c(NA,NA,10))
#> [1] 10 10 10

However when there are at least two non missing values, it works fine.

shadow_shift(c(NA,10,9))
#> [1]  8.893292 10.000000  9.000000

shadow_shift(c(NA,NA,10,9))
#> [1]  8.910828  8.898420 10.000000  9.000000

shadow_shift(c(NA,10,9,NA))
#> [1]  8.905714 10.000000  9.000000  8.879006

To fix this I will create a separate step for when the input only contains one non-missing value

behaviour of is.na

is.na will return true for NA

is.na(NA)
#> [1] TRUE

But for a quoted character, "NA", this is apparently not missing

is.na("NA")
#> [1] FALSE

Somewhat suprisingly, NaN values are also regarded as missing values.

is.na(NaN)
#> [1] TRUE

is.na("NaN")
#> [1] FALSE

I think that the quoted character "NA", "na", "Na", etc. should be regarded as missing, or there should be a specific function to coerce them to NAs. This function should also allow for some user specified NA characters. perhaps something like coerce_na

There might also be scope here for another NA function where people can specify different factors / structure for the NA values. For example, -99, -98, might be missing values, but could have specific reasons / mechanisms for being missing, and so should be recorded differently. This might be a function called label_na.

I'm not really sure if it is a problem that NaN values are considered missing. Perhaps it might be useful to provide some specific handlers for _na type objects, which also handle NaNs in a potentially more opinionated way.

Make missing data summaries play well with dplyr::group_by

At this stage the missing data summaries like percent_missing_df() don't work with dplyr::group_by().

library(naniar)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# this gives the same output
out_1 <- airquality %>% percent_missing_df()

out_1
#> [1] 4.793028

# as this
out_2 <- airquality %>%
  group_by(Month) %>%
  percent_missing_df()

out_2
#> [1] 4.793028

all.equal(out_1,out_2)
#> [1] TRUE

This is the case for all of the missing_data_tidier.R functions.

I'm not sure exactly why this is happening, but I do think that having these functions play with dplyr::group_by is important.

add separating lines for geom_missing_point

This helps more clearly identify the values as missing, and distinct from the dataset.

library(ggmissing)
library(ggplot2)
# method to draw new white band
y_white_band <- min(airquality$Solar.R, 
                    na.rm = TRUE) - 0.95 * min(airquality$Solar.R, 
                                               na.rm = TRUE)

x_white_band <- min(airquality$Ozone, 
                    na.rm = TRUE) - 0.95 * min(airquality$Solar.R, 
                                               na.rm = TRUE)

ggplot(data = airquality,
       aes(x = Ozone,
           y = Solar.R)) +
  geom_missing_point() + 
  # geom_hline(yintercept = -3,
  geom_hline(yintercept = y_white_band,
             size = 2,
             colour = "white") +
  # geom_vline(xintercept = -3,
  geom_vline(xintercept = x_white_band,
             size = 2,
             colour = "white")

However, there will need to be a different method when exploring imputations