njtierney / brolgar Goto Github PK

BRowse Over Longitudinal Data Graphically and Analytically in R

Home Page: http://brolgar.njtierney.com/

License: Other

R 100.00%

brolgar's Introduction

👋 G'day!

I'm a research software engineer working with with Nick Golding in the Infectious Disease Ecology and Modelling team at Telethon Kids Institute. My primary goal is to to improve data analysis with free open source software.

🗒️ Available to teach courses on data science, reproducible reporting, analysis pipelines, and building R packages
🦠 Working on creating analysis pipelines and R packages for disease modelling
💬 Ask me about missing data, longitudinal data, writing functions, and ☕
🛠️ Maintain the greta software for statistical modelling
💻 Mostly program in R
😄 Pronouns: He / Him

brolgar's People

Contributors

Stargazers

Watchers

Forkers

sa-lee cderv earowang kant pherephobia guhjy jfontestad jesserp92 hadley emilliman5 olivroy

brolgar's Issues

interesting dataset of heights over time for each country

https://clio-infra.eu/Indicators/Height.html#

`near` helper functions

is_near - returns logical if something is near another value (within tolerance)
top_near - like top_n. Returns "top n" nearest results within tolerance. So. So top_near(data, x, near, tol) where x is a vector, near is a number to compare each value, and tol is the tolerance to accept values within for the first place.

features for missing data

Number of missings/complete per key
Proportion of missings/complete per key

add_l_* functions

The add_l_ set of functions compliment the l_ set of functions:

e.g.,

add_slope
add_length

This avoids needing to join again later by joining straight to the data.

features to identify relatively short bursts of measures

So for example, identifying ones like this:

function to put ids into K sets of random groups

This can be combined with sample_frac_obs(), where you might want to prune down the observations, and then get a quick view of the longitudinal observations, like so:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
library(brolgar)

wages %>%
  sample_frac_obs(id = id, size = 0.1) %>%
  group_by(id) %>%
  nest() %>%
  mutate(.rand_id = sample(1:10, nrow(.), replace = TRUE)) %>%
  unnest() %>%
  ggplot(aes(x = lnw,
             y = exper,
             group = id)) + 
  geom_line() + 
  facet_wrap(~.rand_id)

^{Created on 2019-04-08 by the reprex package (v0.2.1)}

Example datasets

Health + Medical data
University rankings
census info (over long time?)

rename `sample_n_obs`

Perhaps to something like sample_n_id, like so:

sample_n_id(id = id, n_id = 5)

add feat_lag

This should create p columns, with p being either a calculated number of columns based on the max number of observations for a key, or as user input.

So there would be: lag_1, ..., lag_p columns

long data format visualisations using gghighlight

Some things like this where we split plots into facets based on features would be neat. Perhaps might be worthwhile to include a "gather_features()" or something function, to facilitate this.

library(gghighlight)
#> Loading required package: ggplot2
library(brolgar)
library(tidyverse)
wages_ts %>%
  features(ln_wages, feat_monotonic) %>%
  left_join(wages_ts, by = "id") %>%
  select(id:xp) %>%
  gather(key = "features",
         value = "inc_dec",
         - id,
         - unvary,
         - monotonic,
         - ln_wages,
         - xp) %>%
  filter(inc_dec) %>%
  ggplot(aes(x = xp,
             y = ln_wages,
             group = id)) + 
  geom_line() + 
  gghighlight() + 
  facet_wrap(~features)

^{Created on 2019-07-19 by the reprex package (v0.3.0)}

l_slope is a bit slow

It would be great to improve the speed of l_slope

l_slope_old <- function(df, id, formula){
  l <- split(df, df[[id]])
  sl <- purrr::map(l, ~ eval(substitute(lm(formula, data = .)))) %>%
    purrr::map_dfr( ~ as.data.frame(t(as.matrix(coef(
      .
    ))))) %>%
    dplyr::mutate(id = as.integer(names(l))) %>%
    dplyr::rename_all( ~ c("intercept", "slope", "id")) %>%
    dplyr::select(id, intercept, slope) %>%
    tibble::as_tibble()
  return(sl)
}

library(brolgar)

bm1 <- bench::mark(old = l_slope_old(wages, "id", "lnw~exper"),
                   new = l_slope(wages, id, lnw~exper))
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.

summary(bm1)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 x 10
#>   expression      min     mean   median      max `itr/sec` mem_alloc  n_gc
#>   <chr>      <bch:tm> <bch:tm> <bch:tm> <bch:tm>     <dbl> <bch:byt> <dbl>
#> 1 old        748.71ms 748.71ms 748.71ms 748.71ms     1.34     7.25MB    16
#> 2 new           2.45s    2.45s    2.45s    2.45s     0.408   11.93MB    67
#> # … with 2 more variables: n_itr <int>, total_time <bch:tm>
summary(bm1, relative = TRUE)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 x 10
#>   expression   min  mean median   max `itr/sec` mem_alloc  n_gc n_itr
#>   <chr>      <dbl> <dbl>  <dbl> <dbl>     <dbl>     <dbl> <dbl> <dbl>
#> 1 old         1     1      1     1         3.27      1     1        1
#> 2 new         3.27  3.27   3.27  3.27      1         1.65  4.19     1
#> # … with 1 more variable: total_time <dbl>

^{Created on 2019-04-02 by the reprex package (v0.2.1)}

brolgar should use tsibble data methods for S3, rather than provide additional data.frame methods. Perhaps in the future the methods can be extended to work for .data.frame as well, but until the API is stable, I don't see much point in doing a lot of extra work

Change wages_ts to wages?

Unless I want to keep an example dataset in there that describes "how to create your own tsibble data". But maybe that can be the world_heights data

faceting functions to split/cut of a feature

Idea 1: facet along some feature/variable

gg... + 
  facet_along(~var, n_facets = 10, order = "ascend")

So here the facet_along() function would break a given variable up into 10 groups in ascending order.

This means that you could create a feature on, say, slope, and then get 10 evenly spaced groups from high to low.

Idea 2: facet on a subsample of the data

gg... + 
  facet_sample(~var, n_facet = ..., n_sample = ...)

This would be a similar idea but is specifically for reducing a current spaghetti plot into many plots with fewer observations. So:

n_facet = 10 and n_sample = 10 will yield 1 samples per facet (n_sample / n_facet)

rename `add_k_groups`

Perhaps partition_groups
Or look into https://tidymodels.github.io/rsample/reference/make_strata.html
This should add a name starting with . - e.g., .group.

rename / rework l_n_obs

It is a very handy function, and I find that the other ways of getting the same piece of information are somewhat cumbersome:

library(brolgar)
library(tsibble)

l_n_obs(wages_ts)
#> # A tibble: 888 x 2
#>       id n_obs
#>    <int> <int>
#>  1    31     8
#>  2    36    10
#>  3    53     8
#>  4   122    10
#>  5   134    12
#>  6   145     9
#>  7   155    11
#>  8   173     6
#>  9   206     3
#> 10   207    11
#> # … with 878 more rows

wages_ts %>% features(id, length)
#> # A tibble: 888 x 2
#>       id    V1
#>    <int> <int>
#>  1    31     8
#>  2    36    10
#>  3    53     8
#>  4   122    10
#>  5   134    12
#>  6   145     9
#>  7   155    11
#>  8   173     6
#>  9   206     3
#> 10   207    11
#> # … with 878 more rows

wages_ts %>% features(!!index(wages_ts), length)
#> # A tibble: 888 x 2
#>       id    V1
#>    <int> <int>
#>  1    31     8
#>  2    36    10
#>  3    53     8
#>  4   122    10
#>  5   134    12
#>  6   145     9
#>  7   155    11
#>  8   173     6
#>  9   206     3
#> 10   207    11
#> # … with 878 more rows

wages_ts %>% features(!!index(wages_ts), 
                      list(n_obs = length))
#> # A tibble: 888 x 2
#>       id n_obs
#>    <int> <int>
#>  1    31     8
#>  2    36    10
#>  3    53     8
#>  4   122    10
#>  5   134    12
#>  6   145     9
#>  7   155    11
#>  8   173     6
#>  9   206     3
#> 10   207    11
#> # … with 878 more rows

^{Created on 2019-07-11 by the reprex package (v0.2.1)}

So then, what can I call it?

establish new API to interface with feasts

So that there can be good interoperability with feasts cc @mitchelloharawild / @earowang

Use `key_slope` not `l_slope`

In an effort to remove the l_ function family and stay consistent with naming of things in tidyverts / tsibble, l_slope should instead be key_slope, which returns intercept+slope info for each key.

function to calculate all lognostics

filter_n_obs filter by the number of observations according to some ID

use .data instead of data

add support to create lasagne / heatmaps

Make it easier to create heatmaps of longitudinal data, a la https://github.com/swihart/lasagnar

return individuals close to a specified longnostic

This would be kind of like a special filter verb that returns the data for individuals who are close to a specific type of cognostic.

`add_key_slope` does not work, but `key_slope` does

Somehow ln_wages is not passed properly, probably missing something obvious?

library(brolgar)

# fine
key_slope(wages_ts,ln_wages ~ xp)
#> # A tibble: 888 x 3
#>       id .intercept .slope_xp
#>    <int>      <dbl>     <dbl>
#>  1    31       1.41    0.101 
#>  2    36       2.04    0.0588
#>  3    53       2.29   -0.358 
#>  4   122       1.93    0.0374
#>  5   134       2.03    0.0831
#>  6   145       1.59    0.0469
#>  7   155       1.66    0.0867
#>  8   173       1.61    0.100 
#>  9   206       1.73    0.180 
#> 10   207       1.62    0.0884
#> # … with 878 more rows

# errors
add_key_slope(wages_ts,ln_wages ~ xp)
#> Error in eval(predvars, data, env): object 'ln_wages' not found

^{Created on 2019-07-13 by the reprex package (v0.3.0)}

Do not create longnostics for `id`'s with only one observation

We are not really interested in calculating statistics for cases where there is only one observation. Since the mean of 1 number is not really useful, and you cannot calculate things like the variance or standard deviation of one number.

Add a bookkeeping function to identify ids with only one observation
Outline a workflow where you remove those id's with only one observation

rename l_summarise

Related to #11

Possible names:

features_near
keys_near_features
summarise_near

l_slope should be able to take any number of RHS on the formula

Currently l_slope can only take one explanatory variable on the RHS:

library(brolgar)
data(wages)
l_slope(wages, id, lnw~exper)
#> # A tibble: 888 x 3
#> # Groups:   id [888]
#>       id intercept   slope
#>    <int>     <dbl>   <dbl>
#>  1    31      1.41  0.101 
#>  2    36      2.04  0.0588
#>  3    53      2.29 -0.358 
#>  4   122      1.93  0.0374
#>  5   134      2.03  0.0831
#>  6   145      1.59  0.0469
#>  7   155      1.66  0.0867
#>  8   173      1.61  0.100 
#>  9   206      1.73  0.180 
#> 10   207      1.62  0.0884
#> # … with 878 more rows
l_slope(wages, id, lnw~exper+ged)
#> Error: `nm` must be `NULL` or a character vector the same length as `x`
#> Backtrace:
#> �[90m     �[39m█
#> �[90m  1. �[39m├─base::tryCatch(...)
#> �[90m  2. �[39m│ └─base:::tryCatchList(expr, classes, parentenv, handlers)
#> �[90m  3. �[39m│   ├─base:::tryCatchOne(...)
#> �[90m  4. �[39m│   │ └─base:::doTryCatch(return(expr), name, parentenv, handler)
#> �[90m  5. �[39m│   └─base:::tryCatchList(expr, names[-nh], parentenv, handlers[-nh])
#> �[90m  6. �[39m│     └─base:::tryCatchOne(expr, names, parentenv, handlers[[1L]])
#> �[90m  7. �[39m│       └─base:::doTryCatch(return(expr), name, parentenv, handler)
#> �[90m  8. �[39m├─base::withCallingHandlers(...)
#> �[90m  9. �[39m├─base::saveRDS(...)
#> �[90m 10. �[39m├─base::do.call(...)
#> �[90m 11. �[39m├─(function (what, args, quote = FALSE, envir = parent.frame()) ...
#> �[90m 12. �[39m├─(function (input) ...
#> �[90m 13. �[39m│ └─rmarkdown::render(input, quiet = TRUE, envir = globalenv())
#> �[90m 14. �[39m│   └─knitr::knit(...)
#> �[90m 15. �[39m│     └─knitr:::process_file(text, output)
#> �[90m 16. �[39m│       ├─base::withCallingHandlers(...)
#> �[90m 17. �[39m│       ├─knitr:::process_group(group)
#> �[90m 18. �[39m│       └─knitr:::process_group.block(group)
#> �[90m 19. �[39m│         └─knitr:::call_block(x)
#> �[90m 20. �[39m│           └─knitr:::block_exec(params)
#> �[90m 21. �[39m│             ├─knitr:::in_dir(...)
#> �[90m 22. �[39m│             └─knitr:::evaluate(...)
#> �[90m 23. �[39m│               └─evaluate::evaluate(...)
#> �[90m 24. �[39m│                 └─evaluate:::evaluate_call(...)
#> �[90m 25. �[39m│                   ├─evaluate:::timing_fn(...)
#> �[90m 26. �[39m│                   ├─evaluate:::handle(...)
#> �[90m 27. �[39m│                   │ └─base::try(f, silent = TRUE)
#> �[90m 28. �[39m│                   │   └─base::tryCatch(...)
#> �[90m 29. �[39m│                   │     └─base:::tryCatchList(expr, classes, parentenv, handlers)
#> �[90m 30. �[39m│                   │       └─base:::tryCatchOne(expr, names, parentenv, handlers[[1L]])
#> �[90m 31. �[39m│                   │         └─base:::doTryCatch(return(expr), name, parentenv, handler)
#> �[90m 32. �[39m│                   ├─base::withCallingHandlers(...)
#> �[90m 33. �[39m│                   ├─base::withVisible(eval(expr, envir, enclos))
#> �[90m 34. �[39m│                   └─base::eval(expr, envir, enclos)
#> �[90m 35. �[39m│                     └─base::eval(expr, envir, enclos)
#> �[90m 36. �[39m└─brolgar::l_slope(wages, id, lnw ~ exper + ged)
#> �[90m 37. �[39m  └─`%>%`(...) �[90m/Users/ntie0001/github/njtierney/brolgar/R/lognostics.R:249:2�[39m
#> �[90m 38. �[39m    ├─base::withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
#> �[90m 39. �[39m    └─base::eval(quote(`_fseq`(`_lhs`)), env, env)
#> �[90m 40. �[39m      └─base::eval(quote(`_fseq`(`_lhs`)), env, env)
#> �[90m 41. �[39m        └─brolgar:::`_fseq`(`_lhs`)
#> �[90m 42. �[39m          └─magrittr::freduce(value, `_function_list`)
#> �[90m 43. �[39m            ├─base::withVisible(function_list[[k]](value))
#> �[90m 44. �[39m            └─function_list[[k]](value)
#> �[90m 45. �[39m              └─dplyr::rename_all(., ~c("id", "intercept", "slope"))
#> �[90m 46. �[39m                └─dplyr:::vars_select_syms(vars, funs, .tbl, strict = TRUE)
#> �[90m 47. �[39m                  └─rlang::set_names(syms(vars), fun(vars))
#> �[90m 48. �[39m                    └─rlang:::set_names_impl(x, x, nm, ...)

^{Created on 2019-04-02 by the reprex package (v0.2.1)}

There needs to be some better dynamic naming in the rename_all step

devtools::check() fails because wages_ts data is not up to date

Current status of the names in brolgar:

library(brolgar)
names(wages)
#> [1] "id"       "lnw"      "exper"    "ged"      "postexp"  "black"   
#> [7] "hispanic" "hgc"      "uerate"
names(wages_ts)
#> [1] "id"            "ln_wages"      "xp"            "ged"          
#> [5] "postexp"       "black"         "hispanic"      "high_grade"   
#> [9] "unemploy_rate"
packageVersion("brolgar")
#> [1] '0.0.2.9000'

^{Created on 2019-07-14 by the reprex package (v0.3.0)}

But when I run this code in a chunk in a vignette:

is_xp_in_wages <- "xp" %in% names(wages_ts)

if (!is_xp_in_wages){
  stop("xp isn't in names of wages_ts?", 
       glue::glue_collapse(names(wages_ts), sep = ", ", last = ", and "),
       glue::glue_collapse(class(wages_ts), sep = ", "))
}

I get this error:

E  creating vignettes (13.9s)
   Quitting from lines 37-44 (visualisation-gallery.Rmd) 
   Error: processing vignette 'visualisation-gallery.Rmd' failed with diagnostics:
   xp isn't in names of wages_ts?id, lnw, exper, ged, postexp, black, hispanic, hgc, and ueratetbl_ts, tbl_df, tbl, data.frame
   Execution halted

I cannot work out how to fix this!

add new indexes for measuring other types of structure in longitudinal data - need to find out important things

Idea shifted from @dicook 's #3

Multivariate summaries

By @dicook (imported from tprvan/brolgar#4)

dimension reduction methods for the indicators, purpose to summarise the type of structure existing in the data, which indicators could be combined without loss of information
clustering of subjects, based on indicators, which subjects are similar to each other
computing multivariate summaries of subjects, eg. multivariate median as defined in library(TukeyRegion), or things like datadepth?

implement l_<fun> into lists for features

add continuous integration

Explore types of visualisations used

highlight interesting observations
facet interesting features
visualisations of the statistics
summary vis functions?

add functions for pulling middle X% of subjects, or the A% to B% of subjects. this would enable more than a single subject to be examined - might be useful

This idea shifted from tprvan/brolgar#3

Papers / books / resources for longitudinal data visualisation

all the PDFs are here

Finish tsibble interface

features for monotonicity

increasing - values are always increasing
decreasing - values are always decreasing
unvarying - values are unchanging
feat_monotonic: returns all three

add_ functions should return a `tsibble`

library(brolgar)
wages_ts %>%
  filter_n_obs(n_obs > 10) %>%
  add_key_slope(xp ~ ln_wages) %>%
  class()
#> [1] "tbl_df"     "tbl"        "data.frame"

^{Created on 2019-07-17 by the reprex package (v0.3.0)}

longitudinal datasets on height

https://github.com/adblackwell/tsimane-growth-files
https://onlinelibrary.wiley.com/doi/abs/10.1111/apa.14683
http://ncdrisc.org/data-downloads-height.html
Try searching keywords:
- secular trend
- secular increase
- Adult height
- Childhood growth
- Human growth
- Pubertal growth

consider using percentile to cut up groups in `stratify_keys()`

Rather than trying to break groups up into 10 groups of equal size, when using along, you could try and break them up into ntiles, perhaps using dplyr::ntile():

library(brolgar)
#> Loading required package: tsibble
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following object is masked from 'package:tsibble':
#> 
#>     id
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
ntile(world_heights$height_cm, 10)
#>    [1]  7  4  5  5  1  8  8  7  6  7  5  7  9  7  1  1  1  7  7  6  6  4  5
#>   [24]  8  6  6  6  7  8  8  8  6  6  6  6  6  6  6  7  8  8  8 10  7  6  5
#>   [47]  2  3  6  8  9  9  9  9  8  8  8  9  9  9  9  9 10 10  3  2  1  2  3
#>   [70]  2  3  4  5  5  6  7  7  7  8 10 10 10  8  8  9  9  9  9  9  1  3  3
#>   [93]  2  2  3  2  1  2  2  1  2  3  5 10  2  3  3  4  4  5  5  6 10 10  3
#>  [116]  4  2  3  7  6  6  8  8  8  7  2  2  2  2  2  2  3  3  6  7 10  4  5
#>  [139]  3  3  3  3  3  4  5  4  5  6  5  5  5  5  6  7  9  9  9  6  9  9  9
#>  [162]  5  4  5  2  9  9  6  9  8  9  8  9  9  9  9  9  5  7  1  3  3  1  1
#>  [185]  1  1  1  1  1  4  2  2  2  2  3  4  4  4  5  3  5  7  5  7  9  9  9
#>  [208]  8  9  9  9  8  9  9  9  9  8  8  9  9  9 10 10 10 10 10  7  5  5  3
#>  [231]  2  3  7  8  8  8  7  8  4  7  8 10 10 10 10  8 10 10 10  4  4  4  5
#>  [254]  5  4  4  4  3  3  3  3  5  6  7  8  8  9  1  2  2  2  2  3  4  5  6
#>  [277]  7  7  8  9  5  5  5  2  6  4  4  3  5  7  7  7  7  2  4  2  2  4  2
#>  [300]  6  7  6  7  6  4 10  4  8  8  7  6  5  5  6  7  8  9  9  8  2  2  2
#>  [323]  1  6  4  5  7  9 10 10 10  5  5  4  4  6  7  7  4  4  4  2  1  3  1
#>  [346]  3  2  2  3  4  4  5  6  8  8  8 10 10 10 10 10  5  4  6  6  6  7  7
#>  [369]  8  8 10 10 10 10 10 10 10  7  6  6  7  7  9 10  6  5  6  5  2  5  5
#>  [392]  4  6  8  8  9  9  7  9  7  8  7  5  4  3  1  1  1  3  5  6 10 10  9
#>  [415] 10 10 10  7  6  1  5  6  7  5  6  6  5  9  8  7  6  6  8  9  9 10 10
#>  [438] 10 10 10  2  1  1  1  4  3  5  5  5  3  3  3  3  4  4  4  4  5  5  6
#>  [461]  7  8  9 10 10 10 10  4  3  3  9  8  8  8  4  2  3 10  6  8  3  3  3
#>  [484]  2  6  6  7  4  2  3  6  6  5  5  3  4  5  6  7  7  8 10 10 10 10 10
#>  [507] 10 10  3  6  6  3  4  5  5  6  7  7  8  8  8  7  5  8  6  7  6  7 10
#>  [530] 10 10  1  1  1  1  1  1  1  1  2  2  8  3  7  7  7  6  7  8  8  7  4
#>  [553]  6  6  6  4  7  7  7  1  1  1  5  8  9  5  5  8  8  9  9  3  3  3  3
#>  [576]  4  6  5  6  5  5  2  2  2  2  3  3  3  3  3  4  5  7 10 10 10  1  2
#>  [599]  2  3  2  2  2  2  3  3  2  3  3  3  1  1  1  1  1  1  1  1  1  1  1
#>  [622]  1  1  1  1  1  1  2  2  3  4  4  6  4  4  6  5  6  7  7  8  9  9  7
#>  [645]  5  7  5  8  8  6  5  6  5  5  6  4  6  1  7  7  7  8  9 10 10 10 10
#>  [668] 10  7  7  6  4  6  6  5  4  4  4  4  3  3  4  3  2  2  2  2  3  3  4
#>  [691]  5  6  6  7  9 10 10 10  8  5  6  8  6 10 10  1  1  1  1  1  1  2  2
#>  [714]  4  5  9  4  8  8  8  8  9 10  4  3  2  3  4  2  4  3  8  9  9  3  4
#>  [737]  5  6  6  8  8  8  9  9  4  3  4  4  5  3  8  8  8  1  1  1  1  1  1
#>  [760]  1  1  1  1  1  2  2  2  2  2  2  3  5  8  9  9 10  5  5  5  6  7  6
#>  [783]  6  6  7  6  4  6  5  5  7  7  5  5  4  3  4 10  6  7  8  7  2  2  1
#>  [806]  3  2  2  2  3  3  3  2  5  7  4  5  5  5  5  5  1  1  1  1  1  1  1
#>  [829]  1  1  2  2  2  3  4 10  7  5  8  7  6  9  9  7  8  9  9  9  9  7  7
#>  [852]  7  8  4  3  3  3  2  2  2  1  1  1  2  2  2  3  3  3  3  3  1  4  4
#>  [875]  4  5  6  8  4  9  9 10 10  2  3  7  6  7  8  8  8  7  9  1  1  3  3
#>  [898]  5  8  5  7  6  4  5  4  4  1  2  1  1  1  2  1  1  1  3  3  2  1  1
#>  [921]  4  5  5  9  9  9  9  9  9  2  1  1  2  1  2  2  2  2  7  7  4  3  4
#>  [944]  4  3  3  4  5  5  7  7  8  9 10 10 10 10 10 10 10  9  9  9  9 10  4
#>  [967]  4  5  5  7  8  6 10  7  9  9  9  9  8  9  9  8  2  1  4  2  4  6  5
#>  [990]  4  3  5  7  8  8  7  2  1  1  4  4  4  4  5  4  7  6  7  6  7  8  8
#> [1013]  9  9  9 10 10 10 10 10 10 10 10  8  4  2  3  4  4  6  3  6  3  5  8
#> [1036]  9  8  1  1  1  1  1  4  4  9  8  5  4  4  2  2  2  1  2  1  1  1  1
#> [1059]  1  2  2  3  3  5  1  1  1  1  1  2  3  4  5  8  6  2  4  5  5  5  5
#> [1082]  7 10  8 10 10 10  3  3  3  4  4  3  3  2  3  4  5  4  2  3  3  3  3
#> [1105]  3  3  3  3  4  5  6  7  9  9  3  2  3  4  3  6  7 10  9  3  2  2  3
#> [1128]  3  2  2  3  2  1  1  1  2  2  3  2  2  4  5  6  7  7  5  6  7  9 10
#> [1151] 10 10  9  9  9  9  7  7  7  6  5  7  8  8  9  9  4  9  6  9  8  7  9
#> [1174]  9  9 10 10 10 10  5  8  5  6 10 10  1  3  2  3  2  4  4  5  4  6  4
#> [1197]  9 10  3  2  4  4  9 10  9  6 10 10 10  9  9  9  9  7  9  8  8  8  7
#> [1220]  8  6  8  7  7  6  2  1  1  1  5  7  8  9 10  2  3  2  3  2  1  3  2
#> [1243]  2  2  3  3  3  4  4  4  5  8 10 10 10  2  3  4  7  9  6  8  8  6  7
#> [1266]  8  6  7  7  8  8  8  7  7  6  6  5  5  6  5  6  6  7  7  8  9  9  9
#> [1289] 10  6 10 10 10 10 10 10  6  5  3  5  8  8  9  8  9  5  6  2  5  6  7
#> [1312]  8  4  3  5  4  4  4  6  6  6  7  6  4  4  5  4  5  5  6  5  1  1  2
#> [1335]  1  2  2  1  2  5  5  6  7  8  9  5  4  5  7  8  8  8  6 10 10  2  4
#> [1358]  8  7  7  8  6  7  7  6  7  7  5  5  6  7  8  7  6  6  6  5  5  4  6
#> [1381]  3  8  8  8  7  2  2  4  5  2  2  1  3  3  3  2  3  4 10  7  6  3  4
#> [1404]  7  8  8  6  6  4  8  7  5  5  4  5  5  6  6  7  8  8 10 10 10 10 10
#> [1427] 10  9  9  9  9 10  9  9 10  9  9  8  9  7  7  8  9 10 10 10 10 10 10
#> [1450] 10 10  4  6  6  4  4  4  4  8  9  9  9  8  1  1  1  1  1  1  1  1  1
#> [1473]  1  1  2  2  1  1  1  2  4  2  1  1  2  2  6  3  6  8  8  7  6  7  6
#> [1496]  8  9  9  8

^{Created on 2019-07-23 by the reprex package (v0.3.0)}

This could avoid all the work with setting up the number of groups and re-arranging and joining everything? I guess the problem is that each id needs to have the number summarise somehow as well.

change README to use world_heights

it's a smaller dataset that is easier to explain

l_diff_range (or equivalent)

diff_range <- function(x, na.rm = TRUE){
    diff(range(x, na.rm = na.rm))
}

diff_range(c(1:20))
#> [1] 19
diff_range(rnorm(10))
#> [1] 2.998619

# or in a dataset:
library(brolgar)
library(tibble)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
world_heights
#> # A tibble: 1,499 x 4
#>     code country  year height_cm
#>    <dbl> <chr>   <dbl>     <dbl>
#>  1   276 Germany  1550      168.
#>  2   616 Poland   1550      170.
#>  3   276 Germany  1650      170.
#>  4   616 Poland   1650      168.
#>  5   250 France   1660      163.
#>  6   250 France   1670      161.
#>  7   250 France   1680      162.
#>  8   250 France   1690      161.
#>  9   643 Russia   1700      164.
#> 10   250 France   1710      165 
#> # … with 1,489 more rows

world_heights %>%
    group_by(country) %>%
    summarise(diff_range = diff_range(height_cm)) %>%
    arrange(-diff_range)
#> # A tibble: 153 x 2
#>    country        diff_range
#>    <chr>               <dbl>
#>  1 Netherlands          18.5
#>  2 Malaysia             18.4
#>  3 Czech Republic       17.9
#>  4 Denmark              17.8
#>  5 Austria              17.2
#>  6 Germany              17.2
#>  7 Russia               17.2
#>  8 Croatia              17  
#>  9 Hungary              16.7
#> 10 Guyana               16.1
#> # … with 143 more rows

^{Created on 2019-06-07 by the reprex package (v0.2.1)}

explore what is going wrong with `monotonics`

They should be mutually exclusive, but apparently not:

library(brolgar)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

wages_ts %>%
  features(ln_wages, feat_monotonic)
#> # A tibble: 888 x 4
#>       id increase decrease unvary
#>    <int> <lgl>    <lgl>    <lgl> 
#>  1    31 FALSE    FALSE    FALSE 
#>  2    36 FALSE    FALSE    FALSE 
#>  3    53 FALSE    FALSE    FALSE 
#>  4   122 FALSE    FALSE    FALSE 
#>  5   134 FALSE    FALSE    FALSE 
#>  6   145 FALSE    FALSE    FALSE 
#>  7   155 FALSE    FALSE    FALSE 
#>  8   173 FALSE    FALSE    FALSE 
#>  9   206 TRUE     FALSE    FALSE 
#> 10   207 FALSE    FALSE    FALSE 
#> # … with 878 more rows

wages_ts %>%
  features(ln_wages, feat_monotonic) %>%
  filter(increase)
#> # A tibble: 88 x 4
#>       id increase decrease unvary
#>    <int> <lgl>    <lgl>    <lgl> 
#>  1   206 TRUE     FALSE    FALSE 
#>  2   266 TRUE     TRUE     TRUE  
#>  3   295 TRUE     FALSE    FALSE 
#>  4   304 TRUE     TRUE     TRUE  
#>  5   518 TRUE     FALSE    FALSE 
#>  6   911 TRUE     TRUE     TRUE  
#>  7  1032 TRUE     TRUE     TRUE  
#>  8  1219 TRUE     TRUE     TRUE  
#>  9  1282 TRUE     TRUE     TRUE  
#> 10  1508 TRUE     FALSE    FALSE 
#> # … with 78 more rows

^{Created on 2019-07-14 by the reprex package (v0.3.0)}

`stratify_keys` should gain an `along` or `by` argument

This will then stratify the keys into groups based on arranging by the by or along group.

Longnostics for time

These should be summaries for things like:

Days between measurements
Range of days between measurements (1st and last)
Average/mean/median/etc number of days between measurements

Date features should also be factored into for the other longnostics, where appropriate.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
library(brolgar)

wages %>%
  group_by(id) %>%
  nest() %>%
  sample_n(size = 10) %>% 
  unnest()
#> # A tibble: 57 x 15
#>       id   lnw exper   ged postexp black hispanic   hgc hgc.9 uerate   ue.7
#>    <int> <dbl> <dbl> <int>   <dbl> <int>    <int> <int> <int>  <dbl>  <dbl>
#>  1  8300  2.53 5.64      1   5.64      0        1    12     3   6.49 -0.505
#>  2  8300  2.52 6.54      1   6.54      0        1    12     3   5.99 -1.00 
#>  3  8300  2.49 7.77      1   7.77      0        1    12     3   6.2  -0.805
#>  4  1273  2.25 0.565     1   0         0        0    11     2   4.99 -2.00 
#>  5  1273  2.21 1.14      1   0.578     0        0    11     2   5.49 -1.50 
#>  6  1273  1.79 2.32      1   1.75      0        0    11     2   8     1    
#>  7  1273  1.82 3.45      1   2.89      0        0    11     2   5.89 -1.10 
#>  8  1273  1.54 4.10      1   3.53      0        0    11     2   6.07 -0.926
#>  9  1273  1.20 5.41      1   4.84      0        0    11     2   7.5   0.5  
#> 10  1273  1.71 6.66      1   6.09      0        0    11     2   6.09 -0.905
#> # … with 47 more rows, and 4 more variables: ue.centert1 <dbl>,
#> #   ue.mean <dbl>, ue.person.cen <dbl>, ue1 <dbl>

wages %>%
  group_by(id) %>%
  nest() %>%
  sample_frac(size = 0.1) %>%
  unnest()
#> # A tibble: 631 x 15
#>       id   lnw exper   ged postexp black hispanic   hgc hgc.9 uerate
#>    <int> <dbl> <dbl> <int>   <dbl> <int>    <int> <int> <int>  <dbl>
#>  1  3440  1.86 0.163     0       0     0        0    11     2   5.59
#>  2  3440  1.91 0.983     0       0     0        0    11     2   6.7 
#>  3  3440  1.59 1.61      0       0     0        0    11     2   8.7 
#>  4  3440  2.26 2.45      0       0     0        0    11     2   9.6 
#>  5  3440  2.16 3.40      0       0     0        0    11     2  11.4 
#>  6  3440  2.10 4.38      0       0     0        0    11     2   9.4 
#>  7  3440  2.16 5.55      0       0     0        0    11     2   7.1 
#>  8  3440  2.12 6.70      0       0     0        0    11     2   8   
#>  9  3440  2.16 7.70      0       0     0        0    11     2   7.1 
#> 10  3440  2.25 9.05      0       0     0        0    11     2   6.9 
#> # … with 621 more rows, and 5 more variables: ue.7 <dbl>,
#> #   ue.centert1 <dbl>, ue.mean <dbl>, ue.person.cen <dbl>, ue1 <dbl>

sample_n_obs <- function(data, id, size){
  
  q_id <- rlang::enquo(id)
  
  data %>%
    dplyr::group_by(!!q_id) %>%
    tidyr::nest() %>%
    dplyr::sample_n(size = size) %>%
    tidyr::unnest()
  
}

sample_frac_obs <- function(data, id, size){
  
  q_id <- rlang::enquo(id)
  
  data %>%
    dplyr::group_by(!!q_id) %>%
    tidyr::nest() %>%
    dplyr::sample_frac(size = size) %>%
    tidyr::unnest()
  
}

library(ggplot2)

sample_n_obs(wages,
             id,
             10) %>%
  ggplot(aes(x = exper,
             y = uerate,
             group = id)) + 
  geom_line()

sample_frac_obs(wages,
                id,
                0.05) %>%
  ggplot(aes(x = exper,
             y = uerate,
             group = id)) + 
  geom_line()