Giter Site home page Giter Site logo

brolgar's Introduction

👋 G'day!

I'm a research software engineer working with with Nick Golding in the Infectious Disease Ecology and Modelling team at Telethon Kids Institute. My primary goal is to to improve data analysis with free open source software.

  • 🗒️ Available to teach courses on data science, reproducible reporting, analysis pipelines, and building R packages
  • 🦠 Working on creating analysis pipelines and R packages for disease modelling
  • 💬 Ask me about missing data, longitudinal data, writing functions, and ☕
  • 🛠️ Maintain the greta software for statistical modelling
  • 💻 Mostly program in R
  • 😄 Pronouns: He / Him

brolgar's People

Contributors

dicook avatar hadley avatar kant avatar njtierney avatar tprvan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

brolgar's Issues

`near` helper functions

  • is_near - returns logical if something is near another value (within tolerance)
  • top_near - like top_n. Returns "top n" nearest results within tolerance. So. So top_near(data, x, near, tol) where x is a vector, near is a number to compare each value, and tol is the tolerance to accept values within for the first place.

add_l_* functions

The add_l_ set of functions compliment the l_ set of functions:

e.g.,

  • add_slope
  • add_length

This avoids needing to join again later by joining straight to the data.

function to put ids into K sets of random groups

This can be combined with sample_frac_obs(), where you might want to prune down the observations, and then get a quick view of the longitudinal observations, like so:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
library(brolgar)

wages %>%
  sample_frac_obs(id = id, size = 0.1) %>%
  group_by(id) %>%
  nest() %>%
  mutate(.rand_id = sample(1:10, nrow(.), replace = TRUE)) %>%
  unnest() %>%
  ggplot(aes(x = lnw,
             y = exper,
             group = id)) + 
  geom_line() + 
  facet_wrap(~.rand_id)

Created on 2019-04-08 by the reprex package (v0.2.1)

Example datasets

  • Health + Medical data
  • University rankings
  • census info (over long time?)

add feat_lag

This should create p columns, with p being either a calculated number of columns based on the max number of observations for a key, or as user input.

So there would be: lag_1, ..., lag_p columns

long data format visualisations using gghighlight

Some things like this where we split plots into facets based on features would be neat. Perhaps might be worthwhile to include a "gather_features()" or something function, to facilitate this.

library(gghighlight)
#> Loading required package: ggplot2
library(brolgar)
library(tidyverse)
wages_ts %>%
  features(ln_wages, feat_monotonic) %>%
  left_join(wages_ts, by = "id") %>%
  select(id:xp) %>%
  gather(key = "features",
         value = "inc_dec",
         - id,
         - unvary,
         - monotonic,
         - ln_wages,
         - xp) %>%
  filter(inc_dec) %>%
  ggplot(aes(x = xp,
             y = ln_wages,
             group = id)) + 
  geom_line() + 
  gghighlight() + 
  facet_wrap(~features)

Created on 2019-07-19 by the reprex package (v0.3.0)

l_slope is a bit slow

It would be great to improve the speed of l_slope

l_slope_old <- function(df, id, formula){
  l <- split(df, df[[id]])
  sl <- purrr::map(l, ~ eval(substitute(lm(formula, data = .)))) %>%
    purrr::map_dfr( ~ as.data.frame(t(as.matrix(coef(
      .
    ))))) %>%
    dplyr::mutate(id = as.integer(names(l))) %>%
    dplyr::rename_all( ~ c("intercept", "slope", "id")) %>%
    dplyr::select(id, intercept, slope) %>%
    tibble::as_tibble()
  return(sl)
}

library(brolgar)

bm1 <- bench::mark(old = l_slope_old(wages, "id", "lnw~exper"),
                   new = l_slope(wages, id, lnw~exper))
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.

summary(bm1)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 x 10
#>   expression      min     mean   median      max `itr/sec` mem_alloc  n_gc
#>   <chr>      <bch:tm> <bch:tm> <bch:tm> <bch:tm>     <dbl> <bch:byt> <dbl>
#> 1 old        748.71ms 748.71ms 748.71ms 748.71ms     1.34     7.25MB    16
#> 2 new           2.45s    2.45s    2.45s    2.45s     0.408   11.93MB    67
#> # … with 2 more variables: n_itr <int>, total_time <bch:tm>
summary(bm1, relative = TRUE)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 x 10
#>   expression   min  mean median   max `itr/sec` mem_alloc  n_gc n_itr
#>   <chr>      <dbl> <dbl>  <dbl> <dbl>     <dbl>     <dbl> <dbl> <dbl>
#> 1 old         1     1      1     1         3.27      1     1        1
#> 2 new         3.27  3.27   3.27  3.27      1         1.65  4.19     1
#> # … with 1 more variable: total_time <dbl>

Created on 2019-04-02 by the reprex package (v0.2.1)

use tsibble not data.frame

brolgar should use tsibble data methods for S3, rather than provide additional data.frame methods. Perhaps in the future the methods can be extended to work for .data.frame as well, but until the API is stable, I don't see much point in doing a lot of extra work

Change wages_ts to wages?

Unless I want to keep an example dataset in there that describes "how to create your own tsibble data". But maybe that can be the world_heights data

faceting functions to split/cut of a feature

Idea 1: facet along some feature/variable

gg... + 
  facet_along(~var, n_facets = 10, order = "ascend")

So here the facet_along() function would break a given variable up into 10 groups in ascending order.

This means that you could create a feature on, say, slope, and then get 10 evenly spaced groups from high to low.

Idea 2: facet on a subsample of the data

gg... + 
  facet_sample(~var, n_facet = ..., n_sample = ...)

This would be a similar idea but is specifically for reducing a current spaghetti plot into many plots with fewer observations. So:

  • n_facet = 10 and n_sample = 10 will yield 1 samples per facet (n_sample / n_facet)

rename / rework l_n_obs

It is a very handy function, and I find that the other ways of getting the same piece of information are somewhat cumbersome:

library(brolgar)
library(tsibble)

l_n_obs(wages_ts)
#> # A tibble: 888 x 2
#>       id n_obs
#>    <int> <int>
#>  1    31     8
#>  2    36    10
#>  3    53     8
#>  4   122    10
#>  5   134    12
#>  6   145     9
#>  7   155    11
#>  8   173     6
#>  9   206     3
#> 10   207    11
#> # … with 878 more rows

wages_ts %>% features(id, length)
#> # A tibble: 888 x 2
#>       id    V1
#>    <int> <int>
#>  1    31     8
#>  2    36    10
#>  3    53     8
#>  4   122    10
#>  5   134    12
#>  6   145     9
#>  7   155    11
#>  8   173     6
#>  9   206     3
#> 10   207    11
#> # … with 878 more rows

wages_ts %>% features(!!index(wages_ts), length)
#> # A tibble: 888 x 2
#>       id    V1
#>    <int> <int>
#>  1    31     8
#>  2    36    10
#>  3    53     8
#>  4   122    10
#>  5   134    12
#>  6   145     9
#>  7   155    11
#>  8   173     6
#>  9   206     3
#> 10   207    11
#> # … with 878 more rows

wages_ts %>% features(!!index(wages_ts), 
                      list(n_obs = length))
#> # A tibble: 888 x 2
#>       id n_obs
#>    <int> <int>
#>  1    31     8
#>  2    36    10
#>  3    53     8
#>  4   122    10
#>  5   134    12
#>  6   145     9
#>  7   155    11
#>  8   173     6
#>  9   206     3
#> 10   207    11
#> # … with 878 more rows

Created on 2019-07-11 by the reprex package (v0.2.1)

So then, what can I call it?

Use `key_slope` not `l_slope`

In an effort to remove the l_ function family and stay consistent with naming of things in tidyverts / tsibble, l_slope should instead be key_slope, which returns intercept+slope info for each key.

`add_key_slope` does not work, but `key_slope` does

Somehow ln_wages is not passed properly, probably missing something obvious?

library(brolgar)

# fine
key_slope(wages_ts,ln_wages ~ xp)
#> # A tibble: 888 x 3
#>       id .intercept .slope_xp
#>    <int>      <dbl>     <dbl>
#>  1    31       1.41    0.101 
#>  2    36       2.04    0.0588
#>  3    53       2.29   -0.358 
#>  4   122       1.93    0.0374
#>  5   134       2.03    0.0831
#>  6   145       1.59    0.0469
#>  7   155       1.66    0.0867
#>  8   173       1.61    0.100 
#>  9   206       1.73    0.180 
#> 10   207       1.62    0.0884
#> # … with 878 more rows

# errors
add_key_slope(wages_ts,ln_wages ~ xp)
#> Error in eval(predvars, data, env): object 'ln_wages' not found

Created on 2019-07-13 by the reprex package (v0.3.0)

Do not create longnostics for `id`'s with only one observation

We are not really interested in calculating statistics for cases where there is only one observation. Since the mean of 1 number is not really useful, and you cannot calculate things like the variance or standard deviation of one number.

  • Add a bookkeeping function to identify ids with only one observation
  • Outline a workflow where you remove those id's with only one observation

l_slope should be able to take any number of RHS on the formula

Currently l_slope can only take one explanatory variable on the RHS:

library(brolgar)
data(wages)
l_slope(wages, id, lnw~exper)
#> # A tibble: 888 x 3
#> # Groups:   id [888]
#>       id intercept   slope
#>    <int>     <dbl>   <dbl>
#>  1    31      1.41  0.101 
#>  2    36      2.04  0.0588
#>  3    53      2.29 -0.358 
#>  4   122      1.93  0.0374
#>  5   134      2.03  0.0831
#>  6   145      1.59  0.0469
#>  7   155      1.66  0.0867
#>  8   173      1.61  0.100 
#>  9   206      1.73  0.180 
#> 10   207      1.62  0.0884
#> # … with 878 more rows
l_slope(wages, id, lnw~exper+ged)
#> Error: `nm` must be `NULL` or a character vector the same length as `x`
#> Backtrace:
#> �[90m     �[39m█
#> �[90m  1. �[39m├─base::tryCatch(...)
#> �[90m  2. �[39m│ └─base:::tryCatchList(expr, classes, parentenv, handlers)
#> �[90m  3. �[39m│   ├─base:::tryCatchOne(...)
#> �[90m  4. �[39m│   │ └─base:::doTryCatch(return(expr), name, parentenv, handler)
#> �[90m  5. �[39m│   └─base:::tryCatchList(expr, names[-nh], parentenv, handlers[-nh])
#> �[90m  6. �[39m│     └─base:::tryCatchOne(expr, names, parentenv, handlers[[1L]])
#> �[90m  7. �[39m│       └─base:::doTryCatch(return(expr), name, parentenv, handler)
#> �[90m  8. �[39m├─base::withCallingHandlers(...)
#> �[90m  9. �[39m├─base::saveRDS(...)
#> �[90m 10. �[39m├─base::do.call(...)
#> �[90m 11. �[39m├─(function (what, args, quote = FALSE, envir = parent.frame()) ...
#> �[90m 12. �[39m├─(function (input) ...
#> �[90m 13. �[39m│ └─rmarkdown::render(input, quiet = TRUE, envir = globalenv())
#> �[90m 14. �[39m│   └─knitr::knit(...)
#> �[90m 15. �[39m│     └─knitr:::process_file(text, output)
#> �[90m 16. �[39m│       ├─base::withCallingHandlers(...)
#> �[90m 17. �[39m│       ├─knitr:::process_group(group)
#> �[90m 18. �[39m│       └─knitr:::process_group.block(group)
#> �[90m 19. �[39m│         └─knitr:::call_block(x)
#> �[90m 20. �[39m│           └─knitr:::block_exec(params)
#> �[90m 21. �[39m│             ├─knitr:::in_dir(...)
#> �[90m 22. �[39m│             └─knitr:::evaluate(...)
#> �[90m 23. �[39m│               └─evaluate::evaluate(...)
#> �[90m 24. �[39m│                 └─evaluate:::evaluate_call(...)
#> �[90m 25. �[39m│                   ├─evaluate:::timing_fn(...)
#> �[90m 26. �[39m│                   ├─evaluate:::handle(...)
#> �[90m 27. �[39m│                   │ └─base::try(f, silent = TRUE)
#> �[90m 28. �[39m│                   │   └─base::tryCatch(...)
#> �[90m 29. �[39m│                   │     └─base:::tryCatchList(expr, classes, parentenv, handlers)
#> �[90m 30. �[39m│                   │       └─base:::tryCatchOne(expr, names, parentenv, handlers[[1L]])
#> �[90m 31. �[39m│                   │         └─base:::doTryCatch(return(expr), name, parentenv, handler)
#> �[90m 32. �[39m│                   ├─base::withCallingHandlers(...)
#> �[90m 33. �[39m│                   ├─base::withVisible(eval(expr, envir, enclos))
#> �[90m 34. �[39m│                   └─base::eval(expr, envir, enclos)
#> �[90m 35. �[39m│                     └─base::eval(expr, envir, enclos)
#> �[90m 36. �[39m└─brolgar::l_slope(wages, id, lnw ~ exper + ged)
#> �[90m 37. �[39m  └─`%>%`(...) �[90m/Users/ntie0001/github/njtierney/brolgar/R/lognostics.R:249:2�[39m
#> �[90m 38. �[39m    ├─base::withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
#> �[90m 39. �[39m    └─base::eval(quote(`_fseq`(`_lhs`)), env, env)
#> �[90m 40. �[39m      └─base::eval(quote(`_fseq`(`_lhs`)), env, env)
#> �[90m 41. �[39m        └─brolgar:::`_fseq`(`_lhs`)
#> �[90m 42. �[39m          └─magrittr::freduce(value, `_function_list`)
#> �[90m 43. �[39m            ├─base::withVisible(function_list[[k]](value))
#> �[90m 44. �[39m            └─function_list[[k]](value)
#> �[90m 45. �[39m              └─dplyr::rename_all(., ~c("id", "intercept", "slope"))
#> �[90m 46. �[39m                └─dplyr:::vars_select_syms(vars, funs, .tbl, strict = TRUE)
#> �[90m 47. �[39m                  └─rlang::set_names(syms(vars), fun(vars))
#> �[90m 48. �[39m                    └─rlang:::set_names_impl(x, x, nm, ...)

Created on 2019-04-02 by the reprex package (v0.2.1)

There needs to be some better dynamic naming in the rename_all step

devtools::check() fails because wages_ts data is not up to date

Current status of the names in brolgar:

library(brolgar)
names(wages)
#> [1] "id"       "lnw"      "exper"    "ged"      "postexp"  "black"   
#> [7] "hispanic" "hgc"      "uerate"
names(wages_ts)
#> [1] "id"            "ln_wages"      "xp"            "ged"          
#> [5] "postexp"       "black"         "hispanic"      "high_grade"   
#> [9] "unemploy_rate"
packageVersion("brolgar")
#> [1] '0.0.2.9000'

Created on 2019-07-14 by the reprex package (v0.3.0)

But when I run this code in a chunk in a vignette:

is_xp_in_wages <- "xp" %in% names(wages_ts)

if (!is_xp_in_wages){
  stop("xp isn't in names of wages_ts?", 
       glue::glue_collapse(names(wages_ts), sep = ", ", last = ", and "),
       glue::glue_collapse(class(wages_ts), sep = ", "))
}

I get this error:

E  creating vignettes (13.9s)
   Quitting from lines 37-44 (visualisation-gallery.Rmd) 
   Error: processing vignette 'visualisation-gallery.Rmd' failed with diagnostics:
   xp isn't in names of wages_ts?id, lnw, exper, ged, postexp, black, hispanic, hgc, and ueratetbl_ts, tbl_df, tbl, data.frame
   Execution halted

I cannot work out how to fix this!

Multivariate summaries

By @dicook (imported from tprvan/brolgar#4)

  • dimension reduction methods for the indicators, purpose to summarise the type of structure existing in the data, which indicators could be combined without loss of information
  • clustering of subjects, based on indicators, which subjects are similar to each other
  • computing multivariate summaries of subjects, eg. multivariate median as defined in library(TukeyRegion), or things like datadepth?

features for monotonicity

  • increasing - values are always increasing

  • decreasing - values are always decreasing

  • unvarying - values are unchanging

  • feat_monotonic: returns all three

consider using percentile to cut up groups in `stratify_keys()`

Rather than trying to break groups up into 10 groups of equal size, when using along, you could try and break them up into ntiles, perhaps using dplyr::ntile():

library(brolgar)
#> Loading required package: tsibble
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following object is masked from 'package:tsibble':
#> 
#>     id
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
ntile(world_heights$height_cm, 10)
#>    [1]  7  4  5  5  1  8  8  7  6  7  5  7  9  7  1  1  1  7  7  6  6  4  5
#>   [24]  8  6  6  6  7  8  8  8  6  6  6  6  6  6  6  7  8  8  8 10  7  6  5
#>   [47]  2  3  6  8  9  9  9  9  8  8  8  9  9  9  9  9 10 10  3  2  1  2  3
#>   [70]  2  3  4  5  5  6  7  7  7  8 10 10 10  8  8  9  9  9  9  9  1  3  3
#>   [93]  2  2  3  2  1  2  2  1  2  3  5 10  2  3  3  4  4  5  5  6 10 10  3
#>  [116]  4  2  3  7  6  6  8  8  8  7  2  2  2  2  2  2  3  3  6  7 10  4  5
#>  [139]  3  3  3  3  3  4  5  4  5  6  5  5  5  5  6  7  9  9  9  6  9  9  9
#>  [162]  5  4  5  2  9  9  6  9  8  9  8  9  9  9  9  9  5  7  1  3  3  1  1
#>  [185]  1  1  1  1  1  4  2  2  2  2  3  4  4  4  5  3  5  7  5  7  9  9  9
#>  [208]  8  9  9  9  8  9  9  9  9  8  8  9  9  9 10 10 10 10 10  7  5  5  3
#>  [231]  2  3  7  8  8  8  7  8  4  7  8 10 10 10 10  8 10 10 10  4  4  4  5
#>  [254]  5  4  4  4  3  3  3  3  5  6  7  8  8  9  1  2  2  2  2  3  4  5  6
#>  [277]  7  7  8  9  5  5  5  2  6  4  4  3  5  7  7  7  7  2  4  2  2  4  2
#>  [300]  6  7  6  7  6  4 10  4  8  8  7  6  5  5  6  7  8  9  9  8  2  2  2
#>  [323]  1  6  4  5  7  9 10 10 10  5  5  4  4  6  7  7  4  4  4  2  1  3  1
#>  [346]  3  2  2  3  4  4  5  6  8  8  8 10 10 10 10 10  5  4  6  6  6  7  7
#>  [369]  8  8 10 10 10 10 10 10 10  7  6  6  7  7  9 10  6  5  6  5  2  5  5
#>  [392]  4  6  8  8  9  9  7  9  7  8  7  5  4  3  1  1  1  3  5  6 10 10  9
#>  [415] 10 10 10  7  6  1  5  6  7  5  6  6  5  9  8  7  6  6  8  9  9 10 10
#>  [438] 10 10 10  2  1  1  1  4  3  5  5  5  3  3  3  3  4  4  4  4  5  5  6
#>  [461]  7  8  9 10 10 10 10  4  3  3  9  8  8  8  4  2  3 10  6  8  3  3  3
#>  [484]  2  6  6  7  4  2  3  6  6  5  5  3  4  5  6  7  7  8 10 10 10 10 10
#>  [507] 10 10  3  6  6  3  4  5  5  6  7  7  8  8  8  7  5  8  6  7  6  7 10
#>  [530] 10 10  1  1  1  1  1  1  1  1  2  2  8  3  7  7  7  6  7  8  8  7  4
#>  [553]  6  6  6  4  7  7  7  1  1  1  5  8  9  5  5  8  8  9  9  3  3  3  3
#>  [576]  4  6  5  6  5  5  2  2  2  2  3  3  3  3  3  4  5  7 10 10 10  1  2
#>  [599]  2  3  2  2  2  2  3  3  2  3  3  3  1  1  1  1  1  1  1  1  1  1  1
#>  [622]  1  1  1  1  1  1  2  2  3  4  4  6  4  4  6  5  6  7  7  8  9  9  7
#>  [645]  5  7  5  8  8  6  5  6  5  5  6  4  6  1  7  7  7  8  9 10 10 10 10
#>  [668] 10  7  7  6  4  6  6  5  4  4  4  4  3  3  4  3  2  2  2  2  3  3  4
#>  [691]  5  6  6  7  9 10 10 10  8  5  6  8  6 10 10  1  1  1  1  1  1  2  2
#>  [714]  4  5  9  4  8  8  8  8  9 10  4  3  2  3  4  2  4  3  8  9  9  3  4
#>  [737]  5  6  6  8  8  8  9  9  4  3  4  4  5  3  8  8  8  1  1  1  1  1  1
#>  [760]  1  1  1  1  1  2  2  2  2  2  2  3  5  8  9  9 10  5  5  5  6  7  6
#>  [783]  6  6  7  6  4  6  5  5  7  7  5  5  4  3  4 10  6  7  8  7  2  2  1
#>  [806]  3  2  2  2  3  3  3  2  5  7  4  5  5  5  5  5  1  1  1  1  1  1  1
#>  [829]  1  1  2  2  2  3  4 10  7  5  8  7  6  9  9  7  8  9  9  9  9  7  7
#>  [852]  7  8  4  3  3  3  2  2  2  1  1  1  2  2  2  3  3  3  3  3  1  4  4
#>  [875]  4  5  6  8  4  9  9 10 10  2  3  7  6  7  8  8  8  7  9  1  1  3  3
#>  [898]  5  8  5  7  6  4  5  4  4  1  2  1  1  1  2  1  1  1  3  3  2  1  1
#>  [921]  4  5  5  9  9  9  9  9  9  2  1  1  2  1  2  2  2  2  7  7  4  3  4
#>  [944]  4  3  3  4  5  5  7  7  8  9 10 10 10 10 10 10 10  9  9  9  9 10  4
#>  [967]  4  5  5  7  8  6 10  7  9  9  9  9  8  9  9  8  2  1  4  2  4  6  5
#>  [990]  4  3  5  7  8  8  7  2  1  1  4  4  4  4  5  4  7  6  7  6  7  8  8
#> [1013]  9  9  9 10 10 10 10 10 10 10 10  8  4  2  3  4  4  6  3  6  3  5  8
#> [1036]  9  8  1  1  1  1  1  4  4  9  8  5  4  4  2  2  2  1  2  1  1  1  1
#> [1059]  1  2  2  3  3  5  1  1  1  1  1  2  3  4  5  8  6  2  4  5  5  5  5
#> [1082]  7 10  8 10 10 10  3  3  3  4  4  3  3  2  3  4  5  4  2  3  3  3  3
#> [1105]  3  3  3  3  4  5  6  7  9  9  3  2  3  4  3  6  7 10  9  3  2  2  3
#> [1128]  3  2  2  3  2  1  1  1  2  2  3  2  2  4  5  6  7  7  5  6  7  9 10
#> [1151] 10 10  9  9  9  9  7  7  7  6  5  7  8  8  9  9  4  9  6  9  8  7  9
#> [1174]  9  9 10 10 10 10  5  8  5  6 10 10  1  3  2  3  2  4  4  5  4  6  4
#> [1197]  9 10  3  2  4  4  9 10  9  6 10 10 10  9  9  9  9  7  9  8  8  8  7
#> [1220]  8  6  8  7  7  6  2  1  1  1  5  7  8  9 10  2  3  2  3  2  1  3  2
#> [1243]  2  2  3  3  3  4  4  4  5  8 10 10 10  2  3  4  7  9  6  8  8  6  7
#> [1266]  8  6  7  7  8  8  8  7  7  6  6  5  5  6  5  6  6  7  7  8  9  9  9
#> [1289] 10  6 10 10 10 10 10 10  6  5  3  5  8  8  9  8  9  5  6  2  5  6  7
#> [1312]  8  4  3  5  4  4  4  6  6  6  7  6  4  4  5  4  5  5  6  5  1  1  2
#> [1335]  1  2  2  1  2  5  5  6  7  8  9  5  4  5  7  8  8  8  6 10 10  2  4
#> [1358]  8  7  7  8  6  7  7  6  7  7  5  5  6  7  8  7  6  6  6  5  5  4  6
#> [1381]  3  8  8  8  7  2  2  4  5  2  2  1  3  3  3  2  3  4 10  7  6  3  4
#> [1404]  7  8  8  6  6  4  8  7  5  5  4  5  5  6  6  7  8  8 10 10 10 10 10
#> [1427] 10  9  9  9  9 10  9  9 10  9  9  8  9  7  7  8  9 10 10 10 10 10 10
#> [1450] 10 10  4  6  6  4  4  4  4  8  9  9  9  8  1  1  1  1  1  1  1  1  1
#> [1473]  1  1  2  2  1  1  1  2  4  2  1  1  2  2  6  3  6  8  8  7  6  7  6
#> [1496]  8  9  9  8

Created on 2019-07-23 by the reprex package (v0.3.0)

This could avoid all the work with setting up the number of groups and re-arranging and joining everything? I guess the problem is that each id needs to have the number summarise somehow as well.

l_diff_range (or equivalent)

diff_range <- function(x, na.rm = TRUE){
    diff(range(x, na.rm = na.rm))
}

diff_range(c(1:20))
#> [1] 19
diff_range(rnorm(10))
#> [1] 2.998619

# or in a dataset:
library(brolgar)
library(tibble)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
world_heights
#> # A tibble: 1,499 x 4
#>     code country  year height_cm
#>    <dbl> <chr>   <dbl>     <dbl>
#>  1   276 Germany  1550      168.
#>  2   616 Poland   1550      170.
#>  3   276 Germany  1650      170.
#>  4   616 Poland   1650      168.
#>  5   250 France   1660      163.
#>  6   250 France   1670      161.
#>  7   250 France   1680      162.
#>  8   250 France   1690      161.
#>  9   643 Russia   1700      164.
#> 10   250 France   1710      165 
#> # … with 1,489 more rows

world_heights %>%
    group_by(country) %>%
    summarise(diff_range = diff_range(height_cm)) %>%
    arrange(-diff_range)
#> # A tibble: 153 x 2
#>    country        diff_range
#>    <chr>               <dbl>
#>  1 Netherlands          18.5
#>  2 Malaysia             18.4
#>  3 Czech Republic       17.9
#>  4 Denmark              17.8
#>  5 Austria              17.2
#>  6 Germany              17.2
#>  7 Russia               17.2
#>  8 Croatia              17  
#>  9 Hungary              16.7
#> 10 Guyana               16.1
#> # … with 143 more rows

Created on 2019-06-07 by the reprex package (v0.2.1)

explore what is going wrong with `monotonics`

They should be mutually exclusive, but apparently not:

library(brolgar)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

wages_ts %>%
  features(ln_wages, feat_monotonic)
#> # A tibble: 888 x 4
#>       id increase decrease unvary
#>    <int> <lgl>    <lgl>    <lgl> 
#>  1    31 FALSE    FALSE    FALSE 
#>  2    36 FALSE    FALSE    FALSE 
#>  3    53 FALSE    FALSE    FALSE 
#>  4   122 FALSE    FALSE    FALSE 
#>  5   134 FALSE    FALSE    FALSE 
#>  6   145 FALSE    FALSE    FALSE 
#>  7   155 FALSE    FALSE    FALSE 
#>  8   173 FALSE    FALSE    FALSE 
#>  9   206 TRUE     FALSE    FALSE 
#> 10   207 FALSE    FALSE    FALSE 
#> # … with 878 more rows

wages_ts %>%
  features(ln_wages, feat_monotonic) %>%
  filter(increase)
#> # A tibble: 88 x 4
#>       id increase decrease unvary
#>    <int> <lgl>    <lgl>    <lgl> 
#>  1   206 TRUE     FALSE    FALSE 
#>  2   266 TRUE     TRUE     TRUE  
#>  3   295 TRUE     FALSE    FALSE 
#>  4   304 TRUE     TRUE     TRUE  
#>  5   518 TRUE     FALSE    FALSE 
#>  6   911 TRUE     TRUE     TRUE  
#>  7  1032 TRUE     TRUE     TRUE  
#>  8  1219 TRUE     TRUE     TRUE  
#>  9  1282 TRUE     TRUE     TRUE  
#> 10  1508 TRUE     FALSE    FALSE 
#> # … with 78 more rows

Created on 2019-07-14 by the reprex package (v0.3.0)

Longnostics for time

These should be summaries for things like:

  • Days between measurements
  • Range of days between measurements (1st and last)
  • Average/mean/median/etc number of days between measurements

Date features should also be factored into for the other longnostics, where appropriate.

longnostics instead of lognostics

lognostics reminds me of log more than cognostics - I wonder if longnostics would be a better word to describe longitudinal cognostics?

functions to select a random number of ids

Like so:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
library(brolgar)

wages %>%
  group_by(id) %>%
  nest() %>%
  sample_n(size = 10) %>% 
  unnest()
#> # A tibble: 57 x 15
#>       id   lnw exper   ged postexp black hispanic   hgc hgc.9 uerate   ue.7
#>    <int> <dbl> <dbl> <int>   <dbl> <int>    <int> <int> <int>  <dbl>  <dbl>
#>  1  8300  2.53 5.64      1   5.64      0        1    12     3   6.49 -0.505
#>  2  8300  2.52 6.54      1   6.54      0        1    12     3   5.99 -1.00 
#>  3  8300  2.49 7.77      1   7.77      0        1    12     3   6.2  -0.805
#>  4  1273  2.25 0.565     1   0         0        0    11     2   4.99 -2.00 
#>  5  1273  2.21 1.14      1   0.578     0        0    11     2   5.49 -1.50 
#>  6  1273  1.79 2.32      1   1.75      0        0    11     2   8     1    
#>  7  1273  1.82 3.45      1   2.89      0        0    11     2   5.89 -1.10 
#>  8  1273  1.54 4.10      1   3.53      0        0    11     2   6.07 -0.926
#>  9  1273  1.20 5.41      1   4.84      0        0    11     2   7.5   0.5  
#> 10  1273  1.71 6.66      1   6.09      0        0    11     2   6.09 -0.905
#> # … with 47 more rows, and 4 more variables: ue.centert1 <dbl>,
#> #   ue.mean <dbl>, ue.person.cen <dbl>, ue1 <dbl>

wages %>%
  group_by(id) %>%
  nest() %>%
  sample_frac(size = 0.1) %>%
  unnest()
#> # A tibble: 631 x 15
#>       id   lnw exper   ged postexp black hispanic   hgc hgc.9 uerate
#>    <int> <dbl> <dbl> <int>   <dbl> <int>    <int> <int> <int>  <dbl>
#>  1  3440  1.86 0.163     0       0     0        0    11     2   5.59
#>  2  3440  1.91 0.983     0       0     0        0    11     2   6.7 
#>  3  3440  1.59 1.61      0       0     0        0    11     2   8.7 
#>  4  3440  2.26 2.45      0       0     0        0    11     2   9.6 
#>  5  3440  2.16 3.40      0       0     0        0    11     2  11.4 
#>  6  3440  2.10 4.38      0       0     0        0    11     2   9.4 
#>  7  3440  2.16 5.55      0       0     0        0    11     2   7.1 
#>  8  3440  2.12 6.70      0       0     0        0    11     2   8   
#>  9  3440  2.16 7.70      0       0     0        0    11     2   7.1 
#> 10  3440  2.25 9.05      0       0     0        0    11     2   6.9 
#> # … with 621 more rows, and 5 more variables: ue.7 <dbl>,
#> #   ue.centert1 <dbl>, ue.mean <dbl>, ue.person.cen <dbl>, ue1 <dbl>

sample_n_obs <- function(data, id, size){
  
  q_id <- rlang::enquo(id)
  
  data %>%
    dplyr::group_by(!!q_id) %>%
    tidyr::nest() %>%
    dplyr::sample_n(size = size) %>%
    tidyr::unnest()
  
}

sample_frac_obs <- function(data, id, size){
  
  q_id <- rlang::enquo(id)
  
  data %>%
    dplyr::group_by(!!q_id) %>%
    tidyr::nest() %>%
    dplyr::sample_frac(size = size) %>%
    tidyr::unnest()
  
}

library(ggplot2)

sample_n_obs(wages,
             id,
             10) %>%
  ggplot(aes(x = exper,
             y = uerate,
             group = id)) + 
  geom_line()

sample_frac_obs(wages,
                id,
                0.05) %>%
  ggplot(aes(x = exper,
             y = uerate,
             group = id)) + 
  geom_line()

Created on 2019-04-08 by the reprex package (v0.2.1)

document wages data

  • What are each of the variables in wages?
  • Is there a link to the original data?
  • Can we change the variable names to be more informative?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.