bergelsonlab / blabr Goto Github PK

0.0 3.0 3.0 487 KB

License: Other

R 100.00%

blabr's Issues

reduce the number "Undefined global functions or variables"

This part of devtools::check() output is really useful to flag unqualified function calls. However, as it currently outputs a list of 100+ names, it isn't practical to use it if you only care about the new code. It would be great if this list was cut down considerably.

Make sure that at least exported functions have a test that can help check that the output hasn't changed.
Use codetools::checkUsagePackage('blabr') to list the usages with the file paths and line numbers.
Most of the names come from using tidy selection or data masking in tidyverse verbs. For those, using the .data "pronoun" should do it.

decouple handling BLAB_DATA repos and loading the data

There are two steps in all get_* functions:

Find the repo, find the version, find the files.
Load the files using readr and column specifications.

The two steps don't need to be coupled and having them apart will allow more flexibility. For example, data files cloned or created in a non-standard location could still be read.

compartmentalize test code

Move R/test-helpers.R to testthat/helpers.R so that packages only needed for testing aren't referred to in the package's code. I put it under R/ because testthat at some point said that using testthat/helpers*.R was no longer recommended (indirect evidence here). Apparently, referring to functions from packages that aren't your package's "Import" dependencies isn't a problem, that's what "Suggests" are for. Nevertheless, to me, it makes so much more sense to have testing code separate from the package's code.
Remove library calls from all test files, switch to qualified function calls. I don't know why but at some point I decided it was ok to put library() calls inside tests. As the result, tests will pass even if there is an unqualified call to, say, mutate, in the package's main code that is being tested. I didn't know why I expected it to work otherwise but I am not alone - see this SO questions

add `expect_column_hashes_to_be_equal` to `test-helpers.R`

The following pattern is used too many times not to be factored out to a function:

  library(digest)

  hashes_list <- speaker_stats %>%
    summarise(across(everything(), digest)) %>%
    as.list
  expected_hashes_list <-
    list(
      interval_start = "75ff43e40a186ae138dc9b709b691a45",
      interval_end = "5e39906727aa950a55bff1f80d4226bb",
      spkr = "8b19ab3ad09943f2c807002c40ebe943",
      adult_word_count = "a3dd76d9042133d4ee0a6ccbc654ba48",
      utterance_count = "d36f09f3bbada305d0925623a9ffb990",
      segment_duration = "178fa344206b188de05bea4f07fe2b50"
    )
  expect_equal(hashes_list, expected_hashes_list)

I think a good place for this function is R/test-helpers.R but check r-pkgs first.

add Description to DESCRIPTION

fix test-lena

The test for add_lena_stats referred to a GIN file on my computer - not the best idea. I left comments with instructions what to do in the code of the test.
The other two failing tests fail for specific columns only: ctc in one case, adult_word_count - in another. I wonder if this has something to with floating point arithmetic differences between my macbook and my Dell. I should try running on macbook. The arithmetic issue already came up recently but I don't remember the context, I should search through Slack messages with Elika. Most likely, it was about the VIHI sampling paper.

avoid library() calls

Multiple sources state that library() calls should be avoided. It is ok to have them in the test files though.

From R Packages:

You should never use require() or library() in a package: instead, use the Depends or Imports fields in the DESCRIPTION.

export public-facing functions

blabr used to export every variable defined in its code. For various reasons, it is not a good idea so I switched to explicitly exporting functions/variables that users might use. If you encounter a function that is not exported, please add it in a comment below.

In the meantime, use blabr:::<function> to access a that used to be available after library(blabr) and now isn't.

get_windows needs to have an argument to set TargetOnset

Right now the code is written in that function such that it's looking for the target onset to be present in the dataframe. For the studies that don't involve incorporating the message report, this isn't in there. The current work-around is to just set it as a value in the global environment, but that's fairly delicate and should be replaced with an argument inside the function that can be optionally set to a single value.

function to write session info as the poor man's renv

Probably this is enough:

writeLines(capture.output(sessionInfo()), "session-info_{username}.txt")

Code above (except for the filename) from here

[possibly] switch to using roxygen for maintaining NAMESPACE

Right now, everything that starts with a letter gets exported + magrittr's pipe. Maybe that is the right way to go, I do not know enough about r packages to tell. But it does seem strange to not be explicit about what names we expose.

With roxygen, functions get exported if they have the @export keyword before them.

Do not forget to expose margittr's pipe in case of deciding to switch.

sample_intervals_with_highest should stop if there are NAs

Maybe ignore those rows. Anything but the current behavior of selecting the NA rows.

remove fuzzyjoin dependency (lena intervals code)

The specific join (interval-on-interval) requires a BioConductor package IRanges which requires the user to install from source on M1s. This is annoying and unnecessary: with the size of the table being joined cross-join+filter will be just fine.

Remove from DESCRIPTION:

fuzzyjoin,
IRanges,
biocViews:

Remove from lena.r:

fuzzyjoin::interval_left_join

(potential bug) themes might not use theme_bw

The %+replace% is supposed to clear all elements from the previous theme, making calling theme_bw(...) have no effect.

check whether any theme_bw settings are left in the themes,
update/ignore/fix.

retry git command that time out once

Here is an error I want to see less of. One context in which this error comes up often is running test-get-vihi-annotations while connected to VPN so this might be a way to test. Another option is to set the timeout to some ridiculously small amount of time.

Error in run_git_command(repo, "fetch --tags --prune --prune-tags"): Error executing git command:

fetch --tags --prune --prune-tags

Error message:

fatal: unable to access 'https://github.com/bergelsonlab/vihi_annotations.git/': Failed to connect to github.com port 443 after 21057 ms: Couldn't connect to server

add `io.R` to have all read/write functions in one place

Mimimally. combine seedlings.R and rttmR into io.R.

teach prepare_intervals to output aclew-like intervals

Right now, the way the function works emulates the way LENA creates intervals for lena5min.csv's. First and last intervals start and end outside of the recordings: 15:42:15:52 -> (15:40-15:45, 15:45-15:50, 15:50-15:55). They get trimmed afterward but they are still there while for ACLEW we don't want those short intervals at all. Keeping those intervals should be an option. Also, for ACLEW we'll need to introduce buffers to accommodate context regions.

add an option to drop start and end intervals if they are shorter than required,
add an option to introduce buffers before and after all regions.

See vihi-sampling code where intervals are created for the seedlings corpus.

get_pn_opus_path does not return anything

make pkgdown website

Tutorial from Lisa DeBruine

Don't forget to make a hex! Use this link. Adjust link to change size, or change size in hexSticker.

add days_to_months function

round(age/30.435)

rewrite fixations_to_timepoints (fka binifyFixations) using join_by

Note: fixations_to_timepoints isn't yet implemented at all.

t_series <- fixations %>%
  summarise(t_min = min(current_fix_start),
            t_max = max(current_fix_end)) %>%
  mutate(across(c(t_min, t_max),
                ~ floor(. / bin_size) * bin_size)) %>%
  mutate(t = list(seq(t_min, t_max, by = bin_size))) %>%
  select(t) %>%
  unnest(cols = t)

t_series %>%
  inner_join(
    fixations %>%
      mutate(across(c(current_fix_start, current_fix_end),
                    ~ floor(. / bin_size) * bin_size)),
    by = join_by(between(t, current_fix_start, current_fix_end))
  )

Update June 7 2024

The main part (speeding up by switching to a join) was done in a632aa0.

Check the stuff from the comment below
see a632aa0's message for how the function can be further sped up.

check access to github

If someone has a BLAB_DATA repository set to use ssh and their key is no longer valid, they'll get a very cryptic error:

The access error is also there but it doesn't cause an actual R error and instead causes a weird strsplit error.

How it should work:

blabr looks for, e.g., ~/BLAB_DATA/cdi_spreadsheet.
- If it doesn't exist, it should check whether an https request to github is possible.
  - If it is, suggest cloning.
  - If it isn't, say that it is BLab-only data.
- If it does exist, try whatever ssh/https uri was set up for the "origin" remote.
  - If it works, continue.
  - If it doesn't, suggest changing to https.

clean up get_lena_speaker_stats

Due to peculiarities of its files (or my code, not sure), there are extra NA.

See aclew.R in ACLEW_correllations in vihi_sample in one_time_scripts to see what actually happens there.

add_its_stats(add_stats_function = get_lena_speaker_stats) %>%
  # Sometimes some numeric columns in its files (as read by rlena) are
  # inexplicably NA instead of zero ¯\_(ツ)_/¯. We'd rather have zeros there
  # unless it is a fully NA row signifying an interval with no LENA segments.
  mutate(across(c(adult_word_count, utterance_count, segment_duration),
                ~ if_else(is.na(spkr), true = .x,
                          false = replace_na(.x, 0)))) %>%

big_aggregate missing documentation

rewrite calculate_lena_like_stats using add_lena_stats

In calculate_lena_like_stats, first, create a tibble with the intervals and then apply add_lena_stats to it. The interval creation should output both the wav and wall time.

fix test-seedlings.R

The test uses 01_12_audio_sparse_code.csv which has changed - update the test.

a function for by-kid, by-item cdi info

Often we run eye-tracking studies examining infant word comprehension. It would be useful to cross-check our collected data against parent report with a function that works as follows:

For each item in an eye-tracking study, this function would provide each child's CDI value (understands, produces, neither, not on CDI) for each item.

proposed output:

subj	target	CDI
s01	apple	produces
s01	shang	NA
s02	banana	produces
s02	train	understands

add examples to functions

Alternatively, remove the @examples keyword.

From document():

Warning: [/Users/ek221/blab/blabr/blabr/R/get_data.R:53] @examples requires a value
Warning: [/Users/ek221/blab/blabr/blabr/R/read_bl.R:38] @examples requires a value
Warning: [/Users/ek221/blab/blabr/blabr/R/read_bl.R:193] @examples requires a value

tidyverse is not installed when installng blabr

Installing blabr without first installing tidyverse results in an error. Tidyverse is a meta-package, so this could probably be resolved by loading individual packages instead of loading the whole tidyverse. Otherwise, we could just add tidyverse as a dependency.

get_seedlings_nouns should look for versions in both repositories

Right now, if you request a dev version, it will look for the most current version in the dev repository only. It should look in both repos and print either only the public version if it is the most current overall, or both.

update DESCRIPTION

Add authors and contributors, and assign yourself as "creator" which actually means "current maintainer".

add missing documentation (lots!)

Here is the relevant output of check():

> checking for missing documentation entries ... WARNING
  Undocumented code objects:
    ‘FindFrozenTrials’ ‘FindLowData’ ‘RemoveFrozenTrials’ ‘RemoveLowData’
    ‘add_chi_noun_onset’ ‘all_basiclevel’ ‘all_errors’ ‘anonymous’
    ‘audio_cnames’ ‘binifyFixations’ ‘bl_types’ ‘blab_data’
    ‘cdi_get_words’ ‘cdi_words’ ‘characters_to_factors’
    ‘check_annot_codes’ ‘checkout_branch’ ‘checkout_commit’
    ‘chi_noun_onset’ ‘count_chi’ ‘count_chi_types’ ‘count_device_and_toy’
    ‘count_mot_fat’ ‘count_object_present’ ‘count_utterance’
    ‘expandFixList’ ‘fixations_report’ ‘get_df_file’
    ‘get_late_target_onset’ ‘get_mesrep’ ‘get_pairs’ ‘get_vocab_score’
    ‘get_windows’ ‘git_bin’ ‘home_dir’ ‘keypress_issues’
    ‘keypress_retrieved’ ‘late_target_retrieved’ ‘load_tsv’
    ‘malformed_speaker_codes’ ‘obj_pres’ ‘object2string’ ‘on_cdi’
    ‘outlier’ ‘reliability’ ‘rename_audio_header’ ‘rename_video_header’
    ‘string2object’ ‘subj_mos’ ‘subj_nums’ ‘subjectList’ ‘sync_repo’
    ‘sync_to_upstream’ ‘theme_AMERICA’ ‘theme_blab’ ‘theme_spooky’
    ‘utt_type’ ‘video_cnames’
  All user-level objects in a package should have documentation entries.
  See chapter ‘Writing R documentation files’ in the ‘Writing R
  Extensions’ manual.

blabr requires package "subprocess" which isn't on CRAN anymore

We should either remove this dependency or figure out if the archived version works

big_aggregate throwing error

threw a bunch of warnings about invalid factor levels, should probably convert things to character somewhere higher up (and then back to factor if needed)--haven't checked all the NA cases.
actual error message below
Adding missing grouping variables: audio_video
Adding missing grouping variables: SubjectNumber
Joining, by = c("subj", "month", "audio_video")
Joining, by = c("subj", "month", "audio_video")
Joining, by = c("subj", "month", "audio_video")
Joining, by = c("subj", "month", "audio_video")
Joining, by = c("subj", "month", "audio_video")
Joining, by = c("subj", "month", "audio_video")
Joining, by = c("subj", "month", "audio_video")
Joining, by = c("subj", "month", "audio_video")
mutate_each() is deprecated.
Use mutate_all(), mutate_at() or mutate_if() instead.
To map funs over all variables, use mutate_all()
Error in mutate_impl(.data, dots) :
Evaluation error: object 'TVS' not found.

fix assertthat assertions

assertthat::not_empty(tbl) is incorrect! It only returns a boolean result, no assertion is made. The correct version is

asserthat::assert_that(assertthat::not_empty(tbl))

update README.md

BLAB_DATA cloning,
subprocess package,
probably something else too

installing libraries

when you do devtools::install_github('BergelsonLab/blabr') it would be nice if it checked what libraries were already installed instead of reinstalling e.g. tidyverse, etc.

fix inconistent docs

From check():

> checking for code/documentation mismatches ... WARNING
  Codoc mismatches from documentation object 'big_aggregate':
  big_aggregate
    Code: function(x, exclude = NULL, output = NULL, exclude_chi = FALSE)
    Docs: function(x, exclude = NULL, output = NULL)
    Argument names in code not in docs:
      exclude_chi

> checking Rd \usage sections ... WARNING
  Undocumented arguments in documentation object 'join_full_audio_video'
    ‘output_name’ ‘keep_na’ ‘keep_comments’
  Documented arguments not in \usage in documentation object 'join_full_audio_video':
    ‘output’

  Functions with \usage entries need to have the appropriate \alias
  entries, and all their arguments documented.
  The \usage entries must correspond to syntactically valid R code.
  See chapter ‘Writing R documentation files’ in the ‘Writing R
  Extensions’ manual.

fix test in test-seedlings.R

I can't even get the input md5sum to match. Not sure what this is about.

test-seedlings.R

fix get_cdi_spreadsheet and get_motor_spreadsheet

They need version-specific col_types. The most current version won't load at all because we force col_types to match the data now. Once done, uncomment test_that("same results after loading from csv and feather") in test-get_data.R.

bergelsonlab / blabr Goto Github PK

blabr's Issues

Update June 7 2024

Recommend Projects

Recommend Topics

Recommend Org