tsostarics / sosprosody Goto Github PK

Prosody research helpers

License: Other

R 97.45% C++ 2.55%

sosprosody's Introduction

sosprosody

This package includes a variety of helper functions involving praat, textgrids, pitchtiers, and intonation-related data wrangling. This has been written for my own research, but others may find it helpful as well. This package extends the rPraat package.

Installation

You can install the development version of sosprosody like so:

library(devtools)
devtools::install_github('tsostarics/sosprosody')

Examples

This package provides format and print methods for TextGrid and PitchTier objects loaded with the respective read functions from {rPraat}

library(rPraat)
library(sosprosody)
pitchtier <- pt.read("grandmother_LHH_003.PitchTier")
textgrid <- tg.read("grandmother_LHH_003.TextGrid")

pitchtier
#> grandmother_LHH_003.PitchTier: 106 total pitch pulses.
#> 181|                                                                           |
#>    |                                                          OOO              |
#>    |                                                      OOOO                 |
#>    |                                                    OO                     |
#>    |                                                 OOO                       |
#>    |                                              OOO                          |
#>    |                                           OOO                             |
#>    |         OOOOOOOOOOOOOOOO            OOOOOO                                |
#>    |                         OOO OOOOOOOO                                      |
#>  58|                            O                                              |
#>    0                                                                         1.6
textgrid
#> grandmother_LHH_003.TextGrid
#> [                words: 4/6 labeled intervals from 0 to 1.6                    ]
#> [                phones: 15/17 labeled intervals from 0 to 1.6                 ]

Here’s an example of processing all the textgrids and pitchtiers in a directory into dataframe representations.

nuclear_words <- "grandmother"
tg_df <- batch_process_textgrids("./")
#> Processed 2 TextGrids
pt_df <- batch_process_pitchtiers("./")
#> Processed 2 PitchTiers
nuclear_regions <- get_nuclear_textgrids(tg_df, nuclear_words)

str(tg_df)
#> 'data.frame':    30 obs. of  9 variables:
#>  $ file       : chr  "grandmother_LHH_003" "grandmother_LHH_003" "grandmother_LHH_003" "grandmother_LHH_003" ...
#>  $ word_start : num  0.14 0.14 0.14 0.14 0.44 0.5 0.57 0.57 0.57 0.57 ...
#>  $ word_end   : num  0.44 0.44 0.44 0.44 0.5 0.57 1.33 1.33 1.33 1.33 ...
#>  $ word_label : chr  "laura" "laura" "laura" "laura" ...
#>  $ word_i     : int  1 1 1 1 2 3 4 4 4 4 ...
#>  $ phone_start: num  0.14 0.26 0.27 0.39 0.44 0.5 0.57 0.63 0.72 0.84 ...
#>  $ phone_end  : num  0.26 0.27 0.39 0.44 0.5 0.57 0.63 0.72 0.84 0.89 ...
#>  $ phone_label: chr  "l" "ɒ" "ɹ" "ə" ...
#>  $ phone_i    : int  1 2 3 4 5 6 7 8 9 10 ...
#>  - attr(*, "tiertype")= chr "interval"
str(pt_df)
#> 'data.frame':    245 obs. of  6 variables:
#>  $ file               : chr  "grandmother_LHH_003" "grandmother_LHH_003" "grandmother_LHH_003" "grandmother_LHH_003" ...
#>  $ timepoint          : num  0.227 0.237 0.247 0.257 0.267 ...
#>  $ hz                 : num  80.5 82.4 83 82.6 82.3 ...
#>  $ semitone_difference: num  -0.3946 0 0.1251 0.0499 -0.0256 ...
#>  $ semitones_from     : num  82.4 82.4 82.4 82.4 82.4 ...
#>  $ erb                : num  2.49 2.54 2.56 2.55 2.54 ...

We can then do some common preprocessing steps, such as coding the nuclear word in the phrase, applying running median smoothing, and normalizing the timepoints.

processed_pt_df <- preprocess_pitchtracks(pt_df,
                                          nuclear_df = nuclear_regions, 
                                          runmed_k = 5,
                                          time_normalize = TRUE,
                                          .fromzero = TRUE)

The processed pitch tier data frame can then be plotted like so:

library(ggplot2)

processed_pt_df |> 
  ggplot(aes(x = timepoint_norm, 
             y = hz_runmed, 
             color = is_nuclear,
             group = file)) +
  geom_point() +
  geom_line() +
  # The rest is just for visuals
  scale_color_brewer(palette = 'Dark2') +
  theme_minimal() +
  coord_fixed(ratio = 1/180) +
  theme(legend.position = 'none') +
  annotate(geom = "label", x = .9, y = 85, label = "L*+HLL") +
  annotate(geom = "label", x = .75, y = 150, label = "LHH")

sosprosody's People

Contributors

Stargazers

Watchers

sosprosody's Issues

`add_tier` expects list of lists

add_tier expects a list of lists, where the inner list has a tier structure (ie contains named elements $name, $type, $t1, $t2, $label). But if the goal is to add a single tier, one would think passing a single list with this structure would work:

tst <- sosprosody:::new_textgrid()
new_tier <- tst$words
new_tier$name <- "words2"
add_tier(tst, new_tier_list = new_tier)

But this results in Error in tier[["name"]] : subscript out of bounds within sosprosody::as_textgrid using an lapply on new_tier_list. new_tier needs to be wrapped in a list to work.

tst <- sosprosody:::new_textgrid()
new_tier <- tst$words
new_tier$name <- "words2"
add_tier(tst, new_tier_list = list(new_tier))

This behavior should be either documented in add_tier and as_textgrid or the behavior should be changed such that an unwrapped tier is handled appropriately.

fix vignette path for ubuntu

R CMD check fails for ubuntu due to some issue with the way the example files are not being read in correctly. So, this results in NULLs and empty lists, causing an error where from isn't a finite number. I dont have time to fix this right now.

rename `.grouping` to `.by` to match dplyr syntax

Funtions such as piecewise_interpolate_pulses include an argument .grouping which functions the same as .by in functions like mutate, so it would be nice to keep the nomenclature consistent.

implement split methods for grouped/ungrouped dataframe classes

currently functions such as piecewise_interpolate_pulses indiscriminately checks for grouping structure in a passed dataframe and regroups the output accordingly. One might consider relegating this portion of the implementation to an S3 method deployed with .grouped_df. The main benefit is being able to more easily keep grouped_dfs grouped and ungrouped dataframes ungrouped in the output.

> mtcars |> class()
[1] "data.frame"
> mtcars |> as_tibble() |> class()
[1] "tbl_df"     "tbl"        "data.frame"
> mtcars |> group_by(gear) |> class()
[1] "grouped_df" "tbl_df"     "tbl"        "data.frame"

we'd be looking to add in methods like the following:

piecewise_interpolate_pulses.grouped_df(...)
pieceise_interpolate_pulses.default(...)

The way .grouping is handled could change for the two methods where .grouping is optional for grouped_df, something like three cases:

With ungrouped dataframes, .grouping is required, use the current implementation
For grouped_df, if .grouping is provided, use the current implementation
For grouped_df, if .grouping is not provided, use the group indices as the unique identifiers

Will need to think on this last case though, it will lead to issues if someone thinks their grouping structure uniquely identifies individual timeseries but it actually doesn't.

Allow pulse interpolation for intervals with duplicate labels

Consider an audio file sample that has been annotated for regions of various pitch levels (high, low, extralow, etc.). The textgrid with these annotations has been converted to a dataframe like so:

     file interval_start interval_end label interval_i
1  sample       0.000000     1.717237                1
2  sample       1.717237     2.230513  step          2
3  sample       2.230513     3.190339   low          3
4  sample       3.190339     3.744677  high          4
5  sample       3.744677     4.663441   low          5
6  sample       4.663441     5.720789  high          6
7  sample       5.720789     6.069817   low          7
8  sample       6.069817     7.953540  high          8
9  sample       7.953540     8.661861   low          9
10 sample       8.661861     9.205933  elow         10
11 sample       9.205933     9.431775   low         11
12 sample       9.431775     9.693545  elow         12
13 sample       9.693545    10.237618  high         13
14 sample      10.237618    11.208951   mid         14
15 sample      11.208951    11.722227  high         15
16 sample      11.722227    12.179042   low         16
17 sample      12.179042    12.866832  high         17
18 sample      12.866832    15.151020               18

Let's say you wanted to extract a different number of pulses for each pitch level specified in label. For example, 10 pulses for every high region and 30 for every low region. Currently, there are a few related issues with piecewise_interpolate_pulses that makes this task difficult.

It assumes that the number of labels is equal to the number of unique labels. This means that asking for, say, 50 pulses from "the" low section when there are multiple sections with the label "low" will give odd results:

labeled_points |> 
  dplyr::filter(label != "") |> 
  piecewise_interpolate_pulses(section_by = 'label',
                               .grouping = 'file',
                               time_by = 'timepoint',
                               pulses_per_section = c(step = 30,
                                                      high = 20,
                                                      low = 50,
                                                      mid = 5,
                                                      elow = 15)) |> 
  ggplot(aes(x = pulse_i, color = label, y = hz)) +
  geom_point()

To avoid the above issue, the user can ensure that each interval is uniquely identified by an index and generate the same number of pulses for everything, like so:

labeled_points |> 
  dplyr::filter(label != "") |> 
  piecewise_interpolate_pulses(section_by = 'interval_i',
                               .grouping = 'file',
                               time_by = 'timepoint',
                               pulses_per_section = 10) |> 
  ggplot(aes(x = pulse_i, color = interval_i, y = hz)) +
  geom_point()

But this doesn't solve the problem piecewise_interpolated_pulses is used for: different numbers of pulses for different sections.

It requires the user to specify either a single number of pulses to use for all sections or to explicitly enumerate how many pulses each section must receive-- for all sections. Doing this manually quickly becomes time consuming. The user could create their own named numeric vector and programmatically fill it with (mostly) the same values, but this shouldn't be put on the user.

The implementation should be changed like so:

pulses_per_section should more strictly be a named integer vector, however, it should allow for 1 and only 1 to be unnamed. The unnamed value should be recycled for all other sections that are not specified.

pulses_per_section = c('high' = 10,
                              'low' = 30,
                              25)

would be converted to:

pulses_per_section = c('high' = 10,
                           'low' = 30,
                           'elow' = 25,
                           'mid' = 25,
                           'step' = 25)

Add in an option to use a column that specifies the numeric indices of each interval. If not provided, we can estimate this by observing where the pulse labels change. However, if there's a case where two adjacent intervals have the same label, then they'll be treated as one section. Ultimately we need to know each section and its label, as $|{labels}| <= |{sections}|$. The label of each section can then be straightforwardly used to look up the correct number of pulses given the fix described above.

rename summarize to reframe for dplyr >= 1.1

dplyr 1.1 deprecates a use case of summarize where multiple rows can be returned when summarizing a group. This functionality is taken over by the reframe function, but this is not available in previous versions of dplyr. Either a check for the user's version of dplyr is needed to use either summarize or reframe, or the dependency version will need to be upped when 1.1 is fully released

`get_praatscript_arguments` doesn't work with `UTF-16 BE` encoding

Some kind of encoding checking method can be implemented but honestly it's easier to just resave the praat script with UTF-8 encoding in notepad or something.

deprecate overly specific functionality

There are some parts of the package that are a bit too narrow in scope and really only apply to the workflow I use when making the stimuli for my dissertation experiment. The main things I have in mind are:

batch processing functions, which are just wrappers over reading in pitchtiers or textgrids and merging a fiew tiers. Related functions are batch_process_pitchtiers and batch_process_textgrids
nuclear_df related functionality, which is too specific and obsolete with the widespread release of dplyr's non-equi join functionality. The actual goal of this was just non-equi joins, but it's presented as limited to solely the nuclear portion of an utterance, which is only true for my stimuli and likely not true for broader usage. Related functions are preprocess_pitchtracks, code_nuclear_pulses, and get_nuclear_textgrids
nest_tiers is another case that I think is obsolete, again it's just an instance of a non-equi join. That said, this could perhaps be rewritten to recursively nest multiple tiers. nest_tiers

Updating these would require me to return to my targets workflow and offload all the functions to a sourced file containing the helper functions, which isn't the end of the world, but would require a few hours of breaking then refixing the targets pipeline.

Additionally, if these are removed then adding a vignette to show how to accomplish the tasks these were made for could be useful.

Refactor piecewise_interpolate_pulse

I think the forced sorting attempt is slowing things down a bit for functions like piecewise_interpolate_pulses and average_pitchtracks, I should revisit this at some point and see if I can modularize things a bit better to be more lenient. Also the section indices should be better documented.

average_pitchtracks fails when only 1 pulse exists

The following is fine:

data.frame(file = c('a', 'a', 'b', 'b'),
           t = c(1, 2, 1, 2),
           f = c(60,100,80, 120),
           grp = 'all',
           section = 1) |> 
  average_pitchtracks(section_by = 'section',
                      pulses_per_section = 30,
                      time_by = 't',
                      .pitchval = 'f',
                      aggregate_by = file ~ grp)

But the following throws an error because file b only has 1 pulse

data.frame(file = c('a', 'a', 'b'),
           t = c(1, 2, 1),
           f = c(60,100,80),
           grp = 'all',
           section = 1) |> 
  average_pitchtracks(section_by = 'section',
                      pulses_per_section = 30,
                      time_by = 't',
                      .pitchval = 'f',
                      aggregate_by = file ~ grp)

It's fine that trajectories can't be interpolated with only a single point, but it would be nice if these files were dropped or a listing of the file is included with the error message.

Add indices & filename postprocessing functionality to `textgrid_to_dataframes`

textgrid_to_dataframes does a good job converting each tier to a dataframe representation, but it needs to include the numeric indices of each interval in addition to the labels.

Also, it needs to include an option to remove the .TextGrid part of the file column, I'm constantly forgetting to remove this before joining with a pitch tier dataframe (eg, sample does not match sample.TextGrid).

Both of these should be simple to implement & will avoid the need for some common manual postprocessing steps

duplicate value handling in `interpolate_equal_pulses`

There's two related edge cases that results in division by 0, hence NaNs in the output:

If the first two rows of any subtable in a grouped dataframe (or just the first 2 rows in a single ungrouped dataframe) have the same timepoint and frequency value, the first pulse will return NaN at the first timepoint. Eg:

   tstfile tsthz tsttp
1        a    20     7
2        a    20     7
3        a    30     8
...

will return the below for the first row with interpolate_equal_pulses

  tstfile tsttp tsthz
1 a         7     NaN

If all of the rows are the same, the whole output will be NaNs. So the above but for however many pulses are requested. Because of this case, it wouldn't be sufficient to simply replace the first pulse with the first value in the original dataframe to fix the previous case.

There are 3 ways to handle this:

Throw a warning that NaNs were detected & that there are duplicates in the data which the user should fix (dplyr::distinct works well)
Throw an error with the same information above
Check for duplicates and filter them out, something like replacing:

int_df[[.pitchval]] <- interpolate_pitchpoints(int_df[[time_by]],
                                               pt_df[[time_by]],
                                               pt_df[[.pitchval]])

with

is_not_duplicate <- !duplicated(pt_df[[time_by]])
int_df[[.pitchval]] <- interpolate_pitchpoints(int_df[[time_by]],
                                               pt_df[[time_by]][is_not_duplicate],
                                               pt_df[[.pitchval]][is_not_duplicate])

For the time being I'm throwing a warning about this since I haven't thought long enough about what kinds of issues might come up with adding the duplicate filtering.