Today I encountered an issue with the behavior of intersection
.
Say I have a WORD
tier that looks like this:
And I have a PHONE
tier that looks like this:
AH0 | P | AA1 | N | AH0 | EY1 | T | AY1 | M
Assuming these are time-aligned correctly, when I call intersection
, I get a list that looks something like this:
['UPON-AH0', 'UPON-P', 'UPON-AA1', 'UPON-N', 'A-AH0', 'A-EY1', 'TIME-T', 'TIME-AY1', 'TIME-M']
Because I have two intervals in the WORD
tier which have the same label, from this intersection I can't really tell if I have two distinct words "A"
that have the respective transcriptions "AH0"
and "EY1"
, or if I have one distinct word "A"
transcribed as "AH0 EY1"
.
Obviously, there is no right way to solve this, but I would suggest that since we do know that the word entries are distinct, that perhaps instead the label should be the WORD
label plus a tuple of all the PHONE
labels that coincide with it. Something like this:
['UPON-(AH0, P, AA1, N)', 'A-(AH0)', 'A-(EY1)', 'TIME-(T, AY1, M)']
This would also mean that the interval boundaries would be the boundaries of the left-hand side tier. So my example would be for
word_tier.intersection(phone_tier)
If you instead did
phone_tier.intersection(word_tier)
you would get
['AH0-UPON', 'P-UPON', 'AA1-UPON', 'N-UPON', 'AH0-A', 'EY1-A', 'T-TIME', 'AY1-TIME', 'M-TIME']
What do you think?