the implementation of <a class="issue-link js-issue-link" data-error-text="Failed to l

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

source/commentary alignment detection is incorrect about jdsw HOT 7 CLOSED

thatbudakguy commented on May 30, 2024

source/commentary alignment detection is incorrect

from jdsw.

Comments (7)

GDRom commented on May 30, 2024

I thought about this for a little longer, and I think I may have found the issue and why the ratio of falsely ascribed to commentary annotations is significantly higher than that of ratio of falsely ascribed to source.

So in the relevant code block includes first the creation of sbck_map through += etc., thus adding a list to that variable. Now imagine a certain string occurs both in source and commentary. We'd end up adding two lists to that variable, thus with sbck_map = [True][False], right? Given that a commentary may discuss a string from the source, this should be fairly frequent.

    sbck_map = []
    for source, commentary in sbck:
        sbck_map += [True for i in range(len(source))]
        sbck_map += [False for i in range(len(commentary))]
    assert len(sbck_map) == len(sbck_chars)

So in the second relevant for-loop, we first check if the target string occurs at all; all good so far. But in the else statement, we have a second if...else check; this works if sbck_map[i] consists of a single list. But if sbck_map[i] = [True][False], thus indicating that this string occurs in both source and commentary, shouldn't the if check directly revert to its else, hence append "commentary" (given that [True][False] != [True])?

    for target, annotation, *extra in jdsw:
        remaining = sbck_chars[pointer:]
        location = remaining.find(target)
        if location == -1:
            output.append((target, annotation, BLANK))
        else:
            pointer = location
            if sbck_map[location] is True:
                output.append((target, annotation, "source"))
            else:
                output.append((target, annotation, "commentary"))

from jdsw.

thatbudakguy commented on May 30, 2024

I think understanding the += operator behavior might be helpful here. It's shorthand for List.append, which adds all the items from the right operand to the left operand and stores the result in the left operand.

>>> a = [1, 2, 3]
>>> b = [4, 5, 6]
>>> a += b
>>> a
[1, 2, 3, 4, 5, 6]

In the loop, that will look like:

sbck_map += [True, True, True, True, ...]
sbck_map += [False, False, False, False, ...]

Which will produce a flat list:

[True, True, True, True, False, False, False, False, ...]

Remember that source and commentary are each column from the tab-separated SBCK file, taken one row at a time. So, the output should be one True for each character in the source column, then one False for each character in the commentary column, repeating for every row in the file. The assert sanity-checks this:

assert len(sbck_map) == len(sbck_chars)

i.e. the number of booleans should exactly equal the number of characters in the SBCK file.

from jdsw.

GDRom commented on May 30, 2024

Ah, ok, I see. Thanks for this explanation; that makes sense. Which obviously means that my comment above is fully invalid.

from jdsw.

thatbudakguy commented on May 30, 2024

No worries!

from jdsw.

thatbudakguy commented on May 30, 2024

@GDRom I think I may have tracked this bug down, thanks to your manually annotated alignment of the Lunyu. wanted to get your take on how to proceed. here's what I see...

we begin with the text "傳不" and corresponding LDM annotation "直專反注同鄭..." (quite long), which appear on line 8 of our JDSW digital copy of the first chapter of the lunyu. your notes say we should find this in the source text in the SBCK edition, and we do find it on line 18 there (emphasis mine): "與朋友交言而不信乎傳不習乎"
next we have the text "道" and corresponding LDM annotation "音導本或作導包云治也注及下同", which appear on line 9 of our JDSW. the single character "道" appears a total of 11 times in our SBCK edition. one of them is on line 12, and since it occurs before our previously found annotation (on line 18), we rightly discard it. then things get interesting — all other instances of "道" seem to be much further down the page, with the next one occurring all the way on line 55 of the SBCK edition, skipping over a sizable portion of text. this in itself isn't impossible, just unusual. your notes say we should find it in the source, and indeed it's in the source on line 55: "無改於父之道可謂孝矣(孔安國/曰孝子)"
next we have the text "千乗" and corresponding LDM annotation "繩證反注同千乘大國之賦也", which appear on line 10 of our JDSW. your notes say we ought to find it in the SBCK source, and we do find it four times: once on line 19, and three more times in a block of commentary that spans lines 23, 24, and 25. now we have a dilemma, however: we've already moved up to line 55 as a consequence of finding "道" there, and thus we can't consider any of the cases of "千乗", which are all "behind" us. it seems the correct approach would've been to find it on line 19: "(言九所傳之事得無/素不講習而傳乎)子曰導千乗"

my thought is this: is it possible that we should've actually found "道" as a graphic variant much earlier than line 55? there are only a few characters separating "傳不" and "千乗" in the source; it reads: "傳不習乎子曰導千乗". is "導" the variant we were looking for? i might have missed this but i don't think it's in your notes; you helpfully noted other variants with "variant SBCK".

if my guess is right, then the fix is just what we imagined: matching on variants. my only worry is that, by being too lenient, we might eagerly apply an LDM annotation to a variant, when instead the annotation rightly applies to the actual character itself further down the page (i.e. if LDM had intended to annotate "道" on line 55 instead of the much nearer "導"). i don't know enough to know if this worry is a real concern at this point, but maybe further testing will show us the way.

from jdsw.

GDRom commented on May 30, 2024

@thatbudakguy In short, you are correct.

I went through the passage in question to check whether your hypothesis is correct in these instances, and it's spot on.

So we have the following three glosses:
傳不(直專反...)
道(音導本或作導...)
千乗(繩證反...)

In the SBCK, they occur on lines 11 and 12;
in the kanripo/TLS version, this occurs from lines 35-38.

As you noted, variants mess the automatic alignment up. In SBCK, 道 is written as 導; in kanripo/TLS, 千乗 is written as 千乘. So yes, we would have been looking for 導 (which obviously couldn't be matched).

Initial thoughts:
Tests sound like a good way to go.
Also, and fortunately, LDM notes as well that the 道 in question is sometimes written as 導 (本或作導); we might be able to draw from his comments when one-character sequences are problematic. For 千乗 vs. 千乘 -- given that two-character sequences are highly unlikely to occur in both ways in a single text, we could apply variant readings more liberally to them.

from jdsw.

thatbudakguy commented on May 30, 2024

closing for now since we're tossing out the strategy of attempting to align to an SBCK edition that includes commentary — instead, we align directly to the Zhengwen edition, keeping only the places where LDM's headwords match that text. this is the strategy outlined in #11. if it turns out to not work well, we can revisit.

from jdsw.

source/commentary alignment detection is incorrect about jdsw HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent