Giter Site home page Giter Site logo

Comments (7)

GDRom avatar GDRom commented on May 30, 2024

I thought about this for a little longer, and I think I may have found the issue and why the ratio of falsely ascribed to commentary annotations is significantly higher than that of ratio of falsely ascribed to source.

So in the relevant code block includes first the creation of sbck_map through += etc., thus adding a list to that variable. Now imagine a certain string occurs both in source and commentary. We'd end up adding two lists to that variable, thus with sbck_map = [True][False], right? Given that a commentary may discuss a string from the source, this should be fairly frequent.

    sbck_map = []
    for source, commentary in sbck:
        sbck_map += [True for i in range(len(source))]
        sbck_map += [False for i in range(len(commentary))]
    assert len(sbck_map) == len(sbck_chars)

So in the second relevant for-loop, we first check if the target string occurs at all; all good so far. But in the else statement, we have a second if...else check; this works if sbck_map[i] consists of a single list. But if sbck_map[i] = [True][False], thus indicating that this string occurs in both source and commentary, shouldn't the if check directly revert to its else, hence append "commentary" (given that [True][False] != [True])?

    for target, annotation, *extra in jdsw:
        remaining = sbck_chars[pointer:]
        location = remaining.find(target)
        if location == -1:
            output.append((target, annotation, BLANK))
        else:
            pointer = location
            if sbck_map[location] is True:
                output.append((target, annotation, "source"))
            else:
                output.append((target, annotation, "commentary"))

from jdsw.

thatbudakguy avatar thatbudakguy commented on May 30, 2024

I think understanding the += operator behavior might be helpful here. It's shorthand for List.append, which adds all the items from the right operand to the left operand and stores the result in the left operand.

>>> a = [1, 2, 3]
>>> b = [4, 5, 6]
>>> a += b
>>> a
[1, 2, 3, 4, 5, 6]

In the loop, that will look like:

sbck_map += [True, True, True, True, ...]
sbck_map += [False, False, False, False, ...]

Which will produce a flat list:

[True, True, True, True, False, False, False, False, ...]

Remember that source and commentary are each column from the tab-separated SBCK file, taken one row at a time. So, the output should be one True for each character in the source column, then one False for each character in the commentary column, repeating for every row in the file. The assert sanity-checks this:

assert len(sbck_map) == len(sbck_chars)

i.e. the number of booleans should exactly equal the number of characters in the SBCK file.

from jdsw.

GDRom avatar GDRom commented on May 30, 2024

Ah, ok, I see. Thanks for this explanation; that makes sense. Which obviously means that my comment above is fully invalid.

from jdsw.

thatbudakguy avatar thatbudakguy commented on May 30, 2024

No worries!

from jdsw.

thatbudakguy avatar thatbudakguy commented on May 30, 2024

@GDRom I think I may have tracked this bug down, thanks to your manually annotated alignment of the Lunyu. wanted to get your take on how to proceed. here's what I see...

  1. we begin with the text "傳不" and corresponding LDM annotation "直專反注同鄭..." (quite long), which appear on line 8 of our JDSW digital copy of the first chapter of the lunyu. your notes say we should find this in the source text in the SBCK edition, and we do find it on line 18 there (emphasis mine): "與朋友交言而不信乎 傳不 習乎"
  2. next we have the text "道" and corresponding LDM annotation "音導本或作導包云治也注及下同", which appear on line 9 of our JDSW. the single character "道" appears a total of 11 times in our SBCK edition. one of them is on line 12, and since it occurs before our previously found annotation (on line 18), we rightly discard it. then things get interesting — all other instances of "道" seem to be much further down the page, with the next one occurring all the way on line 55 of the SBCK edition, skipping over a sizable portion of text. this in itself isn't impossible, just unusual. your notes say we should find it in the source, and indeed it's in the source on line 55: "無改於父之 可謂孝矣(孔安國/曰孝子)"
  3. next we have the text "千乗" and corresponding LDM annotation "繩證反注同千乘大國之賦也", which appear on line 10 of our JDSW. your notes say we ought to find it in the SBCK source, and we do find it four times: once on line 19, and three more times in a block of commentary that spans lines 23, 24, and 25. now we have a dilemma, however: we've already moved up to line 55 as a consequence of finding "道" there, and thus we can't consider any of the cases of "千乗", which are all "behind" us. it seems the correct approach would've been to find it on line 19: "(言九所傳之事得無/素不講習而傳乎)子曰導 千乗"

my thought is this: is it possible that we should've actually found "道" as a graphic variant much earlier than line 55? there are only a few characters separating "傳不" and "千乗" in the source; it reads: "傳不習乎子曰導千乗". is "導" the variant we were looking for? i might have missed this but i don't think it's in your notes; you helpfully noted other variants with "variant SBCK".

if my guess is right, then the fix is just what we imagined: matching on variants. my only worry is that, by being too lenient, we might eagerly apply an LDM annotation to a variant, when instead the annotation rightly applies to the actual character itself further down the page (i.e. if LDM had intended to annotate "道" on line 55 instead of the much nearer "導"). i don't know enough to know if this worry is a real concern at this point, but maybe further testing will show us the way.

from jdsw.

GDRom avatar GDRom commented on May 30, 2024

@thatbudakguy In short, you are correct.

I went through the passage in question to check whether your hypothesis is correct in these instances, and it's spot on.

So we have the following three glosses:
傳不(直專反...)
道(音導本或作導...)
千乗(繩證反...)

In the SBCK, they occur on lines 11 and 12;
in the kanripo/TLS version, this occurs from lines 35-38.

As you noted, variants mess the automatic alignment up. In SBCK, 道 is written as 導; in kanripo/TLS, 千乗 is written as 千乘. So yes, we would have been looking for 導 (which obviously couldn't be matched).

Initial thoughts:
Tests sound like a good way to go.
Also, and fortunately, LDM notes as well that the 道 in question is sometimes written as 導 (本或作導); we might be able to draw from his comments when one-character sequences are problematic. For 千乗 vs. 千乘 -- given that two-character sequences are highly unlikely to occur in both ways in a single text, we could apply variant readings more liberally to them.

from jdsw.

thatbudakguy avatar thatbudakguy commented on May 30, 2024

closing for now since we're tossing out the strategy of attempting to align to an SBCK edition that includes commentary — instead, we align directly to the Zhengwen edition, keeping only the places where LDM's headwords match that text. this is the strategy outlined in #11. if it turns out to not work well, we can revisit.

from jdsw.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.