direct-phonology / jdsw Goto Github PK

View Code? Open in Web Editor NEW

2.0 0.0 0.0 36.94 MB

Parsing the "Jingdian Shiwen" with spaCy

License: MIT License

Python 17.55% HTML 0.83% CSS 0.04% Jupyter Notebook 81.59%

nlp corpus text-analysis chinese-traditional phonology

jdsw's People

Contributors

Stargazers

jdsw's Issues

generate CoNLL-U base versions of all texts

needs some reworking of txt2conllu to handle some of the texts still remaining.

update README

the README is fairly out of date. lots of things are still in active development, but we should at least make it more true to what's in the repo right now.

implement conversion from middle chinese to old chinese

depends on #2.

add logging

this is helpful to know things like when stuff actually gets loaded from the filesystem, and show progress for long-running tasks.

convert bulk processing scripts to use fileinput

see https://docs.python.org/3/library/fileinput.html, in particular methods like fileinput.filename(), fileinput.lineno(), etc.

this should allow us to remove the *_all.py versions of scripts; all scripts now accept any number of files or read directly from stdin. they also output a single file; no need to maintain the segmentation of the input. hopefully this will help obscure differences in segmentation between the JDSW and SBCK editions of texts.

sbck2csv (add file name as a column in output)

xml2conllu accepts a single file and thus remains unchanged.

shift output representation to separate initials, finals, and tones

this will help in cases where we can predict part of a fanqie but the other part is a polyphone. likewise for tones. also, we can push the fanqie combination logic out to its own step.

we should also include the original annotation in the final column, for reference

source/commentary alignment detection is incorrect

the implementation of #10 seems to be falsely attributing a substantial portion of annotations to the commentary, when really the annotation occurs first in the source (or vice versa). per @GDRom:

That's 23 of the 34 mismatches. Not sure what happened here -- two hypotheses:

This might be an issue with greediness of the search pattern?

This might be related to the algorithm not finding a first match, and is then somewhat off the rails?

Other mismatches (the ten remaining) would be harder imho to implement, but given the ratio -- 11 out of 55 -- doable by hand afterwards.

see https://github.com/direct-phonology/jdsw/blob/main/out/manual_eval/lunyu/001.txt for a hand-annotated copy indicating where the algorithm was incorrect.

generate test cases from manually annotated copy
make algorithm tolerant of variants (probably want to use str.maketrans for this)
make algorithm tolerant of small differences in phrasing (probably want to use levenshtein for this)

find named entities in annotations

this task isn't a strict prerequisite, but it will improve the tokenization for #32.

some text or person names we can preselect/annotate and merge, since we know they'll likely always refer to named entities. the giveaway for most of these is that they precede a 云.

texts

薛?
二傳?
切韻 (the Qieyun)
說文 (the Shuowen Jiezi)
左傳 (the Chunqiu Zuozhuan), also 左氏?
公羊 (the Chunqiu Gongyang), also 公羊傳
漢書 (the Hanshu)
禮記 (the Liji)
爾雅 (the Erya)
廣雅 (the Guangya, an expanded Erya)
論語 (the Analects)
易 (the Zhouyi), usually for qualifiers like "...易内同"
三蒼
本 (generic "some book/work"), also 一本
卦 (generic "this hexagram", in the Zhouyi)

people

徐 (surname)
李 (surname)
毛 (surname)
崔 (surname), also 崔本 (a work?)
郭 (surname)
秀 (surname)
京 (surname)
干 (surname)
荀 (surname)
司馬 (surname)
盧植 (Lu Zhi)
馬融 (Ma Rong), also 馬
鄭玄 (Zheng Xuan), also 鄭 or 鄭康成 (his courtesy name)
孟康
劉昌宗

implement pipeline pattern for data transformations

now that we have blank CoNLL-U ready to be annotated, we can structure the code around performing transformations in a pipeline format, similar to spaCy:

user defines what transformations should take place, in a list/collection
at runtime, a script initializes each of the transformations ("pipes") and adds them to a pipeline
each CoNLL-U file is loaded and parsed using pyconll
the data is passed through the pipeline, and the output of each pipe is the input to the next pipe
the final output is re-serialized to CoNLL-U and the input file is overwritten

the basic structure of a pipe probably involves initialization (passing in a config) and then a method (__call__, maybe) that the pipeline will call by passing in the output from the previous pipe, and which should return the same type of output as its input.

Port scripts from och-g2p

add visualization

this will ultimately be migrated out of this repo along with most of lib/, but it's time to start prototyping some visualizations for Old Chinese documents, inspired by displaCy.

functionality we want to stub/borrow:

export to HTML (SVG?)
run a local server to visualize results
color in sections of the text (base text vs. annotation)

the "default" visualization style should use a sensible font and visually highlight annotations, as well as mark them up as such (we can try <ruby> or look at other things for this, like SVG's <tspan>.

as an exercise, it would be interesting to try to replicate the "classic" look of vertical running text, right-to-left, with annotations rendered half-size but in the same line and format, in red.

Add an algorithm for inferring relations between spans

We can implement this as a finite-state transducer.

add a streamlit interface for testing span categorization

segment sentences and number tokens in output

if we are aligning the JDSW to the kanseki versions of texts, we can "borrow" the sentence segmentation from the Kanseki versions in order to construct a more traditionally .conll-like representation:

# text = 九五：飛龍在天，利見大人。
1    九    …    SpaceAfter=No|Translit=kjuwX
2    五    …    SpaceAfter=No|Translit=nguX
3    ：    …    SpaceAfter=No
4    飛    …    SpaceAfter=No|Translit=pj+j
5    龍    …    SpaceAfter=No|Translit=lowng
6    在    …    SpaceAfter=No|Translit=dzojX
7    天    …    SpaceAfter=No|Translit=then
8    ，    …    SpaceAfter=No
9    利    …    SpaceAfter=No|Translit=lijH
10   見    …    SpaceAfter=No|Translit=kenH
11   大    …    SpaceAfter=No|Translit=daH
12   人    …    SpaceAfter=No|Translit=nyin
13   。    …    SpaceAfter=No

we could also consider leaving the punctuation in, so that models can learn that it doesn't have an annotation (and annotate <NA> or similar to distinguish it from missing values).

check to see whether NER patterns occur in annotation corpus

a quick check should suffice to see whether our big lists of text and person names actually show up in the JDSW.

add a streamlit interface for testing named entity predictions

choose an annotation from the JDSW at random, or by some UI picker element
load the annotation and apply the pipeline
allow live editing of the NER patterns file (see existing in visualize.py)

fix missing pages in SBCK edition of the JDSW

closed by kanripo/KR1g0003#3.

better handling for cases of multiple readings

right now there are a variety of cases where Reconstruction can raise a MultipleReadingsError: when fetching an initial, a rime, or an entire reading. for at least some of these cases, I think we could do something a little smarter. an example:

we see a 長 in the text without an annotation from LDM, and go to the guangyun looking for a reading.
we see that 長 has three available readings: drjangH, drjang, trjangX.
we divide each of these into initial, rime, and tone (see #6).
for the initial, we have two options: dr and tr.
for the rime, we have only one option, which we can confidently annotate: jang.
for the tone, we have three options: level, rising, and departing.

there's still ambiguity here, but much less ambiguity than simply giving up and not assigning a reading! if we can come up with a systematic way of noting the ambiguity, as B&S do for their OC reconstruction (using things like brackets), we might still salvage some information that would help an algorithm or a human manually correcting the data. for example:

[dr|tr]jang[X|H|_]

and if we annotate each part in a separate field, this might make it into the CoNLL-U as:

MCInitial=[dr/tr]|MCRime=jang|MCTone=[X/H/_]

(using the / instead of | since that character is reserved to separate annotations in CoNLL-U MISC and FEATS fields.)

this also helps in the (unfortunately many) cases where LDM did provide an annotation, but one or both of the characters in his fanqie happen to be polyphones.

restructure as spaCy project

it's probably long past time we did this, just to make repeated actions easier and keep a consistent structure.

handle character variants when loading data

see https://github.com/direct-phonology/core/blob/6a800a3201de43c039a6f7f096aef3a65a843922/core/bin/gentable.py#L84-L134

Separate relation and span annotation

Prodigy's combo interface for annotating relations and spans at the same time is actually geared more towards relations than it is spans. Our project is the opposite — the spans are the most important. I think it's worth it to separate the tasks, so that:

if we want to try a totally heuristics-based strategy for relations, we can do that after the spans are set
we can use the dedicated spans interface for annotating spans, which is somewhat nicer-looking
the logic in the current recipe is spread out/a little more modular

Detect and label restatements of the headword

Not sure how sophisticated this needs to be; maybe not very. We've discussed using a new span label HEAD for this.

handle "all below/above" annotations

inspecting the stored labels in the phonologizer showed a few confusing entries:

      "\u5b50\u4eb7\u53cd\u4e0b\u540c",
      "\u5c45\u519d\u53cd\u4e0b\u540c",
      "\u6071\u6b73\u53cd\u4e0b\u540c",
      "\u6071\u7d79\u53cd\u4e0b\u540c",
      "\u6238\u90de\u53cd\u4e0b\u540c",
      "\u65bc\u458d\u53cd\u4e0b\u540c",
      "\u6b69\u535c\u53cd\u4e0b\u540c"

it looks like these are all of the form:
XY反下同

which would seem to mean that our "below all the same" multi-fanqie logic needs another look:
https://github.com/direct-phonology/och-g2p/blob/94089afb82012a4cdd5cd52c457e01c3a475857e/scripts/annotate.py#L158-L193

add sbck versions of missing texts

for #10, we need sbck versions (including commentary) for:

we also need to re-segment these so that the chapter segmentation matches the jdsw versions.

automatically align jdsw with cleaned source texts

@GDRom has already done some work to do this manually; we want to see if we can automate it.

#10 is a prerequisite for getting the JDSW in shape to align.
#9 is a prerequisite for getting 正文 versions to align to.

this uses a modified version of the algorithm from #10:

Look thru cleaned JDSW from #10 and break it up into k: v store, where each key is every unbroken sequence of characters prior to an annotation
For each key: value pair...
a. Look through the source text (same chapter) and find the first instance of the key (unbroken) that occurs after the previous annotation (annotations must be sequential)
b. If that key is found, take the JDSW annotation and insert it into the source text at that point
c. If that key isn't found, just skip since we'll already know about it from #10

add zhengwen versions of missing texts

we're still looking for clean 正文 versions of these texts:

春秋公羊 (SBCK version https://github.com/kanripo/KR1e0007)
周禮 (SBCK version https://github.com/kanripo/KR1d0002/)
儀禮 (SBCK version https://github.com/kanripo/KR1d0026/)

note that, for the 公羊 at least, Kanripo suggests a 正文 is available: https://www.kanripo.org/titlesearch?query=%E6%98%A5%E7%A7%8B%E5%85%AC%E7%BE%8A

but, the link to it is broken: https://www.kanripo.org/text/KR1e0005/

maybe Christian Wittern can help with this.

use SuPaR-Kanbun as the base model

it's been trained on a dataset that closely matches ours, so it's much better than working from scratch.

fix missing pages in the jdsw

problem

in several cases, the text of the jingdian shiwen from kanripo (KR1g0003) is missing entire pages, which can lead to chapters in the text being "conjoined" (see below for example).

we probably need to address this by updating the text at the source (https://github.com/kanripo/KR1g0003) via pull request and manually inserting the correct text. there are scans of the SBCK edition on Wikimedia and CTEXT has a version with simplified characters for reference.

example

(古猛/反)仲夏(戸嫁反/下同)謂食(音/嗣)齊(才細反/下皆同)頒爵(音/班)必¶
當(丁浪/反)媒氏(音/梅)而取(音娶本/又作娶)稽士(古兮/反)之烖(音/災)妖¶
孽(又作蠥魚列反妖又作祅說文云衣服歌謡/草木之怪謂之祅禽獸蟲蝗之怪謂之蠥)螟(亡丁/反)螽¶
(�)¶
pb:KR1g0003_SBCK_012-8a¶
pb:KR1g0003_SBCK_012-9a¶
(苦浪反又音/剛又户剛反)與茵(音/因)縮二(所六/反)以犢(音獨本/亦作特)相朝(直/遥)¶
(反下及/注同)灌用(古亂反/注同)鬱鬯(丑亮/反)脯醢(上音甫/下音海)繁纓¶

checklist

liji

maoshi

008/009
014/015
016/017/018
020/021

shangshu

yili

009/010
014/015

zhouyi

043/044
053/054/055

zhuangzi

001/002
006/007
017/018

zuozhuan

026/027
028/029

bugs internal to txt files

rearrange POS tags in priority order

this helps with faster POS annotation in prodigy. the order:

SYM
X
PROPN
VERB
NOUN
ADP
PART
PRON
AUX
DET

output list of variant readings occuring in JDSW but not attested in SBGY

There are instances in which LDM's JDSW correctly notes a reading not included in the SBGY. Our current approach fails to take these instances into account. These instances are rare, however, and often tied to archaic texts (like the Shangshu).

Examples thus far encountered include:

女 read as MC nyoX < OCNR *naʔ (meaning 'you') [this would usually be written 汝]
女 is, for comparison, usually read as MC nrjoX < OCNR *nraʔ ('woman,' 'female'), or rel. rarely also verbally MC nrjoH < OCNR *nraʔ-s ('give as wife'). The SBGY notes these two readings.

Suggested approach:

output a list of all occurrences where LDM's notes cannot be reproduced through SBGY;
I will have to look through that list to check our logic.

run topic modeling algorithm on annotations

if we generate a dataset of all annotations, we might be able to find some interesting results seeing how the annotations cluster together — whether they have fanqie, include citations, etc.

parse annotations using a model

annotations seem to have a reliable internal structure which might lend itself well to dependency parsing. perhaps we can define custom terms (for fanqie, qualifiers, citations, etc.) and then use a dependency parser to automatically parse out the interesting parts of each annotation.

clean annotations of commentaries from jdsw

copied/adapted notes from 5/27 meeting:

Look thru JDSW and break it up into k: v store, where each key is every unbroken sequence of characters prior to an annotation
For each key: value pair...
a. Look through the source text (same chapter) and find the first instance of the key (unbroken) that occurs after the previous annotation (annotations must be sequential)
b. If that key is found and it's in the source text (not a commentary), leave it alone in the JDSW
c. If that key is found and it's in the commentary (indicated in SBCK editions in brackets), drop it from the JDSW
d. If that key isn't found at all, log it along with the previous and next annotations so that @GDRom can investigate manually

Assumption: If LDM annotates two successive characters, the second annotation refers to the instance of that character that is closest in the source text to the previous character.

this will produce a version of the JDSW that leaves out any annotations referring to commentaries, which we can later align to the 正文 versions.

train a span categorizer on jeff & hantao's data

using the categorization system established by tharsen & wang, we can see how spaCy's span categorizer performs on its own, irrespective of tokenization, etc.

it would probably be helpful to have a graphical interface for this via streamlit, for testing out arbitrary annotations and seeing what the model predicts.

correct readings generated by fanqie annotations

via @GDRom:

Secondly, just as a word of caution, medials may complicate things further; the specifics in Baxter's MC is sometimes not indicated well by fanqie, so it would have to be a looser Regex in that position (for example MC dzy- -jang becomes dzyang; the fanqie may just indicate, however, an initial MC dz instead of dzy-)

we need to know the valid patterns to correct this.

handle lookups for variant characters

some frequently missed characters like 户 we could cover by semantic variant lookups.

see: https://github.com/direct-phonology/core/blob/6a800a3201de43c039a6f7f096aef3a65a843922/core/bin/gentable.py#L80-L136

Add project task to export annotations

So that they can be shared in .jsonl form rather than stuck in prodigy's database. See db-out.