Giter Site home page Giter Site logo

jdsw's People

Contributors

gdrom avatar thatbudakguy avatar

Stargazers

 avatar  avatar

jdsw's Issues

update README

the README is fairly out of date. lots of things are still in active development, but we should at least make it more true to what's in the repo right now.

add logging

this is helpful to know things like when stuff actually gets loaded from the filesystem, and show progress for long-running tasks.

convert bulk processing scripts to use fileinput

see https://docs.python.org/3/library/fileinput.html, in particular methods like fileinput.filename(), fileinput.lineno(), etc.

this should allow us to remove the *_all.py versions of scripts; all scripts now accept any number of files or read directly from stdin. they also output a single file; no need to maintain the segmentation of the input. hopefully this will help obscure differences in segmentation between the JDSW and SBCK editions of texts.

  • sbck2csv (add file name as a column in output)

xml2conllu accepts a single file and thus remains unchanged.

shift output representation to separate initials, finals, and tones

this will help in cases where we can predict part of a fanqie but the other part is a polyphone. likewise for tones. also, we can push the fanqie combination logic out to its own step.

we should also include the original annotation in the final column, for reference

source/commentary alignment detection is incorrect

the implementation of #10 seems to be falsely attributing a substantial portion of annotations to the commentary, when really the annotation occurs first in the source (or vice versa). per @GDRom:

That's 23 of the 34 mismatches. Not sure what happened here -- two hypotheses:

  • This might be an issue with greediness of the search pattern?
  • This might be related to the algorithm not finding a first match, and is then somewhat off the rails?

Other mismatches (the ten remaining) would be harder imho to implement, but given the ratio -- 11 out of 55 -- doable by hand afterwards.

see https://github.com/direct-phonology/jdsw/blob/main/out/manual_eval/lunyu/001.txt for a hand-annotated copy indicating where the algorithm was incorrect.

  • generate test cases from manually annotated copy
  • make algorithm tolerant of variants (probably want to use str.maketrans for this)
  • make algorithm tolerant of small differences in phrasing (probably want to use levenshtein for this)

find named entities in annotations

this task isn't a strict prerequisite, but it will improve the tokenization for #32.

some text or person names we can preselect/annotate and merge, since we know they'll likely always refer to named entities. the giveaway for most of these is that they precede a 云.

texts

  • 薛?
  • 二傳?
  • 切韻 (the Qieyun)
  • 說文 (the Shuowen Jiezi)
  • 左傳 (the Chunqiu Zuozhuan), also 左氏?
  • 公羊 (the Chunqiu Gongyang), also 公羊傳
  • 漢書 (the Hanshu)
  • 禮記 (the Liji)
  • 爾雅 (the Erya)
  • 廣雅 (the Guangya, an expanded Erya)
  • 論語 (the Analects)
  • 易 (the Zhouyi), usually for qualifiers like "...易内同"
  • 三蒼
  • 本 (generic "some book/work"), also 一本
  • 卦 (generic "this hexagram", in the Zhouyi)

people

  • 徐 (surname)
  • 李 (surname)
  • 毛 (surname)
  • 崔 (surname), also 崔本 (a work?)
  • 郭 (surname)
  • 秀 (surname)
  • 京 (surname)
  • 干 (surname)
  • 荀 (surname)
  • 司馬 (surname)
  • 盧植 (Lu Zhi)
  • 馬融 (Ma Rong), also 馬
  • 鄭玄 (Zheng Xuan), also 鄭 or 鄭康成 (his courtesy name)
  • 孟康
  • 劉昌宗

implement pipeline pattern for data transformations

now that we have blank CoNLL-U ready to be annotated, we can structure the code around performing transformations in a pipeline format, similar to spaCy:

  1. user defines what transformations should take place, in a list/collection
  2. at runtime, a script initializes each of the transformations ("pipes") and adds them to a pipeline
  3. each CoNLL-U file is loaded and parsed using pyconll
  4. the data is passed through the pipeline, and the output of each pipe is the input to the next pipe
  5. the final output is re-serialized to CoNLL-U and the input file is overwritten

the basic structure of a pipe probably involves initialization (passing in a config) and then a method (__call__, maybe) that the pipeline will call by passing in the output from the previous pipe, and which should return the same type of output as its input.

add visualization

this will ultimately be migrated out of this repo along with most of lib/, but it's time to start prototyping some visualizations for Old Chinese documents, inspired by displaCy.

functionality we want to stub/borrow:

  • export to HTML (SVG?)
  • run a local server to visualize results
  • color in sections of the text (base text vs. annotation)

the "default" visualization style should use a sensible font and visually highlight annotations, as well as mark them up as such (we can try <ruby> or look at other things for this, like SVG's <tspan>.

as an exercise, it would be interesting to try to replicate the "classic" look of vertical running text, right-to-left, with annotations rendered half-size but in the same line and format, in red.

segment sentences and number tokens in output

if we are aligning the JDSW to the kanseki versions of texts, we can "borrow" the sentence segmentation from the Kanseki versions in order to construct a more traditionally .conll-like representation:

# text = 九五:飛龍在天,利見大人。
1    九    …    SpaceAfter=No|Translit=kjuwX
2    五    …    SpaceAfter=No|Translit=nguX
3    :    …    SpaceAfter=No
4    飛    …    SpaceAfter=No|Translit=pj+j
5    龍    …    SpaceAfter=No|Translit=lowng
6    在    …    SpaceAfter=No|Translit=dzojX
7    天    …    SpaceAfter=No|Translit=then
8    ,    …    SpaceAfter=No
9    利    …    SpaceAfter=No|Translit=lijH
10   見    …    SpaceAfter=No|Translit=kenH
11   大    …    SpaceAfter=No|Translit=daH
12   人    …    SpaceAfter=No|Translit=nyin
13   。    …    SpaceAfter=No

we could also consider leaving the punctuation in, so that models can learn that it doesn't have an annotation (and annotate <NA> or similar to distinguish it from missing values).

better handling for cases of multiple readings

right now there are a variety of cases where Reconstruction can raise a MultipleReadingsError: when fetching an initial, a rime, or an entire reading. for at least some of these cases, I think we could do something a little smarter. an example:

  1. we see a 長 in the text without an annotation from LDM, and go to the guangyun looking for a reading.
  2. we see that 長 has three available readings: drjangH, drjang, trjangX.
  3. we divide each of these into initial, rime, and tone (see #6).
  4. for the initial, we have two options: dr and tr.
  5. for the rime, we have only one option, which we can confidently annotate: jang.
  6. for the tone, we have three options: level, rising, and departing.

there's still ambiguity here, but much less ambiguity than simply giving up and not assigning a reading! if we can come up with a systematic way of noting the ambiguity, as B&S do for their OC reconstruction (using things like brackets), we might still salvage some information that would help an algorithm or a human manually correcting the data. for example:

[dr|tr]jang[X|H|_]

and if we annotate each part in a separate field, this might make it into the CoNLL-U as:

MCInitial=[dr/tr]|MCRime=jang|MCTone=[X/H/_]

(using the / instead of | since that character is reserved to separate annotations in CoNLL-U MISC and FEATS fields.)

this also helps in the (unfortunately many) cases where LDM did provide an annotation, but one or both of the characters in his fanqie happen to be polyphones.

restructure as spaCy project

it's probably long past time we did this, just to make repeated actions easier and keep a consistent structure.

Separate relation and span annotation

Prodigy's combo interface for annotating relations and spans at the same time is actually geared more towards relations than it is spans. Our project is the opposite — the spans are the most important. I think it's worth it to separate the tasks, so that:

  • if we want to try a totally heuristics-based strategy for relations, we can do that after the spans are set
  • we can use the dedicated spans interface for annotating spans, which is somewhat nicer-looking
  • the logic in the current recipe is spread out/a little more modular

handle "all below/above" annotations

inspecting the stored labels in the phonologizer showed a few confusing entries:

      "\u5b50\u4eb7\u53cd\u4e0b\u540c",
      "\u5c45\u519d\u53cd\u4e0b\u540c",
      "\u6071\u6b73\u53cd\u4e0b\u540c",
      "\u6071\u7d79\u53cd\u4e0b\u540c",
      "\u6238\u90de\u53cd\u4e0b\u540c",
      "\u65bc\u458d\u53cd\u4e0b\u540c",
      "\u6b69\u535c\u53cd\u4e0b\u540c"

it looks like these are all of the form:
XY反下同

which would seem to mean that our "below all the same" multi-fanqie logic needs another look:
https://github.com/direct-phonology/och-g2p/blob/94089afb82012a4cdd5cd52c457e01c3a475857e/scripts/annotate.py#L158-L193

add sbck versions of missing texts

for #10, we need sbck versions (including commentary) for:

  • guliang
  • laozi
  • liji
  • lunyu
  • maoshi
  • shangshu
  • xiaojing
  • zhouyi
  • zhuangzi
  • zuozhuan

we also need to re-segment these so that the chapter segmentation matches the jdsw versions.

automatically align jdsw with cleaned source texts

@GDRom has already done some work to do this manually; we want to see if we can automate it.

#10 is a prerequisite for getting the JDSW in shape to align.
#9 is a prerequisite for getting 正文 versions to align to.

this uses a modified version of the algorithm from #10:

  1. Look thru cleaned JDSW from #10 and break it up into k: v store, where each key is every unbroken sequence of characters prior to an annotation
  2. For each key: value pair...
    a. Look through the source text (same chapter) and find the first instance of the key (unbroken) that occurs after the previous annotation (annotations must be sequential)
    b. If that key is found, take the JDSW annotation and insert it into the source text at that point
    c. If that key isn't found, just skip since we'll already know about it from #10

add zhengwen versions of missing texts

we're still looking for clean 正文 versions of these texts:

note that, for the 公羊 at least, Kanripo suggests a 正文 is available: https://www.kanripo.org/titlesearch?query=%E6%98%A5%E7%A7%8B%E5%85%AC%E7%BE%8A

but, the link to it is broken: https://www.kanripo.org/text/KR1e0005/

maybe Christian Wittern can help with this.

fix missing pages in the jdsw

problem

in several cases, the text of the jingdian shiwen from kanripo (KR1g0003) is missing entire pages, which can lead to chapters in the text being "conjoined" (see below for example).

we probably need to address this by updating the text at the source (https://github.com/kanripo/KR1g0003) via pull request and manually inserting the correct text. there are scans of the SBCK edition on Wikimedia and CTEXT has a version with simplified characters for reference.

example

(古猛/反)仲夏(戸嫁反/下同)謂食(音/嗣)齊(才細反/下皆同)頒爵(音/班)必¶
當(丁浪/反)媒氏(音/梅)而取(音娶本/又作娶)稽士(古兮/反)之烖(音/災)妖¶
孽(又作蠥魚列反妖又作祅說文云衣服歌謡/草木之怪謂之祅禽獸蟲蝗之怪謂之蠥)螟(亡丁/反)螽¶
(�)¶
pb:KR1g0003_SBCK_012-8a¶
pb:KR1g0003_SBCK_012-9a¶
(苦浪反又音/剛又户剛反)與茵(音/因)縮二(所六/反)以犢(音獨本/亦作特)相朝(直/遥)¶
(反下及/注同)灌用(古亂反/注同)鬱鬯(丑亮/反)脯醢(上音甫/下音海)繁纓¶

checklist

liji

  • 009/010
  • 028/029
  • 031/032
  • 036/037
  • 048/049

maoshi

  • 008/009
  • 014/015
  • 016/017/018
  • 020/021

shangshu

  • 020/021/022
  • 029/030/031
  • 038/039
  • 044/045
  • 057/058

yili

  • 009/010
  • 014/015

zhouyi

  • 043/044
  • 053/054/055

zhuangzi

  • 001/002
  • 006/007
  • 017/018

zuozhuan

  • 026/027
  • 028/029

bugs internal to txt files

  • maoshi 003
  • maoshi 010
  • maoshi 012
  • maoshi 019
  • maoshi 020
  • maoshi 024
  • maoshi 025 (3)
  • maoshi 026
  • maoshi 030
  • zhouli 001
  • zhouli 002
  • zhouli 003 (2)
  • zhouli 004
  • zhouli 005
  • zhouli 006 (2)
  • yili (4)
  • liji (16)
  • zuozhuan (23)
  • gongyang (2)
  • guliang (5)
  • zhuangzi (10)

output list of variant readings occuring in JDSW but not attested in SBGY

There are instances in which LDM's JDSW correctly notes a reading not included in the SBGY. Our current approach fails to take these instances into account. These instances are rare, however, and often tied to archaic texts (like the Shangshu).

Examples thus far encountered include:

  • 女 read as MC nyoX < OCNR *naʔ (meaning 'you') [this would usually be written 汝]
  • 女 is, for comparison, usually read as MC nrjoX < OCNR *nraʔ ('woman,' 'female'), or rel. rarely also verbally MC nrjoH < OCNR *nraʔ-s ('give as wife'). The SBGY notes these two readings.

Suggested approach:

  • output a list of all occurrences where LDM's notes cannot be reproduced through SBGY;
  • I will have to look through that list to check our logic.

run topic modeling algorithm on annotations

if we generate a dataset of all annotations, we might be able to find some interesting results seeing how the annotations cluster together — whether they have fanqie, include citations, etc.

parse annotations using a model

annotations seem to have a reliable internal structure which might lend itself well to dependency parsing. perhaps we can define custom terms (for fanqie, qualifiers, citations, etc.) and then use a dependency parser to automatically parse out the interesting parts of each annotation.

clean annotations of commentaries from jdsw

copied/adapted notes from 5/27 meeting:

  1. Look thru JDSW and break it up into k: v store, where each key is every unbroken sequence of characters prior to an annotation
  2. For each key: value pair...
    a. Look through the source text (same chapter) and find the first instance of the key (unbroken) that occurs after the previous annotation (annotations must be sequential)
    b. If that key is found and it's in the source text (not a commentary), leave it alone in the JDSW
    c. If that key is found and it's in the commentary (indicated in SBCK editions in brackets), drop it from the JDSW
    d. If that key isn't found at all, log it along with the previous and next annotations so that @GDRom can investigate manually

Assumption: If LDM annotates two successive characters, the second annotation refers to the instance of that character that is closest in the source text to the previous character.

this will produce a version of the JDSW that leaves out any annotations referring to commentaries, which we can later align to the 正文 versions.

train a span categorizer on jeff & hantao's data

using the categorization system established by tharsen & wang, we can see how spaCy's span categorizer performs on its own, irrespective of tokenization, etc.

it would probably be helpful to have a graphical interface for this via streamlit, for testing out arbitrary annotations and seeing what the model predicts.

correct readings generated by fanqie annotations

via @GDRom:

Secondly, just as a word of caution, medials may complicate things further; the specifics in Baxter's MC is sometimes not indicated well by fanqie, so it would have to be a looser Regex in that position (for example MC dzy- -jang becomes dzyang; the fanqie may just indicate, however, an initial MC dz instead of dzy-)

we need to know the valid patterns to correct this.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.