direct-phonology / jdsw Goto Github PK
View Code? Open in Web Editor NEWParsing the "Jingdian Shiwen" with spaCy
License: MIT License
Parsing the "Jingdian Shiwen" with spaCy
License: MIT License
needs some reworking of txt2conllu
to handle some of the texts still remaining.
the README is fairly out of date. lots of things are still in active development, but we should at least make it more true to what's in the repo right now.
depends on #2.
this is helpful to know things like when stuff actually gets loaded from the filesystem, and show progress for long-running tasks.
see https://docs.python.org/3/library/fileinput.html, in particular methods like fileinput.filename()
, fileinput.lineno()
, etc.
this should allow us to remove the *_all.py
versions of scripts; all scripts now accept any number of files or read directly from stdin. they also output a single file; no need to maintain the segmentation of the input. hopefully this will help obscure differences in segmentation between the JDSW and SBCK editions of texts.
sbck2csv
(add file name as a column in output)xml2conllu
accepts a single file and thus remains unchanged.
this will help in cases where we can predict part of a fanqie but the other part is a polyphone. likewise for tones. also, we can push the fanqie combination logic out to its own step.
we should also include the original annotation in the final column, for reference
the implementation of #10 seems to be falsely attributing a substantial portion of annotations to the commentary, when really the annotation occurs first in the source (or vice versa). per @GDRom:
That's 23 of the 34 mismatches. Not sure what happened here -- two hypotheses:
- This might be an issue with greediness of the search pattern?
- This might be related to the algorithm not finding a first match, and is then somewhat off the rails?
Other mismatches (the ten remaining) would be harder imho to implement, but given the ratio -- 11 out of 55 -- doable by hand afterwards.
see https://github.com/direct-phonology/jdsw/blob/main/out/manual_eval/lunyu/001.txt for a hand-annotated copy indicating where the algorithm was incorrect.
str.maketrans
for this)this task isn't a strict prerequisite, but it will improve the tokenization for #32.
some text or person names we can preselect/annotate and merge, since we know they'll likely always refer to named entities. the giveaway for most of these is that they precede a 云.
now that we have blank CoNLL-U ready to be annotated, we can structure the code around performing transformations in a pipeline format, similar to spaCy:
pyconll
the basic structure of a pipe probably involves initialization (passing in a config) and then a method (__call__
, maybe) that the pipeline will call by passing in the output from the previous pipe, and which should return the same type of output as its input.
this will ultimately be migrated out of this repo along with most of lib/
, but it's time to start prototyping some visualizations for Old Chinese documents, inspired by displaCy.
functionality we want to stub/borrow:
the "default" visualization style should use a sensible font and visually highlight annotations, as well as mark them up as such (we can try <ruby>
or look at other things for this, like SVG's <tspan>
.
as an exercise, it would be interesting to try to replicate the "classic" look of vertical running text, right-to-left, with annotations rendered half-size but in the same line and format, in red.
We can implement this as a finite-state transducer.
if we are aligning the JDSW to the kanseki versions of texts, we can "borrow" the sentence segmentation from the Kanseki versions in order to construct a more traditionally .conll-like representation:
# text = 九五:飛龍在天,利見大人。
1 九 … SpaceAfter=No|Translit=kjuwX
2 五 … SpaceAfter=No|Translit=nguX
3 : … SpaceAfter=No
4 飛 … SpaceAfter=No|Translit=pj+j
5 龍 … SpaceAfter=No|Translit=lowng
6 在 … SpaceAfter=No|Translit=dzojX
7 天 … SpaceAfter=No|Translit=then
8 , … SpaceAfter=No
9 利 … SpaceAfter=No|Translit=lijH
10 見 … SpaceAfter=No|Translit=kenH
11 大 … SpaceAfter=No|Translit=daH
12 人 … SpaceAfter=No|Translit=nyin
13 。 … SpaceAfter=No
we could also consider leaving the punctuation in, so that models can learn that it doesn't have an annotation (and annotate <NA>
or similar to distinguish it from missing values).
a quick check should suffice to see whether our big lists of text and person names actually show up in the JDSW.
closed by kanripo/KR1g0003#3.
right now there are a variety of cases where Reconstruction
can raise a MultipleReadingsError
: when fetching an initial, a rime, or an entire reading. for at least some of these cases, I think we could do something a little smarter. an example:
drjangH
, drjang
, trjangX
.dr
and tr
.jang
.there's still ambiguity here, but much less ambiguity than simply giving up and not assigning a reading! if we can come up with a systematic way of noting the ambiguity, as B&S do for their OC reconstruction (using things like brackets), we might still salvage some information that would help an algorithm or a human manually correcting the data. for example:
[dr|tr]jang[X|H|_]
and if we annotate each part in a separate field, this might make it into the CoNLL-U as:
MCInitial=[dr/tr]|MCRime=jang|MCTone=[X/H/_]
(using the / instead of | since that character is reserved to separate annotations in CoNLL-U MISC
and FEATS
fields.)
this also helps in the (unfortunately many) cases where LDM did provide an annotation, but one or both of the characters in his fanqie happen to be polyphones.
it's probably long past time we did this, just to make repeated actions easier and keep a consistent structure.
Prodigy's combo interface for annotating relations and spans at the same time is actually geared more towards relations than it is spans. Our project is the opposite — the spans are the most important. I think it's worth it to separate the tasks, so that:
spans
interface for annotating spans, which is somewhat nicer-lookingNot sure how sophisticated this needs to be; maybe not very. We've discussed using a new span label HEAD
for this.
inspecting the stored labels in the phonologizer showed a few confusing entries:
"\u5b50\u4eb7\u53cd\u4e0b\u540c",
"\u5c45\u519d\u53cd\u4e0b\u540c",
"\u6071\u6b73\u53cd\u4e0b\u540c",
"\u6071\u7d79\u53cd\u4e0b\u540c",
"\u6238\u90de\u53cd\u4e0b\u540c",
"\u65bc\u458d\u53cd\u4e0b\u540c",
"\u6b69\u535c\u53cd\u4e0b\u540c"
it looks like these are all of the form:
XY反下同
which would seem to mean that our "below all the same" multi-fanqie logic needs another look:
https://github.com/direct-phonology/och-g2p/blob/94089afb82012a4cdd5cd52c457e01c3a475857e/scripts/annotate.py#L158-L193
for #10, we need sbck versions (including commentary) for:
we also need to re-segment these so that the chapter segmentation matches the jdsw versions.
@GDRom has already done some work to do this manually; we want to see if we can automate it.
#10 is a prerequisite for getting the JDSW in shape to align.
#9 is a prerequisite for getting 正文 versions to align to.
this uses a modified version of the algorithm from #10:
we're still looking for clean 正文 versions of these texts:
note that, for the 公羊 at least, Kanripo suggests a 正文 is available: https://www.kanripo.org/titlesearch?query=%E6%98%A5%E7%A7%8B%E5%85%AC%E7%BE%8A
but, the link to it is broken: https://www.kanripo.org/text/KR1e0005/
maybe Christian Wittern can help with this.
it's been trained on a dataset that closely matches ours, so it's much better than working from scratch.
in several cases, the text of the jingdian shiwen from kanripo (KR1g0003) is missing entire pages, which can lead to chapters in the text being "conjoined" (see below for example).
we probably need to address this by updating the text at the source (https://github.com/kanripo/KR1g0003) via pull request and manually inserting the correct text. there are scans of the SBCK edition on Wikimedia and CTEXT has a version with simplified characters for reference.
(古猛/反)仲夏(戸嫁反/下同)謂食(音/嗣)齊(才細反/下皆同)頒爵(音/班)必¶
當(丁浪/反)媒氏(音/梅)而取(音娶本/又作娶)稽士(古兮/反)之烖(音/災)妖¶
孽(又作蠥魚列反妖又作祅說文云衣服歌謡/草木之怪謂之祅禽獸蟲蝗之怪謂之蠥)螟(亡丁/反)螽¶
(�)¶
pb:KR1g0003_SBCK_012-8a¶
pb:KR1g0003_SBCK_012-9a¶
(苦浪反又音/剛又户剛反)與茵(音/因)縮二(所六/反)以犢(音獨本/亦作特)相朝(直/遥)¶
(反下及/注同)灌用(古亂反/注同)鬱鬯(丑亮/反)脯醢(上音甫/下音海)繁纓¶
this helps with faster POS annotation in prodigy. the order:
There are instances in which LDM's JDSW correctly notes a reading not included in the SBGY. Our current approach fails to take these instances into account. These instances are rare, however, and often tied to archaic texts (like the Shangshu).
Examples thus far encountered include:
Suggested approach:
if we generate a dataset of all annotations, we might be able to find some interesting results seeing how the annotations cluster together — whether they have fanqie, include citations, etc.
annotations seem to have a reliable internal structure which might lend itself well to dependency parsing. perhaps we can define custom terms (for fanqie, qualifiers, citations, etc.) and then use a dependency parser to automatically parse out the interesting parts of each annotation.
copied/adapted notes from 5/27 meeting:
Assumption: If LDM annotates two successive characters, the second annotation refers to the instance of that character that is closest in the source text to the previous character.
this will produce a version of the JDSW that leaves out any annotations referring to commentaries, which we can later align to the 正文 versions.
using the categorization system established by tharsen & wang, we can see how spaCy's span categorizer performs on its own, irrespective of tokenization, etc.
it would probably be helpful to have a graphical interface for this via streamlit, for testing out arbitrary annotations and seeing what the model predicts.
via @GDRom:
Secondly, just as a word of caution, medials may complicate things further; the specifics in Baxter's MC is sometimes not indicated well by fanqie, so it would have to be a looser Regex in that position (for example MC dzy- -jang becomes dzyang; the fanqie may just indicate, however, an initial MC dz instead of dzy-)
we need to know the valid patterns to correct this.
some frequently missed characters like 户 we could cover by semantic variant lookups.
So that they can be shared in .jsonl
form rather than stuck in prodigy's database. See db-out
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.