Giter Site home page Giter Site logo

Classical Chinese Model needed about nlp-cube HOT 31 OPEN

adobe avatar adobe commented on May 11, 2024
Classical Chinese Model needed

from nlp-cube.

Comments (31)

tiberiu44 avatar tiberiu44 commented on May 11, 2024 2

I see. I can imagine joint sentence segmentation and parsing working by using the ARC-system. Whenever the stack is emptied, it implies that a sentence boundary should be generated.

We've finished work for the Parser and Tagger for version 2.0, but we still haven't found a good solution for tokenization/sentence splitting.

I think I will give this new approach a try, but it will take some time to implement. I'll let you know when it's done and maybe you can test it on your corpus.

Thanks for the feedback,
Tibi

from nlp-cube.

tiberiu44 avatar tiberiu44 commented on May 11, 2024 1

@KoichiYasuoka - I hope you are doing well in this time of crisis.

It's been a long time since our last progress update on this issue. We started training the 2.0 models for NLP-Cube and they should be out soon. I saw the Classical Chinese corpus in the UD Treebanks (v2.5). The model will be included in this release. Congratulations and thank you for your work.

I thought you might be interested in the fact that we are also setting up a "model zoo" for NLP-Cube, so contributors can publish their pre-trained models. We will try to make research attribution easy, by printing a banner with copyright and/or citing options for these models.

from nlp-cube.

tiberiu44 avatar tiberiu44 commented on May 11, 2024 1

Hi @KoichiYasuoka ,

We've finished releasing the current version of NLPCube and we included the classical Chinese model from 2.7. Sentence segmentation seems to be problematic for this treebank. You can check branch 3.0 of the repo to get more info: https://github.com/adobe/NLP-Cube/tree/3.0

If you have any suggestions regarding sentence segmentation, please let me know. Right now we are using xlm-roberta-base for language modeling, but maybe there is some other LM that can provide better results.

Best,
Tiberiu

from nlp-cube.

tiberiu44 avatar tiberiu44 commented on May 11, 2024 1

This is perfect. I will use your model to train the Classical Chinese pipeline:

python3 cube/trainer.py --task=tokenizer --train=scripts/train/2.7/language/lzh.yaml --store=data/lzh-trf-tokenizer --num-workers=0 --lm-device=cuda:0 --gpus=1 --lm-model=transformer:KoichiYasuoka/roberta-classical-chinese-large-char

Given that this is a dedicated model, I hope it will provide better results than any other LM.

Thank you for this.

from nlp-cube.

tiberiu44 avatar tiberiu44 commented on May 11, 2024 1

Thank you for the feedback. I'm working on that right now. Hope to get it fixed soon.

from nlp-cube.

tiberiu44 avatar tiberiu44 commented on May 11, 2024

I looked over the corpus, and I see there are no delimiters (punctuation marks) for sentences. Is this ik?

from nlp-cube.

KoichiYasuoka avatar KoichiYasuoka commented on May 11, 2024

Yes, OK. Classical Chinese does not have any punctuations or spaces between words or sentences. Therefore, in my humble opinion, tokenization is a hard task without POS-tagging, and sentencization is a hard task without dependency parsing...

from nlp-cube.

tiberiu44 avatar tiberiu44 commented on May 11, 2024

I think we could go for jointly POS-Tagging and tokenising. Unfortunately, the algorithm we use for dependency parsing requires us to build a NxN matrix for all the words (N), which is likely to cause an out of memory error if we use all tokens. Do you know of any other approach, that does not require dependency parsing for sentence segmentation?

from nlp-cube.

KoichiYasuoka avatar KoichiYasuoka commented on May 11, 2024

Umm... I only know Straka & Straková (2017) approach using dynamic programming (see section 4.3), but it requires tentative parse trees...

from nlp-cube.

tiberiu44 avatar tiberiu44 commented on May 11, 2024

@KoichiYasuoka - i haven't had any success with the tokenizer/sentence splitter so far. We are working on rolling out version 2.0 which uses a single model conditionally trained with language embeddings. We have great accuracy figures for the parser and tagger. However, we are still experiencing difficulties with the tokenizer (for all languages).

We tried jointly tagging/parsing and tokenizing, but we simply got the same results as if we would do these two tasks independently. Any suggestions on how to proceed?

from nlp-cube.

KoichiYasuoka avatar KoichiYasuoka commented on May 11, 2024

Umm... For Japanese tokenisation (word splitting) and POS-tagging, we often apply Conditional Random Fields as Kudo et al. (2004). For Classical Chinese, we also use CRF in our UD-Kanbun.

For sentence segmentation in Classical Chinese, recent progress has been made by Hu et al. (2019) at https://seg.shenshen.wiki/. Hu et al. uses BERT-model, which is trained by enormous Classical Chinese texts of 3.3×109 characters...

from nlp-cube.

KoichiYasuoka avatar KoichiYasuoka commented on May 11, 2024

@tiberiu44 - Thank you for using our UD_Classical_Chinese-Kyoto for your NLP-Cube. We've just finished to add 19 more volumes from "禮記" into https://github.com/UniversalDependencies/UD_Classical_Chinese-Kyoto/tree/dev for the v2.6 release of UD Treebanks (scheduled on May 15, 2020). Enjoy!

from nlp-cube.

KoichiYasuoka avatar KoichiYasuoka commented on May 11, 2024

Thank you @tiberiu44 for releasing NLP-Cube 3.0. But, well, pytorch-lightning==1.1.7 is too old for recent torchtext==0.10.0 so I use pytorch-lightning==1.2.10 instead:

>>> from cube.api import Cube
>>> nlp=Cube()
>>> nlp.load("lzh")
>>> doc=nlp("不入虎穴不得虎子")
>>> print(doc)
1	不入虎穴不得虎子	叔津	PROPN	n,名詞,人,複合的人名	NameType=Prs	0	root	_	_

Umm... tokenization of classical Chinese doesn't work here...

from nlp-cube.

tiberiu44 avatar tiberiu44 commented on May 11, 2024

Yes, I see something is definitely wrong with the model. Just tried you example and tokenization did not work. However, on longer examples it seems to behave differently:

Out[13]:
1	子曰學而時習之不亦說乎	子春城	PROPN	n,名詞,人,名	NameType=Giv	2	nsubj	_	_
2	有	有	VERB	v,動詞,存在,存在	_	0	root	_	_
3	朋	朋	NOUN	n,名詞,人,関係	_	2	obj	_	_
4	自	自	ADP	v,前置詞,経由,*	_	6	case	_	_
5	遠	遠	VERB	v,動詞,描写,量	Degree=Pos|VerbForm=Part	6	amod	_	_
6	方	方	NOUN	n,名詞,固定物,関係	Case=Loc	7	obl	_	_
7	來	來	VERB	v,動詞,行為,移動	_	2	ccomp	_	_
8	不	不	ADV	v,副詞,否定,無界	Polarity=Neg	14	advmod	_	_
9	亦	亦	ADV	v,副詞,頻度,重複	_	10	advmod	_	_
10	樂	樂	VERB	v,動詞,行為,態度	_	2	conj	_	_
11	乎	乎	ADP	v,前置詞,基盤,*	_	12	case	_	_
12	人	人	NOUN	n,名詞,人,人	_	7	obl	_	_
13	不	不	ADV	v,副詞,否定,無界	Polarity=Neg	14	advmod	_	_
14	知	知	VERB	v,動詞,行為,動作	_	10	parataxis	_	_

1	而	而	CCONJ	p,助詞,接続,並列	_	3	advmod	_	_
2	不	不	ADV	v,副詞,否定,無界	Polarity=Neg	3	advmod	_	_
3	慍	慍	VERB	v,動詞,行為,態度	_	6	csubj	_	_
4	不	不	ADV	v,副詞,否定,無界	Polarity=Neg	6	advmod	_	_
5	亦	亦	ADV	v,副詞,頻度,重複	_	6	advmod	_	_
6	君子	君子	NOUN	n,名詞,人,役割	_	0	root	_	_
7	乎	乎	PART	p,助詞,句末,*	_	6	discourse:sp	_	_```

I will try retraining the tokenizer with a different LM.

from nlp-cube.

KoichiYasuoka avatar KoichiYasuoka commented on May 11, 2024

Umm... first eleven characters seem untokenized:

>>> from cube.api import Cube
>>> nlp=Cube()
>>> nlp.load("lzh")
>>> doc=nlp("子曰道千乘之國敬事而信節用而愛人使民以時")
>>> print(doc)
1	子曰道千乘之國敬事而信	子春于	PROPN	n,名詞,人,名	NameType=Giv	2	nsubj	_	_
2	節	節	VERB	v,動詞,描写,態度	Degree=Pos	0	root	_	_
3	用	用	VERB	v,動詞,行為,動作	_	2	flat:vv	_	_

1	而	而	CCONJ	p,助詞,接続,並列	_	2	advmod	_	_
2	愛	愛	VERB	v,動詞,行為,交流	_	6	csubj	_	_
3	人	人	NOUN	n,名詞,人,人	_	2	obj	_	_
4	使	使	VERB	v,動詞,行為,使役	_	2	parataxis	_	_
5	民	民	NOUN	n,名詞,人,人	_	4	obj	_	_
6	以	以	VERB	v,動詞,行為,動作	_	0	root	_	_
7	時	時	NOUN	n,名詞,時,*	Case=Tem	6	obj	_	_

from nlp-cube.

tiberiu44 avatar tiberiu44 commented on May 11, 2024

Yes, seems to be a recurring issue with any text I try. I'm retraining the tokenizer/sentence splitter right now (it will take a couple of hours). Hopefully, this will solve the problem. I'll let you know as soon as I publish the new model.

from nlp-cube.

KoichiYasuoka avatar KoichiYasuoka commented on May 11, 2024

Thank you @tiberiu44 and I will wait for the new tokenizer. Ah, well, for sentence segmentation of the classical Chinese, I released https://huggingface.co/KoichiYasuoka/roberta-classical-chinese-large-char and https://github.com/KoichiYasuoka/SuPar-Kanbun using the segmentation algorithm of 一种基于循环神经网络的古文断句方法. I hope these help you.

from nlp-cube.

KoichiYasuoka avatar KoichiYasuoka commented on May 11, 2024

Thank you @tiberiu44 for releasing nlpcube 0.3.0.7. I tried the new model of classical Chinese with pytorch-lightning==1.2.10 and torchtext==0.10.0:

>>> from cube.api import Cube
>>> nlp=Cube()
>>> nlp.load("lzh")
>>> doc=nlp("不入虎穴不得虎子")
>>> print(doc)
1	不	不	ADV	v,副詞,否定,無界	Polarity=Neg	2	advmod	_	_
2	入	入	VERB	v,動詞,行為,移動	_	0	root	_	_
3	虎	虎	NOUN	n,名詞,主体,動物	_	4	nmod	_	_
4	穴	<UNK>	NOUN	n,名詞,可搬,道具	_	2	obj	_	_

1	不	不	ADV	v,副詞,否定,無界	Polarity=Neg	2	advmod	_	_
2	得	得	VERB	v,動詞,行為,得失	_	0	root	_	_

1	虎	虎	NOUN	n,名詞,主体,動物	_	0	root	_	_

1	子 	子產	PROPN	n,名詞,人,名	NameType=Giv	0	root	_	_;compund

The tokenization seems to work well this time. Now the problem is the sentence segmentation...

from nlp-cube.

tiberiu44 avatar tiberiu44 commented on May 11, 2024

So far, I only got an sentence f-score of 20 (best result using your RobertaModel):

-----------+-----------+-----------+-----------+-----------
Tokens     |     98.40 |     97.34 |     97.87 |
Sentences  |     34.06 |     15.03 |     20.86 |
Words      |     98.40 |     97.34 |     97.87 |
UPOS       |     92.36 |     91.37 |     91.86 |     93.86
XPOS       |     89.27 |     88.31 |     88.78 |     90.72
UFeats     |     92.95 |     91.95 |     92.45 |     94.46
AllTags    |     87.35 |     86.41 |     86.88 |     88.77
Lemmas     |     92.01 |     91.02 |     91.51 |     93.51
UAS        |     66.76 |     66.04 |     66.40 |     67.84
LAS        |     61.46 |     60.80 |     61.13 |     62.46
CLAS       |     60.49 |     59.19 |     59.83 |     60.96
MLAS       |     56.81 |     55.59 |     56.20 |     57.25
BLEX       |     56.06 |     54.86 |     55.45 |     56.49

The UAS and LAS scores are low because every time it get's a sentence wrong, the system will also mislabel the root node.

from nlp-cube.

KoichiYasuoka avatar KoichiYasuoka commented on May 11, 2024

20.86% is much worse than the result (80%) of 一种基于循环神经网络的古文断句方法. OK, here I try myself with transformers on Google Colab:

!pip install 'transformers>=4.7.0' datasets seqeval
!test -d UD_Classical_Chinese-Kyoto || git clone https://github.com/universaldependencies/UD_Classical_Chinese-Kyoto
!test -f run_ner.py || curl -LO https://raw.githubusercontent.com/huggingface/transformers/v`pip list | sed -n 's/^transformers *\([^ ]*\) *$/\1/p'`/examples/pytorch/token-classification/run_ner.py

for d in ["train","dev","test"]:
  with open("UD_Classical_Chinese-Kyoto/lzh_kyoto-ud-"+d+".conllu","r",encoding="utf-8") as f:
    r=f.read()
  with open(d+".json","w",encoding="utf-8") as f:
    tokens=[]
    tags=[]
    i=0
    for s in r.split("\n"):
      t=s.split("\t")
      if len(t)==10:
        for c in t[1]:
          tokens.append(c)
          i+=1
      else:
        if i==1:
          tags.append("S")
        elif i==2:
          tags+=["B","E"]
        elif i==3:
          tags+=["B","E2","E"]
        elif i>3:
          tags+=["B"]+["M"]*(i-4)+["E3","E2","E"]
        i=0
        if len(tokens)>80:
          print("{\"tokens\":[\""+"\",\"".join(tokens)+"\"],\"tags\":[\""+"\",\"".join(tags)+"\"]}",file=f)
          tokens=[]
          tags=[]

!python run_ner.py --model_name_or_path KoichiYasuoka/roberta-classical-chinese-large-char --train_file train.json --validation_file dev.json --test_file test.json --output_dir my.danku --do_train --do_eval

I got "eval metrics" as follows:

***** eval metrics *****
  epoch                   =        3.0
  eval_accuracy           =     0.9212
  eval_f1                 =     0.8995
  eval_loss               =     0.2794
  eval_precision          =     0.8991
  eval_recall             =     0.8998
  eval_runtime            = 0:00:09.70
  eval_samples            =        329
  eval_samples_per_second =     33.901
  eval_steps_per_second   =      4.328

Then I tried to sentencize the paragraph I wrote two years ago (#100 (comment)):

import torch
from transformers import AutoTokenizer,AutoModelForTokenClassification
tkz=AutoTokenizer.from_pretrained("my.danku")
mdl=AutoModelForTokenClassification.from_pretrained("my.danku")
s="天平二年正月十三日萃于帥老之宅申宴會也于時初春令月氣淑風和梅披鏡前之粉蘭薰珮後之香加以曙嶺移雲松掛羅而傾盖夕岫結霧鳥封縠而迷林庭舞新蝶空歸故鴈於是盖天坐地促膝飛觴忘言一室之裏開衿煙霞之外淡然自放快然自足若非翰苑何以攄情詩紀落梅之篇古今夫何異矣宜賦園梅聊成短詠"
e=tkz.encode(s,return_tensors="pt")
p=[mdl.config.id2label[q] for q in torch.argmax(mdl(e)[0],dim=2)[0].tolist()[1:-1]]
print("".join(c+"。" if q=="E" or q=="S" else c for c,q in zip(s,p)))

And I got the result "天平二年正月十三日萃于帥老之宅。申宴會也。于時初春令月。氣淑風和。梅披鏡前之粉。蘭薰珮後之香。加以曙嶺移雲。松掛羅而傾盖。夕岫結霧。鳥封縠而迷林。庭舞新蝶。空歸故鴈。於是盖天坐地。促膝飛觴。忘言一室之裏。開衿煙霞之外。淡然自放。快然自足。若非翰苑何以攄情。詩紀落梅之篇。古今夫何異矣。宜賦園梅。聊成短詠。"
How about your system @tiberiu44?

from nlp-cube.

tiberiu44 avatar tiberiu44 commented on May 11, 2024

Unfortunately, I canot run the test right now and I will be away from keyboard most of the day. I will try your approach with transformers tomorrow.

The latest models are pushed if you want to try them. If you already loaded lzh, you will need to trigger a redownload of the model.

The easiest way is to remove all lzh files located in the userhome/.nlpcube/3.0 (anythint that starts with lzh, incuding a folder)

from nlp-cube.

KoichiYasuoka avatar KoichiYasuoka commented on May 11, 2024

Thank you @tiberiu44 for releasing nlpcube 0.3.1.0. I cleaned up my ~/.nlpcube/3.0/lzh:

>>> from cube.api import Cube
>>> nlp=Cube()
>>> nlp.load("lzh")
>>> doc=nlp("天平二年正月十三日萃于帥老之宅申宴會也于時初春令月氣淑風和梅披鏡前之粉蘭薰珮後之香加以曙嶺移雲松掛羅而傾盖夕岫結霧鳥封縠而迷林庭舞新蝶空歸故鴈於是盖天坐地促膝飛觴忘言一室之裏開衿煙霞之外淡然自放快然自足若非翰苑何以攄情詩紀落梅之篇古今夫何異矣宜賦園梅聊成短詠")
>>> print("".join(s.text.replace(" ","")+"。" for s in doc.sentences))

And I've got the result "天平二年正月十三日萃于帥老之宅申宴會也。于時初春令月氣淑風和。梅披鏡前之粉蘭薰珮後之香。加以曙嶺移雲松掛羅而傾盖。夕岫結霧。鳥封縠而迷林庭舞新蝶空歸故鴈。於是盖天坐地促膝飛觴忘言一室之裏開衿煙霞之外淡然自放快然自足若非翰苑何以攄情。詩紀落梅之篇古今夫何異矣。宜賦園梅。聊。成。短詠。" Umm... "聊。成。短詠。" seems unmeaningful but other segmentations are rather good. Then, how do we improve...

from nlp-cube.

tiberiu44 avatar tiberiu44 commented on May 11, 2024

On your previous example, the current version of the tokenizer generates this sentence segmentation:

1	天平	天平	NOUN	n,名詞,時,*	Case=Tem	3	nmod	_	_
2	二	二	NUM	n,数詞,数字,*	_	3	nummod	_	_
3	年	年	NOUN	n,名詞,時,*	Case=Tem	8	obl:tmod	_	_
4	正	正	NOUN	n,名詞,時,*	_	5	amod	_	_
5	月	月	NOUN	n,名詞,時,*	Case=Tem	8	obl:tmod	_	_
6	十三	十三	NUM	n,数詞,数,*	_	7	nummod	_	_
7	日	日	NOUN	n,名詞,時,*	Case=Tem	8	obl:tmod	_	_
8	萃	<UNK>	VERB	v,動詞,行為,動作	_	0	root	_	_
9	于	于	ADP	v,前置詞,基盤,*	_	13	case	_	_
10	帥	帥	NOUN	n,名詞,人,役割	_	11	amod	_	_
11	老	老	NOUN	n,名詞,人,人	_	13	nmod	_	_
12	之	之	SCONJ	p,助詞,接続,属格	_	11	case	_	_
13	宅	宅	NOUN	n,名詞,固定物,建造物	Case=Loc	8	obl:lmod	_	_
14	申	申	VERB	v,動詞,行為,動作	_	8	parataxis	_	_
15	宴	宴	VERB	v,動詞,行為,交流	VerbForm=Part	14	obj	_	_
16	會	會	VERB	v,動詞,行為,交流	_	15	flat:vv	_	_
17	也	也	PART	p,助詞,句末,*	_	8	discourse:sp	_	_

1	于	于	ADP	v,前置詞,基盤,*	_	2	case	_	_
2	時	時	NOUN	n,名詞,時,*	Case=Tem	8	obl:tmod	_	_
3	初	初	NOUN	n,名詞,時,*	Case=Tem	4	nmod	_	_
4	春	春	NOUN	n,名詞,時,*	Case=Tem	6	nmod	_	_
5	令	令	NOUN	n,名詞,人,役割	_	6	nmod	_	_
6	月	月	NOUN	n,名詞,時,*	Case=Tem	8	nsubj	_	_
7	氣	氣	NOUN	n,名詞,描写,形質	_	8	nsubj	_	_
8	淑	淑	VERB	v,動詞,描写,態度	Degree=Pos	0	root	_	_
9	風	風	NOUN	n,名詞,天象,気象	_	10	nsubj	_	_
10	和	和	VERB	v,動詞,描写,形質	Degree=Pos	8	conj	_	_

1	梅	梅	NOUN	n,名詞,固定物,樹木	_	2	nsubj	_	_
2	披	披	VERB	v,動詞,行為,動作	_	0	root	_	_
3	鏡	<UNK>	NOUN	n,名詞,可搬,道具	_	4	nmod	_	_
4	前	前	NOUN	n,名詞,固定物,関係	Case=Loc	6	nmod	_	_
5	之	之	SCONJ	p,助詞,接続,属格	_	4	case	_	_
6	粉	<UNK>	NOUN	n,名詞,不可譲,身体	_	2	obj	_	_

1	蘭	蘭	NOUN	n,名詞,可搬,道具	_	2	nsubj	_	_
2	薰	<UNK>	NOUN	n,名詞,可搬,道具	_	0	root	_	_
3	珮	<UNK>	NOUN	n,名詞,可搬,道具	_	4	nmod	_	_
4	後	後	NOUN	n,名詞,固定物,関係	Case=Tem	6	nmod	_	_
5	之	之	SCONJ	p,助詞,接続,属格	_	4	case	_	_
6	香	香	NOUN	n,名詞,描写,形質	_	2	obj	_	_

1	加	加	VERB	v,動詞,行為,得失	_	5	advmod	_	_
2	以	以	VERB	v,動詞,行為,動作	_	5	advcl	_	_
3	曙	<UNK>	NOUN	n,名詞,描写,形質	_	4	nmod	_	_
4	嶺	<UNK>	NOUN	n,名詞,固定物,地形	Case=Loc	2	obj	_	_
5	移	移	VERB	v,動詞,行為,移動	_	0	root	_	_
6	雲	雲	NOUN	n,名詞,天象,気象	_	5	obj	_	_

1	松	松	PROPN	n,名詞,人,名	NameType=Giv	0	root	_	_

1	掛	<UNK>	VERB	v,動詞,行為,動作	_	0	root	_	_
2	羅	羅	NOUN	n,名詞,可搬,道具	_	1	obj	_	_
3	而	而	CCONJ	p,助詞,接続,並列	_	4	cc	_	_
4	傾	傾	VERB	v,動詞,行為,動作	_	1	conj	_	_
5	盖	<UNK>	NOUN	n,名詞,可搬,道具	_	4	obj	_	_

1	夕	夕	NOUN	n,名詞,時,*	Case=Tem	2	nmod	_	_
2	岫	<UNK>	NOUN	n,名詞,固定物,地形	Case=Loc	3	nsubj	_	_
3	結	結	VERB	v,動詞,行為,動作	_	0	root	_	_
4	霧	<UNK>	NOUN	n,名詞,可搬,道具	_	3	obj	_	_

1	鳥	鳥	NOUN	n,名詞,主体,動物	_	2	nsubj	_	_
2	封	封	VERB	v,動詞,行為,役割	_	45	csubj	_	_
3	縠	<UNK>	NOUN	n,名詞,可搬,道具	_	2	obj	_	_
4	而	而	CCONJ	p,助詞,接続,並列	_	5	cc	_	_
5	迷	<UNK>	VERB	v,動詞,行為,動作	_	2	conj	_	_
6	林	林	NOUN	n,名詞,固定物,地形	Case=Loc	31	obj	_	_
7	庭	庭	NOUN	n,名詞,固定物,建造物	Case=Loc	40	obl:lmod	_	_
8	舞	舞	VERB	v,動詞,行為,動作	_	2	conj	_	_
9	新	新	VERB	v,動詞,描写,形質	Degree=Pos|VerbForm=Part	10	amod	_	_
10	蝶	<UNK>	NOUN	n,名詞,可搬,道具	_	5	obj	_	_
11	空	空	ADV	v,動詞,描写,形質	Degree=Pos|VerbForm=Conv	40	advmod	_	_
12	歸	歸	VERB	v,動詞,行為,移動	_	2	conj	_	_
13	故	故	NOUN	n,名詞,時,*	Case=Tem	14	nmod	_	_
14	鴈	<UNK>	NOUN	n,名詞,主体,動物	_	40	nsubj	_	_
15	於	於	ADP	v,前置詞,基盤,*	_	16	case	_	_
16	是	是	PRON	n,代名詞,指示,*	PronType=Dem	2	obl	_	_
17	盖	<UNK>	NOUN	n,名詞,不可譲,身体	_	40	nsubj	_	_
18	天	天	NOUN	n,名詞,制度,場	Case=Loc	2	obl	_	_
19	坐	坐	VERB	v,動詞,行為,動作	_	2	conj	_	_
20	地	地	NOUN	n,名詞,固定物,地形	Case=Loc	5	obj	_	_
21	促	<UNK>	VERB	v,動詞,行為,動作	_	2	conj	_	_
22	膝	<UNK>	NOUN	n,名詞,可搬,道具	_	31	obj	_	_
23	飛	飛	VERB	v,動詞,行為,動作	_	2	conj	_	_
24	觴	<UNK>	NOUN	n,名詞,可搬,道具	_	31	obj	_	_
25	忘	忘	VERB	v,動詞,行為,動作	_	2	conj	_	_
26	言	言	NOUN	n,名詞,可搬,伝達	_	31	obj	_	_
27	一	一	NUM	n,数詞,数字,*	_	28	nummod	_	_
28	室	室	NOUN	n,名詞,固定物,建造物	Case=Loc	36	nmod	_	_
29	之	之	SCONJ	p,助詞,接続,属格	_	28	case	_	_
30	裏	<UNK>	NOUN	n,名詞,固定物,関係	Case=Loc	2	conj	_	_
31	開	開	VERB	v,動詞,行為,動作	_	2	conj	_	_
32	衿	<UNK>	NOUN	n,名詞,不可譲,身体	_	31	obj	_	_
33	煙	<UNK>	NOUN	n,名詞,固定物,樹木	_	31	obj	_	_
34	霞	<UNK>	NOUN	n,名詞,固定物,樹木	_	33	flat	_	_
35	之	之	SCONJ	p,助詞,接続,属格	_	28	case	_	_
36	外	外	NOUN	n,名詞,固定物,関係	Case=Loc	2	obj	_	_
37	淡	<UNK>	ADV	v,動詞,描写,形質	Degree=Pos|VerbForm=Conv	2	conj	_	_
38	然	然	PART	p,接尾辞,*,*	_	37	fixed	_	_
39	自	自	PRON	n,代名詞,人称,他	PronType=Prs|Reflex=Yes	40	nsubj	_	_
40	放	放	VERB	v,動詞,行為,動作	_	2	conj	_	_
41	快	<UNK>	VERB	v,動詞,描写,態度	Degree=Pos	40	advmod	_	_
42	然	然	PART	p,接尾辞,*,*	_	37	fixed	_	_
43	自	自	PRON	n,代名詞,人称,他	PronType=Prs|Reflex=Yes	50	obj	_	_
44	足	足	VERB	v,動詞,描写,量	Degree=Pos	2	conj	_	_
45	若	若	VERB	v,動詞,行為,分類	Degree=Equ	0	root	_	_
46	非	非	ADV	v,副詞,否定,体言否定	Polarity=Neg	48	amod	_	_
47	翰	翰	NOUN	n,名詞,可搬,道具	_	48	nmod	_	_
48	苑	苑	NOUN	n,名詞,固定物,建造物	Case=Loc	51	nsubj	_	_
49	何	何	PRON	n,代名詞,疑問,*	PronType=Int	50	obj	_	_
50	以	以	VERB	v,動詞,行為,動作	_	51	advcl	_	_
51	攄	<UNK>	VERB	v,動詞,行為,動作	_	44	parataxis	_	_
52	情	情	NOUN	n,名詞,描写,態度	_	51	obj	_	_

1	詩	詩	NOUN	n,名詞,主体,書物	_	2	nsubj	_	_
2	紀	紀	VERB	v,動詞,行為,動作	_	0	root	_	_
3	落	落	VERB	v,動詞,行為,移動	VerbForm=Part	4	amod	_	_
4	梅	梅	NOUN	n,名詞,固定物,樹木	_	6	nmod	_	_
5	之	之	SCONJ	p,助詞,接続,属格	_	4	case	_	_
6	篇	篇	NOUN	n,名詞,可搬,伝達	_	2	obj	_	_

1	古	古	NOUN	n,名詞,時,*	Case=Tem	5	nsubj	_	_
2	今	今	NOUN	n,名詞,時,*	Case=Tem	1	conj	_	_
3	夫	夫	PART	p,助詞,句頭,*	_	5	discourse	_	_
4	何	何	ADV	v,副詞,疑問,原因	AdvType=Cau	5	advmod	_	_
5	異	異	VERB	v,動詞,描写,形質	Degree=Pos	0	root	_	_
6	矣	矣	PART	p,助詞,句末,*	_	5	discourse:sp	_	_

1	宜	宜	AUX	v,助動詞,必要,*	Mood=Nec	2	aux	_	_
2	賦	賦	VERB	v,動詞,行為,動作	_	0	root	_	_
3	園	園	NOUN	n,名詞,固定物,建造物	Case=Loc	4	nmod	_	_
4	梅	梅	NOUN	n,名詞,固定物,樹木	_	2	obj	_	_

1	聊	<UNK>	ADV	v,動詞,行為,動作	VerbForm=Conv	2	advmod	_	_
2	成	成	VERB	v,動詞,行為,生産	_	0	root	_	_
3	短	短	VERB	v,動詞,描写,量	Degree=Pos	4	advmod	_	_
4	詠	詠	VERB	v,動詞,行為,伝達	_	2	ccomp	_	_

Is this an improvement?

from nlp-cube.

KoichiYasuoka avatar KoichiYasuoka commented on May 11, 2024

Yes, yes @tiberiu44 it seems much better result except for "松". But I could not download the improved model after I cleaned ~/.nlpcube/3.0/lzh up. Well, has the new model been released?

from nlp-cube.

tiberiu44 avatar tiberiu44 commented on May 11, 2024

It's not published. The sentence segmentation is still bad. Also, token is worse:

Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     93.29 |     92.62 |     92.96 |
Sentences  |     27.12 |      7.65 |     11.94 |
Words      |     93.29 |     92.62 |     92.96 |
UPOS       |     87.02 |     86.40 |     86.71 |     93.28
XPOS       |     84.06 |     83.46 |     83.76 |     90.11
UFeats     |     88.16 |     87.53 |     87.84 |     94.50
AllTags    |     82.22 |     81.64 |     81.93 |     88.14
Lemmas     |     89.80 |     89.15 |     89.47 |     96.26
UAS        |     43.40 |     43.09 |     43.24 |     46.52
LAS        |     39.54 |     39.26 |     39.40 |     42.38
CLAS       |     38.00 |     36.86 |     37.42 |     39.96
MLAS       |     35.55 |     34.49 |     35.01 |     37.39
BLEX       |     36.87 |     35.76 |     36.31 |     38.77

from nlp-cube.

KoichiYasuoka avatar KoichiYasuoka commented on May 11, 2024

I've released https://huggingface.co/KoichiYasuoka/roberta-classical-chinese-large-sentence-segmentation for sentence segmentation of classical Chinese. You can use it with transformers>=4.1:

import torch
from transformers import AutoTokenizer,AutoModelForTokenClassification
tokenizer=AutoTokenizer.from_pretrained("KoichiYasuoka/roberta-classical-chinese-large-sentence-segmentation")
model=AutoModelForTokenClassification.from_pretrained("KoichiYasuoka/roberta-classical-chinese-large-sentence-segmentation")
s="天平二年正月十三日萃于帥老之宅申宴會也于時初春令月氣淑風和梅披鏡前之粉蘭薰珮後之香加以曙嶺移雲松掛羅而傾盖夕岫結霧鳥封縠而迷林庭舞新蝶空歸故鴈於是盖天坐地促膝飛觴忘言一室之裏開衿煙霞之外淡然自放快然自足若非翰苑何以攄情詩紀落梅之篇古今夫何異矣宜賦園梅聊成短詠"
p=[model.config.id2label[q] for q in torch.argmax(model(tokenizer.encode(s,return_tensors="pt"))[0],dim=2)[0].tolist()[1:-1]]
print("".join(c+"。" if q=="E" or q=="S" else c for c,q in zip(s,p)))

from nlp-cube.

tiberiu44 avatar tiberiu44 commented on May 11, 2024

Do we have permission to use your model in NLPCube? Do you need any citation or notice when somebody loads it?

from nlp-cube.

KoichiYasuoka avatar KoichiYasuoka commented on May 11, 2024

The models are distributed under the Apache License 2.0. You can use them (almost) freely except for trademarks.

from nlp-cube.

tiberiu44 avatar tiberiu44 commented on May 11, 2024

This sounds good. I will update the runtime code for the tokenizer to be able to use transformer models for tokenization.

from nlp-cube.

tiberiu44 avatar tiberiu44 commented on May 11, 2024

One more question: does your model also support tokenization or just sentence segmentation?

from nlp-cube.

KoichiYasuoka avatar KoichiYasuoka commented on May 11, 2024

https://huggingface.co/KoichiYasuoka/roberta-classical-chinese-large-sentence-segmentation is only for sentence segmentation. And I've just released https://huggingface.co/KoichiYasuoka/roberta-classical-chinese-large-upos for POS-tagging with tokenization:

>>> import torch
>>> from transformers import AutoTokenizer,AutoModelForTokenClassification
>>> tokenizer=AutoTokenizer.from_pretrained("KoichiYasuoka/roberta-classical-chinese-large-upos")
>>> model=AutoModelForTokenClassification.from_pretrained("KoichiYasuoka/roberta-classical-chinese-large-upos")
>>> s="子曰學而時習之不亦說乎有朋自遠方來不亦樂乎人不知而不慍不亦君子乎"
>>> p=[model.config.id2label[q] for q in torch.argmax(model(tokenizer.encode(s,return_tensors="pt"))[0],dim=2)[0].tolist()[1:-1]]
>>> print(list(zip(s,p)))
[('子', 'NOUN'), ('曰', 'VERB'), ('學', 'VERB'), ('而', 'CCONJ'), ('時', 'NOUN'), ('習', 'VERB'), ('之', 'PRON'), ('不', 'ADV'), ('亦', 'ADV'), ('說', 'VERB'), ('乎', 'PART'), ('有', 'VERB'), ('朋', 'NOUN'), ('自', 'ADP'), ('遠', 'VERB'), ('方', 'NOUN'), ('來', 'VERB'), ('不', 'ADV'), ('亦', 'ADV'), ('樂', 'VERB'), ('乎', 'PART'), ('人', 'NOUN'), ('不', 'ADV'), ('知', 'VERB'), ('而', 'CCONJ'), ('不', 'ADV'), ('慍', 'VERB'), ('不', 'ADV'), ('亦', 'ADV'), ('君', 'B-NOUN'), ('子', 'I-NOUN'), ('乎', 'PART')]

You can see "君子" is tokenized as a single word with the POS's of B-NOUN and I-NOUN.

from nlp-cube.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.