Giter Site home page Giter Site logo

Comments (3)

A-baoYang avatar A-baoYang commented on May 26, 2024 2

Since the LTP SRL results are indexed from LTP's word segmentation results, just tokenize every splitted words and count the number of previous tokens, then you can get the correct start position and token length.

Tokenize every splitted words:
image

The full tokenized sentence:
image

Get the splitted word start position and length, for example "15240点":

# 重寫 position 轉換函式
def token_pos_trans(token_ws, tup):
    """利用個別 tokenize 的 ltp 分詞結果 (token_ws) 
       推得語義角色 tokenize 後的位置
       
    """
    start, end = tup
    # 累加前面有的 token 數推得目標位置
    start_pos = sum(len(t) for t in token_ws[:start])
    length = 0
    for i in range(start, end+1):
        length += len(token_ws[i])
    return (start_pos, length)

image

from astock.

weituo2002 avatar weituo2002 commented on May 26, 2024 1

I use the LTP to Split Words and Semantic role labeling.The LTP is loaded with samll model file. I am using the source code of this project as follows.

from ltp import LTP
ltp = LTP(path = "./ltp/small")
a = tokenizer.tokenize(df_train.loc[4,'text_a'])
for idx,i in enumerate(df_train.verb_mask[4][2]):
if i != 0:
print(a[idx-1])
seg,hidden = ltp.seg([df_train.loc[4,'text_a']])
srl = ltp.srl(hidden)
def list_to_string(a):
return ''.join(a)
for i,s in enumerate(srl[0]):
if len(s)!=0:
print(f'verb: {seg[0][i]} arv: ({[[ar[0], list_to_string([str(seg[0][k]) for k in range(ar[1],ar[2]+1)])] for ar in srl[0][i]]})')

I get verb/A0/A1 results that are almost identical to the "train.csv" file. But verb/A0/A1 words start in different positions in the "text_a" sentence .
I found that if the "text_a" sentence is entirely in Chinese, the words start in exactly the same position.However, if there are numbers or English in the sentence, the words will not start in the same position as in the "train.csv" file.
To verify this, I rewrote the code as follows.

PRE_TRAINED_MODEL_NAME = "hfl/chinese-roberta-wwm-ext"
tokenizer = BertTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME,do_lower_case=True)
print(df_train.loc[5,'text_a'])
print(len(df_train.loc[5,'text_a']))
a = tokenizer.tokenize(df_train.loc[5,'text_a'])
for idx,i in enumerate(df_train.verb_mask[5][2]):
if i != 0:
print(a[idx-1])
seg,hidden = ltp.seg([df_train.loc[5,'text_a']])
srl = ltp.srl(hidden)
def list_to_string(a):
return ''.join(a)

calculate verb/A0/A1 start position in the "text_a"

def start_pos(segment,id):
n = 0
for i in range(0,id):
n = n + len(segment[i])
return n

for i,s in enumerate(srl[0]):
if len(s)!=0:
print(f'verb: {seg[0][i]} arv: ({[[ar[0], list_to_string([str(seg[0][k]) for k in range(ar[1],ar[2]+1)])] for ar in srl[0][i]]})')

add start postion and length

print(f'verb: {seg[0][i], start_pos(seg[0],i),len(seg[0][i])} arv: ({[[ar[0], [(seg[0][k], start_pos(seg[0],k),len(seg[0][k])) for k in range(ar[1], ar[2] + 1)]] for ar in srl[0][i]]})')

I load the dataset "train.csv",the df_train.loc[5,'text_a'] is as follows.
复星医药公告,控股子公司复星医药产业收到国家药品监督管理局关于同意其获许可的马来酸阿伐曲泊帕片开展临床试验的通知书。该药品用于成人慢性免疫性血小板减少症适应症(ITP),复星医药产业拟于近期条件具备后于**境内开展该新药针对该适应症的III期临床试验。
The result as follows:
the splitting words result:

[['复星', '医药', '公告', ',', '控股', '子公司', '复星医药', '产业', '收到', '国家', '药品', '监督', '管理局', '关于', '同意', '其', '获', '许可', '的', '马来酸阿伐曲泊帕片', '开展', '临床', '试验', '的', '通知书', '。', '该药品', '用于', '成人', '慢性免疫性', '血小板减少症', '适应症', '(', 'ITP', ')', ',', '复星医药', '产业', '拟于', '近期', '条件', '具备', '后于', '**', '境内', '开展', '该', '新药', '针对', '该', '适应症', '的', 'III期', '临床', '试验', '。']]

the Semantic role labeling:
[[(8, [('A0', 4, 7), ('A1', 9, 24)]), (14, [('A1', 15, 22)]), (16, [('A1', 17, 17)]), (20, [('A1', 21, 22)]), (27, [('A2', 28, 34)]), (45, [('A0', 36, 37), ('ARGM-TMP', 38, 42), ('ARGM-LOC', 43, 44), ('A1', 46, 54)])]]

verb: 收到 arv: ([['A0', '控股子公司复星医药产业'], ['A1', '国家药品监督管理局关于同意其获许可的马来酸阿伐曲泊帕片开展临床试验的通知书']])
verb: ('收到', 18, 2) arv: ([['A0', [('控股', 7, 2), ('子公司', 9, 3), ('复星医药', 12, 4), ('产业', 16, 2)]], ['A1', [('国家', 20, 2), ('药品', 22, 2), ('监督', 24, 2), ('管理局', 26, 3), ('关于', 29, 2), ('同意', 31, 2), ('其', 33, 1), ('获', 34, 1), ('许可', 35, 2), ('的', 37, 1), ('马来酸阿伐曲泊帕片', 38, 9), ('开展', 47, 2), ('临床', 49, 2), ('试验', 51, 2), ('的', 53, 1), ('通知书', 54, 3)]]])
verb: 同意 arv: ([['A1', '其获许可的马来酸阿伐曲泊帕片开展临床试验']])
verb: ('同意', 31, 2) arv: ([['A1', [('其', 33, 1), ('获', 34, 1), ('许可', 35, 2), ('的', 37, 1), ('马来酸阿伐曲泊帕片', 38, 9), ('开展', 47, 2), ('临床', 49, 2), ('试验', 51, 2)]]])
verb: 获 arv: ([['A1', '许可']])
verb: ('获', 34, 1) arv: ([['A1', [('许可', 35, 2)]]])
verb: 开展 arv: ([['A1', '临床试验']])
verb: ('开展', 47, 2) arv: ([['A1', [('临床', 49, 2), ('试验', 51, 2)]]])
verb: 用于 arv: ([['A2', '成人慢性免疫性血小板减少症适应症(ITP)']])
verb: ('用于', 61, 2) arv: ([['A2', [('成人', 63, 2), ('慢性免疫性', 65, 5), ('血小板减少症', 70, 6), ('适应症', 76, 3), ('(', 79, 1), ('ITP', 80, 3), (')', 83, 1)]]])
verb: 开展 arv: ([['A0', '复星医药产业'], ['ARGM-TMP', '拟于近期条件具备后于'], ['ARGM-LOC', '**境内'], ['A1', '该新药针对该适应症的III期临床试验']])
verb: ('开展', 105, 2) arv: ([['A0', [('复星医药', 85, 4), ('产业', 89, 2)]], ['ARGM-TMP', [('拟于', 91, 2), ('近期', 93, 2), ('条件', 95, 2), ('具备', 97, 2), ('后于', 99, 2)]], ['ARGM-LOC', [('**', 101, 2), ('境内', 103, 2)]], ['A1', [('该', 107, 1), ('新药', 108, 2), ('针对', 110, 2), ('该', 112, 1), ('适应症', 113, 3), ('的', 116, 1), ('III期', 117, 4), ('临床', 121, 2), ('试验', 123, 2)]]])

the verb in the "train.csv" file:
[(18, 2), (31, 2), (34, 1), (47, 2), (61, 2), (104, 2)]
The verb: ('收到', 18, 2) ('同意', 31, 2) ('获', 34, 1) ('开展', 47, 2) ('用于', 61, 2) in the start position is exactly the same with "train.csv".
But it's not consistent when it comes to english "ITP". for example, the verb: ('开展', 105, 2), but it is (104, 2) in the train.csv.
The A0 and A1 are the same situation

Another example, set the record of the train.csv to the first: df_train.loc[0,'text_a']
the "text_a" is:
中泰化学披露三季报,公司2020年前三季度营业收入649.7亿元,同比增长0.71%;净利润亏损4.46亿元,上年同期(调整后)为盈利4.33亿元。第三季度净利润为亏损1.39亿元。

the splitting words result:
[['中泰', '化学', '披露', '三季报', ',', '公司', '2020年', '前', '三季度', '营业', '收入', '649.7亿', '元', ',', '同比', '增长', '0.71%', ';', '净利润', '亏损', '4.46亿', '元', ',', '上年', '同期', '(', '调整', '后', ')', '为', '盈利', '4.33亿', '元', '。', '第三', '季度', '净利润', '为', '亏损', '1.39亿', '元', '。']]

the Semantic role labeling:
[[(2, [('A0', 0, 1), ('A1', 3, 3)]), (15, [('ARGM-ADV', 14, 14), ('A1', 16, 16)]), (19, [('A1', 18, 18), ('ARGM-EXT', 20, 21)]), (29, [('A0', 23, 28), ('A1', 30, 32)]), (37, [('ARGM-TMP', 34, 35), ('A0', 36, 36), ('A1', 38, 40)])]]

verb: 披露 arv: ([['A0', '中泰化学'], ['A1', '三季报']])
verb: ('披露', 4, 2) arv: ([['A0', [('中泰', 0, 2), ('化学', 2, 2)]], ['A1', [('三季报', 6, 3)]]])
verb: 增长 arv: ([['ARGM-ADV', '同比'], ['A1', '0.71%']])
verb: ('增长', 35, 2) arv: ([['ARGM-ADV', [('同比', 33, 2)]], ['A1', [('0.71%', 37, 5)]]])
verb: 亏损 arv: ([['A1', '净利润'], ['ARGM-EXT', '4.46亿元']])
verb: ('亏损', 46, 2) arv: ([['A1', [('净利润', 43, 3)]], ['ARGM-EXT', [('4.46亿', 48, 5), ('元', 53, 1)]]])
verb: 为 arv: ([['A0', '上年同期(调整后)'], ['A1', '盈利4.33亿元']])
verb: ('为', 64, 1) arv: ([['A0', [('上年', 55, 2), ('同期', 57, 2), ('(', 59, 1), ('调整', 60, 2), ('后', 62, 1), (')', 63, 1)]], ['A1', [('盈利', 65, 2), ('4.33亿', 67, 5), ('元', 72, 1)]]])
verb: 为 arv: ([['ARGM-TMP', '第三季度'], ['A0', '净利润'], ['A1', '亏损1.39亿元']])
verb: ('为', 81, 1) arv: ([['ARGM-TMP', [('第三', 74, 2), ('季度', 76, 2)]], ['A0', [('净利润', 78, 3)]], ['A1', [('亏损', 82, 2), ('1.39亿', 84, 5), ('元', 89, 1)]]])

the verb in the "train.csv" file:
[(4, 2), (31, 2), (41, 2), (58, 1), (74, 1)]
The verb: ('披露', 4, 2) in the start position is exactly the same with "train.csv".
But it's not consistent when it comes to number "2020" and "649.7". the verb:('增长', 35, 2) is different with (31, 2) in the train.csv about the start position.
The A0 and A1 are the same situation.

I guess it's because of the character encoding?
I hope the author can give an answer,thanks!

from astock.

Haiyao-Nero avatar Haiyao-Nero commented on May 26, 2024

The split words result by LTP is different from the tokens result by the tokenizer of pretrained model. The Semantic role, like ([[(8, [('A0', 4, 7), ('A1', 9, 24)]), (14, [('A1', 15, 22)]), (16, [('A1', 17, 17)]), (20, [('A1', 21, 22)]), (27, [('A2', 28, 34)]), (45, [('A0', 36, 37), ('ARGM-TMP', 38, 42), ('ARGM-LOC', 43, 44), ('A1', 46, 54)])]]) represents the position in the tokens split by the pretrained model, which may be a little confusing.

from astock.

Related Issues (10)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.