Thanks for your answers about previous questions very much! In the dataset,for exa

what is the data format of verb A0 and A1 feature ? about astock HOT 3 CLOSED

jinanzou commented on May 26, 2024

what is the data format of verb A0 and A1 feature ?

from astock.

Comments (3)

A-baoYang commented on May 26, 2024 2

Since the LTP SRL results are indexed from LTP's word segmentation results, just tokenize every splitted words and count the number of previous tokens, then you can get the correct start position and token length.

Tokenize every splitted words:

The full tokenized sentence:

Get the splitted word start position and length, for example "15240点":

# 重寫 position 轉換函式
def token_pos_trans(token_ws, tup):
    """利用個別 tokenize 的 ltp 分詞結果 (token_ws) 
       推得語義角色 tokenize 後的位置
       
    """
    start, end = tup
    # 累加前面有的 token 數推得目標位置
    start_pos = sum(len(t) for t in token_ws[:start])
    length = 0
    for i in range(start, end+1):
        length += len(token_ws[i])
    return (start_pos, length)

from astock.

weituo2002 commented on May 26, 2024 1

I use the LTP to Split Words and Semantic role labeling.The LTP is loaded with samll model file. I am using the source code of this project as follows.

from ltp import LTP
ltp = LTP(path = "./ltp/small")
a = tokenizer.tokenize(df_train.loc[4,'text_a'])
for idx,i in enumerate(df_train.verb_mask[4][2]):
if i != 0:
print(a[idx-1])
seg,hidden = ltp.seg([df_train.loc[4,'text_a']])
srl = ltp.srl(hidden)
def list_to_string(a):
return ''.join(a)
for i,s in enumerate(srl[0]):
if len(s)!=0:
print(f'verb: {seg[0][i]} arv: ({[[ar[0], list_to_string([str(seg[0][k]) for k in range(ar[1],ar[2]+1)])] for ar in srl[0][i]]})')

I get verb/A0/A1 results that are almost identical to the "train.csv" file. But verb/A0/A1 words start in different positions in the "text_a" sentence .
I found that if the "text_a" sentence is entirely in Chinese, the words start in exactly the same position.However, if there are numbers or English in the sentence, the words will not start in the same position as in the "train.csv" file.
To verify this, I rewrote the code as follows.

PRE_TRAINED_MODEL_NAME = "hfl/chinese-roberta-wwm-ext"
tokenizer = BertTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME,do_lower_case=True)
print(df_train.loc[5,'text_a'])
print(len(df_train.loc[5,'text_a']))
a = tokenizer.tokenize(df_train.loc[5,'text_a'])
for idx,i in enumerate(df_train.verb_mask[5][2]):
if i != 0:
print(a[idx-1])
seg,hidden = ltp.seg([df_train.loc[5,'text_a']])
srl = ltp.srl(hidden)
def list_to_string(a):
return ''.join(a)

calculate verb/A0/A1 start position in the "text_a"

def start_pos(segment,id):
n = 0
for i in range(0,id):
n = n + len(segment[i])
return n

for i,s in enumerate(srl[0]):
if len(s)!=0:
print(f'verb: {seg[0][i]} arv: ({[[ar[0], list_to_string([str(seg[0][k]) for k in range(ar[1],ar[2]+1)])] for ar in srl[0][i]]})')

add start postion and length

print(f'verb: {seg[0][i], start_pos(seg[0],i),len(seg[0][i])} arv: ({[[ar[0], [(seg[0][k], start_pos(seg[0],k),len(seg[0][k])) for k in range(ar[1], ar[2] + 1)]] for ar in srl[0][i]]})')

I load the dataset "train.csv",the df_train.loc[5,'text_a'] is as follows.
复星医药公告，控股子公司复星医药产业收到国家药品监督管理局关于同意其获许可的马来酸阿伐曲泊帕片开展临床试验的通知书。该药品用于成人慢性免疫性血小板减少症适应症（ITP），复星医药产业拟于近期条件具备后于**境内开展该新药针对该适应症的III期临床试验。
The result as follows:
the splitting words result:

[['复星', '医药', '公告', '，', '控股', '子公司', '复星医药', '产业', '收到', '国家', '药品', '监督', '管理局', '关于', '同意', '其', '获', '许可', '的', '马来酸阿伐曲泊帕片', '开展', '临床', '试验', '的', '通知书', '。', '该药品', '用于', '成人', '慢性免疫性', '血小板减少症', '适应症', '（', 'ITP', '）', '，', '复星医药', '产业', '拟于', '近期', '条件', '具备', '后于', '**', '境内', '开展', '该', '新药', '针对', '该', '适应症', '的', 'III期', '临床', '试验', '。']]

the Semantic role labeling:
[[(8, [('A0', 4, 7), ('A1', 9, 24)]), (14, [('A1', 15, 22)]), (16, [('A1', 17, 17)]), (20, [('A1', 21, 22)]), (27, [('A2', 28, 34)]), (45, [('A0', 36, 37), ('ARGM-TMP', 38, 42), ('ARGM-LOC', 43, 44), ('A1', 46, 54)])]]

verb: 收到 arv: ([['A0', '控股子公司复星医药产业'], ['A1', '国家药品监督管理局关于同意其获许可的马来酸阿伐曲泊帕片开展临床试验的通知书']])
verb: ('收到', 18, 2) arv: ([['A0', [('控股', 7, 2), ('子公司', 9, 3), ('复星医药', 12, 4), ('产业', 16, 2)]], ['A1', [('国家', 20, 2), ('药品', 22, 2), ('监督', 24, 2), ('管理局', 26, 3), ('关于', 29, 2), ('同意', 31, 2), ('其', 33, 1), ('获', 34, 1), ('许可', 35, 2), ('的', 37, 1), ('马来酸阿伐曲泊帕片', 38, 9), ('开展', 47, 2), ('临床', 49, 2), ('试验', 51, 2), ('的', 53, 1), ('通知书', 54, 3)]]])
verb: 同意 arv: ([['A1', '其获许可的马来酸阿伐曲泊帕片开展临床试验']])
verb: ('同意', 31, 2) arv: ([['A1', [('其', 33, 1), ('获', 34, 1), ('许可', 35, 2), ('的', 37, 1), ('马来酸阿伐曲泊帕片', 38, 9), ('开展', 47, 2), ('临床', 49, 2), ('试验', 51, 2)]]])
verb: 获 arv: ([['A1', '许可']])
verb: ('获', 34, 1) arv: ([['A1', [('许可', 35, 2)]]])
verb: 开展 arv: ([['A1', '临床试验']])
verb: ('开展', 47, 2) arv: ([['A1', [('临床', 49, 2), ('试验', 51, 2)]]])
verb: 用于 arv: ([['A2', '成人慢性免疫性血小板减少症适应症（ITP）']])
verb: ('用于', 61, 2) arv: ([['A2', [('成人', 63, 2), ('慢性免疫性', 65, 5), ('血小板减少症', 70, 6), ('适应症', 76, 3), ('（', 79, 1), ('ITP', 80, 3), ('）', 83, 1)]]])
verb: 开展 arv: ([['A0', '复星医药产业'], ['ARGM-TMP', '拟于近期条件具备后于'], ['ARGM-LOC', '**境内'], ['A1', '该新药针对该适应症的III期临床试验']])
verb: ('开展', 105, 2) arv: ([['A0', [('复星医药', 85, 4), ('产业', 89, 2)]], ['ARGM-TMP', [('拟于', 91, 2), ('近期', 93, 2), ('条件', 95, 2), ('具备', 97, 2), ('后于', 99, 2)]], ['ARGM-LOC', [('**', 101, 2), ('境内', 103, 2)]], ['A1', [('该', 107, 1), ('新药', 108, 2), ('针对', 110, 2), ('该', 112, 1), ('适应症', 113, 3), ('的', 116, 1), ('III期', 117, 4), ('临床', 121, 2), ('试验', 123, 2)]]])

the verb in the "train.csv" file:
[(18, 2), (31, 2), (34, 1), (47, 2), (61, 2), (104, 2)]
The verb: ('收到', 18, 2) ('同意', 31, 2) ('获', 34, 1) ('开展', 47, 2) ('用于', 61, 2) in the start position is exactly the same with "train.csv".
But it's not consistent when it comes to english "ITP". for example, the verb: ('开展', 105, 2), but it is (104, 2) in the train.csv.
The A0 and A1 are the same situation

Another example, set the record of the train.csv to the first: df_train.loc[0,'text_a']
the "text_a" is:
中泰化学披露三季报，公司2020年前三季度营业收入649.7亿元，同比增长0.71%；净利润亏损4.46亿元，上年同期（调整后）为盈利4.33亿元。第三季度净利润为亏损1.39亿元。

the splitting words result:
[['中泰', '化学', '披露', '三季报', '，', '公司', '2020年', '前', '三季度', '营业', '收入', '649.7亿', '元', '，', '同比', '增长', '0.71%', '；', '净利润', '亏损', '4.46亿', '元', '，', '上年', '同期', '（', '调整', '后', '）', '为', '盈利', '4.33亿', '元', '。', '第三', '季度', '净利润', '为', '亏损', '1.39亿', '元', '。']]

the Semantic role labeling:
[[(2, [('A0', 0, 1), ('A1', 3, 3)]), (15, [('ARGM-ADV', 14, 14), ('A1', 16, 16)]), (19, [('A1', 18, 18), ('ARGM-EXT', 20, 21)]), (29, [('A0', 23, 28), ('A1', 30, 32)]), (37, [('ARGM-TMP', 34, 35), ('A0', 36, 36), ('A1', 38, 40)])]]

verb: 披露 arv: ([['A0', '中泰化学'], ['A1', '三季报']])
verb: ('披露', 4, 2) arv: ([['A0', [('中泰', 0, 2), ('化学', 2, 2)]], ['A1', [('三季报', 6, 3)]]])
verb: 增长 arv: ([['ARGM-ADV', '同比'], ['A1', '0.71%']])
verb: ('增长', 35, 2) arv: ([['ARGM-ADV', [('同比', 33, 2)]], ['A1', [('0.71%', 37, 5)]]])
verb: 亏损 arv: ([['A1', '净利润'], ['ARGM-EXT', '4.46亿元']])
verb: ('亏损', 46, 2) arv: ([['A1', [('净利润', 43, 3)]], ['ARGM-EXT', [('4.46亿', 48, 5), ('元', 53, 1)]]])
verb: 为 arv: ([['A0', '上年同期（调整后）'], ['A1', '盈利4.33亿元']])
verb: ('为', 64, 1) arv: ([['A0', [('上年', 55, 2), ('同期', 57, 2), ('（', 59, 1), ('调整', 60, 2), ('后', 62, 1), ('）', 63, 1)]], ['A1', [('盈利', 65, 2), ('4.33亿', 67, 5), ('元', 72, 1)]]])
verb: 为 arv: ([['ARGM-TMP', '第三季度'], ['A0', '净利润'], ['A1', '亏损1.39亿元']])
verb: ('为', 81, 1) arv: ([['ARGM-TMP', [('第三', 74, 2), ('季度', 76, 2)]], ['A0', [('净利润', 78, 3)]], ['A1', [('亏损', 82, 2), ('1.39亿', 84, 5), ('元', 89, 1)]]])

the verb in the "train.csv" file:
[(4, 2), (31, 2), (41, 2), (58, 1), (74, 1)]
The verb: ('披露', 4, 2) in the start position is exactly the same with "train.csv".
But it's not consistent when it comes to number "2020" and "649.7". the verb:('增长', 35, 2) is different with (31, 2) in the train.csv about the start position.
The A0 and A1 are the same situation.

I guess it's because of the character encoding?
I hope the author can give an answer,thanks!

from astock.

Haiyao-Nero commented on May 26, 2024

The split words result by LTP is different from the tokens result by the tokenizer of pretrained model. The Semantic role, like ([[(8, [('A0', 4, 7), ('A1', 9, 24)]), (14, [('A1', 15, 22)]), (16, [('A1', 17, 17)]), (20, [('A1', 21, 22)]), (27, [('A2', 28, 34)]), (45, [('A0', 36, 37), ('ARGM-TMP', 38, 42), ('ARGM-LOC', 43, 44), ('A1', 46, 54)])]]) represents the position in the tokens split by the pretrained model, which may be a little confusing.

from astock.

what is the data format of verb A0 and A1 feature ? about astock HOT 3 CLOSED

Comments (3)

calculate verb/A0/A1 start position in the "text_a"

add start postion and length

Related Issues (10)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent