Comments (3)
Since the LTP SRL results are indexed from LTP's word segmentation results, just tokenize every splitted words and count the number of previous tokens, then you can get the correct start position and token length.
Tokenize every splitted words:
Get the splitted word start position and length, for example "15240点":
# 重寫 position 轉換函式
def token_pos_trans(token_ws, tup):
"""利用個別 tokenize 的 ltp 分詞結果 (token_ws)
推得語義角色 tokenize 後的位置
"""
start, end = tup
# 累加前面有的 token 數推得目標位置
start_pos = sum(len(t) for t in token_ws[:start])
length = 0
for i in range(start, end+1):
length += len(token_ws[i])
return (start_pos, length)
from astock.
I use the LTP to Split Words and Semantic role labeling.The LTP is loaded with samll model file. I am using the source code of this project as follows.
from ltp import LTP
ltp = LTP(path = "./ltp/small")
a = tokenizer.tokenize(df_train.loc[4,'text_a'])
for idx,i in enumerate(df_train.verb_mask[4][2]):
if i != 0:
print(a[idx-1])
seg,hidden = ltp.seg([df_train.loc[4,'text_a']])
srl = ltp.srl(hidden)
def list_to_string(a):
return ''.join(a)
for i,s in enumerate(srl[0]):
if len(s)!=0:
print(f'verb: {seg[0][i]} arv: ({[[ar[0], list_to_string([str(seg[0][k]) for k in range(ar[1],ar[2]+1)])] for ar in srl[0][i]]})')
I get verb/A0/A1 results that are almost identical to the "train.csv" file. But verb/A0/A1 words start in different positions in the "text_a" sentence .
I found that if the "text_a" sentence is entirely in Chinese, the words start in exactly the same position.However, if there are numbers or English in the sentence, the words will not start in the same position as in the "train.csv" file.
To verify this, I rewrote the code as follows.
PRE_TRAINED_MODEL_NAME = "hfl/chinese-roberta-wwm-ext"
tokenizer = BertTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME,do_lower_case=True)
print(df_train.loc[5,'text_a'])
print(len(df_train.loc[5,'text_a']))
a = tokenizer.tokenize(df_train.loc[5,'text_a'])
for idx,i in enumerate(df_train.verb_mask[5][2]):
if i != 0:
print(a[idx-1])
seg,hidden = ltp.seg([df_train.loc[5,'text_a']])
srl = ltp.srl(hidden)
def list_to_string(a):
return ''.join(a)
calculate verb/A0/A1 start position in the "text_a"
def start_pos(segment,id):
n = 0
for i in range(0,id):
n = n + len(segment[i])
return n
for i,s in enumerate(srl[0]):
if len(s)!=0:
print(f'verb: {seg[0][i]} arv: ({[[ar[0], list_to_string([str(seg[0][k]) for k in range(ar[1],ar[2]+1)])] for ar in srl[0][i]]})')
add start postion and length
print(f'verb: {seg[0][i], start_pos(seg[0],i),len(seg[0][i])} arv: ({[[ar[0], [(seg[0][k], start_pos(seg[0],k),len(seg[0][k])) for k in range(ar[1], ar[2] + 1)]] for ar in srl[0][i]]})')
I load the dataset "train.csv",the df_train.loc[5,'text_a'] is as follows.
复星医药公告,控股子公司复星医药产业收到国家药品监督管理局关于同意其获许可的马来酸阿伐曲泊帕片开展临床试验的通知书。该药品用于成人慢性免疫性血小板减少症适应症(ITP),复星医药产业拟于近期条件具备后于**境内开展该新药针对该适应症的III期临床试验。
The result as follows:
the splitting words result:
[['复星', '医药', '公告', ',', '控股', '子公司', '复星医药', '产业', '收到', '国家', '药品', '监督', '管理局', '关于', '同意', '其', '获', '许可', '的', '马来酸阿伐曲泊帕片', '开展', '临床', '试验', '的', '通知书', '。', '该药品', '用于', '成人', '慢性免疫性', '血小板减少症', '适应症', '(', 'ITP', ')', ',', '复星医药', '产业', '拟于', '近期', '条件', '具备', '后于', '**', '境内', '开展', '该', '新药', '针对', '该', '适应症', '的', 'III期', '临床', '试验', '。']]
the Semantic role labeling:
[[(8, [('A0', 4, 7), ('A1', 9, 24)]), (14, [('A1', 15, 22)]), (16, [('A1', 17, 17)]), (20, [('A1', 21, 22)]), (27, [('A2', 28, 34)]), (45, [('A0', 36, 37), ('ARGM-TMP', 38, 42), ('ARGM-LOC', 43, 44), ('A1', 46, 54)])]]
verb: 收到 arv: ([['A0', '控股子公司复星医药产业'], ['A1', '国家药品监督管理局关于同意其获许可的马来酸阿伐曲泊帕片开展临床试验的通知书']])
verb: ('收到', 18, 2) arv: ([['A0', [('控股', 7, 2), ('子公司', 9, 3), ('复星医药', 12, 4), ('产业', 16, 2)]], ['A1', [('国家', 20, 2), ('药品', 22, 2), ('监督', 24, 2), ('管理局', 26, 3), ('关于', 29, 2), ('同意', 31, 2), ('其', 33, 1), ('获', 34, 1), ('许可', 35, 2), ('的', 37, 1), ('马来酸阿伐曲泊帕片', 38, 9), ('开展', 47, 2), ('临床', 49, 2), ('试验', 51, 2), ('的', 53, 1), ('通知书', 54, 3)]]])
verb: 同意 arv: ([['A1', '其获许可的马来酸阿伐曲泊帕片开展临床试验']])
verb: ('同意', 31, 2) arv: ([['A1', [('其', 33, 1), ('获', 34, 1), ('许可', 35, 2), ('的', 37, 1), ('马来酸阿伐曲泊帕片', 38, 9), ('开展', 47, 2), ('临床', 49, 2), ('试验', 51, 2)]]])
verb: 获 arv: ([['A1', '许可']])
verb: ('获', 34, 1) arv: ([['A1', [('许可', 35, 2)]]])
verb: 开展 arv: ([['A1', '临床试验']])
verb: ('开展', 47, 2) arv: ([['A1', [('临床', 49, 2), ('试验', 51, 2)]]])
verb: 用于 arv: ([['A2', '成人慢性免疫性血小板减少症适应症(ITP)']])
verb: ('用于', 61, 2) arv: ([['A2', [('成人', 63, 2), ('慢性免疫性', 65, 5), ('血小板减少症', 70, 6), ('适应症', 76, 3), ('(', 79, 1), ('ITP', 80, 3), (')', 83, 1)]]])
verb: 开展 arv: ([['A0', '复星医药产业'], ['ARGM-TMP', '拟于近期条件具备后于'], ['ARGM-LOC', '**境内'], ['A1', '该新药针对该适应症的III期临床试验']])
verb: ('开展', 105, 2) arv: ([['A0', [('复星医药', 85, 4), ('产业', 89, 2)]], ['ARGM-TMP', [('拟于', 91, 2), ('近期', 93, 2), ('条件', 95, 2), ('具备', 97, 2), ('后于', 99, 2)]], ['ARGM-LOC', [('**', 101, 2), ('境内', 103, 2)]], ['A1', [('该', 107, 1), ('新药', 108, 2), ('针对', 110, 2), ('该', 112, 1), ('适应症', 113, 3), ('的', 116, 1), ('III期', 117, 4), ('临床', 121, 2), ('试验', 123, 2)]]])
the verb in the "train.csv" file:
[(18, 2), (31, 2), (34, 1), (47, 2), (61, 2), (104, 2)]
The verb: ('收到', 18, 2) ('同意', 31, 2) ('获', 34, 1) ('开展', 47, 2) ('用于', 61, 2) in the start position is exactly the same with "train.csv".
But it's not consistent when it comes to english "ITP". for example, the verb: ('开展', 105, 2), but it is (104, 2) in the train.csv.
The A0 and A1 are the same situation
Another example, set the record of the train.csv to the first: df_train.loc[0,'text_a']
the "text_a" is:
中泰化学披露三季报,公司2020年前三季度营业收入649.7亿元,同比增长0.71%;净利润亏损4.46亿元,上年同期(调整后)为盈利4.33亿元。第三季度净利润为亏损1.39亿元。
the splitting words result:
[['中泰', '化学', '披露', '三季报', ',', '公司', '2020年', '前', '三季度', '营业', '收入', '649.7亿', '元', ',', '同比', '增长', '0.71%', ';', '净利润', '亏损', '4.46亿', '元', ',', '上年', '同期', '(', '调整', '后', ')', '为', '盈利', '4.33亿', '元', '。', '第三', '季度', '净利润', '为', '亏损', '1.39亿', '元', '。']]
the Semantic role labeling:
[[(2, [('A0', 0, 1), ('A1', 3, 3)]), (15, [('ARGM-ADV', 14, 14), ('A1', 16, 16)]), (19, [('A1', 18, 18), ('ARGM-EXT', 20, 21)]), (29, [('A0', 23, 28), ('A1', 30, 32)]), (37, [('ARGM-TMP', 34, 35), ('A0', 36, 36), ('A1', 38, 40)])]]
verb: 披露 arv: ([['A0', '中泰化学'], ['A1', '三季报']])
verb: ('披露', 4, 2) arv: ([['A0', [('中泰', 0, 2), ('化学', 2, 2)]], ['A1', [('三季报', 6, 3)]]])
verb: 增长 arv: ([['ARGM-ADV', '同比'], ['A1', '0.71%']])
verb: ('增长', 35, 2) arv: ([['ARGM-ADV', [('同比', 33, 2)]], ['A1', [('0.71%', 37, 5)]]])
verb: 亏损 arv: ([['A1', '净利润'], ['ARGM-EXT', '4.46亿元']])
verb: ('亏损', 46, 2) arv: ([['A1', [('净利润', 43, 3)]], ['ARGM-EXT', [('4.46亿', 48, 5), ('元', 53, 1)]]])
verb: 为 arv: ([['A0', '上年同期(调整后)'], ['A1', '盈利4.33亿元']])
verb: ('为', 64, 1) arv: ([['A0', [('上年', 55, 2), ('同期', 57, 2), ('(', 59, 1), ('调整', 60, 2), ('后', 62, 1), (')', 63, 1)]], ['A1', [('盈利', 65, 2), ('4.33亿', 67, 5), ('元', 72, 1)]]])
verb: 为 arv: ([['ARGM-TMP', '第三季度'], ['A0', '净利润'], ['A1', '亏损1.39亿元']])
verb: ('为', 81, 1) arv: ([['ARGM-TMP', [('第三', 74, 2), ('季度', 76, 2)]], ['A0', [('净利润', 78, 3)]], ['A1', [('亏损', 82, 2), ('1.39亿', 84, 5), ('元', 89, 1)]]])
the verb in the "train.csv" file:
[(4, 2), (31, 2), (41, 2), (58, 1), (74, 1)]
The verb: ('披露', 4, 2) in the start position is exactly the same with "train.csv".
But it's not consistent when it comes to number "2020" and "649.7". the verb:('增长', 35, 2) is different with (31, 2) in the train.csv about the start position.
The A0 and A1 are the same situation.
I guess it's because of the character encoding?
I hope the author can give an answer,thanks!
from astock.
The split words result by LTP is different from the tokens result by the tokenizer of pretrained model. The Semantic role, like ([[(8, [('A0', 4, 7), ('A1', 9, 24)]), (14, [('A1', 15, 22)]), (16, [('A1', 17, 17)]), (20, [('A1', 21, 22)]), (27, [('A2', 28, 34)]), (45, [('A0', 36, 37), ('ARGM-TMP', 38, 42), ('ARGM-LOC', 43, 44), ('A1', 46, 54)])]]) represents the position in the tokens split by the pretrained model, which may be a little confusing.
from astock.
Related Issues (10)
- The pretrain model has not exist HOT 1
- Cloning repository is time-consuming due to archives
- a little bug HOT 1
- new complementary tool
- train/test/val encoder
- backtest HOT 2
- Hi! How to get train.csv from raw data? HOT 1
- If the three labels are based on the return rate after a certain number of days following the news release? I'm curious to know the specific number of days.
- 【数据预处理】请问可以提供数据预处理脚本吗
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from astock.