thunlp / hmeae Goto Github PK

View Code? Open in Web Editor NEW

84.0 84.0 21.0 46 KB

Source code for EMNLP-IJCNLP 2019 paper "HMEAE: Hierarchical Modular Event Argument Extraction".

License: MIT License

Python 100.00%

hmeae's People

Contributors

Stargazers

Watchers

hmeae's Issues

Tips: only support stanford-corenlp-full-2018-10-05

if you use other version , it doesn't work.
only support stanford-corenlp-full-2018-10-05.

Question about TAC KBP

Thank you for releasing the code for HMEAE.

My question about TAC KBP 2016 dataset is:

Did you use ACE train, ACE dev and TAC-KBP 2016 as train/dev/test data in the experiment ?
Could you release the preprocess code for TAC-KBP 2016? Or the generated json file will be greatly helpful

Thank you and looking forward to your reply.

stanford-english-corenlp-2018-10-05-models.jar

C:\Anaconda3\envs\tensorflow\python.exe F:/HMEAE-master/train.py --mode DMCNN
WARNING:tensorflow:From F:/HMEAE-master/train.py:26: The name tf.app.run is deprecated. Please use tf.compat.v1.app.run instead.

--File Extraction Finish--
--Entity Extraction Finish--
Traceback (most recent call last):
File "F:/HMEAE-master/train.py", line 26, in
tf.app.run()
File "C:\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\platform\app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "C:\Anaconda3\envs\tensorflow\lib\site-packages\absl\app.py", line 299, in run
_run_main(main, args)
File "C:\Anaconda3\envs\tensorflow\lib\site-packages\absl\app.py", line 250, in _run_main
sys.exit(main(argv))
File "F:/HMEAE-master/train.py", line 15, in main
extractor.Extract()
File "F:\HMEAE-master\utils.py", line 388, in Extract
self.Event_Extract()
File "F:\HMEAE-master\utils.py", line 144, in Event_Extract
nlp = StanfordCoreNLPv2(constant.corenlp_path)
File "F:\HMEAE-master\utils.py", line 438, in init
super(StanfordCoreNLPv2,self).init(path)
File "C:\Anaconda3\envs\tensorflow\lib\site-packages\stanfordcorenlp\corenlp.py", line 46, in init
if not subprocess.call(['java', '-version'], stdout=subprocess.PIPE, stderr=subprocess.STDOUT) == 0:
File "C:\Anaconda3\envs\tensorflow\lib\subprocess.py", line 287, in call
with Popen(*popenargs, **kwargs) as p:
File "C:\Anaconda3\envs\tensorflow\lib\subprocess.py", line 729, in init
restore_signals, start_new_session)
File "C:\Anaconda3\envs\tensorflow\lib\subprocess.py", line 1017, in _execute_child
startupinfo)
FileNotFoundError: [WinError 2] 系统找不到指定的文件。

json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Traceback (most recent call last):
File "D:/pythonProject/HMEAE-master/train.py", line 28, in
tf.app.run()
File "D:\anaconda2020\lib\site-packages\tensorflow\python\platform\app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "D:\anaconda2020\lib\site-packages\absl\app.py", line 303, in run
_run_main(main, args)
File "D:\anaconda2020\lib\site-packages\absl\app.py", line 251, in _run_main
sys.exit(main(argv))
File "D:/pythonProject/HMEAE-master/train.py", line 17, in main
extractor.Extract()
File "D:\pythonProject\HMEAE-master\utils.py", line 388, in Extract
self.Event_Extract()
File "D:\pythonProject\HMEAE-master\utils.py", line 183, in Event_Extract
tokens,offsets = nlp.word_tokenize(sent,True)
File "D:\anaconda2020\lib\site-packages\stanfordcorenlp\corenlp.py", line 173, in word_tokenize
r_dict = self._request('ssplit,tokenize', sentence)
File "D:\anaconda2020\lib\site-packages\stanfordcorenlp\corenlp.py", line 239, in request
r_dict = json.loads(r.text)
File "D:\anaconda2020\lib\json_init.py", line 357, in loads
return _default_decoder.decode(s)
File "D:\anaconda2020\lib\json\decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "D:\anaconda2020\lib\json\decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Questions about f_score

非常感谢您的分享，关于f_score这里我有点疑惑，恳请您的解答~
我想通过举例来说明一下，假设有这些真实的标签，第一个数字为事件类型标签，第二个为角色标签如下：
（2,1），（2,3），（0,0），（5,3），
预测结果为：
（2,1），（2,4），（0,5），（5,0）
按照代码中的计算方式，我得到TP=1，FN=3, FP=1，这样R=1/4，P=1/2

我之前复现这方面的论文的计算方式为，R=1/3 即真实的需要预测的有（2,1），（2,3），（5,3），所以Positive样本=3，R=1/3；P=1/3，预测出了3个标签，只对了一个，所以P为1/3。

我想的是测试集中的正样本似乎应该是固定数量的，所以positive样本数量应该不变，即fn+tp应该固定。

这样计算似乎与您的有些不同，请问一下是我哪里计算错了吗？希望得到您的答复，谢谢！

Could you release HMEAE(BERT) model ?

Thank you very much for releasing the source code.

I noticed that code of DMCNN and HMEAE(DMCNN) is released while DMBERT and HMEAE(BERT) are missing. Can you release code for the two models?

Thank you.

TypeError: word_tokenize() takes 2 positional arguments but 3 were given

The issue happens in utils.py line 183
I check the source code of corenlp, the function of word_tokenize() only has two arguments that are self and sent, i don't know where the boolean argument come from.
If possible, could you please help me to solve this problem? Or provide the processed ACE data?
Thanks a lot!

Question about extracting information of ACE 2005

> --File Extraction Finish--
> --Entity Extraction Finish--
> Traceback (most recent call last):
>   File "train.py", line 26, in <module>
>     tf.app.run()
>   File "/home/zhangmingyu/anaconda3/envs/tf_gpu/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
>     _sys.exit(main(argv))
>   File "train.py", line 15, in main
>     extractor.Extract()
>   File "/home/zhangmingyu/HMEAE-master/utils.py", line 388, in Extract
>     self.Event_Extract()
>   File "/home/zhangmingyu/HMEAE-master/utils.py", line 203, in Event_Extract
>     entity_start = entity_offsets[0][0]
> IndexError: list index out of range
>

question about data format

I don't have the ACE2005 dataset，so I don't know the exact format of your data input，So could you please introduce your input format to me, so that I can convert my data set into your input format，specifically speaking, t_data, a_data, loader.maxlen, loader.max_argument_len, loader.wordemb，What are the specific forms of these inputs

Confusion about the output (role type)

Hi, I have glanced through your paper and found the final output is the role type involved in the input sentence rather than the role type of candidate entity. Because you compress/encode the input embeddings to a sentence representation/embedding, then concatenate it with the role-oriented embedding before using a softmax to get the estimated role type.
Besides, what if there are no explicit argument roles in the daily text? I mean, you can't get labeled sentence in testset, so you don't know which role types contained in the input sentence. How could I calculate the role-oriented embedding?
Thank you for your explanation.

Confusion about argument detection evaluation

Thank you very much for releasing source code about this paper.

However, I notice you used func/f_score to calculate argument detection performance, which basically consider if predicted roles and gold roles match. The event types is ignored in evaluation. I think there is something wrong considering the criteria is as follows:

An argument is correctly classified if its offsets, role, related trigger type and trigger’s offsets exactly match a reference argument.

There are some cases you probably miss:

if a trigger is mislabeled as 'None', this trigger and its arguments are ignored because you only select instances whose predicted event type is not 'None' for dev/test set. (as model/DMCNN/process_data_for_argument shows)
If a trigger is mislabeled as another event type, the predicted argument roles are meaningless and should be marked as wrong (If I didn't misunderstand the criteria). However, the code ignores the wrong event type and treat the detection as right if predicted role matches gold role. (as func/f_score shows)

Correct me if I'm wrong and thank you again.

Question about the special token which indicates event type. Thank you

Thank you for releasing the source code.

I noticed that DMBERT has a special token to indicate the event type when detecting arguments.

To utilize the event type information in our model, we append a special token into each input sequence for BERT to indicate the event type.

Could you give me more details about the operation? Maybe an example is helpful. Take attack event for example, the input may look like the following:

[CLS] [Token1] [Token2] [Token3] [Token4]...[Token 128] [SEP] [ATTACK]

What is the special token? Like [Attack]、#ATTACK#

If the special token doesn't exist in Bert's vocab file, how do you initialize the representation for the token?

Thank you and look forward to your reply.

Question about performance

Thank you for releasing the source code for this paper.

I run the code with the following commands, which I think is correct:

python train.py --gpu 1 --mode DMCNN
python train.py --gpu 1 --mode HMEAE

The performance is not as high as the paper reported, I get the following performance for DMCNN:
test best Precision: 0.5258620689655172 test best Recall:0.48348745046235136 test best F1:0.5037852718513421

And performance of HMEAE(CNN) is even worse.

I guess maybe it's because of the random data split.
other_files = [file for dir in self.dirs for file in self.source_files[dir] if dir!='nw']+nw[40:] random.shuffle(other_files) random.shuffle(other_files) test_files = nw[:40] dev_files = other_files[:30] train_files = other_files[30:]

Could you please release or email me([email protected]) the data split you used for experiment (I mean the three files: train.json/test.json/det.json). I have ACE data and license for it.

Thank you very much

thunlp / hmeae Goto Github PK

hmeae's People

Contributors

Stargazers

Watchers

Forkers

hmeae's Issues

Recommend Projects

Recommend Topics

Recommend Org