Giter Site home page Giter Site logo

Comments (14)

Spico197 avatar Spico197 commented on July 29, 2024

嗨您好,感谢对本项目的关注。这个结果看着确实很奇怪。DuEE-fin的切句工具用的是一个自己写的标点切分工具,之前测试的时候没有发现有删除文本的现象,可能是哪里有bug。可以麻烦你单独使用这里的分句工具测试一下这个文档,然后对比前后的区别吗?感谢反馈!

def sent_seg(
text,
special_seg_indicators=None,
lang="zh",
punctuations=None,
quotation_seg_mode=True,
) -> list:
"""
cut texts into sentences (in chinese language).
Args:
text <str>: texts ready to be cut
special_seg_indicators <list>: some special segment indicators and
their replacement ( [indicator, replacement] ), in baike data,
this argument could be `[('###', '\n'), ('%%%', ' '), ('%%', ' ')]`
lang <str>: languages that your corpus is, support `zh` for Chinese
and `en` for English now.
punctuations <set>: you can split the texts by specified punctuations.
texts will not be splited by `;`, so you can specify them by your own.
quotation_seg_mode <bool>: if True, the quotations will be regarded as a
part of the former sentence.
e.g. `我说:“翠花,上酸菜。”,她说:“欸,好嘞。”`
the text will be splited into
['我说:“翠花,上酸菜。”,', '她说:“欸,好嘞。”'], other than
['我说:“翠花,上酸菜。', '”,她说:“欸,好嘞。”']
Rrturns:
<list>: a list of strings, which are splited sentences.
"""

from docee.

miraitowa9 avatar miraitowa9 commented on July 29, 2024

您好,我尝试直接用这段代码处理了我这段文本,他的分句是没有问题的。
image
image
不知道为什么,运行完build_data.py之后的dueefin_train_w_tgg.json文件的"sentences"属性的分句结果就有一些混乱了
image
有的按逗号分句,有的百分比后面的数字被删除,有的是特殊符号分句

from docee.

Spico197 avatar Spico197 commented on July 29, 2024

这个很奇怪,我测试一下。

from docee.

miraitowa9 avatar miraitowa9 commented on July 29, 2024

非常感谢!辛苦啦!

from docee.

Spico197 avatar Spico197 commented on July 29, 2024

嗨您好,是因为超出最大句长(128),所以句子后面的内容直接被删除了。
image

from docee.

Spico197 avatar Spico197 commented on July 29, 2024

非常感谢您找到了这个潜在的问题。这里确实会影响线下dev的评价结果,因为超出部分的arguments是不包含在内的。不过论文里的开发集结果是在相同设置下跑的,还是可以在相同设置下公平对比。最终效果还是以线上测试集结果为准。

For all researchers who see this issue, here's what happened: @miraitowa9 found the max_seq_len is set to 128 when building DuEE-fin. This indicates the golden event arguments may be less than the real answer (if one argument appears in the cutoff part, the argument would be set to null in the golden labels).
However, since all the baselines are compared under the same setting, the trending and ranking is still reasonable. For all following researchers, I highly recommend you to submit the test2 predictions to the online evaluation platform and get the final results for real fair comparison.

Thanks again to miraitowa9 !

from docee.

miraitowa9 avatar miraitowa9 commented on July 29, 2024

非常感谢您的耐心解答!!我尝试将max_seq_len设置为256之后就没有丢失论元的情况了。我想咨询一下将max_seq_len设置大之后会影响后面模型预测的结果吗?

from docee.

Spico197 avatar Spico197 commented on July 29, 2024

为了和其它方案做公平比较,统一采用前人的设置,其它设置没有测试过。

from docee.

miraitowa9 avatar miraitowa9 commented on July 29, 2024

好的,谢谢!

from docee.

miraitowa9 avatar miraitowa9 commented on July 29, 2024

请问,您尝试过文档级事件抽取的procnet模型吗?这个模型的效果好像也还不错,您有没有想过把它集成到您的代码中呢?

from docee.

Spico197 avatar Spico197 commented on July 29, 2024

Hi, 感谢提问。我计划是长期维护这个repo,尽可能地收集更多的文档事件抽取方法,只是最近比较忙,确实时间有限。欢迎大家贡献代码~

from docee.

miraitowa9 avatar miraitowa9 commented on July 29, 2024

我觉得你的这个repo做的非常不错!!!所以推荐这个代码:https://github.com/xnyuwg/procnet

from docee.

Spico197 avatar Spico197 commented on July 29, 2024

感谢感谢,这篇工作我一直在关注,性能和结果都非常好。我会找时间加进来的,感谢推荐!

from docee.

miraitowa9 avatar miraitowa9 commented on July 29, 2024

好的,非常期待!

from docee.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.