No blanc scorer in scorer.bat

Blanc scorer is implemented in scorer.pl but not in scorer.bat
Is there a reason?
does perl scorer.pl blanc ... work even on Windows?

Error on repeated mentions in the response

When I evaluate the Tüba-D/Z test set from SemEval 2010 shared task, I get the following error:

Found too many repeated mentions (> 10) in the response, so refusing to score. Please fix the output.

Does anyone know how to fix it?

Thanks

reference to paper in README?

As far as I understand this is the new repository location for the software linked in the following paper.

Pradhan, S., Luo, X., Recasens, M., Hovy, E., Ng, V., & Strube, M. (2014). Scoring Coreference Partitions of Predicted Mentions: A Reference Implementation. In Annual Meeting of the Association for Computational Linguistics (Vol. 2, pp. 30–35). http://www.aclweb.org/anthology/P/P14/P14-2006

If this is the case, to make citations easier I would suggest to add this reference to the README file that is visible on the entry page of the repository.

CEAFe precision and recall values looks swapped.

Hi,

I feel that CEAFe precision and recall are reversed. I feel so because the trend of these values are consistently opposite to the trend of precision and recall in B-cub and MUC metrics.
Can you please check ?

Thanks,
Joe

ceafm and blanc does not work in LEA branch

It does not output these metrics anymore, neither when specifying them explicitly nor when asking to compute "all" metrics.

Should old evaluation metrics (CEAFe, B3, MUC) be considered inappropiate?

I've come accross the paper from ACL's website (https://www.aclweb.org/anthology/P16-1060/) which states that the traditional methods from conll2012 scripts are not so great methods to evaluate the coreference resolution task, and also introduce the LEA scorer (which has been implemented in this repository). However, the recent publications of this task are yet mainly evaluated by the old methods, and I can't see the reason why. Would be grateful for an appropriate answer.
Thanks :)

Slight scoring inconsistency between metrics

During a recent preparation of a shared task, I found that the metrics have slightly different behavior against no coreference case. MUC will report 0% F1 for coreference, while the rest (blanc, ceaf, bcub) reports 100%

A very small test case can be the following, we can check the behavior by running the scorer use this file against itself.

begin document (0001); part 000

0001 0 A (0)
0001 1 B (1)

end document

The inconsistency may hurt when one want to compute a document level average, the 0% score produced by MUC can make the result change dramatically. In addition, I think it is reasonable to score it as 100% when both the key and response are identical.

support for cross document coreference?

Are there suggested practices for cross-document coreference? My thought was just to call each multi-document set one document. Let me know if there is any support or best practice.

Should traditional metrics (MUC, B3, CEAFe) be considered inappropiate?

I've come accross the paper from ACL's website (https://www.aclweb.org/anthology/P16-1060/) which states that the traditional methods from conll2012 scripts are not so great methods to evaluate the coreference resolution task, and also introduce the LEA scorer (which has been implemented in this repository). However, the recent publications of this task are yet mainly evaluated by the old methods, and I can't see the reason why. Would be grateful for an appropriate answer.
Thanks :)

conll / reference-coreference-scorers Goto Github PK

reference-coreference-scorers's People

Contributors

Stargazers

Watchers

Forkers

reference-coreference-scorers's Issues

begin document (0001); part 000

end document

Recommend Projects

Recommend Topics

Recommend Org