davidsbatista / ner-evaluation Goto Github PK

An implementation of a full named-entity evaluation metrics based on SemEval'13 Task 9 - not at tag/token level but considering all the tokens that are part of the named-entity

License: MIT License

Jupyter Notebook 42.02% Python 57.98%

named-entity-recognition evaluation-metrics notebook-jupyter crfsuite semeval-2013 ner-evaluation ner semeval

ner-evaluation's Introduction

Named Entity Evaluation as in SemEval 2013 task 9.1

My own implementation, with lots of input from Matt Upson, of the Named-Entity Recognition evaluation metrics as defined by the SemEval 2013 - 9.1 task.

This evaluation metrics go belong a simple token/tag based schema, and consider diferent scenarios based on wether all the tokens that belong to a named entity were classified or not, and also wether the correct entity type was assigned.

You can find a more detailed explanation in the following blog post:

http://www.davidsbatista.net/blog/2018/05/09/Named_Entity_Evaluation/

Notes:

In scenarios IV and VI the entity type of the true and pred does not match, in both cases we only scored against the true entity, not the predicted one. You can argue that the predicted entity could also be scored as spurious, but according to the definition of spurious:

Spurius (SPU) : system produces a response which doesn’t exist in the golden annotation;

In this case it exists an annotation, but only with a different entity type, so we assume it's only incorrect

Example:

You can see a working example on the following notebook:

example-full-named-entity-evaluation.ipynb

Note that in order to run that example you need to have installed:

sklearn
nltk
sklearn_crfsuite

For testing you will need:

pytest
coverage

These dependencies can be installed by running pip3 install -r requirements.txt

Code tests and tests coverage:

To run tests:

coverage run --rcfile=setup.cfg -m pytest

To produce a coverage report:

coverage report

ner-evaluation's People

Contributors

Stargazers

Watchers

Forkers

txthang pottlej qibaoyuan ben-the-hedgehog ssaishruthi van51 bidexbido michaelcapizzi wh1isper jintingzhang giancastro veer66 yingyuankai zxr-v2 yeechingtiger svanhvitlilja aiedward rashibudati julienmbabd j-rossi-nl suryapradeepm aplz raimundojimenez lasp73 dingyh0626 anubhav562 amlarraz songys lampard2a4 lisaterumi kydlaw dawidgrad feryah hyjin-asc srbhr o-lechuck zhangqile900621 ayabel aditikhare007 aboomardiiyah techthiyanes sefeoglu marijkebeersmans aqhali egilron lakshmid13579 njchensz volkanovska

ner-evaluation's Issues

Some problems with evaluation_agg_entities_type

hello, i added two line in your ipynb with two lines:
'partial':deepcopy(metrics_results),
'exact':deepcopy(metrics_results),
in this dict:
results = {'strict': deepcopy(metrics_results),
'ent_type': deepcopy(metrics_results),
}
but, at last i found the output of 'possible'/'actual' are both 0.

so i want to ask, why there are no evaluation_agg_entities_type[type]['partial'/'exact']['possible'/'actual'] in compute_metrics func?

Incorrect entities extraction

If 2 entities of the same types are next to each other, e.g., tags = ['O', 'B-LOC', 'I-LOC', 'B-LOC', 'I-LOC', 'O']
Run

collect_entities(tags)

Expect: 2 entities.
Actual: [Entity(type='LOC', start_offset=1, end_offset=4)]

Different possible input formats

StanfordNER

Single-Line
Switzerland/LOCATION ,/O Davos/PERSON 2018/O :/O Soros/PERSON accuses/O Trump/PERSON of/O wanting/O a/O `/O mafia/O state/O '/O and/O blasts/O social/O media/O ./O
CoNNL like

,	O
Davos	PERSON
2018	O
:	O
Soros	PERSON
accuses	O
Trump	PERSON
of	O
wanting	O
a	O
`	O
mafia	O
state	O
'	O
and	O
blasts	O
social	O
media	O
.	O

xml: TODO (add example)
inlineXML: TODO (add example)
tsv: TODO (add example)
slashTags: TODO (add example)

spaCy

find_overlap question

Hi,
I assumed find_overlap is supposed to find if two ranges have any portion in common, am I wrong?

In your function if input is true_range = range(1, 2) and pred_range = range(2, 2), pred_range is a subset of true_range, so we should count this as a partial overlap, are you not counting such overlaps as partial? The range is exclusive of upper bound.

Below function will return set() when I feed above true_range and pred_range. Shouldn't it be better to check if the minimum of the ranges upper bound is smaller than maximum of lower bounds of the ranges, and return True saying the two overlap? Please correct me if I am not understanding your find_overlap function goal correctly :)

def find_overlap(true_range, pred_range):
"""Find the overlap between two ranges
Find the overlap between two ranges. Return the overlapping values if
present, else return an empty set().
Examples:
>>> find_overlap((1, 2), (2, 3))
2
>>> find_overlap((1, 2), (3, 4))
set()
"""

true_set = set(true_range)
pred_set = set(pred_range)

overlaps = true_set.intersection(pred_set)

return overlaps

True entities considered multiple times

Thank you very much for your code and tutorial.

I think that, when looking for overlaps between predicted and true entities, we'd need to make sure that the same true entity hasn't been used before (i.e.: is not in true_which_overlapped_with_pred)?
Otherwise there might be spurious entities that match with an already used entity that are not counted.

But, at the same time, it might happen that a "better" match for the same true entity comes after it has already been considered (and possibly, consumed, if we don't allow it to match with multiple predicted entities) - so I think that we'd need to keep track of the type of match for each predicted entity and if a "better" one comes after count the previous one as spurious.

Aggregated by entity type results possible error

@davidsbatista first of all, thanx a lot for your great tutorial and your code. I think there is a minor mistake in the computations of the metrics aggregated by entity type:

In line 123-124, you compute the metrics when a predicted entity overlaps with an entity that belongs in different entity type. You count +1 for incorrect for the miss-predicted entity type, which means that this error will affect both actual and possible amounts. In this specific case you have miss-predicted an instance of entity type A, where there is an instance of entity type B.

In my understanding, it would be more precise to count:

evaluation_agg_entities_type[true.type]['strict']['missed'] += 1
evaluation_agg_entities_type[pred.type]['strict']['spurius'] += 1

And why is that?

The miss-predicted instance of entity type A is spurius, given the formal definition

"system produces a response (prediction of type A) which doesn’t exit in the golden annotation"

You also missed an instance of entity type B, given the formal definition

"a golden annotation is not captured by a system"

If you do not count +1 missed items for entity type B in this part of the code, then you won't count it later on in lines 149-150, because you have already appended this entity in list "true_which_overlapped_with_pred".

Maybe I'm wrong, but I feel that these small adjustments improve the notion of precision and recall. I would really like to hear your thoughts. Cheers!

Counting spurious entities

Hi,

I found an issue when counting spurious. In lines 317-322 in ner_eval.py, if a spurious entity is found, +1 is performed for all types of entities. Shoud it be performed for only one type of entity? :

for true in tags:   
    evaluation_agg_entities_type[true]['strict']['spurious'] += 1
     evaluation_agg_entities_type[true]['ent_type']['spurious'] += 1
     evaluation_agg_entities_type[true]['partial']['spurious'] += 1
     evaluation_agg_entities_type[true]['exact']['spurious'] += 1

change to

evaluation_agg_entities_type[pred.e_type]['strict']['spurious'] += 1
evaluation_agg_entities_type[pred.e_type]['ent_type']['spurious'] += 1
evaluation_agg_entities_type[pred.e_type]['partial']['spurious'] += 1 
evaluation_agg_entities_type[pred.e_type]['exact']['spurious'] += 1

Thanks.
Andy.

collect_named_entities skip over entities

For example: collect_named_entities(['B-LOC', 'I-LOC']) will return [].

Fix:
ner_eval.py L149: if ent_type is not None and start_offset is not None and end_offset is None:

true_which_overlapped_with_pred does not get updated properly

Hi,

I think there are few scenarios that you are not updating true_which_overlapped_with_pred properly, and break your for loops.

For example in line 267 of ner_eval.py file, you are appending a "true" entity to true_which_overlapped_with_pred. and then set found_overlap = True and then break. In this case you are not appending other "true" entities that they have partial overlap. For example if annotated data has two spans of [2,4] and [5, 7] and NER model has one span of [3, 6], both "true" spans have overlap with this one "pred" span, but you are breaking the loop, before allowing appending of entity with span [5, 7] to be added to the true_which_overlapped_with_pred. I think it is right that metrics should not get updated, but true_which_overlapped_with_pred should get updated with both "true" spans.

I tried changing the code as follows and it started counting spans properly:

.

Create module and accept other formats

Hi @davidsbatista, good to meet you the other day!

So I've done a bit more work on this, but have not PRed to here yet. I created a module structure and added CI/CD here: MantisAI/nervaluate#1. Not sure if you want me to PR into this repo - I'm happy to, but it will start to move away from the codebase referred to in the blog post; I guess that is fine with good docs?

Next thing I intend to work on is accepting different formats. So far the tool accepts two lists of tags, and then converts them to namedtuples.

The format that prodigy uses is json based (below), and very similar to the named tuple you used originally:

prodigy_format = {
    "text": "Apple",
    "spans": [{"start": 0, "end": 5, "label": "ORG"}],
}

# current named tuple:

Entity = namedtuple("Entity", "e_type start_offset end_offset")
Entity("ORG", 0, 5)

I have been considering switching away from the named-tuple, and just using the prodigy json format in the package. This would mean that we can rely on other converters (form CoNLL -> prodigy json for example) if they exist, or publish them if they do not, and it all will tie in with the spacy/prodigy ecosystem. What do you think? Are you strongly attached to the namedtuples? or is there something that I am overlooking?

range is wrong for only 1 token span

pred_range = range(pred.start_offset, pred.end_offset)

so if pred.start=1 and pred.end_offset=1, then it will be empty range. But it should be one.
to fix:
pred_range = range(pred.start_offset, pred.end_offset+1)

Loop breaks stop looking for predicted entities that may overlap with true entities

The breaks you have included in the loops for Scenario V: and Scenario VI: induce errors in counting overlapping entities.

Pin scikit-learn==0.23.2

At the current time, pulling the repository and running the notebook raise an Exception due to an invalid sklearn version (1.0 currently).
The last compatible version of scikit is the 0.23.2.

I suggest to pin the scikit-learn version in requirements.txt to 0.23.2.

sklearn_crfsuite add bigram feature?

hi:
thanks for sharing! how to add bigram features in sklearn_crfsuite?
(ps bigram means https://taku910.github.io/crfpp/ crf++ bigram template)

Question: Are Partial Matches allowed for Type matching scheme?

Are Partial Matches allowed for Type matching scheme?
On your blog post, precision/recall/f1 section, you wrote "Partial Match (i.e., partial and type)" so I assume it is possible for the Type matching scheme to include some partial match. However, your code seems not count any Partial Matches for the Type matching scheme; and in the drug labeling example, Type has the partial score of 0.