msg-systems / coreferee Goto Github PK

Coreference resolution for English, French, German and Polish, optimised for limited training data and easily extensible for further languages

License: MIT License

Python 99.77% Shell 0.23%

coreferee's Introduction

Coreferee

Author: Richard Paul Hudson, Explosion AI

1. Introduction
2. Interacting with the data model
3. How it works
- 3.1 General operation and rules
- 3.2 The neural ensemble
4. Adding support for a new language
5. Adding support for a custom spaCy model
6. Version history
7. Open issues/requests for assistance

1. Introduction

1.1 The basic idea

Coreferences are situations where two or more words within a text refer to the same entity, e.g. John went home because he was tired. Resolving coreferences is an important general task within the natural language processing field.

Coreferee is a Python 3 library (tested with versions 3.6—3.10) that is used together with spaCy (tested with versions 3.0.0—3.3.0) to resolve coreferences within English, French, German and Polish texts. It is designed so that it is easy to add support for new languages. It uses a mixture of neural networks and programmed rules.

The library was originally developed at msg systems, but is now being maintained at Explosion AI. Please direct any new issues or discussions to the Explosion repository.

1.2 Getting started

1.2.1 English

Presuming you have already installed spaCy and one of the English spacy models, install Coreferee from the command line by typing:

python3 -m pip install coreferee
python3 -m coreferee install en

Note that:

the required command may be python rather than python3 on some operating systems;
in order to use the transformer-based spaCy model en_core_web_trf with Coreferee, you will need to install the spaCy model en_core_web_lg as well (see the explanation here).

Then open a Python prompt (type python3 or python at the command line):

>>> import coreferee, spacy
>>> nlp = spacy.load('en_core_web_trf')
>>> nlp.add_pipe('coreferee')
<coreferee.manager.CorefereeBroker object at 0x000002DE8E9256D0>
>>>
>>> doc = nlp("Although he was very busy with his work, Peter had had enough of it. He and his wife decided they needed a holiday. They travelled to Spain because they loved the country very much.")
>>>
>>> doc._.coref_chains.print()
0: he(1), his(6), Peter(9), He(16), his(18)
1: work(7), it(14)
2: [He(16); wife(19)], they(21), They(26), they(31)
3: Spain(29), country(34)
>>>
>>> doc[16]._.coref_chains.print()
0: he(1), his(6), Peter(9), He(16), his(18)
2: [He(16); wife(19)], they(21), They(26), they(31)
>>>
>>> doc._.coref_chains.resolve(doc[31])
[Peter, wife]
>>>

1.2.2 French

Presuming you have already installed spaCy and one of the French spacy models, install Coreferee from the command line by typing:

python3 -m pip install coreferee
python3 -m coreferee install fr

Note that the required command may be python rather than python3 on some operating systems.

Then open a Python prompt (type python3 or python at the command line):

>>> import coreferee, spacy
>>> nlp = spacy.load('fr_core_news_lg')
>>> nlp.add_pipe('coreferee')
<coreferee.manager.CorefereeBroker object at 0x000001F556B4FF10>
>>>
>>> doc = nlp("Même si elle était très occupée par son travail, Julie en avait marre. Alors, elle et son mari décidèrent qu'ils avaient besoin de vacances. Ils allèrent en Espagne car ils adoraient le pays")
>>>
>>> doc._.coref_chains.print()
0: elle(2), son(7), Julie(10), elle(17), son(19)
1: travail(8), en(11)
2: [elle(17); mari(20)], ils(23), Ils(29), ils(34)
3: Espagne(32), pays(37)
>>>
>>> doc[17]._.coref_chains.print()
0: elle(2), son(7), Julie(10), elle(17), son(19)
2: [elle(17); mari(20)], ils(23), Ils(29), ils(34)
>>>
>>> doc._.coref_chains.resolve(doc[34])
[Julie, mari]
>>>

1.2.3 German

Presuming you have already installed spaCy and one of the German spacy models, install Coreferee from the command line by typing:

python3 -m pip install coreferee
python3 -m coreferee install de

Note that the required command may be python rather than python3 on some operating systems.

Then open a Python prompt (type python3 or python at the command line):

>>> import coreferee, spacy
>>> nlp = spacy.load('de_core_news_lg')
>>> nlp.add_pipe('coreferee')
<coreferee.manager.CorefereeBroker object at 0x0000026E84C63B50>
>>>
>>> doc = nlp("Weil er mit seiner Arbeit sehr beschäftigt war, hatte Peter davon genug. Er und seine Frau haben entschieden, dass ihnen ein Urlaub gut tun würde. Sie sind nach Spanien gefahren, weil ihnen das Land sehr gefiel.")
>>>
>>> doc._.coref_chains.print()
0: er(1), seiner(3), Peter(10), Er(14), seine(16)
1: Arbeit(4), davon(11)
2: [Er(14); Frau(17)], ihnen(22), Sie(29), ihnen(36)
3: Spanien(32), Land(38)
>>>
>>> doc[14]._.coref_chains.print()
0: er(1), seiner(3), Peter(10), Er(14), seine(16)
2: [Er(14); Frau(17)], ihnen(22), Sie(29), ihnen(36)
>>>
>>> doc._.coref_chains.resolve(doc[36])
[Peter, Frau]
>>>

1.2.4 Polish

Presuming you have already installed spaCy and one of the Polish spacy models, install Coreferee from the command line by typing:

python3 -m pip install coreferee
python3 -m coreferee install pl

Note that the required command may be python rather than python3 on some operating systems.

Then open a Python prompt (type python3 or python at the command line):

>>> import coreferee, spacy
>>> nlp = spacy.load('pl_core_news_lg')
>>> nlp.add_pipe('coreferee')
<coreferee.manager.CorefereeBroker object at 0x0000027304C63B50>
>>>
>>> doc = nlp("Ponieważ bardzo zajęty był swoją pracą, Janek miał jej dość. Postanowili z jego żoną, że potrzebują wakacji. Pojechali do Hiszpanii, bo bardzo im się ten kraj podobał.")
>>>
>>> doc._.coref_chains.print()
0: był(3), swoją(4), Janek(7), Postanowili(12), jego(14)
1: pracą(5), jej(9)
2: [Postanowili(12); żoną(15)], potrzebują(18), Pojechali(21), im(27)
3: Hiszpanii(23), kraj(30)
>>>
>>> doc[12]._.coref_chains.print()
0: był(3), swoją(4), Janek(7), Postanowili(12), jego(14)
2: [Postanowili(12); żoną(15)], potrzebują(18), Pojechali(21), im(27)
>>>
>>> doc._.coref_chains.resolve(doc[27])
[Janek, żoną]
>>>

1.3 Background information

Handling coreference resolution successfully requires training corpora that have been manually annotated with coreferences. The state of the art in coreference resolution is progressing rapidly, but is largely focussed on techniques that require training corpora that are larger than what is available for most languages and software developers. The CONLL 2012 training corpus, which is most widely used, has the following restrictions:

CONLL 2012 covers English, Chinese and Arabic; there is nothing of comparable size for most other languages. For example, the corpus we used to train Coreferee for German is around a tenth of the size of CONLL 2012;
CONLL 2012 is not publicly available and has a relatively restrictive license.

Earlier versions of spaCy had an extension, Neuralcoref, that was excellent but that was never made publicly available for any language other than English. The aim of Coreferee, on the other hand, is to get coreference resolution working for a variety of languages: our focus is less on necessarily achieving the best possible precision and recall for English than on enabling the functionality to be reproduced for new languages as easily and as quickly as possible. Because training data is in such short supply for most languages and is very effort-intensive to produce, it is important to use what is available as effectively as possible.

There are three essential strategies that human readers employ to recognise coreferences within a text:

Hard grammatical rules that completely preclude entities within a text from coreferring, e.g. The house stood tall. They went on walking. Such rules play an especially important role in languages that have grammatical gender, which includes most continental European languages.
Pragmatic tendencies, e.g. a word that begins a sentence and that is a grammatical subject is more likely than a word that is in the middle of a sentence and that forms part of a prepositional phrase to be referred back to by a pronoun that follows it in the next sentence.
Semantic restrictions, i.e. which entities can realistically do what to which entities in the world being described. For example, in the sentence The child saddled her up, a reader's experience of the world will make it clear that her must refer to a horse.

With unlimited training data, it would be possible to train a system to employ all three strategies effectively from first principles using word vectors. The features of Coreferee that allow effective learning with the limited training data that is available are:

Strategy 1) is covered by hardcoded rules for each language that the system is then not required to learn from the training data. Because detailed knowledge of the grammar of a specific natural language is a separate skill set from knowledge of machine learning, the two concerns have been fully separated in Coreferee: rules are covered in a separate module from tendencies. This means that a model for a new language can be generated by a competent Python programmer with no knowledge of machine learning or neural networks;
Because the pragmatic tendencies for strategy 2) are very complex and only partially understood by linguists, machine learning and neural networks represent the only realistic way of tackling them. In order to reduce the amount of training data required for neural networks to learn effectively, the syntactic and morphological information supplied by the spaCy models, which have typically been trained with considerably more training data than will be available for coreference resolution, is used as input to neural networks alongside the standard word vectors.
Especially with limited training data but probably even with the largest available training datasets, it is unlikely that a system will learn more than the very simplest tendencies for strategy 3). However, making word vectors available to neural networks ensures that Coreferee can make use of whatever tendencies are discernable.

Coreferee started life to assist the Holmes project, which is used for information extraction and intelligent search. Coreferee is in no way dependent on Holmes, but this original aim has led to several design decisions that may seem somewhat atypical. Several of them could easily be altered by someone with a requirement to do so:

A mention within Coreferee does not consist of a span, but rather of a single token or of a list of tokens that stand in a coordination relationship to one another.
Coreferee does not capture coreferences that are unambiguously evident from the structure of a sentence. For example, the identity of he and doctor in the sentence He was a doctor is not reported by Coreferee because it can easily be derived from a simple analysis of the copular structure of the phrase.
Repetitions of first- and second-person pronouns (I was tired. I went home) are not captured as they add no value either for information extraction or for intelligent search.
Coreferee focusses heavily on anaphors (for English: pronouns). There is only relatively limited capture of coreference between noun phrases, and it is entirely rule-based. (In turn, however, this serves the aim of working with limited training data: noun-phrase coreference is a more exacting task than anaphor resolution.)
Because search performance is much more important for Holmes than document parsing performance, Coreferee performs all analysis eagerly as each document passes through the pipe.

1.4 Facts and figures

1.4.1 Covered relevant linguistic features

ISO 639-1	Language	Anaphor expression			Agreement classes	Coordination expression
ISO 639-1	Language	Pronominal	Verbal	Prepositional	Agreement classes	Conjunctive	Comitative
en	English	My friend came in. He was happy.	-	-	Three singular (natural genders) and one plural class.	Peter and Mary	-
de	German	Mein Freund kam rein. Er war glücklich.	-	Ich benutzte das Auto* und hatte damit einige Probleme.*	Three singular (grammatical genders) and one plural class.	Peter und Maria	-
fr	French	Mon ami entra. Il était heureux.	-	-	Two singular (grammatical genders) and two plural (grammatical genders) classes.	Pierre et Marie	-
pl	Polish	Wszedł mój kolega. Widzieliście, jaki on był szczęśliwy?	Wszedł mój kolega. Szczęśliwy był.¹	-²	Three singular (grammatical genders) and two plural (natural genders) classes.	Piotr i Kasia	1) Piotr z Kasią przyjechali do Warszawy; 2) Widziałem Piotra i przyszli z Kasią

Only subject zero anaphors are covered. Object zero anaphors, e.g. Wypiłeś wodę? Tak, wypiłem. are not in scope because they are mainly used colloquially and do not normally occur in the types of text for which Coreferee is primarily designed. Handling them would require creating or locating a detailed dictionary of verb valencies.
Polish has a restricted use of anaphoric prepositions in some formal registers, e.g. Skończyło się to dlań smutno. Because the Polish spaCy models were trained on news texts, they do not recognise such prepositions, meaning that Coreferee cannot capture them either.

1.4.2 Model performance

ISO 639-1	Language	Training corpora	Total words in training corpora	`*_trf` models		`*_lg` models		`*_md` models		`*_sm` models
ISO 639-1	Language	Training corpora	Total words in training corpora	Anaphors in 20%	Accuracy (%)	Anaphors in 20%	Accuracy (%)	Anaphors in 20%	Accuracy (%)	Anaphors in 20%	Accuracy (%)
en	English	ParCor/ LitBank	393564	2500—2520	82—83	2480—2520	81—82	2480—2510	81	2540—2560	81—82
de	German	ParCor	164300	-	-	530—570	79—80	520—550	79—80	530—550	76—79
fr	French	DEMOCRAT	323754	-	-	1270—1280	71—72	1280—1300	68—70	1130—1140	63—64
pl	Polish	PCC	548268	-	-	1730—1790	72—76	1750—1790	70—75	-	-

Coreferee produces a range of neural-network models for each language corresponding to the various spaCy models for that language. The neural network inputs include word vectors. With _sm (small) models, both spaCy and Coreferee use context-sensitive tensors as an alternative to word vectors. _trf (transformer-based) models, on the other hand, do not use or offer word vectors at all. To remedy this problem, the model configuration files (config.cfg in the directory for each language) allow a vectors model to be specified for use when a main model does not have its own vectors. Coreferee then combines the linguistic information generated by the main model with vector information returned for the individual words in each document by the vectors model.

Because the Coreferee models are rather large (20GB-30GB for the group of models for a given language) and because many users will only be interested in one language, the group of models for a given language is installed using python3 -m coreferee install as demonstrated in the introduction. All Coreferee models are more or less the same size; a larger spaCy model does not equate to a larger Coreferee model. As the figures above demonstrate, the accuracy of Coreferee corresponds closely to the size of the underlying spaCy model, and users are urged to use the larger spaCy models. It is in any case unclear whether there is a situation in which it would make sense to use Coreferee with an _sm model as the Coreferee model would then be considerably larger than the spaCy model! As this discrepancy is especially extreme for the Polish models, Coreferee no longer supports pl_core_news_sm from version 1.1.0 onwards.

The English, German and Polish models support spaCy versions from 3.0.0 to 3.3.0, while the French models support spaCy versions from 3.1.0 to 3.2.0. Because the accuracies and number of anaphors found differ slightly depending on the spaCy version used, the table above cites ranges for each model.

Assessing and comparing the precision and recall of anaphor resolution algorithms is notoriously difficult. For one thing, two human annotators of the same data will not always agree (and, indeed, there are some cases where Coreferee and a training annotator disagree where Coreferee's interpretation seems the more plausible!) And the same algorithm may perform with wildly different accuracies with different test documents depending on how clearly the documents are written and how often there are competing interpretations of individual anaphors.

Because Coreferee decides where there are anaphors to resolve (as opposed to what to resolve them to) in a purely rule-based fashion and because there is not necessarily a perfect correspondence between the types of anaphor these rules are aiming to capture and the types of anaphor covered by any given training corpus, a recall measure would not be meaningful. Instead, we compare the performance between spaCy models — and, during tuning, between different hyperparameter values — by counting the total number of anaphors that the rules find within the test documents as parsed by the spaCy model being used and that are also annotated with a coreference within the training data. The accuracy then expresses the percentage of these anaphors for which the coreference annotated by the corpus author is part of the chain(s) suggested by Coreferee. In situations where the training data specifies a chain C->B->A and B is a type of coreference that Coreferee is not aiming to capture, C->A is used as a valid training reference.

The corpus for each language is split up into a training corpus (around 80%) and a test corpus (around 20%) using a random procedure with a constant seed, meaning that both sets contain documents from throughout each corpus and that the same documents end up in each set on all runs. Note that the corpora were not split up in this way prior to version 1.2.0, meaning that accuracy figures obtained for earlier versions are not directly comparable with accuracy figures obtained for subsequent versions.

Since coreference between noun phrases is restricted to a small number of cases captured by simple rules, the model assessment figures presented here refer solely to anaphor resolution. When anaphor resolution accuracy is being assessed for a test document, noun pairs are detected and added to chains according to the standard rules, but they do not feature in the accuracy figures. On some rare occasions, however, they may have an indirect effect on accuracy by affecting the semantic considerations that determine which anaphors can be added to which chains.

Note that Total words in training corpora in the table above refers to 100% of the available data for each language, while the Anaphors in 20% columns specify the number of anaphors found in the roughly 20% of this data that is used for model assessment.

2 Interacting with the data model

Coreferee generates Chain objects where each chain is an ordered collection of Mention objects that have been analysed as referring to the same entity. Each mention holds references to one or more spaCy token indexes; a chain can have a maximum of one mention with more than one token (most often its leftmost mention). A given token index occurs in a maximum of two mentions; if it belongs to two mentions the mentions will belong to different chains and one of the mentions will contain multiple tokens. All chains that refer to a given Doc or Token object are managed on a ChainHolder object which is accessed via ._.coref_chains. Reproducing part of the example from the introduction:

>>> doc = nlp("Although he was very busy with his work, Peter had had enough of it. He and his wife decided they needed a holiday. They travelled to Spain because they loved the country very much.")
>>>
>>> doc._.coref_chains.print()
0: he(1), his(6), Peter(9), He(16), his(18)
1: work(7), it(14)
2: [He(16); wife(19)], they(21), They(26), they(31)
3: Spain(29), country(34)
>>>
>>> doc[16]._.coref_chains.print()
0: he(1), his(6), Peter(9), He(16), his(18)
2: [He(16); wife(19)], they(21), They(26), they(31)
>>>

Chains and mentions can be navigated much as if they were lists:

>>> len(doc._.coref_chains)
4
>>> doc._.coref_chains[1].pretty_representation
'1: work(7), it(14)'
>>> len(doc._.coref_chains[1])
2
>>> doc._.coref_chains[1][1]
[14]
>>> len(doc._.coref_chains[1][1])
1
>>> doc._.coref_chains[1][1][0]
14
>>>
>>> for chain in doc._.coref_chains:
...     for mention in chain:
...             print(mention)
...
[1]
[6]
[9]
[16]
[18]
[7]
[14]
[16, 19]
[21]
[26]
[31]
[29]
[34]
>>>

A document with Coreferee annotations can be saved and loaded using the normal spaCy methods: the annotations survive the serialization and deserialization. To facilitate this, Coreferee does not store references to spaCy objects, but merely to token indexes. However, each class has a pretty representation designed for human consumption that contains information from the spaCy document and that is generated eagerly when the object is first instantiated. Additionally, the ChainHolder object has a print() method that prints its chains' pretty representations with one chain on each line:

>>> doc._.coref_chains
[0: [1], [6], [9], [16], [18], 1: [7], [14], 2: [16, 19], [21], [26], [31], 3: [29], [34]]
>>> doc._.coref_chains.pretty_representation
'0: he(1), his(6), Peter(9), He(16), his(18); 1: work(7), it(14); 2: [He(16); wife(19)], they(21), They(26), they(31); 3: Spain(29), country(34)'
>>> doc._.coref_chains.print()
0: he(1), his(6), Peter(9), He(16), his(18)
1: work(7), it(14)
2: [He(16); wife(19)], they(21), They(26), they(31)
3: Spain(29), country(34)
>>>
>>> doc._.coref_chains[0]
0: [1], [6], [9], [16], [18]
>>> doc._.coref_chains[0].pretty_representation
'0: he(1), his(6), Peter(9), He(16), his(18)'
>>>
>>> doc._.coref_chains[0][0]
[1]
>>> doc._.coref_chains[0][0].pretty_representation
'he(1)'
>>>

Each chain has an index number that is unique within the document. It is displayed in the representations of Chain and ChainHolder and can also be accessed directly:

>>> doc._.coref_chains[2].index
2

Each chain can also return the index number of the mention within it that is most specific: noun phrases are more specific than anaphors and proper names more specific than common nouns:

>>> doc = nlp("He went to Spain. He loved the country. He often told his friends about it.")
>>> doc._.coref_chains.print()
0: He(0), He(5), He(10), his(13)
1: Spain(3), country(8), it(16)
>>>
>>> doc._.coref_chains[1].most_specific_mention_index
0
>>> doc._.coref_chains[1][doc._.coref_chains[1].most_specific_mention_index].pretty_representation
'Spain(3)'

This information is used as the basis for the resolve() method shown in the initial example: the method traverses multiple chains to find the most specific mention or mentions within the text that describe a given anaphor or noun phrase head.

Note that a mention that heads a complex proper noun phrase only refers to the head of that phrase. Some users have expressed a requirement to retrieve all the tokens in such a phrase. Although this functionality is regarded as outside the main scope of Coreferee and is hence not available via the main data model, the information can be retrieved as follows:

rules_analyzer = nlp_en.get_pipe('coreferee').annotator.rules_analyzer
rules_analyzer.get_propn_subtree(doc[1])

3 How it works

3.1 General operation and rules

3.1.1 Anaphor pair analysis

For each language, methods are implemented that determine:

for each token, its dependent siblings, e.g. Jane is a dependent sibling of Peter in the phrase Peter and Jane;
for each token, whether the token is an anaphor (broadly speaking for English: a third-person pronoun);
for each token, whether the token heads an independent noun phrase that an anaphor could refer to;
for any independent-noun/anaphor or anaphor/anaphor pair within a text, whether or not semantic and syntactic constraints would permit coreference between the members of the pair. For example, there are no circumstances in which they and her could ever corefer within a text. When an entity has dependent siblings, the method is called twice, once with and once without the siblings. Possible coreferents are considered up to five sentences away from each anaphor looking backwards through the text. The method returns 2 (coreference permitted), 1 (coreference unlikely but possible) or 0 (coreference impossible). Alongside the language-specific rules, there are a number of language-independent rules which can lead to a 1 rather than a 2 analysis.

Each anaphor in a document emerges from an analysis using these methods with a list of elements to which it could conceivably refer. The list for each anaphor is scored using the neural ensemble and the possible referents are ordered by decreasing likelihood. Regardless of their neural ensemble score, any pairs with the rules analysis 1 (coreference unlikely but possible) are ordered behind pairs with the rules analysis 2 (coreference permitted).

Note that anaphora is understood in a broad sense that includes cataphora, i.e. pronouns that refer forwards rather than backwards like the initial pronoun in the English example in the introduction. Language-independent rules are used to determine situations in which the syntactic relationship between two elements within the same sentence permits cataphora.

Replacing the neural ensemble scoring with a naive algorithm that always selects the closest potential referent for each anaphor with rules analysis 2 (or 1 if there is no 2) yields an accuracy of around 60% as opposed to the 84% reported above. This demonstrates the respective contribution of each processing strategy to the overall result and provides a useful benchmark for any further machine learning experiments.

3.1.2 Noun pair detection

For each language the following are implemented:

a method that determines whether a noun phrase is indefinite, or, in languages that do not mark indefiniteness, whether it could be interpreted as being indefinite;
a method that determines whether a noun phrase is definite, or, in languages that do not mark definiteness, whether it could be interpreted as being definite;
a dictionary from named entity labels to common nouns that refer to members of each named entity class. For example, the English named entity class ORG maps to the nouns ['company', 'firm', 'organisation'].

This information is used in a purely rule-based fashion to determine probable coreference between pairs of noun phrases: broadly, definite noun phrases that do not contain additional new information refer back to indefinite or definite noun phrases with the same head word, and named entities are referred back to by the common nouns that describe their classes. Noun pairs can be a maximum of two sentences apart as opposed to the five sentences that apply to anaphoric references.

3.1.3 Building the chains

Coreferee goes through each document in natural reading order from left to right building up chains of anaphors and independent noun phrases. For each anaphor, the highest scoring interpretation as suggested by the neural ensemble is preferred. However, because the semantic (but not the syntactic) restrictions on anaphoric reference apply between all pairs formed by members of a chain rather than merely between adjacent members, it may turn out that the highest scoring interpretation is not permissible because it would lead to a semantically inconsistent chain. The interpretation with the next highest score is then tried, and so on until no interpretations remain.

In the unusual situation that all suggested interpretations of a given anaphor have been found to be semantically impossible, it is likely that one of the interpretations of the preceding anaphors in the text was incorrect: authors do not normally use anaphors that do not refer to anything. Reading the text:

The woman looked down and saw Lesley. She stood up and greeted him.

most readers will initially understand she as referring to Lesley. Only when one reaches the end of the sentence does it become clear that Lesley must be a man and that she actually refers to the woman. A quick test shows that Coreferee is capable of handling such ambiguity:

>>> doc = nlp('The woman looked down and saw Lesley. She stood up and greeted her.')
>>> doc._.coref_chains.print()
0: woman(1), her(13)
1: Lesley(6), She(8)
>>>
>>> doc = nlp('The woman looked down and saw Lesley. She stood up and greeted him.')
>>> doc._.coref_chains.print()
0: woman(1), She(8)
1: Lesley(6), him(13)

This is achieved using a rewind: at a point in a text where no suitable interpretation can be found for an anaphor, alternative interpretations of preceding anaphors are investigated in an attempt to find an overall interpretation that fits.

3.2 The neural ensemble

The likelihood scores for anaphoric pairs are calculated using an ensemble of five identical multilayer perceptrons using a rectified linear activation in the input and hidden layers and a sigmoid activation in the output layer. Each of the five networks outputs a probability between 0 and 1 for a given potential anaphoric pair and the mean of the five probabilities is used as the the score for that pair.

The inputs to each of the five networks consist of:

A feature map for each member of the pair. As the first step in training, Coreferee goes through the entire training corpus and notes all the relevant morphological and syntactic information that relevant tokens, their syntactic head tokens and their syntactic children can have. This information is stored with the neural ensemble for each model as a feature table. The feature map for a given token (or list of tokens) is a oneshot representation with respect to the feature table.
A position map for each member of the pair capturing such information as its position within its sentence and its depth within the dependency tree generated for its sentence.
Vector squeezers for each member of the pair and, where existent, for the syntactic head of each member of the pair. The input to a vector squeezer is the vector or context-sensitive tensor for the spaCy token in question. A vector squeezer consists of three neural layers and outputs a representation that is only three neurons wide and that is fed into the rest of the network within the same layer as the other, non-vector inputs.
A compatibility map capturing the relationship between the members of the pair. Alongside the distance separating them in words and in sentences, this includes the number of common features in their feature maps and the cosine similarity between their syntactic heads.

Using a vector squeezer has been consistently found to offer slightly better results either than feeding the full-width vectors into the network directly or than omitting them entirely. Possible intuitions that might explain this behaviour are: the reduced width forces the network to learn and attend to a constrained number of specific semantic features relevant to coreference resolution; and the reduced width limits the attention of the network on the raw vectors in a situation where the training data is insufficient to make effective use of them.

Perhaps somewhat unusually, when a vector is required to represent a coordinated phrase, the mean of the vectors of the individual coordinated tokens is used rather than the mean of the vectors of all the tokens in the coordinated span.

The structure shared by each of the five networks in the ensemble is shown in the attached diagram:

Cross-linguistically, four training epochs were found to offer the best results; adding more training epochs caused the accuracy to start to tail off again owing to overfitting. Training for all relevant spaCy models for a given language takes between one and two hours on a high-end laptop.

4. Adding support for a new language

One of the main design goals of Coreferee was to make it easy to add support for further languages. The prerequisites are:

you will need to know the grammar of the language you are adding well enough to make detailed decisions about which coreferences are normal, which are marginally possible and which are impossible;
you will need to be able to program in Python.

You should not need to get involved in the details of the neural ensemble; Coreferee should do that for you.

The steps involved are:

Create a directory under coreferee/lang/ with the same structure as the existing language-specific directories; it is probably easiest to copy one of them.
The file config.cfg lists the spaCy models for which you wish to generate Coreferee models. You will need to specify a separate vectors model for any of the spaCy models that lack vectors or context-dependent tensors of their own — see the English config.cfg for an example. Each config entry specifies a minimum (from_version) and maximum (to_version) spaCy model version number that the generated Coreferee model will support, as well as the spaCy model version number with which the Coreferee model is trained (train_version). During development, all three numbers will normally refer to a single version number. Later, when an updated spaCy model version is brought out, testing will be required to see whether the existing Coreferee model still supports the new spaCy model version. If so, the maximum version number can be increased; if not, a new config entry will be necessary to accommodate the new Coreferee model that will then be required.
The file rules.py in the main code directory contains an abstract class RulesAnalyzer that must be implemented by a class LanguageSpecificRulesAnalyzer within a file called language_specific_rules.py in each language-specific directory. The abstract class RulesAnalyzer contains docstrings that specify for each abstract property and method the contract to which implementing classes should adhere. Looking at the existing language-specific rules is also likely to be helpful. The method is_potential_anaphor() is normally the most work to create: here it is probably worth looking at the existing English method for languages with natural gender or at the existing German method for languages with grammatical gender. (Polish has an unusually complex gender system, so the Polish example is unlikely to be helpful even as a basis for working with other Slavonic languages.)
There are some situations where word lists can be helpful. If a list is placed in a file <name>.dat within the data directory under a language-specific directory, the contents will be automatically made available within the LanguageSpecificRulesAnalyzer for the language in question as a variable self.<name> that contains a list where each entry corresponds to a line from the file; comments with # are supported. If you use a word list, please ensure it can be published under the MIT license and give appropriate attribution within the language-specific directory in the LICENSE and, where appropriate, in a COPYING file.
Male and female names are managed on a cross-linguistic basis because there is no reason why one would not want e.g. a German female name to be recognised within an English text. Names are automatically made available to all RulesAnalyzer implementations as properties self.male_names, self.female_names, self.exclusively_male_names and self.exclusively_female_names. If you can locate a suitable names list for the language you are working on that is available under a suitable license, add the attribution to the LICENSE file under common/ and merge your names into the two files. Please tidy up the result so that the files are free of duplicates and in alphabetical order.
Create a language-specific directory under tests/ with a file test_rules_<ISO 639-1>.py to test the rules you have written in 3-5). Although one of the corresponding files for one of the existing languages is likely to be the best starting point, you should also be sure to test any extra features specific to the language you are working on. The test tooling is designed to run each test against all spaCy models specified in config.cfg. At this stage in development, you will need to add temporarily a parameter add_coreferee=False to the call to get_nlps() in the setUp() method. Otherwise, all tests will fail because the test tooling will attempt to add the as yet non-existent Coreferee model to the pipe.
Some tests may fail with one of the smaller spaCy models because it produces incorrect syntactic representations rather than because of any issue with your rule code. For such cases, a parameter excluded_nlps can be specified within a test method to prevent it from being executed with specific spaCy models.
Locate a training corpus or corpora. Again, you should make sure that the resulting models can be published under the MIT license. Add new loader class(es) for the corpus or corpora to the existing loader classes in the train/loaders.py file. Loader classes must implement the GenericLoader abstract class that is located at the top of this file. The job of a loader is to read a specific training corpus format and to create and annotate spaCy documents with coreferences marked within corpora of that format. All the data for a single training run should be placed in a single directory; if there are multiple types of training data loaded by different loaders, each loader will need to be able to recognise the data it is required to read by examining the names of the files within the directory. It is worth spending some time checking with print() statements that the loaders annotate as expected, otherwise the training step that follows has little chance of success!
You are now ready to begin training. The training command must be issued from the coreferee/ root directory. Coreferee will place a zip file into <log-dir>. Alongside the accuracy for each model, the files in the zip file show the coreference chains produced for each test document as well as a list of incorrect annotations where the Coreferee interpretation differed from the one specified by the training corpus author — information that is invaluable for debugging and rules improvement. As an example, the training command for English is:

python3 -m coreferee train --lang en --loader ParCorLoader,LitBankANNLoader --data <training-data-dir> --log <log-dir>

Measure the performance of your model against older versions of spaCy and corresponding spaCy models: create a virtual environment for each version of spaCy, and from it measure the performance against the standard test corpus using the coreferee check command, of which an example is:

python3 -m coreferee check --lang en --loader ParCorLoader,LitBankANNLoader --data <training-data-dir> --log <log-dir>

Once you are happy with your models, install them. The command must be issued from the coreferee/ root directory, otherwise Coreferee will attempt to download the models from GitHub where they are not yet present:

python3 -m coreferee install <ISO 639-1>

Before you attempt any regression tests that involve running Coreferee as part of the spaCy pipe, you must remove the add_coreferee=False parameter you added above. A setup where the parameter is present in one test file but absent in the other test file will not work because the spaCy models are loaded once per test run.
Again using one of the existing languages as an starting point, create a test_smoke_tests_<ISO 639-1>.py file in your test directory. The smoke tests are designed to make sure that the basic features of Coreferee are working properly for the language in question and should also cover any features that have posed a particular challenge while developing the rules.
Format your language_specific_rules.py using black.
Go through the documentation (README.md and SHORTREADME.md) adding information about the new language wherever the supported languages are listed in some way.
Issue a pull request. We ask that you supply us with the zip file placed into <log-dir> in point 9. Because this will contain a considerable amount of raw information from the training corpora, it will normally be preferable from a licensing viewpoint to send it out of band rather than attaching it to the pull request.

5. Adding support for a custom spaCy model

If you are using a custom spaCy model, you should generate a corresponding custom Coreferee model. Use points 2), 8), 9) and 10) from the preceding section as a guide. If you do not have your own training data, you can use the same training data that was used to generate the standard Coreferee models.

The language-specific rules expect specific entity tags as 'magic values'. This is unfortunate but there is no obvious alternative solution because there is no way of knowing which entities a new tag might refer to. The best advice is to use the same entity tags in your custom model as are used in the standard spaCy models when referring to similar entity classes.

For many entity tags, the impact will be minimal if you cannot adhere to this, but what is crucial is that you use the PERSON and PER tags to refer to people in English and German respectively. If this is not possible, change the language-specific-rule code and reinstall Coreferee locally (python -m pip install . from the root directory).

6 Version history

6.1 Version 1.0.0

The initial open-source version.

6.2 Version 1.0.1

Fixing of a bug where already installed models were reinstalled from site-packages rather than the new model being pulled from GitHub.

6.3 Version 1.1.0

Upgrade to Python 3.9 and spaCy 3.1
Fixing of minor issues in all three rule-sets
Regeneration of all models
Improvement of the Polish examples in section 1.4.1 to make them more pragmatically correct - many thanks to Małgorzata Styś for her valuable advice on this.

6.4 Version 1.1.1

Changed the dependencies to allow Coreferee to run on the Apple M1 chipset
Sorted out a problem with the supported spaCy versions
Improved some of the tests

6.5 Version 1.1.2

Added support for French, which was kindly supplied by Pantalaymon

6.6 Version 1.1.3

Updated French rules to new version, again supplied by Pantalaymon
Fixed an endless-loop problem in language_independent_is_anaphoric_pair()

6.7 Version 1.2.0

Removed dependencies to TensorFlow and Keras, switching to Thinc as the neural network platform. Switching to Thinc has led to serialized models that are around 30% of the size of the old models, and has also allowed the old limitation to be removed where nlp.pipe() could not be called with n_process > 1 with forked processes.
Added matrix tests to support a variety of Python and spaCy versions, including spaCy 3.2 and spaCy 3.3.
Implemented a more representative stable-random split into train and test corpora

7. Open issues / requests for assistance

Because optimising parsing speed was not a priority in the project within which Coreferee came into being, Coreferee is written purely in Python; it would be helpful if somebody could convert it to Cython.
There are almost certainly changes to the inputs and structure of the neural ensemble that would lead to improvements in accuracy, both cross-linguistically and for specific languages. The only caveat to bear in mind when trying out changes is that it should be possible for someone who does not understand neural networks to write rules for a new language. This means that Coreferee should detect necessary differences in the neural network behaviour between languages automatically rather than requiring the trainer to configure them.
It would be useful if somebody could find a way of benchmarking Coreferee against other coreference resolution solutions, especially for English. One problem this would probably present is that using a benchmark necessitates a normative scope where a system aims to find exactly those types of coreference marked within the benchmark corpus, whereas the scope of Coreferee was determined by project requirements.

coreferee's People

Contributors

Stargazers

Watchers

Forkers

databill86 stjordanis felipeescallon user01 titanous li-zhenyuan playfloor fanjiajie1119 tk-sugumar pantalaymon horncn ryszardtuora guardian-coreference-project wutaiqiang wangcj05 bablf cosmopoli12 abkanneh tianas

coreferee's Issues

Python 3.9 support?

Hi there, thanks a lot for publishing this -- looking forward to trying it out! I was trying to build this for Python 3.9, but it looks like the setup.py forces it to be >=3.8 and <3.9. Is there any fundamental reason why this won't work with Python 3.9?

Use coreferee with spacy >3.1.2

Hey there,
I am having problems to use coreferee together with newest spacy version (3.1.2).
To add a Language factory it is necassary to use Decorators as suggested by spacy.

So I am not that experienced with programming and python as well. But for me I have to write a function where I initialize an object. Then it is possible to apply spacy Decorator @Language.factory on the function.

I figured out to do this for different packages, except this. For example it works for spacy-langdetect like this:
from spacy_langdetect import LanguageDetector @Language.factory("language_detector") def init_LanguageDetector(nlp, name): return LanguageDetector(language_detection_function=None)

and afterwards I can add this new factory easily to the existing pipeline with:
nlp.add_pipe()

Is there a similar way to do this for coreferee, because snippets provided by the READ ME do not work..?

Thanks in advance!

Not able to install coreferee

I am trying to install coreferee using commands mentioned in the documentation:

python3 -m pip install coreferee
python3 -m coreferee install en

but executing this is showing the following errors:

ERROR: Could not find a version that satisfies the requirement coreferee (from versions: none)
ERROR: No matching distribution found for coreferee

How to resolve this issue?

pip package update

Please update package via pip instalation (ver. 1.1.0)...
I got an error "could not find a version ....=1.1.0" and in generall, pip installed ver. 1.0.1

Thanks a lot for a new release.

What about Noun phrases mentions?

Hello everyone,
I see that this model is trained to only deal with single words, Does its architecure not based on span representation?
Sometimes a mention is composed of several words and not with one word; can anyone please explain to me this issue ?

python -m pip install: could not find a version that satisfies the requirement

hi! Thank you for this package. When I try to install:

python -m pip install coreferee

ERROR: Could not find a version that satisfies the requirement coreferee (from versions: none)
ERROR: No matching distribution found for coreferee

It works if I clone it locally, but then I cannot get the class reference to work when I add it to the spact pipeline. I posted a stackoverflow issue as I assumed it was something I was doing wrong but it seems like the problem would be solved if I could correctly install your package.

I have a vanilla python 3.7.1 environment.

Encountered an endless loop for some text

There is a "while True" loop in rules.py (line 396) and I have some text where this enters an endless loop. The text is not pretty but it is parsing.

You might have other ways of solving this, but I found that adding the following code before the assignment of referring_or_governor to its head got me out of the loop.

if referring_or_governor == referring_or_governor.head:
break

Replace coreferences within the text

Is there any build-in option to replace / resolve the identified coreference within the text.

Example in neuralcoref:

doc = nlp(u'Deepika has a dog. She loves him. The movie star has always been fond of animals')

doc._.coref_resolved

Output:

'Deepika has a dog. Deepika loves a dog. Deepika has always been fond of animals'

"zsh: illegal hardware instruction python -m coreferee install en" on MacOS 12.1 M1

Hi there

I tried installing coreferee after installing spaCy and it works well until I try the second of the install commands:

python3 -m pip install coreferee python3 -m coreferee install en

The error message sounds to me like there's an incompatibility between the software and my hardware.

I used python 3.9.7 first, saw that someone said 3.8.0 should work but it gave me the same result there. Same with 3.6.9.

Grateful for any pointers.

Adding support for trf models without NER

Hello,

As I am approaching the end of my enterprise to add support for french, I notice that the performance of the coreference resolution is a lot hampered by the performance of the used spacy model as even the large model "fr_core_news_lg" is not optimal.
However , the CamemBERT-based model "fr_dep_news_trf" yields a very noticeably better lemmatisation, tagging, morphological features and parsing. The downside is: it does not come with Named entity recognition in its pipeline.

I was wondering if it is possible to use the NER component of a smaller model, in the same way that the english coreferee model does with the word vectors.
I see that this was not done for the german's (which is in a similar situation as french) de_dep_news_trf, so I figured there was maybe a reason I should not attempt to use this strategy for french?

If it is possible, how should I go about i?

retokenize can change the index values

I have a program where I need to call retokenize to combine specific tokens. This changes the indices of tokens and thus the coreferences are no longer pointing to the correct tokens. I have created a work around that adjusts these indices values, but I thought I'd let you know about this problem. I still have some issue where I clearly have not adjusted everything correctly because resolve is returning None instead of a value.

Model could not be loaded for config entry 'trf_3_0_0'

Model could not be loaded for config entry 'trf_3_0_0' If models exist for language 'en', load them with the command 'python -m coreferee install en'.
Traceback (most recent call last):
File "/Users/madhav/.pyenv/versions/3.8.0/lib/python3.8/site-packages/coreferee/manager.py", line 79, in get_annotator
importlib.import_module(model_package_name)
File "/Users/madhav/.pyenv/versions/3.8.0/lib/python3.8/importlib/init.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1014, in _gcd_import
File "", line 991, in _find_and_load
File "", line 973, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'coreferee_model_en.trf_3_0_0'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "ree.py", line 3, in
nlp.add_pipe('coreferee')
File "/Users/madhav/.pyenv/versions/3.8.0/lib/python3.8/site-packages/spacy/language.py", line 768, in add_pipe
pipe_component = self.create_pipe(
File "/Users/madhav/.pyenv/versions/3.8.0/lib/python3.8/site-packages/spacy/language.py", line 659, in create_pipe
resolved = registry.resolve(cfg, validate=validate)
File "/Users/madhav/.pyenv/versions/3.8.0/lib/python3.8/site-packages/thinc/config.py", line 722, in resolve
resolved, _ = cls._make(
File "/Users/madhav/.pyenv/versions/3.8.0/lib/python3.8/site-packages/thinc/config.py", line 771, in _make
filled, _, resolved = cls._fill(
File "/Users/madhav/.pyenv/versions/3.8.0/lib/python3.8/site-packages/thinc/config.py", line 843, in fill
getter_result = getter(*args, **kwargs)
File "/Users/madhav/.pyenv/versions/3.8.0/lib/python3.8/site-packages/coreferee/manager.py", line 103, in init
self.annotator = CorefereeManager().get_annotator(nlp)
File "/Users/madhav/.pyenv/versions/3.8.0/lib/python3.8/site-packages/coreferee/manager.py", line 85, in get_annotator
raise ModelNotSupportedError(''.join((nlp.meta['lang'], '', nlp.meta['name'],
coreferee.errors.ModelNotSupportedError: en_core_web_trf version 3.0.0

Resolved text

Are there any plans for adding a "resolved_text"-type of attribute (i.e., the version of the doc/doc text with resolved spans?

Persistent Class TypeError

With most documents of longer than a few sentences (news articles), I am getting a recurrent error:

text = """
France retains its centuries-long status as a global centre of art, science, and philosophy, says Ljubomir Geric. He also notes it
hosts the world's fifth-largest number of UNESCO World Heritage Sites and is the leading tourist destination, 
receiving over 89 million foreign visitors in 2018. France is a developed country with the world's 
seventh-largest economy by nominal GDP, and the ninth-largest by PPP. In terms of aggregate household wealth, 
it ranks fourth in the world. France performs well in international rankings of education, health care, 
life expectancy, and human development. It remains a great power in global affairs, being one of the five 
permanent members of the United Nations Security Council (UNSC) and an official nuclear-weapon state. France is a 
founding and leading member of the European Union (EU) and the Eurozone, and a member of the Group of 7, North 
Atlantic Treaty Organization (NATO), Organisation for Economic Co-operation and Development (OECD), and the 
World Trade Organization (TWO).
"""
doc = nlp(text)

Unexpected error annotating document, skipping ....
<class 'TypeError'>
Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construct a NumPy array explicitly.
  File "/opt/conda/envs/nlp/lib/python3.8/site-packages/coreferee/manager.py", line 110, in __call__
    self.annotator.annotate(doc)
  File "/opt/conda/envs/nlp/lib/python3.8/site-packages/coreferee/annotation.py", line 270, in annotate
    self.tendencies_analyzer.score(doc, self.keras_ensemble)
  File "/opt/conda/envs/nlp/lib/python3.8/site-packages/coreferee/tendencies.py", line 390, in score
    keras_inputs, scoring_necessary = self.prepare_keras_data([doc])
  File "/opt/conda/envs/nlp/lib/python3.8/site-packages/coreferee/tendencies.py", line 326, in prepare_keras_data
    self.get_vectors(potential_referred, doc)
  File "/opt/conda/envs/nlp/lib/python3.8/site-packages/coreferee/tendencies.py", line 263, in get_vectors
    this_object_vector = np.mean( np.array([t.vector for t in tokens]), axis=0)
  File "cupy/core/core.pyx", line 1188, in cupy.core.core.ndarray.__array__

My numpy is 1.19.5. tensorflow 2.4.2. I wonder if this is what your issue about versions being too permissive was about? I'll try to replicate those more restricted installs.

Cannot install dependency while pip install coreferee

When doing pip install coreferee, I get the folowing stack trace when I get to wrapt. On RHEL 7.6. Any additional requirements for being able to pip install? An additional note, I can pip install 1.13.3 of wrapt but as you see it is attempting to use 1.12.1. Don't you need 1.13 with Python 3.9 or should I not be using Python 3.9?

Collecting wrapt~=1.12.1
Downloading wrapt-1.12.1.tar.gz (27 kB)
ERROR: Command errored out with exit status 1:
command: /home/gregory.werner/opt/python-3.9.10/bin/python3 -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-enq9faee/wrapt_779f91a727f145c49bac8be4aea62efd/setup.py'"'"'; file='"'"'/tmp/pip-install-enq9faee/wrapt_779f91a727f145c49bac8be4aea62efd/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(file) if os.path.exists(file) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-w5mkwcfp
cwd: /tmp/pip-install-enq9faee/wrapt_779f91a727f145c49bac8be4aea62efd/
Complete output (11 lines):
Traceback (most recent call last):
File "", line 1, in
File "/home/gregory.werner/opt/python-3.9.10/lib/python3.9/site-packages/setuptools/init.py", line 18, in
from setuptools.dist import Distribution
File "/home/gregory.werner/opt/python-3.9.10/lib/python3.9/site-packages/setuptools/dist.py", line 38, in
from setuptools import windows_support
File "/home/gregory.werner/opt/python-3.9.10/lib/python3.9/site-packages/setuptools/windows_support.py", line 2, in
import ctypes
File "/home/gregory.werner/opt/python-3.9.10/lib/python3.9/ctypes/init.py", line 8, in
from _ctypes import Union, Structure, Array
ModuleNotFoundError: No module named '_ctypes'

coreferee doesn't seem to be available on PyPi (pip)?

pip install coreferee

ERROR: Could not find a version that satisfies the requirement coreferee (from versions: none)
ERROR: No matching distribution found for coreferee

coreferee.errors.ModelNotSupportedError: en_core_web_md version 3.1.0

When executing the following code -

nlp=spacy.load('en_core_web_md')
nlp.add_pipe('coreferee')

I am getting the following error -

coreferee.errors.ModelNotSupportedError: en_core_web_md version 3.1.0

Any idea why this is happening? And what can be done in order to resolve this?

Multithreading regression failure

Ignore. Posted in the wrong place.

coreferee install de raises ImportError

Hello,
I am having some trouble installing the german additional data. Receiving ImportError when running:
python3 -m coreferee install de
Error log is:
ImportError: cannot import name 'get_config' from 'tensorflow.python.eager.context'

I am running this in an venv with python3.8.5 .

How to cite coreferee

Awesome project and appreciate the detailed readme!

I'd like to use coreferee in an academic project and, therefore, would like to know how best to cite it. I am not sure if I missed this somewhere, but it would be helpful if you can add this information to the readme file.

spaCy 3.1, 3.2 models not supported

I have tried to use this library with spaCy 3.1 and 3.2 models. The setup config says spacy>=3.1.0,<3.2.0 is supported, yet I get the ModelNotSupported error for both en_core_web_lg version 3.1.0 and en_core_web_trf version 3.1.0 models (both are installed). Coreferee en is also installed.

import coreferee
import spacy

nlp = spacy.load("en_core_web_lg")
nlp.add_pipe("coreferee")

Info about spaCy

spaCy version: 3.1.4
Platform: Windows-10-10.0.19041-SP0
Python version: 3.8.8
Pipelines: en_core_web_lg (3.1.0), en_core_web_trf (3.1.0)

Error trace

---------------------------------------------------------------------------
ModelNotSupportedError                    Traceback (most recent call last)
Input In [2], in <module>
      3 import spacy
      5 nlp = spacy.load("en_core_web_lg")
----> 6 nlp.add_pipe("coreferee")

File ~\.virtualenvs\lecontra-qSg6Vfcc\lib\site-packages\spacy\language.py:787, in Language.add_pipe(self, factory_name, name, before, after, first, last, source, config, raw_config, validate)
    779     if not self.has_factory(factory_name):
    780         err = Errors.E002.format(
    781             name=factory_name,
    782             opts=", ".join(self.factory_names),
   (...)
    785             lang_code=self.lang,
    786         )
--> 787     pipe_component = self.create_pipe(
    788         factory_name,
    789         name=name,
    790         config=config,
    791         raw_config=raw_config,
    792         validate=validate,
    793     )
    794 pipe_index = self._get_pipe_index(before, after, first, last)
    795 self._pipe_meta[name] = self.get_factory_meta(factory_name)

File ~\.virtualenvs\lecontra-qSg6Vfcc\lib\site-packages\spacy\language.py:670, in Language.create_pipe(self, factory_name, name, config, raw_config, validate)
    667 cfg = {factory_name: config}
    668 # We're calling the internal _fill here to avoid constructing the
    669 # registered functions twice
--> 670 resolved = registry.resolve(cfg, validate=validate)
    671 filled = registry.fill({"cfg": cfg[factory_name]}, validate=validate)["cfg"]
    672 filled = Config(filled)

File ~\.virtualenvs\lecontra-qSg6Vfcc\lib\site-packages\thinc\config.py:729, in registry.resolve(cls, config, schema, overrides, validate)
    720 @classmethod
    721 def resolve(
    722     cls,
   (...)
    727     validate: bool = True,
    728 ) -> Dict[str, Any]:
--> 729     resolved, _ = cls._make(
    730         config, schema=schema, overrides=overrides, validate=validate, resolve=True
    731     )
    732     return resolved

File ~\.virtualenvs\lecontra-qSg6Vfcc\lib\site-packages\thinc\config.py:778, in registry._make(cls, config, schema, overrides, resolve, validate)
    776 if not is_interpolated:
    777     config = Config(orig_config).interpolate()
--> 778 filled, _, resolved = cls._fill(
    779     config, schema, validate=validate, overrides=overrides, resolve=resolve
    780 )
    781 filled = Config(filled, section_order=section_order)
    782 # Check that overrides didn't include invalid properties not in config

File ~\.virtualenvs\lecontra-qSg6Vfcc\lib\site-packages\thinc\config.py:850, in registry._fill(cls, config, schema, validate, resolve, parent, overrides)
    847     getter = cls.get(reg_name, func_name)
    848     # We don't want to try/except this and raise our own error
    849     # here, because we want the traceback if the function fails.
--> 850     getter_result = getter(*args, **kwargs)
    851 else:
    852     # We're not resolving and calling the function, so replace
    853     # the getter_result with a Promise class
    854     getter_result = Promise(
    855         registry=reg_name, name=func_name, args=args, kwargs=kwargs
    856     )

File ~\.virtualenvs\lecontra-qSg6Vfcc\lib\site-packages\coreferee\manager.py:103, in CorefereeBroker.__init__(self, nlp, name)
    101 self.nlp = nlp
    102 self.pid = os.getpid()
--> 103 self.annotator = CorefereeManager().get_annotator(nlp)

File ~\.virtualenvs\lecontra-qSg6Vfcc\lib\site-packages\coreferee\manager.py:95, in CorefereeManager.get_annotator(nlp)
     93         keras_ensemble = keras.models.load_model(absolute_keras_model_filename)
     94         return Annotator(nlp, vectors_nlp, feature_table, keras_ensemble)
---> 95 raise ModelNotSupportedError(''.join((nlp.meta['lang'], '_', nlp.meta['name'],
     96     ' version ', nlp.meta['version'])))

ModelNotSupportedError: en_core_web_lg version 3.1.0

PLEASE ADDRESS ANY NEW ISSUES TO https://github.com/explosion/coreferee

Cannot install on Apple Silicon

I am trying to install this library on macOS 11 on an Apple Silicon Mac. The requirements seem to conflict no matter what I try.

The following are all the things I've tried. It would be great if you provided a way to install from source because I got M1 tensorflow and spacy to work just fine. I just don't know how to install coreferee and the respective models with the libraries that I do have available.

INFO: pip is looking at multiple versions of coreferee to determine which version is compatible with other requirements. This could take a while.
ERROR: Could not find a version that satisfies the requirement h5py~=3.1.0 (from tensorflow-macos) (from versions: 2.2.1, 2.3.0b1, 2.3.0, 2.3.1, 2.4.0b1, 2.4.0, 2.5.0, 2.6.0, 2.7.0rc2, 2.7.0, 2.7.1, 2.8.0rc1, 2.8.0, 2.9.0rc1, 2.9.0, 2.10.0, 3.0.0rc1, 3.0.0, 3.1.0, 3.2.0, 3.2.1, 3.3.0, 3.4.0, 3.5.0, 3.6.0)
ERROR: No matching distribution found for h5py~=3.1.0

I tried building h5py from source using the home-brew version using
HDF5_DIR=/opt/homebrew/Cellar/hdf5/1.12.1 pip3.9 install --no-binary=h5py h5py==3.1.0
because there was no prebuilt binary available. That worked.

But installing carefree doesn't seem to see this version, and I wind up with the error above.
I very much depend on this library. It would be great if we could get this working again.
How can the issue be resolved?

I think it's a problem with several libraries including tensorflow.

EDIT: I tried installing the intel python3.9 hoping Rosetta would force everything to work under x86_64 emulation mode. Everything installed except for:
python3 -m coreferee install en, which yielded "zsh: illegal hardware instruction".

EDIT2: I noticed that 1.1.1 was explicitly for resolving Apple Silicon issues. Why am I experiencing these installation problems then? Is there something subtle that I am missing?

I'm out of ideas.

EDIT3: This is the official tensorflow instructions for Apple Silicon
link ,
but the procedure requires Conda.
I'd still like to know what is going wrong if you were successful in installing on Apple Silicon for v. 1.1.1.
Spacy seems to be having trouble with transformers too.

EDIT4: It took a while, but I got things working by just ignoring using pip and just downloading the repo and dependencies separately using whatever is recommended for mac:


curl -O https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh
./Miniforge3-MacOSX-arm64.sh
conda config --set auto_activate_base false
conda create --name my_env python=3.9
conda activate my_env
conda install rust 
export CARGO_BUILD_TARGET="aarch64-apple-darwin"
conda install -c conda-forge spacy=3.1.4

python -m spacy download en_core_web_trf
python -m spacy download en_core_web_lg

conda install -c apple tensorflow-deps
pip install tensorflow-macos
pip install tensorflow-metal
git clone https://github.com/msg-systems/coreferee.git
mv ./coreferee ./coreferee_container
cd ./coreferee_container
python -m coreferee install en
cd ..

I hope this process becomes simpler. Also, hopefully you'll update to spacy 3.20.

OLD:
I get this error on macOS 11.3 on an M1 (ARM) machine when I try installing:

pip3 install coreferee
python3 -m coreferee install en

ERROR: Cannot install coreferee==1.1.0, coreferee==1.1.1 and coreferee==1.1.2 because these package versions have conflicting dependencies.

The conflict is caused by:
    coreferee 1.1.2 depends on tensorflow-macos~=2.6.0; platform_system == "Darwin"
    coreferee 1.1.1 depends on tensorflow-macos~=2.6.0; platform_system == "Darwin"
    coreferee 1.1.0 depends on tensorflow~=2.5.0

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict

ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/user_guide/#fixing-conflicting-dependencies

How can I resolve this?

Also, I am using python 3.10, but the same thing happens with 3.9.
I am also hoping Spacy 3.2 support is around the corner.

EDIT: I specified an older version of coreferee (1.1.0), but now I get:

ERROR: Could not find a version that satisfies the requirement tensorflow~=2.5.0 (from coreferee) (from versions: none)
ERROR: No matching distribution found for tensorflow~=2.5.0

It seems a lot of these package requirements don't exist. I really depend on this project. How can we get it working?

EDIT:
It looks like the issue is h5py. There's no pre-built wheel for it on ARM Macs.
It's unclear how to install it from source for this purpose.

Documentation website?

Hi! I stumbled upon this repo after looking up tools for corefs, which ended up not working or were completely confusing. This one seems approachable. Is there a documentation website for coreferee?

Support for python 3.7 for google colab use

Google colab only supports python3.7, was wondering if it were possible to release a version of coreferee that supported that?
The other big coreference library compatible with spacy that I am aware of is neuralcoref and that is only supported in spacy 2. They're working on a new version for spacy 3.0 but nothing has been heard on that for a while, so there's a niche right now for a spacy 3 & colab friendly neural coref library

ModelNotSupportedError: en_core_web_lg version 3.1.0 error

Python version: 3.9.7
Spacy: Version: 3.1.0
Model: en-core-web-lg Version: 3.1.0

I'm trying the following code:

python -m coreferee install en_core_web_lg
import coreferee
import spacy

nlp = spacy.load('en_core_web_lg')
nlp.add_pipe('coreferee')

I've tried using different models but all of them keep giving me the same error:
ModelNotSupportedError: en_core_web_lg version 3.1.0

Any help would be appreciated.

ModelNotSupportedError: de_core_news_lg version 3.1.0

Hi, I get this error message (using Spacy 3.1.3):

nlp=spacy.load('de_core_news_lg')
nlp.add_pipe('coreferee')
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\Robert\anaconda3\lib\site-packages\spacy\language.py", line 776, in add_pipe
pipe_component = self.create_pipe(
File "C:\Users\Robert\anaconda3\lib\site-packages\spacy\language.py", line 660, in create_pipe
resolved = registry.resolve(cfg, validate=validate)
File "C:\Users\Robert\anaconda3\lib\site-packages\thinc\config.py", line 729, in resolve
resolved, _ = cls._make(
File "C:\Users\Robert\anaconda3\lib\site-packages\thinc\config.py", line 778, in _make
filled, _, resolved = cls._fill(
File "C:\Users\Robert\anaconda3\lib\site-packages\thinc\config.py", line 850, in fill
getter_result = getter(*args, **kwargs)
File "C:\Users\Robert\anaconda3\lib\site-packages\coreferee\manager.py", line 103, in init
self.annotator = CorefereeManager().get_annotator(nlp)
File "C:\Users\Robert\anaconda3\lib\site-packages\coreferee\manager.py", line 95, in get_annotator
raise ModelNotSupportedError(''.join((nlp.meta['lang'], '', nlp.meta['name'],
coreferee.errors.ModelNotSupportedError: de_core_news_lg version 3.1.0

Thank you!

pip install does not work (can not find package)

$ python --version
Python 3.9.5

$pip --version
pip 21.1.1 from /lib/python3.9/site-packages/pip (python 3.9)

$ pip install coreferee
ERROR: Could not find a version that satisfies the requirement coreferee (from versions: none)
ERROR: No matching distribution found for coreferee

Access to the score of the chain?

Hi,

I've seen some cases in which the correference resolution fails, adding a wrong correference and missing the good one. So I wonder if there is any way to access to the score of the coreference resolution and if there are other options for the resolution that maybe are not being chosen as the correct ones by the network but they could be selected with rules.

Thanks.

PD: This is the case in which is failing:

... This seems to be the case for climate change. All of the available evidence shows that it is real, and it has consequences. ...

In this case, it relates the bold it(s) with All and not with climate change.

Best regards

coreferee.errors.ModelNotSupportedError: en_core_web_trf version 3.1.0

I tried to use this library but got the following error:

coreferee.errors.ModelNotSupportedError: en_core_web_trf version 3.1.0

Here are my environment details:

Python 3.8.5
spaCy 3.1.1

I'm not sure about the coreferee version, but it should be the current last version, v.1.1.0.
I also used the Python 3.9.6 and master branch of the repo and tried to install the package inline, but still got the same error.
Any idea what am I doing wrong?

Does spacy version>= 3.2.0 supported?

Thank you for your work.

Word lists from Wordnet

Hello,

So First of all, thank you for this project which is very well documented.
I am currently working on adding support for french using the DEMOCRAT corpus.

As I will need word lists, I see in the licence file that the lists in the data directories are derived from Wordnet. So I got my hands on WOLF, a similar (derived from WordNet) lexical ressource for french.

So I was wondering if you could give more details on how to produce those .dat files. I am namely interested in the person_words.dat , exclusively_person_words.dat and animal_words.dat files (the criteria used to filter, which files of WordNet ...) .
If possible, it would be great if you could share the code used to produce them.

illegal hardware instruction on macOS

Hi, I followed the instructions from the readme, e.g., used a Py 3.9.5 environment, spacy = 3.1.2 and then installed coreferee. However, when running python -m coreferee install en I get

python -m coreferee install en
[1]    22471 illegal hardware instruction  python -m coreferee install en

Note that this error occurs also when executing import spacy but only after installing coreferee. Prior to the installation of coreferee, import spacy just works fine, so I suspect that the cause of this issue has to do with coreferee or perhaps one of its dependent libraries. For instance, I noticed that during coreferee's installation some libraries installed by spacy previously are replaced with older versions, see:

[...]
  Attempting uninstall: typing-extensions
    Found existing installation: typing-extensions 3.10.0.2
    Uninstalling typing-extensions-3.10.0.2:
      Successfully uninstalled typing-extensions-3.10.0.2
  Attempting uninstall: six
    Found existing installation: six 1.16.0
    Uninstalling six-1.16.0:
      Successfully uninstalled six-1.16.0
  Attempting uninstall: numpy
    Found existing installation: numpy 1.21.2
    Uninstalling numpy-1.21.2:
      Successfully uninstalled numpy-1.21.2
Successfully installed absl-py-0.14.1 astunparse-1.6.3 cachetools-4.2.4 coreferee-1.1.0 flatbuffers-1.12 gast-0.4.0 google-auth-2.3.0 google-auth-oauthlib-0.4.6 google-pasta-0.2.0 grpcio-1.34.1 h5py-3.1.0 keras-2.4.3 keras-nightly-2.5.0.dev2021032900 keras-preprocessing-1.1.2 markdown-3.3.4 numpy-1.19.5 oauthlib-3.1.1 opt-einsum-3.3.0 protobuf-3.18.1 pyasn1-0.4.8 pyasn1-modules-0.2.8 requests-oauthlib-1.3.0 rsa-4.7.2 scipy-1.7.1 six-1.15.0 tensorboard-2.7.0 tensorboard-data-server-0.6.1 tensorboard-plugin-wit-1.8.0 tensorflow-2.5.1 tensorflow-estimator-2.5.0 termcolor-1.1.0 typing-extensions-3.7.4.3 werkzeug-2.0.2 wrapt-1.12.1

pip install coreferee fails ...

pip install coreferee returns the following error:

ERROR: Could not find a version that satisfies the requirement coreferee (from versions: none)
ERROR: No matching distribution found for coreferee

Other info:

spaCy version    3.0.6                         
Platform         Linux-5.4.0-72-generic-x86_64-with-glibc2.27
Python version   3.9.2

Dependency requirements are too loose

The current dependency requirements are too loose, causing a fresh install that uses the newest versions of the dependencies, where applicable (where >= is used for the version number), to not work with different errors.

Alternatively, the code is not compatible with the newest versions of the depdendencies.

I managed to run the example using the lowest allowed versions of the dependencies (replacing >= with ==):
h5py==2.10.0 keras==2.4.3 numpy==1.19.2 spacy==3.0.5 tensorflow==2.4.1

Not able to add spaCy pipw

I tried python -m coreferee install en as well as pip install of the zip file but still getting this error

Failed copying input tensor

Hi, I get data for language spanish through of files *.conll, I tranform this data *.conll in format *.ann, and I try train coreferee with this data, after of change the rules for my langage. This data are 3.000 files approximately, but when I try train this model I get a error:

Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/creangel/info/image/coreferee/__main__.py", line 51, in <module>
    TrainingManager(
  File "/home/creangel/info/image/coreferee/training/train.py", line 409, in train_models
    self.train_model(config_entry_name, config_entry, temp_log_file)
  File "/home/creangel/info/image/coreferee/training/train.py", line 378, in train_model
    keras_ensemble = self.generate_keras_ensemble(
  File "/home/creangel/info/image/coreferee/training/train.py", line 219, in generate_keras_ensemble
    keras_history = model_generator.train_keras_model(training_docs, tendencies_analyzer,
  File "/home/creangel/info/image/coreferee/training/model.py", line 288, in train_keras_model
    #print('keras_inputs: ',keras_inputs)
  File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1134, in fit
    data_handler = data_adapter.get_data_handler(
  File "/usr/local/lib/python3.8/dist-packages/keras/engine/data_adapter.py", line 1383, in get_data_handler
    return DataHandler(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/keras/engine/data_adapter.py", line 1138, in __init__
    self._adapter = adapter_cls(
  File "/usr/local/lib/python3.8/dist-packages/keras/engine/data_adapter.py", line 230, in __init__
    x, y, sample_weights = _process_tensorlike((x, y, sample_weights))
  File "/usr/local/lib/python3.8/dist-packages/keras/engine/data_adapter.py", line 1031, in _process_tensorlike
    inputs = tf.nest.map_structure(_convert_numpy_and_scipy, inputs)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/nest.py", line 869, in map_structure
    structure[0], [func(*x) for x in entries],
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/nest.py", line 869, in <listcomp>
    structure[0], [func(*x) for x in entries],
  File "/usr/local/lib/python3.8/dist-packages/keras/engine/data_adapter.py", line 1026, in _convert_numpy_and_scipy
    return tf.convert_to_tensor(x, dtype=dtype)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper
    return target(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/ops.py", line 1430, in convert_to_tensor_v2_with_dispatch
    return convert_to_tensor_v2(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/ops.py", line 1436, in convert_to_tensor_v2
    return convert_to_tensor(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/profiler/trace.py", line 163, in wrapped
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/ops.py", line 1566, in convert_to_tensor
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/tensor_conversion_registry.py", line 52, in _default_conversion_function
    return constant_op.constant(value, dtype, name=name)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/constant_op.py", line 271, in constant
    return _constant_impl(value, dtype, shape, name, verify_shape=False,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/constant_op.py", line 283, in _constant_impl
    return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/constant_op.py", line 308, in _constant_eager_impl
    t = convert_to_eager_tensor(value, ctx, dtype)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/constant_op.py", line 106, in convert_to_eager_tensor
    return ops.EagerTensor(value, ctx.device_name, dtype)

tensorflow.python.framework.errors_impl.InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not initialized.

I think that is the memory. I have a GPU of 8 GB.

Note: I try with 100 files from my data and the train of coreferee work very well, but with a
with 3000 or 200 files I get the error.

Maybe, someone know about this error and which is the solution?. Thanks.

Doesnt work with new trf model

Got: ModelNotSupportedError: en_core_web_trf version 3.1.0

Doesn't recognize I, me, myself?

Hey there, amazing work! Three things are apparent:

When given a longer text, if the name "Florian" appears in different sentences the chains are split and "Florian" repeated. So the chains are per sentence, correct?
In a text with "I", "myself" and "me" the coreferences are completely ignored. Correct?
If a first name and last name, e.g. "Florian Schmitt" is used, the chain is only based on "Schmitt" and ignores "Florian"

Am I doing smth wrong, or are these findings as designed?

Format file for train

Hi, I want try train a model in Spanish, I read the step by step in the principal page, in the step 8 a corpus is mentioned, ¿what format have this corpus?. Someone can give me an example of a corpus for train Coreferee in any language.

I read the file loaders.py, exactly the method load() in the line 310, I reed that need two files: a *.txt and a *.ann (File of annotations?)

How can I get this files, or how can I made it? Thanks

Why only for one word only?.

why only for one word only?. like this,

what if I want in cluster 0 : Andy Pena(1), he(9) ?

How to clean up GPU memory after this runs?

The method is using all available GPUs and I notice that the GPU memory isn't cleaned up after it runs.

Is there any way to:

Specify what GPUs should be used (and also the max memory to consume)?
Clear out the GPU memory after this runs?

Thanks much. The results are pretty impressive.

Scores for each co-reference

Hi,

I was wondering whether it is possible to retrieve a output score for found references.

Thanks in advance

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.