scossin / iamsystem Goto Github PK

A fast dictionary-based approach for semantic annotation with approximate string matching algorithms

License: MIT License

Java 100.00%

annotations dictionary abbreviations-detection semantic-annotation terminology named-entity-recognition fuzzy-matching phrasematcher medical-informatics

iamsystem's Introduction

IAMsystem

A Java implementation of IAMsystem algorithm, a fast dictionary-based approach for semantic annotation, a.k.a entity linking.

Installation

Add the dependency to your pom.xml:

<dependency>
 	<groupId>fr.erias</groupId>
	<artifactId>IAMsystem</artifactId>
	<version>2.2.0</version>
</dependency>

Usage

You provide a list of keywords you want to detect in a document, you can add and combine abbreviations, normalization methods (lemmatization, stemming) and approximate string matching algorithms, IAMsystem algorithm performs the semantic annotation.

See the documentation for the configuration details. Although the documentation is for the python implementation, the Java implementation offers the same functionalities and uses the same variable names.

Quick example

Matcher matcher = new MatcherBuilder()
		.keywords("North America", "South America")
		.stopwords("and")
		.abbreviations("amer", "America")
		.levenshtein(5, 1, Algorithm.TRANSPOSITION)
		.w(2)
		.build();
List<IAnnotation> annots = matcher.annot("Northh and south Amer.");
for (IAnnotation annot : annots) {
	System.out.println(annot);
}
// Northh Amer	0 6;17 21	North America
// south Amer	11 21	South America

Algorithm

The algorithm was developed in the context of a PhD thesis. It proposes a solution to quickly annotate documents using a large dictionary (> 300K keywords) and fuzzy matching algorithms. No string distance algorithm is implemented in this package, it imports and leverages external libraries. Its algorithmic complexity is O(n(log(m))) with n the number of tokens in a document and m the size of the dictionary. The formalization of the algorithm is available in this paper.

It has participated in several semantic annotation competitions in the medical field where it has obtained satisfactory results, for example by obtaining the best results in the Codiesp shared task. A dictionary-based model can achieve close performance to a transformer-based model when the task is simple or when the training set is small. Its main advantage is its speed, which allows a baseline to be generated quickly.

How it works

Like FlashText and Spacy's phrasematcher algorithms it stores a terminology in a tree data structure (called a trie) for low memory storage and fast lookup.

IAMsystem handles fuzzy string matching at the token level of a n-gram term. In the example below, it detects the term "insuffisance cardiaque aigue" in a document containing "ins cardiaqu aigue":

Approximate string matching

By default, IAMsystem performs exact match only. The following string matching algorithms are available in IAMsystem:

Apache common StringEncoders (Metaphone, Soundex, Caverphone...)
Levenshtein distance (by https://github.com/universal-automata/liblevenshtein-java)
Abbreviations (a dictionary must be provided)
Truncation
ClosestSubString
Regex

You can also add your own fuzzy matching algorithm. Examples of token matching with different algorithms:

token in document	token(s) in terminology	Approximate string matching algorithm
amocssicilllline	amoxicilline	Soundex
amoicilline	amoxicilline	Levenshtein (edit distance of 1)
amoxicil.	amoxicilline	Truncation
amoxicillinesssss	amoxicilline	ClosestSubString
bp	blood pressure	Abbreviations

References

Performance (recall, precision, F-measure) of IAMsystem were evaluated on two information extraction tasks of the CLEF eHealth initiative. IAMsystem's papers:

Cossin S, Jouhet V, Mougin F, Diallo G, Thiessard F. IAM at CLEF eHealth 2018: Concept Annotation and Coding in French Death Certificates. https://arxiv.org/abs/1807.03674
Cossin S and Jouhet V. IAM at CLEF eHealth 2020: Concept Annotation in Spanish Electronic Health Records. http://ceur-ws.org/Vol-2696/paper_198.pdf
Cossin S, Diallo G, Jouhet V. IAM at IberLEF 2022: NER of Species Mentions. CEUR workshop proceedings [Internet]. sept 2022. https://ceur-ws.org/Vol-3202/livingner-paper11.pdf

Organizers' papers:

Névéol A, Robert A, Grippo F, Morgand C, Orsi C, Pelikan L, et al. CLEF eHealth 2018 Multilingual Information Extraction Task Overview: ICD10 Coding of Death Certificates in French, Hungarian and Italian. In: CLEF. 2018. http://ceur-ws.org/Vol-2125/invited_paper_18.pdf
Miranda-Escalada A, Gonzalez-Agirre A, Armengol-Estapé J, Krallinger M. Overview of automatic clinical coding: annotations, guidelines, and solutions for non-English clinical cases at CodiEsp track of eHealth CLEF 2020. CEUR-WS. 2020; http://ceur-ws.org/Vol-2696/paper_263.pdf
A. Miranda-Escalada, E. Farré-Maduell, S. Lima-López, D. Estrada, L. Gascó, M. Krallinger, Mention detection, normalization & classification of species, pathogens, humans and food in clinical documents: Overview of livingner shared task and resources, Procesamiento del Lenguaje Natural (2022)

Release note:

Version
0.0.1	First publication of the algorithm (November 2018)
1.0.0	First major modification. Change the output object of the detector (December 2020), add TermDetector
1.2.0	Re-implement the trie and add a cache mechanism to improve performance
1.3.0	Add support to the Apache common StringEncoder library, add Troncation and ClosestSubString algorithms
2.1.0	Complete re-write of the library to be in sync with the Python implementation and its documentation. The core algorithm, aka the matching strategy, changed to the Window Strategy which allows detection of discontinuous sequences of tokens in a document. The strategy used in previous versions (<2.1.0) is called the NoOverlap strategy.
2.2.0	Fix issue 18: create multiple annotations when a keyword is repeated in the same window.

Acknowledgement

This annotation tool is part of the Drugs Systematized Assessment in real-liFe Environment (DRUGS-SAFE) research platform that is funded by the French Medicines Agency (Agence Nationale de Sécurité du Médicament et des Produits de Santé, ANSM). This platform aims at providing an integrated system allowing the concomitant monitoring of drug use and safety in France.

Citation

@article{cossin_iam_2018,
	title = {{IAM} at {CLEF} {eHealth} 2018: {Concept} {Annotation} and {Coding} in {French} {Death} {Certificates}},
	shorttitle = {{IAM} at {CLEF} {eHealth} 2018},
	url = {http://arxiv.org/abs/1807.03674},
	urldate = {2018-07-11},
	journal = {arXiv:1807.03674 [cs]},
	author = {Cossin, Sébastien and Jouhet, Vianney and Mougin, Fleur and Diallo, Gayo and Thiessard, Frantz},
	month = jul,
	year = {2018},
	note = {arXiv: 1807.03674},
	keywords = {Computer Science - Computation and Language},
}

iamsystem's People

Contributors

Stargazers

Watchers

Forkers

vianneyjouhet gozat drfabach

iamsystem's Issues

Take into account diacritics in stopwords checking

In French, the word "à" has a different meaning than "a".
We would like "à" to be a stopword but not "a".
Stopwords checking is done on the normalized token.
This is a failing test that shows this problem: 0857bd7

Improve performance of defaultTokenizerNormalizer

The class TokenizerNormalizer is a very important one in the pipeline. It handles the tokenization and the normalization of a document and the terms of a terminology.Users can customize the tokenization and normalization process.
After tokenization (out: String[] tokens] and normalization(out: String normalized) the class computes the position (start,end) of each token (int[][] tokenStartEndInSentence) by iterating over each character.

This process is not efficient for the defaultTokenizerNormalizer which uses a whiteSpaceTokenizer: the best way is to compute the position (start,end) during the normalization / tokenization ; avoiding iterating twice on the characters of a document.

Load Terminology: handle file header and quotes

In version 1.3.0, it's impossible to ignore the header of a CSV file and to remove automatically quotes

Overlapping terms in a document are not detected

For example:
Term1: cancer du poumon
Term2: poumon gauche
Document: "cancer du poumon gauche"
cancer du poumon is detected but poumon gauche is not.
The current algorithm behavior is to restart after the last detected token (gauche in this example since poumon is the last detected token). This behavior guarantees no overlapping terms detected in the document.
One other possible behavior is to search terms for each token in the document, whatever the detection results of the previous token.
In some cases, this behavior would be desirable
It is necessary to add a parameter to select the desirable behavior.

Version 2.0.0

Roadmap to version 2.0.0

I've made a release of the algorithm in Python: https://github.com/scossin/iamsystem_python
Several changes have been made in the Python implementation of the algorithm:

window parameter
It allows the detection of discontinuous keywords.
Ignore tokens ordering
It allows to detect of a keyword in whatever order its tokens appear in the document.
start/end format
In IAMsystem <2.0.0, the end value is the index of the last character. In NLP tasks and the Brat format, the index of the last character + 1 is often expected, which makes it easier to take the substring [start:end].
Overlapping terms
In IAMsystem version <2.0.0, overlapping terms are automatically removed, this cannot be configured.
Better documentation
A better documentation would explain how to set up the algorithm.
Improve the terminology - reuse same terminology as in the Python implementation
Using the same terminology will facilitate the understanding of both implementations

Add a sentence-splitter

Add an easy way to add a sentence-splitter in the pipeline.
The CTcode class, storing the output of the detection, should have a sentence number.
The TNoutput class, storing the normalization output, should contain the output of the sentence-splitter.

Fail to detect an abbreviation when the long form contains a stopword

Stopwords are removed from the terms of a terminology.
They are not removed in the long forms of abbreviations.
Long forms containing stopwords failed to be matched.
Failed test: 33e0db2

InvalidSentenceLength with some letters

Input text: İbrahim Koral Önal, Levent Özçakar

ITokenizerNormalizer tokenizerNormalizer = TokenizerNormalizer.getDefaultTokenizerNormalizer();
tokenizerNormalizer.tokenizeNormalize("İ");

throws an InvalidSentenceLength because the the letter İ is normalized by two characters "i " (i and a space).
Same for the letter Ö.
A simple fix is to remove trailing space.

Normalization issue of words containing œ (in French)

TermDetector detector = new TermDetector();
Term term = new Term("oedeme","I50");
detector.addTerm(term);
System.out.println(detector.detect("œdeme").getCTcodes().size());

This is not a normalization issue since normalization requires a one-2-one mapping: character -> normalized character.
A way to handle it is with abbreviations:

TermDetector detector = new TermDetector();
Term term = new Term("oedeme","I50");
Abbreviations abb = new Abbreviations();
abb.addAbbreviation("oedeme", "œdeme");
detector.addTerm(term);
detector.addFuzzyAlgorithm(abb);
System.out.println(detector.detect("œdeme").getCTcodes().size());

A general fix could be to add a method in the Abbreviations class to add automatically words synonyms: it takes in input the unique tokens of the terminology, if a token contains 'œ' then create a long form by replacing 'œ' by oe.
This method will also handle similar cases.

Wouldn't be better to pass a file-path instead of an InputStream to the addTerminology method of Terminology class ?

In class Terminology (IAMsystem/src/main/java/fr/erias/IAMsystem/terminology/Terminology.java), something like that

import java.io.File;
import java.io.FileInputStream;
        /**
	 * Create a terminology object from a CSV file
	 * @param in The inputstream of the CSV file
	 * @param sep the separator of the CSV file (ex : "\t")
	 * @param colLabel the ith column containing the libnormal (normalized label of the term)
	 * @param colCode the ith column containing the terminology code
	 * @param normalizer a {@link INormalizer} to normalize the terms of the terminology
	 * @throws IOException inputstream error
	 */
	public Terminology(String fileName, String sep, int colLabel, int colCode, INormalizer normalizer) throws IOException {
		String line = null;
                File file = new File(fileName);
                InputStream in = new FileInputStream(file);
		BufferedReader br = new BufferedReader(new InputStreamReader(in,"UTF-8"));
		while ((line = br.readLine()) != null) {
			String[] columns = line.split(sep);
			String label = columns[colLabel];
			String code = columns[colCode];
			addTerm(label, code, normalizer);
		}
		br.close();
		logger.info("terminology size : " + terms.size());
	}

might be easier to use, isn't it ?

New feature: ignore some tokens in the Levenshtein distance method

Levenshtein distance method is great to detect typos but it can also generate noise by replacing incorrectly a correct word with another one. For example, "maladie" is replaced by "malaçie" in French ; "maladie" is a word to ignore by this method.

I added a set of tokens2ignore in the Levenshtein distance class: fd726cf
Tokens in this set are ignored by the Levenshtein distance method.

ITokenizerNormalizer anti-pattern

I introduced an anti-pattern in the first versions of IAMsystem, it has become overwhelming in the code and annoying
Whenever we need to access the normalizer, tokenizer, or check if it's a stopword, we need to call ITokenizerNormalizer this way:

tokenizerNormalizer.getNormalizer().getStopwords().isStopWord(lastToken)
tokenizerNormalizer.getNormalizer().normalize(sentence);
tokenizerNormalizer.getTokenizer().tokenize(sentence);

We should be able to use this class in a more simpler way:

tokenizerNormalizer.isStopWord(lastToken)
tokenizerNormalizer.normalize(sentence)
tokenizerNormalizer.tokenize(sentence)

The interface ITokenizerNormalizer should extend INormalizer and ITokenizer; INormalizer should extend IStopwords.

Abbreviations.addAbreviations does not normalize the abbreviation

In IAMsystem/src/main/java/fr/erias/IAMsystem/synonym/Abbreviations.java (lines 40-50) :

/**
	 * Add abbreviations
	 * @param term (ex : 'insuf')
	 * @param abbreviation (ex : 'insuffisance')
	 * @param tokenizerNormalizer a {@link ITokenizerNormalizer}
	 */
	public void addAbbreviation(String term, String abbreviation, ITokenizerNormalizer tokenizerNormalizer) {
		String normalizedTerm = tokenizerNormalizer.getNormalizer().getNormalizedSentence(term);
		String[] tokensArray = tokenizerNormalizer.getTokenizer().tokenize(normalizedTerm);
		addAbbreviation(tokensArray, abbreviation);
	}

should be replaced by

/**
	 * Add abbreviations
	 * @param term (ex : 'insuf')
	 * @param abbreviation (ex : 'insuffisance')
	 * @param tokenizerNormalizer a {@link ITokenizerNormalizer}
	 */
	public void addAbbreviation(String term, String abbreviation, ITokenizerNormalizer tokenizerNormalizer) {
		String normalizedAbbreviation = tokenizerNormalizer.getNormalizer().getNormalizedSentence(abbreviation);
		String normalizedTerm = tokenizerNormalizer.getNormalizer().getNormalizedSentence(term);
		String[] tokensArray = tokenizerNormalizer.getTokenizer().tokenize(normalizedTerm);
		addAbbreviation(tokensArray, normalizedAbbreviation);
	}

Otherwise an uppercase abbreviation would not be detected (for instance) : if one enters a .addAbbreviation("accident vasculaire cérébral", "AVC", ITokNorm), it is not (at the present time) detected in the sentence 'le patient souffre d'un avc', since the registered abbreviation is 'AVC' in uppercase form.

Suggestion: instanciate the Levenshtein index with maxEdits and minNchar directly in the TermDetector class

At the present time the .addLevenshteinIndex in the class TermDetector only implements the defaults versio of the Levenshtein synonyms checking.

I propose to add the following polymorphic method to the TermDetector class, in
IAMsystem/src/main/java/fr/erias/IAMsystem/detect/TermDetector.java

	/**
	 * Create a Lucene index
	 * @param terminology {@link Terminology}
	 * @param maxEdits 
	 * @param minNchar
	 * @throws IOException if 
	 */
	public void addLevenshteinIndex(Terminology terminology, int maxEdits, int minNchar) throws IOException {
		IndexBigramLucene.IndexLuceneUniqueTokensBigram(terminology, this.tokenizerNormalizer); // create the index ; do it only once
		LevenshteinTypoLucene levenshteinTypoLucene = new LevenshteinTypoLucene(); // open the index
                levenshteinTypoLucene.setMaxEdits(maxEdits);
                levenshteinTypoLucene.setMinNchar(minNchar);
		this.levenshtein = levenshteinTypoLucene;
	}

IAM at CLEF EHEALTH 2020 - Paper question

Hi Sebastian,

I have a query about your interesting IAM paper which I was hoping you might be able to help me with. I tried to email you at
[email protected] but my emails bounced; perhaps you that address is no longer in use.

In the paper you mention that: “The second dictionary (run2) was the combination of the first dictionary and the normalized labels of the ICD10-CM terminology. It contained a total of 94,386 terms.”
Could you please share some more details about how you constructed the dictionary of labels from the ICD10-CM terminology as I would like to be able to replicate your method for the CodiEsp English dataset.

Ideally if you are able to share any relevant code or scripts used to generate the Spanish ICD10 dictionary that would be much appreciated and would help me implement it for the English ICD terminology.

Kind regards,
Joe

ps if you would prefer to correspond via email, mine is:

joseph.boyle[at]mre.medical.canon
(where [at] = @)