rockt / chemspot Goto Github PK

ChemSpot is a named entity recognition tool for identifying mentions of chemicals in natural language texts, including trivial names, drugs, abbreviations, molecular formulas and IUPAC entities. Since the different classes of relevant entities have rather different naming characteristics, ChemSpot uses a hybrid approach combining a Conditional Random Field with a dictionary. ChemSpot is released under the Common Public License 1.0.

Home Page: https://www.informatik.hu-berlin.de/forschung/gebiete/wbi/resources/chemspot/chemspot/

License: Other

Java 99.45% Scala 0.55%

chemspot's Introduction

ChemSpot

ChemSpot 2.0 is a set of tools for named entity recognition and classification of chemicals in natural language texts, including trivial names, abbreviations, molecular formulas and IUPAC entities. Since the different classes of relevant entities have rather different naming characteristics, ChemSpot uses a combined approach of employing a Conditional Random Field and a dictionary, as well as pattern-based recognition, a classifier model and several methods for consolidating all annotations. ChemSpot also performs named entity normalization by assigning identifiers from several chemical databases. It achieves an F1 measure of 79.0% on the SCAI corpus.

ChemSpot is released under the Common Public License 1.0 (see LICENSE).

The warning message "Couldn't open cc.mallet.util.MalletLogger resources/logging.properties file." can be ignored.

Running ChemSpot:

Extract chemspot.zip into a directory

unzip chemspot.zip

To tag a sample text file, run

java -Xmx16G -jar chemspot.jar -t sample.txt -o predict.txt

To update the dictionary, run

java -Xmx5G -jar chemspot.jar -u

If you would like to reduce memory consumption and do not need ChemSpot to assign identifiers to chemicals, you can run it without the ids file. Note however that this will completely disable named entity normalization.

java -Xmx12G -jar chemspot.jar -t sample.txt -o predict.txt -i ""

If you would like to further reduce the memory footprint, you can run ChemSpot without the dictionary or multi-class model as well. Note however that this will result in worse NER performance.

java -Xmx7G -jar chemspot.jar -t sample.txt -o predict.txt -i "" -d ""
java -Xmx9G -jar chemspot.jar -t sample.txt -o predict.txt -i "" -M ""

Parameters

arguments:
- -m path to a CRF model file (internal default model file will be used if not provided)
- -s path to a OpenNLP sentence model file (internal default model file will be used if not provided)
- -d path to a zipped set of brics dictionary automata (parameter defaults to 'dict.zip' if not provided)"
- -i path to a zipped tab-separated text file representing a map of terms to ids (parameter defaults to 'ids.zip' if not provided)
- -M path to a multi-class model file (parameter defaults to 'multiclass.bin' if not provided)
flags:
- -e if this flag is set, the performance of ChemSpot on an IOB gold-standard corpus (cf. -c) is evaluated"
- -u if this flag is set, ChemSpot will update the dictionary and ids file
- -T number of threads to create when processing a document collection
input control:
- -c path to a directory containing corpora in IOB format
- -g path to a directory containing gzipped text files
- -t path to a text file
- -f path to a directory of text files
output control:
- -o path to output file
- -I if this flag is set, the output will be converted into the IOB format

Using ChemSpot in your Code

ChemSpot tagger = ChemSpotFactory.createChemSpot("dict.zip", "ids.zip", "multiclass.bin");
String text = "The abilities of LHRH and a potent LHRH agonist ([D-Ser-(But),6, " +
  "des-Gly-NH210]LHRH ethylamide) inhibit FSH responses by rat " +
  "granulosa cells and Sertoli cells in vitro have been compared.";

for (Mention mention : tagger.tag(text)) {
  System.out.printf("%d\t%d\t%s\t%s\t%s,\t%s%n", 
    mention.getStart(), mention.getEnd(), mention.getText(), 
    mention.getCHID(), mention.getSource(), mention.getType().toString());
}

Reproducing our Results

Download the SCAI corpus (chemicals-test-corpus-27-04-2009-v3.iob.gz) and put it in the same directory
To reproduce our results, run

java -Xmx16G -jar chemspot.jar -c chemicals-test-corpus-27-04-2009-v3.iob.gz -o predict.txt -e

Acknowledgements

We would like to thank Daniel Lowe and Philippe Thomas for many valuable suggestions.

chemspot's People

Contributors

Stargazers

Watchers

Forkers

imzwz clarivate-lsps paidi davidsoloman erechtheus jkirsch judithcodes wkrupa beira-bf

chemspot's Issues

eumed-light.jar with newer version of scala

eumed-light.jar has been compiled with old version of scala 2.9.2. Need a version compiled with 2.10

Or the source code so that I can compile

Move match expansion to separate component

Develop a new component for the expansion of matches of chemicals

Improve match expansion

In addition to #15, improve match expansion in order to not expand matches for terms such as "non-cholesterol"

Fix application parameter settings

Certain parameter combinations for the ChemSpot main application produce a strange and somewhat arbitrary behavior. This should be changed so that all parameters work as expected.

Maven deployment

ChemSpot can be installed via Maven, but it would also be nice to automatically create a runnable jar, copy all required files and optionally tar/compress them

Improve normalization

Find more (offline) sources to retrieve IDs for chemicals from and integrate them into ChemSpot

Brics error when initializing ChemSpot

When calling ChemSpot from within a different Java project, the dictionary fails to load. Perhaps this problem is related to #22?

Failed initializing ChemSpot.
org.apache.uima.resource.ResourceInitializationException: Initialization of annotator class "de.berlin.hu.uima.ae.tagger.brics.BricsTagger" failed. (Descriptor: jar:file:/media/Data/workspaces/wbi/prototype/lib/chemspot.jar!/desc/ae/tagger/BricsTaggerAE.xml)
at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.initializeAnalysisComponent(PrimitiveAnalysisEngine_impl.java:254)
at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.initialize(PrimitiveAnalysisEngine_impl.java:158)
at org.uimafit.factory.AnalysisEngineFactory.createPrimitive(AnalysisEngineFactory.java:403)
at de.berlin.hu.chemspot.ChemSpot.(ChemSpot.java:118)
at ChemSpotRunner.main(ChemSpotRunner.java:10)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120)
Caused by: org.apache.uima.resource.ResourceInitializationException
at de.berlin.hu.uima.ae.tagger.brics.BricsTagger.initialize(BricsTagger.java:59)
at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.initializeAnalysisComponent(PrimitiveAnalysisEngine_impl.java:252)
... 9 more
Caused by: java.util.zip.ZipException: error in opening zip file
at java.util.zip.ZipFile.open(Native Method)
at java.util.zip.ZipFile.(ZipFile.java:214)
at java.util.zip.ZipFile.(ZipFile.java:144)
at java.util.zip.ZipFile.(ZipFile.java:115)
at de.berlin.hu.uima.ae.tagger.brics.BricsMatcher.(BricsMatcher.java:42)
at de.berlin.hu.uima.ae.tagger.brics.BricsTagger.initialize(BricsTagger.java:55)
... 10 more

Tagging text is slow

934a481

    public List<Mention> tag(String text) throws UIMAException {
        JCas jcas = JCasFactory.createJCas(typeSystem);
        jcas.setDocumentText(text);
        PubmedDocument pd = new PubmedDocument(jcas);
        pd.setBegin(0);
        pd.setEnd(text.length());
        pd.setPmid("");
        pd.addToIndexes(jcas);
        return tag(jcas);
    }

This is slow since a jcas is initialized each time we want to tag a string. Instead, hold back one pre-intitialized jcas and reset it each time this method gets called.

Wrap ChemSpot in a U-Compare compatible UIMA component

Chemspot REST interface

Hi,
FYI I've quickly developed a wrapper around ChemSpot that offer it as a REST service. In this way you don't need 16Gb or memory everytime you need to tag a new document.

You can find it here: https://bitbucket.org/lfoppiano/chemspot-web
Regards
Luca

Tagging text from command-line does not work

java -jar -Xmx9G chemspot.jar -m crf_model.bin -s sentence_model.bin.gz -d dict.zip -i ids.zip -t sample.txt -o predict.iob

Exception in thread "main" java.io.IOException: There are no corpora defined.
at de.berlin.hu.chemspot.App.promptForCorpus(App.java:146)
at de.berlin.hu.chemspot.App.main(App.java:270)

Reduce size of LINNAEUS automaton

Suppress "Couldn't open edu.umass.cs.mallet.base.util.MalletLogger resources/logging.properties file" error

Grab the missing 'logging.properties' file at

https://github.com/clulab/banner/blob/master/src/main/java/edu/umass/cs/mallet/base/util/resources/logging.properties

Open the 'ChemSpot/chemspot-2.0/chemspot.jar' file using an archive manager (me, Ubuntu: Archive Manager); add the 'logging.properties' file to the following location:

cc.mallet.util.resources.logging.properties

[i.e., "/cc/mallet/util/resources/logging.properties"]

Improve recognition of short terms

A lot of false positives are produced by short terms like "IOP", "BMP", "CIA" or "SAM". Find a way to deal with these matches properly (and maybe separately, in a new component?).

Unable to access jarfile chemspot.jar

HI I have tried to follow the commands with some sample text but received the following error on Ubuntu 16.04.

Unable to access jarfile chemspot.jar

Suppress messages of other libraries (MALLET, OPSIN etc)

Use the following Snippet:

PrintStream oldErr = System.err;
PrintStream newErr = new PrintStream(new ByteArrayOutputStream());
System.setErr(newErr);

// do your work

System.setErr(oldErr);

Use other databases of Jochem as well

change scala version to 2.11.8

ChemSpot can be wonderful if we could use it with scala 2.11.8 in our NLP pipeline. However, we have encountered a problem that class file for scala.ScalaObject not found.
The reason is why de.berlin.hu.enumed.EntityTagger is using scala.ScalaObject and we could not change compiled code.
Is there any chance to get the source code of below maven dependency?

eumed
eumed-rg
1.0.0

If we can get the source code, we could update your ChemSpot code and can use it with state-of-art libraries.

Thanks.

Upgrade to LINNAEUS 2.0

Generic drug tagger constructor

DrugTagger(String, String, String)

Add constructor for Drug Tagger with more generic input, such as an InputStream.