hu-ner / huner Goto Github PK

Named Entity Recognition for biomedical entities

Python 78.23% Dockerfile 0.81% Shell 14.39% Perl 6.58%

corpora bionlp ner named-entity-recognition neural-networks

huner's Introduction

HUNER

We recently published HunFlair, a reimplementation of HUNER inside the Flair framework. By using language models, HunFlair considerably outperforms HUNER. In addition, as part of Flair, HunFlair is easy to install and does not have a dependency on Docker. We recommend all HUNER users to migrate to HunFlair.

HUNER is a state-of-the-art NER model for biomedical entities. It comes with models for genes/proteins, chemicals, diseases, species and cell lines.

The code is based on the great LSTM-CRF NER tagger implementation glample/tagger by Guillaume Lample.

Content

Section	Description
Installation	How to install HUNER
Usage	How to use HUNER
Models	Available pretrained models
Corpora	The HUNER Corpora

Installation

Install docker
Clone this repository to $dir
Download the pretrained model you want to use from here, place it into $dir/models/$model_name and untar it using tar xzf $model_name

Usage

Tagging

To tokenize, sentence split and tag a file INPUT.TXT:

Start the HUNER server from $dir using ./start_server $model_name. The model must reside in the directory $dir/models/$model_name.
Tag text with python client.py INPUT.TXT OUTPUT.CONLL --name $model_name.

the output will then be written to OUTPUT.CONLL in the conll2003 format.

The options for client.py are:

--asume_tokenized: The input is already pre-tokenized and the tokens are separated by whitespace
--assume_sentence_splitted: The input is already split into sentences and each line of the input contains one sentence

Fine-tuning on a new corpus

The steps to fine-tune a base-model $base_model (e.g. gene_all) on a new corpus $corpus are:

Copy the chosen base-model to a new directory, because the weight files will be updated during fine-tuning:

cp $dir/models/$base_model $dir/models/$fine_tuned_model

Convert your corpus to conll format and split it into train, dev and test portions. If you don't want to use either dev or test data you can just provide the training data as dev or test. Note however, that without dev data, results will probably suffer, because early-stopping can't be performed.
Fine-tune the model:

./train.sh $fine_tuned_model $corpus_train $corpus_dev $corpus_test

After successful training, $fine_tuned_model will contain the fine-tuned model and can be used exactly like the models provided by us.

Retraining a base-model from scratch (without fine-tuning)

To train a model from scratch without initializing it from a base-model, proceed as follows:

Convert your corpus to conll format and split it into train, dev and test portions. If you don't want to use either dev or test data you can just provide the training data as dev or test. Note however, that without dev data, results will probably suffer, because early-stopping can't be performed.
Train the model:

./train_no_finetune.sh $corpus_train $corpus_dev $corpus_test

After sucessful training, the model can be found in a newly created directory in models/. The directory name reflects the chosen hyper-parameters and usually reads like tag_scheme=iob,lower=False,zeros=False,char_dim=25....

Models

Model	Test sets P / R / F1 (%)	CRAFT P / R / F1 (%)
cellline_all	70.40 / 65.37 / 67.76	-
chemical_all	83.34 / 80.26 / 81.71	53.56 / 35.85 / 42.95
disease_all	75.01 / 77.71 / 76.20	-
gene_all	72.33 / 76.28 / 73.97	59.67 / 65.98 / 62.66
species_all	77.88 / 74.86 / 73.33	98.51 / 73.83 / 84.40

Corpora

For details and instructions on the HUNER corpora please refer to https://github.com/hu-ner/huner/tree/master/ner_scripts and the corresponding readme.

Citation

Please use the following bibtex entry:

@article{weber2019huner,
  title={HUNER: Improving Biomedical NER with Pretraining},
  author={Weber, Leon and M{\"u}nchmeyer, Jannes and Rockt{\"a}schel, Tim and Habibi, Maryam and Leser, Ulf},
  journal={Bioinformatics},
  year={2019}
}

huner's People

Contributors

Stargazers

Watchers

Forkers

yetinam dewadkar nakumgaurav april-ly prernadas aiedward sonjaaits rocke2020 trellixvulnteam aizaan harikrishnank9

huner's Issues

HTTP Connection Error on running client.py

I followed the steps outlined in the readme to annotate a sample text file:

./start_server.sh gene_all
python3 client.py test.txt output.conll --name gene_all

The docker container is created and the server runs, but on running client.py I get the HTTP Connection error:

requests.exceptions.ConnectionError: HTTPConnectionPool(host='172.17.0.2', port=5000): Max retries exceeded with url: /tag (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x1088b5750>: Failed to establish a new connection: [Errno 60] Operation timed out'))

Here is my system info:

OS: Mac OS Catalina 10.15.1
Running zsh shell
Docker version 19.03.5, build 633a0ea

Any help would be appreciated!

Craft to CoNLL error for conversion

craft_to_conll script gives an index out of range issue when it uses the opennlp wrapper the following:

spaces = self.sentence_offsets[id+1]-self.sentence_offsets[id]-len(self.sentences[id])

in the merge sentences function. How does this work? Is this an oversight?

Can't download biomedical abstracts of Huner

Dear Huner developers,

I try to process the biomedical NER datasets used in Huner. While downloading the datasets, some requsts errors occur. I'm wondering if you can share the dataset directly in a google drive maybe?

Thanks for your help!

Inclusion of cell type information in Hunflair

First of all, thank you for this resource. This is more of a curiosity. CRAFT v4 contains cell type information as well. Did you guys try for Named Entity Detection in Hunflair? If yes, how did it work out?

Multiple Entity Types Predicted per Mention

Hi, it seems that some mentions are being predicted as belonging to multiple entity types:

I am assuming this happens since HUNER trains a separate model for each entity type. Is there a way to get the confidence scores of the predictions for each entity type, so as to select one out of the many categories? Thanks!

Request timeout for BioInfer dataset

When running python download_files.py, I get the following error:

Traceback (most recent call last):
  File "/mnt/home/lotrecks/.local/lib/python3.7/site-packages/urllib3/connection.py", line 175, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "/mnt/home/lotrecks/.local/lib/python3.7/site-packages/urllib3/util/connection.py", line 95, in create_connection
    raise err
  File "/mnt/home/lotrecks/.local/lib/python3.7/site-packages/urllib3/util/connection.py", line 85, in create_connection
    sock.connect(sa)
TimeoutError: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/home/lotrecks/.local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 710, in urlopen
    chunked=chunked,
  File "/mnt/home/lotrecks/.local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 398, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/mnt/home/lotrecks/.local/lib/python3.7/site-packages/urllib3/connection.py", line 244, in request
    super(HTTPConnection, self).request(method, url, body=body, headers=headers)
  File "/opt/software/Python/3.7.2-GCCcore-6.4.0/lib/python3.7/http/client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/opt/software/Python/3.7.2-GCCcore-6.4.0/lib/python3.7/http/client.py", line 1275, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/opt/software/Python/3.7.2-GCCcore-6.4.0/lib/python3.7/http/client.py", line 1224, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/opt/software/Python/3.7.2-GCCcore-6.4.0/lib/python3.7/http/client.py", line 1016, in _send_output
    self.send(msg)
  File "/opt/software/Python/3.7.2-GCCcore-6.4.0/lib/python3.7/http/client.py", line 956, in send
    self.connect()
  File "/mnt/home/lotrecks/.local/lib/python3.7/site-packages/urllib3/connection.py", line 205, in connect
    conn = self._new_conn()
  File "/mnt/home/lotrecks/.local/lib/python3.7/site-packages/urllib3/connection.py", line 187, in _new_conn
    self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x2ab4f4514588>: Failed to establish a new connection: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/home/lotrecks/.local/lib/python3.7/site-packages/requests/adapters.py", line 499, in send
    timeout=timeout,
  File "/mnt/home/lotrecks/.local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 788, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/mnt/home/lotrecks/.local/lib/python3.7/site-packages/urllib3/util/retry.py", line 592, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='mars.cs.utu.fi', port=80): Max retries exceeded with url: /BioInfer/files/BioInfer_corpus_1.1.1.zip (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x2ab4f4514588>: Failed to establish a new connection: [Errno 110] Connection timed out'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "download_files.py", line 44, in <module>
    resp = requests.get('http://mars.cs.utu.fi/BioInfer/files/BioInfer_corpus_1.1.1.zip')
  File "/mnt/home/lotrecks/.local/lib/python3.7/site-packages/requests/api.py", line 73, in get
    return request("get", url, params=params, **kwargs)
  File "/mnt/home/lotrecks/.local/lib/python3.7/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/mnt/home/lotrecks/.local/lib/python3.7/site-packages/requests/sessions.py", line 587, in request
    resp = self.send(prep, **send_kwargs)
  File "/mnt/home/lotrecks/.local/lib/python3.7/site-packages/requests/sessions.py", line 701, in send
    r = adapter.send(request, **kwargs)
  File "/mnt/home/lotrecks/.local/lib/python3.7/site-packages/requests/adapters.py", line 565, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='mars.cs.utu.fi', port=80): Max retries exceeded with url: /BioInfer/files/BioInfer_corpus_1.1.1.zip (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x2ab4f4514588>: Failed to establish a new connection: [Errno 110] Connection timed out'))

I've tried accessing the link directly in my browser, and I get a timeout error there as well.

Do you know of another place to get the data in a format that works with your bioinfer_to_conll.py script? I've only been able to find the dataset on huggingface, and while I could do the format conversion myself, it would save a lot of time to use code that's already been written.

Thanks!
Serena

?? symbols in annotations

Hi,

I am annotating a bunch of articles (ACS SynBio) using huner. For the most part, the results make sense but for some files, there are "?" symbols in annotations. You can have a look at a sample paragraph and its annotations:

The columns were maintained at 40 °C; 0.1% formic acid in acetonitrile (eluent A) and in water (eluent B) were used as eluents with the flow rate of 0.5 mL min –1 as follows: 0–1.5 min, 95% A, 5% B; 1.5–7.0 min, decrease 16.9% min –1 A, increase 16.9% min –1 B; 7.0–9.0 min, 2% A, 98% B; 9.0–9.1 min, increase 15.5% s –1 A, decrease 15.5% s –1 B; 9.1–11.5 min, 95% A, 5% B. Injection volume of 2 μm was used for the analytes. Absorbance was measured with a UV/vis detector at 600 nm. Analysis of Orthogonal Activators The functionality of each sTF and orthogonality in S. cerevisiae was tested by fluorescence measurements. The experiments were initiated by pre-cultivating S. cerevisiae cells (strains marked with asterisk in the column “Analysis of orthogonal activators” in the file) at 30 °C on YPD plates for 24 h. Four ml of SCD-UL medium in 24-well plate was inoculated to initial optical density of 0.2 (OD 600 ) by the pre-culture. Three parallel replicates were cultivated for each strain. Cells were cultivated 18 h at 28 °C, 800 rpm. Fluorescence was measured as described in the “ ” section. In addition to the fluorescence measurements, transcription analysis was performed for the subset of the strains. Strains containing sTF and Venus expression cassettes either with 2 or 8 sTF binding sites (strains H4623, ...

the annotations for which (from "acetontrile" in first line to "H4623") are:

acetonitrile
2 ? ?m
? ? ? ? ? ?
H4623

Also, some useful entities have "?" symbols between their tokens, such as "E. ?? coli membrane". Not that it is a significant issue since I can skip over the entities which containe such erroneous symbols (or regex parse them) but it would be helpful to know why it happens. Thanks!

Path names in convert_craft.sh

convert_craft.sh should arguably contain relative path names instead of ~/code/huner/ner_scripts/scripts/craft_to_conll.py

Entity normalization

Hi,
I wanted to ask if HUNER assigns an ID (e.g. ENTREZ or UniProt ID to genes) to the identified entities or whether it only gives the found term and offsets?
Thank you!
Lea

Error converting many datasets

I am facing errors converting most of the datasets. For instance, JNLPBA gives the following error:

Converting JNLPBA Traceback (most recent call last): File "scripts/jnlpba_to_conll.py", line 25, in <module> f_in.__next__() AttributeError: 'file' object has no attribute '__next__' Traceback (most recent call last): File "scripts/jnlpba_to_conll.py", line 25, in <module> f_in.__next__() AttributeError: 'file' object has no attribute '__next__'

It would be nice if you could fix the docker version of convert_corpora. Thanks!

Error in NER scripts

HI,

I get the below error while converting. Where are the converted files being saved?

File "scripts/biosemantics_to_conll.py", line 68, in
with open(args.output, 'w') as f_out:
PermissionError: [Errno 13] Permission denied: '/biosemantics.conll'

How to make sure that there is no overlapping between train set, dev set and test set?

Hi, I saw HUNER paper and I'm interested in the dataset split method that you uesd. You mentioned in the paper that

"To avoid knowledge leaks in the gold standard settings, we adjusted the splits in such a way that there is no overlap between the train, development and test splits across corpora for the same entity type. That is, we ensure that a sentence is contained either only in the train or development or test portion of all corpora, even when it is contained by multiple corpora. This is especially important, as some corpora are based on the same documents."

I'm wondering when combining the corresponding corpora for each entity type, is it guranteed that no duplicates and no overlapping between dataset splits? Did you do that when you split each corpora according to the split id? Because I didn't find the corresponding code for concatenating all the corpora after dataset split for each entity type

Local version？

I want to know if HUNER provides a usable local version, because Docker configuration is very troublesome

Conversion of Biosemantics corpus from Brat to CONLL

Hi,

I need the Biosemantics corpus pre-processed in BIO tagged scheme split in 60:10:30.
Could you make it available?

How to train a Huner model from scratch on other datasets?

Program error

Hello, I encountered the following problem while running sudo python3 client.py INPUT.TXT OUTPUT.CONLL --name gene_all --assume_sentence_splitted. My INPUT.TXT has been split into one sentence per line Hope you can help me, thank you.

`(base) xinzhi@xinzhi-QTK5:~/Desktop/PTO/HUNER/huner-master$ sudo python3 client.py INPUT.TXT OUTPUT.CONLL --name gene_all --assume_sentence_splitted
Traceback (most recent call last):

File "client.py", line 130, in
tagged_line = tagger.tag(buff, split_sentences=split_sentences, tokenize=not args.assume_tokenized)[0]
File "client.py", line 96, in tag
results.append(response.json())
File "/usr/lib/python3/dist-packages/requests/models.py", line 892, in json
return complexjson.loads(self.text, **kwargs)
File "/usr/lib/python3/dist-packages/simplejson/init.py", line 518, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3/dist-packages/simplejson/decoder.py", line 370, in decode
obj, end = self.raw_decode(s)
File "/usr/lib/python3/dist-packages/simplejson/decoder.py", line 400, in raw_decode
return self.scan_once(s, idx=_w(s, idx).end())
simplejson.errors.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
`

mix up with metrics on readme and in HUNER paper?

Hi,
The table in the readme (https://github.com/hu-ner/huner#readme) lists evaluation results for the HUNER models on the HUNER test sets. As far as I can see the values in this readme table do not match the evaluation results in table 3 in the HUNER paper (https://doi.org/10.1093/bioinformatics/btz528) and only the values for the chemical model and disease model match table 4 in the paper. Also, table 4 presents results without fine-tuning and I assume the models in this repo were fine-tuned.

Was there some kind of mix up with the tables? Which is correct and is the evaluation code or a detailed evaluation description available?

Thank you

'Model' object has no attribute 'id_to_word'

Python 3
Environment: Google Colab
Command: !python /content/huner/train.py "/content/huner/models/gene_all" --train "/content/huner/data/TRAIN.txt" --dev "/content/huner/data/DEV.txt" --test "/content/huner/data/TEST.txt"

Traceback (most recent call last):
File "/content/huner/train.py", line 139, in
f_train, f_eval = model.build(**parameters)
File "/content/huner/model.py", line 131, in build
n_words = len(self.id_to_word)
AttributeError: 'Model' object has no attribute 'id_to_word'

Any ideas, tips? My goal is to fine-tune the huner on my dataset (conll2003 format, with three columns Word: Pos: Tag)

KeyError: 'OPENNLP'

Hi, I get the following error when i run script for ner.
How could this be fixed?
python3 $SCRIPT_DIR/biosemantics_to_conll.py biosemantics M,I,Y,D,B,C,F,R,G,MOA biosemantics_chemical.conll
Traceback (most recent call last):
File "scripts/biosemantics_to_conll.py", line 16, in
opennlp_path = os.environ['OPENNLP']
File "/usr/lib/python3.6/os.py", line 669, in getitem
raise KeyError(key) from None
KeyError: 'OPENNLP'

Issue about CRAFT corpus.

We are making a comparison with huner on CRAFT in our new research.
Could you please provide the processed CRAFT corpus or processing details, cause the results of our experiments are inconsistent with which on the paper.

Conversion scripts not working

The line
python3 $SCRIPT_DIR/bc2gm_gene_to_conll.py $DATA_DIR/bc2gm/test/test.in $DATA_DIR/bc2gm/test/GENE.eval $GENE_DIR/bc2gm2.conll

in convert_corpora.sh produces the following error:

Converting BC2GM 4%|██▉ | 209/5000 [00:22<08:42, 9.16it/s]Traceback (most recent call last): File "scripts/bc2gm_gene_to_conll.py", line 53, in <module> utils.write_to_conll(sentences, entities, document_ids, f_out) File "/Users/apple/Desktop/UCSD/SBKS/huner/ner_scripts/scripts/utils.py", line 69, in write_to_conll pos_tags = pos_tagger.parse(' '.join(tokens)).decode() File "/Users/apple/Desktop/UCSD/SBKS/huner/ner_scripts/scripts/opennlp_wrapper.py", line 35, in parse self.process.expect('\r\n', timeout) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pexpect/spawnbase.py", line 344, in expect timeout, searchwindowsize, async_) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pexpect/spawnbase.py", line 372, in expect_list return exp.expect_loop(timeout) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pexpect/expect.py", line 181, in expect_loop return self.timeout(e) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pexpect/expect.py", line 144, in timeout raise exc pexpect.exceptions.TIMEOUT: Timeout exceeded. <pexpect.pty_spawn.spawn object at 0x10a357f50> command: /Users/apple/Desktop/UCSD/SBKS/apache-opennlp-1.9.2/bin/opennlp args: ['/Users/apple/Desktop/UCSD/SBKS/apache-opennlp-1.9.2/bin/opennlp', 'POSTagger', '/Users/apple/Desktop/UCSD/SBKS/apache-opennlp-1.9.2/models/en-pos-maxent.bin'] buffer (last 100 chars): b'\x07\x07\x07\x07\x07\x07\x07\x07\x07\x07\x07\x07\x07\x07\x07\x07\x07\x07\x07\x07\x07\x07\x07\x07\x07\x07\x07\x07\x07' before (last 100 chars): b'\x07\x07\x07\x07\x07\x07\x07\x07\x07\x07\x07\x07\x07\x07\x07\x07\x07\x07\x07\x07\x07\x07\x07\x07\x07\x07\x07\x07\x07' after: <class 'pexpect.exceptions.TIMEOUT'> match: None match_index: None exitstatus: None flag_eof: False pid: 56959 child_fd: 7 closed: False timeout: 30 delimiter: <class 'pexpect.exceptions.EOF'> logfile: None logfile_read: None logfile_send: None maxread: 2000 ignorecase: False searchwindowsize: None delaybeforesend: 0.05 delayafterclose: 0.1 delayafterterminate: 0.1 searcher: searcher_re: 0: re.compile(b'\r\n')
What could be wrong here?

Query regarding datasets

Hi,
For my project, I am required to extract all the entities in all the datasets that HUNER uses for training. I saw that many of the datasets have train/test splits. Have you used the entire dataset to train HUNER models or only the train part of the datasets?