smitkiri / ehr-relation-extraction Goto Github PK

NER and Relation Extraction from Electronic Health Records (EHR).

License: MIT License

Python 86.43% JavaScript 7.80% Makefile 1.11% HTML 4.66%

ehr biobert ner bilstm-crf adverse-drug-events n2c2 ehr-records named-entity-recognition relation-extraction bert-relation-extraction

ehr-relation-extraction's People

Contributors

Stargazers

Watchers

ehr-relation-extraction's Issues

BertTokenizer lock file not found

When I run the RE training, I get this error:

FileNotFoundError: [Errno 2] No such file or directory: './dataset/cached_train_BertTokenizer_128_ehr-re.lock'

How can I solve it?

Dataset

Is data available to get?

Error generating data using default tokenizer

Hi, thank you so much for your repository, it has been extremely helpful for me in my research work.

Would like to highlight this particular issue when running generate_data.py when using the default tokenizer (not applicable to the scispacy tokenizer which is the current default).

The following error is encountered

$ python3 -m app.training.generate_data

Reading data

Train data:
Progress: [====================] 303/303

Test Data:
Progress: [==========>         ] 103/202/home/jiayi/adverse_drug_event_extraction/app/training/ehr.py:179: UserWarning: Invalid annotation encountered: ['12 hours']
  warnings.warn("Invalid annotation encountered: " + str(line))
Progress: [==============>     ] 145/202Traceback (most recent call last):
  File "/home/jiayi/anaconda3/envs/training/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/jiayi/anaconda3/envs/training/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/jiayi/adverse_drug_event_extraction/app/training/generate_data.py", line 214, in <module>
    main()
  File "/home/jiayi/adverse_drug_event_extraction/app/training/generate_data.py", line 175, in main
    train_dev, test = read_data(data_dir=args.input_dir,
  File "/home/jiayi/adverse_drug_event_extraction/app/training/utils.py", line 293, in read_data
    record = HealthRecord(fid, text_path=os.path.join(test_path, fid + '.txt'),
  File "/home/jiayi/adverse_drug_event_extraction/app/training/ehr.py", line 62, in __init__
    self.set_tokenizer(tokenizer)
  File "/home/jiayi/adverse_drug_event_extraction/app/training/ehr.py", line 269, in set_tokenizer
    self._compute_tokens()
  File "/home/jiayi/adverse_drug_event_extraction/app/training/ehr.py", line 226, in _compute_tokens
    raise Exception("Error computing token to char map.")
Exception: Error computing token to char map.

After some debugging I have noticed that this is due to the 2 "###" instances in the test set for the 109724.txt file. After removing the 2 instances observed, everything was running fine. Not too sure why this is happening but thought it would be interesting to point it out for future use. Additionally, after training the model the model predictions were unexpected as well. Output was supposed to be [('ROU', 1, 1), ('DUR', 2, 2), ('DOS', 3, 3), ('ADE', 4, 4), ('FRE', 5, 6)] but it returned [('inpatient', 0, 0), ('Pseudoaneurysm', 1, 1), ('Fevers', 2, 4), ('QTc', 5, 6)].

I ended up adopting the scispacy tokenizer which works perfectly. Cheers!

F1 socre is zero when using run_ner.py and run_re to fine tuning bert model

When I use run.ner.py and run_re.py scripts, i got F1 score and precision are zero for ner model and re model, I think the problem lies in the format of the training data generated by generate_data right? how do you overcome that?

Missing .tsv files in generated dataset for RE

Hi,
I used generate_data.py file for generating training dataset for RE task but in the output I am only getting .pkl files and .txt files but not .tsv files which I guess are necessary for training of model.

Any idea on how to fix this?

Issue with running ner script with only --do predict argument

The NER model works all fine when I run the ner script with --do train --do predict arguments.
However, when I just want to find the prediction results and I try to run the script with only --do_predict (without --do_train) using the trained model, I get completely wrong results.
Is there something small I am missing out?

Error running run_re.py - TypeError: TextInputSequence must be str

Hi there,

I was running the file run_re.py for biobert_re where I've encountered the following error:

01/21/2022 16:21:02 - WARNING - __main__ -   Process rank: -1, device: cpu, n_gpu: 0, distributed training: False, 16-bits training: False
01/21/2022 16:21:16 - INFO - utils_re -   Creating features from dataset file at /data/jiayi/n2c2/dataset_re
Traceback (most recent call last):
  File "run_re.py", line 230, in <module>
    main()
  File "run_re.py", line 103, in main
    REDataset(data_args, tokenizer=tokenizer, cache_dir=model_args.cache_dir) if training_args.do_train else None
  File "/home/jiayi/adverse_drug_event_extraction/app/bilstm_crf_re/utils_re.py", line 132, in __init__
    self.features = glue_convert_examples_to_features(
  File "/home/jiayi/adverse_drug_event_extraction/app/bilstm_crf_re/data_processor.py", line 40, in glue_convert_examples_to_features
    return _glue_convert_examples_to_features(
  File "/home/jiayi/adverse_drug_event_extraction/app/bilstm_crf_re/data_processor.py", line 74, in _glue_convert_examples_to_features
    batch_encoding = tokenizer(
  File "/home/jiayi/anaconda3/envs/re/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2418, in __call__
    return self.batch_encode_plus(
  File "/home/jiayi/anaconda3/envs/re/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2609, in batch_encode_plus
    return self._batch_encode_plus(
  File "/home/jiayi/anaconda3/envs/re/lib/python3.8/site-packages/transformers/tokenization_utils_fast.py", line 409, in _batch_encode_plus
    encodings = self._tokenizer.encode_batch(
TypeError: TextInputSequence must be str

Upon further debugging, this seems to be an issue from HuggingFace as seen in the comment here. To fix the issue, I've simply included a use_fast=False parameter in line 99 of run_re.py as seen below.

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path,
    cache_dir=model_args.cache_dir,
    use_fast=False
)

Hope this will be helpful to anyone else encountering the same issue.

BiLSTM-CRF NERLearner - Skipping batch of size=1

Hi there, I've been trying to better understand the BiLSTM-CRF NER model, more specifically the NERLearner class in bilstm_crf_ner/model/ner_learner.py.

To run the NER model, I have ran the generate_data and build_data scripts, and subsequently moved on to running the train and test scripts. However, I noticed that when training (and also running test.py), the line 'Skipping batch of size=1' has been logged many times due to the following snippet of code (in both train and test functions).

https://github.com/smitkiri/ehr-relation-extraction/blob/master/bilstm_crf_ner/model/ner_learner.py#L220-L222

if inputs['word_ids'].shape[0] == 1:
    self.logger.info('Skipping batch of size=1')
    continue

All items within my training set and evaluation set will be caught by this if-statement and not move on to the other half of the code. I have tried removing this chunk for evaluation and the model could produce some prediction output - but not to great accuracy as I suspect that it might be affecting the model performance when training.

UPDATE: I realised this was due to the batch size = 1 set, which is not suited for this model. My 2 questions below still remains!

Can I check what is this code for, and will removing it for training and evaluation be okay?

Another question, can I ask how did you derive the results as seen in the BiLSTM-CRF README file? Is there a specific script that you have executed to achieve that?

Documentation for End-to-End RE task

Is there any written documentation explaining the details of the end-to-end RE task?

How to get the n2c2 data in TSV format?

Hi,

I see in the code (this file) that to generate the data for RE, one needs to have the dataset as TSV files.

However, as far as I know, the n2c2 2018 dataset is provided as BRAT files (I asked for it).

Am I wrong, or is there a script that makes the conversion from BRAT to the TSV format you use?

RE task is returning zero as the metrics value

The RE task is giving this warning, UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use zero_division parameter to control this behavior.
This is observed for the test set and not for the evaluation set.
The NER task works completely fine.

Train BioBert_ner error, while running run_ner.py (Colab)

I have installed all the packages as per the requirements.txt file on Colab, including torch==1.6.0 , torchvision==0.7.0 , transformers==3.0.2

when i run the command :
! make train-biobert-ner

i get the following error :

cd biobert_ner/ &&
python run_ner.py
--data_dir ./dataset/
--labels ./dataset/labels.txt
--model_name_or_path dmis-lab/biobert-large-cased-v1.1
--output_dir ./output/
--max_seq_length 128
--num_train_epochs 1
--per_device_train_batch_size 8
--save_steps 4000
--seed 0
--do_train
--do_eval
--do_predict
--overwrite_output_dir
01/31/2022 11:45:17 - INFO - transformers.training_args - PyTorch: setting up devices
01/31/2022 11:45:17 - WARNING - main - Process rank: -1, device: cuda:0, n_gpu: 1, distributed training: False, 16-bits training: False
01/31/2022 11:45:17 - INFO - main - Training/evaluation parameters TrainingArguments(output_dir='./output/', overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=True, evaluate_during_training=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, learning_rate=5e-05, weight_decay=0.0, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Jan31_11-45-17_5c9d189f141d', logging_first_step=False, logging_steps=500, save_steps=4000, save_total_limit=None, no_cuda=False, seed=0, fp16=False, fp16_opt_level='O1', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=1000, past_index=-1)
01/31/2022 11:45:17 - INFO - transformers.configuration_utils - loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/dmis-lab/biobert-large-cased-v1.1/config.json from cache at /root/.cache/torch/transformers/3493610bf2342adb1bf68e2a34c59b725a710eb59df1883605e40ae7e95bf9e4.5b7a692f7cc36e826065fed1096ab38064bca502b90349c26fb1b70aae2defb6
01/31/2022 11:45:17 - INFO - transformers.configuration_utils - Model config BertConfig {
"attention_probs_dropout_prob": 0.1,
"gradient_checkpointing": false,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 1024,
"id2label": {
"0": "B-DRUG",
"1": "I-DRUG",
"2": "B-STR",
"3": "I-STR",
"4": "B-DUR",
"5": "I-DUR",
"6": "B-ROU",
"7": "I-ROU",
"8": "B-FOR",
"9": "I-FOR",
"10": "B-ADE",
"11": "I-ADE",
"12": "B-DOS",
"13": "I-DOS",
"14": "B-REA",
"15": "I-REA",
"16": "B-FRE",
"17": "I-FRE",
"18": "O"
},
"initializer_range": 0.02,
"intermediate_size": 4096,
"label2id": {
"B-ADE": 10,
"B-DOS": 12,
"B-DRUG": 0,
"B-DUR": 4,
"B-FOR": 8,
"B-FRE": 16,
"B-REA": 14,
"B-ROU": 6,
"B-STR": 2,
"I-ADE": 11,
"I-DOS": 13,
"I-DRUG": 1,
"I-DUR": 5,
"I-FOR": 9,
"I-FRE": 17,
"I-REA": 15,
"I-ROU": 7,
"I-STR": 3,
"O": 18
},
"layer_norm_eps": 1e-12,
"max_position_embeddings": 512,
"model_type": "bert",
"num_attention_heads": 16,
"num_hidden_layers": 24,
"pad_token_id": 0,
"type_vocab_size": 2,
"vocab_size": 58996
}

01/31/2022 11:45:17 - INFO - transformers.configuration_utils - loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/dmis-lab/biobert-large-cased-v1.1/config.json from cache at /root/.cache/torch/transformers/3493610bf2342adb1bf68e2a34c59b725a710eb59df1883605e40ae7e95bf9e4.5b7a692f7cc36e826065fed1096ab38064bca502b90349c26fb1b70aae2defb6
01/31/2022 11:45:17 - INFO - transformers.configuration_utils - Model config BertConfig {
"attention_probs_dropout_prob": 0.1,
"gradient_checkpointing": false,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 1024,
"initializer_range": 0.02,
"intermediate_size": 4096,
"layer_norm_eps": 1e-12,
"max_position_embeddings": 512,
"model_type": "bert",
"num_attention_heads": 16,
"num_hidden_layers": 24,
"pad_token_id": 0,
"type_vocab_size": 2,
"vocab_size": 58996
}

01/31/2022 11:45:17 - INFO - transformers.tokenization_utils_base - Model name 'dmis-lab/biobert-large-cased-v1.1' not found in model shortcut name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased, bert-base-german-dbmdz-uncased, TurkuNLP/bert-base-finnish-cased-v1, TurkuNLP/bert-base-finnish-uncased-v1, wietsedv/bert-base-dutch-cased). Assuming 'dmis-lab/biobert-large-cased-v1.1' is a path, a model identifier, or url to a directory containing tokenizer files.
01/31/2022 11:45:18 - INFO - transformers.tokenization_utils_base - loading file https://s3.amazonaws.com/models.huggingface.co/bert/dmis-lab/biobert-large-cased-v1.1/vocab.txt from cache at /root/.cache/torch/transformers/701732fae654e0c36bf4554c7758f748495aa3427b4084607df605f2049a89a0.b2d452d8aee26fe2e337e17013b48f3d5a81bb300c38986450d4022986348bdd
01/31/2022 11:45:18 - INFO - transformers.tokenization_utils_base - loading file https://s3.amazonaws.com/models.huggingface.co/bert/dmis-lab/biobert-large-cased-v1.1/added_tokens.json from cache at None
01/31/2022 11:45:18 - INFO - transformers.tokenization_utils_base - loading file https://s3.amazonaws.com/models.huggingface.co/bert/dmis-lab/biobert-large-cased-v1.1/special_tokens_map.json from cache at None
01/31/2022 11:45:18 - INFO - transformers.tokenization_utils_base - loading file https://s3.amazonaws.com/models.huggingface.co/bert/dmis-lab/biobert-large-cased-v1.1/tokenizer_config.json from cache at None
01/31/2022 11:45:18 - INFO - transformers.tokenization_utils_base - loading file https://s3.amazonaws.com/models.huggingface.co/bert/dmis-lab/biobert-large-cased-v1.1/tokenizer.json from cache at None
01/31/2022 11:45:18 - INFO - transformers.modeling_utils - loading weights file https://cdn.huggingface.co/dmis-lab/biobert-large-cased-v1.1/pytorch_model.bin from cache at /root/.cache/torch/transformers/8c1699719a69e0d7cccc2c016217edb876ee6732c3aa2809e15a09c70e9bc22e.2c1d459b35b7f0b1938ff35bf6334bc60282ea79ea7cf7e9656e27f726ed07c6
01/31/2022 11:45:33 - WARNING - transformers.modeling_utils - Some weights of the model checkpoint at dmis-lab/biobert-large-cased-v1.1 were not used when initializing BertForTokenClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']

This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
01/31/2022 11:45:33 - WARNING - transformers.modeling_utils - Some weights of BertForTokenClassification were not initialized from the model checkpoint at dmis-lab/biobert-large-cased-v1.1 and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
01/31/2022 11:45:33 - INFO - filelock - Lock 140349564664784 acquired on ./dataset/cached_train_dev_BertTokenizer_128.lock
01/31/2022 11:45:33 - INFO - utils_ner - Loading features from cached file ./dataset/cached_train_dev_BertTokenizer_128
01/31/2022 11:45:33 - INFO - filelock - Lock 140349564664784 released on ./dataset/cached_train_dev_BertTokenizer_128.lock
01/31/2022 11:45:33 - INFO - filelock - Lock 140349564665488 acquired on ./dataset/cached_devel_BertTokenizer_128.lock
01/31/2022 11:45:33 - INFO - utils_ner - Loading features from cached file ./dataset/cached_devel_BertTokenizer_128
01/31/2022 11:45:33 - INFO - filelock - Lock 140349564665488 released on ./dataset/cached_devel_BertTokenizer_128.lock
01/31/2022 11:45:35 - INFO - transformers.trainer - You are instantiating a Trainer but W&B is not installed. To use wandb logging, run pip install wandb; wandb login see https://docs.wandb.com/huggingface.
Traceback (most recent call last):
File "run_ner.py", line 284, in
main()
File "run_ner.py", line 206, in main
model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 384, in train
train_dataloader = self.get_train_dataloader()
File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 240, in get_train_dataloader
if self.args.local_rank == -1
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/sampler.py", line 96, in init
"value, but got num_samples={}".format(self.num_samples))
ValueError: num_samples should be a positive integer value, but got num_samples=0
Makefile:53: recipe for target 'train-biobert-ner' failed
make: *** [train-biobert-ner] Error 1

Getting zero results on running evaluation script

Hi there, I ran the Track2-evaluate-ver4.py file on the gold standard test data and the test data(track 2) from n2c2 and I am getting zero scores on Relations for every entity. Please find the output of the execution below

****************************** TRACK 2 *******************************
------- strict ------- ------ lenient -------
Prec. Rec. F(b=1) Prec. Rec. F(b=1)
Drug 0.9990 0.9997 0.9993 0.9990 0.9997 0.9993
Strength 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
Duration 0.9947 1.0000 0.9974 0.9947 1.0000 0.9974
Route 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
Form 0.9998 0.9995 0.9997 1.0000 0.9998 0.9999
Ade 0.9968 0.9968 0.9968 0.9968 0.9968 0.9968
Dosage 0.9996 0.9996 0.9996 0.9996 0.9996 0.9996
Reason 0.9945 1.0000 0.9973 0.9945 1.0000 0.9973
Frequency 0.9988 0.9995 0.9991 0.9995 1.0000 0.9998
------------------------------------------------
Overall (micro) 0.9989 0.9997 0.9993 0.9990 0.9998 0.9994
Overall (macro) 0.9990 0.9997 0.9993 0.9991 0.9997 0.9994

***************************** RELATIONS ******************************
Strength -> Drug 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
Dosage -> Drug 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
Duration -> Drug 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
Frequency -> Drug 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
Form -> Drug 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
Route -> Drug 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
Reason -> Drug 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
ADE -> Drug 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
------------------------------------------------
Overall (micro) 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
Overall (macro) 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000

This is the output I am getting and I am confused what's wrong here. Also when I ran run_re.py, I am getting ZeroDivision error as mentioned here and now I am wondering whether these two have any relation and causing error in the training of model.
It would mean a lot if you can help me regarding this or point out if I am doing something wrong here.

`RuntimeError: CUDA out of memory` when training BioBERT NER model

I have tried to run the run_ner.py script on my local laptop and I had the following error:

Iteration:   0%|                                                                                                                             | 0/189 [00:00<?, ?it/s]
Epoch:   0%|                                                                                                                               | 0/5 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "run_ner.py", line 284, in <module>
    main()
  File "run_ner.py", line 206, in main
    model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
  File "/home/xegulon/miniconda3/envs/ehr-re/lib/python3.7/site-packages/transformers/trainer.py", line 499, in train
    tr_loss += self._training_step(model, inputs, optimizer)
  File "/home/xegulon/miniconda3/envs/ehr-re/lib/python3.7/site-packages/transformers/trainer.py", line 622, in _training_step
    outputs = model(**inputs)
  File "/home/xegulon/miniconda3/envs/ehr-re/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/xegulon/miniconda3/envs/ehr-re/lib/python3.7/site-packages/transformers/modeling_bert.py", line 1446, in forward
    output_hidden_states=output_hidden_states,
  File "/home/xegulon/miniconda3/envs/ehr-re/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/xegulon/miniconda3/envs/ehr-re/lib/python3.7/site-packages/transformers/modeling_bert.py", line 762, in forward
    output_hidden_states=output_hidden_states,
  File "/home/xegulon/miniconda3/envs/ehr-re/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/xegulon/miniconda3/envs/ehr-re/lib/python3.7/site-packages/transformers/modeling_bert.py", line 439, in forward
    output_attentions,
  File "/home/xegulon/miniconda3/envs/ehr-re/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/xegulon/miniconda3/envs/ehr-re/lib/python3.7/site-packages/transformers/modeling_bert.py", line 371, in forward
    hidden_states, attention_mask, head_mask, output_attentions=output_attentions,
  File "/home/xegulon/miniconda3/envs/ehr-re/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/xegulon/miniconda3/envs/ehr-re/lib/python3.7/site-packages/transformers/modeling_bert.py", line 315, in forward
    hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, output_attentions,
  File "/home/xegulon/miniconda3/envs/ehr-re/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/xegulon/miniconda3/envs/ehr-re/lib/python3.7/site-packages/transformers/modeling_bert.py", line 256, in forward
    context_layer = torch.matmul(attention_probs, value_layer)
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 5.80 GiB total capacity; 4.70 GiB already allocated; 18.19 MiB free; 4.85 GiB reserved in total by PyTorch)

Is there any easy workaround for that? Is it compulsory to run that on a cloud GPU?

How to run the API + frontend locally?

Hi,

I see there are instructions to deploy the model on GCP, but there's nothing to deploy it on local computers.

Could you give me a hint?

Thanks.

NER model is not predicting any labels, even on test set during traning

Hi, thank you for this repo, unfortunately I ran into some issues and because there is no error message, I am unable to debug it....

When I try to generate predictions, I get blank output (model isn't predicting anything).
I realised that during training it also failed to predict any entities, but there was no error message, just warnings.

I used the following script to prepare dataset:

!python generate_data.py --task ner   \
                  --input_dir ./data/   \
                  --ade_dir ./ade_corpus/  \
                   --target_dir biobert_ner/dataset/   \
                  --max_seq_len 512 --dev_split 0.1   \
                  --tokenizer biobert-base   \
                   --ext txt --sep " "  \

Log:

Reading data

Train data:
Progress: [====================] 303/303

Test Data:
Progress: [==================> ] 191/202/Users/lsolis/Documents/GitHub/PubMed_pipeline/ehr-relation-extraction/ehr.py:186: UserWarning: Invalid annotation encountered: ['12 hours'], File: ./data/test/106967.ann
  warnings.warn(msg)
Progress: [====================] 202/202

ADE data: Done


Data successfully saved in biobert_ner/dataset/train.txt
Variable successfully saved in biobert_ner/dataset/train.pkl
Data successfully saved in biobert_ner/dataset/train_dev.txt
Variable successfully saved in biobert_ner/dataset/train_dev.pkl
Data successfully saved in biobert_ner/dataset/devel.txt
Variable successfully saved in biobert_ner/dataset/devel.pkl
Data successfully saved in biobert_ner/dataset/test.txt
Variable successfully saved in biobert_ner/dataset/test.pkl

Generating files successful. Files generated: train.txt, train.pkl, train_dev.txt, train_dev.pkl, devel.txt, devel.pkl, test.txt, test.pkl, labels.txt

Then run_ner.py

cd biobert_ner
export SAVE_DIR=./output1
export DATA_DIR=./dataset

export MAX_LENGTH=128
export BATCH_SIZE=16
export NUM_EPOCHS=5
export SAVE_STEPS=1000
export SEED=0

python run_ner.py --data_dir ${DATA_DIR}/ --labels ${DATA_DIR}/labels.txt --model_name_or_path dmis-lab/biobert-large-cased-v1.1 --output_dir ${SAVE_DIR}/ --max_seq_length ${MAX_LENGTH} --num_train_epochs ${NUM_EPOCHS} --per_device_train_batch_size ${BATCH_SIZE} --save_steps ${SAVE_STEPS} --seed ${SEED} --do_train --do_eval --do_predict --overwrite_output_dir

At the very last stage there was a print out in terminal, here is a fraction of it

05/10/2022 04:38:49 - WARNING - __main__ -   Example 1966, Example: 8 O

05/10/2022 04:38:49 - WARNING - __main__ -   Example 1966, Example: . O

05/10/2022 04:38:49 - WARNING - __main__ -   Example 1966, Example: per B-DRUG

05/10/2022 04:38:49 - WARNING - __main__ -   Example 1966, Example: 5 B-STR

test_results.txt content

eval_loss = 1.8185148617116416
eval_precision = 0.0
eval_recall = 0.0
eval_f1 = 0.0

And the fraction of test_predictions.txt - they all are "O", there is not a single other label predicted...

##9 O
##am O
##b O
##c O
##b O
##c O
##g O
##b O
##ct O
##c O
##v O
##ch O
##ch O
##c O
##d O
##w O
##lt O
##t O
##9 O
##am O
##uts O
##ymph O
##s O
##os O
##os O
##as O
##o O
##y O

ade_corpus format

hi,
all the ade_corpus is txt(rel) or csv format, where could I find the json format for generate-data? thanks.

BiLSTM-CRF code not working

Hi I completed the build_data, but when running train.py the ner_learner throws following error
Traceback (most recent call last): File "/data/storage_hpc_nishant/ade_bench/bilstm_crf/train.py", line 185, in <module> main() File "/data/storage_hpc_nishant/ade_bench/bilstm_crf/train.py", line 181, in main learn.fit(train, dev) File "/data/storage_hpc_nishant/ade_bench/bilstm_crf/model/ner_learner.py", line 186, in fit self.train(epoch, nbatches_train, train_generator, fine_tune=fine_tune) File "/data/storage_hpc_nishant/ade_bench/bilstm_crf/model/ner_learner.py", line 217, in train for batch_idx, (inputs, targets, sequence_lengths) in enumerate(train_generator): File "/data/storage_hpc_nishant/ade_bench/bilstm_crf/model/ner_learner.py", line 139, in data_generator "word_ids": np.asarray(word_ids) ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (5,) + inhomogeneous part.
Any pointers to what could be the reason

Pretrained models

Hi, hope you're fine

Do you plan to release pre-trained models?

Thanks

empty data files generated for the RE task

Hi,
I'm facing an issue where every time I generate data for the RE I get empty files even tho it runs successfully

I'm running this command:

python generate_data.py \
--task re \
--input_dir dataset_split \
--target_dir datasetRE/ \
--max_seq_len 512 \
--dev_split 0.1 \
--tokenizer biobert-base \
--ext tsv \
--sep tab \

ADE Dataset on NER Task

I'm curious about how you dealt with deidentified data, did you let them as they are or replaced them with something? [Hospital6 29] notion for example

List index out of range while running re

WARNING:main:Process rank: -1, device: cpu, n_gpu: 0, distributed training: False, 16-bits training: False
Traceback (most recent call last):
File "run_re.py", line 230, in
main()
File "run_re.py", line 103, in main
REDataset(data_args, tokenizer=tokenizer, cache_dir=model_args.cache_dir) if training_args.do_train else None
File "/content/ehr-relation-extraction/biobert_re/utils_re.py", line 129, in init
examples = self.processor.get_train_examples(args.data_dir)
File "/content/ehr-relation-extraction/biobert_re/data_processor.py", line 116, in get_train_examples
return self._create_examples(self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
File "/content/ehr-relation-extraction/biobert_re/data_processor.py", line 139, in _create_examples
label = None if set_type == "test" else line[1]
IndexError: list index out of range

How to use the relation extraction model for inference?

Everything is in the title.

After having run run_re.py, how to use it for RE predictions with new data?

Do the new data need to have its entities annotated?

IndexError: list index out of range

Hi Smit,

I trained NER model using your repo + added some custom data. When I run predictions on custom dataset, it works fine on most files, but about 20% of files get the following error:

Prediction: 100%
1/1 [00:00<00:00, 4.22it/s]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
[<ipython-input-33-ad82028bb83c>](https://localhost:8080/#) in <module>()
----> 1 ner_predictions = get_ner_predictions(ehr_record=text, model_name="biobert")
      2 text_ner = ner_predictions.get_entities()

2 frames
[/content/ehr.py](https://localhost:8080/#) in get_char_idx(self, token_idx)
    319             raise AttributeError("Tokenizer not set.")
    320 
--> 321         char_idx = self.token_to_char_map[token_idx]
    322 
    323         return char_idx

IndexError: list index out of range

Would you know where the issue might be, please?
Thank you!

Rename the variable "words" in `read_examples_from_file()` and `InputExample` in utils_ner.py

The current implementation of biobert_ner.utils_ner.read_examples_from_file() uses the variable name words to store all the tokens in a document. This is misleading, especially in the case of word-piece tokenizers like BERT where individual words can be split into multiple tokens. A better variable name would be tokens since that is precisely what we are reading.

The variable names should be changed here in read_examples_from_file() and also in here in InputExample class.

smitkiri / ehr-relation-extraction Goto Github PK

ehr-relation-extraction's People

Contributors

Stargazers

Watchers

Forkers

ehr-relation-extraction's Issues

Recommend Projects

Recommend Topics

Recommend Org