smitkiri / ehr-relation-extraction Goto Github PK
View Code? Open in Web Editor NEWNER and Relation Extraction from Electronic Health Records (EHR).
Home Page: http://ehr-info.ml
License: MIT License
NER and Relation Extraction from Electronic Health Records (EHR).
Home Page: http://ehr-info.ml
License: MIT License
When I run the RE training, I get this error:
FileNotFoundError: [Errno 2] No such file or directory: './dataset/cached_train_BertTokenizer_128_ehr-re.lock'
How can I solve it?
Is data available to get?
Hi, thank you so much for your repository, it has been extremely helpful for me in my research work.
Would like to highlight this particular issue when running generate_data.py
when using the default tokenizer (not applicable to the scispacy tokenizer which is the current default).
The following error is encountered
$ python3 -m app.training.generate_data
Reading data
Train data:
Progress: [====================] 303/303
Test Data:
Progress: [==========> ] 103/202/home/jiayi/adverse_drug_event_extraction/app/training/ehr.py:179: UserWarning: Invalid annotation encountered: ['12 hours']
warnings.warn("Invalid annotation encountered: " + str(line))
Progress: [==============> ] 145/202Traceback (most recent call last):
File "/home/jiayi/anaconda3/envs/training/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/jiayi/anaconda3/envs/training/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/jiayi/adverse_drug_event_extraction/app/training/generate_data.py", line 214, in <module>
main()
File "/home/jiayi/adverse_drug_event_extraction/app/training/generate_data.py", line 175, in main
train_dev, test = read_data(data_dir=args.input_dir,
File "/home/jiayi/adverse_drug_event_extraction/app/training/utils.py", line 293, in read_data
record = HealthRecord(fid, text_path=os.path.join(test_path, fid + '.txt'),
File "/home/jiayi/adverse_drug_event_extraction/app/training/ehr.py", line 62, in __init__
self.set_tokenizer(tokenizer)
File "/home/jiayi/adverse_drug_event_extraction/app/training/ehr.py", line 269, in set_tokenizer
self._compute_tokens()
File "/home/jiayi/adverse_drug_event_extraction/app/training/ehr.py", line 226, in _compute_tokens
raise Exception("Error computing token to char map.")
Exception: Error computing token to char map.
After some debugging I have noticed that this is due to the 2 "###" instances in the test set for the 109724.txt file. After removing the 2 instances observed, everything was running fine. Not too sure why this is happening but thought it would be interesting to point it out for future use. Additionally, after training the model the model predictions were unexpected as well. Output was supposed to be [('ROU', 1, 1), ('DUR', 2, 2), ('DOS', 3, 3), ('ADE', 4, 4), ('FRE', 5, 6)] but it returned [('inpatient', 0, 0), ('Pseudoaneurysm', 1, 1), ('Fevers', 2, 4), ('QTc', 5, 6)].
I ended up adopting the scispacy tokenizer which works perfectly. Cheers!
When I use run.ner.py and run_re.py scripts, i got F1 score and precision are zero for ner model and re model, I think the problem lies in the format of the training data generated by generate_data right? how do you overcome that?
Hi,
I used generate_data.py file for generating training dataset for RE task but in the output I am only getting .pkl files and .txt files but not .tsv files which I guess are necessary for training of model.
Any idea on how to fix this?
The NER model works all fine when I run the ner script with --do train --do predict arguments.
However, when I just want to find the prediction results and I try to run the script with only --do_predict (without --do_train) using the trained model, I get completely wrong results.
Is there something small I am missing out?
Hi there,
I was running the file run_re.py
for biobert_re where I've encountered the following error:
01/21/2022 16:21:02 - WARNING - __main__ - Process rank: -1, device: cpu, n_gpu: 0, distributed training: False, 16-bits training: False
01/21/2022 16:21:16 - INFO - utils_re - Creating features from dataset file at /data/jiayi/n2c2/dataset_re
Traceback (most recent call last):
File "run_re.py", line 230, in <module>
main()
File "run_re.py", line 103, in main
REDataset(data_args, tokenizer=tokenizer, cache_dir=model_args.cache_dir) if training_args.do_train else None
File "/home/jiayi/adverse_drug_event_extraction/app/bilstm_crf_re/utils_re.py", line 132, in __init__
self.features = glue_convert_examples_to_features(
File "/home/jiayi/adverse_drug_event_extraction/app/bilstm_crf_re/data_processor.py", line 40, in glue_convert_examples_to_features
return _glue_convert_examples_to_features(
File "/home/jiayi/adverse_drug_event_extraction/app/bilstm_crf_re/data_processor.py", line 74, in _glue_convert_examples_to_features
batch_encoding = tokenizer(
File "/home/jiayi/anaconda3/envs/re/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2418, in __call__
return self.batch_encode_plus(
File "/home/jiayi/anaconda3/envs/re/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2609, in batch_encode_plus
return self._batch_encode_plus(
File "/home/jiayi/anaconda3/envs/re/lib/python3.8/site-packages/transformers/tokenization_utils_fast.py", line 409, in _batch_encode_plus
encodings = self._tokenizer.encode_batch(
TypeError: TextInputSequence must be str
Upon further debugging, this seems to be an issue from HuggingFace as seen in the comment here. To fix the issue, I've simply included a use_fast=False
parameter in line 99 of run_re.py
as seen below.
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path,
cache_dir=model_args.cache_dir,
use_fast=False
)
Hope this will be helpful to anyone else encountering the same issue.
Hi there, I've been trying to better understand the BiLSTM-CRF NER model, more specifically the NERLearner class in bilstm_crf_ner/model/ner_learner.py
.
To run the NER model, I have ran the generate_data and build_data scripts, and subsequently moved on to running the train and test scripts. However, I noticed that when training (and also running test.py), the line 'Skipping batch of size=1' has been logged many times due to the following snippet of code (in both train and test functions).
if inputs['word_ids'].shape[0] == 1:
self.logger.info('Skipping batch of size=1')
continue
All items within my training set and evaluation set will be caught by this if-statement and not move on to the other half of the code. I have tried removing this chunk for evaluation and the model could produce some prediction output - but not to great accuracy as I suspect that it might be affecting the model performance when training.
UPDATE: I realised this was due to the batch size = 1 set, which is not suited for this model. My 2 questions below still remains!
Can I check what is this code for, and will removing it for training and evaluation be okay?
Another question, can I ask how did you derive the results as seen in the BiLSTM-CRF README file? Is there a specific script that you have executed to achieve that?
Is there any written documentation explaining the details of the end-to-end RE task?
Hi,
I see in the code (this file) that to generate the data for RE, one needs to have the dataset as TSV files.
However, as far as I know, the n2c2 2018 dataset is provided as BRAT files (I asked for it).
Am I wrong, or is there a script that makes the conversion from BRAT to the TSV format you use?
The RE task is giving this warning, UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use zero_division
parameter to control this behavior.
This is observed for the test set and not for the evaluation set.
The NER task works completely fine.
I have installed all the packages as per the requirements.txt file on Colab, including torch==1.6.0
, torchvision==0.7.0
, transformers==3.0.2
when i run the command :
! make train-biobert-ner
i get the following error :
cd biobert_ner/ &&
python run_ner.py
--data_dir ./dataset/
--labels ./dataset/labels.txt
--model_name_or_path dmis-lab/biobert-large-cased-v1.1
--output_dir ./output/
--max_seq_length 128
--num_train_epochs 1
--per_device_train_batch_size 8
--save_steps 4000
--seed 0
--do_train
--do_eval
--do_predict
--overwrite_output_dir
01/31/2022 11:45:17 - INFO - transformers.training_args - PyTorch: setting up devices
01/31/2022 11:45:17 - WARNING - main - Process rank: -1, device: cuda:0, n_gpu: 1, distributed training: False, 16-bits training: False
01/31/2022 11:45:17 - INFO - main - Training/evaluation parameters TrainingArguments(output_dir='./output/', overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=True, evaluate_during_training=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, learning_rate=5e-05, weight_decay=0.0, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Jan31_11-45-17_5c9d189f141d', logging_first_step=False, logging_steps=500, save_steps=4000, save_total_limit=None, no_cuda=False, seed=0, fp16=False, fp16_opt_level='O1', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=1000, past_index=-1)
01/31/2022 11:45:17 - INFO - transformers.configuration_utils - loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/dmis-lab/biobert-large-cased-v1.1/config.json from cache at /root/.cache/torch/transformers/3493610bf2342adb1bf68e2a34c59b725a710eb59df1883605e40ae7e95bf9e4.5b7a692f7cc36e826065fed1096ab38064bca502b90349c26fb1b70aae2defb6
01/31/2022 11:45:17 - INFO - transformers.configuration_utils - Model config BertConfig {
"attention_probs_dropout_prob": 0.1,
"gradient_checkpointing": false,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 1024,
"id2label": {
"0": "B-DRUG",
"1": "I-DRUG",
"2": "B-STR",
"3": "I-STR",
"4": "B-DUR",
"5": "I-DUR",
"6": "B-ROU",
"7": "I-ROU",
"8": "B-FOR",
"9": "I-FOR",
"10": "B-ADE",
"11": "I-ADE",
"12": "B-DOS",
"13": "I-DOS",
"14": "B-REA",
"15": "I-REA",
"16": "B-FRE",
"17": "I-FRE",
"18": "O"
},
"initializer_range": 0.02,
"intermediate_size": 4096,
"label2id": {
"B-ADE": 10,
"B-DOS": 12,
"B-DRUG": 0,
"B-DUR": 4,
"B-FOR": 8,
"B-FRE": 16,
"B-REA": 14,
"B-ROU": 6,
"B-STR": 2,
"I-ADE": 11,
"I-DOS": 13,
"I-DRUG": 1,
"I-DUR": 5,
"I-FOR": 9,
"I-FRE": 17,
"I-REA": 15,
"I-ROU": 7,
"I-STR": 3,
"O": 18
},
"layer_norm_eps": 1e-12,
"max_position_embeddings": 512,
"model_type": "bert",
"num_attention_heads": 16,
"num_hidden_layers": 24,
"pad_token_id": 0,
"type_vocab_size": 2,
"vocab_size": 58996
}
01/31/2022 11:45:17 - INFO - transformers.configuration_utils - loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/dmis-lab/biobert-large-cased-v1.1/config.json from cache at /root/.cache/torch/transformers/3493610bf2342adb1bf68e2a34c59b725a710eb59df1883605e40ae7e95bf9e4.5b7a692f7cc36e826065fed1096ab38064bca502b90349c26fb1b70aae2defb6
01/31/2022 11:45:17 - INFO - transformers.configuration_utils - Model config BertConfig {
"attention_probs_dropout_prob": 0.1,
"gradient_checkpointing": false,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 1024,
"initializer_range": 0.02,
"intermediate_size": 4096,
"layer_norm_eps": 1e-12,
"max_position_embeddings": 512,
"model_type": "bert",
"num_attention_heads": 16,
"num_hidden_layers": 24,
"pad_token_id": 0,
"type_vocab_size": 2,
"vocab_size": 58996
}
01/31/2022 11:45:17 - INFO - transformers.tokenization_utils_base - Model name 'dmis-lab/biobert-large-cased-v1.1' not found in model shortcut name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased, bert-base-german-dbmdz-uncased, TurkuNLP/bert-base-finnish-cased-v1, TurkuNLP/bert-base-finnish-uncased-v1, wietsedv/bert-base-dutch-cased). Assuming 'dmis-lab/biobert-large-cased-v1.1' is a path, a model identifier, or url to a directory containing tokenizer files.
01/31/2022 11:45:18 - INFO - transformers.tokenization_utils_base - loading file https://s3.amazonaws.com/models.huggingface.co/bert/dmis-lab/biobert-large-cased-v1.1/vocab.txt from cache at /root/.cache/torch/transformers/701732fae654e0c36bf4554c7758f748495aa3427b4084607df605f2049a89a0.b2d452d8aee26fe2e337e17013b48f3d5a81bb300c38986450d4022986348bdd
01/31/2022 11:45:18 - INFO - transformers.tokenization_utils_base - loading file https://s3.amazonaws.com/models.huggingface.co/bert/dmis-lab/biobert-large-cased-v1.1/added_tokens.json from cache at None
01/31/2022 11:45:18 - INFO - transformers.tokenization_utils_base - loading file https://s3.amazonaws.com/models.huggingface.co/bert/dmis-lab/biobert-large-cased-v1.1/special_tokens_map.json from cache at None
01/31/2022 11:45:18 - INFO - transformers.tokenization_utils_base - loading file https://s3.amazonaws.com/models.huggingface.co/bert/dmis-lab/biobert-large-cased-v1.1/tokenizer_config.json from cache at None
01/31/2022 11:45:18 - INFO - transformers.tokenization_utils_base - loading file https://s3.amazonaws.com/models.huggingface.co/bert/dmis-lab/biobert-large-cased-v1.1/tokenizer.json from cache at None
01/31/2022 11:45:18 - INFO - transformers.modeling_utils - loading weights file https://cdn.huggingface.co/dmis-lab/biobert-large-cased-v1.1/pytorch_model.bin from cache at /root/.cache/torch/transformers/8c1699719a69e0d7cccc2c016217edb876ee6732c3aa2809e15a09c70e9bc22e.2c1d459b35b7f0b1938ff35bf6334bc60282ea79ea7cf7e9656e27f726ed07c6
01/31/2022 11:45:33 - WARNING - transformers.modeling_utils - Some weights of the model checkpoint at dmis-lab/biobert-large-cased-v1.1 were not used when initializing BertForTokenClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
pip install wandb; wandb login
see https://docs.wandb.com/huggingface.Hi there, I ran the Track2-evaluate-ver4.py file on the gold standard test data and the test data(track 2) from n2c2 and I am getting zero scores on Relations for every entity. Please find the output of the execution below
****************************** TRACK 2 *******************************
------- strict ------- ------ lenient -------
Prec. Rec. F(b=1) Prec. Rec. F(b=1)
Drug 0.9990 0.9997 0.9993 0.9990 0.9997 0.9993
Strength 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
Duration 0.9947 1.0000 0.9974 0.9947 1.0000 0.9974
Route 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
Form 0.9998 0.9995 0.9997 1.0000 0.9998 0.9999
Ade 0.9968 0.9968 0.9968 0.9968 0.9968 0.9968
Dosage 0.9996 0.9996 0.9996 0.9996 0.9996 0.9996
Reason 0.9945 1.0000 0.9973 0.9945 1.0000 0.9973
Frequency 0.9988 0.9995 0.9991 0.9995 1.0000 0.9998
------------------------------------------------
Overall (micro) 0.9989 0.9997 0.9993 0.9990 0.9998 0.9994
Overall (macro) 0.9990 0.9997 0.9993 0.9991 0.9997 0.9994
***************************** RELATIONS ******************************
Strength -> Drug 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
Dosage -> Drug 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
Duration -> Drug 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
Frequency -> Drug 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
Form -> Drug 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
Route -> Drug 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
Reason -> Drug 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
ADE -> Drug 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
------------------------------------------------
Overall (micro) 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
Overall (macro) 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
This is the output I am getting and I am confused what's wrong here. Also when I ran run_re.py, I am getting ZeroDivision error as mentioned here and now I am wondering whether these two have any relation and causing error in the training of model.
It would mean a lot if you can help me regarding this or point out if I am doing something wrong here.
I have tried to run the run_ner.py
script on my local laptop and I had the following error:
Iteration: 0%| | 0/189 [00:00<?, ?it/s]
Epoch: 0%| | 0/5 [00:00<?, ?it/s]
Traceback (most recent call last):
File "run_ner.py", line 284, in <module>
main()
File "run_ner.py", line 206, in main
model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
File "/home/xegulon/miniconda3/envs/ehr-re/lib/python3.7/site-packages/transformers/trainer.py", line 499, in train
tr_loss += self._training_step(model, inputs, optimizer)
File "/home/xegulon/miniconda3/envs/ehr-re/lib/python3.7/site-packages/transformers/trainer.py", line 622, in _training_step
outputs = model(**inputs)
File "/home/xegulon/miniconda3/envs/ehr-re/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/xegulon/miniconda3/envs/ehr-re/lib/python3.7/site-packages/transformers/modeling_bert.py", line 1446, in forward
output_hidden_states=output_hidden_states,
File "/home/xegulon/miniconda3/envs/ehr-re/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/xegulon/miniconda3/envs/ehr-re/lib/python3.7/site-packages/transformers/modeling_bert.py", line 762, in forward
output_hidden_states=output_hidden_states,
File "/home/xegulon/miniconda3/envs/ehr-re/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/xegulon/miniconda3/envs/ehr-re/lib/python3.7/site-packages/transformers/modeling_bert.py", line 439, in forward
output_attentions,
File "/home/xegulon/miniconda3/envs/ehr-re/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/xegulon/miniconda3/envs/ehr-re/lib/python3.7/site-packages/transformers/modeling_bert.py", line 371, in forward
hidden_states, attention_mask, head_mask, output_attentions=output_attentions,
File "/home/xegulon/miniconda3/envs/ehr-re/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/xegulon/miniconda3/envs/ehr-re/lib/python3.7/site-packages/transformers/modeling_bert.py", line 315, in forward
hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, output_attentions,
File "/home/xegulon/miniconda3/envs/ehr-re/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/xegulon/miniconda3/envs/ehr-re/lib/python3.7/site-packages/transformers/modeling_bert.py", line 256, in forward
context_layer = torch.matmul(attention_probs, value_layer)
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 5.80 GiB total capacity; 4.70 GiB already allocated; 18.19 MiB free; 4.85 GiB reserved in total by PyTorch)
Is there any easy workaround for that? Is it compulsory to run that on a cloud GPU?
Hi,
I see there are instructions to deploy the model on GCP, but there's nothing to deploy it on local computers.
Could you give me a hint?
Thanks.
Hi, thank you for this repo, unfortunately I ran into some issues and because there is no error message, I am unable to debug it....
When I try to generate predictions, I get blank output (model isn't predicting anything).
I realised that during training it also failed to predict any entities, but there was no error message, just warnings.
I used the following script to prepare dataset:
!python generate_data.py --task ner \
--input_dir ./data/ \
--ade_dir ./ade_corpus/ \
--target_dir biobert_ner/dataset/ \
--max_seq_len 512 --dev_split 0.1 \
--tokenizer biobert-base \
--ext txt --sep " " \
Log:
Reading data
Train data:
Progress: [====================] 303/303
Test Data:
Progress: [==================> ] 191/202/Users/lsolis/Documents/GitHub/PubMed_pipeline/ehr-relation-extraction/ehr.py:186: UserWarning: Invalid annotation encountered: ['12 hours'], File: ./data/test/106967.ann
warnings.warn(msg)
Progress: [====================] 202/202
ADE data: Done
Data successfully saved in biobert_ner/dataset/train.txt
Variable successfully saved in biobert_ner/dataset/train.pkl
Data successfully saved in biobert_ner/dataset/train_dev.txt
Variable successfully saved in biobert_ner/dataset/train_dev.pkl
Data successfully saved in biobert_ner/dataset/devel.txt
Variable successfully saved in biobert_ner/dataset/devel.pkl
Data successfully saved in biobert_ner/dataset/test.txt
Variable successfully saved in biobert_ner/dataset/test.pkl
Generating files successful. Files generated: train.txt, train.pkl, train_dev.txt, train_dev.pkl, devel.txt, devel.pkl, test.txt, test.pkl, labels.txt
Then run_ner.py
cd biobert_ner
export SAVE_DIR=./output1
export DATA_DIR=./dataset
export MAX_LENGTH=128
export BATCH_SIZE=16
export NUM_EPOCHS=5
export SAVE_STEPS=1000
export SEED=0
python run_ner.py --data_dir ${DATA_DIR}/ --labels ${DATA_DIR}/labels.txt --model_name_or_path dmis-lab/biobert-large-cased-v1.1 --output_dir ${SAVE_DIR}/ --max_seq_length ${MAX_LENGTH} --num_train_epochs ${NUM_EPOCHS} --per_device_train_batch_size ${BATCH_SIZE} --save_steps ${SAVE_STEPS} --seed ${SEED} --do_train --do_eval --do_predict --overwrite_output_dir
At the very last stage there was a print out in terminal, here is a fraction of it
05/10/2022 04:38:49 - WARNING - __main__ - Example 1966, Example: 8 O
05/10/2022 04:38:49 - WARNING - __main__ - Example 1966, Example: . O
05/10/2022 04:38:49 - WARNING - __main__ - Example 1966, Example: per B-DRUG
05/10/2022 04:38:49 - WARNING - __main__ - Example 1966, Example: 5 B-STR
test_results.txt
content
eval_loss = 1.8185148617116416
eval_precision = 0.0
eval_recall = 0.0
eval_f1 = 0.0
And the fraction of test_predictions.txt
- they all are "O", there is not a single other label predicted...
##9 O
##am O
##b O
##c O
##b O
##c O
##g O
##b O
##ct O
##c O
##v O
##ch O
##ch O
##c O
##d O
##w O
##lt O
##t O
##9 O
##am O
##uts O
##ymph O
##s O
##os O
##os O
##as O
##o O
##y O
hi,
all the ade_corpus is txt(rel) or csv format, where could I find the json format for generate-data? thanks.
Hi I completed the build_data, but when running train.py the ner_learner throws following error
Traceback (most recent call last): File "/data/storage_hpc_nishant/ade_bench/bilstm_crf/train.py", line 185, in <module> main() File "/data/storage_hpc_nishant/ade_bench/bilstm_crf/train.py", line 181, in main learn.fit(train, dev) File "/data/storage_hpc_nishant/ade_bench/bilstm_crf/model/ner_learner.py", line 186, in fit self.train(epoch, nbatches_train, train_generator, fine_tune=fine_tune) File "/data/storage_hpc_nishant/ade_bench/bilstm_crf/model/ner_learner.py", line 217, in train for batch_idx, (inputs, targets, sequence_lengths) in enumerate(train_generator): File "/data/storage_hpc_nishant/ade_bench/bilstm_crf/model/ner_learner.py", line 139, in data_generator "word_ids": np.asarray(word_ids) ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (5,) + inhomogeneous part.
Any pointers to what could be the reason
Hi, hope you're fine
Do you plan to release pre-trained models?
Thanks
Hi,
I'm facing an issue where every time I generate data for the RE I get empty files even tho it runs successfully
I'm running this command:
python generate_data.py \
--task re \
--input_dir dataset_split \
--target_dir datasetRE/ \
--max_seq_len 512 \
--dev_split 0.1 \
--tokenizer biobert-base \
--ext tsv \
--sep tab \
I'm curious about how you dealt with deidentified data, did you let them as they are or replaced them with something? [Hospital6 29] notion for example
WARNING:main:Process rank: -1, device: cpu, n_gpu: 0, distributed training: False, 16-bits training: False
Traceback (most recent call last):
File "run_re.py", line 230, in
main()
File "run_re.py", line 103, in main
REDataset(data_args, tokenizer=tokenizer, cache_dir=model_args.cache_dir) if training_args.do_train else None
File "/content/ehr-relation-extraction/biobert_re/utils_re.py", line 129, in init
examples = self.processor.get_train_examples(args.data_dir)
File "/content/ehr-relation-extraction/biobert_re/data_processor.py", line 116, in get_train_examples
return self._create_examples(self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
File "/content/ehr-relation-extraction/biobert_re/data_processor.py", line 139, in _create_examples
label = None if set_type == "test" else line[1]
IndexError: list index out of range
Everything is in the title.
After having run run_re.py
, how to use it for RE predictions with new data?
Do the new data need to have its entities annotated?
Hi Smit,
I trained NER model using your repo + added some custom data. When I run predictions on custom dataset, it works fine on most files, but about 20% of files get the following error:
Prediction: 100%
1/1 [00:00<00:00, 4.22it/s]
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
[<ipython-input-33-ad82028bb83c>](https://localhost:8080/#) in <module>()
----> 1 ner_predictions = get_ner_predictions(ehr_record=text, model_name="biobert")
2 text_ner = ner_predictions.get_entities()
2 frames
[/content/ehr.py](https://localhost:8080/#) in get_char_idx(self, token_idx)
319 raise AttributeError("Tokenizer not set.")
320
--> 321 char_idx = self.token_to_char_map[token_idx]
322
323 return char_idx
IndexError: list index out of range
Would you know where the issue might be, please?
Thank you!
The current implementation of biobert_ner.utils_ner.read_examples_from_file()
uses the variable name words
to store all the tokens in a document. This is misleading, especially in the case of word-piece tokenizers like BERT where individual words can be split into multiple tokens. A better variable name would be tokens
since that is precisely what we are reading.
The variable names should be changed here in read_examples_from_file()
and also in here in InputExample
class.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.