Giter Site home page Giter Site logo

amzn / amazon-weak-ner-needle Goto Github PK

View Code? Open in Web Editor NEW
100.0 5.0 30.0 106 KB

Named Entity Recognition with Small Strongly Labeled and Large Weakly Labeled Data

License: MIT No Attribution

Python 75.55% Jupyter Notebook 14.47% Shell 9.97%
bert ner biomedical biomedical-named-entity-recognition weakly-supervised-learning weak-supervision distant-supervision

amazon-weak-ner-needle's Introduction

Named Entity Recognition with Small Strongly Labeled and Large Weakly Labeled Data

arXiv

This is the code base for weakly supervised NER.

We provide a three stage framework:

  • Stage I: Domain continual pre-training;
  • Stage II: Noise-aware weakly supervised pre-training;
  • Stage III: Fine-tuning.

In this code base, we actually provide basic building blocks which allow arbitrary combination of different stages. We also provide examples scripts for reproducing our results in BioMedical NER.

See details in arXiv.

Performance Benchmark

BioMedical NER

Method (F1) BC5CDR-chem BC5CDR-disease NCBI-disease
BERT 89.99 79.92 85.87
bioBERT 92.85 84.70 89.13
PubMedBERT 93.33 85.62 87.82
Ours 94.17 90.69 92.28

See more in bio_script/README.md

Dependency

pytorch==1.6.0
transformers==3.3.1
allennlp==1.1.0
flashtool==0.0.10
ray==0.8.7

Install requirements

pip install -r requirements.txt

(If the allennlp and transformers are incompatible, install allennlp first and then update transformers. Since we only use some small functions of allennlp, it should works fine. )

File Structure:

├── bert-ner          #  Python Code for Training NER models
│   └── ...
└── bio_script        #  Shell Scripts for Training BioMedical NER models
    └── ...

Usage

See examples in bio_script

Hyperparameter Explaination

Here we explain hyperparameters used the scripts in ./bio_script.

Training Scripts:

Scripts

  • roberta_mlm_pretrain.sh
  • weak_weighted_selftrain.sh
  • finetune.sh

Hyperparameter

  • GPUID: Choose the GPU for training. It can also be specified by xxx.sh 0,1,2,3.
  • MASTER_PORT: automatically constructed (avoid conflicts) for distributed training.
  • DISTRIBUTE_GPU: use distributed training or not
  • PROJECT_ROOT: automatically detected, the root path of the project folder.
  • DATA_DIR: Directory of the training data, where it contains train.txt test.txt dev.txt labels.txt weak_train.txt (weak data) aug_train.txt (optional).
  • USE_DA: if augment training data by augmentation, i.e., combine train.txt + aug_train.txt in DATA_DIR for training.
  • BERT_MODEL: the model backbone, e.g., roberta-large. See transformers for details.
  • BERT_CKP: see BERT_MODEL_PATH.
  • BERT_MODEL_PATH: the path of the model checkpoint that you want to load as the initialization. Usually used with BERT_CKP.
  • LOSSFUNC: nll the normal loss function, corrected_nll noise-aware risk (i.e., add weighted log-unlikelihood regularization: wei*nll + (1-wei)*null ).
  • MAX_WEIGHT: The maximum weight of a sample in the loss.
  • MAX_LENGTH: max sentence length.
  • BATCH_SIZE: batch size per GPU.
  • NUM_EPOCHS: number of training epoches.
  • LR: learning rate.
  • WARMUP: learning rate warmup steps.
  • SAVE_STEPS: the frequency of saving models.
  • EVAL_STEPS: the frequency of testing on validation.
  • SEED: radnom seed.
  • OUTPUT_DIR: the directory for saving model and code. Some parameters will be automatically appended to the path.
    • roberta_mlm_pretrain.sh: It's better to manually check where you want to save the model.]
    • finetune.sh: It will be save in ${BERT_MODEL_PATH}/finetune_xxxx.
    • weak_weighted_selftrain.sh: It will be save in ${BERT_MODEL_PATH}/selftrain/${FBA_RULE}_xxxx (see FBA_RULE below)

There are some addition parameters need to be set for weakly supervised learning (weak_weighted_selftrain.sh).

Profiling Script

Scripts

  • profile.sh

Profiling scripts also use the same entry as the training script: bert-ner/run_ner.py but only do evaluation.

Hyperparameter Basically the same as training script.

  • PROFILE_FILE: can be train,dev,test or a specific path to a txt data. E.g., using Weak by

    PROFILE_FILE=weak_train_100.txt PROFILE_FILE=$DATA_DIR/$PROFILE_FILE

  • OUTPUT_DIR: It will be saved in OUTPUT_DIR=${BERT_MODEL_PATH}/predict/profile

Weakly Supervised Data Refinement Script

Scripts

  • profile2refinedweakdata.sh

Hyperparameter

  • BERT_CKP: see BERT_MODEL_PATH.
  • BERT_MODEL_PATH: the path of the model checkpoint that you want to load as the initialization. Usually used with BERT_CKP.
  • WEI_RULE: rule for generating weight for each weak sample.
    • uni: all are 1
    • avgaccu: confidence estimate for new labels generated by all_overwrite
    • avgaccu_weak_non_O_promote: confidence estimate for new labels generated by non_O_overwrite
  • PRED_RULE: rule for generating new weak labels.
    • non_O_overwrite: non-entity ('O') is overwrited by prediction
    • all_overwrite: all use prediction, i.e., self-training
    • no: use original weak labels
    • non_O_overwrite_all_overwrite_over_accu_xx: non_O_overwrite + if confidence is higher than xx all tokens use prediction as new labels

The generated data will be saved in ${BERT_MODEL_PATH}/predict/weak_${PRED_RULE}-WEI_${WEI_RULE} WEAK_RULE specified in weak_weighted_selftrain.sh is essential the name of folder weak_${PRED_RULE}-WEI_${WEI_RULE}.

More Rounds of Training, Try Different Combination

  1. To do training with weakly supervised data from any model checkpoint directory:
  • i) Set BERT_CKP appropriately;
  • ii) Create profile data, e.g., run ./bio_script/profile.sh for dev set and weak set
  • iii) Generate data with weak labels from profile data, e.g., run ./bio_script/profile2refinedweakdata.sh. You can use different rules to generate weights for each sample (WEI_RULE) and different rules to refine weak labels (PRED_RULE). See more details in ./ber-ner/profile2refinedweakdata.py
  • iv) Do training with ./bio_script/weak_weighted_selftrain.sh.
  1. To do fine-tuning with human labeled data from any model checkpoint directory:
  • i) Set BERT_CKP appropriately;
  • ii) Run ./bio_script/finetune.sh.

Reference

@inproceedings{Jiang2021NamedER,
  title={Named Entity Recognition with Small Strongly Labeled and Large Weakly Labeled Data},
  author={Haoming Jiang and Danqing Zhang and Tianyue Cao and Bing Yin and T. Zhao},
  booktitle={ACL/IJCNLP},
  year={2021}
}

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

amazon-weak-ner-needle's People

Contributors

amazon-auto avatar dependabot[bot] avatar hmjianggatech avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

amazon-weak-ner-needle's Issues

【实体类型众多】时的解决方案

很感谢您的工作。
我已经复现了医药领域的代码。而当我把您的工作复现在我的工作场景时,我发现我需要同时针对50个实体类型各自训练一个模型。请问您可以提供一个更好的解决方案吗?


Thanks for your sharing your work!
In my field, there are 50 types of entity, so I need to train 50 models for each type. Could you offer a better solution?

details about biomedical dictionary

Thanks for your sharing your work.

I loved reading your paper. But I have a question about the two files.

  • bio_script/data/chem_dict.txt
  • bio_script/data/disease_dict.txt

Could you share the way of collecting a chemical and disease dictionary?

CUDA OUT OF MEMORY

Hello,
When I run "roberta_mlm_pretrain.sh",it always occur the error that CUDA out of memory(I have 4 TITAN RTX GPUs,which means the memory is enough).Is it because the data file all_text.txt is too big?
PS:The terminal show the info:INFO - main - Creating features from dataset file at ../bio_script/tasks/unlabeled/all_text.txt,and then it occurs the error.
Thank you in advance.

No prediction results at the end of training

At the end of training, it shows:
Prediction: 100%
but there is no results printed, like p, r, f1. Only the eval results are printed.

How can I make it print the testing results? Thanks.

More details about running pipline

@HMJiangGatech Thanks for your sharing. To make sure I don't miss any steps, I list the running pipline as follows:

  1. stage1

    sh roberta_mlm_pretrain.sh

  2. stage2

    sh supervised.sh:
    to train the crf model so that it has the ability to predict missing labels for weak data
    sh profile.sh:
    load the supervised trained crf model to predict labels for weak and dev data
    sh profile2refinedweakdata.sh:
    assign weights for weak data
    sh weak_weighted_selftrain.sh:
    weak supervised training

  3. stage3

    sh finetune.sh:
    use true data to fine tune bert

Am I right? Thanks.

Size of weakly labeled data in paper/Annotate.ipynb and the number of sentences of 2021 PubMed Baseline don't match

Hello,

I have tried to annotate all pubmed sentences and struggled with large number of pubmed sentences and memory issue.
I changed download_pubmed.sh code to retrieve 2021 pubmed baseline since we could not find 2020 baseline.

URL=ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline
for i in $(seq -f "%04g" 1 1015); do
  GZFILE=pubmed21n${i}.xml.gz
  echo $URL/$GZFILE
  wget $URL/$GZFILE
  gzip -d $GZFILE
  XMLFILE=pubmed21n${i}.xml
...

And then we encounter this issue: We have retrieved 2021 pubmed baseline only including 1-1015 files, and we assume data was accumulated from 2020. So we guess our retrieved data may have the same number of lines with yours.
But we have quite large number of lines for all_text and (un)labeled_lines, compared to your outputs of Annotate notebook.

Could you please give me some advice for the different number of pubmed sentences and expected effect of those large number of sentences?

Thank you in advance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.