chnsh / bert-ner-conll Goto Github PK

View Code? Open in Web Editor NEW

22.0 22.0 1.0 975 KB

This repository tries to implement BERT for NER by trying to follow the paper using transformers library

License: MIT License

Python 100.00%

bert conll-2003 ner transfer-learning transformers

bert-ner-conll's People

Contributors

Stargazers

Watchers

Forkers

lisaterumi

bert-ner-conll's Issues

Tokenized input length problem

Bert tokenizer will use WordPiece method, leading the actual input>128, so we should modify the max length to 256.

Reproduce CoNLL results

In order to reproduce the conll score reported in BERT paper (92.4 bert-base and 92.8 bert-large) one trick is to apply a truecaser on article titles (all upper case sentences) as preprocessing step for conll train/dev/test. This can be simply done with the following method.

#https://github.com/daltonfury42/truecase
#pip install truecase
import truecase
import re




# original tokens
#['FULL', 'FEES', '1.875', 'REOFFER', '99.32', 'SPREAD', '+20', 'BP']

def truecase_sentence(tokens):
   word_lst = [(w, idx) for idx, w in enumerate(tokens) if all(c.isalpha() for c in w)]
   lst = [w for w, _ in word_lst if re.match(r'\b[A-Z\.\-]+\b', w)]

   if len(lst) and len(lst) == len(word_lst):
       parts = truecase.get_true_case(' '.join(lst)).split()

       # the trucaser have its own tokenization ...
       # skip if the number of word dosen't match
       if len(parts) != len(word_lst): return tokens

       for (w, idx), nw in zip(word_lst, parts):
           tokens[idx] = nw

# truecased tokens
#['Full', 'fees', '1.875', 'Reoffer', '99.32', 'spread', '+20', 'BP']

Also, i found useful to use : very small learning rate (5e-6) \ large batch size (128) \ high epoch num (>40).

With these configurations and preprocessing, I was able to reach 92.8 with bert-large.

Doubt in indexing of labels

In https://github.com/chnsh/BERT-NER-CoNLL/blob/master/model.py#L25, shouldn't it be labels[mask.nonzero().squeeze(1)] instead of labels[mask]? If you do labels[mask], then mask is either 0 or 1, which means you are selecting the 1st or 2nd dim. of labels. Similarly for L18 embedding[mask]

How to use BERT for ENTITY extraction from a Sequence without classification in the NER task?

My requirement here is given a sentence(sequence), I would like to just extract the entities present in the sequence without classifying them to a type in the NER task. I see that BERT has BertForTokenClassification for NER which does the classification.

Can somebody give me an idea of how to do entity extraction/identification using BERT?

Tensorflow Implementation

Thank you @chnsh for this great job, do you have an implementation of the same solution with Tensorflow instead of Pytorch? I'm using TFBertForTokenClassification and I can't figure out how to add CRF on top of BERT.

chnsh / bert-ner-conll Goto Github PK

bert-ner-conll's People

Contributors

Stargazers

Watchers

Forkers

bert-ner-conll's Issues

Tokenized input length problem

Reproduce CoNLL results

Doubt in indexing of labels

How to use BERT for ENTITY extraction from a Sequence without classification in the NER task?

Tensorflow Implementation

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent