chnsh / bert-ner-conll Goto Github PK
View Code? Open in Web Editor NEWThis repository tries to implement BERT for NER by trying to follow the paper using transformers library
License: MIT License
This repository tries to implement BERT for NER by trying to follow the paper using transformers library
License: MIT License
Bert tokenizer will use WordPiece method, leading the actual input>128, so we should modify the max length to 256.
In order to reproduce the conll score reported in BERT paper (92.4 bert-base and 92.8 bert-large) one trick is to apply a truecaser on article titles (all upper case sentences) as preprocessing step for conll train/dev/test. This can be simply done with the following method.
#https://github.com/daltonfury42/truecase
#pip install truecase
import truecase
import re
# original tokens
#['FULL', 'FEES', '1.875', 'REOFFER', '99.32', 'SPREAD', '+20', 'BP']
def truecase_sentence(tokens):
word_lst = [(w, idx) for idx, w in enumerate(tokens) if all(c.isalpha() for c in w)]
lst = [w for w, _ in word_lst if re.match(r'\b[A-Z\.\-]+\b', w)]
if len(lst) and len(lst) == len(word_lst):
parts = truecase.get_true_case(' '.join(lst)).split()
# the trucaser have its own tokenization ...
# skip if the number of word dosen't match
if len(parts) != len(word_lst): return tokens
for (w, idx), nw in zip(word_lst, parts):
tokens[idx] = nw
# truecased tokens
#['Full', 'fees', '1.875', 'Reoffer', '99.32', 'spread', '+20', 'BP']
Also, i found useful to use : very small learning rate (5e-6) \ large batch size (128) \ high epoch num (>40).
With these configurations and preprocessing, I was able to reach 92.8 with bert-large.
In https://github.com/chnsh/BERT-NER-CoNLL/blob/master/model.py#L25, shouldn't it be labels[mask.nonzero().squeeze(1)]
instead of labels[mask]
? If you do labels[mask], then mask is either 0 or 1, which means you are selecting the 1st or 2nd dim. of labels. Similarly for L18 embedding[mask]
My requirement here is given a sentence(sequence), I would like to just extract the entities present in the sequence without classifying them to a type in the NER task. I see that BERT has BertForTokenClassification for NER which does the classification.
Can somebody give me an idea of how to do entity extraction/identification using BERT?
Thank you @chnsh for this great job, do you have an implementation of the same solution with Tensorflow instead of Pytorch? I'm using TFBertForTokenClassification and I can't figure out how to add CRF on top of BERT.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.