Comments (7)
Hi @phosseini,
we are working on the custom vocabulary with BERT large version. It might take some time (maybe a couple of months) to find good pre-training steps.
Thanks.
from biobert-pretrained.
Yes, the WordPiece vocab is exactly the same as the original BERT for several reasons. First, we wanted to use pre-trained BERT released by Google which makes us to use the same WordPiece vocab. Second, because the WordPiece vocab is based on subword units, any new words in biomedical corpus could be turned into proper embeddings (might be tuned during fine-tuning). We could try building our own vocabs using biomedical corpora, but that would lose compatibility with the original pre-trained BERT.
from biobert-pretrained.
Hi @jhyuklee:
A random idea I had: would it be possible to use a custom vocabulary without redoing the BERT pretraining? One way to transfer the model onto a different vocabulary might proceed as follows:
- Train a new Wordpiece vocab.
- Use a similar technique as model distillation, to learn to replicate the same intermediate representation as the existing BERT, but using the new tokenizer. In other words, minimize the MSE between the old BERT after the first layer and the new BERT after the first layer. This paper uses a similar technique.
- Do this for a number of iterations, only training the first layer while keeping all other layers fixed.
The benefit of this is to avoid most of the expensive BERT pre-training: only the first layer would be trained from scratch, rather than the whole model. Thoughts?
from biobert-pretrained.
Got it! Thanks for the quick and helpful reply 👍
I can understand why keeping compatibility with the original BERT is important. Personally, I would like to have a custom dictionary, since I think there might be some interesting opportunity for fine tuning as a lot of medical jargon (like drug names and chemicals) have somewhat of a unique internal structure that is now lost during the subword tokenization. But it'd be rude to ask you to train that! Feel free to close, and thank you for this great contribution!
from biobert-pretrained.
We have a plan for using a custom dictionary, but it will require much more GPU hours to pre-train such model compared to starting from the pre-trained BERT. We'll share it if it works. Thank you for your interest, and I'll close the issue.
from biobert-pretrained.
We have a plan for using a custom dictionary, but it will require much more GPU hours to pre-train such model compared to starting from the pre-trained BERT. We'll share it if it works. Thank you for your interest, and I'll close the issue.
I wonder if there's any update on using the custom dictionary and if it's a work in progress or on your TODO list?
from biobert-pretrained.
Hi @jhyuklee,
I saw you updated your weights with a custom vocabulary. I was wondering if you had any information on how you trained that model. Anything along the lines of:
- How long did it take to train your model?
- I assume you still used the 8xV100 machine for training?
- Did you use the original BERT dataset in your pre-training (wikipedia + bookcorpus)?
Thank you!
from biobert-pretrained.
Related Issues (20)
- total time require for training HOT 3
- Files for BioBERT tokenizer HOT 4
- Regarding Relation Extraction (RE), does it mean it's classifying whether the two marked entities have the defined relations? HOT 1
- Problem with loading model HOT 1
- Has Pre-training corpus chinese? HOT 1
- Pre-trained BioBERT를 distilBERT처럼 사용하려면 HOT 1
- Are there Chinese version bio-bert pretrained model? HOT 1
- BIOBERT corpus HOT 5
- Biobert custom vocab HOT 1
- KeyError when running NER on pretrained BioBERT model HOT 1
- License HOT 1
- I cant open five links of fine-tuning BioBERT HOT 1
- Question: Which part of PMC is used?
- using pretrained biobert matrix
- How do you pre-process the PMC articles?
- Any plan to have updated pretrained model? HOT 1
- The pre-trained weights seems not available in the google drive links provided.
- Using HuggingFace transformers library
- Failed to find any matching files for biobert-pretrained/biobert_v1.1_pubmed/biobert_model.ckpt HOT 7
- Can't download the pre-trained weight from "release" section HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from biobert-pretrained.