Giter Site home page Giter Site logo

wietsedv / bertje Goto Github PK

View Code? Open in Web Editor NEW
133.0 14.0 10.0 392 KB

BERTje is a Dutch pre-trained BERT model developed at the University of Groningen. (EMNLP Findings 2020) "What’s so special about BERT’s layers? A closer look at the NLP pipeline in monolingual and multilingual models"

Home Page: https://aclanthology.org/2020.findings-emnlp.389/

License: Apache License 2.0

Python 93.73% Perl 5.95% Shell 0.32%

bertje's Introduction

BERTje: A Dutch BERT model

Wietse de VriesAndreas van CranenburghArianna BisazzaTommaso CaselliGertjan van NoordMalvina Nissim

Model description

BERTje is a Dutch pre-trained BERT model developed at the University of Groningen.

For details, check out our paper on arXiv, the model on the 🤗 Hugging Face model hub and related work on Semantic Scholar.

Publications with BERTje

Transformers

You can play with BERTje without any training with the following snippet (or use the hosted version by Huggingface):

from transformers import pipeline

pipe = pipeline('fill-mask', model='GroNLP/bert-base-dutch-cased')
for res in pipe('Ik wou dat ik een [MASK] was.'):
    print(res['sequence'])
    
# [CLS] Ik wou dat ik een kind was. [SEP]
# [CLS] Ik wou dat ik een mens was. [SEP]
# [CLS] Ik wou dat ik een vrouw was. [SEP]
# [CLS] Ik wou dat ik een man was. [SEP]
# [CLS] Ik wou dat ik een vriend was. [SEP]

If you want to actually train your own model based on BERTje, you can load the tokenizer and model with this snippet:

from transformers import AutoTokenizer, AutoModel, TFAutoModel

tokenizer = AutoTokenizer.from_pretrained("GroNLP/bert-base-dutch-cased")
model = AutoModel.from_pretrained("GroNLP/bert-base-dutch-cased")  # PyTorch
model = TFAutoModel.from_pretrained("GroNLP/bert-base-dutch-cased")  # Tensorflow

That's all! Check out the Transformers documentation for further instructions.

WARNING: The vocabulary size of BERTje has changed in 2021. If you use an older fine-tuned model and experience problems with the GroNLP/bert-base-dutch-cased tokenizer, use use the following tokenizer:

tokenizer = AutoTokenizer.from_pretrained("GroNLP/bert-base-dutch-cased", revision="v1")  # v1 is the old vocabulary

Benchmarks

The arXiv paper lists benchmarks. Here are a couple of comparisons between BERTje, multilingual BERT, BERT-NL and RobBERT that were done after writing the paper. Unlike some other comparisons, the fine-tuning procedures for these benchmarks are identical for each pre-trained model. You may be able to achieve higher scores for individual models by optimizing fine-tuning procedures.

More experimental results will be added to this page when they are finished. Technical details about how a fine-tuned these models will be published later as well as downloadable fine-tuned checkpoints.

All of the tested models are base sized (12) layers with cased tokenization.

Headers in the tables below link to original data sources. Scores link to the model pages that corresponds to that specific fine-tuned model. These tables will be updated when more simple fine-tuned models are made available.

Named Entity Recognition

Model CoNLL-2002 SoNaR-1 spaCy UD LassySmall
BERTje 90.24 84.93 86.10
mBERT 88.61 84.19 86.77
BERT-NL 85.05 80.45 81.62
RobBERT 84.72 81.98 79.84

Part-of-speech tagging

Model UDv2.5 LassySmall
BERTje 96.48
mBERT 96.20
BERT-NL 96.10
RobBERT 95.91

Download

The recommended download method is using the Transformers library. The model is available at the model hub.

You can manually download the model files here: https://huggingface.co/GroNLP/bert-base-dutch-cased/tree/main

Thanks to Hugging Face for hosting the model files!

Code

The main code that is used for pretraining data preparation, finetuning and probing are given in the appropriate directies. Do not expect the code to be fully functional, complete or documented since this is research code that has been written and collected in the course of multiple months. Nevertheless, the code can be useful for reference.

Acknowledgements

Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC).

Citation

Please use the following citation if you use BERTje or our fine-tuned models:

@misc{devries2019bertje,
	title = {{BERTje}: {A} {Dutch} {BERT} {Model}},
	shorttitle = {{BERTje}},
	author = {de Vries, Wietse  and  van Cranenburgh, Andreas  and  Bisazza, Arianna  and  Caselli, Tommaso  and  Noord, Gertjan van  and  Nissim, Malvina},
	year = {2019},
	month = dec,
	howpublished = {arXiv:1912.09582},
	url = {http://arxiv.org/abs/1912.09582},
}

Use the following citation if you use anything from the probing classifiers:

@inproceedings{devries2020bertlayers,
	title = {What's so special about {BERT}'s layers? {A} closer look at the {NLP} pipeline in monolingual and multilingual models},
	author = {de Vries, Wietse  and  van Cranenburgh, Andreas  and  Nissim, Malvina},
	year = {2020},
	booktitle = {Findings of EMNLP},
	pages = {4339--4350},
	url = {https://www.aclweb.org/anthology/2020.findings-emnlp.389},
}

bertje's People

Contributors

andreasvc avatar wietsedv avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bertje's Issues

Example with newest transformers version

I am not too familiar with the intricacies of Transformers, but it seems the example in the README is outdated.

I did manage to load the model using this:

from transformers import AutoTokenizer, TFAutoModel

tokenizer = AutoTokenizer.from_pretrained("wietsedv/bert-base-dutch-cased")
model = TFAutoModel.from_pretrained("wietsedv/bert-base-dutch-cased")

Where I believe the TFAutoModel, is specific to the Tensorflow backend.

Prepare_cor method missing

# save_data(prepare_cor(args.in_path), args.out_path, 'cor')

Hello! I'm trying to run your code and I wanted to work with the coref data from SoNaR.
I saw this commented code for preparing the coreference data from the SoNaR data and I was wondering if you still have this method somewhere as it would be very useful for me?

Thank you!

A question about the training data: did you use DBNL?

Hi! Could you please tell me if the line "Books: a collection of contemporary and historical fiction novels (4.4GB)" in the paper (section 2.1: Data) refers to DBNL or to some other dataset? Hartelijk dank! :)

"bert-base-dutch-cased" not recognized on huggingface transformers

Although also described in the transformers documentation the model seems not available there. 👍

OSError: Model name 'bert-base-dutch-cased' was not found in model name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased, bert-base-german-dbmdz-uncased). We assumed 'bert-base-dutch-cased' was a path or url to a configuration file named config.json or a directory containing such a file but couldn't find any such file at this path or url.

bert-base-dutch-uncased

Hello,

Thanks for sharing this work! Not being hindered by too much knowledge about BERT models, would it be difficult (for you) to train a model bert-base-dutch-uncased (being more similar to the English counterpart), or is it in some way trivial for me to map an uncased tokenizer to the cased tokenizer?

The use case I have is for post-procesing automatic speech recognition output. We are used to making caseful vocabularies in Dutch ASR (in a similar way contrasting to the DEFAULT APPROACH IN ENGLISH), but with a BERT model as backend to ASR, I believe the case disambiguation should, in the end, be solved there. Hence I am looking, at the input side, for a caseless Dutch tokenizer.

Thanks!

TF checkpoints

Hi,

Is there any TF checkpoints available ? not the Pytorch one

Evaluation of model

Hi I've been trying to use BERTscore found here . I tried with both pytorch and TF models and I even tried tokenizing the reference and predicted texts using Autotokenizer however it keeps giving an error. Could you please help me solve this issue? I have pasted my code below

from transformers import AutoTokenizer TFAutoModel
from bert_score import score

modeltf = TFAutoModel.from_pretrained("GroNLP/bert-base-dutch-cased")
custom_tokenizer = AutoTokenizer.from_pretrained("GroNLP/bert-base-dutch-cased", revision="v1")
tokenized_reference = custom_tokenizer(truth, return_tensors='pt', padding=True, truncation=True)
tokenized_generated = custom_tokenizer(pred, return_tensors='pt', padding=True, truncation=True)
P, R, F1 = score(tokenized_generated,tokenized_reference,model_type=modeltf)

And the error that gets generated is:

KeyError                                  Traceback (most recent call last)
[<ipython-input-69-5d5bcd498207>](https://localhost:8080/#) in <cell line: 1>()
----> 1 P, R, F2 = score(tokenized_generated,tokenized_reference,model_type=modeltf)

[/usr/local/lib/python3.10/dist-packages/bert_score/score.py](https://localhost:8080/#) in score(cands, refs, model_type, num_layers, verbose, idf, device, batch_size, nthreads, all_layers, lang, return_hash, rescale_with_baseline, baseline_path, use_fast_tokenizer)
     93         model_type = lang2model[lang]
     94     if num_layers is None:
---> 95         num_layers = model2layers[model_type]
     96 
     97     tokenizer = get_tokenizer(model_type, use_fast_tokenizer)

KeyError: <transformers.models.bert.modeling_tf_bert.TFBertModel object at 0x7c76c0efddb0>

The versions I'm using for this are:

bert-score==0.3.13
tensorflow==2.13.0
transformers==4.34.0

NER Labels

When trying to parse a dutch sentence using transformers NER the only output labels that exist are label_0 and label_1, is this correct and what do these labels mean?

Reaching out

Hi there,

My name is Merrin and I am an AI student at the UvA. For my bachelor thesis, I would be interested in using BERTje. Unfortunately, I am having some issues that I cannot seem to work through. I would really like to ask you some questions about the use of BERTje. I know this isn't the conventional way of reaching out, but if you have the time it would be nice to get in contact.

Training Data for BERTje

Hello,

thank you very much for BERTje!

We would like to do some analysis regarding BERT models in different languages. Is it possible to release the data you used for pre-training the model. Especially the ones without citations:

  • Books: a collection of contemporary and historical fiction novels
  • Web news: all articles of 4 Dutch news websites from January 1, 2015 to October 1, 2019 (1.6GB)

Thank you very much!

Training data availability

Hey!

I wanted to inquire whether the training data for Bertje is available anywhere, I didn't see it in the repo.
Dankjewel for any help!

Using BERTje for sentiment classification

Hi Wietse!

I am trying classify given texts (usually about 100 words) as either positive or negative. How would I go about doing that with BERTje?

I tried the following based of off the fill-mask example that is shared in the READme and on Huggingface.

model = pipeline("sentiment-analysis", model='GroNLP/bert-base-dutch-cased')
negative_dutch_text = 'Dat is heel vervelend om te horen! Ik ben ook heel boos hierover. Wat een rotzooi.' 
model(negative_dutch_text)

For every sentence this outputs LABEL_0 with a score somewhere around 0.55. I would expect this to be strongly negative. In what way are my expectations off? How would I go about using it to classify it positive or negatively?

Thanks alot!

Pretraining on a different domain

Hi Wietse,

Thnx for BERTje and the great amount of work.
My question is, how hard is it if I wanted to pretrain BERTje on for example dutch legal documents. To my understanding finetuning is the case when you have a downstream task (for example classification). Would it be beneficial to pretrain bertje on these legal documents first or is it sufficient to fine tune only for the downstream task. I can imagine that sentences and the tokens could be different for a different domain and therefor do also some pretraining with this set of documents?

BERTje for SRL

Thanks for sharing the BERTje model!

We want to try to use BERTje for Semantic Role Labeling, which you mention as one of the tasks in the paper. Could you share the code for fine-tuning the model, and possibly also the fine-tuned model?

Pretraining problem

Hello,

I want to do extra pretraining on the bertje model on domain specific texts and I use the pretraining code from the original BERT code.
I downloaded the model from the huggingface model hub and I need to use the .ckpt files.
I cannot download the model via the code as I don't have access to the internet from where I work, so I have a folder of the bert-base-dutch-cased model.

When I try to run the pretraining code I get this error:

2021-01-22 10:17:42.271665: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at save_restore_v2_ops.cc:205 : Out of range: Read less bytes than requested
2021-01-22 10:17:42.271697: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at save_restore_v2_ops.cc:205 : Out of range: Read less bytes than requested
2021-01-22 10:17:42.271698: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at save_restore_v2_ops.cc:205 : Out of range: Read less bytes than requested
2021-01-22 10:17:42.271725: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at save_restore_v2_ops.cc:205 : Out of range: Read less bytes than requested
2021-01-22 10:17:42.271734: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at save_restore_v2_ops.cc:205 : Out of range: Read less bytes than requested
2021-01-22 10:17:42.271737: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at save_restore_v2_ops.cc:205 : Out of range: Read less bytes than requested
2021-01-22 10:17:42.271750: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at save_restore_v2_ops.cc:205 : Out of range: Read less bytes than requested
2021-01-22 10:17:42.271773: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at save_restore_v2_ops.cc:205 : Out of range: Read less bytes than requested
2021-01-22 10:17:42.271787: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at save_restore_v2_ops.cc:205 : Out of range: Read less bytes than requested
2021-01-22 10:17:42.271797: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at save_restore_v2_ops.cc:205 : Out of range: Read less bytes than requested
INFO:tensorflow:training_loop marked as finished
I0122 10:17:42.276760 139671864993600 error_handling.py:115] training_loop marked as finished
WARNING:tensorflow:Reraising captured error
W0122 10:17:42.276864 139671864993600 error_handling.py:149] Reraising captured error
Traceback (most recent call last):
  File "/home/amber/Documents/bert/env/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1375, in _do_call
    return fn(*args)

I can get the pretraining working with the original BERT checkpoints.

The command I use is:

python run_pretraining.py --bert_config_file="bert-base-dutch-cased/config.json" --input_file="tf_examples.tfrecord" --init_checkpoint="bert-base-dutch-cased/model.ckpt" --output_dir="output_dir" --max_seq_length=16 --max_predictions_per_seq=20 --do_train=True --do_eval=True --train_batch_size=1 --eval_batch_size=1 --learning_rate=1e-4 --num_train_steps=20 --num_warmup_steps=20 --save_checkpoints_setps=20 --iterations_per_loop=20 --max_level_steps=20

Do you maybe know what is going wrong?

Important tokens missing in vocabulary?

Hello,

Thanks for creating BERTje!
I'm working on a NER application that will classify named entities in Dutch text documents, so BERTje is really useful to me.

When trying to apply BERTje, I found that the tokeniser doesn't know some basic tokens, e.g. '@' for email addresses, but also lower-case single characters like 'o' or 'h', for instance to be able to tokenise names that aren't frequent (or non-sensical, which happens a lot with internet domains for instance). Interestingly enough, upper-case single characters seem to be available.

Example:

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-dutch-cased", force_download=True)

tokens = tokenizer.tokenize("Dit is een email adres: [email protected]. Zou dat werken?")

print(tokens)

The result is:

['Dit', 'is', 'een', 'email', 'adres', ':', 'test', '[UNK]', '[UNK]', '.', '[UNK]', '.', 'Zou', 'dat', 'werken', '?']

As you can see, "@hjk" and even 'nl' is not tokenised. This seems incorrect to me, there are many situations in text where an @ sign is used. Further, there can also very easily be non-frequent names that need single character tokens to be tokenised by a subword vocabulary.

Am I missing something? If these important tokens are really missing, I can imagine that your NER benchmark results (reported in the README) can also be (much) better.

Bertje word embeddings gebruiken

Hoi Wietse!

I am relatively new to using BERT models, and I was wondering if it is possible to access the word embeddings directly, so I can make them usable in other frameworks. In my specific use case, use it as the embedding model in Top2Vec.

Is this possible and if yes, how can I do this?

thanks in advance!

Question

Hi,

My name is Nirav, currently pursuing Ph.D. at TU Delft on NL and NE extraction from dutch historical text.

I found your model on huggingface and really wanted to try it out with my own experiment. However, I don't if this model can be used with Spacy and Apple M1 chip.

Do you have any instructions to steps written specifically for using Bertje with spacy on the Apple M1 machine?
Also I was wondering if you have time to about the NLP pipeline in Spacy with Bertje model?

Let me know.

Thank you.

Pre-training code

Thank you for creating BERTje and making it available!
In the paper I read you adjusted the pre-training tasks (Sentence Order Prediction and masking tokens instead of individual word pieces), would you be willing to share your pre-training code? I would like to continue pre-training on my own corpus.

Semantic similarity

Im working on a dutch bible project and therefore interested in semantic similarity.
The only models I found that support semantic similarity in dutch are multi lingual models.

  • sentence-transformers/xlm-r-100langs-bert-base-nli-stsb-mean-tokens
  • sentence-transformers/distilbert-multilingual-nli-stsb-quora-ranking

My plan for now is:

  • Find some model that supports Dutch
  • Train it on sentence similarity (how, where to get a decent dataset)
  • There are some parallel bible translations that can be used as a start but there are no similarity scores
  • Evaluate the results

Are there any plans to make a sentence similarity model with Bertje. Im also looking for datasets to train a model on that.
The ROBBERT model also does not have a model trained on sentence similarity
Any suggestions that can help me?

OS error when using 'bert-base-dutch-cased'

Hi! I'm using the model for my thesis and where it used to work when opening it from transformers with the suggested code, it now gives the following error:

OSError: Model name 'bert-base-dutch-cased' was not found in tokenizers model name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased, bert-base-german-dbmdz-uncased, TurkuNLP/bert-base-finnish-cased-v1, TurkuNLP/bert-base-finnish-uncased-v1, wietsedv/bert-base-dutch-cased). We assumed 'bert-base-dutch-cased' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.txt'] but couldn't find such vocabulary files at this path or url.

It does work when I use 'wietsedv/bert-base-dutch-cased' which is in the model name list according to the error. Possibly the name has changed in transformers?

help :)

Ik weet niet of je wat open staat om hulp te verlenen. Ik krijg volgende error in een named entity recognition finetuning task die ik laat lopen op Google Colab.
Dit is mijn config:

data:
  name: "getuigenissen-ner"
  input: "/content/getuigenissen"
  num_labels: 25
 
model:
  shortname: "bertje"
  name: "wietsedv/bert-base-dutch-cased"
  type: "bert"
 
train:
  max_epochs: 200

En deze error krijg ik bij het starten van het finetunen: python main.py data/getuigenissen-ner

/content/bertje/finetuning/v2
2020-12-10 13:47:09.458303: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
importing config from "configs/default.yaml"
importing config from "configs/data/getuigenissen-ner.yaml"
data:
  cache: cache/{}-{}
  cfgs: [data/udlassy-pos, data/lassysmall-pos, data/conll2002-ner, data/sonar-ner,
    data/udlassy-ner, data/110kdbrd, data/110kdbrd-2, data/twisty, data/twisty2, data/twisty3,
    data/twisty-merge-4, data/twisty4-merge-4]
  clip_start: false
  dev: true
  input: /content/getuigenissen
  logs: logs/{}-{}
  merge: null
  name: getuigenissen-ner
  num_labels: 25
  num_sents: 1
  output: output/{}-{}
  token_level: true
  verify: false
eval: {batch_size: 64}
force: false
model:
  cfgs: [models/bertje, models/multi, models/bertnl, models/robbert]
  checkpoint: -1
  device: cuda
  do_export: true
  do_train: true
  lower_case: false
  name: wietsedv/bert-base-dutch-cased
  shortname: bertje
  type: bert
optimizer: {adam_epsilon: 1.0e-08, learning_rate: 5.0e-05, max_grad_norm: 1.0, warmup_steps: 512,
  weight_decay: 0.05}
summary: {groups: false, method: accuracy, probs: false, type: dev}
train: {attention_dropout: 0.2, batch_size: 6, eval_steps: 0.25, gradient_accumulation_steps: 4,
  hidden_dropout: 0.3, logging_steps: 0.1, max_epochs: 200, max_grad_norm: 1.0, seed: 42323}
verbose: true
Loading tokenizer "wietsedv/bert-base-dutch-cased"
Downloading: 100% 241k/241k [00:00<00:00, 19.8MB/s]
 ➤ Loading data from train.tsv
   dataset has 26 labels
  0% 0/1402 [00:00<?, ?it/s]Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
100% 1402/1402 [00:05<00:00, 260.09it/s]
 ➤ Cached data in cache/getuigenissen-ner-bertje/train.tsv.pkl
Train data: 1402 examples, 26 labels: ['O', 'b-activiteit', 'b-bedrag', 'b-beroep', 'b-beschrijving', 'b-citaat', 'b-emotie', 'b-geo', 'b-leeftijd', 'b-misdrijf', 'b-object', 'b-persoon', 'b-tijd', 'i-activiteit', 'i-bedrag', 'i-beroep', 'i-beschrijving', 'i-citaat', 'i-emotie', 'i-geo', 'i-leeftijd', 'i-misdrijf', 'i-object', 'i-persoon', 'i-tijd', 'o']
 ➤ Loading data from dev.tsv
   dataset has 25 labels
100% 589/589 [00:02<00:00, 245.37it/s]
 ➤ Cached data in cache/getuigenissen-ner-bertje/dev.tsv.pkl
Dev data: 589 examples
Loading model "wietsedv/bert-base-dutch-cased"
Downloading: 100% 433/433 [00:00<00:00, 618kB/s]
Downloading: 100% 439M/439M [00:04<00:00, 88.5MB/s]
Some weights of the model checkpoint at wietsedv/bert-base-dutch-cased were not used when initializing BertForTokenClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at wietsedv/bert-base-dutch-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Start training
Global step intervals: Logging=5 Eval=14
Starting at epoch 0
 > Start epoch 0/200
Batch:   0% 0/234 [00:00<?, ?it/s]THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCTensorMath.cu line=29 error=710 : device-side assert triggered
Batch:   0% 0/234 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "main.py", line 387, in <module>
    main()
  File "main.py", line 364, in main
    train(model, train_dataset, dev_dataset, state)
  File "main.py", line 165, in train
    loss.backward()
  File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 221, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py", line 132, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cuda runtime error (710) : device-side assert triggered at /pytorch/aten/src/THC/generic/THCTensorMath.cu:29
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [18,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [19,0,0] Assertion `t >= 0 && t < n_classes` failed.
RobBERT

License for pre-trained model

Hi!

First off, thanks a lot for making and sharing this model.

Second, is there a license that can be applied to the pre-trained model itself?

Thanks,
Josh

sentencepiece model

Many thanks for releasing these resources!
I'm trying to make an R wrapper around Pytorch as explained at https://huggingface.co/transformers/torchscript.html to connect to it from the C++ side such that it can be used as an R package. I have some questions

  1. would it be possible to also release the sentencepiece model which you indicate in the paper you have created
  2. if yes, does the sentencepiece model give the same token ids as the wordpiece model for which you have provided the vocabulary at https://bertje.s3.eu-central-1.amazonaws.com/v1/vocab.txt (how did you convert the sentencepiece to wordpiece?)

Reference to RobBERT

This reference shows a comparison with RobBERT:

https://arxiv.org/pdf/2001.06286.pdf

What are the advantages of "bertje" (maybe it is smaller/simpler/cheaper to run) ?

FYI, there is a discussion on LinkedIn (sign-up required) at:

https://www.linkedin.com/feed/update/urn:li:activity:6631077105952178176?commentUrn=urn%3Ali%3Acomment%3A%28activity%3A6631077105952178176%2C6631183002305077250%29&replyUrn=urn%3Ali%3Acomment%3A%28activity%3A6631077105952178176%2C6631187459520675841%29

Fine-tuning BERTje for custom NER

Hello, I'd like to fine-tune BERTje for custom named-entity recognition in Dutch (for example, to recognize street names). Is this possible by initializing BertForTokenClassification with 'bert-base-dutch-cased'? And also, do you think this approach is viable? How many annotated training examples would be roughly needed to obtain a reasonable performance? Is this approach possible with 200 annotated sentences for every entity type?

Ideally, the BERTje model fine-tuned on CoNLL-2002/SoNaR-1 would be even better in terms of transfer learning. But I see you're planning to release these fine-tuned models in the future, so looking forward to that.

prepare-ud.py

I'm looking into this model in order to finetune a NER task on 18th-19th century Dutch texts. While I was preparing my data (I'm on Windows for data preparation and finetuning will happen on Google Colab) and looking at the structure you require the data to be as input to your finetuning script, I ran the prepare-ud script.
That gave me the following error on Windows, I had to spin up an Ubuntu machine where the code did work.
Just putting this information here so you are aware of the encoding issue.

$ python finetuning/prepare/prepare-ud.py -i "C:\Users\Jan\Dropbox\Work\RForgeBNOSAC\OpenSource\UD_Dutch-LassySmall" -o "data"
C:\Users\Jan\Dropbox\Work\RForgeBNOSAC\OpenSource\UD_Dutch-LassySmall
data
 > Preparing NER data
Traceback (most recent call last):
  File "finetuning/prepare/prepare-ud.py", line 104, in <module>
    main()
  File "finetuning/prepare/prepare-ud.py", line 100, in main
    save_data(prepare_ud(args.in_path), args.out_path)
  File "finetuning/prepare/prepare-ud.py", line 36, in prepare_ud
    train = read_conllu(train_path)
  File "finetuning/prepare/prepare-ud.py", line 9, in read_conllu
    for line in f:
  File "C:\Anaconda3\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 1287: character maps to <undefined>

deploy on serverless architecture

Someone experience with deploying this to serverless ?
AWS lambda has size restrictions to max of 512MB temp storage which is a problem.
image
Other options ? Because BERT can be GPU resource heavy if running fulltime

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.