dbmdz / berts Goto Github PK

View Code? Open in Web Editor NEW

154.0 154.0 12.0 44 KB

DBMDZ BERT, DistilBERT, ELECTRA, GPT-2 and ConvBERT models

License: MIT License

bert bert-model electra german transformers

berts's People

Contributors

Stargazers

Watchers

Forkers

wkryst manueltonneau rnekrasov-msk alefel86 schneipk tommasopetrolito odnodn midnight93 hudakas tiyaro 5l1v3r1 spydazwebai-nlp

berts's Issues

German ELECTRA?

Hi guys, tremendous work, I love using your german pretrained models. I was wondering if there is any plan to release a pretrained version of ELECTRA. Its seems like a really good approach.
Thanks!
Luca :)

License Clarification

Can you please clarify the license for these models?

bert-base-german-cased Tensorflow checkpoints

Hello, I'd need the Tensorflow checkpoints for the german BERT model. I saw an older issue with the links but they don't work anymore.

Thanks!

Infos about the german model

Hey,

is this here

The source data for the model consists of a recent Wikipedia dump, EU Bookshop corpus, Open Subtitles, CommonCrawl, ParaCrawl and News Crawl. This results in a dataset with a size of 16GB and 2,350,234,427 tokens.

For sentence splitting, we use spacy. Our preprocessing steps (sentence piece model for vocab generation) follow those used for training SciBERT. The model is trained with an initial sequence length of 512 subwords and was performed for 1.5M steps.

the only infos about the german bert model? Or is there any paper about it?

I am interested in the News Crawl. Do you know what "news" are within (I think you used a language classificator to get german news only?)? Is there really a backup of online news sources like spiegel.de/WELT/n-tv...? That is currently also my objective how I can get such german news articles text?

Thank you!

PS:

What are the difference between these models?

dbmdz/bert-base-german-uncased and

bert-base-german-dbmdz-uncased

Bert-ita-xxl

Hi, could you please specify in what percentage Wikipedia, the OPUS and the OSCAR corpus were used for training ita-bert-xxl? Thanks

German BERT Dataset sampling

Hi,
do you sampled each dataset (Wikipedia, Common Crawl, Subtitles etc.) equally during German-BERT Training?
OpenAI uses a unequal sampling, which may lead to a better result, as stated in the GPT-3 Paper:

Note that during training, datasetsare not sampled in proportion to their size, but rather datasets we view as higher-quality are
sampled more frequently,such that CommonCrawl and Books2 datasets are sampled less than once during training, but the other datasets aresampled 2-3 times. This essentially accepts a small amount of overfitting in exchange for higher quality training data

If yes, which paremeters do you used?

BERT-ita-xxl - Question about corpus

Greetings dear dbmdz team

One question that I would like to ask: as I see from HuggingFace page of the model, "The source data for the Italian BERT model consists of a recent Wikipedia dump and various texts from the OPUS corpora collection... For the XXL Italian models, we use the same training data from OPUS and extend it with data from the Italian part of the OSCAR corpus". Were these dataset originally written in Italian, or were they english text translated?

I'm asking this because my medical corpus was originally written in English, and I used the Google translated API to translate it. So I would like to estimate the bias introduced by this operation.

Many thanks,
Cheers

Vocab size not compatible with the configuration in the `dbmdz/bert-base-italian-*` models

The vocab size declared in the file config.json and the related dimensions of the model params (32102) are greater than the actual number of rows of the file vocab.txt (31102).

Is there a reason for that or the vocabulary file is not the correct one?

max sequence length

How to user bert turkish sentiment cased model for calculating sentiment scores of sentences with more than 512 sequence length?

TFRC access

Hello all Dears,

How can we access free tpu via TFRC for pre-training BERT language Model on a specific language?

the link below explains that we have to sign up here: https://services.google.com/fb/forms/tpusignup/ , but it seems that it's a dead URL.
https://ai.googleblog.com/2017/05/introducing-tensorflow-research-cloud.html

Nothing happened By Apply now of : https://www.tensorflow.org/tfrc

bert-ita-xxl-cased training set

Hi, I don't quite get how large is the training set of bert-ita-xxl-cased.
In the Hugging-face page is reported as size of the "training corpus" 13B tokens; is that the size of the set or of the entire dataset used?
Thanks

Turkish BERT model TF checkpoints

Hi,
I need to TensorFlow checkpoints for bert-base-turkish-cased model. Could you release the checkpoints?

Potential publishing of TF checkpoints

Awesome work in creating another German BERT model trained on rather scientific texts, dbmdz team!
I would like to use your model with Bert-as-Service and would need the TF checkpoints for that. Do you have them somewhere laying around by any chance?

wrong dimension of bert-base-italian-xxl vocabularies

Hi, thanks again for these models! I was trying to use the bert-base-italian-xxl models, but I noticed that there is a discrepancy between the vocabulary size in the config.json file (32102) and the actual size of the vocabulary (31102). Is it possible that the wrong vocabulary is uploaded?

Advice on evaluating BERT models every n steps

Hi there, thank you for all of the helpful advice on training Transformer models!

In your recent paper German’s Next Language Model, you compare ELECTRA and BERT at various checkpoints. I have tried to do the same thing on my own data. For ELECTRA it is working (saving every 50k steps to GCS bucket) but for BERT the checkpoints are being overwritten. I have tried setting a value for save_checkpoint_steps but this still seems to just keep the 5 most-recent checkpoints. May I ask how you were able to keep the older checkpoints from being overwritten?

I am using the official BERT repository from Google: https://github.com/google-research/bert

Thanks!

How to train bert-base-italian-* models?

Thanks for sharing.
I want to train a different language model (Hindi).
How did you train your bert-base-italian-* models? Are those steps covered anywhere?

dbmdz/bert-base-italian-xxl-cased TFBertModel not working at all

Hi I'm trying to use dbmdz/bert-base-italian-xxl-casedfor creating a keras model for a classification task.
I've followed the documentation but I continue to receive the following error:

tensorflow.python.framework.errors_impl.InvalidArgumentError:  indices[5,0] = 102 is not in [0, 2)
	 [[node functional_1/bert/embeddings/token_type_embeddings/embedding_lookup (defined at /anaconda3/envs/profanity-detector/lib/python3.7/site-packages/transformers/modeling_tf_bert.py:186) ]] [Op:__inference_train_function_29179]

This is the model:

from transformers import TFBertModel, BertTokenizer

bert_model = TFBertModel.from_pretrained("dbmdz/bert-base-italian-xxl-cased")
tokenizer = BertTokenizer.from_pretrained("dbmdz/bert-base-italian-xxl-cased")

input_ids = tf.keras.layers.Input(shape=(constants.MAX_SEQ_LENGTH,), dtype=tf.int32)
token_type_ids = tf.keras.layers.Input(shape=(constants.MAX_SEQ_LENGTH,), dtype=tf.int32)
attention_mask = tf.keras.layers.Input(shape=(constants.MAX_SEQ_LENGTH,), dtype=tf.int32)

seq_output, _ = bert_model({
        "input_ids": input_ids,
        "token_type_ids": token_type_ids,
        "attention_mask": attention_mask
})

pooling = tf.keras.layers.GlobalAveragePooling1D()(seq_output)
dropout = tf.keras.layers.Dropout(0.2)(pooling)
output = tf.keras.layers.Dense(constants.CLASSES, activation="softmax")(dropout)

model = tf.keras.Model(
        inputs=[input_ids, token_type_ids, attention_mask],
        outputs=[output]
)

model.compile(optimizer=tf.optimizers.Adam(lr=0.00001), loss='sparse_categorical_crossentropy', metrics=['accuracy'])

My dataset is tokenized by this method:

    def map_to_dict(self, input_ids, attention_masks, token_type_ids, labels):
        return {
            "input_ids": input_ids,
            "token_type_ids": token_type_ids,
            "attention_mask": attention_masks,
        }, labels

    def tokenize_sequences(self, tokenizer, max_sequence_length, data, labels):
        try:
            token_ids = []
            token_type_ids = []
            attention_mask = []

            for sentence in data:
                bert_input = tokenizer.encode_plus(
                    sentence,
                    add_special_tokens=True,  # add [CLS], [SEP]
                    max_length=max_sequence_length,  # max length of the text that can go to BERT
                    truncation=True,
                    pad_to_max_length=True,  # add [PAD] tokens
                    return_attention_mask=True  # add attention mask to not focus on pad tokens
                )

                token_ids.append(bert_input["input_ids"])
                token_type_ids.append(bert_input["token_type_ids"])
                attention_mask.append(bert_input["attention_mask"])

            return tf.data.Dataset.from_tensor_slices((token_ids, token_type_ids, attention_mask, labels)).map(self.map_to_dict)
        except Exception as e:
            stacktrace = traceback.format_exc()

            logger.error("{}".format(stacktrace))

            raise e

ds_train_encoded = tokenize_sequences(tokenizer, 512, X_train, y_train).shuffle(10000).batch(6)

X_train examples:

["Questo video è davvero bellissimo", "La qualità del video non è proprio il massimo"......]

y_train examples:

[[1], [0]...]

I continue to receive the error described before.

tensorflow.python.framework.errors_impl.InvalidArgumentError:  indices[5,0] = 102 is not in [0, 2)
	 [[node functional_1/bert/embeddings/token_type_embeddings/embedding_lookup (defined at /anaconda3/envs/profanity-detector/lib/python3.7/site-packages/transformers/modeling_tf_bert.py:186) ]] [Op:__inference_train_function_29179]

If I try to use TFBertForSequenceClassification everything works fine (for this reason I'm excluding tokenization problems).

Can you please provide a solution or a well documented guide for using TFBertModel class with Keras model (I cannot find it)?

Thank you

request for uncased version of convbert-base-turkish

Hi,
thanks for the bert models. Is there any chance to see a uncased version of convbert-base-turkish anytime?

BR.

Tensorflow checkpoint for Italian BERT

Hello there,
First of all, thank you very much for your awesome work! Such repositories are of great help.
I was wondering if it is possible to have access to tensorflow checkpoints for the italian pre-trained version of BERT (or if there's already some available). I'm currently looking for a good pre-trained italian model for research.

Thanks in advance!
Best,
Federico

Bert for QA task

Hi there,

First, thanks for making and sharing these German Bert models.

I am wondering if possible to fine-tune the models on question answering tasks? If so, what is the procedure to follow?

Thanks in advance.

Italian BERT

Dear dbmdz team

I would like to use one of the italian BERT models that you ore-trained to create a model trained on a specific topic (medical languages).
I would like to ask a few things that are not entirely clear to me:

Have these models been trained by applying Whole Word Masked?
I assume that these models have the 10% of [MASK] tokens not replaced and 10% replaced with random token, is that correct?
Are these model trained on Next Sentence Prediction as well?

Sorry if these may be trivial questions. Thanks a lot

dbmdz/bert-base-german-cased

Hello,

I have 2 short questions:

is it correct that the model 'distilbert-base-german-cased' (https://huggingface.co/distilbert-base-german-cased) was distilled from the model 'dbmdz/bert-base-german-cased' (https://huggingface.co/dbmdz/bert-base-german-cased)?
is there a paper on the 'dbmdz/bert-base-german-cased' and / or the 'distilbert-base-german-cased' (which can also be used for citation purposes)?

Thanks in advance!

Tensorflow weights for bert-base-italian-cased

Hello, I'm raising this issue in order to ask if it is possible to get the weights of bert-base-italian-cased for Tensorflow.

Issue with TensorFlow

Hello, I am using bert-base-german-cased model with TensorFlow but in embedding layer I got this error

TypeError Traceback (most recent call last)
Cell In[57], line 1
----> 1 TFBertEmbeddings = bert(input_ids,attention_mask = attention_mask)[1]

File ~/work/myenv/lib/python3.11/site-packages/tf_keras/src/utils/traceback_utils.py:70, in filter_traceback..error_handler(*args, **kwargs)
67 filtered_tb = _process_traceback_frames(e.traceback)
68 # To get the full stack trace, call:
69 # tf.debugging.disable_traceback_filtering()
---> 70 raise e.with_traceback(filtered_tb) from None
71 finally:
72 del filtered_tb

File ~/work/myenv/lib/python3.11/site-packages/transformers/modeling_tf_utils.py:428, in unpack_inputs..run_call_with_unpacked_inputs(self, *args, **kwargs)
425 config = self.config
427 unpacked_inputs = input_processing(func, config, **fn_args_and_kwargs)
--> 428 return func(self, **unpacked_inputs)

File ~/work/myenv/lib/python3.11/site-packages/transformers/models/bert/modeling_tf_bert.py:1234, in TFBertModel.call(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict, training)
1190 @unpack_inputs
1191 @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
1192 @add_code_sample_docstrings(
(...)
1212 training: Optional[bool] = False,
1213 ) -> Union[TFBaseModelOutputWithPoolingAndCrossAttentions, Tuple[tf.Tensor]]:
1214 r"""
1215 encoder_hidden_states (tf.Tensor of shape (batch_size, sequence_length, hidden_size), optional):
1216 Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if
(...)
1232 past_key_values). Set to False during training, True during generation
1233 """
-> 1234 outputs = self.bert(
1235 input_ids=input_ids,
1236 attention_mask=attention_mask,
1237 token_type_ids=token_type_ids,
1238 position_ids=position_ids,
1239 head_mask=head_mask,
1240 inputs_embeds=inputs_embeds,
1241 encoder_hidden_states=encoder_hidden_states,
1242 encoder_attention_mask=encoder_attention_mask,
1243 past_key_values=past_key_values,
1244 use_cache=use_cache,
1245 output_attentions=output_attentions,
1246 output_hidden_states=output_hidden_states,
1247 return_dict=return_dict,
1248 training=training,
1249 )
1250 return outputs

File ~/work/myenv/lib/python3.11/site-packages/transformers/models/bert/modeling_tf_bert.py:912, in TFBertMainLayer.call(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict, training)
909 if token_type_ids is None:
910 token_type_ids = tf.fill(dims=input_shape, value=0)
--> 912 embedding_output = self.embeddings(
913 input_ids=input_ids,
914 position_ids=position_ids,
915 token_type_ids=token_type_ids,
916 inputs_embeds=inputs_embeds,
917 past_key_values_length=past_key_values_length,
918 training=training,
919 )
921 # We create a 3D attention mask from a 2D tensor mask.
922 # Sizes are [batch_size, 1, 1, to_seq_length]
923 # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]
924 # this attention mask is more simple than the triangular masking of causal attention
925 # used in OpenAI GPT, we just need to prepare the broadcast dimension here.
926 attention_mask_shape = shape_list(attention_mask)

File ~/work/myenv/lib/python3.11/site-packages/transformers/models/bert/modeling_tf_bert.py:206, in TFBertEmbeddings.call(self, input_ids, position_ids, token_type_ids, inputs_embeds, past_key_values_length, training)
203 raise ValueError("Need to provide either input_ids or input_embeds.")
205 if input_ids is not None:
--> 206 check_embeddings_within_bounds(input_ids, self.config.vocab_size)
207 inputs_embeds = tf.gather(params=self.weight, indices=input_ids)
209 input_shape = shape_list(inputs_embeds)[:-1]

File ~/work/myenv/lib/python3.11/site-packages/transformers/tf_utils.py:163, in check_embeddings_within_bounds(tensor, embed_dim, tensor_name)
153 def check_embeddings_within_bounds(tensor: tf.Tensor, embed_dim: int, tensor_name: str = "input_ids") -> None:
154 """
155 tf.gather, on which TF embedding layers are based, won't check positive out of bound indices on GPU, returning
156 zeros instead. This function adds a check against that dangerous silent behavior.
(...)
161 tensor_name (str, optional): The name of the tensor to use in the error message.
162 """
--> 163 tf.debugging.assert_less(
164 tensor,
165 tf.cast(embed_dim, dtype=tensor.dtype),
166 message=(
167 f"The maximum value of {tensor_name} ({tf.math.reduce_max(tensor)}) must be smaller than the embedding "
168 f"layer's input dimension ({embed_dim}). The likely cause is some problem at tokenization time."
169 ),
170 )

File ~/work/myenv/lib/python3.11/site-packages/keras/src/layers/core/tf_op_layer.py:119, in KerasOpDispatcher.handle(self, op, args, kwargs)
114 """Handle the specified operation with the specified arguments."""
115 if any(
116 isinstance(x, keras_tensor.KerasTensor)
117 for x in tf.nest.flatten([args, kwargs])
118 ):
--> 119 return TFOpLambda(op)(*args, **kwargs)
120 else:
121 return self.NOT_SUPPORTED

File ~/work/myenv/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py:70, in filter_traceback..error_handler(*args, **kwargs)
67 filtered_tb = _process_traceback_frames(e.traceback)
68 # To get the full stack trace, call:
69 # tf.debugging.disable_traceback_filtering()
---> 70 raise e.with_traceback(filtered_tb) from None
71 finally:
72 del filtered_tb

TypeError: Exception encountered when calling layer 'embeddings' (type TFBertEmbeddings).

Could not build a TypeSpec for name: "tf.debugging.assert_less_2/assert_less/Assert/Assert"
op: "Assert"
input: "tf.debugging.assert_less_2/assert_less/All"
input: "tf.debugging.assert_less_2/assert_less/Assert/Assert/data_0"
input: "tf.debugging.assert_less_2/assert_less/Assert/Assert/data_1"
input: "tf.debugging.assert_less_2/assert_less/Assert/Assert/data_2"
input: "Placeholder"
input: "tf.debugging.assert_less_2/assert_less/Assert/Assert/data_4"
input: "tf.debugging.assert_less_2/assert_less/y"
attr {
key: "summarize"
value {
i: 3
}
}
attr {
key: "T"
value {
list {
type: DT_STRING
type: DT_STRING
type: DT_STRING
type: DT_INT32
type: DT_STRING
type: DT_INT32
}
}
}
of unsupported type <class 'tensorflow.python.framework.ops.Operation'>.

Call arguments received by layer 'embeddings' (type TFBertEmbeddings):
• input_ids=<KerasTensor: shape=(None, 400) dtype=int32 (created by layer 'input_ids')>
• position_ids=None
• token_type_ids=<KerasTensor: shape=(None, 400) dtype=int32 (created by layer 'tf.fill_3')>
• inputs_embeds=None
• past_key_values_length=0
• training=False

I tried everything but not solved the issue

Release TF Model for BERTurk for TF-only environments

Hi, thanks a lot for releasing Turkish BERT model --your work is just amazing. I'm raising this issue per your statement here. This is especially needed in TF-only environments: Rasa is one such example. in HFTransformersNLP component added in v1.8, we can now use Rasa with HuggingFace Transformers model. As I read from Rasa source code here, they load weights specifically with TFBertModel class, which is expectedly unable to load a missing tf_model.h5 in dbmdz/bert-base-turkish-cased. I tried to update Rasa source code to use AutoTokenizer and AutoModel; however, it seems that it might require a breaking change in several core features in Rasa. I think that a TF model might also be useful in other TF-only environments. So, is it possible that you release it? I can willingly and happily collaborate on this if help is needed.

Request for Tensorflow version

Hello,
Do you have any plan for the Tensorflow version of the bert-base-turkish-uncased?

Handling of german special charaters

Hi, I am wondering how the tokenizer or the german model will treat input words with special characters like "ß", "ö", "ä", "ü".

I have some input sentences in Latin-1 where the special characters are normalized like "ß" -> "ss" or "ö" -> "oe". Will training with this data be effective or do I have to convert the special characters back to being "ß", "ö", "ä", "ü" again?

Thanks

Reference paper for Italian BERT XXL?

Hi DBMDZ team, this is not really an issue but I don't know how else I can get in touch with you.
My team and I are using dbmdz/bert-base-italian-xxl-cased LLM on HuggingFace for a research project and we are now in the phase of writing a paper about our initial results. We would be really glad to cite your work, on which our is based, but I cannot find any reference paper. Can you point it out for me?
By the way, thanks very much for your amazing work.

TensorFlow checkpoints for Turkish

Hi,
I need access to TensorFlow checkpoints since I want to use the weights into my additional layer in my proposed architecture.
Sincerely yours

Tokenizer results in blank token for extended UTF-8 characters

There is a question mark character in one of the Universal Dependencies datasets which gets wiped out by the tokenizer for the Italian bert & electra models:

https://github.com/UniversalDependencies/UD_Italian-PoSTWITA

warning: big file
https://raw.githubusercontent.com/UniversalDependencies/UD_Italian-PoSTWITA/master/it_postwita-ud-train.conllu

search for "ewww" in the training file

It looks like this if I copy and paste it:

ewww 󾓺 — in viaggio Roma

according to emacs describe-char, it is character 0xFE4FA

Anyway, hopefully that's enough background to figure out which character is causing the problem. If I run the following sentences through the tokenizer with tokenizer.tokenize(sentence) I get the following:

ewww 🐈 — in viaggio Roma   # another random character
ewww 󾓺 — in viaggio Roma    # to test, maybe need to check that this is the weird character, not just a box
ewww — in viaggio Roma

# i printed the word pieces & their IDs
(['e', '##www', '[UNK]', '—', 'in', 'viaggio', 'Roma'], [126, 18224, 101, 986, 139, 2395, 2097])
(['e', '##www', '—', 'in', 'viaggio', 'Roma'], [126, 18224, 986, 139, 2395, 2097])
(['e', '##www', '—', 'in', 'viaggio', 'Roma'], [126, 18224, 986, 139, 2395, 2097])

The missing word causes confusion for me when trying to correlate the Bert embeddings with the words they represent. Can the tokenizer be fixed to treat that character (or any other strange character) as [UNK] as well?

Text Corpus to train and Open Source RoBERTa Model

Hi,
I did read about your german BERT model at hugging faces. I would like to train an RoBERTa model.
Since I also want to give the work back as open source to the community and could reference you:

Is it possible to use your german text corpus? You write:

recent Wikipedia dump, EU Bookshop corpus, Open Subtitles, CommonCrawl, ParaCrawl and News Crawl. This results in a dataset with a size of 16GB and 2,350,234,427 tokens.

Citings

Is it possible to write down a bibtex entry to cite your Model in different papers/thesis ?
Even if its not an official paper, there are a few Models on huggingface with a bibtex entry to at least acknowledge their work.
It's okay to cite the the repo url but at least some more information like authors etc. would be nice :)

How to use italian models with TFBertModel - (headless)?

Hello,
for starters thanks for creating the Italian models.

I'm pretty new to the whole transformers / bert architecture so forgive me if this question is dumb. Anyway, I have a dataset upon which I'm trying to do a classification based on NLP (multi-class). Each row is a sentence of variable length + the target class in a separated attribute.

I started by using:

TFBertModel.from_pretrained('bert-base-multilingual-cased')

Since I wasn't entirely happy, I tried changing the argument from multilingual to the dbmdz italian one, but received this error:

AssertionError: Error retrieving file /root/.cache/torch/transformers/076c0d3e6f9148ae7c8bc48e2818d4ff03ec5bc68115b361cfa8b1795b4c9683.h5

I had also tried using AutoModel, but that doesn't seem to fit my case as it's not headless.

Any suggestion is gonna be much appreciated.

German BERT on older pytorch-transformers

Hi, we're using the deepset BERT model together with AllenNLP 0.9.0 (i.e. the older Pytorch-transformers library). Is there an easy way to make the dbmdz models (cased & uncased) usable in that environment? A non-easy way would also do. Thanks a lot!

Difference between the two German models

Hi,

What's the difference between the two German models old and new?

Thanks!

Publishing of TF checkpoints for distilbert-base-german-cased

Hi dbmdz team,

it's me again^^
I just saw that there is a Pytorch model for distilbert-base-german-cased in huggingface's repo. After my last test with the bigger model, we, the IKON team at the FU Berlin's HCC lab, would be super excited to use these models in our application. Did you also run this distillation experiment by any chance and have the TF checkpoints laying around?

Error(s) in loading state_dict for ElectraForTokenClassification

size mismatch for classifier.weight: copying a param with shape torch.Size([8, 1024]) from checkpoint, the shape in current model is torch.Size([9, 1024]).
Hello,I want to know why your file,"config.json"only has 8 labels for conll2003 datasets,I think it should have 9 labels.

Italian and French BERT: how to cite / acknowledge authorship

We are using both Italian and French DBMDZ BERT in a multilingual research project, and I would like to acknowledge its authorship. Is there any publication to be cited? Website project? Or at least the authors' names? Many thanks.

Anahtar kelime çıkarma çalışması

Merhaba bitirme projem için bir çalışma yapıyorum. Verilen texti en iyi temsil eden kelimeleri bulmaya çalışıyorum örneğin en iyi 3 kelime.. Bunla ilgili sizin modelinizi kullanmak istiyorum, nasıl ilerleyebilirim yardımcı olur musunuz ? Teşekkürler..

Turkish ELECTRA small genetator

Is the generator counterpart to this model: dbmdz/electra-small-turkish-cased-discriminator available? Thanks!

bert-base-italian-xxl-cased/uncased incoherent vocab size

Hi all, thanks for sharing your models!

I noticed that in the xxl cases, the two Italian models report to have a "vocab_size": 32102 (info taken from the config.json), but the size of the vocab.txt is 31102.

Training on Ner

Hi.
How can I train xxl italian model for downstream NER task?

Publishing of TF checkpoints for italian models

Thank you very much for generating these great Bert models in Italian! I've read that you plan to release the TF checkpoints, which I would greatly appreciate. May I ask when are you planning to release them? Thanks again

Request for Tensorflow version (German)

Hello,

where can I find the tensorflow checkpoints for the German BERT model?

Thanks in advance!

Domain of the corpora used for pretraining

Hello!

First of all, thanks for this wonderful collection of pretrained models. I wonder what is the domain of the corpora used for pretraining BERTurk, DistilBERTurk, ConvBERTurk, and ElecTRa. I would like to cite these models in a scientific publication and give an idea about the domain knowledge made available to the model during pretraining.

License for models

Hi!

First off, thanks for making and sharing these models.

Second, is there a license that can be applied to the pre-trained models themselves?

Thanks,
Josh

BERT Turkish Uncased Model

Hi,
I have One Question.
For how many epochs was the BERTurk uncased model trained?
Thanks for answer nowly

creation of vocab

Hey guys,

great to see another public BERT model for german 👍

Could you say sth. about how you managed to handle the 16GB text data in SentencePiece for vocab creation? I get "bad_alloc" all the time when I'm trying to process all of my data. I am aware of VOC_SIZE, INPUT_SENTENCE_SIZE, SHUFFLE_INPUT_SENTENCE, etc. but still ... I don't want to use a subset of my data. What's your approach?

Turkish BERT Uncased Checkpoints

Hi.

Could you please release checkpoints for 'bert-base-turkish-128k-uncased' ?

DistilBERTurk TF checkpoints

Hi guys.

Could you guys please release TF checkpoints for DistilBERTurk?
I have my own Turkish Albert, though with less than desired performance due to me using only the Wiki dump and some pdfs as original training dataset.
I'd like to use your DistilBERT in my intent&slot prediction TF pipeline to compare the accuracies of these models.

dbmdz / berts Goto Github PK

berts's People

Contributors

Stargazers

Watchers

Forkers

berts's Issues

Hello, I am using bert-base-german-cased model with TensorFlow but in embedding layer I got this error

Recommend Projects

Recommend Topics

Recommend Org