getalp / flaubert Goto Github PK
View Code? Open in Web Editor NEWUnsupervised Language Model Pre-training for French
License: Other
Unsupervised Language Model Pre-training for French
License: Other
The example in the readme for running with the Hugging Face library reads:
# Choose among ['flaubert-small-cased', 'flaubert-base-uncased', 'flaubert-base-cased', 'flaubert-large-cased']
modelname = 'flaubert-base-cased'
but these don't work for me. Instead, what worked was:
# Choose among ['flaubert/flaubert_small_cased', 'flaubert/flaubert_base_uncased', 'flaubert/flaubert_base_cased', 'flaubert/flaubert_large_cased']
modelname = 'flaubert/flaubert_base_cased'
Would it be possible to pre-train Flaubert (as described here https://github.com/getalp/Flaubert#3-pre-training-flaubert) in Google Colab? If so, would you be able to give me some advice on converting the code you supply so that it works in Colab?
Thank you in advance for your time. :)
Hi, i trained my own BERT with your code.
Do you provide code for converting xlm model to Hugging Face model: from checkpoint.pth to pytorch_model.bin and config.json? also class for tokenizer?
Thank you!
Hello,
I am using Flaubert with huggingface transformers but I am getting negative vectors I do not know why ?
Do you have any idea why ? when using your previous installation without it the vectors were non negatives.
tensor([[[-0.7107, -1.4737, 0.9828, ..., -3.1127, 0.3499, 0.3747],
[-1.7784, -2.2899, -0.3564, ..., -0.7091, -0.7323, 2.5189],
[-1.3886, -0.6179, -1.2058, ..., -0.5359, -0.5513, 0.4187],
...,
[ 1.0148, -2.3575, -0.9193, ..., -2.2937, 0.5353, 0.6768],
[ 0.9082, -2.2049, -1.2324, ..., -2.1818, 0.5374, 0.8427],
[ 0.9287, -1.7471, -1.7460, ..., -2.1425, 0.5514, 0.7766]],
[[-0.9680, -1.0419, 0.6869, ..., -2.0340, 0.5130, -0.0712],
[-0.9206, -1.6208, -0.1860, ..., 0.6649, -1.0081, -0.8190],
[-1.1192, -1.6612, -0.2082, ..., -0.2693, -1.8960, -0.6337],
...,
[ 0.1435, -2.2177, -0.5040, ..., -1.8233, 0.8659, 0.0530],
[ 0.0633, -2.1492, -0.5994, ..., -1.7436, 0.7928, 0.1566],
[ 0.0480, -1.8401, -0.7709, ..., -1.6504, 0.7304, 0.0832]],
[[-0.5683, -2.7551, 1.0667, ..., -3.7187, 1.6681, 0.8379],
[-2.0996, -0.8701, -1.2148, ..., -0.0379, -2.5241, 1.9351],
[-3.7073, 0.3279, 0.8807, ..., -1.3985, -2.0611, 0.6002],
...,
[ 1.3348, -3.6816, -1.0271, ..., -1.6550, 0.8394, 1.0457],
[ 0.8955, -3.6031, -1.0443, ..., -1.2973, 0.9316, 1.3558],
[ 1.0691, -3.5659, -1.1623, ..., -1.3626, 1.1174, 1.3523]],
...,
[[-0.9926, -1.0781, 0.5332, ..., -1.9912, 0.3508, -0.1276],
[-0.9452, 1.4486, -0.4952, ..., -0.2910, 0.1014, 0.7436],
[-2.1879, -0.1735, 1.2844, ..., -1.4701, -0.3949, -1.3691],
...,
[-0.0380, -1.8997, -0.2480, ..., -1.6570, 0.9166, 0.3879],
[-0.0185, -1.7595, -0.4073, ..., -1.6960, 0.7043, 0.4340],
[ 0.0889, -1.6022, -0.6427, ..., -1.7312, 0.6539, 0.3327]],
[[-0.7026, -2.8673, 0.7728, ..., -3.2812, 1.2357, 1.0315],
[-0.7316, 1.7105, -0.2076, ..., 0.0276, 0.4884, 0.5200],
[-2.2086, -0.0808, 1.7520, ..., -0.2885, 0.3539, -0.3582],
...,
[ 1.2960, -4.0501, -1.3244, ..., -1.6143, 0.6550, 1.8709],
[ 0.9846, -3.8275, -1.3272, ..., -1.1076, 0.6045, 1.9635],
[ 1.0963, -3.6684, -1.3650, ..., -1.1385, 0.6555, 1.7605]],
[[-0.8365, -2.3553, 0.7874, ..., -3.7243, 1.6223, 0.8726],
[-0.2939, -2.7249, 0.2032, ..., -2.4267, -1.7648, 1.4637],
[ 0.1375, 0.9276, 0.0405, ..., -0.4353, -1.1491, 1.0354],
...,
[ 1.6176, -3.3541, -1.4429, ..., -2.5446, 0.8072, 1.7173],
[ 1.2283, -3.1744, -1.2734, ..., -2.0969, 0.7972, 1.9605],
[ 1.4354, -2.9385, -1.1689, ..., -2.0097, 1.0124, 1.8885]]],
grad_fn=<MulBackward0>)
torch.Size([260, 324, 768])
--fin--
My code :
def Flaubert_Model(texte): # texte is a column of sentences
# You could choose among ['flaubert-base-cased', 'flaubert-base-uncased', 'flaubert-large-cased']
modelname = 'flaubert-base-cased'
flaubert, log = FlaubertModel.from_pretrained(modelname, output_loading_info=True)
flaubert_tokenizer = FlaubertTokenizer.from_pretrained(modelname, do_lowercase=False) # do_lowercase=False if using the 'cased' model, otherwise it should be set to False
tokenized = texte.apply((lambda x: flaubert_tokenizer.encode(x, add_special_tokens=True)))
max_len = 0
for i in tokenized.values:
if len(i) > max_len:
max_len = len(i)
padded = np.array([i + [0] * (max_len - len(i)) for i in tokenized.values])
print(padded)
# Using model
token_ids = torch.tensor(padded)
last_layer = flaubert(token_ids)[0]
print(last_layer)
print(last_layer.shape)
Hi,
When getting data from get-data-xnli.sh , I notice that most dataset is not in French. Hence, I wonder how you used that in practice?
I am currently looking for a some NLI-like and STS-like datasets in French. That would be great to fine-tune Flaubert!
As a suggestion, translation the English version of NLI and STS to French could be a good option to fine tune Flaubert on such tasks.
Bonjour bonjour ! Merci d'avoir partagé le modèle !
Dans Camembert il est assez facile de deviner un mot à partir du contexte, y a-t-il un working example dans Flaubert ?
Merci d'avance !
from fairseq.models.roberta import CamembertModel
camembert = CamembertModel.from_pretrained('./camembert-base/')
camembert.eval()
masked_line = 'Le camembert est <mask> :)'
camembert.fill_mask(masked_line, topk=3)
# [('Le camembert est délicieux :)', 0.4909118115901947, ' délicieux'),
]# ('Le camembert est excellent :)', 0.10556942224502563, ' excellent'),
# ('Le camembert est succulent :)', 0.03453322499990463, ' succulent')]
I have a query regarding regarding your training corpus.
The News Crawl corpora that you use are both shuffled and de-duplicated. However, the corpora used by other models like BERT, RoBERTa etc. use a non-shuffled corpus where each document within the corpus is also demarcated with an empty line. Now with this un-shuffled form, when you create pre-training instances, you will end up with contiguous sentences in segment A and segment B. But in your case, the segments will contain non-contiguous sentences right?
So my question is what is your opinion on having non-contiguous sentences in the segments? Does it hurt the performance of MLM, or downstream tasks?
I am getting this error when traning Flaubert.
INFO - 02/12/20 19:50:47 - 0:00:11 - Number of parameters (model): 92501715
INFO - 02/12/20 19:50:51 - 0:00:15 - Before setting SingleTrainer variables.
INFO - 02/12/20 19:50:51 - 0:00:15 - After setting SingleTrainer variables.
INFO - 02/12/20 19:50:52 - 0:00:15 - Found 0 memories.
INFO - 02/12/20 19:50:52 - 0:00:16 - Found 12 FFN.
INFO - 02/12/20 19:50:53 - 0:00:16 - Found 197 parameters in model.
INFO - 02/12/20 19:50:53 - 0:00:17 - Optimizers: model
Traceback (most recent call last):
File "train.py", line 391, in
main(params)
File "train.py", line 266, in main
trainer = SingleTrainer(model, data, params)
File "/home/ge/ke/eXP/Flaubert/xlm/trainer.py", line 857, in init
super().init(data, params)
File "/home/ge/ke/eXP/Flaubert/xlm/trainer.py", line 81, in init
self.init_amp()
File "/home/gekelodjoe/eXP/Flaubert/xlm/trainer.py", line 205, in init_amp
models, optimizers = apex.amp.initialize(
AttributeError: module 'apex' has no attribute 'amp'
Do you have any idea how can I solve it ?
Transformers version
2.3.0
Pytorch version
1.4
When running readme.md code
import torch
from transformers import XLMModel, XLMTokenizer
modelname="xlm_bert_fra_base_lower" # Or absolute path to where you put the folder
# Load model
flaubert, log = XLMModel.from_pretrained(modelname, output_loading_info=True)
# check import was successful, the dictionary should have empty lists as values
print(log)
# Load tokenizer
flaubert_tokenizer = XLMTokenizer.from_pretrained(modelname, do_lowercase_and_remove_accent=False)
sentence="Le chat mange une pomme."
sentence_lower = sentence.lower()
token_ids = torch.tensor([flaubert_tokenizer.encode(sentence_lower)])
last_layer = flaubert(token_ids)[0]
print(last_layer.shape)
Output
OSError: Model name 'xlm_bert_fra_base_lower' was not found in model name list
(xlm-mlm-en-2048, xlm-mlm-ende-1024, xlm-mlm-enfr-1024, xlm-mlm-enro-1024, xlm-mlm-tlm-xnli15-1024, xlm-mlm-xnli15-1024, xlm-clm-enfr-1024, xlm-clm-ende-1024,
xlm-mlm-17-1280, xlm-mlm-100-1280). We assumed 'xlm_bert_fra_base_lower'
was a path or url to a configuration file named config.json
or a directory containing such a file but couldn't find any such file at this path or url.
Tried downloading the model as tar file (lower and normal), extracting it and putting its absolute folder path as modelname
but keeps getting the same error....
I may be missing something so stupid but I don't see a config.json
file in the archive file
What is wrong?
The script failed with :
No compiler is provided in this environment. Perhaps you are running on a JRE rather than a JDK?
I managed to fix it with :
sudo update-alternatives --config java
and selecting /usr/lib/jvm/java-11-openjdk-amd64/bin/java
as default
I was runing the token classification example her : https://github.com/huggingface/transformers/blob/master/examples/token-classification/run.sh with flaubert-large-cased as model name.
I tried to use the downloaded model from https://huggingface.co/flaubert models, but after 100 epochs the results are very bad, the model did not learn any thing.
I Don't understand what is the problem with flaubert-large-cased model.
Note that flaubert-base-cased model give good results on NER taks.
Do you have an idea how can fix this problem please ?
replace -outdir by --outdir !
line 10
Rgds
Hello,
I have a problem currently regarding the finetuning of Flaubert on FLUE. I have a model that I re-trained with custom data, so I have new weights for it, and I currently have it as .json file and .bin files. However, when I want to finetune this model on FLUE tasks, they ask me for vocab and codes files from pretraining, that I don't have when using the hugging face library. I see that there is a module to go from XLM to hugging_face, but not the opposite. Is it possible to transform a model under .json and .bin format to get vocab, codes and .pth files ?
Or maybe there is a clever workaround to this problem ?
Many thanks in advance
Hello,
I would like to use FlauBERT to investigate semantic ambiguity in eighteenth-century French. To do so, I would like to extract the word embeddings of a specific keyword from a large group of different sentences in which the keyword was used and visualize those word embeddings. I have been able to extract and visualize word embeddings using BERT (after having looked at various online tutorials), but I cannot figure out how to revise the code I've been using so that it works with FlauBERT. I've inserted the code I have used for extracting and visualizing word embeddings using BERT. Could you possibly give me some suggestions as to how I might modify this code so that it works with FlauBERT?
I also wonder whether I should further train FlauBERT on a corpus of eighteenth-century French texts?
Thank you in advance for your time.
!pip install torch
!pip install transformers
from transformers import BertTokenizer, BertModel
import pandas as pd
import numpy as np
import nltk
import torch
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import LeaveOneOut
from sklearn import model_selection
import matplotlib.pyplot as plt
%matplotlib inline
text_data = pd.read_csv("https://raw.githubusercontent.com/name/file.csv")
texts = text_data['Sentence'].to_list()
model = BertModel.from_pretrained('bert-base-uncased',
output_hidden_states = True,
)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
def bert_text_preparation(text, tokenizer):
marked_text = "[CLS] " + text + " [SEP]"
tokenized_text = tokenizer.tokenize(marked_text)
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
segments_ids = [1]*len(indexed_tokens)
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
return tokenized_text, tokens_tensor, segments_tensors
def get_bert_embeddings(tokens_tensor, segments_tensors, model):
with torch.no_grad():
outputs = model(tokens_tensor, segments_tensors)
hidden_states = outputs[2][1:]
token_embeddings = hidden_states[-1]
token_embeddings = torch.squeeze(token_embeddings, dim=0)
list_token_embeddings = [token_embed.tolist() for token_embed in token_embeddings]
return list_token_embeddings
target_word_embeddings = []
for text in texts:
tokenized_text, tokens_tensor, segments_tensors = bert_text_preparation(text, tokenizer)
list_token_embeddings = get_bert_embeddings(tokens_tensor, segments_tensors, model)
word_index = tokenized_text.index('keyword')
word_embedding = list_token_embeddings[word_index]
target_word_embeddings.append(word_embedding)
keyword_pca = PCA(n_components=2).fit_transform(target_word_embeddings)
keyword_pca.shape
scatter_x = keyword_pca[:,0]
scatter_y = keyword_pca[:,1]
fig, ax = plt.subplots(figsize=(10, 7))
for g in np.unique(text_data.Place_holder):
ix = np.where(text_data.Place_holder == g)
ax.scatter(scatter_x[ix], scatter_y[ix],s=100)
plt.show()
Hi! Thank you for your open source initiative, and for creating FLUE. I was wondering if you could provide us with an update about when the code and data will be released for the following tasks, which are listed as 'coming soon' on the website:
5.1. Verb Sense Disambiguation
6. Named Entity Recognition
7. Question Answering
We are developing our own language model and would like to evaluate it across all FLUE tasks. Thank you!
Hello,
could you replace the first line of the script :
#! bin/bash
by
#!/usr/bin/env bash
Thanks
Bonjour,
Dans la documentation de Flaubert, il est fait mention de la possibilité de paraphraser.
Auriez-vous un exemple de code à utiliser pour, par exemple, d'une phrase en générer une dizaines de similaires ?
Merci
Line 104 in 3fa2d85
The corpus variable is referenced as $cp, and should be changed to $corpus.
transformers
version: 2.5.1Model I am using (FlauBert):
The problem arises when trying to produce features with the model, the output which is generated causes the system run out of memory.
import torch
from transformers import FlaubertModel, FlaubertTokenizer
# Choose among ['flaubert/flaubert_small_cased', 'flaubert/flaubert_base_uncased',
# 'flaubert/flaubert_base_cased', 'flaubert/flaubert_large_cased']
modelname = 'flaubert/flaubert_base_cased'
# Load pretrained model and tokenizer
flaubert, log = FlaubertModel.from_pretrained(modelname, output_loading_info=True)
flaubert_tokenizer = FlaubertTokenizer.from_pretrained(modelname, do_lowercase=False)
# do_lowercase=False if using cased models, True if using uncased ones
sentence = "Le chat mange une pomme."
token_ids = torch.tensor([flaubert_tokenizer.encode(sentence)])
last_layer = flaubert(token_ids)[0]
print(last_layer.shape)
# torch.Size([1, 8, 768]) -> (batch size x number of tokens x embedding dimension)
# The BERT [CLS] token correspond to the first hidden state of the last layer
cls_embedding = last_layer[:, 0, :]
def get_flaubert_layer(texte):
modelname = "flaubert-base-uncased"
path = './flau/flaubert-base-unc/'
flaubert = FlaubertModel.from_pretrained(path)
flaubert_tokenizer = FlaubertTokenizer.from_pretrained(path)
tokenized = texte.apply((lambda x: flaubert_tokenizer.encode(x, add_special_tokens=True, max_length=512)))
max_len = 0
for i in tokenized.values:
if len(i) > max_len:
max_len = len(i)
padded = np.array([i + [0] * (max_len - len(i)) for i in tokenized.values])
token_ids = torch.tensor(padded)
with torch.no_grad():
last_layer = flaubert(token_ids)[0][:,0,:].numpy()
return last_layer, modelname
The tasks I am working on is:
Steps to reproduce the behavior:
# Reading the file
filename = "corpus"
sentences = pd.read_excel(os.path.join(root, filename + ".xlsx"), sheet_name= 0)
data_id = sentences.identifiant
print("Total phrases: ", len(data_id))
data = sentences.sent
label = sentences.etiquette
emb, mdlname = get_flaubert_layer(data) # corpus is dataframe of approximately 40 000 lines
Apperently this line produce something which is huge and which take a lot memory :
last_layer = flaubert(token_ids)[0][:,0,:].numpy()
I would have expected it run but I think the fact that I pass the whole dataset to the model is causing the system to break, so I wanted to know if it possible to tell the model to process the data set maybe 500 lines or 1000 lines at at a time so as to not pass the whole dataset. I know that , there is this parameter : batch_size which can be used but since I am not training a model but merely using it to produces embeddings as input for others classifiers ,
Do you perhaps know how to modify the batch size so the whole dataset is not treated. I am not really familiar with this type of architecture. In the example , they just put one single sentence but in my case I load a whole dataset (dataframe). ?
My expectation is to make the model treat all the sentences and then produced the vectors I need for the task of classification.
Hello. :)
I would like to use the "--reload_model" option with your train.py command to further train one of your pretrained FlauBERT models.
Upon trying to run train.py with the "--reload_model" option I got an error message saying that there was a "size mismatch" between the pretrained FlauBERT model and the adapted model I was trying to train.
The error message referred to a "shape torch.Size([67542]) from checkpoint". This was for the flaubert_base_uncased model. I assume that the number 67542 is the vocabulary size of flaubert-base-uncased.
In order to use the "--reload_model" option with your pretrained FlauBERT models, do I need to ensure that the vocabulary of my training data is identical to that of the pretrained model? If so, do you think that I could manage that simply by concatenating the "vocab" file of the pretrained model with my training data?
Thank you in advance for your help!
File "extract_split_cls.py", line 14, in
from tools.clean_text import cleaner
ModuleNotFoundError: No module named 'tools'
Hello, I have this error message when trying to execute the extract_slit_cls.sh script. I tried conda install tools, but it does not seem to exist. What is the name of the package I need to install ? Thank you!
Hi,
Is it possible to have a bash file that is doing the sentence encoding of each line of a given document in an optimised manner?
Hence, if we compute something like the following:
bash encode_doc.sh $DATA_PATH/doc.txt
We could get sentence embeddings for each line of the document from $DATA_PATH/doc.txt.
The assumption is that each line of our document correspond to a given sentence.
Cheers and Good job! :D
Hello,
Is it normal that the fastBPE folder is empty? That's why I got an error in prepare-data-pawsx.sh.
Hi,
First of all, thanks a lot for this set of tools, a boon for French language modelling!
Reading your paper, I've been wondering about some dataset choices:
Thanks in advance!
Hello, I am trying to learn BPE codes on my traning set ans I am getting this error :
fast: fastBPE/fastBPE.hpp:458: void fastBPE::readCodes(const char*, std::unordered_map<std::pair<std::__cxx11::basic_string, std::__cxx11::basic_string >, unsigned int, fastBPE::pair_hash>&, std::unordered_map<std::__cxx11::basic_string, std::pair<std::__cxx11::basic_string, std::__cxx11::basic_string > >&): Assertion codes.find(pair) == codes.end()' failed. tools/create_pretraining_data.sh : ligne 38 : 5617 Abandon $FASTBPE applybpe $OUT_PATH/train.$lg $DATA_DIR/$lg.train $OUT_PATH/codes Loading codes from /home/getalp/kelodjoe/eXP/Flaubert/data/processed/fr_corpuslabel/BPE/10k/codes ... fast: fastBPE/fastBPE.hpp:458: void fastBPE::readCodes(const char*, std::unordered_map<std::pair<std::__cxx11::basic_string<char>, std::__cxx11::basic_string<char> >, unsigned int, fastBPE::pair_hash>&, std::unordered_map<std::__cxx11::basic_string<char>, std::pair<std::__cxx11::basic_string<char>, std::__cxx11::basic_string<char> > >&): Assertion
codes.find(pair) == codes.end()' failed.
tools/create_pretraining_data.sh : ligne 39 : 5618 Abandon $FASTBPE applybpe $OUT_PATH/valid.$lg $DATA_DIR/$lg.valid $OUT_PATH/codes
Loading codes from /home/getalp/kelodjoe/eXP/Flaubert/data/processed/fr_corpuslabel/BPE/10k/codes ...
fast: fastBPE/fastBPE.hpp:458: void fastBPE::readCodes(const char*, std::unordered_map<std::pair<std::__cxx11::basic_string, std::__cxx11::basic_string >, unsigned int, fastBPE::pair_hash>&, std::unordered_map<std::__cxx11::basic_string, std::pair<std::__cxx11::basic_string, std::__cxx11::basic_string > >&): Assertion `codes.find(pair) == codes.end()' failed.
tools/create_pretraining_data.sh : ligne 40 : 5619 Abandon $FASTBPE applybpe $OUT_PATH/test.$lg $DATA_DIR/$lg.test $OUT_PATH/codes
cat: /home/getalp/kelodjoe/eXP/Flaubert/data/processed/fr_corpuslabel/BPE/10k/train.fr: Aucun fichier ou dossier de ce type
Read 0 words (0 unique) from text file.
Traceback (most recent call last):
File "preprocess.py", line 30, in
assert os.path.isfile(txt_path)
AssertionError
Traceback (most recent call last):
File "preprocess.py", line 30, in
assert os.path.isfile(txt_path)
AssertionError
Traceback (most recent call last):
File "preprocess.py", line 30, in
assert os.path.isfile(txt_path)
AssertionError
Line 30 of preprocess.py look like this :
Do you have any idea how can I resolve it ?
Hello!
I am using FlauBERT to generate word embeddings as part of a study on word sense disambiguation (WSD).
The FlauBERT tokenizer does not recognize a significant number of words in my corpus and, as a result, segments them. For example, the FlauBERT tokenizer does not recognize the archaic orthography of some verbs. It also does not recognize plural forms of a number of other words in the corpus. As a result, the tokenizer segments a number of words in the corpus, some of which are significant to my research.
I understand that I could further train FlauBERT on a corpus of eighteenth-century French in order to create a new model and a new tokenizer specifically for eighteenth-century French. However, compiling a corpus of eighteenth-century French that is large and heterogeneous enough to be useful would be challenging (and perhaps not even possible).
As an alternative to training a new model, I thought I might lemmatize my corpus before running it through FlauBERT. Stanford's Stanza NLP package (stanfordnlp.github.io/stanza/) recognizes the archaic orthography of the verbs in my corpus and turns them into the infinitive form, a form FlauBERT recognizes. Similarly, Stanza also changes the plural forms of other words into singular forms, forms FlauBERT also recognizes. Thus, if I were to lemmatize my corpus in Stanza, the FlauBERT tokenizer would then be able to recognize substantially more words in my corpus.
Would lemmatizing my corpus in this way adversely affect my FlauBERT results and a WSD analysis in particular? In general, does lemmatization have a negative effect on BERT results and WSD analyses more particularly?
Given that FlauBERT is not trained on lemmatized text, I imagine that lemmatizing the corpus would indeed negatively effect the results of the analysis. As an alternative to training FlauBERT on a corpus of eighteenth-century French (which may not be possible), could I instead train it on a corpus of lemmatized French and then use this new model for a WSD analysis on my corpus of lemmatized eighteenth-century French? Would that work?
I'm not sure if this is the right place for these sorts of questions!
Thank you in advance for your time.
I am using Huggingface datasets for FLUE benchmark, and the CLS datasets are grouped together instead of being separated into different categories of "Books", "DVD", and "Music".
What's the best way of splitting the CLS datasets into different subsets?
I want to train again Flauert but I am encountering a new error:
Traceback (most recent call last):
File "train.py", line 391, in
main(params)
File "train.py", line 309, in main
trainer.mlm_step(lang1, lang2, params.lambda_mlm)
File "/data1/home/getalp/kelodjoe/eXP/Flaubert/xlm/trainer.py", line 781, in mlm_step
self.optimize(loss)
File "/data1/home/getalp/kelodjoe/eXP/Flaubert/xlm/trainer.py", line 250, in optimize
scaled_loss.backward()
File "/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/contextlib.py", line 88, in exit
next(self.gen)
File "/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/handle.py", line 123, in scale_loss
optimizer._post_amp_backward(loss_scaler)
File "/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/_process_optimizer.py", line 241, in post_backward_no_master_weights
post_backward_models_are_masters(scaler, params, stashed_grads)
File "/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/_process_optimizer.py", line 120, in post_backward_models_are_masters
scale_override=grads_have_scale/out_scale)
File "/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/scaler.py", line 117, in unscale
1./scale)
File "/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/multi_tensor_apply/multi_tensor_apply.py", line 30, in call
*args)
RuntimeError: CUDA error: invalid device function (multi_tensor_apply at csrc/multi_tensor_apply.cuh:108)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f7ea6c64193 in /home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: void multi_tensor_apply<2, ScaleFunctor<float, float>, float>(int, int, at::Tensor const&, std::vector<std::vector<at::Tensor, std::allocatorat::Tensor >, std::allocator<std::vector<at::Tensor, std::allocatorat::Tensor > > > const&, ScaleFunctor<float, float>, float) + 0x183f (0x7f7ea0dd379f in /home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so)
frame #2: multi_tensor_scale_cuda(int, at::Tensor, std::vector<std::vector<at::Tensor, std::allocatorat::Tensor >, sTraceback (most recent call last):
File "train.py", line 391, in
main(params)
File "train.py", line 309, in main
trainer.mlm_step(lang1, lang2, params.lambda_mlm)
File "/data1/home/getalp/kelodjoe/eXP/Flaubert/xlm/trainer.py", line 781, in mlm_step
self.optimize(loss)
File "/data1/home/getalp/kelodjoe/eXP/Flaubert/xlm/trainer.py", line 250, in optimize
scaled_loss.backward()
File "/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/contextlib.py", line 88, in exit
next(self.gen)
File "/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/handle.py", line 123, in scale_loss
optimizer._post_amp_backward(loss_scaler)
File "/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/_process_optimizer.py", line 241, in post_backward_no_master_weights
post_backward_models_are_masters(scaler, params, stashed_grads)
File "/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/_process_optimizer.py", line 120, in post_backward_models_are_masters
scale_override=grads_have_scale/out_scale)
File "/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/scaler.py", line 117, in unscale
1./scale)
File "/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/multi_tensor_apply/multi_tensor_apply.py", line 30, in call
*args)
RuntimeError: CUDA error: invalid device function (multi_tensor_apply at csrc/multi_tensor_apply.cuh:108)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f7ea6c64193 in /home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: void multi_tensor_apply<2, ScaleFunctor<float, float>, float>(int, int, at::Tensor const&, std::vector<std::vector<at::Tensor, std::allocatorat::Tensor >, std::allocator<std::vector<at::Tensor, std::allocatorat::Tensor > > > const&, ScaleFunctor<float, float>, float) + 0x183f (0x7f7ea0dd379f in /home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so)
frame #2: multi_tensor_scale_cuda(int, at::Tensor, std::vector<std::vector<at::Tensor, std::allocatorat::Tensor >, std::allocator<std::vector<at::Tensor, std::allocatorat::Tensor > > >, float) + 0x1679 (0x7f7ea0dcff39 in /home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so)
frame #3: + 0x200cc (0x7f7ea0dc30cc in /home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so)
frame #4: + 0x1a634 (0x7f7ea0dbd634 in /home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so)
frame #54: __libc_start_main + 0xf1 (0x7f7eb244b2e1 in /lib/x86_64-linux-gnu/libc.so.6)
Erreur de segmentationtd::allocator<std::vector<at::Tensor, std::allocatorat::Tensor > > >, float) + 0x1679 (0x7f7ea0dcff39 in /home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so)
frame #3: + 0x200cc (0x7f7ea0dc30cc in /home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so)
frame #4: + 0x1a634 (0x7f7ea0dbd634 in /home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so)
frame #54: __libc_start_main + 0xf1 (0x7f7eb244b2e1 in /lib/x86_64-linux-gnu/libc.so.6)
Erreur de segmentation
Do you have any idea ?
I am trying to finetune the model I obtained after training. I some have questions :
python flue/extract_split_cls.py --indir $DATA_DIR/raw/cls-acl10-unprocessed
--outdir $DATA_DIR/processed
--do_lower true
--use_hugging_face true,
So when looking at the dataset in the raw directory, I observed that the script " get-data-cls.sh" actually split also the dataset :
Since , I already have my own_data, Should I write a new script which will divide my data in order to get the result of the first script and then use the results as input for the second command "extract_split_cls.py" to obtain the real data to use for finetuning.
Since I have different data , I supposed I should modify the script extract_split_cls.py to use on my own or it is not necessary ?
Output for the second script "extract_split_cls.py"
Hi,
I am trying to reproduce the dataset you used for FlauBERT.
How did you use Enronsent which seams to be in English ?
Thank you in advance.
it would be a good thing to have this pretrained model on the official model repository of transformers
Hi all and thanks for open-sourcing Flaubert :)
I'm going through the README, trying to find out on which dataset FlauBERT was trained on but I'm not able to find the information. There's a mention of Gutenberg data in the pre-processing example: is that the data you used?
Thanks a lot in advance!
Hi,
Hope you are all well !
Is it possible with Flaubert to do some tweet sentiment analysis written in french ? If so, how can we do that ?
Vive la France ! :-)
Cheers,
X
Hi I have tried to download Flaubert as described:
import torch
from transformers import FlaubertModel, FlaubertTokenizer
Unfortunately it returns an ImportError:
from transformers import FlaubertModel, FlaubertTokenizer
ImportError: cannot import name 'FlaubertModel'
Hello, I tried to retrain Flaubert from last checkpoint on new data but it seems the size of my input do not correspond to what the system expects ? Do you know perhaps how can I solve it ?
Traceback (most recent call last):
File "train.py", line 391, in
main(params)
File "train.py", line 309, in main
trainer.mlm_step(lang1, lang2, params.lambda_mlm)
File "/home/getalp/kelodjoe/eXP/Flaubert/xlm/trainer.py", line 786, in mlm_step
self.optimize(loss)
File "/home/getalp/kelodjoe/eXP/Flaubert/xlm/trainer.py", line 261, in optimize
optimizer.step()
File "/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/initialize.py", line 242, in new_step
output = old_step(*args, **kwargs)
File "/home/getalp/kelodjoe/eXP/Flaubert/xlm/optim.py", line 136, in step
super().step(closure)
File "/home/getalp/kelodjoe/eXP/Flaubert/xlm/optim.py", line 72, in step
exp_avg.mul(beta1).add_(1 - beta1, grad)
File "/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/wrap.py", line 101, in wrapper
return orig_fn(arg0, *args, **kwargs)
RuntimeError: The size of tensor a (68729) must match the size of tensor b (13681) at non-singleton dimension 0
Another question, I would like to use the system for the calssification task in the Flue benchmark but it is not explicit in you page where it is. I get the part for finetuning but after that how are we supposed to test the model finetuned ? Should we use our own created or is there a code for the classification task available ?
Best regards,
Hi,
My goal here is to do clustering on sentences. For this purpose, I chose to use similarities between sentence embedding for all my sentences. Unfortunately, camemBERT seems not great for that task and fine-tuning flauBERT could be a solution.
So thanks to @formiel, I managed to fine tune flauBERT on an NLI dataset.
My question is about that fine-tuning. What is the output exactly ? I only got a few files in the dump_path:
I wonder if after fine-tuning flauBERT, i could use it to make a new embedding of a sentence (like flauBERT before fine-tuning). So where is the new flauBERT model trained on the NLI dataset ? And how use it to make embeddings ?
Thanks in advance
Hi ..If i need to use the Falubert for French to English translation ? what is the approach.
Good afternoon,
I tried to follow your instructions to train my own corpus with Flaubert in order to get a model to use for my classification task but I am having trouble understanding the procedure ?
You said we should use this line to train on our preprocessed data :
/Flaubert$ python train.py --exp_name flaubert_base_lower --dump_path ./dumped/ --data_path ./own_data/data/ --lgs 'fr' --clm_steps '' --mlm_steps 'fr' --emb_dim 768 --n_layers 12 --n_heads 12 --dropout 0.1 --attention_dropout 0.1 --gelu_activation true --batch_size 16 --bptt 512 --optimizer "adam_inverse_sqrt,lr=0.0006,warmup_updates=24000,beta1=0.9,beta2=0.98,weight_decay=0.01,eps=0.000001" --epoch_size 300000 --max_epoch 100000 --validation_metrics _valid_fr_mlm_ppl --stopping_criterion _valid_fr_mlm_ppl,20 --fp16 true --accumulate_gradients 16 --word_mask_keep_rand '0.8,0.1,0.1' --word_pred '0.15'
I tried it after cloning flaubert and all the necessary librairies but I am getting this error :
FAISS library was not found.
FAISS not available. Switching to standard nearest neighbors search implementation.
./own_data/data/train.fr.pth not found
./own_data/data/valid.fr.pth not found
./own_data/data/test.fr.pth not found
Traceback (most recent call last):
File "train.py", line 387, in
check_data_params(params)
File "/ho/ge/ke/eXP/Flaubert/xlm/data/loader.py", line 302, in check_data_params
assert all([all([os.path.isfile(p) for p in paths.values()]) for paths in params.mono_dataset.values()])
AssertionError
Does this mean I have to split my own data in three corpus after preprocessing it ? (train , valid and test ??) Should I use your preprossed script on my own before executing the command ?
Hi !
I would like to finetune Flaubert on FLUE task with hugging face library. I downloaded the PAWS datas and use the code you gave on your github repo but I have this error message I can't go through :
any idea on what to do ?
Thanks for this project by the way, him looking forward to use it !
Have a good day,
Lisa
Hi I got this error when running the run_flue.py
from transformers import flue_compute_metrics as compute_metrics
Import Error: cannot import name 'flue_compute_metrics'
I already preinstalled the requirements and update the transformer directory.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.