getalp / flaubert Goto Github PK

View Code? Open in Web Editor NEW

240.0 240.0 30.0 582 KB

Unsupervised Language Model Pre-training for French

License: Other

Python 91.66% Shell 7.06% Perl 1.28%

flaubert's People

Contributors

Stargazers

Watchers

flaubert's Issues

Model names

The example in the readme for running with the Hugging Face library reads:

# Choose among ['flaubert-small-cased', 'flaubert-base-uncased', 'flaubert-base-cased', 'flaubert-large-cased']
modelname = 'flaubert-base-cased'

but these don't work for me. Instead, what worked was:

# Choose among ['flaubert/flaubert_small_cased', 'flaubert/flaubert_base_uncased', 'flaubert/flaubert_base_cased', 'flaubert/flaubert_large_cased']
modelname = 'flaubert/flaubert_base_cased'

Pre-training FlauBERT with Google Colab?

Would it be possible to pre-train Flaubert (as described here https://github.com/getalp/Flaubert#3-pre-training-flaubert) in Google Colab? If so, would you be able to give me some advice on converting the code you supply so that it works in Colab?

Thank you in advance for your time. :)

Convert from xlm to Hugging Face

Hi, i trained my own BERT with your code.
Do you provide code for converting xlm model to Hugging Face model: from checkpoint.pth to pytorch_model.bin and config.json? also class for tokenizer?
Thank you!

*Error in `python3': free(): invalid pointer: 0x000055ae110d8c20 ***** Abandon

Hello,

I am using Flaubert with huggingface transformers but I am getting negative vectors I do not know why ?

Do you have any idea why ? when using your previous installation without it the vectors were non negatives.

tensor([[[-0.7107, -1.4737, 0.9828, ..., -3.1127, 0.3499, 0.3747],
[-1.7784, -2.2899, -0.3564, ..., -0.7091, -0.7323, 2.5189],
[-1.3886, -0.6179, -1.2058, ..., -0.5359, -0.5513, 0.4187],
...,
[ 1.0148, -2.3575, -0.9193, ..., -2.2937, 0.5353, 0.6768],
[ 0.9082, -2.2049, -1.2324, ..., -2.1818, 0.5374, 0.8427],
[ 0.9287, -1.7471, -1.7460, ..., -2.1425, 0.5514, 0.7766]],

    [[-0.9680, -1.0419,  0.6869,  ..., -2.0340,  0.5130, -0.0712],
     [-0.9206, -1.6208, -0.1860,  ...,  0.6649, -1.0081, -0.8190],
     [-1.1192, -1.6612, -0.2082,  ..., -0.2693, -1.8960, -0.6337],
     ...,
     [ 0.1435, -2.2177, -0.5040,  ..., -1.8233,  0.8659,  0.0530],
     [ 0.0633, -2.1492, -0.5994,  ..., -1.7436,  0.7928,  0.1566],
     [ 0.0480, -1.8401, -0.7709,  ..., -1.6504,  0.7304,  0.0832]],

    [[-0.5683, -2.7551,  1.0667,  ..., -3.7187,  1.6681,  0.8379],
     [-2.0996, -0.8701, -1.2148,  ..., -0.0379, -2.5241,  1.9351],
     [-3.7073,  0.3279,  0.8807,  ..., -1.3985, -2.0611,  0.6002],
     ...,
     [ 1.3348, -3.6816, -1.0271,  ..., -1.6550,  0.8394,  1.0457],
     [ 0.8955, -3.6031, -1.0443,  ..., -1.2973,  0.9316,  1.3558],
     [ 1.0691, -3.5659, -1.1623,  ..., -1.3626,  1.1174,  1.3523]],

    ...,

    [[-0.9926, -1.0781,  0.5332,  ..., -1.9912,  0.3508, -0.1276],
     [-0.9452,  1.4486, -0.4952,  ..., -0.2910,  0.1014,  0.7436],
     [-2.1879, -0.1735,  1.2844,  ..., -1.4701, -0.3949, -1.3691],
     ...,
     [-0.0380, -1.8997, -0.2480,  ..., -1.6570,  0.9166,  0.3879],
     [-0.0185, -1.7595, -0.4073,  ..., -1.6960,  0.7043,  0.4340],
     [ 0.0889, -1.6022, -0.6427,  ..., -1.7312,  0.6539,  0.3327]],

    [[-0.7026, -2.8673,  0.7728,  ..., -3.2812,  1.2357,  1.0315],
     [-0.7316,  1.7105, -0.2076,  ...,  0.0276,  0.4884,  0.5200],
     [-2.2086, -0.0808,  1.7520,  ..., -0.2885,  0.3539, -0.3582],
     ...,
     [ 1.2960, -4.0501, -1.3244,  ..., -1.6143,  0.6550,  1.8709],
     [ 0.9846, -3.8275, -1.3272,  ..., -1.1076,  0.6045,  1.9635],
     [ 1.0963, -3.6684, -1.3650,  ..., -1.1385,  0.6555,  1.7605]],

    [[-0.8365, -2.3553,  0.7874,  ..., -3.7243,  1.6223,  0.8726],
     [-0.2939, -2.7249,  0.2032,  ..., -2.4267, -1.7648,  1.4637],
     [ 0.1375,  0.9276,  0.0405,  ..., -0.4353, -1.1491,  1.0354],
     ...,
     [ 1.6176, -3.3541, -1.4429,  ..., -2.5446,  0.8072,  1.7173],
     [ 1.2283, -3.1744, -1.2734,  ..., -2.0969,  0.7972,  1.9605],
     [ 1.4354, -2.9385, -1.1689,  ..., -2.0097,  1.0124,  1.8885]]],
   grad_fn=<MulBackward0>)

torch.Size([260, 324, 768])
--fin--

My code :

def Flaubert_Model(texte): # texte is a column of sentences

# You could choose among ['flaubert-base-cased', 'flaubert-base-uncased', 'flaubert-large-cased']
modelname = 'flaubert-base-cased' 

flaubert, log = FlaubertModel.from_pretrained(modelname, output_loading_info=True)

flaubert_tokenizer = FlaubertTokenizer.from_pretrained(modelname, do_lowercase=False) # do_lowercase=False if using the 'cased' model, otherwise it should be set to False

tokenized = texte.apply((lambda x: flaubert_tokenizer.encode(x, add_special_tokens=True)))

max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0] * (max_len - len(i)) for i in tokenized.values])

print(padded)

# Using model

token_ids = torch.tensor(padded)

last_layer =  flaubert(token_ids)[0]

print(last_layer)
print(last_layer.shape)

dataset in French NLI and STS - like

Hi,

When getting data from get-data-xnli.sh , I notice that most dataset is not in French. Hence, I wonder how you used that in practice?

I am currently looking for a some NLI-like and STS-like datasets in French. That would be great to fine-tune Flaubert!

As a suggestion, translation the English version of NLI and STS to French could be a good option to fine tune Flaubert on such tasks.

D'euro j'dis là du son dis bdia d dis fin dis djaifbfl j'dis. L'as bais. Eiabduua bsujsvd

Filling masks

Bonjour bonjour ! Merci d'avoir partagé le modèle !

Dans Camembert il est assez facile de deviner un mot à partir du contexte, y a-t-il un working example dans Flaubert ?

Merci d'avance !

from fairseq.models.roberta import CamembertModel
camembert = CamembertModel.from_pretrained('./camembert-base/')
camembert.eval()
masked_line = 'Le camembert est <mask> :)'
camembert.fill_mask(masked_line, topk=3)
# [('Le camembert est délicieux :)', 0.4909118115901947, ' délicieux'),
]#  ('Le camembert est excellent :)', 0.10556942224502563, ' excellent'),
#  ('Le camembert est succulent :)', 0.03453322499990463, ' succulent')]

Pretraining with News Crawls by WMT 19

I have a query regarding regarding your training corpus.

The News Crawl corpora that you use are both shuffled and de-duplicated. However, the corpora used by other models like BERT, RoBERTa etc. use a non-shuffled corpus where each document within the corpus is also demarcated with an empty line. Now with this un-shuffled form, when you create pre-training instances, you will end up with contiguous sentences in segment A and segment B. But in your case, the segments will contain non-contiguous sentences right?

So my question is what is your opinion on having non-contiguous sentences in the segments? Does it hurt the performance of MLM, or downstream tasks?

AttributeError: module 'apex' has no attribute 'amp'

I am getting this error when traning Flaubert.

INFO - 02/12/20 19:50:47 - 0:00:11 - Number of parameters (model): 92501715
INFO - 02/12/20 19:50:51 - 0:00:15 - Before setting SingleTrainer variables.
INFO - 02/12/20 19:50:51 - 0:00:15 - After setting SingleTrainer variables.
INFO - 02/12/20 19:50:52 - 0:00:15 - Found 0 memories.
INFO - 02/12/20 19:50:52 - 0:00:16 - Found 12 FFN.
INFO - 02/12/20 19:50:53 - 0:00:16 - Found 197 parameters in model.
INFO - 02/12/20 19:50:53 - 0:00:17 - Optimizers: model
Traceback (most recent call last):
File "train.py", line 391, in
main(params)
File "train.py", line 266, in main
trainer = SingleTrainer(model, data, params)
File "/home/ge/ke/eXP/Flaubert/xlm/trainer.py", line 857, in init
super().init(data, params)
File "/home/ge/ke/eXP/Flaubert/xlm/trainer.py", line 81, in init
self.init_amp()
File "/home/gekelodjoe/eXP/Flaubert/xlm/trainer.py", line 205, in init_amp
models, optimizers = apex.amp.initialize(
AttributeError: module 'apex' has no attribute 'amp'

Do you have any idea how can I solve it ?

Model name not found or config.json missing

Transformers version
2.3.0
Pytorch version
1.4

When running readme.md code

import torch
from transformers import XLMModel, XLMTokenizer
modelname="xlm_bert_fra_base_lower" # Or absolute path to where you put the folder

# Load model
flaubert, log = XLMModel.from_pretrained(modelname, output_loading_info=True)
# check import was successful, the dictionary should have empty lists as values
print(log)

# Load tokenizer
flaubert_tokenizer = XLMTokenizer.from_pretrained(modelname, do_lowercase_and_remove_accent=False)

sentence="Le chat mange une pomme."
sentence_lower = sentence.lower()

token_ids = torch.tensor([flaubert_tokenizer.encode(sentence_lower)])
last_layer = flaubert(token_ids)[0]
print(last_layer.shape)

Output

OSError: Model name 'xlm_bert_fra_base_lower' was not found in model name list 
(xlm-mlm-en-2048, xlm-mlm-ende-1024, xlm-mlm-enfr-1024, xlm-mlm-enro-1024, xlm-mlm-tlm-xnli15-1024, xlm-mlm-xnli15-1024, xlm-clm-enfr-1024, xlm-clm-ende-1024,
 xlm-mlm-17-1280, xlm-mlm-100-1280). We assumed 'xlm_bert_fra_base_lower' 
was a path or url to a configuration file named config.json 
or a directory containing such a file but couldn't find any such file at this path or url.

Tried downloading the model as tar file (lower and normal), extracting it and putting its absolute folder path as modelname but keeps getting the same error....

I may be missing something so stupid but I don't see a config.json file in the archive file

What is wrong?

1.get_toolkit.sh

The script failed with :
No compiler is provided in this environment. Perhaps you are running on a JRE rather than a JDK?
I managed to fix it with :
sudo update-alternatives --config java
and selecting /usr/lib/jvm/java-11-openjdk-amd64/bin/java as default

weird results on new task using flaubert-large-cased model

I was runing the token classification example her : https://github.com/huggingface/transformers/blob/master/examples/token-classification/run.sh with flaubert-large-cased as model name.

I tried to use the downloaded model from https://huggingface.co/flaubert models, but after 100 epochs the results are very bad, the model did not learn any thing.
I Don't understand what is the problem with flaubert-large-cased model.
Note that flaubert-base-cased model give good results on NER taks.

Do you have an idea how can fix this problem please ?

error in prepare-data-cls.sh

replace -outdir by --outdir !
line 10
Rgds

From pytorch model (with hugging_face library) to XLM model

Hello,

I have a problem currently regarding the finetuning of Flaubert on FLUE. I have a model that I re-trained with custom data, so I have new weights for it, and I currently have it as .json file and .bin files. However, when I want to finetune this model on FLUE tasks, they ask me for vocab and codes files from pretraining, that I don't have when using the hugging face library. I see that there is a module to go from XLM to hugging_face, but not the opposite. Is it possible to transform a model under .json and .bin format to get vocab, codes and .pth files ?

Or maybe there is a clever workaround to this problem ?

Many thanks in advance

Extracting word embeddings from FlauBERT

Hello,

I would like to use FlauBERT to investigate semantic ambiguity in eighteenth-century French. To do so, I would like to extract the word embeddings of a specific keyword from a large group of different sentences in which the keyword was used and visualize those word embeddings. I have been able to extract and visualize word embeddings using BERT (after having looked at various online tutorials), but I cannot figure out how to revise the code I've been using so that it works with FlauBERT. I've inserted the code I have used for extracting and visualizing word embeddings using BERT. Could you possibly give me some suggestions as to how I might modify this code so that it works with FlauBERT?

I also wonder whether I should further train FlauBERT on a corpus of eighteenth-century French texts?

Thank you in advance for your time.

!pip install torch
!pip install transformers

from transformers import BertTokenizer, BertModel
import pandas as pd
import numpy as np
import nltk
import torch
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import LeaveOneOut
from sklearn import model_selection
import matplotlib.pyplot as plt
%matplotlib inline

text_data = pd.read_csv("https://raw.githubusercontent.com/name/file.csv")
texts = text_data['Sentence'].to_list()

model = BertModel.from_pretrained('bert-base-uncased',
                                  output_hidden_states = True,
                                  )
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def bert_text_preparation(text, tokenizer):
    
    marked_text = "[CLS] " + text + " [SEP]"
    tokenized_text = tokenizer.tokenize(marked_text)
    indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
    segments_ids = [1]*len(indexed_tokens)

    tokens_tensor = torch.tensor([indexed_tokens])
    segments_tensors = torch.tensor([segments_ids])

    return tokenized_text, tokens_tensor, segments_tensors

def get_bert_embeddings(tokens_tensor, segments_tensors, model):

    with torch.no_grad():
        outputs = model(tokens_tensor, segments_tensors)
        hidden_states = outputs[2][1:]

    token_embeddings = hidden_states[-1]
    token_embeddings = torch.squeeze(token_embeddings, dim=0)
    list_token_embeddings = [token_embed.tolist() for token_embed in token_embeddings]

    return list_token_embeddings

target_word_embeddings = []

for text in texts:
    tokenized_text, tokens_tensor, segments_tensors = bert_text_preparation(text, tokenizer)
    list_token_embeddings = get_bert_embeddings(tokens_tensor, segments_tensors, model)
    
    word_index = tokenized_text.index('keyword')
    word_embedding = list_token_embeddings[word_index]

    target_word_embeddings.append(word_embedding)

keyword_pca = PCA(n_components=2).fit_transform(target_word_embeddings)
keyword_pca.shape

scatter_x = keyword_pca[:,0]
scatter_y = keyword_pca[:,1]
fig, ax = plt.subplots(figsize=(10, 7))
for g in np.unique(text_data.Place_holder):
    ix = np.where(text_data.Place_holder == g)
    ax.scatter(scatter_x[ix], scatter_y[ix],s=100)
plt.show()

Update about the coming soon FLUE tasks

Hi! Thank you for your open source initiative, and for creating FLUE. I was wondering if you could provide us with an update about when the code and data will be released for the following tasks, which are listed as 'coming soon' on the website:

5.1. Verb Sense Disambiguation
6. Named Entity Recognition
7. Question Answering

We are developing our own language model and would like to evaluate it across all FLUE tasks. Thank you!

Little change in the shebang of prepare-data-cls.sh

Hello,

could you replace the first line of the script :
#! bin/bash

by
#!/usr/bin/env bash

Thanks

Example de paraphrase

Bonjour,
Dans la documentation de Flaubert, il est fait mention de la possibilité de paraphraser.
Auriez-vous un exemple de code à utiliser pour, par exemple, d'une phrase en générer une dizaines de similaires ?

Merci

Typo in corpus download

Flaubert/download.sh

Line 104 in 3fa2d85

elif [ "$cp" == "common_crawl" ]; then

The corpus variable is referenced as $cp, and should be changed to $corpus.

RuntimeError: [enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 237414383616 bytes. Error code 12 (Cannot allocate memory)

Environment info

transformers version: 2.5.1
Platform: linux
Python version: 3.7
PyTorch version (GPU?): 1.4
Tensorflow version (GPU?):
Using GPU in script?: no
Using distributed or parallel set-up in script?: no

Who can help

Model I am using (FlauBert):

The problem arises when trying to produce features with the model, the output which is generated causes the system run out of memory.

the official example scripts: (I did not change much , pretty close to the original)

import torch
from transformers import FlaubertModel, FlaubertTokenizer
# Choose among ['flaubert/flaubert_small_cased', 'flaubert/flaubert_base_uncased', 
#               'flaubert/flaubert_base_cased', 'flaubert/flaubert_large_cased']
modelname = 'flaubert/flaubert_base_cased' 

# Load pretrained model and tokenizer
flaubert, log = FlaubertModel.from_pretrained(modelname, output_loading_info=True)
flaubert_tokenizer = FlaubertTokenizer.from_pretrained(modelname, do_lowercase=False)
# do_lowercase=False if using cased models, True if using uncased ones

sentence = "Le chat mange une pomme."
token_ids = torch.tensor([flaubert_tokenizer.encode(sentence)])

last_layer = flaubert(token_ids)[0]
print(last_layer.shape)
# torch.Size([1, 8, 768])  -> (batch size x number of tokens x embedding dimension)

# The BERT [CLS] token correspond to the first hidden state of the last layer
cls_embedding = last_layer[:, 0, :]

My own modified scripts: (give details below)

def get_flaubert_layer(texte):

	modelname = "flaubert-base-uncased"
	path = './flau/flaubert-base-unc/'

	flaubert = FlaubertModel.from_pretrained(path)
	flaubert_tokenizer = FlaubertTokenizer.from_pretrained(path)
	tokenized = texte.apply((lambda x: flaubert_tokenizer.encode(x, add_special_tokens=True, max_length=512)))
	max_len = 0
	for i in tokenized.values:
		if len(i) > max_len:
			max_len = len(i)
	padded = np.array([i + [0] * (max_len - len(i)) for i in tokenized.values])
	token_ids = torch.tensor(padded)
	with torch.no_grad():
		last_layer = flaubert(token_ids)[0][:,0,:].numpy()
		
	return last_layer, modelname

The tasks I am working on is:

Producing vectors/features from a language model and pass it to others classifiers

To reproduce

Steps to reproduce the behavior:

Get transformers library and scikit-learn, pandas and numpy, pytorch
Last lines of code

# Reading the file 
filename = "corpus"
sentences = pd.read_excel(os.path.join(root, filename + ".xlsx"), sheet_name= 0)
data_id = sentences.identifiant
print("Total phrases: ", len(data_id))
data = sentences.sent
label = sentences.etiquette
emb, mdlname = get_flaubert_layer(data)  # corpus is dataframe of approximately 40 000 lines

Apperently this line produce something which is huge and which take a lot memory :
last_layer = flaubert(token_ids)[0][:,0,:].numpy()

I would have expected it run but I think the fact that I pass the whole dataset to the model is causing the system to break, so I wanted to know if it possible to tell the model to process the data set maybe 500 lines or 1000 lines at at a time so as to not pass the whole dataset. I know that , there is this parameter : batch_size which can be used but since I am not training a model but merely using it to produces embeddings as input for others classifiers ,
Do you perhaps know how to modify the batch size so the whole dataset is not treated. I am not really familiar with this type of architecture. In the example , they just put one single sentence but in my case I load a whole dataset (dataframe). ?

My expectation is to make the model treat all the sentences and then produced the vectors I need for the task of classification.

Continued training of FlauBERT (with --reload_model) -- Question about vocab size

Hello. :)

I would like to use the "--reload_model" option with your train.py command to further train one of your pretrained FlauBERT models.

Upon trying to run train.py with the "--reload_model" option I got an error message saying that there was a "size mismatch" between the pretrained FlauBERT model and the adapted model I was trying to train.

The error message referred to a "shape torch.Size([67542]) from checkpoint". This was for the flaubert_base_uncased model. I assume that the number 67542 is the vocabulary size of flaubert-base-uncased.

In order to use the "--reload_model" option with your pretrained FlauBERT models, do I need to ensure that the vocabulary of my training data is identical to that of the pretrained model? If so, do you think that I could manage that simply by concatenating the "vocab" file of the pretrained model with my training data?

Thank you in advance for your help!

Import error in extract_split_cls.py - No module name tools

File "extract_split_cls.py", line 14, in
from tools.clean_text import cleaner
ModuleNotFoundError: No module named 'tools'

Hello, I have this error message when trying to execute the extract_slit_cls.sh script. I tried conda install tools, but it does not seem to exist. What is the name of the package I need to install ? Thank you!

Can we encode a document from command line?

Hi,

Is it possible to have a bash file that is doing the sentence encoding of each line of a given document in an optimised manner?

Hence, if we compute something like the following:

bash encode_doc.sh $DATA_PATH/doc.txt

We could get sentence embeddings for each line of the document from $DATA_PATH/doc.txt.
The assumption is that each line of our document correspond to a given sentence.

Cheers and Good job! :D

fastBPE

Hello,

Is it normal that the fastBPE folder is empty? That's why I got an error in prepare-data-pawsx.sh.

Datasets: EnronSent, Le Monde, PCT

Hi,

First of all, thanks a lot for this set of tools, a boon for French language modelling!

Reading your paper, I've been wondering about some dataset choices:

the EnronSent seems to be only in English, why did you include it?
I guess it's probably a no no, but is there any chance the two datasets from Le Monde and PCT might be released in some form or other?

Thanks in advance!

fast: fastBPE/fastBPE.hpp:458: void fastBPE::readCodes(const char*, std::unordered_map<std::pair<std::__cxx11::basic_string<char>, std::__cxx11::basic_string<char>

Hello, I am trying to learn BPE codes on my traning set ans I am getting this error :

fast: fastBPE/fastBPE.hpp:458: void fastBPE::readCodes(const char*, std::unordered_map<std::pair<std::__cxx11::basic_string, std::__cxx11::basic_string >, unsigned int, fastBPE::pair_hash>&, std::unordered_map<std::__cxx11::basic_string, std::pair<std::__cxx11::basic_string, std::__cxx11::basic_string > >&): Assertion codes.find(pair) == codes.end()' failed. tools/create_pretraining_data.sh : ligne 38 : 5617 Abandon $FASTBPE applybpe $OUT_PATH/train.$lg $DATA_DIR/$lg.train $OUT_PATH/codes Loading codes from /home/getalp/kelodjoe/eXP/Flaubert/data/processed/fr_corpuslabel/BPE/10k/codes ... fast: fastBPE/fastBPE.hpp:458: void fastBPE::readCodes(const char*, std::unordered_map<std::pair<std::__cxx11::basic_string<char>, std::__cxx11::basic_string<char> >, unsigned int, fastBPE::pair_hash>&, std::unordered_map<std::__cxx11::basic_string<char>, std::pair<std::__cxx11::basic_string<char>, std::__cxx11::basic_string<char> > >&): Assertion codes.find(pair) == codes.end()' failed.
tools/create_pretraining_data.sh : ligne 39 : 5618 Abandon $FASTBPE applybpe $OUT_PATH/valid.$lg $DATA_DIR/$lg.valid $OUT_PATH/codes
Loading codes from /home/getalp/kelodjoe/eXP/Flaubert/data/processed/fr_corpuslabel/BPE/10k/codes ...
fast: fastBPE/fastBPE.hpp:458: void fastBPE::readCodes(const char*, std::unordered_map<std::pair<std::__cxx11::basic_string, std::__cxx11::basic_string >, unsigned int, fastBPE::pair_hash>&, std::unordered_map<std::__cxx11::basic_string, std::pair<std::__cxx11::basic_string, std::__cxx11::basic_string > >&): Assertion `codes.find(pair) == codes.end()' failed.
tools/create_pretraining_data.sh : ligne 40 : 5619 Abandon $FASTBPE applybpe $OUT_PATH/test.$lg $DATA_DIR/$lg.test $OUT_PATH/codes
cat: /home/getalp/kelodjoe/eXP/Flaubert/data/processed/fr_corpuslabel/BPE/10k/train.fr: Aucun fichier ou dossier de ce type
Read 0 words (0 unique) from text file.
Traceback (most recent call last):
File "preprocess.py", line 30, in
assert os.path.isfile(txt_path)
AssertionError
Traceback (most recent call last):
File "preprocess.py", line 30, in
assert os.path.isfile(txt_path)
AssertionError
Traceback (most recent call last):
File "preprocess.py", line 30, in
assert os.path.isfile(txt_path)
AssertionError

Line 30 of preprocess.py look like this :

Do you have any idea how can I resolve it ?

Will lemmatization negatively affect FlauBERT word embeddings?

Hello!

I am using FlauBERT to generate word embeddings as part of a study on word sense disambiguation (WSD).

The FlauBERT tokenizer does not recognize a significant number of words in my corpus and, as a result, segments them. For example, the FlauBERT tokenizer does not recognize the archaic orthography of some verbs. It also does not recognize plural forms of a number of other words in the corpus. As a result, the tokenizer segments a number of words in the corpus, some of which are significant to my research.

I understand that I could further train FlauBERT on a corpus of eighteenth-century French in order to create a new model and a new tokenizer specifically for eighteenth-century French. However, compiling a corpus of eighteenth-century French that is large and heterogeneous enough to be useful would be challenging (and perhaps not even possible).

As an alternative to training a new model, I thought I might lemmatize my corpus before running it through FlauBERT. Stanford's Stanza NLP package (stanfordnlp.github.io/stanza/) recognizes the archaic orthography of the verbs in my corpus and turns them into the infinitive form, a form FlauBERT recognizes. Similarly, Stanza also changes the plural forms of other words into singular forms, forms FlauBERT also recognizes. Thus, if I were to lemmatize my corpus in Stanza, the FlauBERT tokenizer would then be able to recognize substantially more words in my corpus.

Would lemmatizing my corpus in this way adversely affect my FlauBERT results and a WSD analysis in particular? In general, does lemmatization have a negative effect on BERT results and WSD analyses more particularly?

Given that FlauBERT is not trained on lemmatized text, I imagine that lemmatizing the corpus would indeed negatively effect the results of the analysis. As an alternative to training FlauBERT on a corpus of eighteenth-century French (which may not be possible), could I instead train it on a corpus of lemmatized French and then use this new model for a WSD analysis on my corpus of lemmatized eighteenth-century French? Would that work?

I'm not sure if this is the right place for these sorts of questions!

Thank you in advance for your time.

Different Categories for CLS dataset for French

I am using Huggingface datasets for FLUE benchmark, and the CLS datasets are grouped together instead of being separated into different categories of "Books", "DVD", and "Music".

What's the best way of splitting the CLS datasets into different subsets?

Erreur de segmentationtd::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > >, float) + 0x1679

I want to train again Flauert but I am encountering a new error:

Traceback (most recent call last):
File "train.py", line 391, in
main(params)
File "train.py", line 309, in main
trainer.mlm_step(lang1, lang2, params.lambda_mlm)
File "/data1/home/getalp/kelodjoe/eXP/Flaubert/xlm/trainer.py", line 781, in mlm_step
self.optimize(loss)
File "/data1/home/getalp/kelodjoe/eXP/Flaubert/xlm/trainer.py", line 250, in optimize
scaled_loss.backward()
File "/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/contextlib.py", line 88, in exit
next(self.gen)
File "/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/handle.py", line 123, in scale_loss
optimizer._post_amp_backward(loss_scaler)
File "/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/_process_optimizer.py", line 241, in post_backward_no_master_weights
post_backward_models_are_masters(scaler, params, stashed_grads)
File "/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/_process_optimizer.py", line 120, in post_backward_models_are_masters
scale_override=grads_have_scale/out_scale)
File "/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/scaler.py", line 117, in unscale
1./scale)
File "/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/multi_tensor_apply/multi_tensor_apply.py", line 30, in call
*args)
RuntimeError: CUDA error: invalid device function (multi_tensor_apply at csrc/multi_tensor_apply.cuh:108)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f7ea6c64193 in /home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: void multi_tensor_apply<2, ScaleFunctor<float, float>, float>(int, int, at::Tensor const&, std::vector<std::vector<at::Tensor, std::allocatorat::Tensor >, std::allocator<std::vector<at::Tensor, std::allocatorat::Tensor > > > const&, ScaleFunctor<float, float>, float) + 0x183f (0x7f7ea0dd379f in /home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so)
frame #2: multi_tensor_scale_cuda(int, at::Tensor, std::vector<std::vector<at::Tensor, std::allocatorat::Tensor >, sTraceback (most recent call last):
File "train.py", line 391, in
main(params)
File "train.py", line 309, in main
trainer.mlm_step(lang1, lang2, params.lambda_mlm)
File "/data1/home/getalp/kelodjoe/eXP/Flaubert/xlm/trainer.py", line 781, in mlm_step
self.optimize(loss)
File "/data1/home/getalp/kelodjoe/eXP/Flaubert/xlm/trainer.py", line 250, in optimize
scaled_loss.backward()
File "/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/contextlib.py", line 88, in exit
next(self.gen)
File "/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/handle.py", line 123, in scale_loss
optimizer._post_amp_backward(loss_scaler)
File "/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/_process_optimizer.py", line 241, in post_backward_no_master_weights
post_backward_models_are_masters(scaler, params, stashed_grads)
File "/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/_process_optimizer.py", line 120, in post_backward_models_are_masters
scale_override=grads_have_scale/out_scale)
File "/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/scaler.py", line 117, in unscale
1./scale)
File "/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/multi_tensor_apply/multi_tensor_apply.py", line 30, in call
*args)
RuntimeError: CUDA error: invalid device function (multi_tensor_apply at csrc/multi_tensor_apply.cuh:108)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f7ea6c64193 in /home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: void multi_tensor_apply<2, ScaleFunctor<float, float>, float>(int, int, at::Tensor const&, std::vector<std::vector<at::Tensor, std::allocatorat::Tensor >, std::allocator<std::vector<at::Tensor, std::allocatorat::Tensor > > > const&, ScaleFunctor<float, float>, float) + 0x183f (0x7f7ea0dd379f in /home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so)
frame #2: multi_tensor_scale_cuda(int, at::Tensor, std::vector<std::vector<at::Tensor, std::allocatorat::Tensor >, std::allocator<std::vector<at::Tensor, std::allocatorat::Tensor > > >, float) + 0x1679 (0x7f7ea0dcff39 in /home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so)
frame #3: + 0x200cc (0x7f7ea0dc30cc in /home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so)
frame #4: + 0x1a634 (0x7f7ea0dbd634 in /home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so)

frame #54: __libc_start_main + 0xf1 (0x7f7eb244b2e1 in /lib/x86_64-linux-gnu/libc.so.6)

Erreur de segmentationtd::allocator<std::vector<at::Tensor, std::allocatorat::Tensor > > >, float) + 0x1679 (0x7f7ea0dcff39 in /home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so)
frame #3: + 0x200cc (0x7f7ea0dc30cc in /home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so)
frame #4: + 0x1a634 (0x7f7ea0dbd634 in /home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so)

frame #54: __libc_start_main + 0xf1 (0x7f7eb244b2e1 in /lib/x86_64-linux-gnu/libc.so.6)

Erreur de segmentation

Do you have any idea ?

Questions related to the finetuning process of Flaubert on a dataset ?

I am trying to finetune the model I obtained after training. I some have questions :

firstly, the command below split the original dataset which is obtained with "get-data-cls.sh" :

python flue/extract_split_cls.py --indir $DATA_DIR/raw/cls-acl10-unprocessed
--outdir $DATA_DIR/processed
--do_lower true
--use_hugging_face true,

So when looking at the dataset in the raw directory, I observed that the script " get-data-cls.sh" actually split also the dataset :

Since , I already have my own_data, Should I write a new script which will divide my data in order to get the result of the first script and then use the results as input for the second command "extract_split_cls.py" to obtain the real data to use for finetuning.
Since I have different data , I supposed I should modify the script extract_split_cls.py to use on my own or it is not necessary ?

Output for the second script "extract_split_cls.py"

There is also a "parse.py" script , At which level is it used in the flow of the architecture ?

About the finetuning, what is the utility of the config file ? How is it obtained ?
=>
config='flue/examples/cls_books_lr5e6_hf_base_uncased.cfg'
source $config

Enronsent usage

Hi,

I am trying to reproduce the dataset you used for FlauBERT.
How did you use Enronsent which seams to be in English ?

Thank you in advance.

model on official transformers model repository

it would be a good thing to have this pretrained model on the official model repository of transformers

Pre-training data for FlauBERT

Hi all and thanks for open-sourcing Flaubert :)

I'm going through the README, trying to find out on which dataset FlauBERT was trained on but I'm not able to find the information. There's a mention of Gutenberg data in the pre-processing example: is that the data you used?

Thanks a lot in advance!

tweet sentiment analysis in french

Hi,

Hope you are all well !

Is it possible with Flaubert to do some tweet sentiment analysis written in french ? If so, how can we do that ?

Vive la France ! :-)

Cheers,
X

Import Error

Hi I have tried to download Flaubert as described:

import torch
from transformers import FlaubertModel, FlaubertTokenizer

Unfortunately it returns an ImportError:

  from transformers import FlaubertModel, FlaubertTokenizer
ImportError: cannot import name 'FlaubertModel'

RuntimeError: The size of tensor a (68729) must match the size of tensor b (13681) at non-singleton dimension 0

Hello, I tried to retrain Flaubert from last checkpoint on new data but it seems the size of my input do not correspond to what the system expects ? Do you know perhaps how can I solve it ?

Traceback (most recent call last):
File "train.py", line 391, in
main(params)
File "train.py", line 309, in main
trainer.mlm_step(lang1, lang2, params.lambda_mlm)
File "/home/getalp/kelodjoe/eXP/Flaubert/xlm/trainer.py", line 786, in mlm_step
self.optimize(loss)
File "/home/getalp/kelodjoe/eXP/Flaubert/xlm/trainer.py", line 261, in optimize
optimizer.step()
File "/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/initialize.py", line 242, in new_step
output = old_step(*args, **kwargs)
File "/home/getalp/kelodjoe/eXP/Flaubert/xlm/optim.py", line 136, in step
super().step(closure)
File "/home/getalp/kelodjoe/eXP/Flaubert/xlm/optim.py", line 72, in step
exp_avg.mul(beta1).add_(1 - beta1, grad)
File "/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/wrap.py", line 101, in wrapper
return orig_fn(arg0, *args, **kwargs)
RuntimeError: The size of tensor a (68729) must match the size of tensor b (13681) at non-singleton dimension 0

Another question, I would like to use the system for the calssification task in the Flue benchmark but it is not explicit in you page where it is. I get the part for finetuning but after that how are we supposed to test the model finetuned ? Should we use our own created or is there a code for the classification task available ?
Best regards,

Using flauBERT for similarities between sentences

Hi,

My goal here is to do clustering on sentences. For this purpose, I chose to use similarities between sentence embedding for all my sentences. Unfortunately, camemBERT seems not great for that task and fine-tuning flauBERT could be a solution.

So thanks to @formiel, I managed to fine tune flauBERT on an NLI dataset.
My question is about that fine-tuning. What is the output exactly ? I only got a few files in the dump_path:

train.log ==> logs of the training
params.pkl ==> parameters of the training
test.pred.0 ==> prediction of the test dataset after first epoch
valid.pred.0 ==> valid classification of the test dataset after first epoch
test.pred.1 ==> etc

I wonder if after fine-tuning flauBERT, i could use it to make a new embedding of a sentence (like flauBERT before fine-tuning). So where is the new flauBERT model trained on the NLI dataset ? And how use it to make embeddings ?

Thanks in advance

Using the Flaubert for Translation

Hi ..If i need to use the Falubert for French to English translation ? what is the approach.

How can I train flaubert on a different corpus (not gutenberg, wiki) but for another domain ?

Good afternoon,

I tried to follow your instructions to train my own corpus with Flaubert in order to get a model to use for my classification task but I am having trouble understanding the procedure ?

You said we should use this line to train on our preprocessed data :

/Flaubert$ python train.py --exp_name flaubert_base_lower --dump_path ./dumped/ --data_path ./own_data/data/ --lgs 'fr' --clm_steps '' --mlm_steps 'fr' --emb_dim 768 --n_layers 12 --n_heads 12 --dropout 0.1 --attention_dropout 0.1 --gelu_activation true --batch_size 16 --bptt 512 --optimizer "adam_inverse_sqrt,lr=0.0006,warmup_updates=24000,beta1=0.9,beta2=0.98,weight_decay=0.01,eps=0.000001" --epoch_size 300000 --max_epoch 100000 --validation_metrics _valid_fr_mlm_ppl --stopping_criterion _valid_fr_mlm_ppl,20 --fp16 true --accumulate_gradients 16 --word_mask_keep_rand '0.8,0.1,0.1' --word_pred '0.15'

I tried it after cloning flaubert and all the necessary librairies but I am getting this error :

FAISS library was not found.
FAISS not available. Switching to standard nearest neighbors search implementation.
./own_data/data/train.fr.pth not found
./own_data/data/valid.fr.pth not found
./own_data/data/test.fr.pth not found
Traceback (most recent call last):
File "train.py", line 387, in
check_data_params(params)
File "/ho/ge/ke/eXP/Flaubert/xlm/data/loader.py", line 302, in check_data_params
assert all([all([os.path.isfile(p) for p in paths.values()]) for paths in params.mono_dataset.values()])
AssertionError

Does this mean I have to split my own data in three corpus after preprocessing it ? (train , valid and test ??) Should I use your preprossed script on my own before executing the command ?

Finetuning on FLUE

Hi !
I would like to finetune Flaubert on FLUE task with hugging face library. I downloaded the PAWS datas and use the code you gave on your github repo but I have this error message I can't go through :

any idea on what to do ?

Thanks for this project by the way, him looking forward to use it !

Have a good day,
Lisa

bug in run_flue.py

Hi I got this error when running the run_flue.py
from transformers import flue_compute_metrics as compute_metrics
Import Error: cannot import name 'flue_compute_metrics'

I already preinstalled the requirements and update the transformer directory.

getalp / flaubert Goto Github PK

flaubert's People

Contributors

Stargazers

Watchers

Forkers

flaubert's Issues

Environment info

Environment info

Who can help

To reproduce

Recommend Projects

Recommend Topics

Recommend Org