bertin-project / bertin-t5x Goto Github PK

View Code? Open in Web Editor NEW

3.0 5.0 2.0 22 KB

BERTIN Project T5X training files

License: Apache License 2.0

Shell 4.47% Python 95.53%

bertin-t5x's Introduction

bertin-t5x

BERTIN Project T5X training files

Clone the repo and cd to it
Clone https://github.com/google-research/t5x inside with name t5x_repo and install in edit mode
Symlink t5x_repo/t5x to t5x in the cloned folder of this repo
Install dependencies jax for TPU and seqio (this one from repo)
Run run.sh

Lists of checkpoints can be found:

If meeting segmentation faults when writing checkpoints to the buckets, the reasone might be tensorstore version 0.1.18. As a temporal fix, try using version 0.1.14 instead.

bertin-t5x's People

Contributors

Stargazers

Watchers

Forkers

acul3 switiz

bertin-t5x's Issues

Huggingface conversion after pretrain / finetune

@versae Did you try porting the t5x models to Huggingface once pretrained and/or finetuned ?
I was able to convert the model to HF, but what about the tokenizer ? how to convert the tokenizer to HF

how do i too add the extra_ids already for the tokenizer such that i could keep the extra_ids=0

@versae I've been seeing in the mt5 tokenizer the extra 100 ids needed for the sentinel tokens come already included in the tokenizer.
'gs://t5-data/vocabs/mc4.250000.100extra/sentencepiece.model'

how do i too train my own sentencepiece tokenizer where i could too include the 100extra already in my tokenizer, just like the original one above.

as i am training a variant of mt5 that covers multiple langugaes that the common mt5 does cover. i went on to training a unigram sentencpiece tokenizer with 250000 vocab_size but i tripped into confusion, on how to get the 100 extra_ids similar the the official mt5 tokenizer.

How does the t5x pretraining script understands how to load the appropriate "train", "validation" hf dataset ?

@versae Hey there, I am too pretraining T5_1_1 base using a custom generated datasets that's in huggingface. your repo has been a huge help in making HF dataset compatible.

upon making a task.py just like yours i have started pretraining a base t5_1_1
but i had a doubt, how does the t5x training script know how to load the train subset and the validation subsets ?

also one weird thing i noticed is that the training is quite too fast, my dataset is 20GB containing 60M+ samples. and its running on 2 x A6000 96GB of VRAM in total, with 64 GB of memory. and upon 2 hrs the training has already completed 10k steps. which is a bit confusing for a model of that size.

it crossed my mind that maybe the model might be training on "validation" samples and not on the actual training data.

How to train the efficient T5 models ?

Hey @versae in the new paper scale efficiently https://arxiv.org/abs/2109.10686

There are better, efficient variants of T5 and mT5 but i couldn't find these efficient models in the T5x repo.

If i have have to train these variants of T5 models how do I train them using T5x

how to choose optimal training step, given a custom training dataset.

@versae I had a doubt regarding how much training steps to train the model, given a custom training datasets.

Currently i am training T5_1_1 on hindi language, and i current have dataset of 20GB 60M+ samples. but on training for 500k steps on batch_size of 64, the trainer says its training for 250 epochs.

(not sure on how the math in trainer works for estimating the epochs. as i have 60M+ samples training on batch_size 64, How could it readch 250 epochs in 500k steps ? )

could you please tell me on how much epochs is it ideal and recommended to train the model. (i have seen in t5x repo people reporting bad downstream task performance after the model was trained too long)

bertin-project / bertin-t5x Goto Github PK

bertin-t5x's Introduction

bertin-t5x

bertin-t5x's People

Contributors

Stargazers

Watchers

Forkers

bertin-t5x's Issues

Huggingface conversion after pretrain / finetune

how do i too add the extra_ids already for the tokenizer such that i could keep the extra_ids=0

How does the t5x pretraining script understands how to load the appropriate "train", "validation" hf dataset ?

How to train the efficient T5 models ?

how to choose optimal training step, given a custom training dataset.

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent