Giter Site home page Giter Site logo

bertin-t5x's Introduction

bertin-t5x

BERTIN Project T5X training files

  1. Clone the repo and cd to it
  2. Clone https://github.com/google-research/t5x inside with name t5x_repo and install in edit mode
  3. Symlink t5x_repo/t5x to t5x in the cloned folder of this repo
  4. Install dependencies jax for TPU and seqio (this one from repo)
  5. Run run.sh

Lists of checkpoints can be found:

If meeting segmentation faults when writing checkpoints to the buckets, the reasone might be tensorstore version 0.1.18. As a temporal fix, try using version 0.1.14 instead.

bertin-t5x's People

Contributors

versae avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

acul3 switiz

bertin-t5x's Issues

how do i too add the extra_ids already for the tokenizer such that i could keep the extra_ids=0

@versae I've been seeing in the mt5 tokenizer the extra 100 ids needed for the sentinel tokens come already included in the tokenizer.
'gs://t5-data/vocabs/mc4.250000.100extra/sentencepiece.model'

how do i too train my own sentencepiece tokenizer where i could too include the 100extra already in my tokenizer, just like the original one above.

as i am training a variant of mt5 that covers multiple langugaes that the common mt5 does cover. i went on to training a unigram sentencpiece tokenizer with 250000 vocab_size but i tripped into confusion, on how to get the 100 extra_ids similar the the official mt5 tokenizer.

How does the t5x pretraining script understands how to load the appropriate "train", "validation" hf dataset ?

@versae Hey there, I am too pretraining T5_1_1 base using a custom generated datasets that's in huggingface. your repo has been a huge help in making HF dataset compatible.

upon making a task.py just like yours i have started pretraining a base t5_1_1
but i had a doubt, how does the t5x training script know how to load the train subset and the validation subsets ?

also one weird thing i noticed is that the training is quite too fast, my dataset is 20GB containing 60M+ samples. and its running on 2 x A6000 96GB of VRAM in total, with 64 GB of memory. and upon 2 hrs the training has already completed 10k steps. which is a bit confusing for a model of that size.

it crossed my mind that maybe the model might be training on "validation" samples and not on the actual training data.

how to choose optimal training step, given a custom training dataset.

@versae I had a doubt regarding how much training steps to train the model, given a custom training datasets.

Currently i am training T5_1_1 on hindi language, and i current have dataset of 20GB 60M+ samples. but on training for 500k steps on batch_size of 64, the trainer says its training for 250 epochs.

(not sure on how the math in trainer works for estimating the epochs. as i have 60M+ samples training on batch_size 64, How could it readch 250 epochs in 500k steps ? )

could you please tell me on how much epochs is it ideal and recommended to train the model. (i have seen in t5x repo people reporting bad downstream task performance after the model was trained too long)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.