huu4ontocord / mdel Goto Github PK
View Code? Open in Web Editor NEWMulti-Domain Expert Learning
License: Apache License 2.0
Multi-Domain Expert Learning
License: Apache License 2.0
When using the trainer script, the last HF Hub update of the model repo fails with the message:
remote: Sorry, your push was rejected during YAML metadata verification: remote: - Error: "model-index[0].results[0].dataset.config" must be a string remote: ------------------------------------------------------------------------- remote: ------------------------------------------------------------------------- remote: Please find the documentation at: remote: https://huggingface.co/docs/hub/model-cards#model-card-metadata remote: remote: ------------------------------------------------------------------------- To https://huggingface.co/Multi-Domain-Expert-Layers/expert-uspto ! [remote rejected] main -> main (pre-receive hook declined) error: failed to push some refs to 'https://huggingface.co/Multi-Domain-Expert-Layers/expert-uspto'
It's currently difficult to work on the box because it has a single user with a single environment, which frequently gets broken.
To solve it, we should either create separate users or install docker.
When training an expert model on a mix of domain and pile data, we would like to log the validation loss for the domain data and the pile data separately.
The goal is to do a perplexity calculation on a few models:
The model in (1) can be created using the script in this PR. The list of experts is:
The modes in (2, 3) should be prepared in the following issue.
The evaluation should be done on the evaluation fold of each expert's dataset, but excluding the Pile part of it. The datasets are at MDEL HF. The calculation of the perplexity can be done with the Hugginface Trainer's evaluate() method (see example here).
The deliverables of this issue should be:
Train 3B experts again, with the following variations:
data size: 500K, 1M, 10M, and if enough examples 50M (basically max data)
Both should train layers 16-20 and 16-25. Save checkpoints at 50k and 100k steps.
Domains to train:
We need a way to create a config file for each dataset that is being uploaded via the upload script, so that the trainer will track the metrics split by domain data/pile data.
Investigate why the training occasionally crashes.
80B tokens each of language day from vi, en, fi, hi, ja. Some of these langs we won't have enough data so we will need to do multiple passes on the data.
And we will do about 80b tokens of code, including multilingual code already gathered by
@Taishi Nakamura
that leaves 20b tokens which we can reserve for instruciton data, math, science, etc. - basically our expert data.
Integrate MDEL with various evaluation framework
{
"pipe-parallel-size": 1,
"model-parallel-size": 1,
"num-layers": 16,
"hidden-size": 2048,
"num-attention-heads": 8,
"seq-length": 2048,
"max-position-embeddings": 2048,
"pos-emb": "rotary",
"rotary-pct": 0.25,
"no-weight-tying": true,
"gpt-j-residual": true,
"output-layer-parallelism": "column",
"scaled-upper-triang-masked-softmax-fusion": false,
"bias-gelu-fusion": false,
"init_method": "small_init",
"output_layer_init_method": "wang_init",
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.00025,
"betas": [0.9, 0.95],
"eps": 1.0e-8
}
},
"min_lr": 0.000025,
"zero_optimization": {
"stage": 0,
"allgather_partitions": true,
"allgather_bucket_size": 500000000,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 500000000,
"contiguous_gradients": true,
"cpu_offload": false
},
"fp16": {
"enabled": true,
"type": "bfloat16",
"auto_cast": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 12,
"hysteresis": 2,
"min_loss_scale": 1
},
"fp32_allreduce": true,
"train_micro_batch_size_per_gpu": 4,
"gradient-accumulation-steps": 4,
"data-path": "data/debug_text_document",
"data-impl": "mmap",
"num_workers": 1,
"checkpoint-activations": true,
"checkpoint-num-layers": 1,
"partition-activations": true,
"synchronize-each-layer": true,
"gradient_clipping": 1.0,
"weight-decay": 0.1,
"hidden-dropout": 0,
"attention-dropout": 0,
"train-iters": 143000,
"lr-decay-iters": 143000,
"distributed-backend": "nccl",
"lr-decay-style": "cosine",
"warmup": 0.01,
"checkpoint-factor": 1000,
"extra-save-iters": [0,1,2,4,8,16,32,64,128,256,512],
"eval-interval": 143000,
"eval-iters": 10,
"log-interval": 10,
"steps_per_print": 10,
"wall_clock_breakdown": true,
"tokenizer-type": "HFGPT2Tokenizer"
}
Tried this config but I see the error:
Traceback (most recent call last):
Traceback (most recent call last):
File "train.py", line 27, in <module>
File "train.py", line 27, in <module>
pretrain(neox_args=neox_args)
File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/training.py", line 226, in pretrain
pretrain(neox_args=neox_args)
File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/training.py", line 226, in pretrain
iteration = train(
File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/training.py", line 782, in train
loss_dict, skipped_iter = train_step(
File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/training.py", line 688, in train_step
iteration = train(
File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/training.py", line 782, in train
reduced_loss = train_step_pipe(
File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/training.py", line 738, in train_step_pipe
loss_dict, skipped_iter = train_step(
File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/training.py", line 688, in train_step
loss = model.train_batch(data_iter=data_iterator)
File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/pipe/engine.py", line 346, in train_batch
reduced_loss = train_step_pipe(
File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/training.py", line 738, in train_step_pipe
loss = model.train_batch(data_iter=data_iterator)
File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/pipe/engine.py", line 346, in train_batch
self._exec_schedule(sched)
File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/pipe/engine.py", line 1376, in _exec_schedule
self._exec_schedule(sched)
File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/pipe/engine.py", line 1376, in _exec_schedule
self._exec_instr(**cmd.kwargs)
File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/pipe/engine.py", line 658, in _exec_forward_pass
self._exec_instr(**cmd.kwargs)
File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/pipe/engine.py", line 658, in _exec_forward_pass
outputs = super().forward(inputs)
File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
outputs = super().forward(inputs)
File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/engine.py", line 1842, in forward
ret_val = func(*args, **kwargs)
File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/engine.py", line 1842, in forward
loss = self.module(*inputs, **kwargs)
loss = self.module(*inputs, **kwargs)
File "/dccstor/mayankgpfs/conda/envs/laion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
File "/dccstor/mayankgpfs/conda/envs/laion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
return forward_call(*input, **kwargs)
File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/pipe/module.py", line 364, in forward
File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/pipe/module.py", line 364, in forward
x = exec_range_func(start_idx, end_idx)(*x)
File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/pipe/module.py", line 337, in exec_func
x = exec_range_func(start_idx, end_idx)(*x)
File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/pipe/module.py", line 337, in exec_func
inputs = layer(inputs)
File "/dccstor/mayankgpfs/conda/envs/laion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
inputs = layer(inputs)
File "/dccstor/mayankgpfs/conda/envs/laion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
return forward_call(*input, **kwargs) File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/model/word_embeddings.py", line 181, in forward
File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/model/word_embeddings.py", line 181, in forward
embeddings = super().forward(input_ids, position_ids)
File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/model/word_embeddings.py", line 136, in forward
words_embeddings = self.word_embeddings(input_ids)
File "/dccstor/mayankgpfs/conda/envs/laion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
embeddings = super().forward(input_ids, position_ids)
File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/model/word_embeddings.py", line 136, in forward
return forward_call(*input, **kwargs)
File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/mpu/layers.py", line 196, in forward
words_embeddings = self.word_embeddings(input_ids)
File "/dccstor/mayankgpfs/conda/envs/laion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
output_parallel = F.embedding(
File "/dccstor/mayankgpfs/conda/envs/laion/lib/python3.8/site-packages/torch/nn/functional.py", line 2210, in embedding
return forward_call(*input, **kwargs)
File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/mpu/layers.py", line 196, in forward
output_parallel = F.embedding(
File "/dccstor/mayankgpfs/conda/envs/laion/lib/python3.8/site-packages/torch/nn/functional.py", line 2210, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.HalfTensor instead (while checking arguments for embedding)
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.HalfTensor instead (while checking arguments for embedding)
We need to eval the experts that are merged against if we trained a 1b Pythia model all together.
To keep it fair, we would need to get the exact same 8000 random train example each for the 7 dataset we used in the ohter experiments. And we merge the 6 experts with basic averaging and run the same eval from the 7 dataset on that model.
This will give us a comparison of :
If most training script is homogenous except the data_path args/ config (I assume it is as they started from the same seed LM), then we could do a script that trains expert model sequentially and keeps the GPUs busy.
We would like to create a script for creating a merged model by averaging expert weights.
The script would take as input:
The averaged model would be uploaded to the MDEL HF repo. It's model card should contain the names of the experts it was created from.
Our analysis in #53 has shown that the expert models we had previously trained actually have a higher perplexity than the base model.
Here are some issues that may have caused this:
The expert models were trained with an old version of the trainer, so we don't know which wandb run they belong to and what were the pile/domain data losses during the training. Re-doing the training of one of the experts should help clarify.
Train a 1b on the 300K minipile instructions? Train 5 variants:
There is an effort to train LLAMA 1.5b on SUMMIT and we would like to train a few MDEL experts based on this model to test on SUMMIT.
The LLAMA is being trained using a DeeperSpeed version that works with SUMMIT:
https://github.com/EleutherAI/DeeperSpeed/tree/v2.0-summit
The purpose of this ticket is to create a minimal script or a step-by-step guide to start a training run of a GPT-NeoX model on SUMMIT.
Guidelines from Sampo:
A potential way of implementing this in a flexible way is to mount the fork as a submodule in the MDEL repo. Example here.
The purpose of this ticket is to create a minimal script or a step-by-step guide to start a training run of a GPT-NeoX model on SUMMIT.
We would like to create a script for creating a merged model by using the C-BTM method.
The script would take as input:
List of experts models from the [MDEL HF repo](https://huggingface.co/Multi-Domain-Expert-Layers).
Name of the output model
The averaged model would be uploaded to the MDEL HF repo. It's model card should contain the names of the experts it was created from.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.