huu4ontocord / mdel Goto Github PK

View Code? Open in Web Editor NEW

67.0 67.0 14.0 38.11 MB

Multi-Domain Expert Learning

License: Apache License 2.0

Makefile 0.06% Shell 1.31% Python 84.45% Batchfile 0.10% Jupyter Notebook 14.08%

mdel's People

Contributors

Stargazers

Watchers

Forkers

p1ayer-1 jordiclive tehvenomm siddheshmhatre anandanne vantainguyen nourfahmy kenhktsui stillerman tosingithub micanilabs gpucce nicolomonti apollohuang1

mdel's Issues

Training instruction followers as composable layers and expert layers

check out: /content/drive/Shareddrives/MDEL/dataset/minipile_instruct.jsonl

Train a 1b on the 300K minipile instructions? Train 5 variants:

Layers 14,15,16
Layers 4,5,6,7,8
Layers 4,5,6,7,8,14,15,16 (edited)
all layers
layers 9,10,11,12,13

Fix HF Hub Upload Error

When using the trainer script, the last HF Hub update of the model repo fails with the message:
remote: Sorry, your push was rejected during YAML metadata verification: remote: - Error: "model-index[0].results[0].dataset.config" must be a string remote: ------------------------------------------------------------------------- remote: ------------------------------------------------------------------------- remote: Please find the documentation at: remote: https://huggingface.co/docs/hub/model-cards#model-card-metadata remote: remote: ------------------------------------------------------------------------- To https://huggingface.co/Multi-Domain-Expert-Layers/expert-uspto ! [remote rejected] main -> main (pre-receive hook declined) error: failed to push some refs to 'https://huggingface.co/Multi-Domain-Expert-Layers/expert-uspto'

Create minimal example of training on LUMI

Guidelines from Sampo:

Single pure torch (no custom CUDA kernels) codebase
Start with small model (< 10B params)
Use this fork of Megatron-DeepSpeed

A potential way of implementing this in a flexible way is to mount the fork as a submodule in the MDEL repo. Example here.

The purpose of this ticket is to create a minimal script or a step-by-step guide to start a training run of a GPT-NeoX model on SUMMIT.

Setup separate environments on Redmond.ai box

It's currently difficult to work on the box because it has a single user with a single environment, which frequently gets broken.

To solve it, we should either create separate users or install docker.

Train baseline models for evaluation

We need to eval the experts that are merged against if we trained a 1b Pythia model all together.

Trained with all layers on the 6 datasets we have.
Trained with just the upper layers.

To keep it fair, we would need to get the exact same 8000 random train example each for the 7 dataset we used in the ohter experiments. And we merge the 6 experts with basic averaging and run the same eval from the 7 dataset on that model.

This will give us a comparison of :

training all layers on same token and data
training some layers on same token and data
merging with different experts trained on same compute

Tokenize the StarCoder dataset

Integrate with LLM evaluation frameworks

Integrate MDEL with various evaluation framework

Add Single Layer HF Trainer

https://github.com/thedarkzeno/layer_adapt/blob/main/train_clm_trainer.py

Do a small test run

Change the dataset mixing script to process files with a wild card

Expert merging: c-BTM

We would like to create a script for creating a merged model by using the C-BTM method.

The script would take as input:

List of experts models from the [MDEL HF repo](https://huggingface.co/Multi-Domain-Expert-Layers).
Name of the output model

The averaged model would be uploaded to the MDEL HF repo. It's model card should contain the names of the experts it was created from.

Stabilize Training on Redmond Box

Investigate why the training occasionally crashes.

Add code for mixing Pile and Expert data

Investigate Expert Models Having High Perplexity

Our analysis in #53 has shown that the expert models we had previously trained actually have a higher perplexity than the base model.

Here are some issues that may have caused this:

no warmup
LR too high
too few steps
mixing in pile data
too many gradient accumulation steps
measurement error

The expert models were trained with an old version of the trainer, so we don't know which wandb run they belong to and what were the pile/domain data losses during the training. Re-doing the training of one of the experts should help clarify.

Get all relevant data for StarCoder into LUMI

80B tokens each of language day from vi, en, fi, hi, ja. Some of these langs we won't have enough data so we will need to do multiple passes on the data.

And we will do about 80b tokens of code, including multilingual code already gathered by
@Taishi Nakamura

that leaves 20b tokens which we can reserve for instruciton data, math, science, etc. - basically our expert data.

Add genre classifer code to repo

Automatic Training Scripts for All Expert Models

If most training script is homogenous except the data_path args/ config (I assume it is as they started from the same seed LM), then we could do a script that trains expert model sequentially and keeps the GPUs busy.

inputs_ids cast to fp16 in deeperspeed bug

{
  "pipe-parallel-size": 1,
  "model-parallel-size": 1,

  "num-layers": 16,
  "hidden-size": 2048,
  "num-attention-heads": 8,
  "seq-length": 2048,
  "max-position-embeddings": 2048,
  "pos-emb": "rotary",
  "rotary-pct": 0.25,
  "no-weight-tying": true,
  "gpt-j-residual": true,
  "output-layer-parallelism": "column",
  
  "scaled-upper-triang-masked-softmax-fusion": false,
  "bias-gelu-fusion": false,

  "init_method": "small_init",
  "output_layer_init_method": "wang_init",

  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 0.00025,
      "betas": [0.9, 0.95],
      "eps": 1.0e-8
    }
  },
  "min_lr": 0.000025,

  "zero_optimization": {
    "stage": 0,
    "allgather_partitions": true,
    "allgather_bucket_size": 500000000,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 500000000,
    "contiguous_gradients": true,
    "cpu_offload": false
  }, 

  "fp16": {
    "enabled": true,
    "type": "bfloat16",
    "auto_cast": true,
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 12,
    "hysteresis": 2,
    "min_loss_scale": 1
  }, 

  "fp32_allreduce": true,

  "train_micro_batch_size_per_gpu": 4,
  "gradient-accumulation-steps": 4,

  "data-path": "data/debug_text_document",
  "data-impl": "mmap",
  "num_workers": 1,

  "checkpoint-activations": true,
  "checkpoint-num-layers": 1,
  "partition-activations": true,
  "synchronize-each-layer": true,

  "gradient_clipping": 1.0,
  "weight-decay": 0.1,
  "hidden-dropout": 0,
  "attention-dropout": 0,

  "train-iters": 143000,
  "lr-decay-iters": 143000,
  "distributed-backend": "nccl",
  "lr-decay-style": "cosine",
  "warmup": 0.01,
  "checkpoint-factor": 1000,
  "extra-save-iters": [0,1,2,4,8,16,32,64,128,256,512],
  "eval-interval": 143000,
  "eval-iters": 10,

  "log-interval": 10,
  "steps_per_print": 10,
  "wall_clock_breakdown": true,

  "tokenizer-type": "HFGPT2Tokenizer"
}

Tried this config but I see the error:

Traceback (most recent call last):
Traceback (most recent call last):
  File "train.py", line 27, in <module>
  File "train.py", line 27, in <module>
    pretrain(neox_args=neox_args)
  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/training.py", line 226, in pretrain
    pretrain(neox_args=neox_args)
  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/training.py", line 226, in pretrain
    iteration = train(
  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/training.py", line 782, in train
    loss_dict, skipped_iter = train_step(
  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/training.py", line 688, in train_step
    iteration = train(
      File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/training.py", line 782, in train
reduced_loss = train_step_pipe(
  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/training.py", line 738, in train_step_pipe
    loss_dict, skipped_iter = train_step(
  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/training.py", line 688, in train_step
    loss = model.train_batch(data_iter=data_iterator)
  File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/pipe/engine.py", line 346, in train_batch
    reduced_loss = train_step_pipe(
  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/training.py", line 738, in train_step_pipe
    loss = model.train_batch(data_iter=data_iterator)
  File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/pipe/engine.py", line 346, in train_batch
    self._exec_schedule(sched)
      File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/pipe/engine.py", line 1376, in _exec_schedule
self._exec_schedule(sched)
  File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/pipe/engine.py", line 1376, in _exec_schedule
    self._exec_instr(**cmd.kwargs)
  File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/pipe/engine.py", line 658, in _exec_forward_pass
    self._exec_instr(**cmd.kwargs)
  File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/pipe/engine.py", line 658, in _exec_forward_pass
    outputs = super().forward(inputs)
  File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    outputs = super().forward(inputs)
  File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/engine.py", line 1842, in forward
    ret_val = func(*args, **kwargs)
  File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/engine.py", line 1842, in forward
    loss = self.module(*inputs, **kwargs)    
loss = self.module(*inputs, **kwargs)
  File "/dccstor/mayankgpfs/conda/envs/laion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
  File "/dccstor/mayankgpfs/conda/envs/laion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
        return forward_call(*input, **kwargs)
return forward_call(*input, **kwargs)
  File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/pipe/module.py", line 364, in forward
  File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/pipe/module.py", line 364, in forward
    x = exec_range_func(start_idx, end_idx)(*x)
      File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/pipe/module.py", line 337, in exec_func
x = exec_range_func(start_idx, end_idx)(*x)
  File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/pipe/module.py", line 337, in exec_func
    inputs = layer(inputs)
      File "/dccstor/mayankgpfs/conda/envs/laion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
inputs = layer(inputs)
  File "/dccstor/mayankgpfs/conda/envs/laion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
    return forward_call(*input, **kwargs)  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/model/word_embeddings.py", line 181, in forward

  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/model/word_embeddings.py", line 181, in forward
    embeddings = super().forward(input_ids, position_ids)
  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/model/word_embeddings.py", line 136, in forward
    words_embeddings = self.word_embeddings(input_ids)
  File "/dccstor/mayankgpfs/conda/envs/laion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    embeddings = super().forward(input_ids, position_ids)
  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/model/word_embeddings.py", line 136, in forward
    return forward_call(*input, **kwargs)
  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/mpu/layers.py", line 196, in forward
    words_embeddings = self.word_embeddings(input_ids)
  File "/dccstor/mayankgpfs/conda/envs/laion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    output_parallel = F.embedding(
  File "/dccstor/mayankgpfs/conda/envs/laion/lib/python3.8/site-packages/torch/nn/functional.py", line 2210, in embedding
    return forward_call(*input, **kwargs)
  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/mpu/layers.py", line 196, in forward
    output_parallel = F.embedding(
  File "/dccstor/mayankgpfs/conda/envs/laion/lib/python3.8/site-packages/torch/nn/functional.py", line 2210, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.HalfTensor instead (while checking arguments for embedding)
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.HalfTensor instead (while checking arguments for embedding)

Add pylint

Add pre-commit hooks

Evaluate a merged expert model's perplexity

The goal is to do a perplexity calculation on a few models:

A model that is a weighted average of a few experts models
A baseline model which is fine-tuned on the union of the experts' datasets
Same as above but only the layers that were tuned for the experts (layer 9-13)

The model in (1) can be created using the script in this PR. The list of experts is:

ArXiv
Pubmed Central
Pubmed Abstracts
USPTO
Philpapers
Github
Freelaw

The modes in (2, 3) should be prepared in the following issue.

The evaluation should be done on the evaluation fold of each expert's dataset, but excluding the Pile part of it. The datasets are at MDEL HF. The calculation of the perplexity can be done with the Hugginface Trainer's evaluate() method (see example here).

The deliverables of this issue should be:

The perplexity numbers for each model
A script to reproduce the result

Report val loss aggregated by data origin

When training an expert model on a mix of domain and pile data, we would like to log the validation loss for the domain data and the pile data separately.

Create minimal example of training on SUMMIT

There is an effort to train LLAMA 1.5b on SUMMIT and we would like to train a few MDEL experts based on this model to test on SUMMIT.

The LLAMA is being trained using a DeeperSpeed version that works with SUMMIT:
https://github.com/EleutherAI/DeeperSpeed/tree/v2.0-summit

The purpose of this ticket is to create a minimal script or a step-by-step guide to start a training run of a GPT-NeoX model on SUMMIT.

List of experts models from the MDEL HF repo.
Name of the output model

The averaged model would be uploaded to the MDEL HF repo. It's model card should contain the names of the experts it was created from.

Dataset generation open issues

Currently using shard 1 only, it doesn't have enough data for some domains - we should preprocess a bigger part of the pile to get good coverage for these.
Need to publish subset names
Not clear whether to balance domain/pile by num samples or num tokens
Report val loss split by domain/pile data

Train 2nd batch of expert models

Train 3B experts again, with the following variations:

data size: 500K, 1M, 10M, and if enough examples 50M (basically max data)

80% pile/20% expert
Expert only

Both should train layers 16-20 and 16-25. Save checkpoints at 50k and 100k steps.

Domains to train:

ArXiv
Github
Pubmed Central
Pubmed Abstracts
Philpapers
Freelaw
USPTO