Giter Site home page Giter Site logo

mdel's People

Contributors

bentherien avatar digitous avatar henk717 avatar huu4ontocord avatar jordiclive avatar mrcabbage972 avatar mrseeker avatar nicolomonti avatar nourfahmy avatar p1ayer-1 avatar stillerman avatar thedarkzeno avatar vmay-chegg avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mdel's Issues

Setup separate environments on Redmond.ai box

It's currently difficult to work on the box because it has a single user with a single environment, which frequently gets broken.

To solve it, we should either create separate users or install docker.

Create minimal example of training on LUMI

Guidelines from Sampo:

  1. Single pure torch (no custom CUDA kernels) codebase
  2. Start with small model (< 10B params)
  3. Use this fork of Megatron-DeepSpeed

A potential way of implementing this in a flexible way is to mount the fork as a submodule in the MDEL repo. Example here.

The purpose of this ticket is to create a minimal script or a step-by-step guide to start a training run of a GPT-NeoX model on SUMMIT.

inputs_ids cast to fp16 in deeperspeed bug

{
  "pipe-parallel-size": 1,
  "model-parallel-size": 1,

  "num-layers": 16,
  "hidden-size": 2048,
  "num-attention-heads": 8,
  "seq-length": 2048,
  "max-position-embeddings": 2048,
  "pos-emb": "rotary",
  "rotary-pct": 0.25,
  "no-weight-tying": true,
  "gpt-j-residual": true,
  "output-layer-parallelism": "column",
  
  "scaled-upper-triang-masked-softmax-fusion": false,
  "bias-gelu-fusion": false,

  "init_method": "small_init",
  "output_layer_init_method": "wang_init",

  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 0.00025,
      "betas": [0.9, 0.95],
      "eps": 1.0e-8
    }
  },
  "min_lr": 0.000025,

  "zero_optimization": {
    "stage": 0,
    "allgather_partitions": true,
    "allgather_bucket_size": 500000000,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 500000000,
    "contiguous_gradients": true,
    "cpu_offload": false
  }, 

  "fp16": {
    "enabled": true,
    "type": "bfloat16",
    "auto_cast": true,
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 12,
    "hysteresis": 2,
    "min_loss_scale": 1
  }, 

  "fp32_allreduce": true,

  "train_micro_batch_size_per_gpu": 4,
  "gradient-accumulation-steps": 4,

  "data-path": "data/debug_text_document",
  "data-impl": "mmap",
  "num_workers": 1,

  "checkpoint-activations": true,
  "checkpoint-num-layers": 1,
  "partition-activations": true,
  "synchronize-each-layer": true,

  "gradient_clipping": 1.0,
  "weight-decay": 0.1,
  "hidden-dropout": 0,
  "attention-dropout": 0,

  "train-iters": 143000,
  "lr-decay-iters": 143000,
  "distributed-backend": "nccl",
  "lr-decay-style": "cosine",
  "warmup": 0.01,
  "checkpoint-factor": 1000,
  "extra-save-iters": [0,1,2,4,8,16,32,64,128,256,512],
  "eval-interval": 143000,
  "eval-iters": 10,

  "log-interval": 10,
  "steps_per_print": 10,
  "wall_clock_breakdown": true,

  "tokenizer-type": "HFGPT2Tokenizer"
}

Tried this config but I see the error:

Traceback (most recent call last):
Traceback (most recent call last):
  File "train.py", line 27, in <module>
  File "train.py", line 27, in <module>
    pretrain(neox_args=neox_args)
  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/training.py", line 226, in pretrain
    pretrain(neox_args=neox_args)
  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/training.py", line 226, in pretrain
    iteration = train(
  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/training.py", line 782, in train
    loss_dict, skipped_iter = train_step(
  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/training.py", line 688, in train_step
    iteration = train(
      File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/training.py", line 782, in train
reduced_loss = train_step_pipe(
  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/training.py", line 738, in train_step_pipe
    loss_dict, skipped_iter = train_step(
  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/training.py", line 688, in train_step
    loss = model.train_batch(data_iter=data_iterator)
  File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/pipe/engine.py", line 346, in train_batch
    reduced_loss = train_step_pipe(
  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/training.py", line 738, in train_step_pipe
    loss = model.train_batch(data_iter=data_iterator)
  File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/pipe/engine.py", line 346, in train_batch
    self._exec_schedule(sched)
      File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/pipe/engine.py", line 1376, in _exec_schedule
self._exec_schedule(sched)
  File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/pipe/engine.py", line 1376, in _exec_schedule
    self._exec_instr(**cmd.kwargs)
  File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/pipe/engine.py", line 658, in _exec_forward_pass
    self._exec_instr(**cmd.kwargs)
  File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/pipe/engine.py", line 658, in _exec_forward_pass
    outputs = super().forward(inputs)
  File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    outputs = super().forward(inputs)
  File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/engine.py", line 1842, in forward
    ret_val = func(*args, **kwargs)
  File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/engine.py", line 1842, in forward
    loss = self.module(*inputs, **kwargs)    
loss = self.module(*inputs, **kwargs)
  File "/dccstor/mayankgpfs/conda/envs/laion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
  File "/dccstor/mayankgpfs/conda/envs/laion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
        return forward_call(*input, **kwargs)
return forward_call(*input, **kwargs)
  File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/pipe/module.py", line 364, in forward
  File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/pipe/module.py", line 364, in forward
    x = exec_range_func(start_idx, end_idx)(*x)
      File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/pipe/module.py", line 337, in exec_func
x = exec_range_func(start_idx, end_idx)(*x)
  File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/pipe/module.py", line 337, in exec_func
    inputs = layer(inputs)
      File "/dccstor/mayankgpfs/conda/envs/laion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
inputs = layer(inputs)
  File "/dccstor/mayankgpfs/conda/envs/laion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
    return forward_call(*input, **kwargs)  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/model/word_embeddings.py", line 181, in forward

  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/model/word_embeddings.py", line 181, in forward
    embeddings = super().forward(input_ids, position_ids)
  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/model/word_embeddings.py", line 136, in forward
    words_embeddings = self.word_embeddings(input_ids)
  File "/dccstor/mayankgpfs/conda/envs/laion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    embeddings = super().forward(input_ids, position_ids)
  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/model/word_embeddings.py", line 136, in forward
    return forward_call(*input, **kwargs)
  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/mpu/layers.py", line 196, in forward
    words_embeddings = self.word_embeddings(input_ids)
  File "/dccstor/mayankgpfs/conda/envs/laion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    output_parallel = F.embedding(
  File "/dccstor/mayankgpfs/conda/envs/laion/lib/python3.8/site-packages/torch/nn/functional.py", line 2210, in embedding
    return forward_call(*input, **kwargs)
  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/mpu/layers.py", line 196, in forward
    output_parallel = F.embedding(
  File "/dccstor/mayankgpfs/conda/envs/laion/lib/python3.8/site-packages/torch/nn/functional.py", line 2210, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.HalfTensor instead (while checking arguments for embedding)
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.HalfTensor instead (while checking arguments for embedding)

Automatic Training Scripts for All Expert Models

If most training script is homogenous except the data_path args/ config (I assume it is as they started from the same seed LM), then we could do a script that trains expert model sequentially and keeps the GPUs busy.

Train 2nd batch of expert models

Train 3B experts again, with the following variations:

data size: 500K, 1M, 10M, and if enough examples 50M (basically max data)

  1. 80% pile/20% expert
  2. Expert only

Both should train layers 16-20 and 16-25. Save checkpoints at 50k and 100k steps.

Domains to train:

  1. ArXiv
  2. Github
  3. Pubmed Central
  4. Pubmed Abstracts
  5. Philpapers
  6. Freelaw
  7. USPTO

Add script for merging expert models via weight averaging

We would like to create a script for creating a merged model by averaging expert weights.

The script would take as input:

  1. List of experts models from the MDEL HF repo.
  2. Name of the output model

The averaged model would be uploaded to the MDEL HF repo. It's model card should contain the names of the experts it was created from.

Expert merging: c-BTM

We would like to create a script for creating a merged model by using the C-BTM method.

The script would take as input:

List of experts models from the [MDEL HF repo](https://huggingface.co/Multi-Domain-Expert-Layers).
Name of the output model

The averaged model would be uploaded to the MDEL HF repo. It's model card should contain the names of the experts it was created from.

Investigate Expert Models Having High Perplexity

Our analysis in #53 has shown that the expert models we had previously trained actually have a higher perplexity than the base model.

Here are some issues that may have caused this:

  • no warmup
  • LR too high
  • too few steps
  • mixing in pile data
  • too many gradient accumulation steps
  • measurement error

The expert models were trained with an old version of the trainer, so we don't know which wandb run they belong to and what were the pile/domain data losses during the training. Re-doing the training of one of the experts should help clarify.

Evaluate a merged expert model's perplexity

The goal is to do a perplexity calculation on a few models:

  1. A model that is a weighted average of a few experts models
  2. A baseline model which is fine-tuned on the union of the experts' datasets
  3. Same as above but only the layers that were tuned for the experts (layer 9-13)

The model in (1) can be created using the script in this PR. The list of experts is:

  1. ArXiv
  2. Pubmed Central
  3. Pubmed Abstracts
  4. USPTO
  5. Philpapers
  6. Github
  7. Freelaw

The modes in (2, 3) should be prepared in the following issue.

The evaluation should be done on the evaluation fold of each expert's dataset, but excluding the Pile part of it. The datasets are at MDEL HF. The calculation of the perplexity can be done with the Hugginface Trainer's evaluate() method (see example here).

The deliverables of this issue should be:

  1. The perplexity numbers for each model
  2. A script to reproduce the result

Fix HF Hub Upload Error

When using the trainer script, the last HF Hub update of the model repo fails with the message:
remote: Sorry, your push was rejected during YAML metadata verification: remote: - Error: "model-index[0].results[0].dataset.config" must be a string remote: ------------------------------------------------------------------------- remote: ------------------------------------------------------------------------- remote: Please find the documentation at: remote: https://huggingface.co/docs/hub/model-cards#model-card-metadata remote: remote: ------------------------------------------------------------------------- To https://huggingface.co/Multi-Domain-Expert-Layers/expert-uspto ! [remote rejected] main -> main (pre-receive hook declined) error: failed to push some refs to 'https://huggingface.co/Multi-Domain-Expert-Layers/expert-uspto'

Dataset generation open issues

  1. Currently using shard 1 only, it doesn't have enough data for some domains - we should preprocess a bigger part of the pile to get good coverage for these.
  2. Need to publish subset names
  3. Not clear whether to balance domain/pile by num samples or num tokens
  4. Report val loss split by domain/pile data

Get all relevant data for StarCoder into LUMI

80B tokens each of language day from vi, en, fi, hi, ja. Some of these langs we won't have enough data so we will need to do multiple passes on the data.

And we will do about 80b tokens of code, including multilingual code already gathered by
@Taishi Nakamura

that leaves 20b tokens which we can reserve for instruciton data, math, science, etc. - basically our expert data.

Train baseline models for evaluation

We need to eval the experts that are merged against if we trained a 1b Pythia model all together.

  1. Trained with all layers on the 6 datasets we have.
  2. Trained with just the upper layers.

To keep it fair, we would need to get the exact same 8000 random train example each for the 7 dataset we used in the ohter experiments. And we merge the 6 experts with basic averaging and run the same eval from the 7 dataset on that model.

This will give us a comparison of :

  1. training all layers on same token and data
  2. training some layers on same token and data
  3. merging with different experts trained on same compute

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.