mosaicml / examples Goto Github PK

View Code? Open in Web Editor NEW

430.0 430.0 122.0 61.38 MB

Fast and flexible reference benchmarks

License: Apache License 2.0

Shell 100.00%

examples's People

Stargazers

Watchers

Forkers

mrseeker stanford-crfm ananyahjha93 samhavens nik-mosaic a-jacobson sophiawisdom landanjs vchiley alextrott16 rocm dakinggg eleutherai bandish-shah standardgalactic galtay bcui19 corymosaicml shivanshupurohit bmosaicml fduwjj karan6181 kobindra mvpatel2000 lessw2020 growlix jfrankle jacobfulano dskhudia nqn yakazimir dblalock tgale96 llefebure codestar12 rmonsur stjordanis techthiyanes markhng525 gatech-sysml chenyuho farck honglu2875 manuelrios18 madhavatreplit gladiopeace gatech-sysml forex24 eracah calvin-scio bwlarsen brianchunkang godfrey-cw aychtang ogulcanogul ankitshah009 blefaudeux riversun trelent dirkgr scchess taner45 jesusoctavioas salmansamie llmsguide nitin-mane mrkumar83 matrixsociety kappa0x galtay-tempus anoop-qasolve subhankarfynd ruichenmle jackcook sandy4321 aribenjamin kuuci ajaysaini725 prenigma kprybol lucky85008 sunatthegilddotcom xsyourpal jeromeku open-ml 54457616 fsx950223 jdvin nancyhung synologistics fabfish aymericledrezen snarayan21 vellum-ai asudhakar-db ejyuen irenedea degamble xiaohanzhangcmu alexrogalskiy

examples's Issues

Default num_canonical_nodes to an even multiple of num_physical_nodes

Not sure of the problematic math, but get_partitions will error out if num_canonical_nodes / num_physical_nodes is not a whole number. This could be resolved by making the default conditional, i.e

pn=num_physical_nodes
num_canonical_nodes = num_canonical_nodes or 120 // pn * pn + pn

Example I saw when attempting to train a 350M gpt example on 6 nodes:

get_partitions(
    num_samples=364672,
    num_canonical_nodes=128,
    num_physical_nodes=6,
    ranks_per_node=4,
    workers_per_rank=1,
    batch_size=6
)
# =>ValueError: cannot reshape array of size 364672 into shape (6)

Error when training with Mosaic-Bert

I have forked the docker from the README and installed the dependencies from requirements.txt. One difference is I'm using singularity instead of docker.

I face the following error only when executing main.py with mosaic-bert-base-uncased.yaml (hf_bert works fine)

Here is the error, I see after tokenization. I would appreciate any guidance you can give me. Thanks for the amazing work!

Traceback (most recent call last):
File "", line 21, in _fwd_kernel
KeyError: ('2-.-0-.-0-d82511111ad128294e9d31a6ac684238-7929002797455b30efce6e41eddc6b57-3aa563e00c5c695dd945e23b09a86848-d962222789c30252d492a16cca3bf467-ff946bd4b3b4a4cbdf8cedc6e1c658e0-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.bfloat16, torch.bfloat16, torch.bfloat16, torch.float32, torch.bfloat16, torch.float32, torch.float32, 'fp32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), ('matrix', False, 64, True, True, True, 128, 128), (True, True, True, True, True, True, True, (False,), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (False, False), (True, False), (True, False), (True, False), (True, False), (False, False), (False, False)))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/u/user/examples/examples/benchmarks/bert/main.py", line 269, in
main(cfg)
File "/u/user/examples/examples/benchmarks/bert/main.py", line 256, in main
trainer.fit()
File "/u/user/.local/lib/python3.10/site-packages/composer/trainer/trainer.py", line 1766, in fit
self._train_loop()
File "/u/user/.local/lib/python3.10/site-packages/composer/trainer/trainer.py", line 1940, in _train_loop
total_loss_dict = self._train_batch(use_grad_scaling)
File "/u/user/.local/lib/python3.10/site-packages/composer/trainer/trainer.py", line 2115, in _train_batch
optimizer.step(closure=lambda **kwargs: self._train_microbatches(
File "/usr/lib/python3/dist-packages/torch/optim/lr_scheduler.py", line 68, in wrapper
return wrapped(*args, **kwargs)
File "/usr/lib/python3/dist-packages/torch/optim/optimizer.py", line 140, in wrapper
out = func(*args, **kwargs)
File "/usr/lib/python3/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/u/user/.local/lib/python3.10/site-packages/composer/optim/decoupled_weight_decay.py", line 288, in step
loss = closure()
File "/u/user/.local/lib/python3.10/site-packages/composer/trainer/trainer.py", line 2115, in
optimizer.step(closure=lambda **kwargs: self._train_microbatches(
File "/u/user/.local/lib/python3.10/site-packages/composer/trainer/trainer.py", line 2213, in _train_microbatches
microbatch_loss_dict = self._train_microbatch(use_grad_scaling, current_batch_size, is_final_microbatch)
File "/u/user/.local/lib/python3.10/site-packages/composer/trainer/trainer.py", line 2276, in _train_microbatch
self.state.outputs = self.state.model(self.state.batch)
File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/u/user/.local/lib/python3.10/site-packages/composer/models/huggingface.py", line 314, in forward
output = self.model(**batch) # type: ignore (thirdparty)
File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/u/user/examples/examples/benchmarks/bert/src/bert_layers.py", line 858, in forward
outputs = self.bert(
File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/u/user/examples/examples/benchmarks/bert/src/bert_layers.py", line 677, in forward
encoder_outputs = self.encoder(
File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/u/user/examples/examples/benchmarks/bert/src/bert_layers.py", line 533, in forward
hidden_states = layer_module(hidden_states,
File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/u/user/examples/examples/benchmarks/bert/src/bert_layers.py", line 395, in forward
attention_output = self.attention(hidden_states, cu_seqlens, seqlen,
File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/u/user/examples/examples/benchmarks/bert/src/bert_layers.py", line 307, in forward
self_output = self.self(input_tensor, cu_seqlens, max_s, indices,
File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/u/user/examples/examples/benchmarks/bert/src/bert_layers.py", line 241, in forward
attention = flash_attn_qkvpacked_func(qkv, bias)
File "/u/user/examples/examples/benchmarks/bert/src/flash_attn_triton.py", line 1021, in forward
o, lse, ctx.softmax_scale = _flash_attn_forward(
File "/u/user/examples/examples/benchmarks/bert/src/flash_attn_triton.py", line 826, in _flash_attn_forward
_fwd_kernel[grid]( # type: ignore
File "/u/user/.local/lib/python3.10/site-packages/triton/runtime/jit.py", line 106, in launcher
return self.run(*args, grid=grid, **kwargs)
File "/u/user/.local/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 86, in run
return self.fn.run(*args, num_warps=config.num_warps, num_stages=config.num_stages, **kwargs, **config.kwargs)
File "/u/user/.local/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 200, in run
return self.fn.run(*args, **kwargs)
File "", line 41, in _fwd_kernel
File "/u/user/.local/lib/python3.10/site-packages/triton/compiler.py", line 1239, in compile
so = _build(fn.name, src_path, tmpdir)
File "/u/user/.local/lib/python3.10/site-packages/triton/compiler.py", line 1169, in _build
ret = subprocess.check_call(cc_cmd)
File "/usr/lib/python3.10/subprocess.py", line 364, in check_call
retcode = call(*popenargs, **kwargs)
File "/usr/lib/python3.10/subprocess.py", line 345, in call
with Popen(*popenargs, **kwargs) as p:
File "/usr/lib/python3.10/subprocess.py", line 971, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "/usr/lib/python3.10/subprocess.py", line 1863, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: '/sw/spack/sys11-2023-03/apps/linux-rhel8-x86_64/gcc-8.5.0/gcc-11.4.0-yycklku/bin/gcc'

Error in ResNet ImageNet examples

While training ResNet with the "mild" ImageNet recipe, I realized that the call to config.update(recipe_config) doesn't actually work. See these lines.

recipe_config is an OmegaConf dictionary config with keys that include dots in the name:

{
    'model.loss_name': 'binary_cross_entropy', 
    'train_dataset.crop_size': 176, 
    'eval_dataset.resize_size': 232, 
     'max_duration': '36ep'
}

When you call config.update(recipe_config), it adds these keys to the config object directly, instead of updating theconfig.train_dataset nested-dictionary. This means the max_duration is actually 36ep because it's a top-level key, but the crop sizes will not be changed and the model loss is still cross entropy instead of the binary variant.

You can fix it like this:

for key, value in recipe_config.items():
    OmegaConf.update(config, key, value)

OmegaConf.update respects the dots in the key names.

RuntimeError: Triton Error [CUDA]: invalid argument

Getting the following issue when running mosaic-bert recipe. Only with bf16, works with fp32.

Traceback (most recent call last):
  File "<string>", line 21, in _bwd_kernel
KeyError: ('2-.-0-.-0-d82511111ad128294e9d31a6ac684238-7929002797455b30efce6e41eddc6b57-3aa563e00c5c695dd945e23b09a86848-d962222789c30252d492a16cca3bf467-ff946bd4b3b4a4cbdf8cedc6e1c658e0-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.bfloat16, torch.bfloat16, torch.bfloat16, torch.float32, torch.bfloat16, torch.float32, torch.bfloat16, torch.bfloat16, torch.float32, torch.float32, 'fp32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), ('matrix', False, 64, False, True, True, True, 128, 128), (True, True, True, True, True, True, True, True, True, True, (False,), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (False, False), (True, False), (True, False), (True, False), (True, False), (False, False), (False, False)))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/coc/scratch/sreddy65/examples/examples/bert/main.py", line 141, in <module>
    main(cfg)
  File "/coc/scratch/sreddy65/examples/examples/bert/main.py", line 128, in main
    trainer.fit()
  File "/nethome/sreddy65/miniconda3/envs/mosaic/lib/python3.10/site-packages/composer/trainer/trainer.py", line 1787, in fit
    self._train_loop()
  File "/nethome/sreddy65/miniconda3/envs/mosaic/lib/python3.10/site-packages/composer/trainer/trainer.py", line 1950, in _train_loop
    total_loss_dict = self._train_batch(use_grad_scaling)
  File "/nethome/sreddy65/miniconda3/envs/mosaic/lib/python3.10/site-packages/composer/trainer/trainer.py", line 2126, in _train_batch
    optimizer.step(closure=lambda **kwargs: self._train_microbatches(
  File "/nethome/sreddy65/miniconda3/envs/mosaic/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 68, in wrapper
    return wrapped(*args, **kwargs)
  File "/nethome/sreddy65/miniconda3/envs/mosaic/lib/python3.10/site-packages/torch/optim/optimizer.py", line 140, in wrapper
    out = func(*args, **kwargs)
  File "/nethome/sreddy65/miniconda3/envs/mosaic/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/nethome/sreddy65/miniconda3/envs/mosaic/lib/python3.10/site-packages/composer/optim/decoupled_weight_decay.py", line 289, in step
    loss = closure()
  File "/nethome/sreddy65/miniconda3/envs/mosaic/lib/python3.10/site-packages/composer/trainer/trainer.py", line 2126, in <lambda>
    optimizer.step(closure=lambda **kwargs: self._train_microbatches(
  File "/nethome/sreddy65/miniconda3/envs/mosaic/lib/python3.10/site-packages/composer/trainer/trainer.py", line 2209, in _train_microbatches
    microbatch_loss_dict = self._train_microbatch(use_grad_scaling, current_batch_size, is_final_microbatch)
  File "/nethome/sreddy65/miniconda3/envs/mosaic/lib/python3.10/site-packages/composer/trainer/trainer.py", line 2305, in _train_microbatch
    microbatch_loss.backward(create_graph=self._backwards_create_graph)
  File "/nethome/sreddy65/miniconda3/envs/mosaic/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/nethome/sreddy65/miniconda3/envs/mosaic/lib/python3.10/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/nethome/sreddy65/miniconda3/envs/mosaic/lib/python3.10/site-packages/torch/autograd/function.py", line 267, in apply
    return user_fn(self, *args)
  File "/nethome/sreddy65/examples/examples/bert/src/flash_attn_triton.py", line 1041, in backward
    _flash_attn_backward(do,
  File "/nethome/sreddy65/examples/examples/bert/src/flash_attn_triton.py", line 949, in _flash_attn_backward
    _bwd_kernel[grid](  # type: ignore
  File "/nethome/sreddy65/miniconda3/envs/mosaic/lib/python3.10/site-packages/triton/runtime/jit.py", line 106, in launcher
    return self.run(*args, grid=grid, **kwargs)
  File "/nethome/sreddy65/miniconda3/envs/mosaic/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 73, in run
    timings = {config: self._bench(*args, config=config, **kwargs)
  File "/nethome/sreddy65/miniconda3/envs/mosaic/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 73, in <dictcomp>
    timings = {config: self._bench(*args, config=config, **kwargs)
  File "/nethome/sreddy65/miniconda3/envs/mosaic/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 63, in _bench
    return do_bench(kernel_call)
  File "/nethome/sreddy65/miniconda3/envs/mosaic/lib/python3.10/site-packages/triton/testing.py", line 136, in do_bench
    fn()
  File "/nethome/sreddy65/miniconda3/envs/mosaic/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 62, in kernel_call
    self.fn.run(*args, num_warps=config.num_warps, num_stages=config.num_stages, **current)
  File "/nethome/sreddy65/miniconda3/envs/mosaic/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 200, in run
    return self.fn.run(*args, **kwargs)
  File "<string>", line 43, in _bwd_kernel
RuntimeError: Triton Error [CUDA]: invalid argument

FSDP for encoder

Hi,

Is there a code example available for FSDP with encoder only model- Bert/Roberta.

Thanks

Please bring code features from MPT-7b back to MPT-1b for use of MPT-1b with SFTTrainer.

What I want to do:

model = MosaicGPT.from_pretrained(
    "mosaicml/mpt-1b-redpajama-200b",
    trust_remote_code=True,
    attn_impl='torch'
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=tokenized_train_data["train"],
    eval_dataset=tokenized_val_data["validation"],
    dataset_text_field="text",
    args=training_args,
    neftune_noise_alpha=5 #the only one important thing for me
)

Yet it fails with various missing features in MPT-1b implementation:

forward with labels (like this on in MPT-7b)
get_input_embeddings (like this on in MPT-7b)

and potentially others.

Please help the community to use MPT-1b by:
a) retraining MPT-7b with 1b params size weights and MPT-7b code base
b) by updating MPT-1b codebase (which diverges from MPT-7b in terms of architecture a bit)

Confusion regarding conflicting information in model card of "mosaic-bert" on Hugging Face

I have been referring to the model card of "mosaic-bert" on Hugging Face, and I noticed some conflicting information that has left me confused. In the model card, it is mentioned that the vocabulary size was increased to a multiple of 8 and 64 (from 30,522 to 30,528 tokens). This suggests that either a new vocabulary was fitted or additional tokens were added to the standard bert-base tokenizer.

However, later in the model card, it states that "The tokenizer for this model is simply the Hugging Face bert-base-uncased tokenizer" with the original 30,522 tokens. This inconsistency has created confusion for me, as it seems contradictory to the previous statement regarding the increased vocabulary size.

Additionally, the respective blog mentions that one can use their own custom vocabulary and tokenizer for a specific domain. I am intrigued by this possibility but unsure of how it can be accomplished. I would appreciate further clarification on how to employ a custom vocabulary and tokenizer with the "mosaic-bert" model.

Thank you for your attention to this matter. I look forward to your response and clarification regarding these issues.

Can't save a trained model as a HuggingFace model

After the training has been done, I tried to use a utility function write_huggingface_pretrained_from_composer_checkpoint to save a HF model from a checkpoint. However, it throws a key error, that it can't find huggingface field inside binary model.

Seems like the saved binaries have their integrations field empty. I didn't change any configs inside a Trainer. What might go wrong?

File /opt/conda/lib/python3.10/site-packages/composer/models/huggingface.py:522, in get_hf_config_from_composer_state_dict(state_dict, config_overrides)
    519 if config_overrides is None:
    520     config_overrides = {}
--> 522 hf_config_dict = state_dict['state']['integrations']['huggingface']['model']['config']['content']
    523 # Update the config with any extra args needed
    524 hf_config_dict.update(config_overrides)

KeyError: 'huggingface'

MosaicBERT: Convert composer weights to HF

Hi,

we could sucessfully pretrain various MosaicBERT models and evaluations with composer-based fine-tuning look really good :)

However, when using a/the conversion script llm-foundry/scripts/inference/convert_composer_to_hf.py the converted HF model seems to be initialized randomly and the MLM predictions are looking super random.

I used the conversion script from the llm-foundry repository like this:

$ python3 /mnt/llm-foundry/scripts/inference/convert_composer_to_hf.py --composer_path ep111-ba125000-rank0.pt --hf_output_path ./converted-3 --output_precision fp32

It then shows, that various weights are not correctly initalized:

HF checkpoint folder successfully created at ./converted-3.                                                              
Loading model from ./converted-3                                                                                         
If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`                                             
Some weights of BertLMHeadModel were not initialized from the model checkpoint at ./converted-3 and are newly initialized
: ['bert.encoder.layer.7.attention.self.key.bias', 'bert.encoder.layer.11.output.LayerNorm.weight', 'bert.encoder.layer.7
.attention.self.query.weight', 'bert.encoder.layer.10.output.LayerNorm.bias', 'bert.encoder.layer.4.output.dense.bias', '
bert.encoder.layer.8.attention.self.key.bias', 'bert.encoder.layer.5.output.LayerNorm.bias', 'bert.encoder.layer.1.output
.dense.weight', 'bert.encoder.layer.2.output.dense.bias', 'bert.encoder.layer.8.attention.self.value.bias', 'bert.encoder
.layer.5.intermediate.dense.weight', 'bert.encoder.layer.0.attention.self.value.bias', 'bert.encoder.layer.1.intermediate
.dense.bias', 'bert.encoder.layer.1.attention.self.query.weight', 'bert.encoder.layer.8.attention.self.query.weight', 'be
rt.encoder.layer.2.attention.self.key.weight', 'bert.encoder.layer.2.output.LayerNorm.weight', 'bert.encoder.layer.3.atte
ntion.self.query.bias', 'bert.encoder.layer.11.attention.self.value.weight', 'bert.encoder.layer.2.attention.self.value.b
ias', 'bert.encoder.layer.4.attention.self.value.bias', 'bert.encoder.layer.0.attention.self.key.weight', 'bert.encoder.l
ayer.2.attention.self.key.bias', 'bert.encoder.layer.6.attention.self.key.weight', 'bert.encoder.layer.5.attention.self.k
ey.bias', 'bert.encoder.layer.9.attention.self.query.weight', 'bert.encoder.layer.7.attention.self.value.weight', 'bert.e
ncoder.layer.8.output.dense.weight', 'bert.encoder.layer.4.attention.self.key.bias', 'bert.encoder.layer.11.attention.sel
f.value.bias', 'bert.encoder.layer.4.attention.self.key.weight', 'bert.encoder.layer.7.intermediate.dense.bias', 'bert.en
coder.layer.5.output.dense.bias', 'bert.encoder.layer.8.attention.self.value.weight', 'bert.encoder.layer.5.attention.sel
f.query.weight', 'bert.encoder.layer.4.attention.self.value.weight', 'bert.encoder.layer.9.intermediate.dense.weight', 'b
ert.encoder.layer.3.output.LayerNorm.bias', 'bert.encoder.layer.6.intermediate.dense.bias', 'bert.encoder.layer.3.interme
diate.dense.weight', 'bert.encoder.layer.9.attention.self.value.bias', 'bert.encoder.layer.4.output.LayerNorm.weight', 'b
ert.encoder.layer.3.output.LayerNorm.weight', 'bert.encoder.layer.5.attention.self.value.weight', 'bert.encoder.layer.10.
attention.self.key.weight', 'bert.encoder.layer.3.intermediate.dense.bias', 'bert.encoder.layer.9.output.LayerNorm.bias',
 'bert.encoder.layer.11.attention.self.query.bias', 'bert.encoder.layer.11.intermediate.dense.bias', 'bert.encoder.layer.
0.attention.self.key.bias', 'bert.encoder.layer.7.output.LayerNorm.bias', 'bert.encoder.layer.0.output.dense.weight', 'be
rt.encoder.layer.6.attention.self.query.weight', 'bert.encoder.layer.11.output.LayerNorm.bias', 'bert.encoder.layer.5.out
put.LayerNorm.weight', 'bert.encoder.layer.9.output.dense.bias', 'bert.encoder.layer.6.attention.self.key.bias', 'bert.en
coder.layer.1.intermediate.dense.weight', 'bert.encoder.layer.10.attention.self.query.weight', 'bert.encoder.layer.3.atte
ntion.self.query.weight', 'bert.encoder.layer.9.output.dense.weight', 'bert.encoder.layer.1.attention.self.key.weight', '
bert.encoder.layer.10.output.LayerNorm.weight', 'bert.encoder.layer.0.attention.self.value.weight', 'bert.encoder.layer.2
.attention.self.query.bias', 'bert.encoder.layer.8.output.dense.bias', 'bert.encoder.layer.0.output.LayerNorm.weight'
[...]
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Is there any special conversion script/hints for converting a MosaicBERT composer checkpoint 🤔

Any help is highly appreciated!

Explain composer logs emitted during training + Replicate Benchmark Results

❓ Question

Hello, I am training an mpt-3B model on AWS SageMaker using an ml.p4d.24xlarge instance and trying to replicate the results displayed in this table: link.

Specifically, I am focusing on replicating the result for the mpt-3b model with the following configuration: max_seq_len: 2048, global_train_batch_size=320, device_train_microbatch_size=5, and 8 a100_40gb GPUs. According to the table, it should be able to process 39 sequences per second. Since I process 320 sequences within one batch, the batch should ideally finish within 8.2 seconds. However, when I run it, it takes around 10 seconds (screenshot attached).

I am also looking for an explanation of the logs emitted by the composer before the start of every batch. I have checked the documentation but couldn't find anything specific. I am particularly interested in understanding the meaning of the following logs:

Train memory/allocated_mem: 6.8051
Train memory/active_mem: 6.8051
Train memory/inactive_mem: 1.9065
Train memory/reserved_mem: 14.6740
Train memory/alloc_retries: 0
Train loss/train/total: 11.6525
Train metrics/train/LanguageCrossEntropy: 11.6525
Train metrics/train/LanguagePerplexity: 114977.1562
Train time/train: 0.0081
Train time/val: 0.0000
Train time/total: 0.0081
Train lr-DecoupledAdamW/group0: 0.0000
Train time/remaining_estimate: 0.0225

Lastly, I would like to know if there is an easy way to calculate TFLOP/s using the above logs.

Here is the bash command that I am running:

composer train/train.py \
  train/yamls/pretrain/mpt-3b.yaml \
  data_local=my-copy-c4 \
  train_loader.dataset.split=train_small \
  eval_loader.dataset.split=val_small \
  max_duration=10ba \
  eval_interval=30ba \
  save_folder=mpt-3b \
  max_seq_len=2048 \
  global_train_batch_size=320 \
  device_train_microbatch_size=5

FlashAttention Triton error on the MosaicBERT models other than base

When I try to run MosaicBERT like this:

from transformers import AutoModelForMaskedLM, BertTokenizer, pipeline

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
config = transformers.BertConfig.from_pretrained('mosaicml/mosaic-bert-base-seqlen-2048')
# It's essential to set the attention_probs_dropout_prob to 0.1
#config.alibi_starting_size = 2048 # maximum sequence length updated to 2048 from config default of 512
mlm = AutoModelForMaskedLM.from_pretrained('mosaicml/mosaic-bert-base-seqlen-2048', trust_remote_code=True, config=config)

mlm.to("cuda")
classifier = pipeline('fill-mask', model=mlm, tokenizer=tokenizer, device="cuda")

classifier("I [MASK] to the store yesterday.")

I get this error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File [~/YNAB/ML/ai_categorize/.conda-env/lib/python3.9/site-packages/triton/compiler/code_generator.py:1124](https://vscode-remote+wsl-002bubuntu-002d18-002e04.vscode-resource.vscode-cdn.net/home/taytay/YNAB/ML/ai_categorize/src/BERT/~/YNAB/ML/ai_categorize/.conda-env/lib/python3.9/site-packages/triton/compiler/code_generator.py:1124), in ast_to_ttir(fn, signature, specialization, constants, debug, arch)
   [1123](https://vscode-remote+wsl-002bubuntu-002d18-002e04.vscode-resource.vscode-cdn.net/home/taytay/YNAB/ML/ai_categorize/src/BERT/~/YNAB/ML/ai_categorize/.conda-env/lib/python3.9/site-packages/triton/compiler/code_generator.py:1123) try:
-> [1124](https://vscode-remote+wsl-002bubuntu-002d18-002e04.vscode-resource.vscode-cdn.net/home/taytay/YNAB/ML/ai_categorize/src/BERT/~/YNAB/ML/ai_categorize/.conda-env/lib/python3.9/site-packages/triton/compiler/code_generator.py:1124)     generator.visit(fn.parse())
   [1125](https://vscode-remote+wsl-002bubuntu-002d18-002e04.vscode-resource.vscode-cdn.net/home/taytay/YNAB/ML/ai_categorize/src/BERT/~/YNAB/ML/ai_categorize/.conda-env/lib/python3.9/site-packages/triton/compiler/code_generator.py:1125) except CompilationError as e:

File [~/YNAB/ML/ai_categorize/.conda-env/lib/python3.9/site-packages/triton/compiler/code_generator.py:1017](https://vscode-remote+wsl-002bubuntu-002d18-002e04.vscode-resource.vscode-cdn.net/home/taytay/YNAB/ML/ai_categorize/src/BERT/~/YNAB/ML/ai_categorize/.conda-env/lib/python3.9/site-packages/triton/compiler/code_generator.py:1017), in CodeGenerator.visit(self, node)
   [1016](https://vscode-remote+wsl-002bubuntu-002d18-002e04.vscode-resource.vscode-cdn.net/home/taytay/YNAB/ML/ai_categorize/src/BERT/~/YNAB/ML/ai_categorize/.conda-env/lib/python3.9/site-packages/triton/compiler/code_generator.py:1016)     last_loc = self.builder.get_loc()
-> [1017](https://vscode-remote+wsl-002bubuntu-002d18-002e04.vscode-resource.vscode-cdn.net/home/taytay/YNAB/ML/ai_categorize/src/BERT/~/YNAB/ML/ai_categorize/.conda-env/lib/python3.9/site-packages/triton/compiler/code_generator.py:1017) ret = super().visit(node)
   [1018](https://vscode-remote+wsl-002bubuntu-002d18-002e04.vscode-resource.vscode-cdn.net/home/taytay/YNAB/ML/ai_categorize/src/BERT/~/YNAB/ML/ai_categorize/.conda-env/lib/python3.9/site-packages/triton/compiler/code_generator.py:1018) # Reset the location to the last one before the visit

File [~/YNAB/ML/ai_categorize/.conda-env/lib/python3.9/ast.py:407](https://vscode-remote+wsl-002bubuntu-002d18-002e04.vscode-resource.vscode-cdn.net/home/taytay/YNAB/ML/ai_categorize/src/BERT/~/YNAB/ML/ai_categorize/.conda-env/lib/python3.9/ast.py:407), in NodeVisitor.visit(self, node)
    [406](https://vscode-remote+wsl-002bubuntu-002d18-002e04.vscode-resource.vscode-cdn.net/home/taytay/YNAB/ML/ai_categorize/src/BERT/~/YNAB/ML/ai_categorize/.conda-env/lib/python3.9/ast.py:406) visitor = getattr(self, method, self.generic_visit)
--> [407](https://vscode-remote+wsl-002bubuntu-002d18-002e04.vscode-resource.vscode-cdn.net/home/taytay/YNAB/ML/ai_categorize/src/BERT/~/YNAB/ML/ai_categorize/.conda-env/lib/python3.9/ast.py:407) return visitor(node)

File [~/YNAB/ML/ai_categorize/.conda-env/lib/python3.9/site-packages/triton/compiler/code_generator.py:293](https://vscode-remote+wsl-002bubuntu-002d18-002e04.vscode-resource.vscode-cdn.net/home/taytay/YNAB/ML/ai_categorize/src/BERT/~/YNAB/ML/ai_categorize/.conda-env/lib/python3.9/site-packages/triton/compiler/code_generator.py:293), in CodeGenerator.visit_Module(self, node)
    [292](https://vscode-remote+wsl-002bubuntu-002d18-002e04.vscode-resource.vscode-cdn.net/home/taytay/YNAB/ML/ai_categorize/src/BERT/~/YNAB/ML/ai_categorize/.conda-env/lib/python3.9/site-packages/triton/compiler/code_generator.py:292) def visit_Module(self, node):
--> [293](https://vscode-remote+wsl-002bubuntu-002d18-002e04.vscode-resource.vscode-cdn.net/home/taytay/YNAB/ML/ai_categorize/src/BERT/~/YNAB/ML/ai_categorize/.conda-env/lib/python3.9/site-packages/triton/compiler/code_generator.py:293)     ast.NodeVisitor.generic_visit(self, node)

File [~/YNAB/ML/ai_categorize/.conda-env/lib/python3.9/ast.py:415](https://vscode-remote+wsl-002bubuntu-002d18-002e04.vscode-resource.vscode-cdn.net/home/taytay/YNAB/ML/ai_categorize/src/BERT/~/YNAB/ML/ai_categorize/.conda-env/lib/python3.9/ast.py:415), in NodeVisitor.generic_visit(self, node)
    [414](https://vscode-remote+wsl-002bubuntu-002d18-002e04.vscode-resource.vscode-cdn.net/home/taytay/YNAB/ML/ai_categorize/src/BERT/~/YNAB/ML/ai_categorize/.conda-env/lib/python3.9/ast.py:414)         if isinstance(item, AST):
--> [415](https://vscode-remote+wsl-002bubuntu-002d18-002e04.vscode-resource.vscode-cdn.net/home/taytay/YNAB/ML/ai_categorize/src/BERT/~/YNAB/ML/ai_categorize/.conda-env/lib/python3.9/ast.py:415)             self.visit(item)
    [416](https://vscode-remote+wsl-002bubuntu-002d18-002e04.vscode-resource.vscode-cdn.net/home/taytay/YNAB/ML/ai_categorize/src/BERT/~/YNAB/ML/ai_categorize/.conda-env/lib/python3.9/ast.py:416) elif isinstance(value, AST):
...
                            other=0.0)
        qk = tl.zeros([BLOCK_M, BLOCK_N], dtype=tl.float32)
        qk += tl.dot(q, k, trans_b=True)
                        ^
TypeError("dot() got an unexpected keyword argument 'trans_b'")

This appears to have been fixed a few days ago by @jacobfulano in the mosaic-bert-base repo:
https://huggingface.co/mosaicml/mosaic-bert-base/blob/ed2a544063a892b78823cba2858d1e098c0e6012/config.json

It looks like that removes FlashAttention? Does that mean that the speed increase from FA is also removed?

Here's how I can fix it in the meantime if someone else Googles and stumbles across this

config = transformers.BertConfig.from_pretrained('mosaicml/mosaic-bert-base')
# It's essential to set the attention_probs_dropout_prob to 0.1, which mosaic-bert-base does. So we just update the alibi_starting_size:
config.alibi_starting_size = 2048 # maximum sequence length updated to 2048 from config default of 512
mlm = AutoModelForMaskedLM.from_pretrained('mosaicml/mosaic-bert-base-seqlen-2048', trust_remote_code=True, config=config)
mlm.to("cuda")
classifier = pipeline('fill-mask', model=mlm, tokenizer=tokenizer, device="cuda")
classifier("I [MASK] to the store yesterday.")

Training Time estimation on single GPU A100 80G

I am pre training the BERT model for FP32,BF16 ,my estimated training time for FP32, 128 sequence length , 256 batch size, is 160 hrs, with mosic bert , is this expected , i don't see much reduction when it compared with hugging face BERT.

1 out of N runs starts successfully, others fail immediately

Hi,
I am trying to reproduce BERT pretraining example by following the steps here: https://github.com/mosaicml/examples/tree/main/examples/bert without any custom modifications.
I am trying to run it on 4xA100 80GB, inside a conda environment. A really weird and annoying problem I am facing now is that only 1 lucky run out of N attempts to start succeeds to start with training. All other (N-1) attempts fail with this error:

Training using config:
data_local: /ssdpool/eldar/bert_c4
data_remote: null
max_seq_len: 128
tokenizer_name: bert-base-uncased
mlm_probability: 0.15
run_name: hf-bert-base-uncased
model:
  name: hf_bert
  use_pretrained: false
  pretrained_model_name: ${tokenizer_name}
  tokenizer_name: ${tokenizer_name}
  model_config:
    num_attention_heads: 12
    num_hidden_layers: 12
    max_position_embedding: 512
    attention_probs_dropout_prob: 0.1
train_loader:
  name: text
  dataset:
    local: ${data_local}
    remote: ${data_remote}
    split: train
    tokenizer_name: ${tokenizer_name}
    max_seq_len: ${max_seq_len}
    shuffle: true
    mlm_probability: ${mlm_probability}
  drop_last: true
  num_workers: 8
eval_loader:
  name: text
  dataset:
    local: ${data_local}
    remote: ${data_remote}
    split: val
    tokenizer_name: ${tokenizer_name}
    max_seq_len: ${max_seq_len}
    shuffle: false
    mlm_probability: 0.15
  drop_last: false
  num_workers: 8
scheduler:
  name: linear_decay_with_warmup
  t_warmup: 0.06dur
  alpha_f: 0.02
optimizer:
  name: decoupled_adamw
  lr: 0.0005
  betas:
  - 0.9
  - 0.98
  eps: 1.0e-06
  weight_decay: 1.0e-05
max_duration: 286720000sp
eval_interval: 2000ba
global_train_batch_size: 4096
seed: 17
device_eval_batch_size: 128
device_train_microbatch_size: 128
precision: bf16
progress_bar: false
log_to_console: true
console_log_interval: 1ba
callbacks:
  speed_monitor:
    window_size: 500
  lr_monitor: {}

Initializing model...
n_params=1.0951e+08
Building train loader...
Building eval loader...

ERROR:composer.cli.launcher:Rank 1 crashed with exit code 1.
Waiting up to 30 seconds for all training processes to terminate. Press Ctrl-C to exit immediately.
/home/ekurtic/miniconda3/envs/mosaicml/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 6 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/home/ekurtic/miniconda3/envs/mosaicml/lib/python3.9/multiprocessing/resource_tracker.py:229: UserWarning: resource_tracker: '/eb22f1_next_epoch': [Errno 2] No such file or directory: '/eb22f1_next_epoch'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/home/ekurtic/miniconda3/envs/mosaicml/lib/python3.9/multiprocessing/resource_tracker.py:229: UserWarning: resource_tracker: '/d016e8_next_epoch': [Errno 2] No such file or directory: '/d016e8_next_epoch'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/home/ekurtic/miniconda3/envs/mosaicml/lib/python3.9/multiprocessing/resource_tracker.py:229: UserWarning: resource_tracker: '/eb22f1_barrier': [Errno 2] No such file or directory: '/eb22f1_barrier'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/home/ekurtic/miniconda3/envs/mosaicml/lib/python3.9/multiprocessing/resource_tracker.py:229: UserWarning: resource_tracker: '/d016e8_barrier': [Errno 2] No such file or directory: '/d016e8_barrier'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/home/ekurtic/miniconda3/envs/mosaicml/lib/python3.9/multiprocessing/resource_tracker.py:229: UserWarning: resource_tracker: '/d016e8_shard_states': [Errno 2] No such file or directory: '/d016e8_shard_states'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
/home/ekurtic/miniconda3/envs/mosaicml/lib/python3.9/multiprocessing/resource_tracker.py:229: UserWarning: resource_tracker: '/eb22f1_shard_states': [Errno 2] No such file or directory: '/eb22f1_shard_states'
  warnings.warn('resource_tracker: %r: %s' % (name, e))
Global rank 1 (PID 4033709) exited with code 1
----------Begin global rank 1 STDOUT----------
Training using config:
data_local: /ssdpool/eldar/bert_c4
data_remote: null
max_seq_len: 128
tokenizer_name: bert-base-uncased
mlm_probability: 0.15
run_name: hf-bert-base-uncased
model:
  name: hf_bert
  use_pretrained: false
  pretrained_model_name: ${tokenizer_name}
  tokenizer_name: ${tokenizer_name}
  model_config:
    num_attention_heads: 12
    num_hidden_layers: 12
    max_position_embedding: 512
    attention_probs_dropout_prob: 0.1
train_loader:
  name: text
  dataset:
    local: ${data_local}
    remote: ${data_remote}
    split: train
    tokenizer_name: ${tokenizer_name}
    max_seq_len: ${max_seq_len}
    shuffle: true
    mlm_probability: ${mlm_probability}
  drop_last: true
  num_workers: 8
eval_loader:
  name: text
  dataset:
    local: ${data_local}
    remote: ${data_remote}
    split: val
    tokenizer_name: ${tokenizer_name}
    max_seq_len: ${max_seq_len}
    shuffle: false
    mlm_probability: 0.15
  drop_last: false
  num_workers: 8
scheduler:
  name: linear_decay_with_warmup
  t_warmup: 0.06dur
  alpha_f: 0.02
optimizer:
  name: decoupled_adamw
  lr: 0.0005
  betas:
  - 0.9
  - 0.98
  eps: 1.0e-06
  weight_decay: 1.0e-05
max_duration: 286720000sp
eval_interval: 2000ba
global_train_batch_size: 4096
seed: 17
device_eval_batch_size: 128
device_train_microbatch_size: 512
precision: bf16
progress_bar: false
log_to_console: true
console_log_interval: 1ba
callbacks:
  speed_monitor:
    window_size: 500
  lr_monitor: {}

Initializing model...
n_params=1.0951e+08
Building train loader...
Building eval loader...

----------End global rank 1 STDOUT----------
----------Begin global rank 1 STDERR----------
Traceback (most recent call last):
  File "/home/ekurtic/github/eldarkurtic/mosaicml/examples/examples/bert/main.py", line 141, in <module>
    main(cfg)
  File "/home/ekurtic/github/eldarkurtic/mosaicml/examples/examples/bert/main.py", line 91, in main
    trainer = Trainer(
  File "/home/ekurtic/miniconda3/envs/mosaicml/lib/python3.9/site-packages/composer/trainer/trainer.py", line 975, in __init__
    dist.initialize_dist(device, dist_timeout)
  File "/home/ekurtic/miniconda3/envs/mosaicml/lib/python3.9/site-packages/composer/utils/dist.py", line 426, in initialize_dist
    dist.init_process_group(device_obj.dist_backend, timeout=timeout_timedelta)
  File "/home/ekurtic/miniconda3/envs/mosaicml/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 761, in init_process_group
    default_pg = _new_process_group_helper(
  File "/home/ekurtic/miniconda3/envs/mosaicml/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 909, in _new_process_group_helper
    pg = _create_process_group_wrapper(
  File "/home/ekurtic/miniconda3/envs/mosaicml/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 3232, in _create_process_group_wrapper
    helper_pg = ProcessGroupGloo(store, rank, world_size, timeout=timeout)
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:799] connect [10.36.193.84]:1927: Connection refused
Exception ignored in: <function StreamingDataset.__del__ at 0x1482b4fc11f0>
Traceback (most recent call last):
  File "/home/ekurtic/miniconda3/envs/mosaicml/lib/python3.9/site-packages/streaming/base/dataset.py", line 851, in __del__
    wait_for_local_leader(world)
  File "/home/ekurtic/miniconda3/envs/mosaicml/lib/python3.9/site-packages/streaming/base/util.py", line 62, in wait_for_local_leader
    wait_for_file_to_exist(dir_path,
  File "/home/ekurtic/miniconda3/envs/mosaicml/lib/python3.9/site-packages/streaming/base/util.py", line 49, in wait_for_file_to_exist
    raise RuntimeError(f'{err_msg}, bailing out: ' + f'{timeout:.3f} < {dt:.3f} sec.')
RuntimeError: Waiting for local rank 0, bailing out: 60.000 < 60.033 sec.
Exception ignored in: <function StreamingDataset.__del__ at 0x1482b4fc11f0>
Traceback (most recent call last):
  File "/home/ekurtic/miniconda3/envs/mosaicml/lib/python3.9/site-packages/streaming/base/dataset.py", line 851, in __del__
    wait_for_local_leader(world)
  File "/home/ekurtic/miniconda3/envs/mosaicml/lib/python3.9/site-packages/streaming/base/util.py", line 62, in wait_for_local_leader
    wait_for_file_to_exist(dir_path,
  File "/home/ekurtic/miniconda3/envs/mosaicml/lib/python3.9/site-packages/streaming/base/util.py", line 49, in wait_for_file_to_exist
    raise RuntimeError(f'{err_msg}, bailing out: ' + f'{timeout:.3f} < {dt:.3f} sec.')
RuntimeError: Waiting for local rank 0, bailing out: 60.000 < 60.034 sec.
/home/ekurtic/miniconda3/envs/mosaicml/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 6 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

----------End global rank 1 STDERR----------
ERROR:composer.cli.launcher:Global rank 0 (PID 4033708) exited with code -15

And the bash script I am using to run this:

export HF_DATASETS_CACHE=/ssdpool/eldar/hf_cache
export CUDA_VISIBLE_DEVICES=4,5,6,7
export MASTER_PORT=12345
export WORLD_SIZE=4
export TORCH_DISTRIBUTED_DEBUG=DETAIL

composer main.py yamls/main/hf-bert-base-uncased.yaml

What am I doing wrong here, any ideas? Why is GLOO complaining in error messages above, shouldn't we use NCCL as a backend for DDP runs?

Finetuning on windows machine

Hi,

I pretrained a mosaic-bert version on a linux server. In a next step I like to finetune the model locally on a windows 11 machine.
I tried to run your finetuning script but unfortunately it does nto seem to work on a windows machine.

Another try was to convert the model to the Huggingface format. I tried this approach using the model "mosaicml/mosaic-bert-base" on Huggingface first, after allowing for "BertForSequenceClassification" in the config file. This works, but it seems that the training does not converge, in contrast to the original Huggingface model "bert-base-uncased". Do you have any explanations or recommendations for that?

This is the code:

#%% Imports
from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from torch.utils.data import DataLoader
from torch.optim import AdamW
from transformers import get_scheduler
import torch
import evaluate

#%% Begin by loading some random dataset (i.e. the Yelp Reviews dataset):
dataset = load_dataset("yelp_review_full")

#%% Tokenize
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(["text"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")

#%% As we are only interested in the comparison between hf bert and mosaic bert, finetune on a smaller dataset
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))
train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=8)
eval_dataloader = DataLoader(small_eval_dataset, batch_size=8)

#%% Load model & train
model =AutoModelForSequenceClassification.from_pretrained('original-mosaic-bert', trust_remote_code=True, num_labels=5) # Finetune mosaic-bert model, but losses do not converge
# AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5) # This finetunes the original huggingface model with converging losses
optimizer = AdamW(model.parameters(), lr=5e-5)
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        print(loss)

--concat_tokens flag in BERT pretraining

Hi,

would you recommend to set the --concat_tokens flag for the BERT pretraining? Did you observe any difference in your experiments?

Furthermore, would you recommend to split documents with more than 512 tokens in two or more documents instead of using truncation?

Thank you very much in advance!

config class for bert is not consistent

Hey I am trying to pull the model from huggingface repo using
AutoModelForMaskedLM.from_pretrained( 'mosaicml/mosaic-bert-base-seqlen-2048', trust_remote_code=True, revision='b7a0389')
(with revision param and without) I am getting the same error that goes like this:
ValueError: The model class you are passing has a config_classattribute that is not consistent with the config class you passed (model has <class 'transformers.models.bert.configuration_bert.BertConfig'> and you passed <class 'transformers_modules.mosaicml.mosaic-bert-base-seqlen-2048.b7a0389deadf7a7261a3e5e7ea0680d8ba12232f.configuration_bert.BertConfig'>. Fix one of those so they match!
Do you have any suggestion as to why this might be the case?

When I do this : BertModel.from_pretrained('mosaicml/mosaic-bert-base-seqlen-2048') It seem to work correctly although I am not sure if the flash attention will work correctly given this statement "This model requires that trust_remote_code=True be passed to the from_pretrained method. This is because we train using FlashAttention (Dao et al. 2022), which is not part of the transformers library and depends on Triton and some custom PyTorch code." in the model card, and class BertModel don't have parameter trust_remote_code.

Train BERT on own data

Hi,

is there a documentation of how to proceed if one likes to train the moisaic-bert model on own data?
If I get it correctly, the only thing you mention to that end is:

Alternatively, feel free to substitute our dataloader with one of your own in the script main.py.

Can you be more precise?

Thank you very much in advance

Can't install the requirements for mosaicml

Hello. I'll start using DNABERT_2 and they mention mosaicml.

I'm tying to install it but I get the following error:

$ pip install -r requirements.txt  

Defaulting to user installation because normal site-packages is not writeable
ERROR: Could not find a version that satisfies the requirement einops==0.5.0 (from versions: 0.1.0, 0.2.0, 0.3.0, 0.3.1, 0.3.2, 0.4.0, 0.4.1)
ERROR: No matching distribution found for einops==0.5.0

$ pip list | grep einops
einops                   0.4.1

If I just update einops, it stays the same version and this gives the same error
pip install -U einops==0.5.0

I've downloaded the .tar.gz here and the READ.md tells me to install from GitHub:

$ pip install https://github.com/arogozhnikov/einops/archive/master.zip

ERROR: Could not find a version that satisfies the requirement hatchling>=1.10.0 (from versions: 0.8.0, 0.8.1, 0.8.2, 0.9.0, 0.10.0, 0.11.0, 0.11.1, 0.11.2, 0.11.3, 0.12.0, 0.13.0, 0.14.0, 0.15.0, 0.16.0, 0.17.0, 0.18.0, 0.19.0, 0.20.0, 0.20.1, 0.21.0, 0.21.1, 0.22.0, 0.23.0, 0.24.0, 0.25.0, 0.25.1)
  ERROR: No matching distribution found for hatchling>=1.10.0

And the same happens if I try to update hatchling. I've also downloaded the .tar.gz here but there's not a lot of info.

Any ideas? I appreciate any insights.

I'm using:
pip 21.3.1
Python 3.6.9
Ubuntu 18.04.3 LTS

Accessing model after pre-training

Following the instructions laid out in the README, I have conducted pre-training using the Mosaic BERT model architecture and have pointed the save_folder to a directory in an S3 bucket. It appears as though the checkpointing has resulted in a .pt file being stored as desired, but I was wondering if there was a recommended method for loading the model back into SageMaker to perform inference and to confirm that the model is performing as expected. Any recommendations or suggestions would be immensely appreciated!

MosaicML LLM: 'key_padding_mask' is NoneType when setting "attn_impl: torch"

Hi Team,

Everything worked fine with "attn_impl: flash". But when I tried to train the LLM models without the FlashAttention by setting "attn_impl: torch" in the yamls, the following error occurs.

Traceback (most recent call last):
File "/workspace/examples_latest/examples/llm/main.py", line 215, in
main(cfg)
File "/workspace/examples_latest/examples/llm/main.py", line 204, in main
trainer.fit()
File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 1787, in fit
self._train_loop()
File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 1950, in _train_loop
total_loss_dict = self._train_batch(use_grad_scaling)
File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 2126, in _train_batch
optimizer.step(closure=lambda **kwargs: self._train_microbatches(
File "/usr/lib/python3/dist-packages/torch/optim/lr_scheduler.py", line 68, in wrapper
return wrapped(*args, **kwargs)
File "/usr/lib/python3/dist-packages/torch/optim/optimizer.py", line 140, in wrapper
out = func(*args, **kwargs)
File "/usr/lib/python3/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/usr/lib/python3/dist-packages/composer/optim/decoupled_weight_decay.py", line 289, in step
loss = closure()
File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 2126, in
optimizer.step(closure=lambda **kwargs: self._train_microbatches(
File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 2209, in _train_microbatches
microbatch_loss_dict = self._train_microbatch(use_grad_scaling, current_batch_size, is_final_microbatch)
File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 2255, in _train_microbatch
self.state.outputs = self.state.model(self.state.batch)
File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/examples_latest/examples/llm/src/models/mosaic_gpt.py", line 255, in forward
return self.model(input_ids=input_ids,
File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/lib/python3/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 2727, in forward
output = self._fsdp_wrapped_module(*args, **kwargs)
File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/lib/python3/dist-packages/torch/distributed/fsdp/flatten_params_wrapper.py", line 165, in forward
return self.module(*inputs, **kwinputs)
File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/examples_latest/examples/llm/src/models/mosaic_gpt.py", line 158, in forward
x = block(x, mod_key_padding_mask, attn_mask)
File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/lib/python3/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 2727, in forward
output = self._fsdp_wrapped_module(*args, **kwargs)
File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/lib/python3/dist-packages/torch/distributed/fsdp/flatten_params_wrapper.py", line 165, in forward
return self.module(*inputs, **kwinputs)
File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/examples_latest/examples/llm/src/models/layers/gpt_blocks.py", line 53, in forward
b, _ = self.causal_attn(a, key_padding_mask, attn_mask)
File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/examples_latest/examples/llm/src/models/layers/attention.py", line 39, in forward
key_padding_mask=~key_padding_mask,
TypeError: bad operand type for unary ~: 'NoneType'

will fine tuning work on windows 11 lap top minimal gpu for bert sequence_classification.py ?

https://github.com/mosaicml/examples/blob/main/examples/benchmarks/bert/sequence_classification.py

will fine tuning work on windows 11 lap top minimal gpu ?
print(torch.cuda.get_device_name(0))
NVIDIA GeForce RTX 3050 Ti Laptop GPU

Inquiry about Mosaic-BERT and BERT-Base Sequence Lengths

I have been exploring the Mosaic-BERT model and I noticed that it is trained on a sequence length of 128. It's my understanding that this length can be easily extrapolated during inference time due to Attention with Linear Biases. However, in one of your blog posts, you compared the Mosaic-BERT model with the Hugging Face BERT base model, and I'm unclear about the sequence length used for training the BERT-Base model.

Specifically, I would like to know if the BERT-Base model, which is used as a benchmark for the mosaic-bert model for example in the appended figure, is trained with a sequence length of 128 or 512? If it is trained with a sequence length of 128, I would like to inquire about the necessary steps to obtain a Mosaic-BERT model that matches the performance of the BERT-Base model with a sequence length of 512.

Thank you for your attention to this matter. I look forward to your response and clarification.

Matmul error when using output_all_encoded_layers = True, and pooler

Hi,

First off thanks for this great contribution!

There seems to be an issue with the handling of then encoder_outputs in the pooler level when passing output_all_encoded_layers = True.

examples/examples/benchmarks/bert/src/bert_layers.py

Lines 676 to 689 in daddaef

    
           encoder_outputs = self.encoder( 
        
               embedding_output, 
        
               attention_mask, 
        
               output_all_encoded_layers=output_all_encoded_layers, 
        
               subset_mask=subset_mask) 
        
           if masked_tokens_mask is None: 
        
               sequence_output = encoder_outputs[-1] 
        
               pooled_output = self.pooler( 
        
                   sequence_output) if self.pooler is not None else None 
        
           else: 
        
               # TD [2022-03-01]: the indexing here is very tricky. 
        
               attention_mask_bool = attention_mask.bool()

because when doing that, I'm getting:

File ~/.conda/envs/mimibert/lib/python3.11/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File ~/PatientTrajectoryForecasting/utils/bert_layers_mosa.py:567, in BertPooler.forward(self, hidden_states, pool)
    561 def forward(self,
    562             hidden_states: torch.Tensor,
    563             pool: Optional[bool] = True) -> torch.Tensor:
    564     # We "pool" the model by simply taking the hidden state corresponding
    565     # to the first token.
    566     first_token_tensor = hidden_states[:, 0] if pool else hidden_states
--> 567     pooled_output = self.dense(first_token_tensor)
    568     pooled_output = self.activation(pooled_output)
    569     return pooled_output

File ~/.conda/envs/mimibert/lib/python3.11/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File ~/.conda/envs/mimibert/lib/python3.11/site-packages/torch/nn/modules/linear.py:114, in Linear.forward(self, input)
    113 def forward(self, input: Tensor) -> Tensor:
--> 114     return F.linear(input, self.weight, self.bias)

RuntimeError: mat1 and mat2 shapes cannot be multiplied (1x54784 and 768x768)

I believe the issue is due to the padding function not being applied to the hidden layens before appending to the list in the bert encoder level:

examples/examples/benchmarks/bert/src/bert_layers.py

Lines 511 to 530 in daddaef

    
           all_encoder_layers = [] 
        
           if subset_mask is None: 
        
               for layer_module in self.layer: 
        
                   hidden_states = layer_module(hidden_states, 
        
                                                cu_seqlens, 
        
                                                seqlen, 
        
                                                None, 
        
                                                indices, 
        
                                                attn_mask=attention_mask, 
        
                                                bias=alibi_attn_mask) 
        
                   if output_all_encoded_layers: 
        
                       all_encoder_layers.append(hidden_states) 
        
               # Pad inputs and mask. It will insert back zero-padded tokens. 
        
               # Assume ntokens is total number of tokens (padded and non-padded) 
        
               # and ntokens_unpad is total number of non-padded tokens. 
        
               # Then padding performs the following de-compression: 
        
               #     hidden_states[ntokens_unpad,hidden] -> hidden_states[ntokens,hidden] 
        
               hidden_states = bert_padding_module.pad_input( 
        
                   hidden_states, indices, batch, seqlen) 
        
           else:

(Edit: yep this works, but not haven't checked for deps)

all_encoder_layers.append(bert_padding_module.pad_input(
                hidden_states, indices, batch, seqlen))

The same thing should probably be done when the subset_mask is not None...

Thanks again for your contribution to the comunity!

Loss spike when training mosaic-bert (fp32)

I am training with the mosaic-bert-base-uncased.yaml recipe on 8xA40s, with data created with the mosaic provided script for c4. I consistently get a loss spike and loss being stuck around 10k -15k steps into training. The only change is using fp32 instead of bfloat16.

Below spikes at ~10k

Regression testing

We should automatically do full training runs for all of the benchmarks sometimes to detect accuracy regressions. This will be more challenging than regular testing since we'll have to:

Launch multi-GPU jobs
Monitor the results
Track previous results as a reference

This should actually be pretty good MCLI dogfooding.

MosaicBERT: pretraining configuration for models > 128 seq. length

Hi MosaicML team,

many thanks for releasing the code and models for your MosaicBERT! I highly appreciate the effort that you put in modernizing the BERT architecture.

I am interested in pretraining MosaicBERT so I have some questions :)

I am interested in the pretraining configuration for the model with 512 sequence length. Additionally: do you have hardware recommendations and the approx. time to pretrain MosaicBERT with 512 seq. length. Did you use the phase 1 + phase 2 "trick" with pretraining for 128 seq. length and then fewer steps with 512? For that, the MosaicBERT with 128 seq. length could be "recycled".
I'm also interested in what implementation is recommended to use e.g. a tagged/specific commit or the upcoming #440 PR.

Many thanks in advance!

Stefan

ValueError: Value bf16 not found in Precision

Hi, I am trying to replicate the Bert training script. I run into this issue of precision bf16 not being available. When I switch it to FP32 it works. May I know if am missing something?

Finetuning script broken?

Hey,

as finetuning after the import to transformers is not possible, I tried the finetuning script that you provide.
I tried to run the function 'test_classification_script()' from 'tests/test_classification.py' as a first step to test your finetuning framework.
To do so, I used a linux server with ubuntu and with 4 x NVIDIA Tesla P100 (16 GB).
For the setup, I followed all the steps that you recommend here, i.e.:

A Docker container running MosaicML's PyTorch base image: mosaicml/pytorch:1.13.1_cu117-python3.10-ubuntu20.04
git clone https://github.com/mosaicml/examples.git
cd examples/examples/benchmarks/bert
pip install -r requirements.txt

I have installed the cuda release 117, as the following output suggests:

'nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0'

To test your finetuning script, I simply did the following in the console:

>>> python 
>>> from tests.test_classification import test_classification_script 
>>> test_classification_script()

Here is the complete output:

Training using config:
tokenizer_name: prajjwal1/bert-tiny
max_seq_len: 32
run_name: test
model:
  name: mosaic_bert
  num_labels: 2
  pretrained_model_name: ${tokenizer_name}
  tokenizer_name: ${tokenizer_name}
train_loader:
  split: train
  tokenizer_name: ${tokenizer_name}
  max_seq_len: ${max_seq_len}
  shuffle: true
  drop_last: true
  num_workers: 4
eval_loader:
  split: validation
  tokenizer_name: ${tokenizer_name}
  max_seq_len: ${max_seq_len}
  shuffle: false
  drop_last: false
  num_workers: 4
scheduler:
  name: linear_decay_with_warmup
  t_warmup: 0.5dur
  alpha_f: 0.02
optimizer:
  name: decoupled_adamw
  lr: 0.0002
  betas:
  - 0.9
  - 0.95
  eps: 1.0e-08
  weight_decay: 0.0
max_duration: 8ba
eval_interval: 8ba
eval_subset_num_batches: 2
global_train_batch_size: 4
seed: 17
device_eval_batch_size: 4
device_train_microbatch_size: 2
precision: fp32
progress_bar: false
log_to_console: false
console_log_interval: 1ba
callbacks:
  speed_monitor:
    window_size: 4
  lr_monitor: {}

Initializing model...
n_params=4.4515e+06
Building train loader...
Found cached dataset glue (...)
Loading cached processed dataset at .../cache-qnli-prajjwal1,bert-tiny-tokenization-train.arrow
Building eval loader...
Found cached dataset glue (...)
Loading cached processed dataset at .../huggingface/datasets/glue/qnli/1.0.0.../cache-qnli-prajjwal1,bert-tiny-tokenization-validation.arrow
/usr/lib/python3/dist-packages/composer/callbacks/speed_monitor.py:120: UserWarning: gpu_flop count not found for None with precision: fp32; MFU cannot be calculated and reported. gpu_flops_available can be manuallyoverridden by setting gpu_flops_available in SpeedMonitor.
  warnings.warn(
Logging config...
tokenizer_name: prajjwal1/bert-tiny
max_seq_len: 32
run_name: test
model:
  name: mosaic_bert
  num_labels: 2
  pretrained_model_name: ${tokenizer_name}
  tokenizer_name: ${tokenizer_name}
train_loader:
  split: train
  tokenizer_name: ${tokenizer_name}
  max_seq_len: ${max_seq_len}
  shuffle: true
  drop_last: true
  num_workers: 4
eval_loader:
  split: validation
  tokenizer_name: ${tokenizer_name}
  max_seq_len: ${max_seq_len}
  shuffle: false
  drop_last: false
  num_workers: 4
scheduler:
  name: linear_decay_with_warmup
  t_warmup: 0.5dur
  alpha_f: 0.02
optimizer:
  name: decoupled_adamw
  lr: 0.0002
  betas:
  - 0.9
  - 0.95
  eps: 1.0e-08
  weight_decay: 0.0
max_duration: 8ba
eval_interval: 8ba
eval_subset_num_batches: 2
global_train_batch_size: 4
seed: 17
device_eval_batch_size: 4
device_train_microbatch_size: 2
precision: fp32
progress_bar: false
log_to_console: false
console_log_interval: 1ba
callbacks:
  speed_monitor:
    window_size: 4
  lr_monitor: {}
n_gpus: 1
device_train_batch_size: 4

Starting training...
Traceback (most recent call last):
  File "<string>", line 21, in _fwd_kernel
KeyError: ('2-.-0-.-0-d82511111ad128294e9d31a6ac684238-7929002797455b30efce6e41eddc6b57-3aa563e00c5c695dd945e23b09a86848-d962222789c30252d492a16cca3bf467-ff946bd4b3b4a4cbdf8cedc6e1c658e0-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.float16, torch.float16, torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, 'fp32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), ('matrix', False, 64, True, True, True, 128, 128), (True, True, True, True, True, True, True, (False,), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (False, False), (True, False), (True, False), (True, False), (True, False), (False, False), (False, False)))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/examples/examples/benchmarks/bert/tests/test_classification.py", line 14, in test_classification_script
    main(config)
  File "/examples/examples/benchmarks/bert/sequence_classification.py", line 317, in main
    trainer.fit()
  File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 1766, in fit
    self._train_loop()
  File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 1940, in _train_loop
    total_loss_dict = self._train_batch(use_grad_scaling)
  File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 2115, in _train_batch
    optimizer.step(closure=lambda **kwargs: self._train_microbatches(
  File "/usr/lib/python3/dist-packages/torch/optim/lr_scheduler.py", line 68, in wrapper
    return wrapped(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/torch/optim/optimizer.py", line 140, in wrapper
    out = func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/composer/optim/decoupled_weight_decay.py", line 288, in step
    loss = closure()
  File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 2115, in <lambda>
    optimizer.step(closure=lambda **kwargs: self._train_microbatches(
  File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 2213, in _train_microbatches
    microbatch_loss_dict = self._train_microbatch(use_grad_scaling, current_batch_size, is_final_microbatch)
  File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 2276, in _train_microbatch
    self.state.outputs = self.state.model(self.state.batch)
  File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/lib/python3/dist-packages/composer/models/huggingface.py", line 314, in forward
    output = self.model(**batch)  # type: ignore (thirdparty)
  File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/examples/examples/benchmarks/bert/src/bert_layers.py", line 1009, in forward
    outputs = self.bert(
  File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/examples/examples/benchmarks/bert/src/bert_layers.py", line 677, in forward
    encoder_outputs = self.encoder(
  File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/examples/examples/benchmarks/bert/src/bert_layers.py", line 514, in forward
    hidden_states = layer_module(hidden_states,
  File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/examples/examples/benchmarks/bert/src/bert_layers.py", line 395, in forward
    attention_output = self.attention(hidden_states, cu_seqlens, seqlen,
  File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/examples/examples/benchmarks/bert/src/bert_layers.py", line 307, in forward
    self_output = self.self(input_tensor, cu_seqlens, max_s, indices,
  File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/examples/examples/benchmarks/bert/src/bert_layers.py", line 237, in forward
    attention = flash_attn_qkvpacked_func(qkv, bias)
  File "/examples/examples/benchmarks/bert/src/flash_attn_triton.py", line 1021, in forward
    o, lse, ctx.softmax_scale = _flash_attn_forward(
  File "/examples/examples/benchmarks/bert/src/flash_attn_triton.py", line 826, in _flash_attn_forward
    _fwd_kernel[grid](  # type: ignore
  File "/usr/lib/python3/dist-packages/triton/runtime/jit.py", line 106, in launcher
    return self.run(*args, grid=grid, **kwargs)
  File "/usr/lib/python3/dist-packages/triton/runtime/autotuner.py", line 86, in run
    return self.fn.run(*args, num_warps=config.num_warps, num_stages=config.num_stages, **kwargs, **config.kwargs)
  File "/usr/lib/python3/dist-packages/triton/runtime/autotuner.py", line 200, in run
    return self.fn.run(*args, **kwargs)
  File "<string>", line 41, in _fwd_kernel
  File "/usr/lib/python3/dist-packages/triton/compiler.py", line 1268, in compile
    return CompiledKernel(name, so_cache_manager._make_path(so_name), fn_cache_manager.cache_dir, device)
  File "/usr/lib/python3/dist-packages/triton/compiler.py", line 1301, in __init__
    mod, func, n_regs, n_spills = _triton.code_gen.load_binary(metadata["name"], self.asm["cubin"], self.shared, device)
RuntimeError: CUDA: Error- invalid source

Note that I replaced in the output above the paths with my personal information by (...).

Also note that the commands

composer sequence_classification.py yamls/test/sequence_classification.yaml
composer sequence_classification.py yamls/test/sequence_classification.yaml model.name=mosaic_bert

yield the same error message.

Did I something wrong or is this an error in the code? I would be incredibly grateful for any guidance as I urgently need to fine-tune my model, but unfortunately, I'm currently facing the mentioned challenges that are preventing me from doing so.

Thank you very much!

Remove ComposerClassifier from vision benchmarks

From @abhi-mosaic: ComposerClassifer only saves a few lines of code, but makes the benchmark opaque.

Error using PIL

Hi!

I try to train my own BERT model using your code. First of all, I tried to test the setup with your default configuration using the C4 dataset.

I get an error when I try to run the following command for testing pre-training on the Huggingface model:
composer main.py yamls/test/main.yaml

The error message is as follows:
ImportError: cannot import name '_imaging' from 'PIL' (/usr/lib/python3/dist-packages/PIL/init.py)

I tried to update pip, uninstall and reinstall pillow. However, this did not help. I followed all the steps that you mentioned before the pre-training command. I use a 8xA100 80GB node.

I would appreciate any hints that might help! :)

Confusing comment in the deeplabv3.yaml

In the example of deeplab training in the yaml config file deeplab/yamls/deeplabv3.yaml it is written in config file:

batch_size: 128                    # Training dataloader batch size per device - line 26
batch_size: 128                    # Evaluation dataloader batch size per device - line 39

But, according to the main.py (lines 64-66) it should be not the batch size per device rather the total batch size.

if dist.get_world_size():
    train_batch_size //= dist.get_world_size()
    eval_batch_size //= dist.get_world_size()

How to add a custom key to config file?

Hi,

I'm currently working with the resnet50 training recipe. However, I'm aiming to adapt Mosaic to my custom MobileNetV2 model and need to incorporate a custom parameter into the model. I've made the following modifications:

In yamls/mobilenetv2.yaml:

# Model
model:
  name: mobilenetv2           # Name of the ResNet model to train either resnet{18, 34, 50, 101, 152}
  loss_name: binary_cross_entropy # Name of the loss function either 'cross_entropy' or 'binary_cross_entropy'
  num_classes: 1000        # Number of classes in the classification task
  compress_rate: [0.34]*8

And in model.py:

from mobilenetv2 import mobilenetv2


def build_composer_resnet(model_name: str = 'resnet50',
                          loss_name: str = 'cross_entropy',
                          num_classes: int = 1000,
                          compress_rate: list = [0.]*8):
    """Helper function to build a Composer ResNet model.

    Args:
        model_name (str, optional): Name of the ResNet model to use, either
            ['resnet18', 'resnet34', 'resnet50', 'resnet101', 'resnet152']. Default: ``'resnet50'``.
        loss_name (str, optional): Name of the loss function to use, either ['cross_entropy', 'binary_cross_entropy'].
            Default: ``'cross_entropy'``.
        num_classes (int, optional): Number of classes in the classification task. Default: ``1000``.
    """
    model = mobilenetv2(compress_rate=compress_rate, num_classes=num_classes)

However, upon running:

composer main.py yamls/mobilenetv2.yaml recipe_name=hot

I encountered the following error:

Building train dataloader
Built train dataloader

Building evaluation dataloader
Built evaluation dataloader

Building Composer model
Traceback (most recent call last):
  File "/home/van-tien.pham/momo/main.py", line 243, in <module>
    main(config)
  File "/home/van-tien.pham/momo/main.py", line 113, in main
    compress_rate=config.model.compress_rate,
  File "/home/van-tien.pham/anaconda3/envs/mosaic/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 355, in __getattr__
    self._format_and_raise(
  File "/home/van-tien.pham/anaconda3/envs/mosaic/lib/python3.9/site-packages/omegaconf/base.py", line 231, in _format_and_raise
    format_and_raise(
  File "/home/van-tien.pham/anaconda3/envs/mosaic/lib/python3.9/site-packages/omegaconf/_utils.py", line 899, in format_and_raise
    _raise(ex, cause)
  File "/home/van-tien.pham/anaconda3/envs/mosaic/lib/python3.9/site-packages/omegaconf/_utils.py", line 797, in _raise
    raise ex.with_traceback(sys.exc_info()[2])  # set env var OC_CAUSE=1 for full trace
  File "/home/van-tien.pham/anaconda3/envs/mosaic/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 351, in __getattr__
    return self._get_impl(
  File "/home/van-tien.pham/anaconda3/envs/mosaic/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 442, in _get_impl
    node = self._get_child(
  File "/home/van-tien.pham/anaconda3/envs/mosaic/lib/python3.9/site-packages/omegaconf/basecontainer.py", line 73, in _get_child
    child = self._get_node(
  File "/home/van-tien.pham/anaconda3/envs/mosaic/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 480, in _get_node
    raise ConfigKeyError(f"Missing key {key!s}")
omegaconf.errors.ConfigAttributeError: Missing key compress_rate
    full_key: model.compress_rate
    object_type=dict
ERROR:composer.cli.launcher:Rank 1 crashed with exit code 1.
Waiting up to 30 seconds for all training processes to terminate. Press Ctrl-C to exit immediately.
Global rank 1 (PID 28118) exited with code 1
----------Begin global rank 1 STDOUT----------
Building train dataloader
Built train dataloader

Building evaluation dataloader
Built evaluation dataloader

Building Composer model

----------End global rank 1 STDOUT----------
----------Begin global rank 1 STDERR----------
Traceback (most recent call last):
  File "/home/van-tien.pham/momo/main.py", line 243, in <module>
    main(config)
  File "/home/van-tien.pham/momo/main.py", line 113, in main
    compress_rate=config.model.compress_rate,
  File "/home/van-tien.pham/anaconda3/envs/mosaic/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 355, in __getattr__
    self._format_and_raise(
  File "/home/van-tien.pham/anaconda3/envs/mosaic/lib/python3.9/site-packages/omegaconf/base.py", line 231, in _format_and_raise
    format_and_raise(
  File "/home/van-tien.pham/anaconda3/envs/mosaic/lib/python3.9/site-packages/omegaconf/_utils.py", line 899, in format_and_raise
    _raise(ex, cause)
  File "/home/van-tien.pham/anaconda3/envs/mosaic/lib/python3.9/site-packages/omegaconf/_utils.py", line 797, in _raise
    raise ex.with_traceback(sys.exc_info()[2])  # set env var OC_CAUSE=1 for full trace
  File "/home/van-tien.pham/anaconda3/envs/mosaic/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 351, in __getattr__
    return self._get_impl(
  File "/home/van-tien.pham/anaconda3/envs/mosaic/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 442, in _get_impl
    node = self._get_child(
  File "/home/van-tien.pham/anaconda3/envs/mosaic/lib/python3.9/site-packages/omegaconf/basecontainer.py", line 73, in _get_child
    child = self._get_node(
  File "/home/van-tien.pham/anaconda3/envs/mosaic/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 480, in _get_node
    raise ConfigKeyError(f"Missing key {key!s}")
omegaconf.errors.ConfigAttributeError: Missing key compress_rate
    full_key: model.compress_rate
    object_type=dict

----------End global rank 1 STDERR----------
ERROR:composer.cli.launcher:Global rank 0 (PID 28117) exited with code -15

How can I resolve this error omegaconf.errors.ConfigAttributeError: Missing key compress_rate?

Thank you in advance for your help!

CUDA out of memory

Hi,

I tried to replicate the mosaic-BERT training on the C4 dataset. I followed step by step your guidelines. The dataset preparation worked well. However, during BERT training with the main.py file, I got a CUDA out of memory error. I did not change any hyperparamters in the respective yaml (mosaic-bert-base-uncased.yaml), except for the path of the data. I trained the model on a 8*80 GB A100 GPU.

Here is the trace:

Traceback (most recent call last):
File "/examples/examples/benchmarks/bert/main.py", line 269, in
main(cfg)
File "/examples/examples/benchmarks/bert/main.py", line 256, in main
trainer.fit()
File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 1766, in fit
self._train_loop()
File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 1993, in _train_loop
self._run_evaluators(Event.BATCH_END)
File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 2071, in _run_evaluators
self._eval_loop(
File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 2724, in _eval_loop
self._original_model.update_metric(
File "/usr/lib/python3/dist-packages/composer/models/huggingface.py", line 395, in update_metric
metric.update(outputs, self.labels) # pyright: ignore [reportGeneralTypeIssues]
File "/usr/lib/python3/dist-packages/torchmetrics/metric.py", line 399, in wrapped_func
raise err
File "/usr/lib/python3/dist-packages/torchmetrics/metric.py", line 389, in wrapped_func
update(*args, **kwargs)
File "/usr/lib/python3/dist-packages/composer/metrics/nlp.py", line 123, in update
losses = self.loss_fn(logits, target)
File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/lib/python3/dist-packages/torch/nn/modules/loss.py", line 1174, in forward
return F.cross_entropy(input, target, weight=self.weight,
File "/usr/lib/python3/dist-packages/torch/nn/functional.py", line 3026, in cross_entropy
return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 59.62 GiB (GPU 0; 79.15 GiB total capacity; 68.06 GiB already allocated; 6.73 GiB free; 71.12 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:composer.cli.launcher:Rank 0 crashed with exit code 1.
ERROR:composer.cli.launcher:Global rank 0 (PID 1861369) exited with code 1

Furthermore, the training up to that point took quite long:

[sample=8192000/286720000]:
Train time/batch: 1999
Train time/sample: 8187904
Train time/batch_in_epoch: 1999
Train time/sample_in_epoch: 8187904
Train time/token: 1048051712
Train time/token_in_epoch: 1048051712
Train trainer/device_train_microbatch_size: 128
Train loss/train/total: 3.7229
Train metrics/train/LanguageCrossEntropy: 3.7233
Train metrics/train/MaskedAccuracy: 0.3915
Train throughput/batches_per_sec: 0.2469
Train throughput/samples_per_sec: 1011.2352
Train throughput/device/batches_per_sec: 0.2469
Train throughput/device/samples_per_sec: 1011.2352
Train throughput/tokens_per_sec: 129438.1079
Train throughput/device/tokens_per_sec: 129438.1079
Train time/train: 2.2606
Train time/val: 0.0000
Train time/total: 2.2606
Train lr-DecoupledAdamW/group0: 0.0002

I am a bit confused as you said that a key feature of mosaic-BERT is its training speed. Do you have any idea what I am doing wrong?

Thank you in advance for your help!

Update: I saw that the out of memory issue occurs when the model is evaluated (after 2000 batches per default). I already tried to reduce the global_train_batch_size from 4096 to 2048, withour success.

Change bf16 to amp_bf16

Hi,

I tried replicating the pretraining bert script and when I ran it with the yaml script I got the following error: Value bf16 is not available in Precision. I traced the code and changed bf16 to amp_bf16 to get it to work. Not sure if I am using the outdated code but flagging it here.

https://github.com/mosaicml/examples/blob/main/examples/bert/yamls/main/hf-bert-base-uncased.yaml - line 82

Pre-commit checks

Should include:

Linting
Formatting
Running local tests (both doctests and standalone test functions)

In a manner as similar as possible to what's in Composer.

Out of scope for now:

Convergence runs / anything that has to happen on a GPU

MosaicBert: Training stops after first evaluation pass with Flash Attention 2

I am trying to pretrain bert on 8 x A100 with the scripts provided but with Flash Attention 2. My data is on a remote AWS s3 bucket in form of recommended Streaming Dataset. The training starts pretty normally and goes on until it reaches evaluation which is at default 2000 steps. In the last iteration of evaluation (lets say its 1/7561... 4800/7561), my GPU memory and utilisation goes down back to zero and it gets stuck. I had set attention_probs_dropout_prob to 0.1 as it is supported and pretrained_model_name is 'bert-base-uncased'.
Any thoughts on this would be useful. I am not even able to see the error correctly.

Integration documentation is broken

The documentation link for integration across different platform is broken

https://docs.mosaicml.com/projects/mcli/en/latest/integrations/integration_basics.html mentioned here

link to bert example is broken

link to bert example is broken from
https://huggingface.co/mosaicml/mosaic-bert-base
and
https://www.mosaicml.com/blog/mosaicbert

https://github.com/mosaicml/examples/tree/main/examples/bert#single-task-fine-tuning
https://github.com/mosaicml/examples/tree/main/examples/bert

and

Update Readme

It's pretty good but we should have more description + CI badges at the top.

[Question] Use flash attention w/ RedPajama LLM?

Cross-posting from here: mosaicml/composer#2175, because I suspect this is the correct place to file the issue.


	encoder_outputs = self.encoder(
	embedding_output,
	attention_mask,
	output_all_encoded_layers=output_all_encoded_layers,
	subset_mask=subset_mask)

	if masked_tokens_mask is None:
	sequence_output = encoder_outputs[-1]
	pooled_output = self.pooler(
	sequence_output) if self.pooler is not None else None
	else:
	# TD [2022-03-01]: the indexing here is very tricky.
	attention_mask_bool = attention_mask.bool()

	all_encoder_layers = []
	if subset_mask is None:
	for layer_module in self.layer:
	hidden_states = layer_module(hidden_states,
	cu_seqlens,
	seqlen,
	None,
	indices,
	attn_mask=attention_mask,
	bias=alibi_attn_mask)
	if output_all_encoded_layers:
	all_encoder_layers.append(hidden_states)
	# Pad inputs and mask. It will insert back zero-padded tokens.
	# Assume ntokens is total number of tokens (padded and non-padded)
	# and ntokens_unpad is total number of non-padded tokens.
	# Then padding performs the following de-compression:
	# hidden_states[ntokens_unpad,hidden] -> hidden_states[ntokens,hidden]
	hidden_states = bert_padding_module.pad_input(
	hidden_states, indices, batch, seqlen)
	else: