Giter Site home page Giter Site logo

llm-workshop's Introduction

hey there

About Me :

I'm Sourab Mangrulkar; an Applied Scientist and Machine Learning Engineer from India ๐Ÿ‡ฎ๐Ÿ‡ณ.

  • ๐Ÿ”ญ Iโ€™m currently working as an Applied Scientist at Amazon.
  • ๐ŸŒฑ Exploring Natural Language Processing, Computer Vision and Distributed Training at Scale. Always up for meaningful collaboration.
  • ๐Ÿ˜„ Pronouns: He/His/Him.
  • โšก Painting ๐ŸŽจ, sketching โœ๏ธ and poetry ๐Ÿ“ are my favourite hobbies. Recently, I've started reading up on stocks and economic markets.
  • ๐Ÿ“ซ How to reach me: ย  Linkedin Badge

๐Ÿ“ Research :

โœ๏ธ Blog Posts :

๐Ÿ’ฌ Talks and Presentations

llm-workshop's People

Contributors

arpieb avatar pacman100 avatar singl3 avatar timxx avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

llm-workshop's Issues

chat assistant traning: CUDA out of memory

Hi, I am getting a CUDA out of memory error when I try to run the chat_assistant training's run_fsdp.sh script on a 34b model. Changing the model from 7b to 34b is the only change I made.

Local edits

I only edited chat_assistant/training/run_fsdp.sh to replace the 7b model with a 34b model. Screenshot:
image

Stack trace

  File "/home/ubuntu/steven/DHS-LLM-Workshop/chat_assistant/training/train.py", line 252, in <module>
    main(args)
  File "/home/ubuntu/steven/DHS-LLM-Workshop/chat_assistant/training/train.py", line 223, in main
    trainer.train()
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1556, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1872, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 2748, in training_step
    self.accelerator.backward(loss)
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/accelerate/accelerator.py", line 1986, in backward
    loss.backward(**kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/function.py", line 288, in apply
    return user_fn(self, *args)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 288, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.29 GiB. GPU 1 has a total capacty of 79.19 GiB of which 1.26 GiB is free. Including non-PyTorch memory, this process has 77.93 GiB memory in use. Of the allocated memory 74.62 GiB is allocated by PyTorch, and 1.24 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):

Hardware / Gpu info

I was running this job on a machine with 8 H100 gpus. Below is screenshot of nvidia-smi -l 1 output; you can see that in the span of 5 seconds the GPU memory usage spiked from around 42GiB on all 8 devices to 80GiB on all 8 devices, and then the process crashed.

image image

Full logs:
full_logs_dhs.txt

Let me know if there's any additional information I can provide to be helpful. Thanks in advance!

Finetune 70B model on one node

Thanks for your educational blog post and this repo.

Could you please provide your scripts to finetune the 70B model in this repo?

BTW, when I run your 7B finetune script, it consumes ~60GB GPU memory using FSDP with batch size=1. Is this normal?

No Version mentioned for any library used in the project.

I tried to follow https://huggingface.co/blog/personal-copilot

But everything is breaking. Maybe because the version of the dependency needed is not mentioned. Is it not a good practice to mention the version so that we are always sure the software keeps working. How can I know which version of the dependency is compatible for the current date.

Can you please make sure to update the requirements.txt? Also, Please update the version of python dependency in the colab link as well. Referring to the PEFT one.

Incorrectness in Flash Attention

We completely ignore attention_mask here: https://github.com/pacman100/DHS-LLM-Workshop/blob/53672e1b774da7798fb10a50ef8ca5b2750c5608/personal_copilot/training/starcoder_flash_attn_monkey_patch.py#L60

If the input had padding, then this is incorrect (not by a major amount I think but that might depend on how much padding the input has).
We need to maintain cu_seqlens and use the packed version of flash here.
But the current implementementation is easier to implement and maintain I guess?

Can we add a note regarding the incorrect behaviour?

how to load model trained by accelerate with fsdp.

Hi dear:

I finetuned with accelerate with fsdp, but i do not know how to load checkpoint to do inference, checkpoint output is as below:

checkpoin-100

  • optomizer_0
    • __0_0.distcp
    • __1_0.distcp
    • _-2_0,distcp
  • pytorch_model_fsdp_0
    • .metadata
    • __0_0.distcp
    • __1_0.distcp
    • _-2_0,distcp
  • rng_state_0.pth
  • rng_state_1.pth
  • rng_state_2.pth
  • scheduler.pt
  • trainer_state.json

Looking forward to your reply

Error on save_steps using FSDP

I am currently using the FSDP (Fully Sharded Data Parallelism) approach with the Llama 2 70B model. The training process has begun, but I encounter an error when attempting to save the checkpoint at each save_step. I have set the save_step as 50.

System: 1 Node with 2 A100 80 GB GPU

Here are the supporting screenshots
MicrosoftTeams-image (1)

MicrosoftTeams-image (2)

@pacman100

train Segmentation fault

python train.py
--model_path "bigcode/starcoderbase-1b"
--dataset_name "smangrul/hf-stack-v1"
--subset "data"
--data_column "content"
--split "train"
--seq_length 2048
--max_steps 2000
--batch_size 1
--gradient_accumulation_steps 1
--learning_rate 5e-5
--lr_scheduler_type "cosine"
--weight_decay 0.01
--num_warmup_steps 30
--eval_freq 100
--save_freq 500
--log_freq 25
--use_reentrant False
--num_workers 4
--bf16
--no_fp16
--output_dir "starcoderbase1b-personal-copilot-A100-40GB-colab"
--fim_rate 0.5
--fim_spm_rate 0.5
--use_flash_attn

error:/u01/liuys/anaconda3/envs/code/lib/python3.10/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
/u01/liuys/anaconda3/envs/code/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
Segmentation fault (core dumped)

Packing = True

Quick Question:

You have kept packing=True.
Shouldn't we be computing loss for the response part only? When setting packing as true, we won't get that functionality of mentioning to compute loss only for response template.
So, is it ok to train(compute loss) right from first token (including prompt) in sft?

@pacman100

Enhancements for Efficient Utilization and Optimization in Fine-tuning Llama 2 70B Example

Hi @pacman100 ,

Firstly, thank you for the well-detailed article! I am writing to provide some feedback and seek clarification.

  1. Optimizer Selection:

    • The blog post demonstrates the use of a particular optimizer, "paged_adamw_32bit". However, upon altering this to "adamw_torch", I encountered an Out Of Memory (OOM) issue. Could you elucidate on the critical role the default optimizer plays in the successful execution of the example provided? Any insight into why the memory issue arises with "adamw_torch" would be highly valuable.
  2. GPU Utilization:

    • In attempting to replicate the described setup on a 2 nodes x 8 H100s machine, I observed a relatively low GPU utilization rate of around 20% with the GPUs drawing only about ~200 Watts. Is there any recommendation on how to elevate the GPU utilization rate, to potentially expedite the training process and maximize the computational resources at hand?

Your guidance will be immensely beneficial!

Thank you for your time.

Flash Attention for fine-tuning

How can we use flash attention v2 for fine-tuning with huggingface models?

Does the path only works for pre-training(or extended pre-training)?

All the discussions mentioned below are for pre-training(or extended pre-training).

I would like to fine-tune bigcode/starcoder 15.5 billion parameter model with 2k context length using A100-80GB.

Using device_map auto when launch acceleator

in https://github.com/pacman100/DHS-LLM-Workshop/blob/main/chat_assistant/training/utils.py#L182C9-L182C19, what is the reason to set device_map = 'auto'
When I run it with accelerator (with fsdp) I got the error

ValueError: You can't train a model that has been loaded with `device_map='auto'` in any distributed mode. Please rerun your script specifying `--num_processes=1` or by launching with `python {{myscript.py}}`.

Also in here (huggingface/accelerate#1840) it states that it is not compatible with using DistributedDataParallel.

Problem training with FSDP

When I am trying to train a model with FSDP, I am getting following error.

*** TypeError: isinstance() arg 2 must be a type, a tuple of types, or a union

It is happening on this specific line
trainer.model = trainer.model_wrapped = FSDP(trainer.model, **kwargs)

and after a bit of debugging it feels like it has something to do with auto_wrap_policy. I am not really sure how to solve this. Do you have any suggestions. It was working fine until few days ago.

Fine Tuning with LoRA failed during train step

Below is the notebook link from your blog - https://huggingface.co/blog/personal-copilot
https://colab.research.google.com/drive/1Tz9KKgacppA4S6H4eo_sw43qEaC9lFLs?usp=sharing

!git pull
!python train.py \
    --model_name_or_path "bigcode/starcoder" \
    --dataset_name "smangrul/hf-stack-v1" \
    --subset "data" \
    --data_column "content" \
    --splits "train" \
    --seq_length 2048 \
    --max_steps 2000 \
    --batch_size 4 \
    --gradient_accumulation_steps 4 \
    --learning_rate 5e-4 \
    --lr_scheduler_type "cosine" \
    --weight_decay 0.01 \
    --num_warmup_steps 30 \
    --eval_freq 100 \
    --save_freq 100 \
    --log_freq 25 \
    --num_workers 4 \
    --bf16 \
    --no_fp16 \
    --output_dir "peft-lora-starcoder15B-v2-personal-copilot-A100-40GB-colab" \
    --fim_rate 0.5 \
    --fim_spm_rate 0.5 \
    --use_peft_lora \
    --lora_r 32 \
    --lora_alpha 64 \
    --lora_dropout 0.0 \
    --lora_target_modules "c_proj,c_attn,q_attn,c_fc,c_proj" \
    --use_flash_attn \
    --use_4bit_qunatization \
    --use_nested_quant \
    --bnb_4bit_compute_dtype "bfloat16"

I am stuck at this step.

Below is the error

Already up to date.
2024-05-09 20:44:58.617684: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-09 20:44:58.617733: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-09 20:44:58.619695: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-05-09 20:44:58.630452: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-05-09 20:45:00.111432: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Traceback (most recent call last):
  File "/content/DHS-LLM-Workshop/personal_copilot/training/train.py", line 494, in <module>
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
  File "/usr/local/lib/python3.10/dist-packages/transformers/hf_argparser.py", line 348, in parse_args_into_dataclasses
    raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}")
ValueError: Some specified arguments are not used by the HfArgumentParser: ['--subset', 'data', '--data_column', 'content', '--seq_length', '2048', '--batch_size', '4', '--num_warmup_steps', '30', '--eval_freq', '100', '--save_freq', '100', '--log_freq', '25', '--num_workers', '4', '--no_fp16', '--use_4bit_qunatization']

train.py: error: ambiguous option: --split could match --splits, --split_batches

I uploaded my private github repo as data set to private hugging face dataset. Below is the error I receive when I try to train using PEFT method

2024-05-05 18:43:36.206142: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-05 18:43:36.206194: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-05 18:43:36.207621: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-05-05 18:43:36.214881: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-05-05 18:43:37.321192: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
usage: train.py [-h] --model_name_or_path MODEL_NAME_OR_PATH [--lora_alpha LORA_ALPHA]
[--lora_dropout LORA_DROPOUT] [--lora_r LORA_R]
[--lora_target_modules LORA_TARGET_MODULES]
[--use_nested_quant [USE_NESTED_QUANT]]
[--bnb_4bit_compute_dtype BNB_4BIT_COMPUTE_DTYPE]
[--bnb_4bit_quant_type BNB_4BIT_QUANT_TYPE] [--use_flash_attn [USE_FLASH_ATTN]]
[--use_peft_lora [USE_PEFT_LORA]]
[--use_8bit_qunatization [USE_8BIT_QUNATIZATION]]
[--use_4bit_quantization [USE_4BIT_QUANTIZATION]]
[--use_reentrant [USE_REENTRANT]] [--use_unsloth [USE_UNSLOTH]]
[--use_loftq [USE_LOFTQ]] [--use_loftq_callback [USE_LOFTQ_CALLBACK]]
[--dataset_name DATASET_NAME] [--dataset_text_field DATASET_TEXT_FIELD]
[--max_seq_length MAX_SEQ_LENGTH] [--test_size TEST_SIZE] [--fim_rate FIM_RATE]
[--fim_spm_rate FIM_SPM_RATE] [--splits SPLITS] --output_dir OUTPUT_DIR
[--overwrite_output_dir [OVERWRITE_OUTPUT_DIR]] [--do_train [DO_TRAIN]]
[--do_eval [DO_EVAL]] [--do_predict [DO_PREDICT]]
[--eval_strategy {no,steps,epoch}] [--prediction_loss_only [PREDICTION_LOSS_ONLY]]
[--per_device_train_batch_size PER_DEVICE_TRAIN_BATCH_SIZE]
[--per_device_eval_batch_size PER_DEVICE_EVAL_BATCH_SIZE]
[--per_gpu_train_batch_size PER_GPU_TRAIN_BATCH_SIZE]
[--per_gpu_eval_batch_size PER_GPU_EVAL_BATCH_SIZE]
[--gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS]
[--eval_accumulation_steps EVAL_ACCUMULATION_STEPS] [--eval_delay EVAL_DELAY]
[--learning_rate LEARNING_RATE] [--weight_decay WEIGHT_DECAY]
[--adam_beta1 ADAM_BETA1] [--adam_beta2 ADAM_BETA2] [--adam_epsilon ADAM_EPSILON]
[--max_grad_norm MAX_GRAD_NORM] [--num_train_epochs NUM_TRAIN_EPOCHS]
[--max_steps MAX_STEPS]
[--lr_scheduler_type {linear,cosine,cosine_with_restarts,polynomial,constant,constant_with_warmup,inverse_sqrt,reduce_lr_on_plateau,cosine_with_min_lr,warmup_stable_decay}]
[--lr_scheduler_kwargs LR_SCHEDULER_KWARGS] [--warmup_ratio WARMUP_RATIO]
[--warmup_steps WARMUP_STEPS]
[--log_level {detail,debug,info,warning,error,critical,passive}]
[--log_level_replica {detail,debug,info,warning,error,critical,passive}]
[--log_on_each_node [LOG_ON_EACH_NODE]] [--no_log_on_each_node]
[--logging_dir LOGGING_DIR] [--logging_strategy {no,steps,epoch}]
[--logging_first_step [LOGGING_FIRST_STEP]] [--logging_steps LOGGING_STEPS]
[--logging_nan_inf_filter [LOGGING_NAN_INF_FILTER]] [--no_logging_nan_inf_filter]
[--save_strategy {no,steps,epoch}] [--save_steps SAVE_STEPS]
[--save_total_limit SAVE_TOTAL_LIMIT] [--save_safetensors [SAVE_SAFETENSORS]]
[--no_save_safetensors] [--save_on_each_node [SAVE_ON_EACH_NODE]]
[--save_only_model [SAVE_ONLY_MODEL]]
[--restore_callback_states_from_checkpoint [RESTORE_CALLBACK_STATES_FROM_CHECKPOINT]]
[--no_cuda [NO_CUDA]] [--use_cpu [USE_CPU]] [--use_mps_device [USE_MPS_DEVICE]]
[--seed SEED] [--data_seed DATA_SEED] [--jit_mode_eval [JIT_MODE_EVAL]]
[--use_ipex [USE_IPEX]] [--bf16 [BF16]] [--fp16 [FP16]]
[--fp16_opt_level FP16_OPT_LEVEL] [--half_precision_backend {auto,apex,cpu_amp}]
[--bf16_full_eval [BF16_FULL_EVAL]] [--fp16_full_eval [FP16_FULL_EVAL]]
[--tf32 TF32] [--local_rank LOCAL_RANK]
[--ddp_backend {nccl,gloo,mpi,ccl,hccl,cncl}] [--tpu_num_cores TPU_NUM_CORES]
[--tpu_metrics_debug [TPU_METRICS_DEBUG]] [--debug DEBUG [DEBUG ...]]
[--dataloader_drop_last [DATALOADER_DROP_LAST]] [--eval_steps EVAL_STEPS]
[--dataloader_num_workers DATALOADER_NUM_WORKERS]
[--dataloader_prefetch_factor DATALOADER_PREFETCH_FACTOR]
[--past_index PAST_INDEX] [--run_name RUN_NAME] [--disable_tqdm DISABLE_TQDM]
[--remove_unused_columns [REMOVE_UNUSED_COLUMNS]] [--no_remove_unused_columns]
[--label_names LABEL_NAMES [LABEL_NAMES ...]]
[--load_best_model_at_end [LOAD_BEST_MODEL_AT_END]]
[--metric_for_best_model METRIC_FOR_BEST_MODEL]
[--greater_is_better GREATER_IS_BETTER] [--ignore_data_skip [IGNORE_DATA_SKIP]]
[--fsdp FSDP] [--fsdp_min_num_params FSDP_MIN_NUM_PARAMS]
[--fsdp_config FSDP_CONFIG]
[--fsdp_transformer_layer_cls_to_wrap FSDP_TRANSFORMER_LAYER_CLS_TO_WRAP]
[--accelerator_config ACCELERATOR_CONFIG] [--deepspeed DEEPSPEED]
[--label_smoothing_factor LABEL_SMOOTHING_FACTOR]
[--optim {adamw_hf,adamw_torch,adamw_torch_fused,adamw_torch_xla,adamw_torch_npu_fused,adamw_apex_fused,adafactor,adamw_anyprecision,sgd,adagrad,adamw_bnb_8bit,adamw_8bit,lion_8bit,lion_32bit,paged_adamw_32bit,paged_adamw_8bit,paged_lion_32bit,paged_lion_8bit,rmsprop,rmsprop_bnb,rmsprop_bnb_8bit,rmsprop_bnb_32bit,galore_adamw,galore_adamw_8bit,galore_adafactor,galore_adamw_layerwise,galore_adamw_8bit_layerwise,galore_adafactor_layerwise}]
[--optim_args OPTIM_ARGS] [--adafactor [ADAFACTOR]]
[--group_by_length [GROUP_BY_LENGTH]] [--length_column_name LENGTH_COLUMN_NAME]
[--report_to REPORT_TO] [--ddp_find_unused_parameters DDP_FIND_UNUSED_PARAMETERS]
[--ddp_bucket_cap_mb DDP_BUCKET_CAP_MB]
[--ddp_broadcast_buffers DDP_BROADCAST_BUFFERS]
[--dataloader_pin_memory [DATALOADER_PIN_MEMORY]] [--no_dataloader_pin_memory]
[--dataloader_persistent_workers [DATALOADER_PERSISTENT_WORKERS]]
[--skip_memory_metrics [SKIP_MEMORY_METRICS]] [--no_skip_memory_metrics]
[--use_legacy_prediction_loop [USE_LEGACY_PREDICTION_LOOP]]
[--push_to_hub [PUSH_TO_HUB]] [--resume_from_checkpoint RESUME_FROM_CHECKPOINT]
[--hub_model_id HUB_MODEL_ID]
[--hub_strategy {end,every_save,checkpoint,all_checkpoints}]
[--hub_token HUB_TOKEN] [--hub_private_repo [HUB_PRIVATE_REPO]]
[--hub_always_push [HUB_ALWAYS_PUSH]]
[--gradient_checkpointing [GRADIENT_CHECKPOINTING]]
[--gradient_checkpointing_kwargs GRADIENT_CHECKPOINTING_KWARGS]
[--include_inputs_for_metrics [INCLUDE_INPUTS_FOR_METRICS]]
[--eval_do_concat_batches [EVAL_DO_CONCAT_BATCHES]] [--no_eval_do_concat_batches]
[--fp16_backend {auto,apex,cpu_amp}] [--evaluation_strategy {no,steps,epoch}]
[--push_to_hub_model_id PUSH_TO_HUB_MODEL_ID]
[--push_to_hub_organization PUSH_TO_HUB_ORGANIZATION]
[--push_to_hub_token PUSH_TO_HUB_TOKEN] [--mp_parameters MP_PARAMETERS]
[--auto_find_batch_size [AUTO_FIND_BATCH_SIZE]]
[--full_determinism [FULL_DETERMINISM]] [--torchdynamo TORCHDYNAMO]
[--ray_scope RAY_SCOPE] [--ddp_timeout DDP_TIMEOUT]
[--torch_compile [TORCH_COMPILE]] [--torch_compile_backend TORCH_COMPILE_BACKEND]
[--torch_compile_mode TORCH_COMPILE_MODE] [--dispatch_batches DISPATCH_BATCHES]
[--split_batches SPLIT_BATCHES]
[--include_tokens_per_second [INCLUDE_TOKENS_PER_SECOND]]
[--include_num_input_tokens_seen [INCLUDE_NUM_INPUT_TOKENS_SEEN]]
[--neftune_noise_alpha NEFTUNE_NOISE_ALPHA]
[--optim_target_modules OPTIM_TARGET_MODULES]
train.py: error: ambiguous option: --split could match --splits, --split_batches

Eval is like running forever

Hello author,

Thanks for your tutorial.

I am using the dataset hf-codegen-v2 which has 370k rows.

The validation set is about 1850. The batch size is 4. For other params, they are the same as the ones in run_peft.sh.

The training speed is normal but the eval loop is running forever.

Below is the log for evaluation:
{'eval_loss': 0.1817416101694107, 'eval_runtime': 6666.0306, 'eval_samples_per_second': 8.072, 'eval_steps_per_second': 2.018, 'epoch': 0.5}

I am not sure if this is normal.

Any help will be appreciated!

ZD

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.