Comments (3)
Hi @not-lain, that's probably not a bug. The dataset have only 200 rows. Could you try with a bigger dataset ?
from transformers.
Hi @SunMarc I have tried with both not-lain/docci
and not-lain/docci-small
, and the same behavior persisted after the first log
from transformers.
@SunMarc after some fiddling It seems that this is not related to the trainer API rather to my training script
model.text_model.train()
model.config.use_cache = False
model.text_model.transformer.gradient_checkpointing_enable()
torch.autograd.set_detect_anomaly(True)
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[28], [line 10](vscode-notebook-cell:?execution_count=28&line=10)
[1](vscode-notebook-cell:?execution_count=28&line=1) from transformers import Trainer
[3](vscode-notebook-cell:?execution_count=28&line=3) trainer = Trainer(
[4](vscode-notebook-cell:?execution_count=28&line=4) model=model,
[5](vscode-notebook-cell:?execution_count=28&line=5) train_dataset=data['train'],
(...)
[8](vscode-notebook-cell:?execution_count=28&line=8) args=args
[9](vscode-notebook-cell:?execution_count=28&line=9) )
---> [10](vscode-notebook-cell:?execution_count=28&line=10) trainer.train()
File ~/.local/lib/python3.11/site-packages/transformers/trainer.py:1885, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
[1883](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:1883) hf_hub_utils.enable_progress_bars()
[1884](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:1884) else:
-> [1885](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:1885) return inner_training_loop(
[1886](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:1886) args=args,
[1887](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:1887) resume_from_checkpoint=resume_from_checkpoint,
[1888](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:1888) trial=trial,
[1889](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:1889) ignore_keys_for_eval=ignore_keys_for_eval,
[1890](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:1890) )
File ~/.local/lib/python3.11/site-packages/transformers/trainer.py:2216, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
[2213](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:2213) self.control = self.callback_handler.on_step_begin(args, self.state, self.control)
[2215](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:2215) with self.accelerator.accumulate(model):
-> [2216](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:2216) tr_loss_step = self.training_step(model, inputs)
[2218](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:2218) if (
[2219](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:2219) args.logging_nan_inf_filter
[2220](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:2220) and not is_torch_xla_available()
[2221](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:2221) and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
[2222](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:2222) ):
[2223](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:2223) # if loss is nan or inf simply add the average of previous logged losses
[2224](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:2224) tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)
File ~/.local/lib/python3.11/site-packages/transformers/trainer.py:3250, in Trainer.training_step(***failed resolving arguments***)
[3248](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:3248) scaled_loss.backward()
[3249](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:3249) else:
-> [3250](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:3250) self.accelerator.backward(loss)
[3252](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:3252) return loss.detach() / self.args.gradient_accumulation_steps
File ~/.local/lib/python3.11/site-packages/accelerate/accelerator.py:2134, in Accelerator.backward(self, loss, **kwargs)
[2132](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/accelerate/accelerator.py:2132) self.lomo_backward(loss, learning_rate)
[2133](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/accelerate/accelerator.py:2133) else:
-> [2134](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/accelerate/accelerator.py:2134) loss.backward(**kwargs)
File ~/.local/lib/python3.11/site-packages/torch/_tensor.py:525, in Tensor.backward(self, gradient, retain_graph, create_graph, inputs)
[515](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/_tensor.py:515) if has_torch_function_unary(self):
[516](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/_tensor.py:516) return handle_torch_function(
[517](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/_tensor.py:517) Tensor.backward,
[518](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/_tensor.py:518) (self,),
(...)
[523](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/_tensor.py:523) inputs=inputs,
[524](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/_tensor.py:524) )
--> [525](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/_tensor.py:525) torch.autograd.backward(
[526](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/_tensor.py:526) self, gradient, retain_graph, create_graph, inputs=inputs
[527](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/_tensor.py:527) )
File ~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:267, in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
[262](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:262) retain_graph = create_graph
[264](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:264) # The reason we repeat the same comment below is that
[265](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:265) # some Python versions print out the first line of a multi-line function
[266](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:266) # calls in the traceback and some print out the last line
--> [267](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:267) _engine_run_backward(
[268](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:268) tensors,
[269](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:269) grad_tensors_,
[270](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:270) retain_graph,
[271](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:271) create_graph,
[272](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:272) inputs,
[273](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:273) allow_unreachable=True,
[274](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:274) accumulate_grad=True,
[275](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:275) )
File ~/.local/lib/python3.11/site-packages/torch/autograd/graph.py:744, in _engine_run_backward(t_outputs, *args, **kwargs)
[742](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/graph.py:742) unregister_hooks = _register_logging_hooks_on_whole_graph(t_outputs)
[743](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/graph.py:743) try:
--> [744](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/graph.py:744) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[745](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/graph.py:745) t_outputs, *args, **kwargs
[746](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/graph.py:746) ) # Calls into the C++ engine to run the backward pass
[747](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/graph.py:747) finally:
[748](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/graph.py:748) if attach_logging_hooks:
RuntimeError: Function 'LogSoftmaxBackward0' returned nan values in its 0th output.
since this is not related to the trainer API i'm closing this one, thanks for the support
from transformers.
Related Issues (20)
- Can't create transformer pipeline because pytorch failed to be detected HOT 8
- Trainer having issues with DataLoaderShard when running with torchrun
- cannot import name 'AutoModelForImageToImage' from 'transformers.models.auto.modeling_auto' (/opt/conda/lib/python3.10/site-packages/transformers/models/auto/modeling_auto.py) HOT 1
- linear_sum_assignment error in the object_detection.py guide HOT 2
- A parameter in TrainingArguments: sample_output=True HOT 2
- Mixtral's implementation of auxiliary loss seems incorrect
- LlavaNextProcessor.__init__() got an unexpected keyword argument 'image_token' HOT 5
- Gemma-7b model set my own lm_head but cannot saved and changed pretrained embedding_layer's weights too. HOT 1
- Attention dropout causing problem in attention score distribution
- llama3 with torch.compile used more memory HOT 6
- Quantization support for heads and embeddings HOT 6
- How do I replace a spare tokens? HOT 3
- ValueError: The checkpoint you are trying to load has model type `zoedepth` but Transformers does not recognize this architecture HOT 2
- Quantized T5EncoderModel cannot be removed from VRAM on CUDA systems HOT 6
- GenerationMixin sample() runs forever HOT 3
- ChatGLMForConditionalGeneration does not support Flash Attention 2.0 yet. HOT 1
- The last ut test of the QDQBert model ”test_inference_no_head_absolute_embedding” did not pass when using official safetensors HOT 2
- run_clm.py AttributeError: 'NoneType' object has no attribute 'get' HOT 7
- ImportError: cannot import name 'logging' from 'huggingface_hub' HOT 2
- DataCollatorForLanguageModeling should allow users to send there own labels HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from transformers.