System Info transformers ve

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

loss = 0 after first log with trainer API about transformers HOT 3 CLOSED

not-lain commented on July 2, 2024

loss = 0 after first log with trainer API

from transformers.

Comments (3)

SunMarc commented on July 2, 2024

Hi @not-lain, that's probably not a bug. The dataset have only 200 rows. Could you try with a bigger dataset ?

from transformers.

not-lain commented on July 2, 2024

Hi @SunMarc I have tried with both not-lain/docci and not-lain/docci-small , and the same behavior persisted after the first log

from transformers.

not-lain commented on July 2, 2024

@SunMarc after some fiddling It seems that this is not related to the trainer API rather to my training script

before
after applying

model.text_model.train()
model.config.use_cache = False
model.text_model.transformer.gradient_checkpointing_enable()
torch.autograd.set_detect_anomaly(True)

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[28], [line 10](vscode-notebook-cell:?execution_count=28&line=10)
      [1](vscode-notebook-cell:?execution_count=28&line=1) from transformers import Trainer
      [3](vscode-notebook-cell:?execution_count=28&line=3) trainer = Trainer(
      [4](vscode-notebook-cell:?execution_count=28&line=4)         model=model,
      [5](vscode-notebook-cell:?execution_count=28&line=5)         train_dataset=data['train'],
   (...)
      [8](vscode-notebook-cell:?execution_count=28&line=8)         args=args
      [9](vscode-notebook-cell:?execution_count=28&line=9)         )
---> [10](vscode-notebook-cell:?execution_count=28&line=10) trainer.train()

File ~/.local/lib/python3.11/site-packages/transformers/trainer.py:1885, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   [1883](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:1883)         hf_hub_utils.enable_progress_bars()
   [1884](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:1884) else:
-> [1885](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:1885)     return inner_training_loop(
   [1886](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:1886)         args=args,
   [1887](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:1887)         resume_from_checkpoint=resume_from_checkpoint,
   [1888](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:1888)         trial=trial,
   [1889](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:1889)         ignore_keys_for_eval=ignore_keys_for_eval,
   [1890](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:1890)     )

File ~/.local/lib/python3.11/site-packages/transformers/trainer.py:2216, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   [2213](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:2213)     self.control = self.callback_handler.on_step_begin(args, self.state, self.control)
   [2215](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:2215) with self.accelerator.accumulate(model):
-> [2216](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:2216)     tr_loss_step = self.training_step(model, inputs)
   [2218](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:2218) if (
   [2219](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:2219)     args.logging_nan_inf_filter
   [2220](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:2220)     and not is_torch_xla_available()
   [2221](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:2221)     and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
   [2222](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:2222) ):
   [2223](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:2223)     # if loss is nan or inf simply add the average of previous logged losses
   [2224](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:2224)     tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)

File ~/.local/lib/python3.11/site-packages/transformers/trainer.py:3250, in Trainer.training_step(***failed resolving arguments***)
   [3248](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:3248)         scaled_loss.backward()
   [3249](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:3249) else:
-> [3250](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:3250)     self.accelerator.backward(loss)
   [3252](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/transformers/trainer.py:3252) return loss.detach() / self.args.gradient_accumulation_steps

File ~/.local/lib/python3.11/site-packages/accelerate/accelerator.py:2134, in Accelerator.backward(self, loss, **kwargs)
   [2132](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/accelerate/accelerator.py:2132)     self.lomo_backward(loss, learning_rate)
   [2133](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/accelerate/accelerator.py:2133) else:
-> [2134](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/accelerate/accelerator.py:2134)     loss.backward(**kwargs)

File ~/.local/lib/python3.11/site-packages/torch/_tensor.py:525, in Tensor.backward(self, gradient, retain_graph, create_graph, inputs)
    [515](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/_tensor.py:515) if has_torch_function_unary(self):
    [516](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/_tensor.py:516)     return handle_torch_function(
    [517](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/_tensor.py:517)         Tensor.backward,
    [518](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/_tensor.py:518)         (self,),
   (...)
    [523](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/_tensor.py:523)         inputs=inputs,
    [524](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/_tensor.py:524)     )
--> [525](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/_tensor.py:525) torch.autograd.backward(
    [526](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/_tensor.py:526)     self, gradient, retain_graph, create_graph, inputs=inputs
    [527](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/_tensor.py:527) )

File ~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:267, in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    [262](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:262)     retain_graph = create_graph
    [264](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:264) # The reason we repeat the same comment below is that
    [265](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:265) # some Python versions print out the first line of a multi-line function
    [266](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:266) # calls in the traceback and some print out the last line
--> [267](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:267) _engine_run_backward(
    [268](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:268)     tensors,
    [269](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:269)     grad_tensors_,
    [270](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:270)     retain_graph,
    [271](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:271)     create_graph,
    [272](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:272)     inputs,
    [273](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:273)     allow_unreachable=True,
    [274](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:274)     accumulate_grad=True,
    [275](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/__init__.py:275) )

File ~/.local/lib/python3.11/site-packages/torch/autograd/graph.py:744, in _engine_run_backward(t_outputs, *args, **kwargs)
    [742](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/graph.py:742)     unregister_hooks = _register_logging_hooks_on_whole_graph(t_outputs)
    [743](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/graph.py:743) try:
--> [744](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/graph.py:744)     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    [745](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/graph.py:745)         t_outputs, *args, **kwargs
    [746](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/graph.py:746)     )  # Calls into the C++ engine to run the backward pass
    [747](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/graph.py:747) finally:
    [748](https://vscode-remote+ssh-002dremote-002b150-002e136-002e222-002e37.vscode-resource.vscode-cdn.net/home/ubuntu/moondream/~/.local/lib/python3.11/site-packages/torch/autograd/graph.py:748)     if attach_logging_hooks:

RuntimeError: Function 'LogSoftmaxBackward0' returned nan values in its 0th output.

since this is not related to the trainer API i'm closing this one, thanks for the support

from transformers.

loss = 0 after first log with trainer API about transformers HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent