Giter Site home page Giter Site logo

Comments (14)

mkolodny avatar mkolodny commented on August 30, 2024 3

@prigoyal I ended up getting Tensorboard working in Colab with a couple edits:

  1. Load the notebook extension with %load_ext tensorboard
  2. Edit the config in the training command from config.TENSORBOARD_SETUP.USE_TENSORBOARD=true to config.HOOKS.TENSORBOARD_SETUP.USE_TENSORBOARD=true

from vissl.

pcanas avatar pcanas commented on August 30, 2024 2

Hi @prigoyal I am following the tutorial "Understanding VISSL Training and YAML Config" and I have the exact same issue as @Tylersuard.
I have tried what you suggested but I was not able to solve it. I have not changed any code from the original notebook.
Thank you!

from vissl.

mkolodny avatar mkolodny commented on August 30, 2024 1

I'm seeing the same issue in two different notebooks - Train SimCLR on 1 gpu.ipynb and Understanding VISSL Training and YAML Config.

It looks like the models aren't being trained. After running python3 run_distributed_engines.py ..., the tensorboard directory is the last info that's logged. Then the command exits.

If I remove the config.TENSORBOARD_SETUP.USE_TENSORBOARD=true line from the command, running the command gets a little farther. Then I see the error:

INFO 2021-04-07 03:45:31,394 util.py: 241: Broadcasting checkpoint loaded from 
Traceback (most recent call last):
  File "run_distributed_engines.py", line 194, in <module>
    hydra_main(overrides=overrides)
  File "run_distributed_engines.py", line 179, in hydra_main
    hook_generator=default_hook_generator,
  File "run_distributed_engines.py", line 123, in launch_distributed
    hook_generator=hook_generator,
  File "run_distributed_engines.py", line 166, in _distributed_worker
    process_main(cfg, dist_run_id, local_rank=local_rank, node_id=node_id)
  File "run_distributed_engines.py", line 159, in process_main
    hook_generator=hook_generator,
  File "/usr/local/lib/python3.7/dist-packages/vissl/engines/train.py", line 102, in train_main
    trainer.train()
  File "/usr/local/lib/python3.7/dist-packages/vissl/trainer/trainer_main.py", line 155, in train
    self.task.prepare(pin_memory=self.cfg.DATA.PIN_MEMORY)
  File "/usr/local/lib/python3.7/dist-packages/vissl/trainer/train_task.py", line 634, in prepare
    self.base_model = self._build_model()
  File "/usr/local/lib/python3.7/dist-packages/vissl/trainer/train_task.py", line 462, in _build_model
    model = self._restore_model_weights(model)
  File "/usr/local/lib/python3.7/dist-packages/vissl/trainer/train_task.py", line 399, in _restore_model_weights
    append_prefix=append_prefix,
  File "/usr/local/lib/python3.7/dist-packages/vissl/utils/checkpoint.py", line 404, in init_model_from_weights
    state_dict_key_name in state_dict.keys()
AttributeError: 'NoneType' object has no attribute 'keys'

from vissl.

pcanas avatar pcanas commented on August 30, 2024 1

Hi @prigoyal I finally was able to run the training successfully. I send you my comments to help you further debug the issue:

  • The alternative proposed to update the fvcore dependency by running pip install --progress-bar off --upgrade iopath did not work for me.
  • What worked for me is the fix you proposed in #248: changing line https://github.com/facebookresearch/vissl/blob/master/vissl/trainer/train_task.py#L464
  • All what I discussed only works iff config.TENSORBOARD_SETUP.USE_TENSORBOARD is set to false. Otherwise (even with the fix), the script gets stuck in the point mentioned by @Tylersuard. Therefore, there is definitely an issue there.

Hope it helps! And thank you for your support!

from vissl.

mkolodny avatar mkolodny commented on August 30, 2024 1

thank you for pointing this out. VISSL has evolved since the v0.1.5 package release. We will update the tutorials to reflect this change. :)

You're welcome! I'm happy I could help :) And thanks for updating the tutorials so quickly

from vissl.

prigoyal avatar prigoyal commented on August 30, 2024

Hi @Tylersuard , thank you for reaching out . This is quite weird and from the logs above, I can't spot anything immediate. One possibility: the colab server probably didn't start or got disconnected. Would it be possible to retry the workflow again and verify ? (maybe try a few different tutorials).

Alternatively, are you able to build VISSL from source and use that: https://github.com/facebookresearch/vissl/blob/stable/INSTALL.md#install-from-source-in-pip-environment If that works on your machine, we can rule out whether its a VISSL issue or simple a colab issue :)

from vissl.

prigoyal avatar prigoyal commented on August 30, 2024

I'm seeing the same issue in two different notebooks - Train SimCLR on 1 gpu.ipynb and Understanding VISSL Training and YAML Config.

It looks like the models aren't being trained. After running python3 run_distributed_engines.py ..., the tensorboard directory is the last info that's logged. Then the command exits.

If I remove the config.TENSORBOARD_SETUP.USE_TENSORBOARD=true line from the command, running the command gets a little farther. Then I see the error:

INFO 2021-04-07 03:45:31,394 util.py: 241: Broadcasting checkpoint loaded from 
Traceback (most recent call last):
  File "run_distributed_engines.py", line 194, in <module>
    hydra_main(overrides=overrides)
  File "run_distributed_engines.py", line 179, in hydra_main
    hook_generator=default_hook_generator,
  File "run_distributed_engines.py", line 123, in launch_distributed
    hook_generator=hook_generator,
  File "run_distributed_engines.py", line 166, in _distributed_worker
    process_main(cfg, dist_run_id, local_rank=local_rank, node_id=node_id)
  File "run_distributed_engines.py", line 159, in process_main
    hook_generator=hook_generator,
  File "/usr/local/lib/python3.7/dist-packages/vissl/engines/train.py", line 102, in train_main
    trainer.train()
  File "/usr/local/lib/python3.7/dist-packages/vissl/trainer/trainer_main.py", line 155, in train
    self.task.prepare(pin_memory=self.cfg.DATA.PIN_MEMORY)
  File "/usr/local/lib/python3.7/dist-packages/vissl/trainer/train_task.py", line 634, in prepare
    self.base_model = self._build_model()
  File "/usr/local/lib/python3.7/dist-packages/vissl/trainer/train_task.py", line 462, in _build_model
    model = self._restore_model_weights(model)
  File "/usr/local/lib/python3.7/dist-packages/vissl/trainer/train_task.py", line 399, in _restore_model_weights
    append_prefix=append_prefix,
  File "/usr/local/lib/python3.7/dist-packages/vissl/utils/checkpoint.py", line 404, in init_model_from_weights
    state_dict_key_name in state_dict.keys()
AttributeError: 'NoneType' object has no attribute 'keys'

thank you @mkolodny , the issue of AttributeError: 'NoneType' object has no attribute 'keys' comes from fvcore dependency. Please look at the #248 (comment) and also alternate is to run pip install --progress-bar off --upgrade iopath which should update the dependency. If the error doesn't go away, do let me know. We will need to report this to the fvcore team.

from vissl.

prigoyal avatar prigoyal commented on August 30, 2024

Hi @prigoyal I am following the tutorial "Understanding VISSL Training and YAML Config" and I have the exact same issue as @Tylersuard.
I have tried what you suggested but I was not able to solve it. I have not changed any code from the original notebook.
Thank you!

thank you @pcanas and @mkolodny , my guess is that the stall could be coming from the tensorboard logging. For debugging, can you try to set config.TENSORBOARD_SETUP.USE_TENSORBOARD=false and see if that works? This will allow us to narrow down the issue and look into it further.

from vissl.

prigoyal avatar prigoyal commented on August 30, 2024

thank you @pcanas ,

  1. we will release new conda/pip packages of vissl in June which should help when using packages of vissl
  2. great to hear
  3. I'll investigate the tensorboard issue.

from vissl.

mkolodny avatar mkolodny commented on August 30, 2024

Thanks for offering to investigate the tensorboard issue @prigoyal

In case it's helpful to anyone, here's a version of the "Understanding VISSL Training and YAML Config" colab notebook that I was able to get working without tensorboard:

Understanding VISSL Training and YAML Config

(I also had to install the master branch of ClassyVision - that install is included in the linked colab notebook)

from vissl.

prigoyal avatar prigoyal commented on August 30, 2024

Thanks for offering to investigate the tensorboard issue @prigoyal

In case it's helpful to anyone, here's a version of the "Understanding VISSL Training and YAML Config" colab notebook that I was able to get working without tensorboard:

Understanding VISSL Training and YAML Config

(I also had to install the master branch of ClassyVision - that install is included in the linked colab notebook)

thank you for providing the notebook. It seems like you are training VISSL master in which case we do mandate installing the Classy Vision master https://github.com/facebookresearch/vissl/blob/master/INSTALL.md#step-4-install-vissl

from vissl.

prigoyal avatar prigoyal commented on August 30, 2024

@prigoyal I ended up getting Tensorboard working in Colab with a couple edits:

thank you @mkolodny , that is super amazing to hear.

  1. Load the notebook extension with %load_ext tensorboard

I went ahead and updated all the colab notebook tutorials (updated the master tutorials)

  1. Edit the config in the training command from config.TENSORBOARD_SETUP.USE_TENSORBOARD=true to config.HOOKS.TENSORBOARD_SETUP.USE_TENSORBOARD=true

thank you for pointing this out. VISSL has evolved since the v0.1.5 package release. We will update the tutorials to reflect this change. :)

from vissl.

prigoyal avatar prigoyal commented on August 30, 2024

closing this task as the actionable items have been completed. Summarizing below:

  1. Load the notebook extension with %load_ext tensorboard -> tutorials updated
  2. Edit the config in the training command from config.TENSORBOARD_SETUP.USE_TENSORBOARD=true to config.HOOKS.TENSORBOARD_SETUP.USE_TENSORBOARD=true -> tutorials updated
  3. For NoneType error, see #257 (comment) and this will be included in next VISSL package release.

thank you @mkolodny and everyone. Please feel free to open a new issue or reopen this issue in case of any follow-ups / questions :)

from vissl.

lewfish avatar lewfish commented on August 30, 2024

closing this task as the actionable items have been completed. Summarizing below:

  1. Load the notebook extension with %load_ext tensorboard -> tutorials updated
  2. Edit the config in the training command from config.TENSORBOARD_SETUP.USE_TENSORBOARD=true to config.HOOKS.TENSORBOARD_SETUP.USE_TENSORBOARD=true -> tutorials updated
  3. For NoneType error, see #257 (comment) and this will be included in next VISSL package release.

thank you @mkolodny and everyone. Please feel free to open a new issue or reopen this issue in case of any follow-ups / questions :)

I'm running the Colab tutorials on master and I'm still having the problems described in this issue. The above comment says the tutorials have been updated, but where have they been updated? On Github, it says the tutorials were last updated 2 months ago.

from vissl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.