Giter Site home page Giter Site logo

facebookresearch / vissl Goto Github PK

View Code? Open in Web Editor NEW
3.2K 54.0 329.0 19.6 MB

VISSL is FAIR's library of extensible, modular and scalable components for SOTA Self-Supervised Learning with images.

Home Page: https://vissl.ai

License: MIT License

Shell 0.12% Python 9.19% Dockerfile 0.01% Jupyter Notebook 90.54% JavaScript 0.11% CSS 0.04%

vissl's Introduction

CircleCIPRs Welcome

What's New

Below we share, in reverse chronological order, the updates and new releases in VISSL. All VISSL releases are available here.

Introduction

VISSL is a computer VIsion library for state-of-the-art Self-Supervised Learning research with PyTorch. VISSL aims to accelerate research cycle in self-supervised learning: from designing a new self-supervised task to evaluating the learned representations. Key features include:

Installation

See INSTALL.md.

Getting Started

Install VISSL by following the installation instructions. After installation, please see Getting Started with VISSL and the Colab Notebook to learn about basic usage.

Documentation

Learn more about VISSL at our documentation. And see the projects/ for some projects built on top of VISSL.

Tutorials

Get started with VISSL by trying one of the Colab tutorial notebooks.

Model Zoo and Baselines

We provide a large set of baseline results and trained models available for download in the VISSL Model Zoo.

Contributors

VISSL is written and maintained by the Facebook AI Research.

Development

We welcome new contributions to VISSL and we will be actively maintaining this library! Please refer to CONTRIBUTING.md for full instructions on how to run the code, tests and linter, and submit your pull requests.

License

VISSL is released under MIT license.

Citing VISSL

If you find VISSL useful in your research or wish to refer to the baseline results published in the Model Zoo, please use the following BibTeX entry.

@misc{goyal2021vissl,
  author =       {Priya Goyal and Quentin Duval and Jeremy Reizenstein and Matthew Leavitt and Min Xu and
                  Benjamin Lefaudeux and Mannat Singh and Vinicius Reis and Mathilde Caron and Piotr Bojanowski and
                  Armand Joulin and Ishan Misra},
  title =        {VISSL},
  howpublished = {\url{https://github.com/facebookresearch/vissl}},
  year =         {2021}
}

vissl's People

Contributors

akainth015 avatar amyreese avatar blazejdolicki avatar blefaudeux avatar bottler avatar cjrd avatar cynthia avatar datumbox avatar dmitryvinn avatar facebook-github-bot avatar growlix avatar igorsugak avatar iseessel avatar itamaro avatar jingli9111 avatar leitian avatar leszfb avatar mannatsingh avatar mayalene avatar min-xu-ai avatar olivierdehaene avatar pixelb avatar pranavsinghps1 avatar prigoyal avatar quentinduval avatar r-barnes avatar rgeirhos avatar soulitzer avatar wpc avatar xcastilla avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

vissl's Issues

ImportError in APEX package while following the Installation guide

πŸ“š VISSL Documentation

I have tried all the different installation options mentioned in the Intallation guide but they all converge to the same issue when I try to import the apex package:

File "/home/[...]/.local/lib/python3.7/site-packages/apex/__init__.py", line 13, in <module>
    from pyramid.session import UnencryptedCookieSessionFactoryConfig
ImportError: cannot import name 'UnencryptedCookieSessionFactoryConfig' from 'pyramid.session' (unknown location)

I tried looking for similar issues but I did not find a clear answer. Also, I was wondering if the library is used at all since I cannot see its use during the tutorials.

Updating the LR at every step

πŸ“š VISSL Documentation

Hi,

I am trying to do a linear warmup with a cosine schedule. I followed the instructions here. While I specified update_interval=step, it seems the learning rate update only happens every epoch when I look at the TB plots.
Screen Shot 2021-03-05 at 10 21 52 AM

This is most likely me not understanding how to use the scheduling args, so any documentation related to this would help. I looked here and on the default config but could not find an answer.

Here is my current optimization config

  OPTIMIZER:
      name: adam
      lr: 0.0001
      weight_decay: 0
      num_epochs: 50
      clip_grad_norm: 1
      head_optimizer_params:
        use_different_lr: False
        use_different_wd: False
      param_schedulers:
        lr:
          auto_lr_scaling:
            auto_scale: false
            base_value: 0.3
            base_lr_batch_size: 256
          name: composite
          schedulers:
            - name: linear
              start_value: 1e-5
              end_value: 1e-4
            - name: cosine
              start_value: 1e-4
              end_value: 0.
              # wave_type: half
              # restart_interval_length: 0.5
              wave_type: full
              is_adaptive: True
              restart_interval_length: 0.334
          interval_scaling: [rescaled, fixed]
          update_interval: step
          lengths: [0.05, 0.95]                 # 100ep # how to split between warmup and cos. anneal?

(here is what I am trying do to, for reference)
Screen Shot 2021-03-05 at 10 20 04 AM

Values for mean and std in Normalize Data Transform

In all the example config files described in the Tutorials, there is a Normalize Data Transform, with the following mean and std:

mean: [0.485, 0.456, 0.406]
std: [0.229, 0.224, 0.225]

I was wondering what is the reason for these values. Are they specific for the Imagenet_1k dataset? Should I calculate my own values if my plan is to use a custom dataset?

Thank you for your help!

Custom Dataset

I would like to train SimCLR. How do I prepare the custom dataset? How should I format the dataset ?

WandB support

πŸš€ Feature

Now that we can use Weights and Biases on the Facebook cluster, it would be really neat if there was support for it within VISSL.

Motivation & Examples

WandB is like TensorBoard on steroids, and provides a more user-friendly interface.

Tell us why the feature is useful.

Should be used exactly like the current Tensorboard logger within VISSL

Note

I this is something the VISSL team would consider adding, I can try and submit a PR :)

Which sinkhorn implementation

We are trying to figure out which sinkhorn implementation is used in the Swav model.
It looks like you're using standard sinkhorn implementation with an additional trick for numerical stabilisation but
it is difficult to read the code and to match it with the original implementation. For example, in Swav the kernel is computed with exp(C/epsilon), in all other implementation it is exp(-C/epsilon). Why ?
Compared to the implantation of stabilized sinkhorn from https://pythonot.github.io your code looks very differents.
We did the test to compare the output of the implement of sinkhorn in swav and POT and we are not able to produces the same results.
Can you bring some insight on this ?

Loading Trained Models

Hello,
I followed the given tutorials and manged to train a model on a custom dataset.
The training seemed to work but I can't figure out how to use the trained model.
I tried building the model as follows:

import yaml
from vissl.utils.hydra_config import AttrDict
cfg = yaml.load(open("path_to_config_yaml"), loaded=yaml.FullLoader)["config"]
cfg = AttrDict(cfg)

from vissl.model import build_model 
model = build_model(cfg.MODL, cfg.OPTIMIZER)

Where path_to_config_yaml is the path to the same config as the one used in training (configs/config/quick_1gpu_resnet50_simclr.yaml).
The following error accord:

AttributeError: AttrDict object has no attribute FEATURE_EVAL_SETTINGS.

Any ideas on how to solve this? Otherwise, is there a tutorial which explains how to load trained models? I have read How to Load Pretrained Models but couldn't really understand.

If more information is needed, please comment and I'll provide it
Thanks!

Proper way to clip gradient norm

❓ How to do something using VISSL

Using the Adam optimizer, I want to know what is the best way to clip the gradient norm after calling backward (using torch.nn.utils.clip_grad_norm_). I looked into the standard_train_step.py file, but there does not seem to be any gradient clipping in place. All I could find were LARC specific clipping arguments in the config files.

For my personal use I had to design a custom training step, so I integrated a line with gradient clipping. Under this approach, what is the best/recommended way to pass a grad_clip argument from the .yaml file such that it is accessible through the task object passed into the train_step function ? I am asking because there seems to already be a default argument for that. When debugging and printing vars(args).keys() there is a clip_grad_norm key, however I haven't seen it in the config files. Is it possible to set it ? if so, how ?

Thanks for the amazing support so far,
Lucas

How can I use vissl as a package?

I understand VISSL provides an easy way to train models via scripts and YAML configurations.
I am looking to integrate the features of VISSL in my project. I looking to use like:-

from vissl import swav

model = swav(config)

## do other stuff

Could you please help me with the same? Specifically when I am use it as a package on my local machine.
Thank You!

Implement BYOL

🌟 New SSL approach addition

Approach description

BYOL https://arxiv.org/abs/2006.07733

Open source status

  • the model implementation is available: (give details)
  • the model weights are available: (give details)
  • who are the authors: (mention them, if possible by @gh-username)

Open cpus_per_task option for SLURM training

πŸš€ Feature

It would be great to open the cpus_per_task option for SLURM training to the hydra config. Right now it is inferred from the number of processes per node i.e. the number of GPUs per node with a baseline value of 8 CPU per process. See: https://github.com/facebookresearch/vissl/blob/master/vissl/utils/distributed_launcher.py#L259

Motivation & Examples

On the cluster I am using, we have 3 CPU per GPU in our octo-gpu nodes and 10 CPU per GPU in our quadri-gpu nodes. We need to be able to setup this value accordingly.

ConViT: documentation and model

We landed the Convit code and want to accomplish the next few steps:

  • Documentation
    • documentation in docs/source/ssl_approaches @growlix
    • optional: add doc to projects/ folder @growlix
  • Model zoo
    • a script to train and evaluate @growlix
    • train a model and evaluate @vedanuj
    • Add the model to AWS and to the model zoo (@prigoyal can help coordinate this)

How to load a pretrained/finetuned VISSL model for inference?

❓ How to load a pretrained/finetuned VISSL model for inference?

Preface: I am aware of #235 and of https://github.com/facebookresearch/vissl/blob/master/tutorials/Using_a_pretrained_model_for_inference.ipynb

First of all, thanks a lot for making VISSL available - it's an awesome tool.

However, I am struggling with using the models I've trained for simple inference. Specifically, I am trying to score images with a VISSL model that I've finetuned on my own simple dataset, having 4 classes. During training, the model achieved a high TOP-1 accuracy on the validation set. Consequently, I'd assume that when using the model for inference and scoring images from the validation set, I should see the same accuracy. Strangely enough the model predictions are rubbish, with the model predicting basically always one class. My guess is that I am doing something wrong when loading and preparing the model for inference. I'll provide the technical details below:

Training

I've finetuned a torchvision ResNet50 model, following the official tutorial https://vissl.ai/tutorials/Benchmark_Full_Finetuning_on_ImageNet_1K. Specifically, I've executed the following run command:

python run_distributed_engines.py \
    hydra.verbose=true \
    config=eval_resnet_8gpu_transfer_in1k_semi_sup_fulltune_mod \
    config.DATA.TRAIN.DATA_SOURCES=[disk_folder] \
    config.DATA.TRAIN.LABEL_SOURCES=[disk_folder] \
    config.DATA.TRAIN.DATASET_NAMES=[mydata] \
    config.DATA.TRAIN.COPY_TO_LOCAL_DISK=False \
    config.DATA.TRAIN.BATCHSIZE_PER_REPLICA=32 \
    config.DATA.TEST.DATA_SOURCES=[disk_folder] \
    config.DATA.TEST.LABEL_SOURCES=[disk_folder] \
    config.DATA.TEST.DATASET_NAMES=[mydata] \
    config.DATA.TEST.BATCHSIZE_PER_REPLICA=32 \
    config.DISTRIBUTED.NUM_NODES=1 \
    config.DISTRIBUTED.NUM_PROC_PER_NODE=1 \
    config.CHECKPOINT.DIR="./checkpoints_finetune" \
    config.MODEL.WEIGHTS_INIT.PARAMS_FILE="resnet50-19c8e357.pth" \
    config.MODEL.WEIGHTS_INIT.APPEND_PREFIX="trunk._feature_blocks." \
    config.MODEL.WEIGHTS_INIT.STATE_DICT_KEY_NAME=""

using a slightly modified yaml config compared to the base eval_resnet_8gpu_transfer_in1k_semi_sup_fulltune. The modifications only concern the HEAD:

  MODEL:
    TRUNK:
      NAME: resnet
      TRUNK_PARAMS:
        RESNETS:
          DEPTH: 50
    HEAD:
      PARAMS: [
        ["mlp", {"dims": [2048, 4]}],
      ]

The TOP-1 accuracy during training reaches 98% on the training data and about 95% on the validation data. I've double checked that the model indeed is loading the intended data, that the targets are correctly used and that the model predictions during validation reflect on average the 95% accuracy by running the above command in pdb (python -m pdb run_distributed_engines.py ...) , setting breakpoints in standard_train_step in vissl/trainer/train_steps/standard_train_step.py and inspecting the contents of sampleand model_output. Everything looks plausible and consistent.

Inference

I have tried to load the model in "inference" mode following the suggestions in #235 and the tutorial https://github.com/facebookresearch/vissl/blob/master/tutorials/Using_a_pretrained_model_for_inference.ipynb (Note the transformation pipeline, which should one-to-one reproduce the transformations used during training in the testing phase - see eval_resnet_8gpu_transfer_in1k_semi_sup_fulltune)

from omegaconf import OmegaConf
from vissl.utils.hydra_config import AttrDict
from vissl.models import build_model
from classy_vision.generic.util import load_checkpoint
from vissl.utils.checkpoint import init_model_from_weights
from PIL import Image
import torchvision.transforms as transforms

config = OmegaConf.load("configs/config/eval_resnet_8gpu_transfer_in1k_semi_sup_fulltune_mod.yaml")
default_config = OmegaConf.load("vissl/config/defaults.yaml")

cfg = OmegaConf.merge(default_config, config)

cfg = AttrDict(cfg)
cfg.config.MODEL.WEIGHTS_INIT.PARAMS_FILE = "checkpoints_finetune/model_final_checkpoint_phase138.torch"
cfg.config.MODEL.FEATURE_EVAL_SETTINGS.EXTRACT_TRUNK_FEATURES_ONLY = True
cfg.config.MODEL.FEATURE_EVAL_SETTINGS.SHOULD_FLATTEN_FEATS = False
cfg.config.MODEL.FEATURE_EVAL_SETTINGS.LINEAR_EVAL_FEAT_POOL_OPS_MAP = [["res5avg", ["Identity", []]]]

model = build_model(cfg.config.MODEL, cfg.config.OPTIMIZER)
weights = load_checkpoint(checkpoint_path=cfg.config.MODEL.WEIGHTS_INIT.PARAMS_FILE)

init_model_from_weights(
    config=cfg.config,
    model=model,
    state_dict=weights,
    state_dict_key_name="classy_state_dict",
    skip_layers=[],  # Use this if you do not want to load all layers
)
pipeline = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

I've then tried to score with this model images from the validiation set, expecting that in 95% of cases the correct target class is predicted:

for i in range(4):
    print("Validation set target class ", i)
    img_dir = "mydata/val/{}".format(i)
    for img_name in sorted(os.listdir(img_dir))[:5]:
        img_fname = os.path.join(img_dir, img_name)
        image = Image.open(img_fname).convert("RGB")
        x = pipeline(image)
        features = model(x.unsqueeze(0))
        _, pred = features[0].float().topk(1,largest=True, sorted=True)
        print(img_fname, features, pred[0])

But the predictions are all over the place:

Validation set target class  0
mydata/val/0/26b47d05f7a17b09fdca68f01ef42740.jpg [tensor([-8.2779, -0.6585,  4.4609,  5.0466], grad_fn=<AddBackward0>)] tensor(3)
mydata/val/0/43eb8a990f72e5dd084dd926b233a5dc.jpg [tensor([-8.3000, -0.5172,  4.3451,  5.1021], grad_fn=<AddBackward0>)] tensor(3)
mydata/val/0/49b1838adc0b4d16b5f5a282c4a13333.jpg [tensor([-8.1877, -0.7219,  5.0734,  4.4780], grad_fn=<AddBackward0>)] tensor(2)
mydata/val/0/8af96ab34dee1163c1c23910b5e3c37e.jpg [tensor([-8.0564, -0.7757,  4.5512,  4.9206], grad_fn=<AddBackward0>)] tensor(3)
mydata/val/0/de2bae914295d209ca4b5b1772e8b89e.jpg [tensor([-8.3365, -0.6826,  4.4628,  5.1643], grad_fn=<AddBackward0>)] tensor(3)
Validation set target class  1
mydata/val/1/006d0afb20f6f92a742978f1a65e8ecc.jpg [tensor([-8.5201, -0.2442,  4.7602,  4.6363], grad_fn=<AddBackward0>)] tensor(2)
mydata/val/1/0278d05c83a725304fa506d26f15f332.jpg [tensor([-8.6195, -0.2458,  4.6237,  4.9525], grad_fn=<AddBackward0>)] tensor(3)
mydata/val/1/076c219c1f2ec1859ff3c3cd6a4fce0f.jpg [tensor([-8.2978, -0.4898,  4.9301,  4.5161], grad_fn=<AddBackward0>)] tensor(2)
mydata/val/1/0af6840fd3ae8ec0f43f70fd0f9b80d2.jpg [tensor([-8.4313, -0.8287,  4.7313,  5.1053], grad_fn=<AddBackward0>)] tensor(3)
mydata/val/1/0bc4bddf1f3689def5df97d557a2de3a.jpg [tensor([-8.4514, -0.3272,  4.6578,  4.7634], grad_fn=<AddBackward0>)] tensor(3)
Validation set target class  2
mydata/val/2/0019ba30aa56fc050113076326ee3ec3.jpg [tensor([-8.2777, -0.5599,  4.8195,  4.5890], grad_fn=<AddBackward0>)] tensor(2)
mydata/val/2/00221133dfde2e3196690a0e4f6e6114.jpg [tensor([-8.3178, -0.4367,  4.4543,  4.9198], grad_fn=<AddBackward0>)] tensor(3)
mydata/val/2/0023c8338336625f71209b7a80a6b093.jpg [tensor([-8.3114, -0.6324,  4.8827,  4.7317], grad_fn=<AddBackward0>)] tensor(2)
mydata/val/2/0042f8e6f0c7ec5d2d4ae8f467ba3365.jpg [tensor([-8.4777, -0.7538,  5.0929,  4.8214], grad_fn=<AddBackward0>)] tensor(2)
mydata/val/2/0051db26771cd7c3f91019751a2006ff.jpg [tensor([-8.2487, -0.8269,  4.4676,  5.2313], grad_fn=<AddBackward0>)] tensor(3)
Validation set target class  3
mydata/val/3/001fcdf186182ee139e9c7aa710e5b50.jpg [tensor([-8.4400, -0.7982,  4.2155,  5.6232], grad_fn=<AddBackward0>)] tensor(3)
mydata/val/3/00afccfd48cb0155ee0a9f74553601ca.jpg [tensor([-8.3494, -0.7743,  4.4209,  5.3379], grad_fn=<AddBackward0>)] tensor(3)
mydata/val/3/01a68c73059c25c045c5101a72f314ab.jpg [tensor([-8.1762, -0.6270,  4.3527,  5.0999], grad_fn=<AddBackward0>)] tensor(3)
mydata/val/3/02e9cd3870cae126e00573bbbb24874a.jpg [tensor([-8.5710, -0.5589,  4.0382,  5.7221], grad_fn=<AddBackward0>)] tensor(3)
mydata/val/3/03946f596354dd4a01b5f0ee47ae2a8a.jpg [tensor([-8.3258, -0.5837,  4.4099,  5.0943], grad_fn=<AddBackward0>)] tensor(3)

Based on the comments in defaults.yaml (in the FEATURE_EVAL_SETTINGS section), I've tried different config setups, such as

cfg.config.MODEL.FEATURE_EVAL_SETTINGS.EVAL_MODE_ON = True
cfg.config.MODEL.FEATURE_EVAL_SETTINGS.FREEZE_TRUNK_ONLY = False
cfg.config.MODEL.FEATURE_EVAL_SETTINGS.FREEZE_TRUNK_AND_HEAD = True
cfg.config.MODEL.FEATURE_EVAL_SETTINGS.EVAL_TRUNK_AND_HEAD = True

but the results basically remained the same.

I very much suspect that I am messing something up somewhere during loading and preparing the trained model for inference. Could you please point me in the right direction? I can provide more technical details if requried.

Changing model in Feature Extraction tutorial did not work for me

Hello everyone!
First of all, this is a really exciting toolbox - thank you for publishing it!

I am fairly new to working with config files so probably the mistake is on my side:

I tried changing the ResNet50-model from your feature extraction Colab tutorial to the SimClr one. To do that, I changed the download link to the weights but the rest remained the same.

Later in the shell command I changed the name of the checkpoint so it looks like this:

!python3 run_distributed_engines.py \ hydra.verbose=true \ config=extract_resnet_in1k_8gpu \ +config/trunk_only=rn50_layers \ config.DATA.TRAIN.DATA_SOURCES=[disk_folder] \ config.DATA.TRAIN.LABEL_SOURCES=[disk_folder] \ config.DATA.TRAIN.DATASET_NAMES=[dummy_data_folder] \ config.DATA.TRAIN.BATCHSIZE_PER_REPLICA=2 \ config.DATA.TEST.DATA_SOURCES=[disk_folder] \ config.DATA.TEST.LABEL_SOURCES=[disk_folder] \ config.DATA.TEST.DATASET_NAMES=[dummy_data_folder] \ config.DATA.TEST.BATCHSIZE_PER_REPLICA=2 \ config.DISTRIBUTED.NUM_NODES=1 \ config.DISTRIBUTED.NUM_PROC_PER_NODE=1 \ config.CHECKPOINT.DIR="./checkpoints" \ config.MODEL.WEIGHTS_INIT.PARAMS_FILE="/content/model_final_checkpoint_phase999.torch" \ config.MODEL.WEIGHTS_INIT.APPEND_PREFIX="trunk.base_model._feature_blocks." \ config.MODEL.WEIGHTS_INIT.STATE_DICT_KEY_NAME=""

This results in allocation errors on Colab and I don't know how to get around them.

INFO 2021-03-25 11:12:35,260 util.py: 282: Loaded checkpoint from /content/model_final_checkpoint_phase999.torch INFO 2021-03-25 11:12:35,260 util.py: 241: Broadcasting checkpoint loaded from /content/model_final_checkpoint_phase999.torch tcmalloc: large alloc 1246027776 bytes == 0x55e7fac8c000 @ 0x7fa802cfa615 0x55e647c6706c 0x55e647d46eba 0x55e647c6dacc 0x7fa7fe078a84 0x7fa7fe0807a4 0x7fa7fe056a30 0x7fa7ed436075 0x7fa7ed4325da 0x7fa7ed436a89 0x7fa7fe056f7e 0x7fa7fdc95c99 0x55e647c6ac38 0x55e647cde63d 0x55e647cd8e0d 0x55e647c6b77a 0x55e647cd9a45 0x55e647cd8b0e 0x55e647c6b77a 0x55e647cdde50 0x55e647cd8b0e 0x55e647c6b77a 0x55e647cd9a45 0x55e647cd8b0e 0x55e647c6b77a 0x55e647cda86a 0x55e647c6b69a 0x55e647cd9c9e 0x55e647c6b69a 0x55e647cd9c9e 0x55e647cd8b0e

Is Colab too weak for the "rn50_w4_in1k_simclr_1000ep" - model? Or is there something else I have to consider? After all, the only thing I changed was the download file for the checkpoints...

I hope you can help me!

Thank you very much :)

Model not training in Colab

Problem: command exits after a few seconds without training
No checkpoints output

Command:

!python3 run_distributed_engines.py
hydra.verbose=true
config=supervised_1gpu_resnet_example
config.DATA.TRAIN.DATA_SOURCES=[disk_folder]
config.DATA.TRAIN.LABEL_SOURCES=[disk_folder]
config.DATA.TRAIN.DATASET_NAMES=[dummy_data_folder]
config.DATA.TRAIN.DATA_PATHS=[/content/dummy_data/train]
config.DATA.TRAIN.BATCHSIZE_PER_REPLICA=2
config.DATA.TEST.DATA_SOURCES=[disk_folder]
config.DATA.TEST.LABEL_SOURCES=[disk_folder]
config.DATA.TEST.DATASET_NAMES=[dummy_data_folder]
config.DATA.TEST.DATA_PATHS=[/content/dummy_data/val]
config.DATA.TEST.BATCHSIZE_PER_REPLICA=2
config.DISTRIBUTED.NUM_NODES=1
config.DISTRIBUTED.NUM_PROC_PER_NODE=1
config.OPTIMIZER.num_epochs=2
config.OPTIMIZER.param_schedulers.lr.values=[0.01,0.001]
config.OPTIMIZER.param_schedulers.lr.milestones=[1]
config.TENSORBOARD_SETUP.USE_TENSORBOARD=true
config.CHECKPOINT.DIR="./checkpoints"

Output:

overrides: ['hydra.verbose=true', 'config=supervised_1gpu_resnet_example', 'config.DATA.TRAIN.DATA_SOURCES=[disk_folder]', 'config.DATA.TRAIN.LABEL_SOURCES=[disk_folder]', 'config.DATA.TRAIN.DATASET_NAMES=[dummy_data_folder]', 'config.DATA.TRAIN.DATA_PATHS=[/content/dummy_data/train]', 'config.DATA.TRAIN.BATCHSIZE_PER_REPLICA=2', 'config.DATA.TEST.DATA_SOURCES=[disk_folder]', 'config.DATA.TEST.LABEL_SOURCES=[disk_folder]', 'config.DATA.TEST.DATASET_NAMES=[dummy_data_folder]', 'config.DATA.TEST.DATA_PATHS=[/content/dummy_data/val]', 'config.DATA.TEST.BATCHSIZE_PER_REPLICA=2', 'config.DISTRIBUTED.NUM_NODES=1', 'config.DISTRIBUTED.NUM_PROC_PER_NODE=1', 'config.OPTIMIZER.num_epochs=2', 'config.OPTIMIZER.param_schedulers.lr.values=[0.01,0.001]', 'config.OPTIMIZER.param_schedulers.lr.milestones=[1]', 'config.TENSORBOARD_SETUP.USE_TENSORBOARD=true', 'config.CHECKPOINT.DIR=./checkpoints', 'hydra.verbose=true']

INFO 2021-03-28 03:48:24,957 init.py: 32: Provided Config has latest version: 1
INFO 2021-03-28 03:48:24,958 run_distributed_engines.py: 163: Spawning process for node_id: 0, local_rank: 0, dist_rank: 0, dist_run_id: localhost:42573
INFO 2021-03-28 03:48:24,958 train.py: 66: Env set for rank: 0, dist_rank: 0
INFO 2021-03-28 03:48:24,958 env.py: 41: CLICOLOR: 1
INFO 2021-03-28 03:48:24,958 env.py: 41: CLOUDSDK_CONFIG: /content/.config
INFO 2021-03-28 03:48:24,959 env.py: 41: CLOUDSDK_PYTHON: python3
INFO 2021-03-28 03:48:24,959 env.py: 41: COLAB_GPU: 1
INFO 2021-03-28 03:48:24,959 env.py: 41: CUDA_VERSION: 11.0.3
INFO 2021-03-28 03:48:24,959 env.py: 41: CUDNN_VERSION: 8.0.4.30
INFO 2021-03-28 03:48:24,959 env.py: 41: DATALAB_SETTINGS_OVERRIDES: {"kernelManagerProxyPort":6000,"kernelManagerProxyHost":"172.28.0.3","jupyterArgs":["--ip="172.28.0.2""],"debugAdapterMultiplexerPath":"/usr/local/bin/dap_multiplexer"}
INFO 2021-03-28 03:48:24,959 env.py: 41: DEBIAN_FRONTEND: noninteractive
INFO 2021-03-28 03:48:24,959 env.py: 41: ENV: /root/.bashrc
INFO 2021-03-28 03:48:24,959 env.py: 41: GCE_METADATA_TIMEOUT: 0
INFO 2021-03-28 03:48:24,959 env.py: 41: GCS_READ_CACHE_BLOCK_SIZE_MB: 16
INFO 2021-03-28 03:48:24,959 env.py: 41: GIT_PAGER: cat
INFO 2021-03-28 03:48:24,959 env.py: 41: GLIBCPP_FORCE_NEW: 1
INFO 2021-03-28 03:48:24,959 env.py: 41: GLIBCXX_FORCE_NEW: 1
INFO 2021-03-28 03:48:24,959 env.py: 41: HOME: /root
INFO 2021-03-28 03:48:24,959 env.py: 41: HOSTNAME: 392565ebe3a4
INFO 2021-03-28 03:48:24,960 env.py: 41: JPY_PARENT_PID: 58
INFO 2021-03-28 03:48:24,960 env.py: 41: LANG: en_US.UTF-8
INFO 2021-03-28 03:48:24,960 env.py: 41: LAST_FORCED_REBUILD: 20210316
INFO 2021-03-28 03:48:24,960 env.py: 41: LD_LIBRARY_PATH: /usr/lib64-nvidia
INFO 2021-03-28 03:48:24,960 env.py: 41: LD_PRELOAD: /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4
INFO 2021-03-28 03:48:24,960 env.py: 41: LIBRARY_PATH: /usr/local/cuda/lib64/stubs
INFO 2021-03-28 03:48:24,960 env.py: 41: LOCAL_RANK: 0
INFO 2021-03-28 03:48:24,960 env.py: 41: MPLBACKEND: module://ipykernel.pylab.backend_inline
INFO 2021-03-28 03:48:24,960 env.py: 41: NCCL_VERSION: 2.7.8
INFO 2021-03-28 03:48:24,960 env.py: 41: NO_GCE_CHECK: True
INFO 2021-03-28 03:48:24,960 env.py: 41: NVIDIA_DRIVER_CAPABILITIES: compute,utility
INFO 2021-03-28 03:48:24,960 env.py: 41: NVIDIA_REQUIRE_CUDA: cuda>=11.0 brand=tesla,driver>=418,driver<419 brand=tesla,driver>=440,driver<441 brand=tesla,driver>=450,driver<451
INFO 2021-03-28 03:48:24,960 env.py: 41: NVIDIA_VISIBLE_DEVICES: all
INFO 2021-03-28 03:48:24,960 env.py: 41: OLDPWD: /
INFO 2021-03-28 03:48:24,960 env.py: 41: PAGER: cat
INFO 2021-03-28 03:48:24,961 env.py: 41: PATH: /usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tools/node/bin:/tools/google-cloud-sdk/bin:/opt/bin
INFO 2021-03-28 03:48:24,961 env.py: 41: PWD: /content
INFO 2021-03-28 03:48:24,961 env.py: 41: PYDEVD_USE_FRAME_EVAL: NO
INFO 2021-03-28 03:48:24,961 env.py: 41: PYTHONPATH: /env/python
INFO 2021-03-28 03:48:24,961 env.py: 41: PYTHONWARNINGS: ignore:::pip._internal.cli.base_command
INFO 2021-03-28 03:48:24,961 env.py: 41: RANK: 0
INFO 2021-03-28 03:48:24,961 env.py: 41: SHELL: /bin/bash
INFO 2021-03-28 03:48:24,961 env.py: 41: SHLVL: 1
INFO 2021-03-28 03:48:24,961 env.py: 41: TBE_CREDS_ADDR: 172.28.0.1:8008
INFO 2021-03-28 03:48:24,961 env.py: 41: TERM: xterm-color
INFO 2021-03-28 03:48:24,961 env.py: 41: TF_FORCE_GPU_ALLOW_GROWTH: true
INFO 2021-03-28 03:48:24,961 env.py: 41: WORLD_SIZE: 1
INFO 2021-03-28 03:48:24,961 env.py: 41: _: /usr/bin/python3
INFO 2021-03-28 03:48:24,961 env.py: 41: __EGL_VENDOR_LIBRARY_DIRS: /usr/lib64-nvidia:/usr/share/glvnd/egl_vendor.d/
INFO 2021-03-28 03:48:24,962 misc.py: 86: Set start method of multiprocessing to fork
INFO 2021-03-28 03:48:24,962 train.py: 77: Setting seed....
INFO 2021-03-28 03:48:24,962 misc.py: 99: MACHINE SEED: 0
INFO 2021-03-28 03:48:24,980 hydra_config.py: 140: Training with config:
INFO 2021-03-28 03:48:24,986 hydra_config.py: 144: {'CHECKPOINT': {'APPEND_DISTR_RUN_ID': False,
'AUTO_RESUME': True,
'BACKEND': 'disk',
'CHECKPOINT_FREQUENCY': 1,
'CHECKPOINT_ITER_FREQUENCY': -1,
'DIR': './checkpoints',
'LATEST_CHECKPOINT_RESUME_FILE_NUM': 1,
'OVERWRITE_EXISTING': False,
'USE_SYMLINK_CHECKPOINT_FOR_RESUME': False},
'CLUSTERFIT': {'CLUSTER_BACKEND': 'faiss',
'FEATURES': {'DATASET_NAME': '',
'DATA_PARTITION': 'TRAIN',
'LAYER_NAME': ''},
'NUM_CLUSTERS': 16000,
'N_ITER': 50},
'DATA': {'DDP_BUCKET_CAP_MB': 25,
'ENABLE_ASYNC_GPU_COPY': True,
'NUM_DATALOADER_WORKERS': 5,
'PIN_MEMORY': True,
'TEST': {'BATCHSIZE_PER_REPLICA': 2,
'COLLATE_FUNCTION': 'default_collate',
'COLLATE_FUNCTION_PARAMS': {},
'COPY_DESTINATION_DIR': '',
'COPY_TO_LOCAL_DISK': False,
'DATASET_NAMES': ['dummy_data_folder'],
'DATA_LIMIT': -1,
'DATA_PATHS': ['/content/dummy_data/val'],
'DATA_SOURCES': ['disk_folder'],
'DEFAULT_GRAY_IMG_SIZE': 224,
'DROP_LAST': False,
'ENABLE_QUEUE_DATASET': False,
'INPUT_KEY_NAMES': ['data'],
'LABEL_PATHS': [],
'LABEL_SOURCES': ['disk_folder'],
'LABEL_TYPE': 'standard',
'MMAP_MODE': True,
'TARGET_KEY_NAMES': ['label'],
'TRANSFORMS': [{'name': 'Resize', 'size': 256},
{'name': 'CenterCrop', 'size': 224},
{'name': 'ToTensor'},
{'mean': [0.485, 0.456, 0.406],
'name': 'Normalize',
'std': [0.229, 0.224, 0.225]}],
'USE_STATEFUL_DISTRIBUTED_SAMPLER': False},
'TRAIN': {'BATCHSIZE_PER_REPLICA': 2,
'COLLATE_FUNCTION': 'default_collate',
'COLLATE_FUNCTION_PARAMS': {},
'COPY_DESTINATION_DIR': '',
'COPY_TO_LOCAL_DISK': False,
'DATASET_NAMES': ['dummy_data_folder'],
'DATA_LIMIT': -1,
'DATA_PATHS': ['/content/dummy_data/train'],
'DATA_SOURCES': ['disk_folder'],
'DEFAULT_GRAY_IMG_SIZE': 224,
'DROP_LAST': False,
'ENABLE_QUEUE_DATASET': False,
'INPUT_KEY_NAMES': ['data'],
'LABEL_PATHS': [],
'LABEL_SOURCES': ['disk_folder'],
'LABEL_TYPE': 'standard',
'MMAP_MODE': True,
'TARGET_KEY_NAMES': ['label'],
'TRANSFORMS': [{'name': 'RandomResizedCrop', 'size': 224},
{'name': 'RandomHorizontalFlip'},
{'brightness': 0.4,
'contrast': 0.4,
'hue': 0.4,
'name': 'ColorJitter',
'saturation': 0.4},
{'name': 'ToTensor'},
{'mean': [0.485, 0.456, 0.406],
'name': 'Normalize',
'std': [0.229, 0.224, 0.225]}],
'USE_STATEFUL_DISTRIBUTED_SAMPLER': False}},
'DISTRIBUTED': {'BACKEND': 'nccl',
'BROADCAST_BUFFERS': True,
'INIT_METHOD': 'tcp',
'MANUAL_GRADIENT_REDUCTION': False,
'NCCL_DEBUG': False,
'NCCL_SOCKET_NTHREADS': '',
'NUM_NODES': 1,
'NUM_PROC_PER_NODE': 1,
'RUN_ID': 'auto'},
'IMG_RETRIEVAL': {'DATASET_PATH': '',
'EVAL_BINARY_PATH': '',
'EVAL_DATASET_NAME': 'Paris',
'FEATS_PROCESSING_TYPE': '',
'GEM_POOL_POWER': 4.0,
'N_PCA': 512,
'RESIZE_IMG': 1024,
'SHOULD_TRAIN_PCA_OR_WHITENING': True,
'SPATIAL_LEVELS': 3,
'TEMP_DIR': '/tmp/instance_retrieval/',
'TRAIN_DATASET_NAME': 'Oxford',
'WHITEN_IMG_LIST': ''},
'LOG_FREQUENCY': 100,
'LOSS': {'CrossEntropyLoss': {'ignore_index': -1},
'bce_logits_multiple_output_single_target': {'normalize_output': False,
'reduction': 'none',
'world_size': 1},
'cross_entropy_multiple_output_single_target': {'ignore_index': -1,
'normalize_output': False,
'reduction': 'mean',
'temperature': 1.0,
'weight': None},
'deepclusterv2_loss': {'BATCHSIZE_PER_REPLICA': 256,
'DROP_LAST': True,
'kmeans_iters': 10,
'memory_params': {'crops_for_mb': [0],
'embedding_dim': 128},
'num_clusters': [3000, 3000, 3000],
'num_crops': 2,
'num_train_samples': -1,
'temperature': 0.1},
'moco_loss': {'embedding_dim': 128,
'momentum': 0.999,
'queue_size': 65536,
'temperature': 0.2},
'multicrop_simclr_info_nce_loss': {'buffer_params': {'effective_batch_size': 4096,
'embedding_dim': 128,
'world_size': 64},
'num_crops': 2,
'temperature': 0.1},
'name': 'cross_entropy_multiple_output_single_target',
'nce_loss_with_memory': {'loss_type': 'nce',
'loss_weights': [1.0],
'memory_params': {'embedding_dim': 128,
'memory_size': -1,
'momentum': 0.5,
'norm_init': True,
'update_mem_on_forward': True},
'negative_sampling_params': {'num_negatives': 16000,
'type': 'random'},
'norm_constant': -1,
'norm_embedding': True,
'num_train_samples': -1,
'temperature': 0.07,
'update_mem_with_emb_index': -100},
'simclr_info_nce_loss': {'buffer_params': {'effective_batch_size': 4096,
'embedding_dim': 128,
'world_size': 64},
'temperature': 0.1},
'swav_loss': {'crops_for_assign': [0, 1],
'embedding_dim': 128,
'epsilon': 0.05,
'normalize_last_layer': True,
'num_crops': 2,
'num_iters': 3,
'num_prototypes': [3000],
'output_dir': '',
'queue': {'local_queue_length': 0,
'queue_length': 0,
'start_iter': 0},
'temp_hard_assignment_iters': 0,
'temperature': 0.1,
'use_double_precision': False},
'swav_momentum_loss': {'crops_for_assign': [0, 1],
'embedding_dim': 128,
'epsilon': 0.05,
'momentum': 0.99,
'momentum_eval_mode_iter_start': 0,
'normalize_last_layer': True,
'num_crops': 2,
'num_iters': 3,
'num_prototypes': [3000],
'queue': {'local_queue_length': 0,
'queue_length': 0,
'start_iter': 0},
'temperature': 0.1,
'use_double_precision': False}},
'MACHINE': {'DEVICE': 'gpu'},
'METERS': {'accuracy_list_meter': {'meter_names': [],
'num_meters': 1,
'topk_values': [1, 5]},
'enable_training_meter': True,
'mean_ap_list_meter': {'max_cpu_capacity': -1,
'meter_names': [],
'num_classes': 9605,
'num_meters': 1},
'name': 'accuracy_list_meter'},
'MODEL': {'ACTIVATION_CHECKPOINTING': {'NUM_ACTIVATION_CHECKPOINTING_SPLITS': 2,
'USE_ACTIVATION_CHECKPOINTING': False},
'AMP_PARAMS': {'AMP_ARGS': {'opt_level': 'O1'},
'AMP_TYPE': 'apex',
'USE_AMP': False},
'CUDA_CACHE': {'CLEAR_CUDA_CACHE': False, 'CLEAR_FREQ': 100},
'FEATURE_EVAL_SETTINGS': {'EVAL_MODE_ON': False,
'EVAL_TRUNK_AND_HEAD': False,
'EXTRACT_TRUNK_FEATURES_ONLY': False,
'FREEZE_TRUNK_AND_HEAD': False,
'FREEZE_TRUNK_ONLY': False,
'LINEAR_EVAL_FEAT_POOL_OPS_MAP': [],
'SHOULD_FLATTEN_FEATS': True},
'HEAD': {'BATCHNORM_EPS': 1e-05,
'BATCHNORM_MOMENTUM': 0.1,
'PARAMS': [['mlp', {'dims': [2048, 1000]}]],
'PARAMS_MULTIPLIER': 1.0},
'INPUT_TYPE': 'rgb',
'MODEL_COMPLEXITY': {'COMPUTE_COMPLEXITY': False,
'INPUT_SHAPE': [3, 224, 224]},
'MULTI_INPUT_HEAD_MAPPING': [],
'NON_TRAINABLE_PARAMS': [],
'SINGLE_PASS_EVERY_CROP': False,
'SYNC_BN_CONFIG': {'CONVERT_BN_TO_SYNC_BN': False,
'GROUP_SIZE': -1,
'SYNC_BN_TYPE': 'pytorch'},
'TEMP_FROZEN_PARAMS_ITER_MAP': [],
'TRUNK': {'NAME': 'resnet',
'TRUNK_PARAMS': {'EFFICIENT_NETS': {},
'REGNET': {},
'RESNETS': {'DEPTH': 50,
'GROUPS': 1,
'LAYER4_STRIDE': 2,
'NORM': 'BatchNorm',
'WIDTH_MULTIPLIER': 1,
'WIDTH_PER_GROUP': 64,
'ZERO_INIT_RESIDUAL': False}}},
'WEIGHTS_INIT': {'APPEND_PREFIX': '',
'PARAMS_FILE': '',
'REMOVE_PREFIX': '',
'SKIP_LAYERS': ['num_batches_tracked'],
'STATE_DICT_KEY_NAME': 'classy_state_dict'}},
'MONITOR_PERF_STATS': False,
'MULTI_PROCESSING_METHOD': 'fork',
'NEAREST_NEIGHBOR': {'L2_NORM_FEATS': False, 'SIGMA': 0.1, 'TOPK': 200},
'OPTIMIZER': {'head_optimizer_params': {'use_different_lr': False,
'use_different_wd': False,
'weight_decay': 0.0001},
'larc_config': {'clip': False,
'eps': 1e-08,
'trust_coefficient': 0.001},
'momentum': 0.9,
'name': 'sgd',
'nesterov': True,
'num_epochs': 2,
'param_schedulers': {'lr': {'auto_lr_scaling': {'auto_scale': True,
'base_lr_batch_size': 256,
'base_value': 0.1},
'end_value': 0.0,
'interval_scaling': [],
'lengths': [],
'milestones': [1],
'name': 'multistep',
'schedulers': [],
'start_value': 0.1,
'update_interval': 'epoch',
'value': 0.1,
'values': [0.00078125, 7.813e-05]},
'lr_head': {'auto_lr_scaling': {'auto_scale': True,
'base_lr_batch_size': 256,
'base_value': 0.1},
'end_value': 0.0,
'interval_scaling': [],
'lengths': [],
'milestones': [1],
'name': 'multistep',
'schedulers': [],
'start_value': 0.1,
'update_interval': 'epoch',
'value': 0.1,
'values': [0.00078125,
7.813e-05]}},
'regularize_bias': True,
'regularize_bn': False,
'use_larc': False,
'weight_decay': 0.0001},
'PERF_STAT_FREQUENCY': -1,
'ROLLING_BTIME_FREQ': -1,
'SEED_VALUE': 0,
'SVM': {'cls_list': [],
'costs': {'base': -1.0,
'costs_list': [0.1, 0.01],
'power_range': [4, 20]},
'cross_val_folds': 3,
'dual': True,
'force_retrain': False,
'loss': 'squared_hinge',
'low_shot': {'dataset_name': 'voc',
'k_values': [1, 2, 4, 8, 16, 32, 64, 96],
'sample_inds': [1, 2, 3, 4, 5]},
'max_iter': 2000,
'normalize': True,
'penalty': 'l2'},
'TENSORBOARD_SETUP': {'EXPERIMENT_LOG_DIR': 'tensorboard',
'FLUSH_EVERY_N_MIN': 5,
'LOG_DIR': '.',
'LOG_PARAMS': True,
'LOG_PARAMS_EVERY_N_ITERS': 310,
'LOG_PARAMS_GRADIENTS': True,
'USE_TENSORBOARD': True},
'TEST_EVERY_NUM_EPOCH': 1,
'TEST_MODEL': True,
'TEST_ONLY': False,
'TRAINER': {'TASK_NAME': 'self_supervision_task',
'TRAIN_STEP_NAME': 'standard_train_step'},
'VERBOSE': True}
INFO 2021-03-28 03:48:25,689 train.py: 89: System config:


sys.platform linux
Python 3.7.10 (default, Feb 20 2021, 21:17:23) [GCC 7.5.0]
numpy 1.19.5
Pillow 7.0.0
vissl 0.1.5 @/usr/local/lib/python3.7/dist-packages/vissl
GPU available True
GPU 0 Tesla P100-PCIE-16GB
CUDA_HOME /usr/local/cuda
torchvision 0.6.1+cu101 @/usr/local/lib/python3.7/dist-packages/torchvision
hydra 1.0.6 @/usr/local/lib/python3.7/dist-packages/hydra
classy_vision 0.6.0.dev @/usr/local/lib/python3.7/dist-packages/classy_vision
tensorboard 1.15.0
apex 0.1 @/usr/local/lib/python3.7/dist-packages/apex
cv2 4.1.2
PyTorch 1.8.1+cu102 @/usr/local/lib/python3.7/dist-packages/torch
PyTorch debug build False


PyTorch built with:

  • GCC 7.3
  • C++ Version: 201402
  • Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • NNPACK is enabled
  • CPU capability usage: AVX2
  • CUDA Runtime 10.2
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70
  • CuDNN 7.6.5
  • Magma 2.5.2
  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=7.6.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

CPU info:


Architecture x86_64
CPU op-mode(s) 32-bit, 64-bit
Byte Order Little Endian
CPU(s) 4
On-line CPU(s) list 0-3
Thread(s) per core 2
Core(s) per socket 2
Socket(s) 1
NUMA node(s) 1
Vendor ID GenuineIntel
CPU family 6
Model 79
Model name Intel(R) Xeon(R) CPU @ 2.20GHz
Stepping 0
CPU MHz 2199.998
BogoMIPS 4399.99
Hypervisor vendor KVM
Virtualization type full
L1d cache 32K
L1i cache 32K
L2 cache 256K
L3 cache 56320K
NUMA node0 CPU(s) 0-3


INFO 2021-03-28 03:48:25,689 tensorboard.py: 46: Tensorboard dir: ./checkpoints/tb_logs

Pretrained RegNetY Models

πŸš€ Feature

Release RegNetY backbone family from SEER pretrained on 1B Instagram pictures.

Motivation & Examples

I could not find these models in the Model Zoo. Do you have plans to release them or did I miss them? The pretrained RegNets seem like great few shot learners. Maybe they will be great for downstream tasks with limited data.

Training SimCLR on 1-gpu with VISSL tutorial throws an error on training

When running Training SimCLR on 1-gpu with VISSL tutorial, I tried to kick-off the training using the following command:

!python3 run_distributed_engines.py \
    hydra.verbose=true \
    config=quick_1gpu_resnet50_simclr \
    config.DATA.TRAIN.DATA_SOURCES=[synthetic] \
    config.CHECKPOINT.DIR="./checkpoints" \
    +config.HOOKS.TENSORBOARD_SETUP.USE_TENSORBOARD=true

After outputting various configuration settings, the code has encountered an error while building a model:

INFO 2021-05-04 02:13:10,704 train_task.py: 419: Building model....
INFO 2021-05-04 02:13:10,705 resnext.py:  63: ResNeXT trunk, supports activation checkpointing. Deactivated
INFO 2021-05-04 02:13:10,705 resnext.py:  83: Building model: ResNeXt50-1x64d-w1-BatchNorm2d
INFO 2021-05-04 02:13:11,316 model_helpers.py: 138: Using SyncBN group size: 1
INFO 2021-05-04 02:13:11,316 model_helpers.py: 153: Converting BN layers to PyTorch SyncBN
WARNING 2021-05-04 02:13:11,317 model_helpers.py: 159: Process groups not supported with PyTorch SyncBN currently. Traning will be slow. Please consider installing Apex for SyncBN.
INFO 2021-05-04 02:13:11,322 train_task.py: 378: Initializing model from: 
INFO 2021-05-04 02:13:11,323 util.py: 241: Broadcasting checkpoint loaded from 
Traceback (most recent call last):
  File "run_distributed_engines.py", line 194, in <module>
    hydra_main(overrides=overrides)
  File "run_distributed_engines.py", line 179, in hydra_main
    hook_generator=default_hook_generator,
  File "run_distributed_engines.py", line 123, in launch_distributed
    hook_generator=hook_generator,
  File "run_distributed_engines.py", line 166, in _distributed_worker
    process_main(cfg, dist_run_id, local_rank=local_rank, node_id=node_id)
  File "run_distributed_engines.py", line 159, in process_main
    hook_generator=hook_generator,
  File "/usr/local/lib/python3.7/dist-packages/vissl/engines/train.py", line 102, in train_main
    trainer.train()
  File "/usr/local/lib/python3.7/dist-packages/vissl/trainer/trainer_main.py", line 155, in train
    self.task.prepare(pin_memory=self.cfg.DATA.PIN_MEMORY)
  File "/usr/local/lib/python3.7/dist-packages/vissl/trainer/train_task.py", line 634, in prepare
    self.base_model = self._build_model()
  File "/usr/local/lib/python3.7/dist-packages/vissl/trainer/train_task.py", line 462, in _build_model
    model = self._restore_model_weights(model)
  File "/usr/local/lib/python3.7/dist-packages/vissl/trainer/train_task.py", line 399, in _restore_model_weights
    append_prefix=append_prefix,
  File "/usr/local/lib/python3.7/dist-packages/vissl/utils/checkpoint.py", line 404, in init_model_from_weights
    state_dict_key_name in state_dict.keys()
AttributeError: 'NoneType' object has no attribute 'keys'

Upon inspection, I noticed that even though I didn't specify a path to pre-trained model weights, the code was still trying to call _restore_model_weights function, thus ultimately causing the NoneType error in init_model_from_weights function.

Here's the output excerpt showing the settings for WEIGHTS_INIT key:

           'WEIGHTS_INIT': {'APPEND_PREFIX': '',
                            'PARAMS_FILE': '',
                            'REMOVE_PREFIX': '',
                            'SKIP_LAYERS': ['num_batches_tracked'],
                            'STATE_DICT_KEY_NAME': 'classy_state_dict'}},

Can the framework be adapted for other tasks?

πŸ“š VISSL Documentation

A clear and concise description of what content in the Docs is an issue.

Not sure if this is the right place to ask this (apologies for that).

Is VISSL framework a genereic framework? Meaning, can we make use of our models (meant for speech/Audio) and leverage the VISSL library for SSL?

Currently, I don't see in the documentation or an example of how to adapt the toolbox for non-computer vision work.

Unable to run sample RotNet pre-trained config

Instructions To Reproduce the πŸ› Bug:

  1. what changes you made (git diff) or what code you wrote
None.  This was simply a trial to run pretrained RotNet with dummy dataset.  Some of the other SSL's worked fine, some didn't.  Please see exception stacktrace
  1. what exact command you run:

python3 run_distributed_engines.py
hydra.verbose=true
config=pretrain/rotnet/rotnet_8gpu_resnet
config.DATA.TRAIN.DATA_SOURCES=[disk_folder]
config.DATA.TRAIN.LABEL_SOURCES=[disk_folder]
config.DATA.TRAIN.DATASET_NAMES=[dummy_data_folder]
config.DATA.TRAIN.DATA_PATHS=[dummy_data/train]
config.DATA.TRAIN.BATCHSIZE_PER_REPLICA=2
config.DATA.TEST.DATA_SOURCES=[disk_folder]
config.DATA.TEST.LABEL_SOURCES=[disk_folder]
config.DATA.TEST.DATASET_NAMES=[dummy_data_folder]
config.DATA.TEST.DATA_PATHS=[dummy_data/val]
config.DATA.TEST.BATCHSIZE_PER_REPLICA=2
config.DISTRIBUTED.NUM_NODES=1
config.DISTRIBUTED.NUM_PROC_PER_NODE=1
config.OPTIMIZER.num_epochs=2
config.OPTIMIZER.param_schedulers.lr.values=[0.01,0.001]
config.OPTIMIZER.param_schedulers.lr.milestones=[1]
config.CHECKPOINT.DIR="./checkpoints"

  1. what you observed (including full logs):
** fvcore version of PathManager will be deprecated soon. **
** Please migrate to the version in iopath repo. **
https://github.com/facebookresearch/iopath 

####### overrides: ['hydra.verbose=true', 'config=pretrain/rotnet/rotnet_8gpu_resnet', 'config.DATA.TRAIN.DATA_SOURCES=[disk_folder]', 'config.DATA.TRAIN.LABEL_SOURCES=[disk_folder]', 'config.DATA.TRAIN.DATASET_NAMES=[dummy_data_folder]', 'config.DATA.TRAIN.DATA_PATHS=[dummy_data/train]', 'config.DATA.TRAIN.BATCHSIZE_PER_REPLICA=2', 'config.DATA.TEST.DATA_SOURCES=[disk_folder]', 'config.DATA.TEST.LABEL_SOURCES=[disk_folder]', 'config.DATA.TEST.DATASET_NAMES=[dummy_data_folder]', 'config.DATA.TEST.DATA_PATHS=[dummy_data/val]', 'config.DATA.TEST.BATCHSIZE_PER_REPLICA=2', 'config.DISTRIBUTED.NUM_NODES=1', 'config.DISTRIBUTED.NUM_PROC_PER_NODE=1', 'config.OPTIMIZER.num_epochs=2', 'config.OPTIMIZER.param_schedulers.lr.values=[0.01,0.001]', 'config.OPTIMIZER.param_schedulers.lr.milestones=[1]', 'config.CHECKPOINT.DIR=./checkpoints', 'hydra.verbose=true']
INFO 2021-04-09 05:46:13,602 __init__.py:  34: Provided Config has latest version: 1
INFO 2021-04-09 05:46:13,603 run_distributed_engines.py: 163: Spawning process for node_id: 0, local_rank: 0, dist_rank: 0, dist_run_id: localhost:56173
INFO 2021-04-09 05:46:13,603 train.py:  67: Env set for rank: 0, dist_rank: 0
INFO 2021-04-09 05:46:13,603 env.py:  38: CONDA_DEFAULT_ENV:	vissl_2
INFO 2021-04-09 05:46:13,603 env.py:  38: CONDA_EXE:	/home/ec2-user/miniconda3/bin/conda
INFO 2021-04-09 05:46:13,603 env.py:  38: CONDA_PREFIX:	/home/ec2-user/miniconda3/envs/vissl_2
INFO 2021-04-09 05:46:13,603 env.py:  38: CONDA_PREFIX_1:	/home/ec2-user/miniconda3
INFO 2021-04-09 05:46:13,603 env.py:  38: CONDA_PROMPT_MODIFIER:	(vissl_2) 
INFO 2021-04-09 05:46:13,603 env.py:  38: CONDA_PYTHON_EXE:	/home/ec2-user/miniconda3/bin/python
INFO 2021-04-09 05:46:13,603 env.py:  38: CONDA_SHLVL:	2
INFO 2021-04-09 05:46:13,603 env.py:  38: HISTCONTROL:	ignoredups
INFO 2021-04-09 05:46:13,603 env.py:  38: HISTSIZE:	1000
INFO 2021-04-09 05:46:13,603 env.py:  38: HOME:	/home/ec2-user
INFO 2021-04-09 05:46:13,603 env.py:  38: HOSTNAME:	ip-10-0-6-212.vpc.internal
INFO 2021-04-09 05:46:13,603 env.py:  38: LANG:	en_US.UTF-8
INFO 2021-04-09 05:46:13,603 env.py:  38: LESSOPEN:	||/usr/bin/lesspipe.sh %s
INFO 2021-04-09 05:46:13,603 env.py:  38: LOCAL_RANK:	0
INFO 2021-04-09 05:46:13,604 env.py:  38: LOGNAME:	ec2-user
INFO 2021-04-09 05:46:13,604 env.py:  38: LS_COLORS:	rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.flac=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.axa=01;36:*.oga=01;36:*.spx=01;36:*.xspf=01;36:
INFO 2021-04-09 05:46:13,604 env.py:  38: MAIL:	/var/spool/mail/ec2-user
INFO 2021-04-09 05:46:13,604 env.py:  38: OLDPWD:	/home/ec2-user/vissl/configs
INFO 2021-04-09 05:46:13,604 env.py:  38: PATH:	/usr/local/cuda-11.2/bin:/usr/local/cuda-11.2/bin:/home/ec2-user/miniconda3/envs/vissl_2/bin:/home/ec2-user/miniconda3/condabin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/aws/bin:/home/ec2-user/miniconda3/bin:/home/ec2-user/.local/bin:/home/ec2-user/bin:/opt/aws/bin:/home/ec2-user/miniconda3/bin:/home/ec2-user/.local/bin:/home/ec2-user/bin
INFO 2021-04-09 05:46:13,604 env.py:  38: PWD:	/home/ec2-user/vissl
INFO 2021-04-09 05:46:13,604 env.py:  38: RANK:	0
INFO 2021-04-09 05:46:13,604 env.py:  38: SELINUX_LEVEL_REQUESTED:	
INFO 2021-04-09 05:46:13,604 env.py:  38: SELINUX_ROLE_REQUESTED:	
INFO 2021-04-09 05:46:13,604 env.py:  38: SELINUX_USE_CURRENT_RANGE:	
INFO 2021-04-09 05:46:13,604 env.py:  38: SHELL:	/bin/bash
INFO 2021-04-09 05:46:13,604 env.py:  38: SHLVL:	2
INFO 2021-04-09 05:46:13,604 env.py:  38: SSH_CLIENT:	207.207.163.8 60744 22
INFO 2021-04-09 05:46:13,604 env.py:  38: SSH_CONNECTION:	207.207.163.8 60744 10.0.6.212 22
INFO 2021-04-09 05:46:13,604 env.py:  38: SSH_TTY:	/dev/pts/0
INFO 2021-04-09 05:46:13,604 env.py:  38: TERM:	screen
INFO 2021-04-09 05:46:13,604 env.py:  38: TMUX:	/tmp/tmux-1000/default,12776,0
INFO 2021-04-09 05:46:13,604 env.py:  38: TMUX_PANE:	%0
INFO 2021-04-09 05:46:13,604 env.py:  38: USER:	ec2-user
INFO 2021-04-09 05:46:13,604 env.py:  38: WORLD_SIZE:	1
INFO 2021-04-09 05:46:13,604 env.py:  38: XDG_RUNTIME_DIR:	/run/user/1000
INFO 2021-04-09 05:46:13,604 env.py:  38: XDG_SESSION_ID:	53
INFO 2021-04-09 05:46:13,604 env.py:  38: _:	/home/ec2-user/miniconda3/envs/vissl_2/bin/python3
INFO 2021-04-09 05:46:13,604 env.py:  38: _CE_CONDA:	
INFO 2021-04-09 05:46:13,604 env.py:  38: _CE_M:	
INFO 2021-04-09 05:46:13,604 misc.py: 133: Set start method of multiprocessing to forkserver
INFO 2021-04-09 05:46:13,604 train.py:  78: Setting seed....
INFO 2021-04-09 05:46:13,604 misc.py: 146: MACHINE SEED: 1
INFO 2021-04-09 05:46:13,742 hydra_config.py:  63: Training with config:
INFO 2021-04-09 05:46:13,747 hydra_config.py:  67: {'CHECKPOINT': {'APPEND_DISTR_RUN_ID': False,
                'AUTO_RESUME': True,
                'BACKEND': 'disk',
                'CHECKPOINT_FREQUENCY': 1,
                'CHECKPOINT_ITER_FREQUENCY': -1,
                'DIR': './checkpoints',
                'LATEST_CHECKPOINT_RESUME_FILE_NUM': 1,
                'OVERWRITE_EXISTING': False,
                'USE_SYMLINK_CHECKPOINT_FOR_RESUME': False},
 'CLUSTERFIT': {'CLUSTER_BACKEND': 'faiss',
                'FEATURES': {'DATASET_NAME': '',
                             'DATA_PARTITION': 'TRAIN',
                             'LAYER_NAME': ''},
                'NUM_CLUSTERS': 16000,
                'N_ITER': 50},
 'DATA': {'DDP_BUCKET_CAP_MB': 25,
          'ENABLE_ASYNC_GPU_COPY': True,
          'NUM_DATALOADER_WORKERS': 1,
          'PIN_MEMORY': True,
          'TEST': {'BATCHSIZE_PER_REPLICA': 2,
                   'COLLATE_FUNCTION': 'default_collate',
                   'COLLATE_FUNCTION_PARAMS': {},
                   'COPY_DESTINATION_DIR': '',
                   'COPY_TO_LOCAL_DISK': False,
                   'DATASET_NAMES': ['dummy_data_folder'],
                   'DATA_LIMIT': -1,
                   'DATA_LIMIT_SAMPLING': {'IS_BALANCED': False,
                                           'SEED': 0,
                                           'SKIP_NUM_SAMPLES': 0},
                   'DATA_PATHS': ['dummy_data/val'],
                   'DATA_SOURCES': ['disk_folder'],
                   'DEFAULT_GRAY_IMG_SIZE': 224,
                   'DROP_LAST': False,
                   'ENABLE_QUEUE_DATASET': False,
                   'INPUT_KEY_NAMES': ['data'],
                   'LABEL_PATHS': [],
                   'LABEL_SOURCES': ['disk_folder'],
                   'LABEL_TYPE': 'standard',
                   'MMAP_MODE': True,
                   'NEW_IMG_PATH_PREFIX': '',
                   'REMOVE_IMG_PATH_PREFIX': '',
                   'TARGET_KEY_NAMES': ['label'],
                   'TRANSFORMS': [{'name': 'ImgRotatePil'},
                                  {'name': 'Resize', 'size': 256},
                                  {'name': 'CenterCrop', 'size': 224},
                                  {'name': 'ToTensor'},
                                  {'mean': [0.485, 0.456, 0.406],
                                   'name': 'Normalize',
                                   'std': [0.229, 0.224, 0.225]}],
                   'USE_STATEFUL_DISTRIBUTED_SAMPLER': False},
          'TRAIN': {'BATCHSIZE_PER_REPLICA': 2,
                    'COLLATE_FUNCTION': 'default_collate',
                    'COLLATE_FUNCTION_PARAMS': {},
                    'COPY_DESTINATION_DIR': '',
                    'COPY_TO_LOCAL_DISK': False,
                    'DATASET_NAMES': ['dummy_data_folder'],
                    'DATA_LIMIT': -1,
                    'DATA_LIMIT_SAMPLING': {'IS_BALANCED': False,
                                            'SEED': 0,
                                            'SKIP_NUM_SAMPLES': 0},
                    'DATA_PATHS': ['dummy_data/train'],
                    'DATA_SOURCES': ['disk_folder'],
                    'DEFAULT_GRAY_IMG_SIZE': 224,
                    'DROP_LAST': False,
                    'ENABLE_QUEUE_DATASET': False,
                    'INPUT_KEY_NAMES': ['data'],
                    'LABEL_PATHS': [],
                    'LABEL_SOURCES': ['disk_folder'],
                    'LABEL_TYPE': 'standard',
                    'MMAP_MODE': True,
                    'NEW_IMG_PATH_PREFIX': '',
                    'REMOVE_IMG_PATH_PREFIX': '',
                    'TARGET_KEY_NAMES': ['label'],
                    'TRANSFORMS': [{'name': 'ImgRotatePil'},
                                   {'name': 'RandomResizedCrop', 'size': 224},
                                   {'name': 'RandomHorizontalFlip'},
                                   {'name': 'ToTensor'},
                                   {'mean': [0.485, 0.456, 0.406],
                                    'name': 'Normalize',
                                    'std': [0.229, 0.224, 0.225]}],
                    'USE_STATEFUL_DISTRIBUTED_SAMPLER': False}},
 'DISTRIBUTED': {'BACKEND': 'nccl',
                 'BROADCAST_BUFFERS': True,
                 'INIT_METHOD': 'tcp',
                 'MANUAL_GRADIENT_REDUCTION': False,
                 'NCCL_DEBUG': False,
                 'NCCL_SOCKET_NTHREADS': '',
                 'NUM_NODES': 1,
                 'NUM_PROC_PER_NODE': 1,
                 'RUN_ID': 'auto'},
 'HOOKS': {'LOG_GPU_STATS': True,
           'MEMORY_SUMMARY': {'LOG_ITERATION_NUM': 0,
                              'PRINT_MEMORY_SUMMARY': True},
           'MODEL_COMPLEXITY': {'COMPUTE_COMPLEXITY': False,
                                'INPUT_SHAPE': [3, 224, 224]},
           'PERF_STATS': {'MONITOR_PERF_STATS': False,
                          'PERF_STAT_FREQUENCY': -1,
                          'ROLLING_BTIME_FREQ': -1},
           'TENSORBOARD_SETUP': {'EXPERIMENT_LOG_DIR': 'tensorboard',
                                 'FLUSH_EVERY_N_MIN': 5,
                                 'LOG_DIR': '.',
                                 'LOG_PARAMS': True,
                                 'LOG_PARAMS_EVERY_N_ITERS': 310,
                                 'LOG_PARAMS_GRADIENTS': True,
                                 'USE_TENSORBOARD': False}},
 'IMG_RETRIEVAL': {'DATASET_PATH': '',
                   'EVAL_BINARY_PATH': '',
                   'EVAL_DATASET_NAME': 'Paris',
                   'FEATS_PROCESSING_TYPE': '',
                   'GEM_POOL_POWER': 4.0,
                   'N_PCA': 512,
                   'RESIZE_IMG': 1024,
                   'SHOULD_TRAIN_PCA_OR_WHITENING': True,
                   'SPATIAL_LEVELS': 3,
                   'TEMP_DIR': '/tmp/instance_retrieval/',
                   'TRAIN_DATASET_NAME': 'Oxford',
                   'WHITEN_IMG_LIST': ''},
 'LOG_FREQUENCY': 100,
 'LOSS': {'CrossEntropyLoss': {'ignore_index': -1},
          'bce_logits_multiple_output_single_target': {'normalize_output': False,
                                                       'reduction': 'none',
                                                       'world_size': 1},
          'cross_entropy_multiple_output_single_target': {'ignore_index': -1,
                                                          'normalize_output': False,
                                                          'reduction': 'mean',
                                                          'temperature': 1.0,
                                                          'weight': None},
          'deepclusterv2_loss': {'BATCHSIZE_PER_REPLICA': 256,
                                 'DROP_LAST': True,
                                 'kmeans_iters': 10,
                                 'memory_params': {'crops_for_mb': [0],
                                                   'embedding_dim': 128},
                                 'num_clusters': [3000, 3000, 3000],
                                 'num_crops': 2,
                                 'num_train_samples': -1,
                                 'temperature': 0.1},
          'ignore_index': -1,
          'moco_loss': {'embedding_dim': 128,
                        'momentum': 0.999,
                        'queue_size': 65536,
                        'temperature': 0.2},
          'multicrop_simclr_info_nce_loss': {'buffer_params': {'effective_batch_size': 4096,
                                                               'embedding_dim': 128,
                                                               'world_size': 64},
                                             'num_crops': 2,
                                             'temperature': 0.1},
          'name': 'cross_entropy_multiple_output_single_target',
          'nce_loss_with_memory': {'loss_type': 'nce',
                                   'loss_weights': [1.0],
                                   'memory_params': {'embedding_dim': 128,
                                                     'memory_size': -1,
                                                     'momentum': 0.5,
                                                     'norm_init': True,
                                                     'update_mem_on_forward': True},
                                   'negative_sampling_params': {'num_negatives': 16000,
                                                                'type': 'random'},
                                   'norm_constant': -1,
                                   'norm_embedding': True,
                                   'num_train_samples': -1,
                                   'temperature': 0.07,
                                   'update_mem_with_emb_index': -100},
          'simclr_info_nce_loss': {'buffer_params': {'effective_batch_size': 4096,
                                                     'embedding_dim': 128,
                                                     'world_size': 64},
                                   'temperature': 0.1},
          'swav_loss': {'crops_for_assign': [0, 1],
                        'embedding_dim': 128,
                        'epsilon': 0.05,
                        'normalize_last_layer': True,
                        'num_crops': 2,
                        'num_iters': 3,
                        'num_prototypes': [3000],
                        'output_dir': '.',
                        'queue': {'local_queue_length': 0,
                                  'queue_length': 0,
                                  'start_iter': 0},
                        'temp_hard_assignment_iters': 0,
                        'temperature': 0.1,
                        'use_double_precision': False},
          'swav_momentum_loss': {'crops_for_assign': [0, 1],
                                 'embedding_dim': 128,
                                 'epsilon': 0.05,
                                 'momentum': 0.99,
                                 'momentum_eval_mode_iter_start': 0,
                                 'normalize_last_layer': True,
                                 'num_crops': 2,
                                 'num_iters': 3,
                                 'num_prototypes': [3000],
                                 'queue': {'local_queue_length': 0,
                                           'queue_length': 0,
                                           'start_iter': 0},
                                 'temperature': 0.1,
                                 'use_double_precision': False}},
 'MACHINE': {'DEVICE': 'gpu'},
 'METERS': {'accuracy_list_meter': {'meter_names': [],
                                    'num_meters': 1,
                                    'topk_values': [1]},
            'enable_training_meter': True,
            'mean_ap_list_meter': {'max_cpu_capacity': -1,
                                   'meter_names': [],
                                   'num_classes': 9605,
                                   'num_meters': 1},
            'name': 'accuracy_list_meter'},
 'MODEL': {'ACTIVATION_CHECKPOINTING': {'NUM_ACTIVATION_CHECKPOINTING_SPLITS': 2,
                                        'USE_ACTIVATION_CHECKPOINTING': False},
           'AMP_PARAMS': {'AMP_ARGS': {'opt_level': 'O1'},
                          'AMP_TYPE': 'apex',
                          'USE_AMP': False},
           'CUDA_CACHE': {'CLEAR_CUDA_CACHE': False, 'CLEAR_FREQ': 100},
           'FEATURE_EVAL_SETTINGS': {'EVAL_MODE_ON': False,
                                     'EVAL_TRUNK_AND_HEAD': False,
                                     'EXTRACT_TRUNK_FEATURES_ONLY': False,
                                     'FREEZE_TRUNK_AND_HEAD': False,
                                     'FREEZE_TRUNK_ONLY': False,
                                     'LINEAR_EVAL_FEAT_POOL_OPS_MAP': [],
                                     'SHOULD_FLATTEN_FEATS': True},
           'FSDP_CONFIG': {'flatten_parameters': True,
                           'fp32_reduce_scatter': False,
                           'mixed_precision': True},
           'GRAD_CLIP': {'MAX_NORM': 1, 'NORM_TYPE': 2, 'USE_GRAD_CLIP': False},
           'HEAD': {'BATCHNORM_EPS': 1e-05,
                    'BATCHNORM_MOMENTUM': 0.1,
                    'INPLACE_RELU': True,
                    'PARAMS': [['mlp', {'dims': [2048, 4]}]],
                    'PARAMS_MULTIPLIER': 1.0},
           'INPUT_TYPE': 'rgb',
           'MULTI_INPUT_HEAD_MAPPING': [],
           'NON_TRAINABLE_PARAMS': [],
           'SHARDED_DDP_SETUP': {'reduce_buffer_size': -1},
           'SINGLE_PASS_EVERY_CROP': False,
           'SYNC_BN_CONFIG': {'CONVERT_BN_TO_SYNC_BN': False,
                              'GROUP_SIZE': -1,
                              'SYNC_BN_TYPE': 'pytorch'},
           'TEMP_FROZEN_PARAMS_ITER_MAP': [],
           'TRUNK': {'NAME': 'resnet',
                     'TRUNK_PARAMS': {'EFFICIENT_NETS': {},
                                      'REGNET': {},
                                      'RESNETS': {'DEPTH': 50,
                                                  'GROUPNORM_GROUPS': 32,
                                                  'GROUPS': 1,
                                                  'LAYER4_STRIDE': 2,
                                                  'NORM': 'BatchNorm',
                                                  'STANDARDIZE_CONVOLUTIONS': False,
                                                  'WIDTH_MULTIPLIER': 1,
                                                  'WIDTH_PER_GROUP': 64,
                                                  'ZERO_INIT_RESIDUAL': False},
                                      'VISION_TRANSFORMERS': {'ATTENTION_DROPOUT_RATE': 0,
                                                              'CLASSIFIER': 'token',
                                                              'DROPOUT_RATE': 0,
                                                              'DROP_PATH_RATE': 0,
                                                              'HIDDEN_DIM': 768,
                                                              'IMAGE_SIZE': 224,
                                                              'MLP_DIM': 3072,
                                                              'NUM_HEADS': 12,
                                                              'NUM_LAYERS': 12,
                                                              'PATCH_SIZE': 16,
                                                              'QKV_BIAS': False,
                                                              'QK_SCALE': False,
                                                              'name': None}}},
           'WEIGHTS_INIT': {'APPEND_PREFIX': '',
                            'PARAMS_FILE': '',
                            'REMOVE_PREFIX': '',
                            'SKIP_LAYERS': ['num_batches_tracked'],
                            'STATE_DICT_KEY_NAME': 'classy_state_dict'}},
 'MULTI_PROCESSING_METHOD': 'forkserver',
 'NEAREST_NEIGHBOR': {'L2_NORM_FEATS': False, 'SIGMA': 0.1, 'TOPK': 200},
 'OPTIMIZER': {'betas': [0.9, 0.999],
               'construct_single_param_group_only': False,
               'head_optimizer_params': {'use_different_lr': False,
                                         'use_different_wd': False,
                                         'weight_decay': 0.0001},
               'larc_config': {'clip': False,
                               'eps': 1e-08,
                               'trust_coefficient': 0.001},
               'momentum': 0.9,
               'name': 'sgd',
               'nesterov': False,
               'non_regularized_parameters': [],
               'num_epochs': 2,
               'param_schedulers': {'lr': {'auto_lr_scaling': {'auto_scale': True,
                                                               'base_lr_batch_size': 1,
                                                               'base_value': 0.1},
                                           'end_value': 0.0,
                                           'interval_scaling': [],
                                           'lengths': [],
                                           'milestones': [1],
                                           'name': 'multistep',
                                           'schedulers': [],
                                           'start_value': 0.1,
                                           'update_interval': 'epoch',
                                           'value': 0.1,
                                           'values': [0.2, 0.02]},
                                    'lr_head': {'auto_lr_scaling': {'auto_scale': True,
                                                                    'base_lr_batch_size': 1,
                                                                    'base_value': 0.1},
                                                'end_value': 0.0,
                                                'interval_scaling': [],
                                                'lengths': [],
                                                'milestones': [1],
                                                'name': 'multistep',
                                                'schedulers': [],
                                                'start_value': 0.1,
                                                'update_interval': 'epoch',
                                                'value': 0.1,
                                                'values': [0.2, 0.02]}},
               'regularize_bias': True,
               'regularize_bn': False,
               'use_larc': False,
               'use_zero': False,
               'weight_decay': 0.0001},
 'SEED_VALUE': 1,
 'SLURM': {'COMMENT': 'vissl job',
           'CONSTRAINT': '',
           'LOG_FOLDER': '.',
           'MEM_GB': 250,
           'NAME': 'vissl',
           'PARTITION': 'learnfair',
           'PORT_ID': 40050,
           'TIME_HOURS': 72,
           'USE_SLURM': False},
 'SVM': {'cls_list': [],
         'costs': {'base': -1.0,
                   'costs_list': [0.1, 0.01],
                   'power_range': [4, 20]},
         'cross_val_folds': 3,
         'dual': True,
         'force_retrain': False,
         'loss': 'squared_hinge',
         'low_shot': {'dataset_name': 'voc',
                      'k_values': [1, 2, 4, 8, 16, 32, 64, 96],
                      'sample_inds': [1, 2, 3, 4, 5]},
         'max_iter': 2000,
         'normalize': True,
         'penalty': 'l2'},
 'TEST_EVERY_NUM_EPOCH': 5,
 'TEST_MODEL': True,
 'TEST_ONLY': False,
 'TRAINER': {'TASK_NAME': 'self_supervision_task',
             'TRAIN_STEP_NAME': 'standard_train_step'},
 'VERBOSE': False}
INFO 2021-04-09 05:46:14,206 train.py:  90: System config:
-------------------  -------------------------------------------------------------------------------------------
sys.platform         linux
Python               3.7.10 (default, Feb 26 2021, 18:47:35) [GCC 7.3.0]
numpy                1.19.5
Pillow               8.2.0
vissl                0.1.5 @/home/ec2-user/vissl/vissl
GPU available        True
GPU 0,1,2,3          Tesla T4
CUDA_HOME            /usr/local/cuda-11.2
torchvision          0.9.1+cu102 @/home/ec2-user/miniconda3/envs/vissl_2/lib/python3.7/site-packages/torchvision
hydra                1.0.6 @/home/ec2-user/miniconda3/envs/vissl_2/lib/python3.7/site-packages/hydra
classy_vision        0.6.0.dev @/home/ec2-user/miniconda3/envs/vissl_2/lib/python3.7/site-packages/classy_vision
apex                 0.1 @/home/ec2-user/miniconda3/envs/vissl_2/lib/python3.7/site-packages/apex
cv2                  4.5.1
PyTorch              1.8.1+cu102 @/home/ec2-user/miniconda3/envs/vissl_2/lib/python3.7/site-packages/torch
PyTorch debug build  False
-------------------  -------------------------------------------------------------------------------------------
PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 10.2
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70
  - CuDNN 7.6.5
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=7.6.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, 

CPU info:
-------------------  ----------------------------------------------
Architecture         x86_64
CPU op-mode(s)       32-bit, 64-bit
Byte Order           Little Endian
CPU(s)               48
On-line CPU(s) list  0-47
Thread(s) per core   2
Core(s) per socket   24
Socket(s)            1
NUMA node(s)         1
Vendor ID            GenuineIntel
CPU family           6
Model                85
Model name           Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
Stepping             7
CPU MHz              2998.569
BogoMIPS             4999.99
Hypervisor vendor    KVM
Virtualization type  full
L1d cache            32K
L1i cache            32K
L2 cache             1024K
L3 cache             36608K
NUMA node0 CPU(s)    0-47
-------------------  ----------------------------------------------
INFO 2021-04-09 05:46:14,207 train_task.py: 194: Not using Automatic Mixed Precision
INFO 2021-04-09 05:46:14,207 trainer_main.py: 109: Using Distributed init method: tcp://localhost:56173, world_size: 1, rank: 0
INFO 2021-04-09 05:46:14,209 distributed_c10d.py: 187: Added key: store_based_barrier_key:1 to store for rank: 0
INFO 2021-04-09 05:46:14,209 trainer_main.py: 130: | initialized host ip-10-0-6-212.vpc.internal as rank 0 (0)
INFO 2021-04-09 05:46:14,209 img_rotate_pil.py:  56: ImgRotatePil | Using num_angles: 4
INFO 2021-04-09 05:46:14,209 img_rotate_pil.py:  58: ImgRotatePil | Using num_rotations_per_img: 1
INFO 2021-04-09 05:46:14,210 ssl_dataset.py: 153: Rank: 0 split: TEST Data files:
['dummy_data/val']
INFO 2021-04-09 05:46:14,210 ssl_dataset.py: 156: Rank: 0 split: TEST Label files:
['dummy_data/val']
INFO 2021-04-09 05:46:14,210 disk_dataset.py:  83: Loaded 10 samples from folder dummy_data/val
INFO 2021-04-09 05:46:14,211 img_rotate_pil.py:  56: ImgRotatePil | Using num_angles: 4
INFO 2021-04-09 05:46:14,211 img_rotate_pil.py:  58: ImgRotatePil | Using num_rotations_per_img: 1
INFO 2021-04-09 05:46:14,211 ssl_dataset.py: 153: Rank: 0 split: TRAIN Data files:
['dummy_data/train']
INFO 2021-04-09 05:46:14,211 ssl_dataset.py: 156: Rank: 0 split: TRAIN Label files:
['dummy_data/train']
INFO 2021-04-09 05:46:14,211 disk_dataset.py:  83: Loaded 10 samples from folder dummy_data/train
INFO 2021-04-09 05:46:14,211 misc.py: 133: Set start method of multiprocessing to forkserver
INFO 2021-04-09 05:46:14,211 __init__.py: 109: Created the Distributed Sampler....
INFO 2021-04-09 05:46:14,211 __init__.py:  90: Distributed Sampler config:
{'num_replicas': 1, 'rank': 0, 'epoch': 0, 'num_samples': 10, 'total_size': 10, 'shuffle': True, 'seed': 0}
INFO 2021-04-09 05:46:14,212 __init__.py: 173: Wrapping the dataloader to async device copies
INFO 2021-04-09 05:46:17,220 misc.py: 133: Set start method of multiprocessing to forkserver
INFO 2021-04-09 05:46:17,220 __init__.py: 109: Created the Distributed Sampler....
INFO 2021-04-09 05:46:17,220 __init__.py:  90: Distributed Sampler config:
{'num_replicas': 1, 'rank': 0, 'epoch': 0, 'num_samples': 10, 'total_size': 10, 'shuffle': True, 'seed': 0}
INFO 2021-04-09 05:46:17,220 __init__.py: 173: Wrapping the dataloader to async device copies
INFO 2021-04-09 05:46:17,220 train_task.py: 422: Building model....
INFO 2021-04-09 05:46:17,220 resnext.py:  63: ResNeXT trunk, supports activation checkpointing. Deactivated
INFO 2021-04-09 05:46:17,221 resnext.py:  83: Building model: ResNeXt50-1x64d-w1-BatchNorm2d
INFO 2021-04-09 05:46:17,751 train_task.py: 596: Broadcast model BN buffers from master on every forward pass
INFO 2021-04-09 05:46:17,751 classification_task.py: 377: Synchronized Batch Normalization is disabled
INFO 2021-04-09 05:46:17,751 train_task.py: 342: Building loss...
INFO 2021-04-09 05:46:17,788 optimizer_helper.py: 254: 
Trainable params: 161, 
Non-Trainable params: 0, 
Trunk Regularized Parameters: 53, 
Trunk Unregularized Parameters 106, 
Head Regularized Parameters: 2, 
Head Unregularized Parameters: 0 
Remaining Regularized Parameters: 0 
Remaining Unregularized Parameters: 0
INFO 2021-04-09 05:46:17,789 trainer_main.py: 246: Training 2 epochs. One epoch = 5 iterations
INFO 2021-04-09 05:46:17,789 trainer_main.py: 248: Total 10 iterations for training
INFO 2021-04-09 05:46:17,789 trainer_main.py: 249: Total 10 samples in one epoch
INFO 2021-04-09 05:46:18,039 logger.py:  80: Fri Apr  9 05:46:17 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1B.0 Off |                    0 |
| N/A   24C    P0    25W /  70W |   1166MiB / 15109MiB |      8%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            On   | 00000000:00:1C.0 Off |                    0 |
| N/A   24C    P8     9W /  70W |      3MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            On   | 00000000:00:1D.0 Off |                    0 |
| N/A   23C    P8     9W /  70W |      3MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   24C    P8     9W /  70W |      3MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     36191      C   python3                          1163MiB |
+-----------------------------------------------------------------------------+

INFO 2021-04-09 05:46:18,040 trainer_main.py: 166: Model is:
 Classy <class 'vissl.models.base_ssl_model.BaseSSLMultiInputOutputModel'>:
BaseSSLMultiInputOutputModel(
  (_heads): ModuleDict()
  (trunk): ResNeXt(
    (_feature_blocks): ModuleDict(
      (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv1_relu): ReLU(inplace=True)
      (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
      (layer1): Sequential(
        (0): Bottleneck(
          (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
          (downsample): Sequential(
            (0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          )
        )
        (1): Bottleneck(
          (conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
        )
        (2): Bottleneck(
          (conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
        )
      )
      (layer2): Sequential(
        (0): Bottleneck(
          (conv1): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
          (downsample): Sequential(
            (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
            (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          )
        )
        (1): Bottleneck(
          (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
        )
        (2): Bottleneck(
          (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
        )
        (3): Bottleneck(
          (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
        )
      )
      (layer3): Sequential(
        (0): Bottleneck(
          (conv1): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
          (downsample): Sequential(
            (0): Conv2d(512, 1024, kernel_size=(1, 1), stride=(2, 2), bias=False)
            (1): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          )
        )
        (1): Bottleneck(
          (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
        )
        (2): Bottleneck(
          (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
        )
        (3): Bottleneck(
          (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
        )
        (4): Bottleneck(
          (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
        )
        (5): Bottleneck(
          (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
        )
      )
      (layer4): Sequential(
        (0): Bottleneck(
          (conv1): Conv2d(1024, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(<SUPPORTED_L4_STRIDE.two: 2>, <SUPPORTED_L4_STRIDE.two: 2>), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
          (downsample): Sequential(
            (0): Conv2d(1024, 2048, kernel_size=(1, 1), stride=(<SUPPORTED_L4_STRIDE.two: 2>, <SUPPORTED_L4_STRIDE.two: 2>), bias=False)
            (1): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          )
        )
        (1): Bottleneck(
          (conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
        )
        (2): Bottleneck(
          (conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
        )
      )
      (avgpool): AdaptiveAvgPool2d(output_size=(1, 1))
      (flatten): Flatten()
    )
  )
  (heads): ModuleList(
    (0): MLP(
      (clf): Sequential(
        (0): Linear(in_features=2048, out_features=4, bias=True)
      )
    )
  )
)
INFO 2021-04-09 05:46:18,040 trainer_main.py: 167: Loss is: CrossEntropyMultipleOutputSingleTargetLoss(
  (_losses): ModuleList()
)
INFO 2021-04-09 05:46:18,041 trainer_main.py: 168: Starting training....
INFO 2021-04-09 05:46:18,041 __init__.py:  90: Distributed Sampler config:
{'num_replicas': 1, 'rank': 0, 'epoch': 0, 'num_samples': 10, 'total_size': 10, 'shuffle': True, 'seed': 0}
** fvcore version of PathManager will be deprecated soon. **
** Please migrate to the version in iopath repo. **
https://github.com/facebookresearch/iopath 

Traceback (most recent call last):
  File "run_distributed_engines.py", line 194, in <module>
    hydra_main(overrides=overrides)
  File "run_distributed_engines.py", line 179, in hydra_main
    hook_generator=default_hook_generator,
  File "run_distributed_engines.py", line 123, in launch_distributed
    hook_generator=hook_generator,
  File "run_distributed_engines.py", line 166, in _distributed_worker
    process_main(cfg, dist_run_id, local_rank=local_rank, node_id=node_id)
  File "run_distributed_engines.py", line 159, in process_main
    hook_generator=hook_generator,
  File "/home/ec2-user/vissl/vissl/engines/train.py", line 103, in train_main
    trainer.train()
  File "/home/ec2-user/vissl/vissl/trainer/trainer_main.py", line 171, in train
    self._advance_phase(task)  # advances task.phase_idx
  File "/home/ec2-user/vissl/vissl/trainer/trainer_main.py", line 291, in _advance_phase
    phase_type, epoch=task.phase_idx, compute_start_iter=compute_start_iter
  File "/home/ec2-user/vissl/vissl/trainer/train_task.py", line 506, in recreate_data_iterator
    self.data_iterator = iter(self.dataloaders[phase_type])
  File "/home/ec2-user/miniconda3/envs/vissl_2/lib/python3.7/site-packages/classy_vision/dataset/dataloader_async_gpu_wrapper.py", line 40, in __iter__
    self.preload()
  File "/home/ec2-user/miniconda3/envs/vissl_2/lib/python3.7/site-packages/classy_vision/dataset/dataloader_async_gpu_wrapper.py", line 46, in preload
    self.cache_next = next(self._iter)
  File "/home/ec2-user/miniconda3/envs/vissl_2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 517, in __next__
    data = self._next_data()
  File "/home/ec2-user/miniconda3/envs/vissl_2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1199, in _next_data
    return self._process_data(data)
  File "/home/ec2-user/miniconda3/envs/vissl_2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1225, in _process_data
    data.reraise()
  File "/home/ec2-user/miniconda3/envs/vissl_2/lib/python3.7/site-packages/torch/_utils.py", line 429, in reraise
    raise self.exc_type(msg)
TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/ec2-user/miniconda3/envs/vissl_2/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 202, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/ec2-user/miniconda3/envs/vissl_2/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ec2-user/miniconda3/envs/vissl_2/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ec2-user/vissl/vissl/data/ssl_dataset.py", line 355, in __getitem__
    item = self.transform(item)
  File "/home/ec2-user/miniconda3/envs/vissl_2/lib/python3.7/site-packages/torchvision/transforms/transforms.py", line 60, in __call__
    img = t(img)
  File "/home/ec2-user/vissl/vissl/data/ssl_transforms/__init__.py", line 144, in __call__
    output = self.transform(sample["data"][idx])
  File "/home/ec2-user/vissl/vissl/data/ssl_transforms/img_rotate_pil.py", line 39, in __call__
    img = TF.rotate(image, self.angles[label])
  File "/home/ec2-user/miniconda3/envs/vissl_2/lib/python3.7/site-packages/torchvision/transforms/functional.py", line 949, in rotate
    raise TypeError("Argument angle should be int or float")
TypeError: Argument angle should be int or float
  1. please simplify the steps as much as possible so they do not require additional resources to
    run, such as a private dataset.

The dummy dataset was obtained similar to existing tutorials:

https://colab.research.google.com/drive/1CCuZ50BN99JcOB6VEPytVi_i2tSMd7A3#scrollTo=KPGCiTsXZeW3

Expected behavior:

If there are no obvious error in "what you observed" provided above,
please tell us the expected behavior.

Environment:

Provide your environment information using the following command:

-------------------  -------------------------------------------------------------------------------------------
sys.platform         linux
Python               3.7.10 (default, Feb 26 2021, 18:47:35) [GCC 7.3.0]
numpy                1.19.5
Pillow               8.2.0
vissl                0.1.5 @/home/ec2-user/vissl/vissl
GPU available        True
GPU 0,1,2,3          Tesla T4
CUDA_HOME            /usr/local/cuda-11.2
torchvision          0.9.1+cu102 @/home/ec2-user/miniconda3/envs/vissl_2/lib/python3.7/site-packages/torchvision
hydra                1.0.6 @/home/ec2-user/miniconda3/envs/vissl_2/lib/python3.7/site-packages/hydra
classy_vision        0.6.0.dev @/home/ec2-user/miniconda3/envs/vissl_2/lib/python3.7/site-packages/classy_vision
apex                 0.1 @/home/ec2-user/miniconda3/envs/vissl_2/lib/python3.7/site-packages/apex
cv2                  4.5.1
PyTorch              1.8.1+cu102 @/home/ec2-user/miniconda3/envs/vissl_2/lib/python3.7/site-packages/torch
PyTorch debug build  False
-------------------  -------------------------------------------------------------------------------------------
PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 10.2
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70
  - CuDNN 7.6.5
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=7.6.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, 

CPU info:
-------------------  ----------------------------------------------
Architecture         x86_64
CPU op-mode(s)       32-bit, 64-bit
Byte Order           Little Endian
CPU(s)               48
On-line CPU(s) list  0-47
Thread(s) per core   2
Core(s) per socket   24
Socket(s)            1
NUMA node(s)         1
Vendor ID            GenuineIntel
CPU family           6
Model                85
Model name           Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
Stepping             7
CPU MHz              2887.996
BogoMIPS             4999.99
Hypervisor vendor    KVM
Virtualization type  full
L1d cache            32K
L1i cache            32K
L2 cache             1024K
L3 cache             36608K
NUMA node0 CPU(s)    0-47
-------------------  ----------------------------------------------

When to expect Triage

VISSL devs and contributors aim to triage issues asap however, as a general guideline, we ask users to expect triaging in 1-2 weeks.

Scaling of composed lr scheduling w/ linear warmup

param_schedulers["schedulers"][idx]["start_value"] = start_value

I was trying to figure out why the linear warmup component of my composed schedulers was disappearing, and it led me here, and now I have a couple of questions:

It's not clear to me why the linear scheduler start value (param_schedulers["schedulers"][idx]["start_value"], which I'll refer to as linear_start_val for brevity) is being replaced instead of scaled. Is there a reason it's not scaled such that

linear_start_val = (linear_start_val / base_lr) * ((batch_size_per_node * num_nodes) / base_lr_batch_size)?

And why is linear warmup removed if only using a single node?

How to run VISSL on video with a Data Transform that chooses a random frame of the video

❓ How to run an algorithm on video with a Data Transform that chooses a random frame of the video

Hi, I have been using the library for a while now and I am familiar with its main features.

I know that currently it only supports datasets containing images, but I would like to explore the possibility of using videos and I think I could reuse most of the code from the library while making two changes:

  1. Instead of loading an image, it should load a video.
  2. The Data Transformation to apply would be selecting a random frame from that video.

From I have seen in the "Add new Data Transforms" documentation, point 2 should be feasible by adding a new class for my transform. But I want to get some feedback on how feasible would it be to change the source code of the library to be able to load videos. Could you provide me some feedback about this possibility and some guidance on where is the data loading actually happening so that I could check?

Thank you for your help!

No module named 'classy_vision'

Tried to run an example code described here inside a running docker container built from ./docker/* and ended up in a ModuleNotFoundError.

Instructions To Reproduce the πŸ› Bug:

cd vissl/docker
image=cu101 ./build_docker.sh
docker run -it --shm-size=8gb --env="DISPLAY" vissl:1.0-cu101
python tools/run_distributed_engines.py config=pretrain/swav/swav_8node_resnet \ config.DISTRIBUTED.NUM_PROC_PER_NODE=1 config.DISTRIBUTED.NUM_NODES=1

results in

** fvcore version of PathManager will be deprecated soon. **
** Please migrate to the version in iopath repo. **
https://github.com/facebookresearch/iopath

Traceback (most recent call last):
File "tools/run_distributed_engines.py", line 13, in
from vissl.utils.distributed_launcher import (
File "/home/vissluser/vissl/vissl/utils/distributed_launcher.py", line 15, in
from vissl.data.dataset_catalog import get_data_files
File "/home/vissluser/vissl/vissl/data/init.py", line 8, in
from classy_vision.dataset import DataloaderAsyncGPUWrapper
ModuleNotFoundError: No module named 'classy_vision'

Environment:

Ubuntu 20.04.

License question

This Github repo (and several other Facebook SSL repos) use a Creative Commons Attribution-NonCommercial 4.0 International Public License. I understand that I cannot sell this or modified copies of this code, and would need to license adapted versions of this code under the same license. But, does this license mean that I am not allowed to run this code for commercial purposes? As an example, would I be able to use VISSL to pretrain a model on my own data and then use those weights in a commercial project? Taking it a step further, would I be able to take the pretrained weights distributed with VISSL, fine-tune them on another dataset, and then use the resulting weights for commercial purposes?

Implement Barlow Twins

🌟 New SSL approach addition

Approach description

Implement Barlow Twins (arxiv link).

image

Pseudocode:

# f: encoder network
# lambda: weight on the off-diagonal terms
# N: batch size
# D: dimensionality of the representation
#
# mm: matrix-matrix multiplication
# off_diagonal: off-diagonal elements of a matrix
# eye: identity matrix

for x in loader: # load a batch with N samples
    # two randomly augmented versions of x
    y_a, y_b = augment(x)

    # compute representations
    z_a = f(y_a) # NxD
    z_b = f(y_b) # NxD

    # normalize repr. along the batch dimension
    z_a_norm = (z_a - z_a.mean(0)) / z_a.std(0) # NxD
    z_b_norm = (z_b - z_b.mean(0)) / z_b.std(0) # NxD

    # cross-correlation matrix
    c = mm(z_a_norm.T, z_b_norm) / N # DxD

    # loss
    c_diff = (c - eye(D)).pow(2) # DxD
    # multiply off-diagonal elems of c_diff by lambda
    off_diagonal(c_diff).mul_(lambda)
    loss = c_diff.sum()

    # optimization step
    loss.backward()
    optimizer.step()

Open source status

The model implementation is not yet available. However, it will be open sourced at: https://github.com/facebookresearch/barlowtwins

  • the model implementation is available
  • the model weights are available
  • who are the authors: Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, StΓ©phane Deny

Hardware Used and Time Taken to Produce Models in Model Zoo

πŸ“š VISSL Documentation

Hi,

I was wondering which hardware did you use and how much time did it take to train the models listed in Moel Zoo. These stats would be beneficial to do a comprehensive comparison of the time complexity of the methods. Though these details can be found in the corresponding papers, but different papers are using different hardware, and documenting these details on similar hardware would be beneficial.

Grateful,
Muhammad Maaz

Quick start tutorial failing on model creation

Following the steps in https://vissl.readthedocs.io/en/v0.1.5/getting_started.html, I get an error coming from model build process where vissl is trying to load weights from a file that does not exist. Checking default config file, I see MODEL.WEIGHTS_INIT.PARAMS_FILE is set to "", i.e., an empty string. However, using the PathManager from my fvcore installation (which came with installing vissl via pip) I see that PathManager.exists("") somehow evaluates to True. Then the line if PathManager.exists(init_weights_path): (line 380 in vissl/trainer/train_task.py) fails to prevent the model trying to load weights from an empty filename, and code errors out.

Instructions To Reproduce the Issue:

  1. Install vissl as pip install --user vissl in the pytorch:ngc-20.10 docker image
  2. Set up dataset config as described for imagenet_1k dataset (files stored in /tmp/)
  3. Run the following:
python3 run_distributed_engines.py \
     hydra.verbose=true \
     config.DATA.TRAIN.DATASET_NAMES=[imagenet1k_tmp] \
     config.DATA.TRAIN.DATA_SOURCES=[disk_folder] \
     config.DATA.TRAIN.DATA_PATHS=["/tmp/train"] \
     config=quick_1gpu_resnet50_simclr \
     config.CHECKPOINT.DIR="/global/cscratch1/sd/pharring/vissl/checkpoints" \
     config.CHECKPOINT.AUTO_RESUME=true
     config.TENSORBOARD_SETUP.USE_TENSORBOARD=true
  1. With hydra.verbose=true, lots of output so I will not include all irrelevant details. Initially some warnings from PathManager saying:
[PathManager] Attempting to register prefix 'http://' from the following call stack:
  File "run_distributed_engines.py", line 166, in _distributed_worker
    process_main(cfg, dist_run_id, local_rank=local_rank, node_id=node_id)
  File "run_distributed_engines.py", line 159, in process_main
    hook_generator=hook_generator,
  File "/global/homes/p/pharring/.local/cori/pytorch_ngc_20.10/lib/python3.6/site-packages/vissl/engines/train.py", line 60, in train_main
    set_env_vars(local_rank, node_id, cfg)
  File "/global/homes/p/pharring/.local/cori/pytorch_ngc_20.10/lib/python3.6/site-packages/vissl/utils/env.py", line 31, in set_env_vars
    PathManager.register_handler(HTTPURLHandler(), allow_override=True)
  File "/global/homes/p/pharring/.local/cori/pytorch_ngc_20.10/lib/python3.6/site-packages/fvcore/common/file_io.py", line 288, in register_handler
    + "".join(traceback.format_stack(limit=5))

[PathManager] Prefix 'http://' is already registered by <class 'iopath.common.file_io.HTTPURLHandler'>. We will override the old handler. To avoid such conflicts, create a project-specific PathManager instead.

This same chatter appears for ftp:// and https://, then a bunch of printouts of env variables and build settings, then the error with traceback:

INFO 2021-03-24 22:34:05,852 train_task.py: 378: Initializing model from: 
INFO 2021-03-24 22:34:05,852 util.py: 241: Broadcasting checkpoint loaded from 
Traceback (most recent call last):
  File "run_distributed_engines.py", line 194, in <module>
    hydra_main(overrides=overrides)
  File "run_distributed_engines.py", line 179, in hydra_main
    hook_generator=default_hook_generator,
  File "run_distributed_engines.py", line 123, in launch_distributed
    hook_generator=hook_generator,
  File "run_distributed_engines.py", line 166, in _distributed_worker
    process_main(cfg, dist_run_id, local_rank=local_rank, node_id=node_id)
  File "run_distributed_engines.py", line 159, in process_main
    hook_generator=hook_generator,
  File "/global/homes/p/pharring/.local/cori/pytorch_ngc_20.10/lib/python3.6/site-packages/vissl/engines/train.py", line 102, in train_main
    trainer.train()
  File "/global/homes/p/pharring/.local/cori/pytorch_ngc_20.10/lib/python3.6/site-packages/vissl/trainer/trainer_main.py", line 155, in train
    self.task.prepare(pin_memory=self.cfg.DATA.PIN_MEMORY)
  File "/global/homes/p/pharring/.local/cori/pytorch_ngc_20.10/lib/python3.6/site-packages/vissl/trainer/train_task.py", line 634, in prepare
    self.base_model = self._build_model()
  File "/global/homes/p/pharring/.local/cori/pytorch_ngc_20.10/lib/python3.6/site-packages/vissl/trainer/train_task.py", line 462, in _build_model
    model = self._restore_model_weights(model)
  File "/global/homes/p/pharring/.local/cori/pytorch_ngc_20.10/lib/python3.6/site-packages/vissl/trainer/train_task.py", line 399, in _restore_model_weights
    append_prefix=append_prefix,
  File "/global/homes/p/pharring/.local/cori/pytorch_ngc_20.10/lib/python3.6/site-packages/vissl/utils/checkpoint.py", line 404, in init_model_from_weights
    state_dict_key_name in state_dict.keys()
AttributeError: 'NoneType' object has no attribute 'keys'

Expected behavior:

Model should build successfully, without trying to loading in any weights upon initialization. I could be doing something wrong, but having trouble seeing where, as I followed the quick start steps closely!

Environment:

Full environment shown below:

--------------------  -----------------------------------------------------------------------------------------------------------
sys.platform          linux
Python                3.6.10 |Anaconda, Inc.| (default, May  8 2020, 02:54:21) [GCC 7.3.0]
numpy                 1.19.1
Pillow                8.0.1
vissl                 0.1.5 @/global/homes/p/pharring/.local/cori/pytorch_ngc_20.10/lib/python3.6/site-packages/vissl
GPU available         True
GPU 0,1               Tesla V100-SXM2-16GB
CUDA_HOME             /usr/local/cuda
TORCH_CUDA_ARCH_LIST  5.2 6.0 6.1 7.0 7.5 8.0 8.6+PTX
torchvision           0.8.0a0 @/opt/conda/lib/python3.6/site-packages/torchvision
hydra                 1.0.6 @/global/homes/p/pharring/.local/cori/pytorch_ngc_20.10/lib/python3.6/site-packages/hydra
classy_vision         0.6.0.dev @/global/homes/p/pharring/.local/cori/pytorch_ngc_20.10/lib/python3.6/site-packages/classy_vision
tensorboard           1.15.0
apex                  0.1 @/opt/conda/lib/python3.6/site-packages/apex
cv2                   3.4.1
PyTorch               1.7.0a0+7036e91 @/opt/conda/lib/python3.6/site-packages/torch
PyTorch debug build   False
--------------------  -----------------------------------------------------------------------------------------------------------
PyTorch built with:
  - GCC 7.5
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2019.0.1 Product Build 20180928 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.5.0 (Git Hash N/A)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.1
  - NVCC architecture flags: -gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_86,code=compute_86
  - CuDNN 8.0.4
  - Magma 2.5.2
  - Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, FORCE_FALLBACK_CUDA_MPI=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=ON, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, 

CPU info:
-------------------  ----------------------------------------
Architecture         x86_64
CPU op-mode(s)       32-bit, 64-bit
Byte Order           Little Endian
CPU(s)               80
On-line CPU(s) list  0-79
Thread(s) per core   2
Core(s) per socket   20
Socket(s)            2
NUMA node(s)         2
Vendor ID            GenuineIntel
CPU family           6
Model                85
Model name           Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
Stepping             4
CPU MHz              2400.000
CPU max MHz          3700.0000
CPU min MHz          1000.0000
BogoMIPS             4800.00
Virtualization       VT-x
L1d cache            32K
L1i cache            32K
L2 cache             1024K
L3 cache             28160K
NUMA node0 CPU(s)    0-19,40-59
NUMA node1 CPU(s)    20-39,60-79
-------------------  ----------------------------------------

Reproducing model zoos simCLR top 1 accuracy

I'm attempting to reproduce the numbers for SimCLR as mentioned in the model zoo's README with the help of the pre-trained models.

VISSL seems to be fully functioning on my machine (i.e I can run training scripts). The steps I've taken are as follows.:

  1. Downloaded ImageNet-2012 train and validation from https://image-net.org/challenges/LSVRC/2012/2012-downloads.php
  2. Ran this script to structure the validation set according to VISSL documentation. This is because the downloaded validation images don't come with subfolders like the training set does, for some reason.
  3. I downloaded the "RN50 - 100 epoch" model from here and put in in a folder called "model_zoo"
  4. The model zoo for SimCLR points to this json file to reproduce the numbers. I'm not sure how to run this with VISSL, so I decided to manually launch a training run via the following configuration: benchmark/linear_image_classification/imagenet1k/eval_resnet_8gpu_transfer_in1k_linear.yaml
  5. I only have one GPU available, so I launch the following command:
    python tools/run_distributed_engines.py config=benchmark/linear_image_classification/imagenet1k/eval_resnet_8gpu_transfer_in1k_linear.yaml config.DISTRIBUTED_NUM_PROC_PER_NODE=1 config.MODEL.SYNC_BN_CONFIG.SYNC_BN_TYPE=pytorch

This seems to run fine. My question here is am I doing this correctly and will I achieve a top-1 accuracy close to the 64.4 as reported in the table?

Pre-release announcement: Implementing vision transformers in vissl

🌟 Vision transformers are coming to vissl

Approach description

Transformers have recently demonstrated promising results for visions tasks in supervised contexts (Dosovitskiy et al., 2020, Touvron et al., 2020). We are currently working with the ClassyVision team to carry forward their vision transformers into vissl.

Open source status

The implementation is currently underway and we plan to release in the next week. @vedanuj and I (@growlix) are working with @mannatsingh and @anmolkalia from the ClassyVision team.

[feature] Dataset Catalog Environment Variable for Local Development

πŸš€ Feature

Add ability to use a different dataset_catalog.json through the use of an environment variable and/or through the hydra config.

Motivation & Examples

The immediate use case is for local development. That is we can have a dataset_catalog_local.json, that we can have gitignored for ease of local development. This will be helpful in terms of managing git when we need to change the dataset_catalog.

Other use cases could be for having different dataset catalogs for different environments, etc.,

The idea would be to set an environment variable called VISSL_DATASET_CATALOG_PATH. Then in get_json_data_catalog_file, we look for the environment variable and fallback to the the default the dataset_catalog.json.

We could potentially put the catalog in the YAML config. We might be able to do both with the following hierarchy YAML CONFIG => ENVIRONMENT VARIABLE => Default.

Example Colab File on top-level README is no longer working.

πŸ“š VISSL Documentation

A clear and concise description of what content in the Docs is an issue.

Trying to execute the VISSL example notebook Using a pretrained model for inference from the main README.md and running into execution errors deep in the VISSL code (see screenshot). Seems the schema changed, since I cannot access the attribute either when dotting into the config object.

image

Homepage - https://vissl.ai/

Hi

Would it make sense to update the main project homepage to make it clear its a computer vision / image related library, it's not really mentioned on that page especially the main tag line.

Tony

Wrong package apex gets installed while following the documentation

πŸ“š VISSL Documentation

When installing VISSL using pip in a venv, the documentation states:

pip install apex -f https://dl.fbaipublicfiles.com/vissl/packaging/apexwheels/py38_cu101_pyt151/download.html

which installs the pypi package apex https://pypi.org/project/apex/. Instead the order of the arguments should be reversed:

pip install -f https://dl.fbaipublicfiles.com/vissl/packaging/apexwheels/py38_cu101_pyt151/download.html apex

System: Ubuntu 20.04.2 LTS, pip 20.0.2

Error in conda package installation

Hi,

I'm trying to install the package following the instructions

conda create -n vissl python=3.8
conda activate vissl
conda install -c pytorch pytorch=1.7.1 torchvision cudatoolkit=10.2
conda install -c vissl -c iopath -c conda-forge -c pytorch -c defaults apex vissl

However the last step seems to be failing : I'm getting

(vissl) lucaspc@devfair0224:~/repos$ conda install -c vissl -c iopath -c conda-forge -c pytorch -c default
s apex vissl

Solving environment: failed

InvalidVersionSpecError: Invalid version spec: =2.7

I'm not sure where the error is coming from, there may be some version conflict.

Thanks,
Lucas

U-Net Integration

❓ How to do something using VISSL

Hello. Are there any plans to add support for U-Net in the near future? Unfortunately it is not on the list of the provided trunks.

If this is not planned, how would you suggest going about it? Our research project requires usage of U-Net for analysis and comparison. We would love to use VISSL to help with the implementation, but this one link is missing.

Thank you in advance.

Attribute error: NoneType has no 'keys'

I have been using Vissl since a couple of weeks and just ran into a problem when setting it up on a new machine (this didn't happen on the old machine). Every time I want to start a pre-training I get the following error:

File "/opt/conda/envs/vissl/lib/python3.8/site-packages/vissl/utils/checkpoint.py", line 404, in init_model_from_weights
state_dict_key_name in state_dict.keys()
AttributeError: 'NoneType' object has no attribute 'keys'

This also happens when just trying to replicating your tutorials such as the Training SimCLR on 1-gpu with VISSL.

Thank you.


sys.platform linux
Python 3.8.5 (default, Sep 4 2020, 07:30:14) [GCC 7.3.0]
numpy 1.19.2
Pillow 8.1.2
vissl 0.1.5 @/opt/conda/envs/vissl/lib/python3.8/site-packages/vissl
GPU available True
GPU 0,1,2,3 Tesla V100-SXM2-16GB
CUDA_HOME /usr/local/cuda
torchvision 0.8.2 @/opt/conda/envs/vissl/lib/python3.8/site-packages/torchvision
hydra 1.0.6 @/opt/conda/envs/vissl/lib/python3.8/site-packages/hydra
classy_vision 0.6.0.dev @/opt/conda/envs/vissl/lib/python3.8/site-packages/classy_vision
tensorboard 2.4.1
apex 0.1 @/opt/conda/envs/vissl/lib/python3.8/site-packages/apex
PyTorch 1.7.1 @/opt/conda/envs/vissl/lib/python3.8/site-packages/torch
PyTorch debug build False


PyTorch built with:

  • GCC 7.3
  • C++ Version: 201402
  • Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v1.6.0 (Git Hash 5ef631a030a6f73131c77892041042805a06064f)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • NNPACK is enabled
  • CPU capability usage: AVX2
  • CUDA Runtime 10.1
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
  • CuDNN 7.6.3
  • Magma 2.5.2
  • Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

CPU info:


Architecture x86_64
CPU op-mode(s) 32-bit, 64-bit
Byte Order Little Endian
CPU(s) 32
On-line CPU(s) list 0-31
Thread(s) per core 2
Core(s) per socket 16
Socket(s) 1
NUMA node(s) 1
Vendor ID GenuineIntel
CPU family 6
Model 79
Model name Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping 1
CPU MHz 2698.234
CPU max MHz 3000.0000
CPU min MHz 1200.0000
BogoMIPS 4600.00
Hypervisor vendor Xen
Virtualization type full
L1d cache 32K
L1i cache 32K
L2 cache 256K
L3 cache 46080K
NUMA node0 CPU(s) 0-31


Utilities to read configurations and load VISSL pre-trained models

πŸš€ Feature

Following the issue raised in #235, I propose to had a few helper functions which would make using VISSL even more straightforward:

  • a function to load a configuration in one shot, dealing all the things done in run_distributed_engine.py: merging with defaults.yaml, inference of the missing elements, and such
  • a function in vissl.util.checkpoint to help initialize a model built with build_model from a checkpoint easily

Motivation & Examples

Issue #235 has shown that using a pre-trained model in inference mode is not straightforward and could be improved. Ultimately, it would be nice to have the following features:

# Loading a configuration:
config = vissl.config.load("configs/config/pretrain/...")

# Building a model
model = vissl.models.build_model(config.MODEL, config.OPTIMIZER)

# Initializing the model
vissl.utils.checkpoint.load_weights_from_checkpoint(model, config.MODEL.WEIGHTS_INIT.PARAMS_FILE)

Performance of reproduced MoCo-v2

Thanks for your wonderful codebase! It is more convenient to conduct various experiments.

But I have a question about the reported MoCo-v2 in your MODEL_ZOO. It achieved 66.4% top-1 accuracy for linear classification while the official implementation of MoCo-v2 had 67.7%. Considering the std of MoCo-v2 is very small (around 0.1%), I think the gap (1.3%) between different implementation might suggest that there are some misalignments (\eg hyperparameters). Do you have any idea?

Register custom model / loops outside of the VISSL directory

πŸš€ Feature

I would like to be able to integrate custom code, such as custom model head / trunks, or custom training loops, without having to put them directly in the vissl directory

Motivation & Examples

Like many users, I like to download my packages via pip or conda. Right now I have to manually copy my new models to the vissl directory within my environment. The main issue I have with this is that I have to manually keep track of my added files should I do a vissl update.

Note

There is already something similar already established for datasets, where I can register a new one outside of the main codebase. If there would exist something like this for models that would be great.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.