Giter Site home page Giter Site logo

barlowtwins's Introduction

Barlow Twins: Self-Supervised Learning via Redundancy Reduction

Screen Shot 2021-04-29 at 6 26 48 AM

PyTorch implementation of Barlow Twins.

@article{zbontar2021barlow,
  title={Barlow Twins: Self-Supervised Learning via Redundancy Reduction},
  author={Zbontar, Jure and Jing, Li and Misra, Ishan and LeCun, Yann and Deny, St{\'e}phane},
  journal={arXiv preprint arXiv:2103.03230},
  year={2021}
}

Pretrained Model

epochs batch size acc1 acc5 download
1000 2048 73.5% 91.0% ResNet-50 full checkpoint train logs val logs

You can choose to download either the weights of the pretrained ResNet-50 network or the full checkpoint, which also contains the weights of the projector network and the state of the optimizer.

The pretrained model is also available on PyTorch Hub.

import torch
model = torch.hub.load('facebookresearch/barlowtwins:main', 'resnet50')

Barlow Twins Training

Install PyTorch and download ImageNet by following the instructions in the requirements section of the PyTorch ImageNet training example. The code has been developed for PyTorch version 1.7.1 and torchvision version 0.8.2, but it should work with other versions just as well.

Our best model is obtained by running the following command:

python main.py /path/to/imagenet/

Training time is approximately 7 days on 16 v100 GPUs.

Evaluation: Linear Classification

Train a linear probe on the representations learned by Barlow Twins. Freeze the weights of the resnet and use the entire ImageNet training set.

python evaluate.py /path/to/imagenet/ /path/to/checkpoint/resnet50.pth --lr-classifier 0.3

Evaluation: Semi-supervised Learning

Train a linear probe on the representations learned by Barlow Twins. Finetune the weights of the resnet and use a subset of the ImageNet training set.

python evaluate.py /path/to/imagenet/ /path/to/checkpoint/resnet50.pth --weights finetune --train-perc 1 --epochs 20 --lr-backbone 0.005 --lr-classifier 0.5 --weight-decay 0 --checkpoint-dir ./checkpoint/semisup/

Community Updates

Let us know about all the cool stuff you are able to do with Barlow Twins so that we can advertise it here!

License

This project is released under MIT License, which allows commercial use. See LICENSE for details.

barlowtwins's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

barlowtwins's Issues

Barlow Twins loss on identical vector

Hello, I really enjoyed reading the paper and thought about the intentions of the loss.

However, I was wondering if setting the target matrix as identity matrix is eligible.

As far as I understand, each element of cross correlation matrix is matrix multiplication on each feature element.
Barlow Twins loss aims to have correlation of 1 on the diagonal and 0(no correlation) on the non-diagonal elements.

So, if two identical representation vector were fed to the loss, I thought it should give loss of zero, but it didn't.

For the sake of simplicity, let's say we have 2 pairs of representation vectors with identical values. (that's 4 vectors)
However, when I take two identical 1d vectors for 2 data, took the batch norm and computed Barlow Twins loss with them,
I got 1 on the diagonal but not 0 on the non-diagonal elements.

Same thing goes for the case when the batch size is 1. (batch norm makes the value to be normalized to zero though)

I'm not sure how the loss will learn invariance and redundancy with the target of identity matrix, especially on the redundancy term.
Can you please elaborate on how representation vector learns within redundancy term?

Here's a simple example I tried. (I followed the code implementation)

image

Thank you!

Issue loading checkpoint.pth file

Thank you very much for providing your code on GitHub. There is a problem when trying to load the checkpoint.pth file. If I load it using your latest code, it gives the error AttributeError: Can't get attribute 'exclude_bias_and_norm' on <module '__main__'>.

One can easily fix it by adding the following function to the file (not nested anywhere)

def exclude_bias_and_norm(p):
    return p.ndim == 1

Commit 21149b45bda50e579f166a4e261f281924b7c208 from August 5th actually introduced this "bug" as you very likely included the original saved model but just changed the code.

It is a very minor thing, but I thought about pointing it out. The problem stems from how PyTorch serializes models - hence I hope by opening this issue other researchers having the same issue might see this easy solution.

Here you can find a Jupyter notebook detailing the problem and the fix:
BugReport.zip

Evaluating a checkpoint

I am trying to evaluate a checkpoint while training but when i use the evaluation script, I get following error:

AttributeError: Can't get attribute 'exclude_bias_and_norm' on <module 'mp_main' from '/home/ubuntu/Projects/barlowtwins/evaluate.py'>

Can you please have a look?

configs of training with batch size 256

can you provide the config to train with batch size 256? how to set the lambd and scale loss? and does the LARS is still necessary when training with small batch sizes?

Possible bug on the loss computation

I could be wrong but reviewing the code, I think that there shouldn't be a pow_(2) on the line

off_diag = off_diagonal(c).pow_(2).sum()

My reason is that the power 2 is already applied in-place and element-wise on the correlation matrix as this line

on_diag = torch.diagonal(c).add_(-1).pow_(2).sum()

So presently we are computing the 4th power of the off-diagonal terms. But the original paper considers only their square
https://arxiv.org/pdf/2103.03230.pdf?fbclid=IwAR1p0f_DIdSF1gLFxcdAnBbUX7BO-5AFAqWQea0fNbOCsO5DK3TfQ6xTjo0

use barlowtwins predict but blocked and can not be killed

Hi, I'm very glad to thank your guys share this amazing project.
it training very easy.
but when I use it to test or predict, it blocked and can not be killed

here is my predictor code:

`
import torch
import argparse
from PIL import Image
from torch import nn, optim
from main import BarlowTwins, Transform

"""
BarlowTwins Predictor
"""

def load_model(args):
args.ngpus_per_node = torch.cuda.device_count()
args.rank = 0
args.dist_url = 'tcp://localhost:58472'
args.world_size = args.ngpus_per_node
gpu = 0
model = BarlowTwins(args).cuda(gpu)
model = nn.SyncBatchNorm.convert_sync_batchnorm(model)
torch.distributed.init_process_group(
backend='nccl', init_method=args.dist_url,
world_size=args.world_size, rank=args.rank)
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[gpu])
best_cp = torch.load("model/barlowtwins/bt_face.pth")
# model.module.backbone.state_dict(best_cp)
model.load_state_dict(best_cp)
model.eval()
return model

def predict(t1, t2, model, device):
model.to(device)
with torch.no_grad():
t1=t1.to(torch.device("cuda:1" if torch.cuda.is_available() else "cpu"))
t2=t2.to(torch.device("cuda:2" if torch.cuda.is_available() else "cpu"))
out_data = model(t1, t2)
return out_data

def predict_img(img_url, args):
img = Image.open(img_url)
bt_trans = Transform()
t1, t2 = bt_trans(img)
t1 = t1.unsqueeze(0)
t2 = t2.unsqueeze(0)
model = load_model(args)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
output = predict(t1, t2, model, device)
return output

if name=="main":
parser = argparse.ArgumentParser(description='Barlow Twins Predict')
parser.add_argument('--workers', default=8, type=int, metavar='N',
help='number of data loader workers')
parser.add_argument('--projector', default='8192-8192-8192', type=str,
metavar='MLP', help='projector MLP')
parser.add_argument("-d", default="data/test/1.jpg")
args = parser.parse_args()
img_url = str(args.d)
result = predict_img(img_url, args)
print(result)
`

would you please tell me which part is wrong? I'll very hopefully for your answer!

Bug Issues: loaded state dict has a different number of parameter groups

Sorry to disturb you , I encountered a bug. I tried to use the model you released and use the classification task to verify it. When I load your released trained model using load_state_dict, it raises the following error:
python evaluate.py /path/to/imagenet/ /path/to/checkpoint/resnet50.pth --lr-classifier 0.1

File "/home/amy/.local/lib/python3.7/site-packages/torch/optim/optimizer.py", line 118, in load_state_dict raise ValueError("loaded state dict has a different number of " ValueError: loaded state dict has a different number of parameter groups
However,when I try to use the model to do Semi-supervised Learning,It can run successfully.
python evaluate.py /path/to/imagenet/ /path/to/checkpoint/resnet50.pth --weights finetune --train-perc 1 --epochs 20 --lr-backbone 0.002 --lr-classifier 0.5 --weight-decay 0

Improvement over filtering bias and bn

An improvement can be make in main.py :
Instead of recomputing every time the filtered parameters in LARS
one can pass two function weight_decay_filter and lars_adaptation_filter to the filtered group that always return True (filter).

def exclude_param(p: torch.nn.parameter.Parameter):
    return True
parameters = [
        {"params": param_weights},
        {
            "params": param_biases,
            "weight_decay_filter": exclude_param,
            "lars_adaptation_filter": exclude_param,
        },
    ]

then

optimizer = LARS(parameters, lr=0, weight_decay=args.weight_decay)

In LARS.step , the weights group will not be filtered as its weight_decay_filter and lars_adaptation_filter will be set to None
unlike the param_biases group.

A question on the BT loss with Batch Norm layers

Hi, Thanks a lot for the very clear implementation and the paper is so easy to read!. I had a quick question on the Barlow Twins loss.

Since Barlow Twins relies on the statistics of a batch of data, if there are batch norm layers in the encoder network, is it possible that the parameters of the BN layers be updated/affected more than any other parameters in the network to optimize the BT loss? Did you experiment without any batchnorm layers in the encoder to see if that affects the learned representations?

Are negative samples required?

From the discussion session of the paper,

Another common point between the two
losses is that they both rely on batch statistics to measure
this variability. However, the INFO NCE objective maxi-
mizes the variability of the representation by maximizing
the pairwise distance between all pairs of samples, whereas
our method does so by decorrelating the components of the
representation vectors

This confuses me.

My understanding is that the cross-correlation matrix only considers the decorrelation between a 2 augmentations of same image. As such, we don't need the negative samples present in the batch. Then how and why are the batch statistics considered in decorrelating the repr. vectors? Could you please explain if the negative sample are indeed considered in the loss function or not?

Start index for each epoch

Could anyone explain why we do not start at index 0 for every epoch, rather than epoch * len(loader)??
Lines 123 to 125:

    for epoch in range(start_epoch, args.epochs):
         sampler.set_epoch(epoch)
         for step, ((y1, y2), _) in enumerate(loader, start=epoch * len(loader)):

BT loss value on val set every N training epochs without classifier

Hi,

Thanks for making it painless for others to use and build upon your work.

I'm curious to know what the loss function value looks like during the training process on the val set of images. In my case of using spectrogram images on a resnet-18 backbone, the network trains well but the val loss value is very noisy and shows no clear trend. For example, I'm not sure what to make of this:

image

In the original experiments, was the BT loss value on the val set tracked every N epochs (without any classifier involved)? As far as I can tell, the loss reported in the 'val logs' txt file is of the classifier that was trained on top of the embeddings generating using a single training checkpoint.

The loss curve on the val set should be expected to follow typical behavior, right? Any pointers as to what might be causing a noisy val loss?

Discrepancy in the semi-supervised results for 10% data

Hi,

I was able to reproduce your linear probe results for BarlowTwins:
{"epoch": 99, "acc1": 73.266, "acc5": 91.094, "best_acc1": 73.29, "best_acc5": 91.108}.

I have also computed the semi-supervised results. Although the results for 1% data match, the results for 10% data are slightly different. I used the same hyperparameters for 10% as 1% data since you only mentioned the parameters for 1%. Maybe that was the difference. If the parameters are different, can you please clarify the parameters for 10% data?

1% data: {"epoch": 19, "acc1": 55.09, "acc5": 79.894, "best_acc1": 55.102, "best_acc5": 79.894}
10% data: {"epoch": 19, "acc1": 68.89, "acc5": 88.966, "best_acc1": 68.89, "best_acc5": 88.966}

Implementation question

I am wondering what is the purpose for the following code block?

barlowtwins/main.py

Lines 90 to 96 in a655214

param_weights = []
param_biases = []
for param in model.parameters():
if param.ndim == 1:
param_biases.append(param)
else:
param_weights.append(param)

Why do we average out correlation matrices from different GPUs? Is this mathematically valid?

Thanks for this great work!

I am a bit confused about the computation of the Barlow Twins loss in the multi-gpu setting. If I understand it correctly, each batch is split into smaller minibatches and these are then processed on separage GPUs. Each GPU computes the cross correlation matrix corresponding to its minibatch. The cross correlations between samples on different GPUs are not computed. It is not clear to me, why the different cross correlation matrices are averaged out across GPUs. This creates a mean correlation matrix and this one is then used for loss computation.

Why not compute the loss for each correlation matrix separately and only average out the final loss?
Or even better, why not compute the full cross correlation matrix (i.e. gather all embedding vectors onto one device and computing the cross correlation there?)

I fail to see why summing up correlation matrices is a valid mathematical operation - or is it just an implementation "hack" that makes things easier? I guess since all cross correlation matrices are ideally converging towards identity matrices (as forced by the loss function), avereging them out does not strictly break the convergence - is that the case?

I am not very experienced with distributed deep learning so there may be technical things I don't understand. Thanks for your help.

barlowtwins/main.py

Lines 206 to 223 in a655214

def forward(self, y1, y2):
z1 = self.projector(self.backbone(y1))
z2 = self.projector(self.backbone(y2))
# empirical cross-correlation matrix
c = self.bn(z1).T @ self.bn(z2)
# sum the cross-correlation matrix between all gpus
c.div_(self.args.batch_size)
torch.distributed.all_reduce(c)
on_diag = torch.diagonal(c).add_(-1).pow_(2).sum()
off_diag = off_diagonal(c).pow_(2).sum()
loss = on_diag + self.args.lambd * off_diag
return loss

Quality of Embeddings

Hello,

thanks for your publishing your amazing work. Have you done any investigations on the quality of the embeddings for downstream tasks rather than the representations? I think the SimCLR paper did some, where they showed that the representations were better, but since the embeddings are nicely disentangled in your work, I was wondering if barlowTwins' embeddings might be better, especially when disentanglement is needed?

Greetings

Possible to release projector weights too?

Hi,

Thanks for releasing the pre-trained weights for the ResNet-50 backbone! Would it be possible to also release the weights of the pre-trained projector associated with the pre-trained ResNet-50?

This would be extra helpful for downstream tasks where there is considerably more unlabeled than labeled data, making it easy to fine-tune the whole model (backbone encoder + projector) by minimizing the BT loss on the unlabeled samples prior to training a classifier on the labeled subset.

Will BarlowTwins overfit on the training data?

Hi, team members of BarlowTwins. I have a question about the framework of this algorithm.
When the output of projection layers grows, the total trainable parameters grow hugely. Does any overfitting situation occur?
Also, in the paper, why the results of transfer learning is not so good compared to supervised transfer learning? Could it be the reason that BarlowTwins may over focus on the features of training data?

Look forward to your reply, thank you.

Shuffling BN and Single GPU

If we train it using single GPU, will we face the issue of information leakage due to BN as mentioned in the MoCo paper?

Thanks

loss is NaN for some batches/steps

EDIT - After more investigation, the issue is likely with my input data/features. Marking as closed. Sorry about that!

Hi,

Thanks for making it super easy to apply this architecture to other datasets.

I'm working with spectrogram images containing 10 channels, and a resnet18 backbone. Currently using ADAM optimizer with single-GPU training. My loss computation results in NaNs for some batches (early on, shown below) and then NaNs for the whole epoch (later on, not shown), and I'm not sure why. I was wondering if you had any thoughts on why that might be the case. A snapshot of the output logs (with on_diag and off_diag printed for each batch/step) is given below:

tensor(nan, device='cuda:0', grad_fn=) tensor(nan, device='cuda:0', grad_fn=)
tensor(0.0051, device='cuda:0', grad_fn=) tensor(124.6055, device='cuda:0', grad_fn=)
tensor(0.0047, device='cuda:0', grad_fn=) tensor(116.0811, device='cuda:0', grad_fn=)
tensor(0.0038, device='cuda:0', grad_fn=) tensor(116.7065, device='cuda:0', grad_fn=)
tensor(nan, device='cuda:0', grad_fn=) tensor(nan, device='cuda:0', grad_fn=)
tensor(0.0080, device='cuda:0', grad_fn=) tensor(107.8073, device='cuda:0', grad_fn=)
tensor(nan, device='cuda:0', grad_fn=) tensor(nan, device='cuda:0', grad_fn=)
tensor(0.0051, device='cuda:0', grad_fn=) tensor(115.6731, device='cuda:0', grad_fn=)
tensor(0.0027, device='cuda:0', grad_fn=) tensor(132.3802, device='cuda:0', grad_fn=)
tensor(nan, device='cuda:0', grad_fn=) tensor(nan, device='cuda:0', grad_fn=)
tensor(0.0864, device='cuda:0', grad_fn=) tensor(94.7531, device='cuda:0', grad_fn=)
tensor(nan, device='cuda:0', grad_fn=) tensor(nan, device='cuda:0', grad_fn=)
{"epoch": 1, "batch": 0, "learning_rate": 0.0, "loss": NaN, "time": "0:01:28.300381"}
tensor(0.0051, device='cuda:0', grad_fn=) tensor(116.8903, device='cuda:0', grad_fn=)
tensor(0.0040, device='cuda:0', grad_fn=) tensor(117.0997, device='cuda:0', grad_fn=)
tensor(0.0031, device='cuda:0', grad_fn=) tensor(128.7704, device='cuda:0', grad_fn=)
tensor(0.0024, device='cuda:0', grad_fn=) tensor(120.1153, device='cuda:0', grad_fn=)
tensor(0.0047, device='cuda:0', grad_fn=) tensor(112.5160, device='cuda:0', grad_fn=)
tensor(0.0065, device='cuda:0', grad_fn=) tensor(116.6054, device='cuda:0', grad_fn=)
tensor(0.0031, device='cuda:0', grad_fn=) tensor(120.2410, device='cuda:0', grad_fn=)

transforms.Solarize probability is zero for transforms but not transforms_prime

I was viewing the transforms class and I saw this

        self.transform = transforms.Compose([
            transforms.RandomResizedCrop(224, interpolation=Image.BICUBIC),
            transforms.RandomHorizontalFlip(p=0.5),
            transforms.RandomApply(
                [transforms.ColorJitter(brightness=0.4, contrast=0.4,
                                        saturation=0.2, hue=0.1)],
                p=0.8
            ),
            transforms.RandomGrayscale(p=0.2),
            GaussianBlur(p=1.0),
            Solarization(p=0.0),
            transforms.ToTensor(),
            transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                 std=[0.229, 0.224, 0.225])
        ])
        self.transform_prime = transforms.Compose([
            transforms.RandomResizedCrop(224, interpolation=Image.BICUBIC),
            transforms.RandomHorizontalFlip(p=0.5),
            transforms.RandomApply(
                [transforms.ColorJitter(brightness=0.4, contrast=0.4,
                                        saturation=0.2, hue=0.1)],
                p=0.8
            ),
            transforms.RandomGrayscale(p=0.2),
            GaussianBlur(p=0.1),
            Solarization(p=0.2),
            transforms.ToTensor(),
            transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                 std=[0.229, 0.224, 0.225])
        ])

Notice that the solarization probability is zero on the first augmentation but not the second. Is there a difference between the two(their different naming?)

Is this expected?

Error in saving resnet50.pth

torch.save(model.module.backbone.state_dict(),

Hi,

Since module is not a layer in BarlowTwin's definition, L151 should be
torch.save(model.backbone.state_dict(), args.checkpoint_dir / 'resnet50.pth')

Regards

Typo in the architecture image

In the architecture image, output names are not correctly written: YA -> ZB and YB -> ZA. They should be vice versa accourding to the paper:

The two batches of distorted views YA and YB are then fed to a function fθ, typically a deep network with trainable parameters θ,producing batches of representations ZA and ZB respectively.

The same image is also used in the paper and there are the same typos, too.

DDP Wrapper in evaluate.py

Hi,

I noticed that in evaluate.py, the resnet50 backbone was not wrapped in DDP. Can you maybe explain the reason behind it?

Batchnorm1d

Sorry, I figure it out.

After The batchnorm1d normalization, it necessary to divide the results by batch size for unit-norm vectors

Projector Network

Hello Barlow Twins Team!

First off, great work on the paper and providing a reference implementation of the concept! I have a couple of question regarding the projector network and it's usage. I know the paper discussed a little bit regarding the projector network, but I'm still not clear as to what its intention/purpose. Could someone provide more detail for its use and purpose? If there are other external resources that speak to this, I definitely wouldn't mind reading them. Also, after training the model do we use the projector network on further downstream tasks or do we just use the ResNet50 backbone? Thanks in advance for taking the time to address these concerns.

how to perform correlation for 4d tensor.

thanks for the amazing work!!

For my work the projector outputs two tensors of size [128, 3, 64, 64], how can i find the correlation matrix between these.
One way is to to flatten it in [128, -1] format but it will loose the spatial information so i didn't wanted to do that is their any other way?

Is torch.distributed.all_reduce working as expected?

This line https://github.com/facebookresearch/barlowtwins/blob/main/main.py#L208 use torch.distributed.all_reduce to sum the correlation matrices across all gpus. However as I know this op is not dedicated for forward computation where backward computation would run later. Instead, to apply "correctly differentiable" distributed all reduce, the official PyTorch document recommends using torch.distributed.nn.*: https://pytorch.org/docs/stable/distributed.html#autograd-enabled-communication-primitives

Augmentation Distribution

Thanks for sharing!

I can see that you chose different distributions in transform and transform_prime, namely

GaussianBlur(p=0.1) vs (p=1.0)
Solarization(p=0.2) vs (p=0.0)

In the paper I could not find a clear motivation or ablation. Is there an intuition or experience you can share?
Kind regards.

About the last normalization layer

Thanks for the great work! Before computing the cross-correlation matrix, can we L2-normalize the representations along the batch dimension instead of using BatchNorm1d? Then the pseudocode becomes:

normalize repr. along the batch dimension

z_a_norm = torch.nn.functional.normalize(z_a, dim=0) # NxD
z_b_norm = torch.nn.functional.normalize(z_b, dim=0) # NxD

cross-correlation matrix

c = mm(z_a_norm.T, z_b_norm) # DxD

Loss implementation

Hi, Jure Zbontar, great work!

I am trying to re-implement the code. But I just find some inconsistence of the loss implementations (rather than the scale-loss).

In the paper, Eq.1 shows that the redundancy reduction item is computed on the cross-correlation matrix , and in the current implementation in this repo, it is also computed based on the cross-correlation matrix c.
on_diag = torch.diagonal(c).add_(-1).pow_(2).sum().mul(self.args.scale_loss)
off_diag = off_diagonal(c).pow_(2).sum().mul(self.args.scale_loss)
loss = on_diag + self.args.lambd * off_diag

But in the pseudocode implementation in Algorithm 1 in the official paper, it seems that the loss is computed on the c_diff rather than c. Could you please illustrate more about this? Thanks a lot!

Training on mutiple nodes

Thanks for your great work. If there are two machines (each with 8 V100 GPUs) connected with ethernet, without slurm management, then how to run the code with your stated 16 V100 config?

L2-norm problem

I have seen the l2-norm operator in the paper, ie. the formula (2), but I have not seen it in your codes, please tell me why? Thanks!

Applications on one-dimensional signal datasets

Hi, thank you for your work, it inspires me a lot.
I want to apply BT to my data set about one-dimensional pulse signals instead of images, the data length is 500, do you think it needs to be changed to 224? I did the following work: I changed the number of channels in resnet50 from 3 to 1, and conv2d to conv1d, but the output dimension is still 2048. Good results can be obtained with supervised networks, but the loss of self-supervised learning processes does not drop significantly, and the downstream classification tasks are not as good as supervised networks. mlp also raised the dimension to 8192, but that didn't work either.
What do you think might be the reason? Or can you give me some advice. Thanks again!

Question about Fig. 4 in the paper

Hi. Thanks for the great work!
I'm trying to reproduce your paper's results in Fig. 4 (effect of the dimensionality of the last layer of the projector network on performance).
I have two questions about this:

  • Could you teach me what hyperparameters you used in the experiments?
  • Did you run main.py with --projector 8192-8192-16384? not --projector 16384-16384-16384, right?

How to reproduce VOC07 linear classification result?

I used vissl to reproduce VOC07 linear classification result using the pre-trained model linked in this repo and the following config. The accuracy I got is 86.831% which is quite a bit higher than the 86.2% reported in your paper. However, for multi-crop SwAV, I am able to reproduce the 88.9% figure using this config. Can you kindly comment on the settings you use? What are differences from the config above? Thanks.

```yaml
# @package _global_
config:
  DATA:
    NUM_DATALOADER_WORKERS: 5
    TRAIN:
      DATA_SOURCES: [disk_filelist]
      LABEL_SOURCES: [disk_filelist]
      DATASET_NAMES: [voc2007]
      BATCHSIZE_PER_REPLICA: 32
      TRANSFORMS:
        - name: Resize
          size: 256
        - name: CenterCrop
          size: [224, 224]
        - name: ToTensor
        - name: Normalize
          mean: [0.485, 0.456, 0.406]
          std: [0.229, 0.224, 0.225]
      MMAP_MODE: False
      COPY_TO_LOCAL_DISK: True
      COPY_DESTINATION_DIR: /tmp/voc2007/
    TEST:
      DATA_SOURCES: [disk_filelist]
      LABEL_SOURCES: [disk_filelist]
      DATASET_NAMES: [voc2007]
      BATCHSIZE_PER_REPLICA: 32
      TRANSFORMS:
        - name: Resize
          size: 256
        - name: CenterCrop
          size: [224, 224]
        - name: ToTensor
        - name: Normalize
          mean: [0.485, 0.456, 0.406]
          std: [0.229, 0.224, 0.225]
      MMAP_MODE: False
      COPY_TO_LOCAL_DISK: True
      COPY_DESTINATION_DIR: /tmp/voc2007/
  MODEL:
    WEIGHTS_INIT:
      PARAMS_FILE: "specify path"
      STATE_DICT_KEY_NAME: classy_state_dict
      # STATE_DICT_KEY_NAME: model_state_dict
    FEATURE_EVAL_SETTINGS:
      EVAL_MODE_ON: True
      FREEZE_TRUNK_ONLY: True
      EXTRACT_TRUNK_FEATURES_ONLY: True
      SHOULD_FLATTEN_FEATS: True
      LINEAR_EVAL_FEAT_POOL_OPS_MAP: [
        ["res5", ["AvgPool2d", [[6, 6], 1, 0]]],
        ["res5avg", ["Identity", []]],
      ]
    TRUNK:
      NAME: resnet
      RESNETS:
        DEPTH: 50
  DISTRIBUTED:
    NUM_NODES: 1
    NUM_PROC_PER_NODE: 8
  MACHINE:
    DEVICE: gpu
  CHECKPOINT:
    DIR: .
  SVM:
    costs:
      costs_list: [0.01, 0.1, 1.0, 2, 5, 10, 15, 20, 50, 100, 1000]
    normalize: True
    loss: squared_hinge
    penalty: l2
    dual: True
    max_iter: 2000
    cross_val_folds: 3
    force_retrain: False

I am attaching the following screenshot from your paper as a reference.

Screenshot 2021-10-03 at 9 07 16 PM

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.