facebookresearch / barlowtwins Goto Github PK

PyTorch implementation of Barlow Twins.

License: MIT License

Python 100.00%

barlowtwins's Introduction

Barlow Twins: Self-Supervised Learning via Redundancy Reduction

PyTorch implementation of Barlow Twins.

@article{zbontar2021barlow,
  title={Barlow Twins: Self-Supervised Learning via Redundancy Reduction},
  author={Zbontar, Jure and Jing, Li and Misra, Ishan and LeCun, Yann and Deny, St{\'e}phane},
  journal={arXiv preprint arXiv:2103.03230},
  year={2021}
}

Pretrained Model

epochs	batch size	acc1	acc5	download
1000	2048	73.5%	91.0%	ResNet-50	full checkpoint	train logs	val logs

You can choose to download either the weights of the pretrained ResNet-50 network or the full checkpoint, which also contains the weights of the projector network and the state of the optimizer.

The pretrained model is also available on PyTorch Hub.

import torch
model = torch.hub.load('facebookresearch/barlowtwins:main', 'resnet50')

Barlow Twins Training

Install PyTorch and download ImageNet by following the instructions in the requirements section of the PyTorch ImageNet training example. The code has been developed for PyTorch version 1.7.1 and torchvision version 0.8.2, but it should work with other versions just as well.

Our best model is obtained by running the following command:

python main.py /path/to/imagenet/

Training time is approximately 7 days on 16 v100 GPUs.

Evaluation: Linear Classification

Train a linear probe on the representations learned by Barlow Twins. Freeze the weights of the resnet and use the entire ImageNet training set.

python evaluate.py /path/to/imagenet/ /path/to/checkpoint/resnet50.pth --lr-classifier 0.3

Evaluation: Semi-supervised Learning

Train a linear probe on the representations learned by Barlow Twins. Finetune the weights of the resnet and use a subset of the ImageNet training set.

python evaluate.py /path/to/imagenet/ /path/to/checkpoint/resnet50.pth --weights finetune --train-perc 1 --epochs 20 --lr-backbone 0.005 --lr-classifier 0.5 --weight-decay 0 --checkpoint-dir ./checkpoint/semisup/

Community Updates

To use multiple nodes without SLURM, see suggestion by Shoaib: #13 (comment)
Barlow Twins on CIFAR-10 (PyTorch): https://github.com/IgorSusmelj/barlowtwins
An intriguing connection of Barlow Twins with the HSIC independence criterion, and some awesome results on small datasets (CIFAR10, STL10 and Tiny-ImageNet) in PyTorch: https://github.com/yaohungt/Barlow-Twins-HSIC
A TensorFlow implementation of Barlow Twins on CIFAR-10, by Sayak Paul: https://github.com/sayakpaul/Barlow-Twins-TF
A recent work applying Barlow Twins to learn EEG signals, by Neeraj Wagh: https://github.com/neerajwagh/eeg-self-supervision

Let us know about all the cool stuff you are able to do with Barlow Twins so that we can advertise it here!

License

This project is released under MIT License, which allows commercial use. See LICENSE for details.

barlowtwins's People

Stargazers

Watchers

Forkers

forks-learning stjordanis jesperkers ml-lab sumhncku ivanletteri philippeters chaoso fil82 shaunstanislauslau xujinglin joe817 fadouakhm yangsenwxy faisalshahbaz steven202 olivierdehaene zlpsophina showkeyjar xuchensjtu aturkelson suhongmoon z-fabian ljingv rajashekarv95 jimilee jaylizzhn soanduong maveriq sohailkhanmarwat repo-collection avani17101 towzeur mgrankin aashishpt khuongnd jonhare gevero jeaninezpp sborar aliechoes bidur-khanal abdurehman458 marktuddenham awentzonline liuhao-lh kaleem500bc hzwfl2 wang3702 xvdp linusericsson ancina lupus83 bip5 mrevow eurekayuan manantomar grez72 shengyupei latennte samratthapa120 calvinzhao0502 sthalles abaner23 yzakzero kyhoolee saif8091 bobby-he yuzheyang kshitij-ambilduke nballusr santomon philippeweinzaepfel haorotu liguge qu574 karthik-pandaram arrufat ghimiredhikura iyanuakin mldl leatherking lzw27 wendlerc charan0 alexandresee fu7388 artenyx devashishjoshi veugene jayamundra ensea-internship-2022 zororonoa muhammadhashirbinkhalid cham-choi yonghongwu harvard-visionlab richa10 haimrich elias-ramzi

barlowtwins's Issues

Barlow Twins loss on identical vector

Hello, I really enjoyed reading the paper and thought about the intentions of the loss.

However, I was wondering if setting the target matrix as identity matrix is eligible.

As far as I understand, each element of cross correlation matrix is matrix multiplication on each feature element.
Barlow Twins loss aims to have correlation of 1 on the diagonal and 0(no correlation) on the non-diagonal elements.

So, if two identical representation vector were fed to the loss, I thought it should give loss of zero, but it didn't.

For the sake of simplicity, let's say we have 2 pairs of representation vectors with identical values. (that's 4 vectors)
However, when I take two identical 1d vectors for 2 data, took the batch norm and computed Barlow Twins loss with them,
I got 1 on the diagonal but not 0 on the non-diagonal elements.

Same thing goes for the case when the batch size is 1. (batch norm makes the value to be normalized to zero though)

I'm not sure how the loss will learn invariance and redundancy with the target of identity matrix, especially on the redundancy term.
Can you please elaborate on how representation vector learns within redundancy term?

Here's a simple example I tried. (I followed the code implementation)

Thank you!

Issue loading checkpoint.pth file

Thank you very much for providing your code on GitHub. There is a problem when trying to load the checkpoint.pth file. If I load it using your latest code, it gives the error AttributeError: Can't get attribute 'exclude_bias_and_norm' on <module '__main__'>.

One can easily fix it by adding the following function to the file (not nested anywhere)

def exclude_bias_and_norm(p):
    return p.ndim == 1

Commit 21149b45bda50e579f166a4e261f281924b7c208 from August 5th actually introduced this "bug" as you very likely included the original saved model but just changed the code.

It is a very minor thing, but I thought about pointing it out. The problem stems from how PyTorch serializes models - hence I hope by opening this issue other researchers having the same issue might see this easy solution.

Here you can find a Jupyter notebook detailing the problem and the fix:
BugReport.zip

Pretrained weight for predictor

Can you share the whole model weight of Barlow Twins including not only resnet 50 but also predictor?

Evaluating a checkpoint

I am trying to evaluate a checkpoint while training but when i use the evaluation script, I get following error:

AttributeError: Can't get attribute 'exclude_bias_and_norm' on <module 'mp_main' from '/home/ubuntu/Projects/barlowtwins/evaluate.py'>

Can you please have a look?

configs of training with batch size 256

can you provide the config to train with batch size 256? how to set the lambd and scale loss? and does the LARS is still necessary when training with small batch sizes?

Possible bug on the loss computation

I could be wrong but reviewing the code, I think that there shouldn't be a pow_(2) on the line

barlowtwins/main.py

Line 219 in 8e8d284

off_diag = off_diagonal(c).pow_(2).sum()

My reason is that the power 2 is already applied in-place and element-wise on the correlation matrix as this line

barlowtwins/main.py

Line 218 in 8e8d284

on_diag = torch.diagonal(c).add_(-1).pow_(2).sum()

So presently we are computing the 4th power of the off-diagonal terms. But the original paper considers only their square
https://arxiv.org/pdf/2103.03230.pdf?fbclid=IwAR1p0f_DIdSF1gLFxcdAnBbUX7BO-5AFAqWQea0fNbOCsO5DK3TfQ6xTjo0

use barlowtwins predict but blocked and can not be killed

Hi, I'm very glad to thank your guys share this amazing project.
it training very easy.
but when I use it to test or predict, it blocked and can not be killed

here is my predictor code:

`
import torch
import argparse
from PIL import Image
from torch import nn, optim
from main import BarlowTwins, Transform

"""
BarlowTwins Predictor
"""

def load_model(args):
args.ngpus_per_node = torch.cuda.device_count()
args.rank = 0
args.dist_url = 'tcp://localhost:58472'
args.world_size = args.ngpus_per_node
gpu = 0
model = BarlowTwins(args).cuda(gpu)
model = nn.SyncBatchNorm.convert_sync_batchnorm(model)
torch.distributed.init_process_group(
backend='nccl', init_method=args.dist_url,
world_size=args.world_size, rank=args.rank)
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[gpu])
best_cp = torch.load("model/barlowtwins/bt_face.pth")
# model.module.backbone.state_dict(best_cp)
model.load_state_dict(best_cp)
model.eval()
return model

def predict(t1, t2, model, device):
model.to(device)
with torch.no_grad():
t1=t1.to(torch.device("cuda:1" if torch.cuda.is_available() else "cpu"))
t2=t2.to(torch.device("cuda:2" if torch.cuda.is_available() else "cpu"))
out_data = model(t1, t2)
return out_data

def predict_img(img_url, args):
img = Image.open(img_url)
bt_trans = Transform()
t1, t2 = bt_trans(img)
t1 = t1.unsqueeze(0)
t2 = t2.unsqueeze(0)
model = load_model(args)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
output = predict(t1, t2, model, device)
return output

if name=="main":
parser = argparse.ArgumentParser(description='Barlow Twins Predict')
parser.add_argument('--workers', default=8, type=int, metavar='N',
help='number of data loader workers')
parser.add_argument('--projector', default='8192-8192-8192', type=str,
metavar='MLP', help='projector MLP')
parser.add_argument("-d", default="data/test/1.jpg")
args = parser.parse_args()
img_url = str(args.d)
result = predict_img(img_url, args)
print(result)
`

would you please tell me which part is wrong? I'll very hopefully for your answer!

Bug Issues: loaded state dict has a different number of parameter groups

Sorry to disturb you , I encountered a bug. I tried to use the model you released and use the classification task to verify it. When I load your released trained model using load_state_dict, it raises the following error:
python evaluate.py /path/to/imagenet/ /path/to/checkpoint/resnet50.pth --lr-classifier 0.1

File "/home/amy/.local/lib/python3.7/site-packages/torch/optim/optimizer.py", line 118, in load_state_dict raise ValueError("loaded state dict has a different number of " ValueError: loaded state dict has a different number of parameter groups
However,when I try to use the model to do Semi-supervised Learning,It can run successfully.
python evaluate.py /path/to/imagenet/ /path/to/checkpoint/resnet50.pth --weights finetune --train-perc 1 --epochs 20 --lr-backbone 0.002 --lr-classifier 0.5 --weight-decay 0

Improvement over filtering bias and bn

An improvement can be make in main.py :
Instead of recomputing every time the filtered parameters in LARS
one can pass two function weight_decay_filter and lars_adaptation_filter to the filtered group that always return True (filter).

def exclude_param(p: torch.nn.parameter.Parameter):
    return True

parameters = [
        {"params": param_weights},
        {
            "params": param_biases,
            "weight_decay_filter": exclude_param,
            "lars_adaptation_filter": exclude_param,
        },
    ]

then

optimizer = LARS(parameters, lr=0, weight_decay=args.weight_decay)

In LARS.step , the weights group will not be filtered as its weight_decay_filter and lars_adaptation_filter will be set to None
unlike the param_biases group.

A question on the BT loss with Batch Norm layers

Hi, Thanks a lot for the very clear implementation and the paper is so easy to read!. I had a quick question on the Barlow Twins loss.

Since Barlow Twins relies on the statistics of a batch of data, if there are batch norm layers in the encoder network, is it possible that the parameters of the BN layers be updated/affected more than any other parameters in the network to optimize the BT loss? Did you experiment without any batchnorm layers in the encoder to see if that affects the learned representations?

Are negative samples required?

From the discussion session of the paper,

Another common point between the two
losses is that they both rely on batch statistics to measure
this variability. However, the INFO NCE objective maxi-
mizes the variability of the representation by maximizing
the pairwise distance between all pairs of samples, whereas
our method does so by decorrelating the components of the
representation vectors

This confuses me.

My understanding is that the cross-correlation matrix only considers the decorrelation between a 2 augmentations of same image. As such, we don't need the negative samples present in the batch. Then how and why are the batch statistics considered in decorrelating the repr. vectors? Could you please explain if the negative sample are indeed considered in the loss function or not?

are the parameters for 300 epoch with accuracy 71.4% same to the gitgub 1000 epoch?

Start index for each epoch

Could anyone explain why we do not start at index 0 for every epoch, rather than epoch * len(loader)??
Lines 123 to 125:

    for epoch in range(start_epoch, args.epochs):
         sampler.set_epoch(epoch)
         for step, ((y1, y2), _) in enumerate(loader, start=epoch * len(loader)):

When will the code released?

BT loss value on val set every N training epochs without classifier

Hi,

Thanks for making it painless for others to use and build upon your work.

I'm curious to know what the loss function value looks like during the training process on the val set of images. In my case of using spectrogram images on a resnet-18 backbone, the network trains well but the val loss value is very noisy and shows no clear trend. For example, I'm not sure what to make of this:

In the original experiments, was the BT loss value on the val set tracked every N epochs (without any classifier involved)? As far as I can tell, the loss reported in the 'val logs' txt file is of the classifier that was trained on top of the embeddings generating using a single training checkpoint.

The loss curve on the val set should be expected to follow typical behavior, right? Any pointers as to what might be causing a noisy val loss?

Where could we find the "reproduced version" of the other SSL methods

The barlowtwins show the amazing performance of the "reproduced version" of the other SSL methods in the paper. However, this official repo does not provide any code snippet and the pretrained-checkpoint to reveal the performance of the reproduced version.

It's hard to believe that ..

efficiency proposal

        z1 = self.projector(self.backbone(y1))
        z2 = self.projector(self.backbone(y2))

https://github.com/facebookresearch/barlowtwins/blob/main/main.py#L208

I guess here we could concatenate the input along the batch axis, do a single pass in the net, then split back the output.
This should lead to a more efficient and faster training, right ?

Discrepancy in the semi-supervised results for 10% data

Hi,

I was able to reproduce your linear probe results for BarlowTwins:
{"epoch": 99, "acc1": 73.266, "acc5": 91.094, "best_acc1": 73.29, "best_acc5": 91.108}.

I have also computed the semi-supervised results. Although the results for 1% data match, the results for 10% data are slightly different. I used the same hyperparameters for 10% as 1% data since you only mentioned the parameters for 1%. Maybe that was the difference. If the parameters are different, can you please clarify the parameters for 10% data?

1% data: {"epoch": 19, "acc1": 55.09, "acc5": 79.894, "best_acc1": 55.102, "best_acc5": 79.894}
10% data: {"epoch": 19, "acc1": 68.89, "acc5": 88.966, "best_acc1": 68.89, "best_acc5": 88.966}

Implementation question

I am wondering what is the purpose for the following code block?

barlowtwins/main.py

Lines 90 to 96 in a655214

    
           param_weights = [] 
        
           param_biases = [] 
        
           for param in model.parameters(): 
        
               if param.ndim == 1: 
        
                   param_biases.append(param) 
        
               else: 
        
                   param_weights.append(param)

Can we ignore the additional parameter --scale-loss?

hi, thanks for your contribution, I am guess that the additional parameter scale-loss cannot affect the gradient of loss, so in practice, we can ignore it, please tell me that i am right...

Dose we can add distorted image, like Y_a, Y_b, Y_c?

Why do we average out correlation matrices from different GPUs? Is this mathematically valid?

Thanks for this great work!

I am a bit confused about the computation of the Barlow Twins loss in the multi-gpu setting. If I understand it correctly, each batch is split into smaller minibatches and these are then processed on separage GPUs. Each GPU computes the cross correlation matrix corresponding to its minibatch. The cross correlations between samples on different GPUs are not computed. It is not clear to me, why the different cross correlation matrices are averaged out across GPUs. This creates a mean correlation matrix and this one is then used for loss computation.

Why not compute the loss for each correlation matrix separately and only average out the final loss?
Or even better, why not compute the full cross correlation matrix (i.e. gather all embedding vectors onto one device and computing the cross correlation there?)

I fail to see why summing up correlation matrices is a valid mathematical operation - or is it just an implementation "hack" that makes things easier? I guess since all cross correlation matrices are ideally converging towards identity matrices (as forced by the loss function), avereging them out does not strictly break the convergence - is that the case?

I am not very experienced with distributed deep learning so there may be technical things I don't understand. Thanks for your help.

barlowtwins/main.py

Lines 206 to 223 in a655214

    
           def forward(self, y1, y2): 
        
               z1 = self.projector(self.backbone(y1)) 
        
               z2 = self.projector(self.backbone(y2)) 
        
               # empirical cross-correlation matrix 
        
               c = self.bn(z1).T @ self.bn(z2) 
        
               # sum the cross-correlation matrix between all gpus 
        
               c.div_(self.args.batch_size) 
        
               torch.distributed.all_reduce(c) 
        
               on_diag = torch.diagonal(c).add_(-1).pow_(2).sum() 
        
               off_diag = off_diagonal(c).pow_(2).sum() 
        
               loss = on_diag + self.args.lambd * off_diag 
        
               return loss

Quality of Embeddings

Hello,

thanks for your publishing your amazing work. Have you done any investigations on the quality of the embeddings for downstream tasks rather than the representations? I think the SimCLR paper did some, where they showed that the representations were better, but since the embeddings are nicely disentangled in your work, I was wondering if barlowTwins' embeddings might be better, especially when disentanglement is needed?

Greetings

May I know how to implement BarlowTwins in Segmentation task? Thanks a lot.

Possible to release projector weights too?

Hi,

Thanks for releasing the pre-trained weights for the ResNet-50 backbone! Would it be possible to also release the weights of the pre-trained projector associated with the pre-trained ResNet-50?

This would be extra helpful for downstream tasks where there is considerably more unlabeled than labeled data, making it easy to fine-tune the whole model (backbone encoder + projector) by minimizing the BT loss on the unlabeled samples prior to training a classifier on the labeled subset.

Will BarlowTwins overfit on the training data?

Hi, team members of BarlowTwins. I have a question about the framework of this algorithm.
When the output of projection layers grows, the total trainable parameters grow hugely. Does any overfitting situation occur?
Also, in the paper, why the results of transfer learning is not so good compared to supervised transfer learning? Could it be the reason that BarlowTwins may over focus on the features of training data?

Look forward to your reply, thank you.

Shuffling BN and Single GPU

If we train it using single GPU, will we face the issue of information leakage due to BN as mentioned in the MoCo paper?

Thanks

loss is NaN for some batches/steps

EDIT - After more investigation, the issue is likely with my input data/features. Marking as closed. Sorry about that!

Hi,

Thanks for making it super easy to apply this architecture to other datasets.

I'm working with spectrogram images containing 10 channels, and a resnet18 backbone. Currently using ADAM optimizer with single-GPU training. My loss computation results in NaNs for some batches (early on, shown below) and then NaNs for the whole epoch (later on, not shown), and I'm not sure why. I was wondering if you had any thoughts on why that might be the case. A snapshot of the output logs (with on_diag and off_diag printed for each batch/step) is given below:

tensor(nan, device='cuda:0', grad_fn=) tensor(nan, device='cuda:0', grad_fn=)
tensor(0.0051, device='cuda:0', grad_fn=) tensor(124.6055, device='cuda:0', grad_fn=)
tensor(0.0047, device='cuda:0', grad_fn=) tensor(116.0811, device='cuda:0', grad_fn=)
tensor(0.0038, device='cuda:0', grad_fn=) tensor(116.7065, device='cuda:0', grad_fn=)
tensor(nan, device='cuda:0', grad_fn=) tensor(nan, device='cuda:0', grad_fn=)
tensor(0.0080, device='cuda:0', grad_fn=) tensor(107.8073, device='cuda:0', grad_fn=)
tensor(nan, device='cuda:0', grad_fn=) tensor(nan, device='cuda:0', grad_fn=)
tensor(0.0051, device='cuda:0', grad_fn=) tensor(115.6731, device='cuda:0', grad_fn=)
tensor(0.0027, device='cuda:0', grad_fn=) tensor(132.3802, device='cuda:0', grad_fn=)
tensor(nan, device='cuda:0', grad_fn=) tensor(nan, device='cuda:0', grad_fn=)
tensor(0.0864, device='cuda:0', grad_fn=) tensor(94.7531, device='cuda:0', grad_fn=)
tensor(nan, device='cuda:0', grad_fn=) tensor(nan, device='cuda:0', grad_fn=)
{"epoch": 1, "batch": 0, "learning_rate": 0.0, "loss": NaN, "time": "0:01:28.300381"}
tensor(0.0051, device='cuda:0', grad_fn=) tensor(116.8903, device='cuda:0', grad_fn=)
tensor(0.0040, device='cuda:0', grad_fn=) tensor(117.0997, device='cuda:0', grad_fn=)
tensor(0.0031, device='cuda:0', grad_fn=) tensor(128.7704, device='cuda:0', grad_fn=)
tensor(0.0024, device='cuda:0', grad_fn=) tensor(120.1153, device='cuda:0', grad_fn=)
tensor(0.0047, device='cuda:0', grad_fn=) tensor(112.5160, device='cuda:0', grad_fn=)
tensor(0.0065, device='cuda:0', grad_fn=) tensor(116.6054, device='cuda:0', grad_fn=)
tensor(0.0031, device='cuda:0', grad_fn=) tensor(120.2410, device='cuda:0', grad_fn=)

transforms.Solarize probability is zero for transforms but not transforms_prime

I was viewing the transforms class and I saw this

        self.transform = transforms.Compose([
            transforms.RandomResizedCrop(224, interpolation=Image.BICUBIC),
            transforms.RandomHorizontalFlip(p=0.5),
            transforms.RandomApply(
                [transforms.ColorJitter(brightness=0.4, contrast=0.4,
                                        saturation=0.2, hue=0.1)],
                p=0.8
            ),
            transforms.RandomGrayscale(p=0.2),
            GaussianBlur(p=1.0),
            Solarization(p=0.0),
            transforms.ToTensor(),
            transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                 std=[0.229, 0.224, 0.225])
        ])
        self.transform_prime = transforms.Compose([
            transforms.RandomResizedCrop(224, interpolation=Image.BICUBIC),
            transforms.RandomHorizontalFlip(p=0.5),
            transforms.RandomApply(
                [transforms.ColorJitter(brightness=0.4, contrast=0.4,
                                        saturation=0.2, hue=0.1)],
                p=0.8
            ),
            transforms.RandomGrayscale(p=0.2),
            GaussianBlur(p=0.1),
            Solarization(p=0.2),
            transforms.ToTensor(),
            transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                 std=[0.229, 0.224, 0.225])
        ])

Notice that the solarization probability is zero on the first augmentation but not the second. Is there a difference between the two(their different naming?)

Is this expected?

Error in saving resnet50.pth

barlowtwins/main.py

Line 151 in 8e8d284

torch.save(model.module.backbone.state_dict(),

Hi,

Since module is not a layer in BarlowTwin's definition, L151 should be
torch.save(model.backbone.state_dict(), args.checkpoint_dir / 'resnet50.pth')

Regards

When using the own dataset, loss is around 10,000

Hi, Thank you for the code.

How can I reduce loss to a more normal level?

Thanks！

Typo in the architecture image

In the architecture image, output names are not correctly written: YA -> ZB and YB -> ZA. They should be vice versa accourding to the paper:

The two batches of distorted views YA and YB are then fed to a function fθ, typically a deep network with trainable parameters θ,producing batches of representations ZA and ZB respectively.

The same image is also used in the paper and there are the same typos, too.

DDP Wrapper in evaluate.py

Hi,

I noticed that in evaluate.py, the resnet50 backbone was not wrapped in DDP. Can you maybe explain the reason behind it?

Batchnorm1d

Sorry, I figure it out.

After The batchnorm1d normalization, it necessary to divide the results by batch size for unit-norm vectors

Projector Network

Hello Barlow Twins Team!

First off, great work on the paper and providing a reference implementation of the concept! I have a couple of question regarding the projector network and it's usage. I know the paper discussed a little bit regarding the projector network, but I'm still not clear as to what its intention/purpose. Could someone provide more detail for its use and purpose? If there are other external resources that speak to this, I definitely wouldn't mind reading them. Also, after training the model do we use the projector network on further downstream tasks or do we just use the ResNet50 backbone? Thanks in advance for taking the time to address these concerns.

how to perform correlation for 4d tensor.

thanks for the amazing work!!

For my work the projector outputs two tensors of size [128, 3, 64, 64], how can i find the correlation matrix between these.
One way is to to flatten it in [128, -1] format but it will loose the spatial information so i didn't wanted to do that is their any other way?

Does this technique work on medical images ?

Just wanted to know if anyone was able to test this technique on medical image and arrive at promising results.

Is torch.distributed.all_reduce working as expected?

This line https://github.com/facebookresearch/barlowtwins/blob/main/main.py#L208 use torch.distributed.all_reduce to sum the correlation matrices across all gpus. However as I know this op is not dedicated for forward computation where backward computation would run later. Instead, to apply "correctly differentiable" distributed all reduce, the official PyTorch document recommends using torch.distributed.nn.*: https://pytorch.org/docs/stable/distributed.html#autograd-enabled-communication-primitives

Single GPU Training

How can I train the model on a Single GPU machine? Thanks

Augmentation Distribution

Thanks for sharing!

I can see that you chose different distributions in transform and transform_prime, namely

GaussianBlur(p=0.1) vs (p=1.0)
Solarization(p=0.2) vs (p=0.0)

In the paper I could not find a clear motivation or ablation. Is there an intuition or experience you can share?
Kind regards.

weight update for step=0,epoch=0 get's missed.

The function adjust_learning_rate() sets learning rate to 0 for step=0, epoch=0, Thereby missing out on the very first weight update.

About the last normalization layer

Thanks for the great work! Before computing the cross-correlation matrix, can we L2-normalize the representations along the batch dimension instead of using BatchNorm1d? Then the pseudocode becomes:

normalize repr. along the batch dimension

z_a_norm = torch.nn.functional.normalize(z_a, dim=0) # NxD
z_b_norm = torch.nn.functional.normalize(z_b, dim=0) # NxD

cross-correlation matrix

c = mm(z_a_norm.T, z_b_norm) # DxD

Loss implementation

Hi, Jure Zbontar, great work!

I am trying to re-implement the code. But I just find some inconsistence of the loss implementations (rather than the scale-loss).

In the paper, Eq.1 shows that the redundancy reduction item is computed on the cross-correlation matrix , and in the current implementation in this repo, it is also computed based on the cross-correlation matrix c.
on_diag = torch.diagonal(c).add_(-1).pow_(2).sum().mul(self.args.scale_loss)
off_diag = off_diagonal(c).pow_(2).sum().mul(self.args.scale_loss)
loss = on_diag + self.args.lambd * off_diag

But in the pseudocode implementation in Algorithm 1 in the official paper, it seems that the loss is computed on the c_diff rather than c. Could you please illustrate more about this? Thanks a lot！

Barlows redundancy

Hello,

If you can please help in understanding the Barlows redundancy fact mentioned in the paper, I am a bit confused to grasp it. I found someone else also asking about it on Reddit.

Here is the Reddit link: https://www.reddit.com/r/MachineLearning/comments/ma10iu/d_barlow_twins_ssl_via_redundancy_reduction/?utm_source=share&utm_medium=ios_app&utm_name=iossmf

Thanks,
Armin

Training loss abruptly changes

Hi!

Any idea what can be the reason for such jump in the training loss:

Training on mutiple nodes

Thanks for your great work. If there are two machines (each with 8 V100 GPUs) connected with ethernet, without slurm management, then how to run the code with your stated 16 V100 config?

L2-norm problem

I have seen the l2-norm operator in the paper, ie. the formula (2), but I have not seen it in your codes, please tell me why? Thanks!

Applications on one-dimensional signal datasets

Hi, thank you for your work, it inspires me a lot.
I want to apply BT to my data set about one-dimensional pulse signals instead of images, the data length is 500, do you think it needs to be changed to 224? I did the following work: I changed the number of channels in resnet50 from 3 to 1, and conv2d to conv1d, but the output dimension is still 2048. Good results can be obtained with supervised networks, but the loss of self-supervised learning processes does not drop significantly, and the downstream classification tasks are not as good as supervised networks. mlp also raised the dimension to 8192, but that didn't work either.
What do you think might be the reason? Or can you give me some advice. Thanks again!

Question about Fig. 4 in the paper

Hi. Thanks for the great work!
I'm trying to reproduce your paper's results in Fig. 4 (effect of the dimensionality of the last layer of the projector network on performance).
I have two questions about this:

Could you teach me what hyperparameters you used in the experiments?
Did you run main.py with --projector 8192-8192-16384? not --projector 16384-16384-16384, right?

How to reproduce VOC07 linear classification result?

I used vissl to reproduce VOC07 linear classification result using the pre-trained model linked in this repo and the following config. The accuracy I got is 86.831% which is quite a bit higher than the 86.2% reported in your paper. However, for multi-crop SwAV, I am able to reproduce the 88.9% figure using this config. Can you kindly comment on the settings you use? What are differences from the config above? Thanks.

```yaml
# @package _global_
config:
  DATA:
    NUM_DATALOADER_WORKERS: 5
    TRAIN:
      DATA_SOURCES: [disk_filelist]
      LABEL_SOURCES: [disk_filelist]
      DATASET_NAMES: [voc2007]
      BATCHSIZE_PER_REPLICA: 32
      TRANSFORMS:
        - name: Resize
          size: 256
        - name: CenterCrop
          size: [224, 224]
        - name: ToTensor
        - name: Normalize
          mean: [0.485, 0.456, 0.406]
          std: [0.229, 0.224, 0.225]
      MMAP_MODE: False
      COPY_TO_LOCAL_DISK: True
      COPY_DESTINATION_DIR: /tmp/voc2007/
    TEST:
      DATA_SOURCES: [disk_filelist]
      LABEL_SOURCES: [disk_filelist]
      DATASET_NAMES: [voc2007]
      BATCHSIZE_PER_REPLICA: 32
      TRANSFORMS:
        - name: Resize
          size: 256
        - name: CenterCrop
          size: [224, 224]
        - name: ToTensor
        - name: Normalize
          mean: [0.485, 0.456, 0.406]
          std: [0.229, 0.224, 0.225]
      MMAP_MODE: False
      COPY_TO_LOCAL_DISK: True
      COPY_DESTINATION_DIR: /tmp/voc2007/
  MODEL:
    WEIGHTS_INIT:
      PARAMS_FILE: "specify path"
      STATE_DICT_KEY_NAME: classy_state_dict
      # STATE_DICT_KEY_NAME: model_state_dict
    FEATURE_EVAL_SETTINGS:
      EVAL_MODE_ON: True
      FREEZE_TRUNK_ONLY: True
      EXTRACT_TRUNK_FEATURES_ONLY: True
      SHOULD_FLATTEN_FEATS: True
      LINEAR_EVAL_FEAT_POOL_OPS_MAP: [
        ["res5", ["AvgPool2d", [[6, 6], 1, 0]]],
        ["res5avg", ["Identity", []]],
      ]
    TRUNK:
      NAME: resnet
      RESNETS:
        DEPTH: 50
  DISTRIBUTED:
    NUM_NODES: 1
    NUM_PROC_PER_NODE: 8
  MACHINE:
    DEVICE: gpu
  CHECKPOINT:
    DIR: .
  SVM:
    costs:
      costs_list: [0.01, 0.1, 1.0, 2, 5, 10, 15, 20, 50, 100, 1000]
    normalize: True
    loss: squared_hinge
    penalty: l2
    dual: True
    max_iter: 2000
    cross_val_folds: 3
    force_retrain: False

I am attaching the following screenshot from your paper as a reference.

	param_weights = []
	param_biases = []
	for param in model.parameters():
	if param.ndim == 1:
	param_biases.append(param)
	else:
	param_weights.append(param)


	def forward(self, y1, y2):
	z1 = self.projector(self.backbone(y1))
	z2 = self.projector(self.backbone(y2))

	# empirical cross-correlation matrix
	c = self.bn(z1).T @ self.bn(z2)

	# sum the cross-correlation matrix between all gpus
	c.div_(self.args.batch_size)
	torch.distributed.all_reduce(c)

	on_diag = torch.diagonal(c).add_(-1).pow_(2).sum()
	off_diag = off_diagonal(c).pow_(2).sum()
	loss = on_diag + self.args.lambd * off_diag
	return loss

facebookresearch / barlowtwins Goto Github PK

barlowtwins's Introduction

Barlow Twins: Self-Supervised Learning via Redundancy Reduction

Pretrained Model

Barlow Twins Training

Evaluation: Linear Classification

Evaluation: Semi-supervised Learning

Community Updates

License

barlowtwins's People

Stargazers

Watchers

Forkers

barlowtwins's Issues

normalize repr. along the batch dimension

cross-correlation matrix

Recommend Projects

Recommend Topics

Recommend Org