patrickhua / simsiam Goto Github PK

View Code? Open in Web Editor NEW

807.0 807.0 136.0 241 KB

A pytorch implementation for paper 'Exploring Simple Siamese Representation Learning'

License: MIT License

Python 100.00%

simsiam's People

Contributors

Stargazers

Watchers

Forkers

leodesigner pkuqgg yyht liaw05 alexanderwhatley btwardow godcherry bkbai steven202 wenliangdai chaoso wenbinlee hengyuel yif-yang youqingxiaozhua swan2015 zhongyingmatrix nathanhowell qianandfei fumengying19 z-one11 guyulongcs tubbz-alt rjean joeking11829 hiiragi0107 hanoch666 yangsenwxy xrosliang paneda1998 yinggangli dign50501 bordias jesperkers joshnroy matthiasware hanseungwook iif0403 neabfi rebeen malam8 anthi7 owalnuto chao1224 xinweijiang guang000 fengshiwest shipengwang elevanth tianyu-z nikheelpandey pipigenius frederikschubert taiychen 4gatepylon xuliwu zuoym15 jjboy guitaryourself skadoodler peleliu777 light-- humem yuwproject israrbacha zhengye1995 jwang1993 mseyfi flocf 0hanc tlwzzy xugithub1 pmjames16 bdminh minkyujeon yuanmengzhixing hhhhnwl cswin caisarl76 zzbros lindsey98 wangkaku henkwu trungpx geekbgm rainwangphy liuhao-lh lenka844 maetamongminji johnsk95 codejjang seongjinahn liuleigit ryanyip-kat myeongjin-kim mijungthegatsbypostdoc otakbeku jetabekteshi kyhoolee rduan036

simsiam's Issues

Why are training and testing so slow

Why are training and testing so slow?

SyncBatchNorm

To the best of my knowledge, SyncBatchnorm is only supported with DDP not DataParallel.

First of all, I want to thank you for this great project! I am a phD student in the field of Deep Learning and would really like to include your implementation in my experiments. Unfortunately, what stops me from doing so is that you did not provide a license yet. Would it be possible for you to add a license for this project such as the MIT license? I would greatly appreciate that and, of course, properly cite your work.

pretrain weights

Can you share pretrain weights for SimSiam?

Two sloved problems during evalution:

1.RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same.
Code in linear_eval.py line 50 and 59 should add .to(device) :
model = get_backbone(args.backbone).to(args.device)

2.RuntimeError: main thread is not in main loop
Tcl_AsyncDelete: async handler deleted by the wrong thread
In plot_logger.py add:
import matplotlib
matplotlib.use('Agg')
before:
import matplotlib.pyplot as plt

Some small mistakes

Thanks for your work. I thank the "nn.BatchNorm1d(hidden_dim)" in line 42 in the models/simsiam.py should be changed to "nn.BatchNorm1d(out_dim)".

A strange error in the transform class： 'tuple' object is not callable

Traceback (most recent call last):
File "main.py", line 94, in
main(args=get_args())
File "main.py", line 68, in main
for idx, ((images1, images2), _) in enumerate(p_bar):
File "/mnt/users/miniconda3/envs/simsiam/lib/python3.8/site-packages/tqdm/std.py", line 1129, in iter
for obj in iterable:
File "/mnt/users/miniconda3/envs/simsiam/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 435, in next
data = self._next_data()
File "/mnt/users/miniconda3/envs/simsiam/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 475, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/mnt/users/miniconda3/envs/simsiam/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/mnt/users/miniconda3/envs/simsiam/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/mnt/users/miniconda3/envs/simsiam/lib/python3.8/site-packages/torch/utils/data/dataset.py", line 272, in getitem
return self.dataset[self.indices[idx]]
File "/mnt/users/miniconda3/envs/simsiam/lib/python3.8/site-packages/torchvision/datasets/cifar.py", line 120, in getitem
img = self.transform(img)
File "/mnt/users/03_simsiam/augmentations/simsiam_aug.py", line 25, in call
x1 = self.transform(x)
File "/mnt/users/miniconda3/envs/simsiam/lib/python3.8/site-packages/torchvision/transforms/transforms.py", line 68, in call
img = t(img)
TypeError: 'tuple' object is not callable

I met this TypeError when I run the commad:
python main.py --debug --dataset cifar10 --data_dir my/data/folder/ --output_dir ./outputs
And I print the type of x in the code ' x1 = self.transform(x)' ,which may cause this error :
<class 'PIL.Image.Image'>

linear_eval.py typo ?

Hey,
I guess there is a typo in:

SimSiam/linear_eval.py

Line 43 in fe7963f

dataset=train_set,

It supposed to be test_set right ?

For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

I want to train on a custom dataset, but I got multiple errors I tried to solve that, one error is
For debugging, consider passing CUDA_LAUNCH_BLOCKING=1.
I couldn't solve this, I ran the repository in Colab.

BYOL momentum update function is defined, but not used in the training script

SimSiam/models/byol.py

Line 84 in fca181e

def update_moving_average(self, global_step, max_steps):

I am trying to apply different contrastive learning algorithms to some other fields. I found your implementation is really elegant and useful. However, I found that the BYOL model will not call the momentum update function.

So, is it a BUG, or I have missed it?

SWAV

please implement swav

Looking for Phd positions ...

Previously I was working on my graduate school application.

I know there are several bugs in the code and I feel so guilty not able to fix it. Now that I've finished all my applicaitons, I will make the repo better!

Does anyone know labs/professors that are looking for PhD studuents? My research interest is self-supervised learning(apparently). If you happen to know such a position, please contact me!

AttributeError and NotImplementedError

Hello,

While running SimCLR, I get the error "AttributeError: Can not find momentum in namespace. Please write momentum in your config file(xxx.yaml)!"

Also, running BYOL produces the error
" File "/content/gdrive/My Drive/Competitor/SimSiam/models/byol.py", line 75, in init
raise NotImplementedError('Please put update_moving_average to training')
NotImplementedError: Please put update_moving_average to training"

Are there anywork arounds to this issue?

Thank you.

For SimSiam, should be evaluation after backbone or encoder?

I checked the linear_eval.py and noticed only the backbone is imported during the evaluation; this should be correct for SimCl. However, I think the learned representation in SimSiam should be the p after the encoder, which includes the backbone and projector.
This might be the reason that others claim performance doesn't meet the original paper.

Implementation of Stopgrad

May I a ask you a question, how did you implment stop grad? I viewed the code, but I didn't find it. Thank you!

Normalization for different datasets not implemented?

First of all, thanks for providing this pytorch implementation.

If we look into the augmentations for each of the models (SimSiam, BYOL, etc), it seems that it is using the ImageNet dataset's mean & std dev, regardless of whether you're training on CIFAR10 or CIFAR100 or others. (

SimSiam/augmentations/simsiam_aug.py

Line 8 in 01d7e78

imagenet_mean_std = [[0.485, 0.456, 0.406],[0.229, 0.224, 0.225]]

)

Is my understanding correct and should this implementation be corrected?

A mistake caused by batch norm layers

Initially I used the z1, z2 = encoder(torch.cat([x1, x2])).chunk(2) to replace the twice forwarding in simsiam. However I realized the output can not be aligned with the original implementation in the paper. Here's a simplified example:

import torch, torchvision
x1 = torch.randn((2,3,224,224))
x2 = torch.randn_like(x1)
encoder = torchvision.models.resnet50()

z1, z2 = encoder(x1), encoder(x2)
print(z1,z2)
z1, z2 = encoder(torch.cat([x1, x2])).chunk(2)
print(z1,z2)

This gives different outputs for z1 and z2.

Then I disabled the bn using eval():

encoder.eval()
z1, z2 = encoder(x1), encoder(x2)
print(z1,z2)
z1, z2 = encoder(torch.cat([x1, x2])).chunk(2)
print(z1,z2)

The outputs are the same now!

Can't achieve the accuracy in the paper with cifar10

I use the kNN classification as a monitor during training. As shown in Figure D.1 in paper, the accuracy is about 60% in the beginning and finally achieve 90%. I can't achieve this accuracy and just achieve a very low accuracy with the parameter mentioned in the paper.

If anyone can achieve the results in the paper, thank you very much for sharing some experimental details.

--resume not implemented

Need to resume training.
Followed example in linear_eval.py

but getting:

Resuming model from outputs/custom_small_resnet18/simsiam-custom_small-epoch300.pth
Epoch 0/500:   0%|                                                                                                                                   | 0/1263 [00:02<?, ?it/s]
Training:   0%|                                                                                                                                       | 0/500 [00:02<?, ?it/s]
Traceback (most recent call last):
  File "main.py", line 139, in <module>
    main(args=get_args())
  File "main.py", line 101, in main
    loss = model.forward(images1.to(args.device), images2.to(args.device))
TypeError: forward() takes 2 positional arguments but 3 were given

How to properly load checkpoint and resume training ?
Thanks !

Loss collapse

I am trying to pretrain the SimSiam model on mscoco dataset.. but the loss collapses to -1 very quickly.. What are the possible reasons behind and some suggestions to solve the same?

Attribute not found for Resenet

Hi,

I installed the required version of torch and torchvision but still got "torch.nn.modules.module.ModuleAttributeError: 'ResNet' object has no attribute 'output_dim'".

模型崩塌

我用simsiam，Resnet18作为主干，在cifar10上，batchsize 128，基础lr0.06，warmup 50epoch，在30个epoch，loss降为-1，准确率只有10%,而前五个epoch准确率在30左右，后面loss下降，但是准确率降低，可能是什么原因呢

How can i use the simSiam to do classification tasks?

can i add a classification loss in each branch or one of the branch? will the result be good?
if the p and z have different dimensions, how can i change same to have same dimension and then calculate the D(p,z) loss？

package yaml in python3.x is called pyyaml

As title :D

Default settings of gaussian blur described in the paper is unclear

Color augmentation is ColorJitter with {brightness, contrast, saturation, hue} strength of {0.4, 0.4, 0.4, 0.1} with an applying probability of 0.8, and RandomGrayscale with an applying probability of 0.2. Blurring augmentation [8] has a Gaussian kernel with std in [0.1, 2.0].

They didn't say the probability of gaussian blur. It's just doesn't make sense to have gaussian blur on both augmentations. Because in training the model only sees blurred images, but in testing, the blury effect is removed. This will definetely hurt the generalization ability of this model. I will use the default gaussian blur probability in simclr instead!

will the currernt performance of dataset be published?

it would be nice if you could publish your implementation results for dataset such as cifar or stl10. This will get more focus and attention to the work.

No matching distribution found for torch==1.7.0

when i run the pip install -r .\requirements.txt, i got the error message said: No matching distribution found for torch==1.7.0,shoud i update the torch to 1.7.1?

A solved problem during parallel training.

When I use multi-gpu for training.
An error accured.

Traceback (most recent call last): File "main.py", line 73, in <module> main(args=get_args()) File "main.py", line 51, in main loss = model.forward(images1.to(args.device), images2.to(args.device)) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 161, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 171, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply output.reraise() File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/_utils.py", line 428, in reraise raise self.exc_type(msg) AssertionError: Caught AssertionError in replica 0 on device 0. Original Traceback (most recent call last): File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker output = module(*input, **kwargs) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/data1/fengry/vcm/comfea/MySimSiam-0.1.0/model.py", line 94, in forward z1, z2 = f(x1), f(x2) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/nn/modules/container.py", line 117, in forward input = module(input) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torchvision/models/resnet.py", line 220, in forward return self._forward_impl(x) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torchvision/models/resnet.py", line 204, in _forward_impl x = self.bn1(x) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/nn/modules/batchnorm.py", line 519, in forward world_size = torch.distributed.get_world_size(process_group) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 625, in get_world_size return _get_group_size(group) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 220, in _get_group_size _check_default_pg() File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 210, in _check_default_pg assert _default_pg is not None, \ AssertionError: Default process group is not initialized

Then in main.py, I add
torch.distributed.init_process_group('gloo', init_method='file:///tmp/somefile', rank=0, world_size=1)
before
if torch.cuda.device_count() > 1: model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)
, and add
loss = loss.mean()
before
loss.backward().

Everything goes well now.

when I run main.py in my data, I find the loss has more than 0.9, but the acc is only has 0.7, Do you have any good suggestions？

simsiam

loss < 0?

performance on simclr

Hi, @PatrickHua , thank you for your implementation.
I ran the code of simclr with default parameters, except for setting the momentum as 0.9, since it's missing in the simclr_cifar.yaml.
After 100 epochs, I got a 55.57% acc. It seems much lower than that in the paper. Whts's yours? Are there something wrong with my settings?

Strange errors when running cifar_experiment.sh

The OS is Ubuntu 18.04. The environment is in the conda environment as indicated with all required dependencies in requirements.txt installed.

The script in the debug mode runs well. However, when I ran:

sh configs/cifar_experiment.sh

A strange error happened during the evaluation time:

Training: 100%|██████████| 800/800 [6:17:14<00:00, 28.29s/it, epoch=799, loss_avg=-.878]

Evaluating:   0%|          | 0/30 [00:00<?, ?it/s]Model saved to outputs/cifar10_experiment/simsiam-cifar10-epoch800.pth
Files already downloaded and verified

Evaluating:   0%|          | 0/30 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "main.py", line 116, in <module>
    main(args=get_args())
  File "main.py", line 113, in main
    linear_eval(args, backbone)
  File "/home/yl764/SimSiam/SimSiam/linear_eval.py", line 109, in main
    feature = model(images.to(args.device))
  File "/home/yl764/miniconda3/envs/simsiam/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/yl764/miniconda3/envs/simsiam/lib/python3.8/site-packages/torchvision/models/resnet.py", line 220, in forward
    return self._forward_impl(x)
  File "/home/yl764/miniconda3/envs/simsiam/lib/python3.8/site-packages/torchvision/models/resnet.py", line 203, in _forward_impl
    x = self.conv1(x)
  File "/home/yl764/miniconda3/envs/simsiam/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/yl764/miniconda3/envs/simsiam/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 423, in forward
    return self._conv_forward(input, self.weight)
  File "/home/yl764/miniconda3/envs/simsiam/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 419, in _conv_forward
    return F.conv2d(input, weight, self.bias, self.stride,
RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

Thanks!

What is the difference between 'num_epochs' and 'stop_at_epoch'?

Why aren't these two parameters equivalent?

Is "DistributedSampler" necessary？

Hello,
I find that there is no "DistributedSampler" in the code.
Is it a normal setting?
I think this setting would make the model run the same data twice (or the number of GPU) in a single epoch (because the shuffle is True.).
I'm not sure if this is normal.
Thank you very much.

Backbone setting

In the SimSiam，the backbone like ResNet50 was used.
Does the backbone Resnet50 as an encoder include the final fc layer? or stop at the global average pooling layer?
I found that the whole network as the encoder in the models.simsiam.py file.

loss clsoe to -1 at the begining of training

Has anyone met the problem that the loss close to -1 at the beginning of training? BTW, the training data is not sourced from traditional classification data like Imagenet or cifar.

Close

About the transfer learning experiment section of the code

Thanks for your work. Could you please tell me the code for the transfer learning experiment section?
Looking forward to your reply.

There should be a stop gradient in the simsiam model

Hello,

I was using your implementation of SimSiam for contrastive learning. I noticed that the model that you have created has a few problems:

The "stop_gradient" part of the network is absent from your implementation. This model is effectively training both the path.

Could you please clarify how and where you are taking care of it?

Consumes a lot GPU memory than standard?

Hi, I used a ResNet34 backbone to train on (1, 128, 128) images with a batch size of 128. The total allocated memory is >35GB. According to the post, a ResNet50 on (3,256,256) images with a batch size of 96 only consumes 10GB. I am wondering if anyone else experiences the same issue and if there is any clue as to why this network takes such a lot of memory.

Do cifar_resnet_1.py and cifar_resnet_2.py are the same?

Dear author:

 Do the following cifar_resnet_1.py and cifar_resnet_2.py are the same? If not, what's the difference between them?

   + https://github.com/PatrickHua/SimSiam/blob/main/models/backbones/cifar_resnet_1.py
   + https://github.com/PatrickHua/SimSiam/blob/main/models/backbones/cifar_resnet_2.py

And are they the same as the network used in MoCo_cifar_10_demo ?
    + https://colab.research.google.com/github/facebookresearch/moco/blob/colab-notebook/colab/moco_cifar10_demo.ipynb

The code error in logger

Hi all, I am doing research for SimSiam with this implementation.
The framework of this backbone is really good and easy to expend, so I tried to run the SimSiam on ImageNet and CUB200 datasets.
However, to reproduce the experimental results in the paper, I had to run the model with resnet50/CUB200 with batch size =256, and I got an OOM problem.
Then I ran the model on two GPUs, but the code complains that there are some errors.

Can anyone help me out with this issue?
The code works fine when I only use one GPU
Thanks.
The command I used like:
CUDA_VISICE_DEVICES=0,1 python /root/SimSiam/main.py --data_dir /root/CUB_200_2011/ --log_dir ../logs/ -c configs/simsiam_cub200.yaml --ckpt_dir ~/.cache --hide_progress