patrickhua / simsiam Goto Github PK
View Code? Open in Web Editor NEWA pytorch implementation for paper 'Exploring Simple Siamese Representation Learning'
License: MIT License
A pytorch implementation for paper 'Exploring Simple Siamese Representation Learning'
License: MIT License
Why are training and testing so slow?
To the best of my knowledge, SyncBatchnorm is only supported with DDP not DataParallel.
First of all, I want to thank you for this great project! I am a phD student in the field of Deep Learning and would really like to include your implementation in my experiments. Unfortunately, what stops me from doing so is that you did not provide a license yet. Would it be possible for you to add a license for this project such as the MIT license? I would greatly appreciate that and, of course, properly cite your work.
Can you share pretrain weights for SimSiam?
1.RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same.
Code in linear_eval.py line 50 and 59 should add .to(device)
:
model = get_backbone(args.backbone).to(args.device)
2.RuntimeError: main thread is not in main loop
Tcl_AsyncDelete: async handler deleted by the wrong thread
In plot_logger.py add:
import matplotlib
matplotlib.use('Agg')
before:
import matplotlib.pyplot as plt
Thanks for your work. I thank the "nn.BatchNorm1d(hidden_dim)" in line 42 in the models/simsiam.py should be changed to "nn.BatchNorm1d(out_dim)".
Traceback (most recent call last):
File "main.py", line 94, in
main(args=get_args())
File "main.py", line 68, in main
for idx, ((images1, images2), _) in enumerate(p_bar):
File "/mnt/users/miniconda3/envs/simsiam/lib/python3.8/site-packages/tqdm/std.py", line 1129, in iter
for obj in iterable:
File "/mnt/users/miniconda3/envs/simsiam/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 435, in next
data = self._next_data()
File "/mnt/users/miniconda3/envs/simsiam/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 475, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/mnt/users/miniconda3/envs/simsiam/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/mnt/users/miniconda3/envs/simsiam/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/mnt/users/miniconda3/envs/simsiam/lib/python3.8/site-packages/torch/utils/data/dataset.py", line 272, in getitem
return self.dataset[self.indices[idx]]
File "/mnt/users/miniconda3/envs/simsiam/lib/python3.8/site-packages/torchvision/datasets/cifar.py", line 120, in getitem
img = self.transform(img)
File "/mnt/users/03_simsiam/augmentations/simsiam_aug.py", line 25, in call
x1 = self.transform(x)
File "/mnt/users/miniconda3/envs/simsiam/lib/python3.8/site-packages/torchvision/transforms/transforms.py", line 68, in call
img = t(img)
TypeError: 'tuple' object is not callable
I met this TypeError when I run the commad:
python main.py --debug --dataset cifar10 --data_dir my/data/folder/ --output_dir ./outputs
And I print the type of x in the code ' x1 = self.transform(x)' ,which may cause this error :
<class 'PIL.Image.Image'>
I want to train on a custom dataset, but I got multiple errors I tried to solve that, one error is
For debugging, consider passing CUDA_LAUNCH_BLOCKING=1.
I couldn't solve this, I ran the repository in Colab.
Line 84 in fca181e
I am trying to apply different contrastive learning algorithms to some other fields. I found your implementation is really elegant and useful. However, I found that the BYOL model will not call the momentum update function.
So, is it a BUG, or I have missed it?
please implement swav
Previously I was working on my graduate school application.
I know there are several bugs in the code and I feel so guilty not able to fix it. Now that I've finished all my applicaitons, I will make the repo better!
Does anyone know labs/professors that are looking for PhD studuents? My research interest is self-supervised learning(apparently). If you happen to know such a position, please contact me!
Hello,
While running SimCLR, I get the error "AttributeError: Can not find momentum in namespace. Please write momentum in your config file(xxx.yaml)!"
Also, running BYOL produces the error
" File "/content/gdrive/My Drive/Competitor/SimSiam/models/byol.py", line 75, in init
raise NotImplementedError('Please put update_moving_average to training')
NotImplementedError: Please put update_moving_average to training"
Are there anywork arounds to this issue?
Thank you.
I checked the linear_eval.py and noticed only the backbone is imported during the evaluation; this should be correct for SimCl. However, I think the learned representation in SimSiam should be the p after the encoder, which includes the backbone and projector.
This might be the reason that others claim performance doesn't meet the original paper.
May I a ask you a question, how did you implment stop grad? I viewed the code, but I didn't find it. Thank you!
First of all, thanks for providing this pytorch implementation.
If we look into the augmentations for each of the models (SimSiam, BYOL, etc), it seems that it is using the ImageNet dataset's mean & std dev, regardless of whether you're training on CIFAR10 or CIFAR100 or others. (
SimSiam/augmentations/simsiam_aug.py
Line 8 in 01d7e78
Is my understanding correct and should this implementation be corrected?
Initially I used the z1, z2 = encoder(torch.cat([x1, x2])).chunk(2)
to replace the twice forwarding in simsiam. However I realized the output can not be aligned with the original implementation in the paper. Here's a simplified example:
import torch, torchvision
x1 = torch.randn((2,3,224,224))
x2 = torch.randn_like(x1)
encoder = torchvision.models.resnet50()
z1, z2 = encoder(x1), encoder(x2)
print(z1,z2)
z1, z2 = encoder(torch.cat([x1, x2])).chunk(2)
print(z1,z2)
This gives different outputs for z1 and z2.
Then I disabled the bn using eval():
encoder.eval()
z1, z2 = encoder(x1), encoder(x2)
print(z1,z2)
z1, z2 = encoder(torch.cat([x1, x2])).chunk(2)
print(z1,z2)
The outputs are the same now!
I use the kNN classification as a monitor during training. As shown in Figure D.1 in paper, the accuracy is about 60% in the beginning and finally achieve 90%. I can't achieve this accuracy and just achieve a very low accuracy with the parameter mentioned in the paper.
If anyone can achieve the results in the paper, thank you very much for sharing some experimental details.
Need to resume training.
Followed example in linear_eval.py
but getting:
Resuming model from outputs/custom_small_resnet18/simsiam-custom_small-epoch300.pth
Epoch 0/500: 0%| | 0/1263 [00:02<?, ?it/s]
Training: 0%| | 0/500 [00:02<?, ?it/s]
Traceback (most recent call last):
File "main.py", line 139, in <module>
main(args=get_args())
File "main.py", line 101, in main
loss = model.forward(images1.to(args.device), images2.to(args.device))
TypeError: forward() takes 2 positional arguments but 3 were given
How to properly load checkpoint and resume training ?
Thanks !
I am trying to pretrain the SimSiam model on mscoco dataset.. but the loss collapses to -1 very quickly.. What are the possible reasons behind and some suggestions to solve the same?
我用simsiam,Resnet18作为主干,在cifar10上,batchsize 128,基础lr0.06,warmup 50epoch,在30个epoch,loss降为-1,准确率只有10%,而前五个epoch准确率在30左右,后面loss下降,但是准确率降低,可能是什么原因呢
As title :D
Color augmentation is ColorJitter with {brightness, contrast, saturation, hue} strength of {0.4, 0.4, 0.4, 0.1} with an applying probability of 0.8, and RandomGrayscale with an applying probability of 0.2. Blurring augmentation [8] has a Gaussian kernel with std in [0.1, 2.0].
They didn't say the probability of gaussian blur. It's just doesn't make sense to have gaussian blur on both augmentations. Because in training the model only sees blurred images, but in testing, the blury effect is removed. This will definetely hurt the generalization ability of this model. I will use the default gaussian blur probability in simclr instead!
it would be nice if you could publish your implementation results for dataset such as cifar or stl10. This will get more focus and attention to the work.
when i run the pip install -r .\requirements.txt, i got the error message said: No matching distribution found for torch==1.7.0,shoud i update the torch to 1.7.1?
When I use multi-gpu for training.
An error accured.
Traceback (most recent call last): File "main.py", line 73, in <module> main(args=get_args()) File "main.py", line 51, in main loss = model.forward(images1.to(args.device), images2.to(args.device)) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 161, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 171, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply output.reraise() File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/_utils.py", line 428, in reraise raise self.exc_type(msg) AssertionError: Caught AssertionError in replica 0 on device 0. Original Traceback (most recent call last): File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker output = module(*input, **kwargs) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/data1/fengry/vcm/comfea/MySimSiam-0.1.0/model.py", line 94, in forward z1, z2 = f(x1), f(x2) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/nn/modules/container.py", line 117, in forward input = module(input) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torchvision/models/resnet.py", line 220, in forward return self._forward_impl(x) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torchvision/models/resnet.py", line 204, in _forward_impl x = self.bn1(x) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/nn/modules/batchnorm.py", line 519, in forward world_size = torch.distributed.get_world_size(process_group) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 625, in get_world_size return _get_group_size(group) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 220, in _get_group_size _check_default_pg() File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 210, in _check_default_pg assert _default_pg is not None, \ AssertionError: Default process group is not initialized
Then in main.py, I add
torch.distributed.init_process_group('gloo', init_method='file:///tmp/somefile', rank=0, world_size=1)
before
if torch.cuda.device_count() > 1: model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)
, and add
loss = loss.mean()
before
loss.backward()
.
Everything goes well now.
when I run main.py in my data, I find the loss has more than 0.9, but the acc is only has 0.7, Do you have any good suggestions?
loss < 0?
Hi, @PatrickHua , thank you for your implementation.
I ran the code of simclr with default parameters, except for setting the momentum as 0.9, since it's missing in the simclr_cifar.yaml.
After 100 epochs, I got a 55.57% acc. It seems much lower than that in the paper. Whts's yours? Are there something wrong with my settings?
The OS is Ubuntu 18.04. The environment is in the conda environment as indicated with all required dependencies in requirements.txt installed.
The script in the debug mode runs well. However, when I ran:
sh configs/cifar_experiment.sh
A strange error happened during the evaluation time:
Training: 100%|██████████| 800/800 [6:17:14<00:00, 28.29s/it, epoch=799, loss_avg=-.878]
Evaluating: 0%| | 0/30 [00:00<?, ?it/s]Model saved to outputs/cifar10_experiment/simsiam-cifar10-epoch800.pth
Files already downloaded and verified
Evaluating: 0%| | 0/30 [00:00<?, ?it/s]
Traceback (most recent call last):
File "main.py", line 116, in <module>
main(args=get_args())
File "main.py", line 113, in main
linear_eval(args, backbone)
File "/home/yl764/SimSiam/SimSiam/linear_eval.py", line 109, in main
feature = model(images.to(args.device))
File "/home/yl764/miniconda3/envs/simsiam/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/yl764/miniconda3/envs/simsiam/lib/python3.8/site-packages/torchvision/models/resnet.py", line 220, in forward
return self._forward_impl(x)
File "/home/yl764/miniconda3/envs/simsiam/lib/python3.8/site-packages/torchvision/models/resnet.py", line 203, in _forward_impl
x = self.conv1(x)
File "/home/yl764/miniconda3/envs/simsiam/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/yl764/miniconda3/envs/simsiam/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 423, in forward
return self._conv_forward(input, self.weight)
File "/home/yl764/miniconda3/envs/simsiam/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 419, in _conv_forward
return F.conv2d(input, weight, self.bias, self.stride,
RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same
Thanks!
Why aren't these two parameters equivalent?
Hello,
I find that there is no "DistributedSampler" in the code.
Is it a normal setting?
I think this setting would make the model run the same data twice (or the number of GPU) in a single epoch (because the shuffle is True.).
I'm not sure if this is normal.
Thank you very much.
In the SimSiam,the backbone like ResNet50 was used.
Does the backbone Resnet50 as an encoder include the final fc layer? or stop at the global average pooling layer?
I found that the whole network as the encoder in the models.simsiam.py file.
Has anyone met the problem that the loss close to -1 at the beginning of training? BTW, the training data is not sourced from traditional classification data like Imagenet or cifar.
Close
Thanks for your work. Could you please tell me the code for the transfer learning experiment section?
Looking forward to your reply.
Hello,
I was using your implementation of SimSiam for contrastive learning. I noticed that the model that you have created has a few problems:
Could you please clarify how and where you are taking care of it?
Hi, I used a ResNet34 backbone to train on (1, 128, 128) images with a batch size of 128. The total allocated memory is >35GB. According to the post, a ResNet50 on (3,256,256) images with a batch size of 96 only consumes 10GB. I am wondering if anyone else experiences the same issue and if there is any clue as to why this network takes such a lot of memory.
Dear author:
Do the following cifar_resnet_1.py and cifar_resnet_2.py are the same? If not, what's the difference between them?
+ https://github.com/PatrickHua/SimSiam/blob/main/models/backbones/cifar_resnet_1.py
+ https://github.com/PatrickHua/SimSiam/blob/main/models/backbones/cifar_resnet_2.py
And are they the same as the network used in MoCo_cifar_10_demo ?
+ https://colab.research.google.com/github/facebookresearch/moco/blob/colab-notebook/colab/moco_cifar10_demo.ipynb
Hi all, I am doing research for SimSiam with this implementation.
The framework of this backbone is really good and easy to expend, so I tried to run the SimSiam on ImageNet and CUB200 datasets.
However, to reproduce the experimental results in the paper, I had to run the model with resnet50/CUB200 with batch size =256, and I got an OOM problem.
Then I ran the model on two GPUs, but the code complains that there are some errors.
Can anyone help me out with this issue?
The code works fine when I only use one GPU
Thanks.
The command I used like:
CUDA_VISICE_DEVICES=0,1 python /root/SimSiam/main.py --data_dir /root/CUB_200_2011/ --log_dir ../logs/ -c configs/simsiam_cub200.yaml --ckpt_dir ~/.cache --hide_progress
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.