hobbitlong / cmc Goto Github PK
View Code? Open in Web Editor NEW[ECCV 2020] "Contrastive Multiview Coding", also contains implementations for MoCo and InstDis
License: BSD 2-Clause "Simplified" License
[ECCV 2020] "Contrastive Multiview Coding", also contains implementations for MoCo and InstDis
License: BSD 2-Clause "Simplified" License
Does the version of the torchvision impact the experimental setting, and which version should be used in the experiments?
Hi @HobbitLong, thanks for your great work and also sharing the code. I guess the ImageNet-100 is not a conventional subset so I wonder if you can share the list since we also don't have enough resources to run on the full ImageNet ==.
Hi, thank you for sharing the code! I am curious about the effect of the data augmentation, concretely the RandomResizedCrop in train_moco_ins.py.
In your codes, the minimum crop scale is 0.2 for most choices but 0.08 for imagenet full dataset with ResNet, however the parameter in other papers such as non parametric instance discrimination is also set to 0.2 when using ResNet as backbone. So I am curious about the choice(0.08 as default torchvision parameter). Is this smaller scale work better in full imagenet? Have you validated the performance on imagenet between 0.08 and 0.2 with a ResNet backbone?
Line 189 in 0f72b18
This one-time estimation is problematic, especially if the dictionary is not random noise. Computing Z as a moving average of this would give a more reasonable result.
Hi,
thanks for your repo.
It would be nice if you could provide the code / the input pipeline which you used to run the NYU RGB-D experiments as well (similar to #4 ). To me it is not entirely clear, how you added the different modalities.
Best,
Hi,
Would you please share the subset of ImageNet(ImageNet100) you used?
I want to train the MoCo model and compare it with your results!
Thanks!
Could you please release the code of using other views instead of only "l" and "ab" in the training CMC process?
Lines 35 to 46 in 58d06e9
Hi, I have a question about using softmax instead of NCE loss.
In that function, every label is set zero including the critic value of positive sample, which has index 0 of the batch.
I want to know the reason. My take on this is that the label should be [1, 0, 0, 0, ...]. Isn't it?
Dear authors,
I just read your paper recently, and I think it is really interesting and significant.
So I want to see some details about the method by running the code.
I mainly focus on graph representation learning, recommendation, and ML.
I am not familiar with the image dataset and processing.
Could you provide me the datasets to run the code? Or where can I download the Imagenet 100 and STL-100 dataset??
Thanks.
Xu Chen
Hi @HobbitLong, I am trying to implement CMC on CIFAR-10 with a shallow ResNet. However, the accuracy only reaches 60%~70%. I have tried to tune the batch size from 64 to 512 and learning rate from 0.01 to 0.12. In addition, I also tuned the nce_k from 8192 to 65536. Unfortunately, it is not improved yet. I am writing to ask do you have any suggestions on tuning parameters on small datasets like CIFAR-10? Thank you very much.
Hi there
Thanks a lot for this great repo!
I am trying out MoCo on my own dataset (I also added additional augmentations). Training appears to have converged, but the max value I get for ins_prob
is about 13.35
, and the lowest value I get for loss is about 0.2422
.
I am wondering what metrics you got when training on Imagenet? Am not sure what a "good" score should look like.
Here are screenshots from training progress in tensorboard (ignore the multiple lines at the start of training).
Thanks,
Liam
Hi, @HobbitLong, could you please supply the classes you use for the imagenet 100 dataset? thanks. Is the imagenet 100 the same as imagenet except for the class number?
Hi, thank you for your code. Would you please provide a download link of ImageNet-trained resnet50/resnet101 weights?
I enjoyed reading the paper and thanks for uploading the code.
Quick question - would it be possible to also upload the scripts to run the STL-10 eval?
Thanks!
Line 30 in 58d06e9
I think the purpose of using NCE is to avoid expensive summation over entire vector in softmax. But in your implementation, there is still summation over entire log_D0
which confused me. I'll appreciate it if you explain this.
I'm new to this field, and hope you point out my misunderstanding if there is.
Hi,
Is it possible to get download link for AlexNet weights trained on ImageNet?
Thanks for sharing the code and ResNet weights.
Is there any ten crop results? As I know, some methods will improve a lot with ten crop, but some may only improve a little. I wonder how much improvement can be get with ten crop in CMC.
Hi
Thank you for sharing this great work with us.
I saw that you have spawn
in the code, thus I am wondering your plan to release the code for supporting DistributedDataParallel
. In particular, I am curious how do you sync the memory bank for L and ab, e.g., in self.register_buffer('memory_ab')
during training.
Thank you :)
Would you please share the MoCo pre-trained weights?
It appears to me that shuffle-bn has no effect, when run on a single GPU.
Example:
import torch
import torch.nn as nn
(B,C,H,W) = 4,3,2,2
model1 = nn.Sequential(nn.BatchNorm2d(C))
model2 = nn.Sequential(nn.BatchNorm2d(C))
print("Before:")
print(" model1 stats: ", model1[0].running_mean, model1[0].running_var)
print(" model2 stats: ", model2[0].running_mean, model2[0].running_var)
shuffle_ids = torch.randperm(B).long()
x1 = torch.randn(B,C,H,W)*3+1
x2 = x1[shuffle_ids]
model1(x1)
model2(x2)
print("After:")
print(" model1 stats: ", model1[0].running_mean, model1[0].running_var)
print(" model2 stats: ", model2[0].running_mean, model2[0].running_var)
Before:
model1 stats: tensor([0., 0., 0.]) tensor([1., 1., 1.])
model2 stats: tensor([0., 0., 0.]) tensor([1., 1., 1.])
After:
model1 stats: tensor([0.2285, 0.1523, 0.1447]) tensor([1.6193, 1.4863, 1.6332])
model2 stats: tensor([0.2285, 0.1523, 0.1447]) tensor([1.6193, 1.4863, 1.6332])
I guess another approach is necessary on single-GPU. Any thoughts?
Thanks for releasing this code.
Lines 372 to 397 in 58d06e9
If I understand the inner work of eval_moco_ins.py
correctly, the code seems training the downstream task (single FC) using augmented images (train_transform == 'CJ').
This augmentation process not only slows down the training speed of the downstream task but also seems to violate the purpose of evaluation (Then we freeze the features and train a supervised linear classifier, said in MoCo paper).
Isn't it right to save the center-cropped average pooled features and perform FC training on those fixed features?
Hi @HobbitLong,
Thanks for your nice paper and publlic code!
I have reproduced results of MoCo and InsDIS on ImageNet100 following your steps.
I got 67.44 for MoCo and 66.02 for InsDIS, which are worse than the expected 73.4 and 69.1.
Could you please help me about this?
Best
Mengyuan
Hi, it seems that you are using the dot product between vectors from two views as a proxy for unknown distribution denoted as pd in your paper here. In other words, your hθ is the dot product. Theoretically any hθ can work so it's all good.
But doesn't it force the two representations to be similar? I understand the two representations should have high mutual information. But it is not the same as having the two vectors in similar directions.
Obviously it worked out pretty well. But do you think having a parameterized NCEAverage
loss would have allowed for more representations with not so similar directions but still having high MI?
Thank you again!
Hi @HobbitLong,
Thanks for such a clean and readable code.
I am interested in using the pre-trained weights that you were kind enough to provide. I downloaded the pre-trained weights CMC_resnet50v2.pth and MoCo_softmax_16384_epoch200.pth. Then, I ran the linear evaluation code with the following commands, but couldn't reproduce the accuracies. The accuracies at the final, 60th, epoch for CMC and MoCo are 62.0% and 57.3% respectively. The accuracies should be 64.1% (from the CMC paper) and 59.4% (from readme).
CUDA_VISIBLE_DEVICES=9 python LinearProbing.py --dataset imagenet \
--data_folder /datasets/imagenet_nfs1 \
--save_path ./output/cmc_linear \
--tb_path ./output/cmc_linear \
--model_path ./pretrained/CMC_resnet50v2.pth \
--model resnet50v2 --learning_rate 30 --layer 6
CUDA_VISIBLE_DEVICES=8 python eval_moco_ins.py --dataset imagenet \
--data_folder /datasets/imagenet_nfs1 \
--save_path ./output/moco_linear \
--tb_path ./output/moco_linear \
--model_path ./pretrained/MoCo_softmax_16384_epoch200.pth \
--model resnet50 --learning_rate 30 --layer 6
Have I missed something? Do I need to change the default hyperparameters to get the reported numbers?
Thanks
Hi there,
This is a bit of a meta-question.
I noticed that your code uses the original AlexNet parameters i.e. with convolutions 96,256,384,384,256 vs. the one weird trick paper 64,192,384,256,256 that is the standard in the official PyTorch implementation.
In comparison, Feng et al. at CVPR 2019 use the smaller version of AlexNet in their code.
I was wondering if there was a standard for which version of AlexNet should be used in the self-supervised literature, and if it even makes a difference?
Thanks
Hi Yonglong,
Thanks a lot for the great work and sharing the code. I am trying to reproduce the results of MoCo on ImageNet-1k, with ResNet 50. Did you reproduce the results on Kaiming's paper on the full ImageNet? Would you kindly share me the specific configurations for reproducing MoCo-ResNet-50?
Thanks a lot!
Hi,
I want to use CMC in my own experiment, but the loss is strange. At each epoch, the loss decays as normal (like from 20 to 11). But at the next epoch, the loss becomes nearly the same as begining (the loss is 20 again). I wonder if it is 'normal' in CMC.
Thanks.
Hi,
Thanks for your code. I just wonder how to visualize the AB channel of images in the code as shown in your paper. I could visualize L channel using TensorboardX, but that doesn't work for AB channel.
How to prevent an element in the enqueue come from the same sample as the query, especially when the dataloader‘s param "shuffle" is True?
Thank you
Line 178 in 783bf95
Hi,
Thanks for your released code. I want to check something puzzling me.
Does 'resnes50v2' represent 'ResNet-50' in Table 2 in the paper?
Does 'resnes50v3' represent 'ResNet-50 x2' in Table 2 in the paper?
If the answers are true, I want to know if you have trained 'resnet50v1' on ImageNet. Could you please share the results?
Thanks.
I saw your note and it seems rather unusual to use such a large learning rate:
Note: When training linear classifiers on top of ResNets, it's important to use large learning rate, e.g., 30~50.
Is there something I'm missing? I can't imagine how you get stable gradient descent with such high learning rates.
Hi, thanks for open-sourcing the code. I wanted to know as to when will you enable the support for resnet to be used as a backbone.
Why initialize memory value this way? Thanks!
stdv = 1. / math.sqrt(inputSize / 3)
self.register_buffer('memory', torch.rand(self.queueSize, inputSize).mul_(2 * stdv).add_(-stdv))
Hey there, thanks for the well-documented code!
Quick question: Am I correctly assuming that in order to evaluate the model on the 1,000-class ImageNet validation dataset one has to train the linear classifier first (using LinearProbing.py
)? If so, would it be possible to release pre-trained weights for the classifier as well, such that one can use classifier.load_state_dict(checkpoint['classifier'])
?
Hi! Thanks for your code!
I have some questions about your implement. I notice that for negative samples we use memory bank cause 4096 is too large for 1 batch. But for positive samples, why still use memory bank rather than the feature calculated by this batch? Is there any harm for doing this?
Thanks
Hello,
Thanks for making this code available. I am trying to run the pertained alexnet model (downloaded from the Dropbox link) with the following command:
python LinearProbing.py --dataset imagenet --data_folder /share/ctn/users/jwl2182/imagenet_data --save_path . --model_path /home/jwl2182/CMC/CMC_alexnet.pth --model alexnet --learning_rate 0.1 --layer 5 --tb_path /home/jwl2182/CMC/tb --gpu 0
But I get the following error. Any ideas what might be happening?
RuntimeError: Error(s) in loading state_dict for MyAlexNetCMC:
Unexpected key(s) in state_dict: "encoder.module.l_to_ab.conv_block_1.1.num_batches_tracked", "encoder.module.l_to_ab.conv_block_2.1.num_batches_tracked", "encoder.module.l_to_ab.conv_block_3.1.num_batches_tracked", "encoder.module.l_to_ab.conv_block_4.1.num_batches_tracked", "encoder.module.l_to_ab.conv_block_5.1.num_batches_tracked", "encoder.module.l_to_ab.fc6.1.num_batches_tracked", "encoder.module.l_to_ab.fc7.1.num_batches_tracked", "encoder.module.ab_to_l.conv_block_1.1.num_batches_tracked", "encoder.module.ab_to_l.conv_block_2.1.num_batches_tracked", "encoder.module.ab_to_l.conv_block_3.1.num_batches_tracked", "encoder.module.ab_to_l.conv_block_4.1.num_batches_tracked", "encoder.module.ab_to_l.conv_block_5.1.num_batches_tracked", "encoder.module.ab_to_l.fc6.1.num_batches_tracked", "encoder.module.ab_to_l.fc7.1.num_batches_tracked".
Could you tell me the details about how the imagenet100 subset was created?
Hi,
I see your code for NCESoftmaxLoss as follows:
#########
class NCESoftmaxLoss(nn.Module):
"""Softmax cross-entropy loss (a.k.a., info-NCE loss in CPC paper)"""
def init(self):
super(NCESoftmaxLoss, self).init()
self.criterion = nn.CrossEntropyLoss()
def forward(self, x):
bsz = x.shape[0]
x = x.squeeze()
label = torch.zeros([bsz]).cuda().long()
loss = self.criterion(x, label)
return loss
###########
The label for this loss is label = torch.zeros([bsz]).cuda().long(), but in your paper, according to eq.2,
You have one positive for each sample.
So is something missed here??
Thanks.
https://github.com/HobbitLong/CMC/blob/master/NCE/alias_multinomial.py#L8
Hi!
While reading your code, I've noticed that for loops in the initialization function of AliasMethod causes a lot of computation.
However, the only entry (https://github.com/HobbitLong/CMC/blob/master/NCE/NCEAverage.py#L13) instantiating the class is passing torch.ones, which results in ones self.prob and zeros self.alias in AliasMethod.
What could go wrong if I let them just ones and zeros instead of running for loops while initializing AliasMethod?
Thanks for sharing the code :) (and RepDistiller too!)
Hi
Thank you for sharing this project with us! I am curious why did you change the default AdaptiveAvgPool2d
of ResNet to AvgPool2d
. How does this change affect the performance?
Your AvgPool2d
layer:
https://github.com/HobbitLong/CMC/blob/master/models/resnet.py#L124
Pytorch's AdaptiveAvgPool2d
layer:
https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py#L153
Hi, Thanks a lot for sharing this great code.
I have a question about data augmentation and the memory bank. If we use data augmentation, the features in the memory bank are not update for this issue. Especially for the positive examples which we using from the memory bank.
Have you thought about it?
Hi,
I'm confused how can I get the imagenet100 dataset online, since I can't find any corresponding link for downloading.
Could you please share the link?
Thank you.
Hi @HobbitLong , I am trying to reproduce MoCo v2 on ImageNet 1k. Have you tried to replace the Linear projection head to MLP? Do you think it is necessary that add the batch normalization layer or bias for the fully connected layer? I keep all the hyper-param same as the paper but only could get 61.4~ acc with 4 gpu 256 batch size.
Would you kindly share with me the specific configurations based on your codebase for reproducing MoCov2-ResNet-50?
Thanks a lot!
Hi,
Thanks for open-sourcing your work, I have been trying to use CMC on my custom toy dataset which has 2 views (Image (3D), Sensor view (3D)) I'm able to run the model successfully but the Z for view 1 and view 2 is being set to 119973150195712.
I made sure to use L2 norm around the final features from each of the alexnet halfs but I'm really not sure why the Z values are being initialized to such a high value. I kept the nce_m,nce_k and nce_t to the same as that of your code.
Please, can you help me with the same? Thank you
After 126 epochs for training, the loss still seems huge. And the probs for "L","ab" are only about 0.007. We set learning rate, batch size to 6e-2 and 1024 (8 Tesla-V100).
Train: [126][930/1252] BT 0.827 (0.953) DT 0.001 (0.234) loss 6.161 (6.071) l_p 0.007 (0.007) ab_p 0.006 (0.006)
torch.Size([1024, 16385, 1])
Train: [126][940/1252] BT 0.630 (0.951) DT 0.001 (0.232) loss 5.945 (6.071) l_p 0.007 (0.007) ab_p 0.006 (0.006)
torch.Size([1024, 16385, 1])
I don't know what's wrong with our experiment setting. Could you share the curves of training loss and probs of 'L' and 'ab'?
Hi @HobbitLong , thank you for releasing the code. I wanted to ask a few questions regarding the implementation of NCEAverage.py
. I understand some of them might be pretty basic questions but hopefully the answers will also help others to understand the code + implementation better.
T=0.07
and why do out_l
and out_ab
need to be divided by T
?* Is there any advantage of starting out with unit vectors (on average) by implementing stdv = 1. / math.sqrt(inputSize / 3)
here. I say this because out_l
and out_ab
need to be normalized anyway as is done here.
* Is this correct that you use a moving average (MA) to update weight_l
and weight_ab
(instead of just copying the values directly) because the model itself is learning and the values l
and ab
can be noisy? Using a MA reduces variance.
* As a follow up, how would this implementation be possible if you were not using memory banks? Is this an incidental advantage of using a memory bank?
[Resolved] Why did you not use a gradient descent based method to implement NCE? Was it done to reduce the overload of all things that needed to be learnt?
[Resolved] Lastly, since NCEAverage
has no parameters or nn layers, I believe you don't need with torch.no_grad()
here.~~
Thank you again.
When I evaluated the result on ImageNet (not the subset), I got the bug as follows:
THCudaCheckWarn FAIL file=/opt/conda/conda-bld/pytorch_1524586445097/work/aten/src/THC/THCStream.cpp line=50 error=59 : device-side assert triggered
Does anyone have any thought about the issue?
I enjoyed reading the paper. Thanks for open sourcing the code.
Please let me know if I can train CMC model [resnet50 variant] by loading pretrained resnet-50 trained on ImageNet.
Also if I want to train with custom dataset with custom number of classes, please suggest what change is required in hyperparams?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.