hongyi-zhang / fixup Goto Github PK
View Code? Open in Web Editor NEWA Re-implementation of Fixed-update Initialization
License: BSD 3-Clause "New" or "Revised" License
A Re-implementation of Fixed-update Initialization
License: BSD 3-Clause "New" or "Revised" License
Hi,
In your implementation of transformer's multihead attention module, I notice you use L^(-1/6) to scale down the initialization of the first weight layer. Since the multihead attention module has only two weight layers, should't the scale down coefficient be set as L^(-1/2) according to the derivation in your paper?
Hi, thanks for releasing this high-quality codebase! It is really helpful to understand your excellent paper.
I am trying to re-implement it for other purposes, and I just noticed that you assign a 10x smaller initial learning rate for bias and multiplier terms. It seems to be necessary, since in my implementation, setting their learning rate to 0.1
will cause divergence somewhere in the middle of the training.
I'm just wondering whether this is the reason that you decided to scale down the learning rate for bias and multiplier terms. And if not, is there any other insight?
Thank you in advance for any clarification.
Best regards
Ruizhe
Hey, I ran the example command given in the ReadMe but I only got 35% accuracy after 200 epochs. Could you please provide the hyperparameter values that we need to use to replicate the results given in the paper?
in Your code:
def mixup_data(x, y, alpha=1.0, use_cuda=True, per_sample=False):
'''Compute the mixup data. Return mixed inputs, pairs of targets, and lambda'''
batch_size = x.size()[0]
if use_cuda:
index = torch.randperm(batch_size).cuda()
else:
index = torch.randperm(batch_size)
if alpha > 0. and not per_sample:
lam = torch.zeros(y.size()).fill_(np.random.beta(alpha, alpha)).cuda()
mixed_x = lam.view(-1, 1, 1, 1) * x + (1 - lam.view(-1, 1, 1, 1)) * x[index,:]
elif alpha > 0.:
lam = torch.Tensor(np.random.beta(alpha, alpha, size=y.size())).cuda()
mixed_x = lam.view(-1, 1, 1, 1) * x + (1 - lam.view(-1, 1, 1, 1)) * x[index,:]
else:
lam = torch.ones(y.size()).cuda()
mixed_x = x
y_a, y_b = y, y[index]
return mixed_x, y_a, y_b, lam
So when alpha below 1, it will only mix the labels not x... Since lam is one, loss of the mixup_criterion will only calculate the loss between x and y_a. Also, no matter how you change your alpha as long as lower than 1, your loss will not be changed..
Hi, thanks for the implementation! I'd love to try out the imagenet version of Fixup Resnet; I noticed you wrote this in the README:
A re-implementation is available. However, I have not been able to test it. If you try it out, please feel free to contact me --- your feedback is very welcome!
Anything I can help? What's the obstacle? I'm very interested in seeing the results on imagenet and would like to help in any ways I can.
Again thanks for the wonder work!
I have tried to run imagenet with your fixup resnet model, but it did not converge at all. I did not use mix-up method, and all hyper-parameters are defaults.
When I run the CUDA_VISIBLE_DEVICES=0 python cifar_train.py -a fixup_resnet32 --sess benchmark_a0d5e4lr01 --seed 11111 --alpha 0. --decay 5e-4, I got following errors:
stty: standard input: Inappropriate ioctl for device
Traceback (most recent call last):
File "cifar_train.py", line 20, in
from utils import progress_bar, mixup_data, mixup_criterion
File "/home/s1926784/Fixup/cifar/utils.py", line 86, in
_, term_width = os.popen('stty size', 'r').read().split()
ValueError: not enough values to unpack (expected 2, got 0)
Hi, I couldn't quite attain the accuracy numbers you report for WideResNet-40-10 and ResNet-110 on CIFAR-10, but I would be very interested in evaluating these well-performing models for different metrics. If you could make the model parameters publicly available or share them directly, I would be very grateful.
Best,
Julian
Hi, I wanted to reproduce the results in your fixup paper. (Figure 3: Depth of residual networks versus test accuracy at the first epoch for various methods on CIFAR-10 with the default BatchNorm learning rate. )
The figure shows that fixup resnet can achieve 50% test accuracy on cifar10 when the depth < 1000. But I can only get about 40% after multiple runs.
I use google's colab P100 GPU, pytorch 1.3.0 to do the experiments. Here is my codes (I used the same script in the readme):
!rm -rf *
!git clone https://github.com/hongyi-zhang/Fixup.git
!mv Fixup/cifar/* .
!rm -rf Fixup
!python cifar_train.py -a fixup_resnet32 --sess benchmark_a0d5e4lr01 --seed 11111 --alpha 0. --decay 5e-4 --n_epoch=1
Here is my colab link: https://colab.research.google.com/drive/10aj0-vEGHlqxZ95oS5RLaCwIdQgDepjM
Could you help me look into this? Thanks!
Hi there,
The code seems to train well, thanks for the implementation!
I wonder though if you tried adding multiplier / bias per channel (like batchnorm) instead of single scalars?
Also I see you use the BasicBlock of Resnet, also for the bigger Resnets. Did you try the Bottleneck block?
I am very sorry for using pytorch. Do you have the tensorflow implementation of Fixup? Thanks.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.