hongyi-zhang / fixup Goto Github PK

View Code? Open in Web Editor NEW

151.0 151.0 27.0 49 KB

A Re-implementation of Fixed-update Initialization

License: BSD 3-Clause "New" or "Revised" License

Python 99.69% Shell 0.31%

batchnorm fixup resnet

fixup's People

Contributors

Stargazers

Watchers

fixup's Issues

About the scale down in multihead attention

Hi,

In your implementation of transformer's multihead attention module, I notice you use L^(-1/6) to scale down the initialization of the first weight layer. Since the multihead attention module has only two weight layers, should't the scale down coefficient be set as L^(-1/2) according to the derivation in your paper?

Smaller base learning rate for bias and scale

Hi, thanks for releasing this high-quality codebase! It is really helpful to understand your excellent paper.

I am trying to re-implement it for other purposes, and I just noticed that you assign a 10x smaller initial learning rate for bias and multiplier terms. It seems to be necessary, since in my implementation, setting their learning rate to 0.1 will cause divergence somewhere in the middle of the training.

I'm just wondering whether this is the reason that you decided to scale down the learning rate for bias and multiplier terms. And if not, is there any other insight?

Thank you in advance for any clarification.

Best regards
Ruizhe

Failed to reproduce the CIFAR-10 accuracy

Hey, I ran the example command given in the ReadMe but I only got 35% accuracy after 200 epochs. Could you please provide the hyperparameter values that we need to use to replicate the results given in the paper?

Mix_up: alpha below 1 should not work...

in Your code:
def mixup_data(x, y, alpha=1.0, use_cuda=True, per_sample=False):

'''Compute the mixup data. Return mixed inputs, pairs of targets, and lambda'''
batch_size = x.size()[0]
if use_cuda:
    index = torch.randperm(batch_size).cuda()
else:
    index = torch.randperm(batch_size)

if alpha > 0. and not per_sample:
    lam = torch.zeros(y.size()).fill_(np.random.beta(alpha, alpha)).cuda()
    mixed_x = lam.view(-1, 1, 1, 1) * x + (1 - lam.view(-1, 1, 1, 1)) * x[index,:]
elif alpha > 0.:
    lam = torch.Tensor(np.random.beta(alpha, alpha, size=y.size())).cuda()
    mixed_x = lam.view(-1, 1, 1, 1) * x + (1 - lam.view(-1, 1, 1, 1)) * x[index,:]
else:
    lam = torch.ones(y.size()).cuda()
    mixed_x = x

y_a, y_b = y, y[index]
return mixed_x, y_a, y_b, lam

So when alpha below 1, it will only mix the labels not x... Since lam is one, loss of the mixup_criterion will only calculate the loss between x and y_a. Also, no matter how you change your alpha as long as lower than 1, your loss will not be changed..

ImageNet?

Hi, thanks for the implementation! I'd love to try out the imagenet version of Fixup Resnet; I noticed you wrote this in the README:

A re-implementation is available. However, I have not been able to test it. If you try it out, please feel free to contact me --- your feedback is very welcome!

Anything I can help? What's the obstacle? I'm very interested in seeing the results on imagenet and would like to help in any ways I can.

Again thanks for the wonder work!

Imagenet test

I have tried to run imagenet with your fixup resnet model, but it did not converge at all. I did not use mix-up method, and all hyper-parameters are defaults.

not enough values to unpack (expected 2, got 0)

When I run the CUDA_VISIBLE_DEVICES=0 python cifar_train.py -a fixup_resnet32 --sess benchmark_a0d5e4lr01 --seed 11111 --alpha 0. --decay 5e-4, I got following errors:
stty: standard input: Inappropriate ioctl for device
Traceback (most recent call last):
File "cifar_train.py", line 20, in
from utils import progress_bar, mixup_data, mixup_criterion
File "/home/s1926784/Fixup/cifar/utils.py", line 86, in
_, term_width = os.popen('stty size', 'r').read().split()
ValueError: not enough values to unpack (expected 2, got 0)

Trained models on CIFAR-10

Hi, I couldn't quite attain the accuracy numbers you report for WideResNet-40-10 and ResNet-110 on CIFAR-10, but I would be very interested in evaluating these well-performing models for different metrics. If you could make the model parameters publicly available or share them directly, I would be very grateful.
Best,
Julian

training at increasing depth

Hi, I wanted to reproduce the results in your fixup paper. (Figure 3: Depth of residual networks versus test accuracy at the first epoch for various methods on CIFAR-10 with the default BatchNorm learning rate. )

The figure shows that fixup resnet can achieve 50% test accuracy on cifar10 when the depth < 1000. But I can only get about 40% after multiple runs.

I use google's colab P100 GPU, pytorch 1.3.0 to do the experiments. Here is my codes (I used the same script in the readme):
!rm -rf *
!git clone https://github.com/hongyi-zhang/Fixup.git
!mv Fixup/cifar/* .
!rm -rf Fixup
!python cifar_train.py -a fixup_resnet32 --sess benchmark_a0d5e4lr01 --seed 11111 --alpha 0. --decay 5e-4 --n_epoch=1

Here is my colab link: https://colab.research.google.com/drive/10aj0-vEGHlqxZ95oS5RLaCwIdQgDepjM

Could you help me look into this? Thanks!

Couple of questions regarding per channel multipliers and bottleneck blocks

Hi there,

The code seems to train well, thanks for the implementation!

I wonder though if you tried adding multiplier / bias per channel (like batchnorm) instead of single scalars?
Also I see you use the BasicBlock of Resnet, also for the bigger Resnets. Did you try the Bottleneck block?

Do you have the tensorflow implementation of Fixup?

I am very sorry for using pytorch. Do you have the tensorflow implementation of Fixup? Thanks.

hongyi-zhang / fixup Goto Github PK

fixup's People

Contributors

Stargazers

Watchers

Forkers

fixup's Issues

About the scale down in multihead attention

Smaller base learning rate for bias and scale

Failed to reproduce the CIFAR-10 accuracy

Mix_up: alpha below 1 should not work...

ImageNet?

Imagenet test

not enough values to unpack (expected 2, got 0)

Trained models on CIFAR-10

training at increasing depth

Couple of questions regarding per channel multipliers and bottleneck blocks

Do you have the tensorflow implementation of Fixup?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent