Giter Site home page Giter Site logo

fixup's People

Contributors

hongyi-zhang avatar tjingrant avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

fixup's Issues

About the scale down in multihead attention

Hi,

In your implementation of transformer's multihead attention module, I notice you use L^(-1/6) to scale down the initialization of the first weight layer. Since the multihead attention module has only two weight layers, should't the scale down coefficient be set as L^(-1/2) according to the derivation in your paper?

Smaller base learning rate for bias and scale

Hi, thanks for releasing this high-quality codebase! It is really helpful to understand your excellent paper.

I am trying to re-implement it for other purposes, and I just noticed that you assign a 10x smaller initial learning rate for bias and multiplier terms. It seems to be necessary, since in my implementation, setting their learning rate to 0.1 will cause divergence somewhere in the middle of the training.

I'm just wondering whether this is the reason that you decided to scale down the learning rate for bias and multiplier terms. And if not, is there any other insight?

Thank you in advance for any clarification.

Best regards
Ruizhe

Failed to reproduce the CIFAR-10 accuracy

Hey, I ran the example command given in the ReadMe but I only got 35% accuracy after 200 epochs. Could you please provide the hyperparameter values that we need to use to replicate the results given in the paper?

Mix_up: alpha below 1 should not work...

in Your code:
def mixup_data(x, y, alpha=1.0, use_cuda=True, per_sample=False):

'''Compute the mixup data. Return mixed inputs, pairs of targets, and lambda'''
batch_size = x.size()[0]
if use_cuda:
    index = torch.randperm(batch_size).cuda()
else:
    index = torch.randperm(batch_size)

if alpha > 0. and not per_sample:
    lam = torch.zeros(y.size()).fill_(np.random.beta(alpha, alpha)).cuda()
    mixed_x = lam.view(-1, 1, 1, 1) * x + (1 - lam.view(-1, 1, 1, 1)) * x[index,:]
elif alpha > 0.:
    lam = torch.Tensor(np.random.beta(alpha, alpha, size=y.size())).cuda()
    mixed_x = lam.view(-1, 1, 1, 1) * x + (1 - lam.view(-1, 1, 1, 1)) * x[index,:]
else:
    lam = torch.ones(y.size()).cuda()
    mixed_x = x

y_a, y_b = y, y[index]
return mixed_x, y_a, y_b, lam

So when alpha below 1, it will only mix the labels not x... Since lam is one, loss of the mixup_criterion will only calculate the loss between x and y_a. Also, no matter how you change your alpha as long as lower than 1, your loss will not be changed..

ImageNet?

Hi, thanks for the implementation! I'd love to try out the imagenet version of Fixup Resnet; I noticed you wrote this in the README:

A re-implementation is available. However, I have not been able to test it. If you try it out, please feel free to contact me --- your feedback is very welcome!

Anything I can help? What's the obstacle? I'm very interested in seeing the results on imagenet and would like to help in any ways I can.

Again thanks for the wonder work!

Imagenet test

I have tried to run imagenet with your fixup resnet model, but it did not converge at all. I did not use mix-up method, and all hyper-parameters are defaults.

not enough values to unpack (expected 2, got 0)

When I run the CUDA_VISIBLE_DEVICES=0 python cifar_train.py -a fixup_resnet32 --sess benchmark_a0d5e4lr01 --seed 11111 --alpha 0. --decay 5e-4, I got following errors:
stty: standard input: Inappropriate ioctl for device
Traceback (most recent call last):
File "cifar_train.py", line 20, in
from utils import progress_bar, mixup_data, mixup_criterion
File "/home/s1926784/Fixup/cifar/utils.py", line 86, in
_, term_width = os.popen('stty size', 'r').read().split()
ValueError: not enough values to unpack (expected 2, got 0)

Trained models on CIFAR-10

Hi, I couldn't quite attain the accuracy numbers you report for WideResNet-40-10 and ResNet-110 on CIFAR-10, but I would be very interested in evaluating these well-performing models for different metrics. If you could make the model parameters publicly available or share them directly, I would be very grateful.
Best,
Julian

training at increasing depth

Hi, I wanted to reproduce the results in your fixup paper. (Figure 3: Depth of residual networks versus test accuracy at the first epoch for various methods on CIFAR-10 with the default BatchNorm learning rate. )

The figure shows that fixup resnet can achieve 50% test accuracy on cifar10 when the depth < 1000. But I can only get about 40% after multiple runs.

I use google's colab P100 GPU, pytorch 1.3.0 to do the experiments. Here is my codes (I used the same script in the readme):
!rm -rf *
!git clone https://github.com/hongyi-zhang/Fixup.git
!mv Fixup/cifar/* .
!rm -rf Fixup
!python cifar_train.py -a fixup_resnet32 --sess benchmark_a0d5e4lr01 --seed 11111 --alpha 0. --decay 5e-4 --n_epoch=1

Here is my colab link: https://colab.research.google.com/drive/10aj0-vEGHlqxZ95oS5RLaCwIdQgDepjM

Could you help me look into this? Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.