The adabound's discuss from luolc

Can this deal with complex numbers?

Hi authors,

I intended to use this method on complex numbers and it turned out with a error message like:

File "optimizer.py", line 701, in step step_size.div_(denom).clamp_(lower_bound, upper_bound).mul_( RuntimeError: "clamp_scalar_cpu" not implemented for 'ComplexFloat'

I'm wondering if it's possible to improve this for complex numbers? Thanks.

Ni

The provided new optimizer is sensitive on tiny batchsize

The provided new optimizer is sensitive on tiny batchsize (<4), I am testing on the very simply linear regression, while others performance looks like nice currently.

Path:

Loss curve:

Zoomed Loss curve:

Nan loss in RCAN model

https://github.com/wayne391/Image-Super-Resolution/blob/master/src/models/RCAN.py

Just change
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4, amsgrad=False)
to
optimizer = adabound.AdaBound(model.parameters(), lr=1e-4, final_lr=0.1)

Nan loss in RCAN model, but Adam work fine.

When did the optimizer switch to SGD？

I set the initial lr=0.0001, final_lr=0.1,
but I still don't know when the optimizer will become SGD.
Do I need to improve my learning rate to the final learning rate manually?
thanks！

What is up with Epoch 150

I'm wondering what is happening at epoch 150 in all visualizations? I would like to introduce that into all my models ;-)

https://github.com/Luolc/AdaBound/blob/master/demos/cifar10/visualization.ipynb

Learning rate changing

Hi， thanks a lot for sharing your excellent work.

I wonder if I want to change learning rate with epoch increasing, how do I set parameter lr
and final_lr in adamnboound ? Or is there any need changing learining rate with epoch increasing?

Looking for your reply, thanks a lot.

Why python 3.6 requirement?

I don't see any reason why this code would not run in a lower version of python.
Could you explain why is there such a requirement?

About clip (α / √Vt, ηl, ηu) in the paper

Hello, can you please tell me what these two parameters in α / √Vt mean, especially Vt?
Thank you

Question about the code

IIRC, because group['lr'] will never be changed, so finalr_lr will always be the same as group['final_lr'].
Is this intended?

AdaBound/adabound/adabound.py

Line 110 in 6fa8260

final_lr = group['final_lr'] * group['lr'] / base_lr

等不及想要在yolo下试试了，期待tensorflow的版本

具体效果我会在这里提交一份，如果我没忘记的话。

AttributeError: no attribute 'base_lrs'

Thank you very much for sharing this impressive work. I am somehow receiving the following error:

    for group, base_lr in zip(self.param_groups, self.base_lrs):
AttributeError: 'AdaBound' object has no attribute 'base_lrs'

Don't work properly with higher lr

I'm new in deep learning and I found the project works well with SGD but turns to be sth wrong with adabound.

When I start with lr=1e-3, it shows as below and break down:
invalid argument 2: non-empty 3D or 4D (batch mode) tensor expected for input, but got: [1 x 64 x 0 x 27] at /pytorch/aten/src/THCUNN/generic/SpatialAdaptiveMaxPooling.cu:24

But seems to work right if I set lr to 1e-4 or lower. It confused me a lot.
Any ideas?

python=3.6
pytorch=1.0.1 / 0.4

update pip package please~

Thx for your nice job, but I find that the package doesn't include AdaBoundW, Could you please update it?

Merge with Ranger or over9000

https://github.com/lessw2020/Ranger-Deep-Learning-Optimizer

mgrankin/over9000#4

I strongly believe that AdaBound would be better if it used RAdam instead of Adam.
It could merge with Lookahead too and LAMB.
Then we would have the best of both worlds and a beautiful example of scientific collaboration.

Tensorflow version coming when?

Can't wait to try with Tensorflow. I'm just curious about the release date of Tensorflow version.

The optimizer may have a bad performance on reading comprehension model?

Great thanks to your work!

The line with orange color is baseline using Adam as optimizer and the line with blue color is the baseline using AdaBound. I think the performance is much worse? Or I need to wait more patiently?

What's your opinion? Thank you very much!

Be careful when using adaptive gradient methods

I tested three methods in a very simple problem, and got the result as above.

Code are printed here:

import torch
import torch.nn as nn
import matplotlib.pyplot as plt
import adabound

class Net(nn.Module):

def __init__(self, dim):
    
    super(Net, self).__init__()
    self.fc1 = nn.Linear(dim, 2*dim)
    self.relu = nn.ReLU(inplace=True)
    self.fc2 = nn.Linear(2*dim, dim)

def forward(self, x):
    
    x = self.fc1(x)
    x = self.relu(x)
    x = self.fc2(x)
    
    return x

DIM = 30
epochs = 1000
xini = (torch.ones(1, DIM) * 100)
opti = (torch.zeros(1, DIM) * 100)

lr = 0.01
net = Net(DIM)
objfun = nn.MSELoss()

loss_adab = []
loss_adam = []
loss_sgd = []
for epoch in range(epochs):

if epoch % 100 == 0:
    lr /= 10

optimizer = adabound.AdaBound(net.parameters(), lr) 
out = net(xini)
los = objfun(out, opti)
loss_adab.append(los.detach().numpy())

optimizer.zero_grad()
los.backward()
optimizer.step()

lr = 0.01
net = Net(DIM)
objfun = nn.MSELoss()

for epoch in range(epochs):

if epoch % 100 == 0:
    lr /= 10

optimizer = torch.optim.Adam(net.parameters(), lr) 
out = net(xini)
los = objfun(out, opti)
loss_adam.append(los.detach().numpy())

optimizer.zero_grad()
los.backward()
optimizer.step()

lr = 0.001
net = Net(DIM)
objfun = nn.MSELoss()

for epoch in range(epochs):

if epoch % 100 == 0:
    lr /= 10

optimizer = torch.optim.SGD(net.parameters(), lr, momentum=0.9) 
out = net(xini)
los = objfun(out, opti)
loss_sgd.append(los.detach().numpy())

optimizer.zero_grad()
los.backward()
optimizer.step()

plt.figure()
plt.plot(loss_adab, label='adabound')
plt.plot(loss_adam, label='adam')
plt.plot(loss_sgd, label='SGD')
plt.yscale('log')
plt.xlabel('epochs')
plt.ylabel('Log(loss)')
plt.legend()
plt.savefig('camp.png', dpi=600)
plt.show()

lr_scheduler affect the actual learning rate

# Applies bounds on actual learning rate
# lr_scheduler cannot affect final_lr, this is a workaround to apply lr decay
final_lr = group['final_lr'] * group['lr'] / base_lr`

However lr_scheduler may change param_group['lr'] during training, therefore the final_lr, lower_bound, upper_bound will also be affected.

Should I not use lr_scheduler and let AbaBound adapts the params to transform from Adam to SGD?

Thank you very much!

Thank you!

/home/xxxx/.local/lib/python3.7/site-packages/adabound/adabound.py:94: UserWarning: This overload of add_ is deprecated:
        add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
        add_(Tensor other, *, Number alpha) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:766.)
  exp_avg.mul_(beta1).add_(1 - beta1, grad)

luolc / adabound Goto Github PK

adabound's Issues

Recommend Projects

Recommend Topics

Recommend Org