Comments (4)
import torch
import numpy as np
import matplotlib.pyplot as plt
import torch.nn.functional as F
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader
import math
import torch
from torch.optim import Optimizer
generate M data points roughly forming a line (noise added)
M = 50
theta_true = torch.Tensor([[0.5], [2]])
X = 10 * torch.rand(M, 2) - 5
X[:, 1] = 1.0
y = torch.mm(X, theta_true) + 0.3 * torch.randn(M, 1)
def mse(t1, t2):
diff = t1 - t2
return torch.sum(diff * diff) / diff.numel()
def model(x):
return X @ theta
def cost_func(theta, X, y):
pred = torch.mm(X, theta)
diff = pred - y
loss = (diff**2).sum(0) / X.shape[0]
return loss
Define
batch_size = 1
num_epochs = 100
loss_fn = F.mse_loss
train_ds = TensorDataset(X, y)
train_dl = DataLoader(train_ds, batch_size, shuffle=True)
model = nn.Linear(2, 1, bias=False)
Define a utility function to train the model
def fit(num_epochs, loss_fn, opt):
model.weight.data[0][0].fill_(2.00)
model.weight.data[0][1].fill_(4.00)
Loss =[]
Theta = np.zeros(shape=(1,2,num_epochs))
for epoch in range(num_epochs):
for xb,yb in train_dl:
pred = model(xb)
loss = loss_fn(pred, yb)
loss.backward()
opt.step()
opt.zero_grad()
Loss.append(loss_fn(model(X), y))
Theta[:,0,epoch] = model.weight.detach().numpy()[0][0]
Theta[:,1,epoch] = model.weight.detach().numpy()[0][1]
Loss = np.array(Loss)
return Theta, Loss
ADAM_t, ADAM = fit(num_epochs, loss_fn, torch.optim.Adam(model.parameters(), lr=1e-2) )
SGD_t, SGD = fit(num_epochs, loss_fn, torch.optim.SGD(model.parameters(), lr=1e-3, momentum=0) )
SGDM_t, SGDM = fit(num_epochs, loss_fn, torch.optim.SGD(model.parameters(), lr=1e-3, momentum=0.9) )
ADAB_t, ADAB = fit(num_epochs, loss_fn, AdaBound(model.parameters(), lr=1e-2, final_lr=0.1) )
theta_0_vals = np.linspace(-2,4,100)
theta_1_vals = np.linspace(0,4,100)
theta = torch.Tensor(len(theta_0_vals),2)
J = np.zeros((len(theta_0_vals),len(theta_1_vals)))
for i, theta_0 in enumerate(theta_0_vals):
for j, theta_1 in enumerate(theta_1_vals):
J[i,j] = cost_func(torch.Tensor([[theta_0], [theta_1]]), X, y)
xc,yc = np.meshgrid(theta_0_vals, theta_1_vals)
contours = plt.contour(xc, yc, J, 20)
plot_vals = range(0,num_epochs)
plt.plot(ADAM_t[0,0,plot_vals],ADAM_t[0,1,plot_vals],'-.',lw=2, label='Adam')
plt.plot(SGD_t[0,0,plot_vals],SGD_t[0,1,plot_vals],'-.',lw=2, label='Sgd')
plt.plot(SGDM_t[0,0,plot_vals],SGDM_t[0,1,plot_vals],'-.',lw=2, label='Sgd+momentum')
plt.plot(ADAB_t[0,0,plot_vals],ADAB_t[0,1,plot_vals],'-.',lw=2, label='AdaBound')
plt.scatter(theta_true[0].numpy(),theta_true[1].numpy(),marker='*', color='red',lw=2, label='gloal')
plt.legend(loc='lower left')
plt.figure()
plt.subplot(211)
plt.plot(range(ADAB.shape[0]),ADAB,'-.',lw=2, label='AdaBound')
plt.plot(range(ADAB.shape[0]),ADAM,'-.',lw=2, label='Adam')
plt.plot(range(ADAB.shape[0]),SGD,'-.',lw=2, label='Sgd')
plt.plot(range(ADAB.shape[0]),SGDM,'-.',lw=2, label='Sgd+momentum')
plt.subplots_adjust(top=2.92, bottom=0.12, left=0.15, right=2.95, hspace=0.2, wspace=0.35)
plt.legend(loc='upper right')
plt.subplot(212)
plt.plot(range(ADAB.shape[0]),ADAB,'-.',lw=2, label='AdaBound')
plt.plot(range(ADAB.shape[0]),ADAM,'-.',lw=2, label='Adam')
plt.plot(range(ADAB.shape[0]),SGD,'-.',lw=2, label='Sgd')
plt.plot(range(ADAB.shape[0]),SGDM,'-.',lw=2, label='Sgd+momentum')
plt.subplots_adjust(top=2.92, bottom=0.12, left=0.15, right=2.95, hspace=0.2, wspace=0.35)
plt.xlim((80, 100))
plt.ylim((0, 0.3))
plt.legend(loc='upper right')
from adabound.
When I used AdaBound to train a ShuffleNet V2 model with tiny batch (5-10), I met same problem. This optimizer might not be convergent.
Btw: when I used "adabound.AdaBound([{'params': part of model's params, 'lr':0...}])" to prevent some parameters be updated during training process, I would get error infomation means "cannot use lr = 0". But I can use "torch.optim.Adam([{'params': part of model's params, 'lr':0...}])" to implement this purpose.
Is this a BUG ??
from adabound.
That's very interesting. We didn't pay attention to the impact by batch size before.
Thanks for providing the new aspect to explore! 😄
Would you please provide more details of the experiments? Such as the hyperparameters of each optimizer, the scale of the dataset, etc.
from adabound.
Hi,
I have been training with adabound in a custom dataset and I faced similar issues with low batch sizes.
The only doubt I have is that in the ReadMe, you provide a comparison graph of the different optimizers, I dont understand why the abrupt change in the epoch 150. I guess there is when the optimizers switches to SGD but why at this point? Does that mean that if I train a dataset through 1000 epochs, it will make a similar change in epoch 750?
Thank you for the help
from adabound.
Related Issues (20)
- Why python 3.6 requirement? HOT 1
- Nan loss in RCAN model HOT 9
- AttributeError: no attribute 'base_lrs' HOT 10
- Don't work properly with higher lr
- update pip package please~ HOT 2
- Be careful when using adaptive gradient methods HOT 3
- Can you provide the code and parameters of related LSTM experiments~
- grammar police... "as well as adam"
- Merge with Ranger or over9000
- LSTM hyparameters for language modeling
- About clip (α / √Vt, ηl, ηu) in the paper
- Learning rate changing
- Pytorch 1.6 warning HOT 1
- When did the optimizer switch to SGD? HOT 1
- Can this deal with complex numbers?
- Question about the code HOT 2
- What is up with Epoch 150 HOT 8
- Tensorflow version coming when? HOT 3
- lr_scheduler affect the actual learning rate HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from adabound.