Giter Site home page Giter Site logo

kozistr / pytorch_optimizer Goto Github PK

View Code? Open in Web Editor NEW
190.0 5.0 16.0 2.64 MB

optimizer & lr scheduler & loss function collections in PyTorch

Home Page: https://pytorch-optimizers.readthedocs.io/en/latest/

License: Apache License 2.0

Makefile 0.13% Python 99.87%
optimizer pytorch ranger chebyshev adamp radam madgrad adabound adabelief sam

pytorch_optimizer's Introduction

pytorch-optimizer

Build workflow Documentation Status
Quality codecov black ruff
Package PyPI version PyPI pyversions
Status PyPi download PyPi month download
License apache

pytorch-optimizer is optimizer & lr scheduler collections in PyTorch. I just re-implemented (speed & memory tweaks, plug-ins) the algorithm while based on the original paper. Also, It includes useful and practical optimization ideas.
Currently, 64 optimizers (+ bitsandbytes), 11 lr schedulers, and 13 loss functions are supported!

Highly inspired by pytorch-optimizer.

Getting Started

For more, see the documentation.

Most optimizers are under MIT or Apache 2.0 license, but a few optimizers like Fromage, Nero have CC BY-NC-SA 4.0 license, which is non-commercial. So, please double-check the license before using it at your work.

Installation

$ pip3 install pytorch-optimizer

From v2.12.0, you can install and import bitsandbytes optimizers. please check the requirements before installing it.

From v3.0.0, drop Python 3.7 support. However, you can still use this package with Python 3.7 by installing with --ignore-requires-python option.

$ pip install "pytorch-optimizer[bitsandbytes]"

Simple Usage

from pytorch_optimizer import AdamP

model = YourModel()
optimizer = AdamP(model.parameters())

# or you can use optimizer loader, simply passing a name of the optimizer.

from pytorch_optimizer import load_optimizer

optimizer = load_optimizer(optimizer='adamp')(model.parameters())

# if you install `bitsandbytes` optimizer, you can use `8-bit` optimizers from `pytorch-optimizer`.

from pytorch_optimizer import load_optimizer

opt = load_optimizer(optimizer='bnb_adamw8bit')
optimizer = opt(model.parameters())

Also, you can load the optimizer via torch.hub.

import torch

model = YourModel()
opt = torch.hub.load('kozistr/pytorch_optimizer', 'adamp')
optimizer = opt(model.parameters())

If you want to build the optimizer with parameters & configs, there's create_optimizer() API.

from pytorch_optimizer import create_optimizer

optimizer = create_optimizer(
    model,
    'adamp',
    lr=1e-3,
    weight_decay=1e-3,
    use_gc=True,
    use_lookahead=True,
)

Supported Optimizers

You can check the supported optimizers with below code.

from pytorch_optimizer import get_supported_optimizers

supported_optimizers = get_supported_optimizers()
Optimizer Description Official Code Paper Citation
AdaBelief Adapting Step-sizes by the Belief in Observed Gradients github https://arxiv.org/abs/2010.07468 cite
AdaBound Adaptive Gradient Methods with Dynamic Bound of Learning Rate github https://openreview.net/forum?id=Bkg3g2R9FX cite
AdaHessian An Adaptive Second Order Optimizer for Machine Learning github https://arxiv.org/abs/2006.00719 cite
AdamD Improved bias-correction in Adam https://arxiv.org/abs/2110.10828 cite
AdamP Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights github https://arxiv.org/abs/2006.08217 cite
diffGrad An Optimization Method for Convolutional Neural Networks github https://arxiv.org/abs/1909.11015v3 cite
MADGRAD A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic github https://arxiv.org/abs/2101.11075 cite
RAdam On the Variance of the Adaptive Learning Rate and Beyond github https://arxiv.org/abs/1908.03265 cite
Ranger a synergistic optimizer combining RAdam and LookAhead, and now GC in one optimizer github https://bit.ly/3zyspC3 cite
Ranger21 a synergistic deep learning optimizer github https://arxiv.org/abs/2106.13731 cite
Lamb Large Batch Optimization for Deep Learning github https://arxiv.org/abs/1904.00962 cite
Shampoo Preconditioned Stochastic Tensor Optimization github https://arxiv.org/abs/1802.09568 cite
Nero Learning by Turning: Neural Architecture Aware Optimisation github https://arxiv.org/abs/2102.07227 cite
Adan Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models github https://arxiv.org/abs/2208.06677 cite
Adai Disentangling the Effects of Adaptive Learning Rate and Momentum github https://arxiv.org/abs/2006.15815 cite
SAM Sharpness-Aware Minimization github https://arxiv.org/abs/2010.01412 cite
ASAM Adaptive Sharpness-Aware Minimization github https://arxiv.org/abs/2102.11600 cite
GSAM Surrogate Gap Guided Sharpness-Aware Minimization github https://openreview.net/pdf?id=edONMAnhLu- cite
D-Adaptation Learning-Rate-Free Learning by D-Adaptation github https://arxiv.org/abs/2301.07733 cite
AdaFactor Adaptive Learning Rates with Sublinear Memory Cost github https://arxiv.org/abs/1804.04235 cite
Apollo An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization github https://arxiv.org/abs/2009.13586 cite
NovoGrad Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks github https://arxiv.org/abs/1905.11286 cite
Lion Symbolic Discovery of Optimization Algorithms github https://arxiv.org/abs/2302.06675 cite
Ali-G Adaptive Learning Rates for Interpolation with Gradients github https://arxiv.org/abs/1906.05661 cite
SM3 Memory-Efficient Adaptive Optimization github https://arxiv.org/abs/1901.11150 cite
AdaNorm Adaptive Gradient Norm Correction based Optimizer for CNNs github https://arxiv.org/abs/2210.06364 cite
RotoGrad Gradient Homogenization in Multitask Learning github https://openreview.net/pdf?id=T8wHz4rnuGL cite
A2Grad Optimal Adaptive and Accelerated Stochastic Gradient Descent github https://arxiv.org/abs/1810.00553 cite
AccSGD Accelerating Stochastic Gradient Descent For Least Squares Regression github https://arxiv.org/abs/1704.08227 cite
SGDW Decoupled Weight Decay Regularization github https://arxiv.org/abs/1711.05101 cite
ASGD Adaptive Gradient Descent without Descent github https://arxiv.org/abs/1910.09529 cite
Yogi Adaptive Methods for Nonconvex Optimization NIPS 2018 cite
SWATS Improving Generalization Performance by Switching from Adam to SGD https://arxiv.org/abs/1712.07628 cite
Fromage On the distance between two neural networks and the stability of learning github https://arxiv.org/abs/2002.03432 cite
MSVAG Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients github https://arxiv.org/abs/1705.07774 cite
AdaMod An Adaptive and Momental Bound Method for Stochastic Learning github https://arxiv.org/abs/1910.12249 cite
AggMo Aggregated Momentum: Stability Through Passive Damping github https://arxiv.org/abs/1804.00325 cite
QHAdam Quasi-hyperbolic momentum and Adam for deep learning github https://arxiv.org/abs/1810.06801 cite
PID A PID Controller Approach for Stochastic Optimization of Deep Networks github CVPR 18 cite
Gravity a Kinematic Approach on Optimization in Deep Learning github https://arxiv.org/abs/2101.09192 cite
AdaSmooth An Adaptive Learning Rate Method based on Effective Ratio https://arxiv.org/abs/2204.00825v1 cite
SRMM Stochastic regularized majorization-minimization with weakly convex and multi-convex surrogates github https://arxiv.org/abs/2201.01652 cite
AvaGrad Domain-independent Dominance of Adaptive Methods github https://arxiv.org/abs/1912.01823 cite
PCGrad Gradient Surgery for Multi-Task Learning github https://arxiv.org/abs/2001.06782 cite
AMSGrad On the Convergence of Adam and Beyond https://openreview.net/pdf?id=ryQu7f-RZ cite
Lookahead k steps forward, 1 step back github https://arxiv.org/abs/1907.08610 cite
PNM Manipulating Stochastic Gradient Noise to Improve Generalization github https://arxiv.org/abs/2103.17182 cite
GC Gradient Centralization github https://arxiv.org/abs/2004.01461 cite
AGC Adaptive Gradient Clipping github https://arxiv.org/abs/2102.06171 cite
Stable WD Understanding and Scheduling Weight Decay github https://arxiv.org/abs/2011.11152 cite
Softplus T Calibrating the Adaptive Learning Rate to Improve Convergence of ADAM https://arxiv.org/abs/1908.00700 cite
Un-tuned w/u On the adequacy of untuned warmup for adaptive optimization https://arxiv.org/abs/1910.04209 cite
Norm Loss An efficient yet effective regularization method for deep neural networks https://arxiv.org/abs/2103.06583 cite
AdaShift Decorrelation and Convergence of Adaptive Learning Rate Methods github https://arxiv.org/abs/1810.00143v4 cite
AdaDelta An Adaptive Learning Rate Method https://arxiv.org/abs/1212.5701v1 cite
Amos An Adam-style Optimizer with Adaptive Weight Decay towards Model-Oriented Scale github https://arxiv.org/abs/2210.11693 cite
SignSGD Compressed Optimisation for Non-Convex Problems github https://arxiv.org/abs/1802.04434 cite
Sophia A Scalable Stochastic Second-order Optimizer for Language Model Pre-training github https://arxiv.org/abs/2305.14342 cite
Prodigy An Expeditiously Adaptive Parameter-Free Learner github https://arxiv.org/abs/2306.06101 cite
PAdam Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks github https://arxiv.org/abs/1806.06763 cite
LOMO Full Parameter Fine-tuning for Large Language Models with Limited Resources github https://arxiv.org/abs/2306.09782 cite
Tiger A Tight-fisted Optimizer, an optimizer that is extremely budget-conscious github cite
CAME Confidence-guided Adaptive Memory Efficient Optimization github https://aclanthology.org/2023.acl-long.243/ cite
WSAM Sharpness-Aware Minimization Revisited: Weighted Sharpness as a Regularization Term github https://arxiv.org/abs/2305.15817 cite
Aida A DNN Optimizer that Improves over AdaBelief by Suppression of the Adaptive Stepsize Range github https://arxiv.org/abs/2203.13273 cite
GaLore Memory-Efficient LLM Training by Gradient Low-Rank Projection github https://arxiv.org/abs/2403.03507 cite
Adalite Adalite optimizer github https://github.com/VatsaDev/adalite cite

Supported LR Scheduler

You can check the supported learning rate schedulers with below code.

from pytorch_optimizer import get_supported_lr_schedulers

supported_lr_schedulers = get_supported_lr_schedulers()
LR Scheduler Description Official Code Paper Citation
Explore-Exploit Wide-minima Density Hypothesis and the Explore-Exploit Learning Rate Schedule https://arxiv.org/abs/2003.03977 cite
Chebyshev Acceleration via Fractal Learning Rate Schedules https://arxiv.org/abs/2103.01338 cite
REX Revisiting Budgeted Training with an Improved Schedule github https://arxiv.org/abs/2107.04197 cite

Supported Loss Function

You can check the supported loss functions with below code.

from pytorch_optimizer import get_supported_loss_functions

supported_loss_functions = get_supported_loss_functions()
Loss Functions Description Official Code Paper Citation
Label Smoothing Rethinking the Inception Architecture for Computer Vision https://arxiv.org/abs/1512.00567 cite
Focal Focal Loss for Dense Object Detection https://arxiv.org/abs/1708.02002 cite
Focal Cosine Data-Efficient Deep Learning Method for Image Classification Using Data Augmentation, Focal Cosine Loss, and Ensemble https://arxiv.org/abs/2007.07805 cite
LDAM Learning Imbalanced Datasets with Label-Distribution-Aware Margin Loss github https://arxiv.org/abs/1906.07413 cite
Jaccard (IOU) IoU Loss for 2D/3D Object Detection https://arxiv.org/abs/1908.03851 cite
Bi-Tempered The Principle of Unchanged Optimality in Reinforcement Learning Generalization https://arxiv.org/abs/1906.03361 cite
Tversky Tversky loss function for image segmentation using 3D fully convolutional deep networks https://arxiv.org/abs/1706.05721 cite
Lovasz Hinge A tractable surrogate for the optimization of the intersection-over-union measure in neural networks github https://arxiv.org/abs/1705.08790 cite

Useful Resources

Several optimization ideas to regularize & stabilize the training. Most of the ideas are applied in Ranger21 optimizer.

Also, most of the captures are taken from Ranger21 paper.

Adaptive Gradient Clipping Gradient Centralization Softplus Transformation
Gradient Normalization Norm Loss Positive-Negative Momentum
Linear learning rate warmup Stable weight decay Explore-exploit learning rate schedule
Lookahead Chebyshev learning rate schedule (Adaptive) Sharpness-Aware Minimization
On the Convergence of Adam and Beyond Improved bias-correction in Adam Adaptive Gradient Norm Correction

Adaptive Gradient Clipping

This idea originally proposed in NFNet (Normalized-Free Network) paper. AGC (Adaptive Gradient Clipping) clips gradients based on the unit-wise ratio of gradient norms to parameter norms.

Gradient Centralization

image

Gradient Centralization (GC) operates directly on gradients by centralizing the gradient to have zero mean.

Softplus Transformation

By running the final variance denom through the softplus function, it lifts extremely tiny values to keep them viable.

Gradient Normalization

Norm Loss

image

Positive-Negative Momentum

image

Linear learning rate warmup

image

Stable weight decay

image

Explore-exploit learning rate schedule

image

Lookahead

k steps forward, 1 step back. Lookahead consisting of keeping an exponential moving average of the weights that is updated and substituted to the current weights every k lookahead steps (5 by default).

Chebyshev learning rate schedule

Acceleration via Fractal Learning Rate Schedules.

(Adaptive) Sharpness-Aware Minimization

Sharpness-Aware Minimization (SAM) simultaneously minimizes loss value and loss sharpness.
In particular, it seeks parameters that lie in neighborhoods having uniformly low loss.

On the Convergence of Adam and Beyond

Convergence issues can be fixed by endowing such algorithms with 'long-term memory' of past gradients.

Improved bias-correction in Adam

With the default bias-correction, Adam may actually make larger than requested gradient updates early in training.

Adaptive Gradient Norm Correction

Correcting the norm of a gradient in each iteration based on the adaptive training history of gradient norm.

Frequently asked questions

here

Citation

Please cite the original authors of optimization algorithms. You can easily find it in the above table! If you use this software, please cite it below. Or you can get it from "cite this repository" button.

@software{Kim_pytorch_optimizer_optimizer_2021,
    author = {Kim, Hyeongchan},
    month = jan,
    title = {{pytorch_optimizer: optimizer & lr scheduler & loss function collections in PyTorch}},
    url = {https://github.com/kozistr/pytorch_optimizer},
    version = {2.12.0},
    year = {2021}
}

Maintainer

Hyeongchan Kim / @kozistr

pytorch_optimizer's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

pytorch_optimizer's Issues

Updated Shampoo uber slow performance

I just swap out Nero optimizer in my Lightning AI loop and gave the new Shampoo a try. There is something going on with it, as this card is typically able to do 2 it per second on almost anything. Old Shampoo was not fast, but it was expected for a second order optimizer to achieve half the iterations per sec.
image

Adafactor: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

│                                                                                                  │
│ C:\Users\dowon\miniconda3\envs\st5\lib\site-packages\torch\optim\optimizer.py:280 in wrapper     │
│                                                                                                  │
│   277 │   │   │   │   │   │   │   raise RuntimeError(f"{func} must return None or a tuple of (   │
│   278 │   │   │   │   │   │   │   │   │   │   │      f"but got {result}.")                       │
│   279 │   │   │   │                                                                              │
│ ❱ 280 │   │   │   │   out = func(*args, **kwargs)                                                │
│   281 │   │   │   │   self._optimizer_step_code()                                                │
│   282 │   │   │   │                                                                              │
│   283 │   │   │   │   # call optimizer step post hooks                                           │
│                                                                                                  │
│ ╭──────────────────────────── locals ────────────────────────────╮                               │
│ │            _ = []                                              │                               │
│ │         args = (                                               │                               │
│ │                │   AdaFactor (                                 │                               │
│ │                Parameter Group 0                               │                               │
│ │                │   lr: 0.0001                                  │                               │
│ │                │   weight_decay: 0.01                          │                               │
│ │                                                                │                               │
│ │                Parameter Group 1                               │                               │
│ │                │   lr: 0.0001                                  │                               │
│ │                │   weight_decay: 0.0                           │                               │
│ │                ),                                              │                               │
│ │                )                                               │                               │
│ │         func = <function AdaFactor.step at 0x000001E4F2FCD090> │                               │
│ │       kwargs = {}                                              │                               │
│ │ profile_name = 'Optimizer.step#AdaFactor.step'                 │                               │
│ │         self = AdaFactor (                                     │                               │
│ │                Parameter Group 0                               │                               │
│ │                │   lr: 0.0001                                  │                               │
│ │                │   weight_decay: 0.01                          │                               │
│ │                                                                │                               │
│ │                Parameter Group 1                               │                               │
│ │                │   lr: 0.0001                                  │                               │
│ │                │   weight_decay: 0.0                           │                               │
│ │                )                                               │                               │
│ ╰────────────────────────────────────────────────────────────────╯                               │
│                                                                                                  │
│ C:\Users\dowon\miniconda3\envs\st5\lib\site-packages\torch\utils\_contextlib.py:115 in           │
│ decorate_context                                                                                 │
│                                                                                                  │
│   112 │   @functools.wraps(func)                                                                 │
│   113def decorate_context(*args, **kwargs):                                                 │
│   114 │   │   with ctx_factory():                                                                │
│ ❱ 115 │   │   │   return func(*args, **kwargs)                                                   │
│   116 │                                                                                          │
│   117return decorate_context                                                                │
│   118                                                                                            │
│                                                                                                  │
│ ╭─────────────────────────────────────────── locals ───────────────────────────────────────────╮ │
│ │        args = (                                                                              │ │
│ │               │   AdaFactor (                                                                │ │
│ │               Parameter Group 0                                                              │ │
│ │               │   lr: 0.0001                                                                 │ │
│ │               │   weight_decay: 0.01                                                         │ │
│ │                                                                                              │ │
│ │               Parameter Group 1                                                              │ │
│ │               │   lr: 0.0001                                                                 │ │
│ │               │   weight_decay: 0.0                                                          │ │
│ │               ),                                                                             │ │
│ │               )                                                                              │ │
│ │ ctx_factory = <bound method _DecoratorContextManager.clone of                                │ │
│ │               <torch.autograd.grad_mode.no_grad object at 0x000001E4F2FB2650>>               │ │
│ │        func = <function AdaFactor.step at 0x000001E4F2FCD000>                                │ │
│ │      kwargs = {}                                                                             │ │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────────╯ │
│                                                                                                  │
│ C:\Users\dowon\miniconda3\envs\st5\lib\site-packages\pytorch_optimizer\optimizer\adafactor.py:16 │
│ 6 in step                                                                                        │
│                                                                                                  │
│   163 │   │   │   │   if factored:                                                               │
│   164 │   │   │   │   │   exp_avg_sq_row, exp_avg_sq_col = state['exp_avg_sq_row'], state['exp   │
│   165 │   │   │   │   │                                                                          │
│ ❱ 166 │   │   │   │   │   exp_avg_sq_row.mul_(beta2_t).add_(update.mean(dim=-1), alpha=1.0 - b   │
│   167 │   │   │   │   │   exp_avg_sq_col.mul_(beta2_t).add_(update.mean(dim=-2), alpha=1.0 - b   │
│   168 │   │   │   │   │                                                                          │
│   169 │   │   │   │   │   # Approximation of exponential moving average of square of gradient    │
│                                                                                                  │
│ ╭─────────────────────────────────────────── locals ───────────────────────────────────────────╮ │
│ │        beta2_t = 0.0                                                                         │ │
│ │        closure = None                                                                        │ │
│ │ exp_avg_sq_col = tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., │ │
│ │                  0., 0., 0., 0., 0., 0., 0.,                                                 │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., │ │
│ │                  0., 0., 0., 0., 0., 0., 0.,                                                 │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., │ │
│ │                  0., 0., 0., 0., 0., 0., 0.,                                                 │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., │ │
│ │                  0., 0., 0., 0., 0., 0., 0.,                                                 │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., │ │
│ │                  0., 0., 0., 0., 0., 0., 0.,                                                 │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., │ │
│ │                  0., 0., 0., 0., 0., 0., 0.,                                                 │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., │ │
│ │                  0., 0., 0., 0., 0., 0., 0.,                                                 │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., │ │
│ │                  0., 0., 0., 0., 0., 0., 0.,                                                 │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., │ │
│ │                  0., 0., 0., 0., 0., 0., 0.,                                                 │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., │ │
│ │                  0., 0., 0., 0., 0., 0., 0.,                                                 │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])    │ │
│ │ exp_avg_sq_row = tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., │ │
│ │                  0., 0., 0., 0., 0., 0., 0.,                                                 │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., │ │
│ │                  0., 0., 0., 0., 0., 0., 0.,                                                 │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., │ │
│ │                  0., 0., 0., 0., 0., 0., 0.,                                                 │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., │ │
│ │                  0., 0., 0., 0., 0., 0., 0.,                                                 │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., │ │
│ │                  0., 0., 0., 0., 0., 0., 0.,                                                 │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., │ │
│ │                  0., 0., 0., 0., 0., 0., 0.,                                                 │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., │ │
│ │                  0., 0., 0., 0., 0., 0., 0.,                                                 │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., │ │
│ │                  0., 0., 0., 0., 0., 0., 0.,                                                 │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., │ │
│ │                  0., 0., 0., 0., 0., 0., 0.,                                                 │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., │ │
│ │                  0., 0., 0., 0., 0., 0., 0.,                                                 │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., │ │
│ │                  0., 0., 0., 0., 0., 0., 0.,                                                 │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., │ │
│ │                  0., 0., 0., 0., 0., 0., 0.,                                                 │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., │ │
│ │                  0., 0., 0., 0., 0., 0., 0.,                                                 │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., │ │
│ │                  0., 0., 0., 0., 0., 0., 0.,                                                 │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., │ │
│ │                  0., 0., 0., 0., 0., 0., 0.,                                                 │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., │ │
│ │                  0., 0., 0., 0., 0., 0., 0.,                                                 │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0.])                                    │ │
│ │       factored = True                                                                        │ │
│ │           grad = tensor([[0., 0., 0.,  ..., 0., 0., 0.],                                     │ │
│ │                  │   │   [0., 0., 0.,  ..., 0., 0., 0.],                                     │ │
│ │                  │   │   [0., 0., 0.,  ..., 0., 0., 0.],                                     │ │
│ │                  │   │   ...,                                                                │ │
│ │                  │   │   [0., 0., 0.,  ..., 0., 0., 0.],                                     │ │
│ │                  │   │   [0., 0., 0.,  ..., 0., 0., 0.],                                     │ │
│ │                  │   │   [0., 0., 0.,  ..., 0., 0., 0.]])                                    │ │
│ │     grad_shape = torch.Size([392, 256])                                                      │ │
│ │          group = {                                                                           │ │
│ │                  │   'params': [                                                             │ │
│ │                  │   │   Parameter containing:                                               │ │
│ │                  tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],       │ │
│ │                  │   │   [-0.0139, -0.0139,  0.0198,  ...,  0.0270,  0.0313, -0.0146],       │ │
│ │                  │   │   [-0.0009, -0.0218, -0.0262,  ...,  0.0154,  0.0274, -0.0132],       │ │
│ │                  │   │   ...,                                                                │ │
│ │                  │   │   [-0.0118, -0.0145,  0.0182,  ..., -0.0162,  0.0020,  0.0483],       │ │
│ │                  │   │   [ 0.0015, -0.0085,  0.0296,  ..., -0.0070,  0.0312, -0.0047],       │ │
│ │                  │   │   [-0.0200, -0.0121, -0.0016,  ...,  0.0087,  0.0232,  0.0026]],      │ │
│ │                  │      requires_grad=True),                                                 │ │
│ │                  │   │   Parameter containing:                                               │ │
│ │                  tensor(1., requires_grad=True),                                             │ │
│ │                  │   │   Parameter containing:                                               │ │
│ │                  tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., │ │
│ │                  1.,                                                                         │ │
│ │                  │   │   1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., │ │
│ │                  1.,                                                                         │ │
│ │                  │   │   1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., │ │
│ │                  1.,                                                                         │ │
│ │                  │   │   1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., │ │
│ │                  1.,                                                                         │ │
│ │                  │   │   1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., │ │
│ │                  1.,                                                                         │ │
│ │                  │   │   1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., │ │
│ │                  1.,                                                                         │ │
│ │                  │   │   1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., │ │
│ │                  1.,                                                                         │ │
│ │                  │   │   1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., │ │
│ │                  1.,                                                                         │ │
│ │                  │   │   1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., │ │
│ │                  1.,                                                                         │ │
│ │                  │   │   1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., │ │
│ │                  1.,                                                                         │ │
│ │                  │   │   1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., │ │
│ │                  1.,                                                                         │ │
│ │                  │   │   1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., │ │
│ │                  1.,                                                                         │ │
│ │                  │   │   1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., │ │
│ │                  1.,                                                                         │ │
│ │                  │   │   1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., │ │
│ │                  1.,                                                                         │ │
│ │                  │   │   1., 1., 1., 1.], requires_grad=True),                               │ │
│ │                  │   │   Parameter containing:                                               │ │
│ │                  tensor([[-0.0172, -0.0357,  0.0127,  ...,  0.0140,  0.0231,  0.0100],       │ │
│ │                  │   │   [ 0.0073, -0.0318,  0.0532,  ...,  0.0201, -0.0209, -0.0244],       │ │
│ │                  │   │   [ 0.0128, -0.0140, -0.0178,  ..., -0.0317,  0.0243, -0.0172],       │ │
│ │                  │   │   ...,                                                                │ │
│ │                  │   │   [ 0.0261,  0.0426, -0.0010,  ..., -0.0248, -0.0073, -0.0267],       │ │
│ │                  │   │   [ 0.0117,  0.0044,  0.0139,  ..., -0.0108, -0.0086,  0.0062],       │ │
│ │                  │   │   [-0.0203, -0.0363,  0.0016,  ..., -0.0156,  0.0419, -0.0382]],      │ │
│ │                  │      requires_grad=True),                                                 │ │
│ │                  │   │   Parameter containing:                                               │ │
│ │                  tensor([[ 0.0115, -0.0228,  0.0137,  ...,  0.0053,  0.0165,  0.0195],       │ │
│ │                  │   │   [ 0.0154, -0.0034,  0.0146,  ..., -0.0372,  0.0160,  0.0086],       │ │
│ │                  │   │   [ 0.0229,  0.0122, -0.0370,  ..., -0.0363,  0.0052,  0.0166],       │ │
│ │                  │   │   ...,                                                                │ │
│ │                  │   │   [ 0.0101, -0.0106, -0.0434,  ...,  0.0086, -0.0010,  0.0086],       │ │
│ │                  │   │   [-0.0083, -0.0232, -0.0103,  ..., -0.0086, -0.0010,  0.0180],       │ │
│ │                  │   │   [-0.0173, -0.0152, -0.0176,  ...,  0.0475,  0.0017,  0.0169]],      │ │
│ │                  │      requires_grad=True),                                                 │ │
│ │                  │   │   Parameter containing:                                               │ │
│ │                  tensor([[-2.0274e-02, -1.2771e-02, -2.8952e-03,  ...,  1.2484e-02,          │ │
│ │                  │   │    -2.6788e-02,  4.1619e-03],                                         │ │
│ │                  │   │   [-1.1912e-02,  4.6735e-02, -3.8022e-02,  ...,  2.8097e-03,          │ │
│ │                  │   │     1.7520e-02, -1.3666e-02],                                         │ │
│ │                  │   │   [ 1.3482e-02,  8.7789e-04,  2.3605e-02,  ..., -1.7275e-02,          │ │
│ │                  │   │    -3.8947e-02, -1.8789e-02],                                         │ │
│ │                  │   │   ...,                                                                │ │
│ │                  │   │   [-1.4591e-02, -4.2399e-02, -1.2827e-02,  ...,  6.9698e-03,          │ │
│ │                  │   │    -3.2002e-02,  1.4425e-02],                                         │ │
│ │                  │   │   [ 8.1146e-04, -4.8101e-05, -1.1711e-02,  ..., -1.3608e-02,          │ │
│ │                  │   │     3.4900e-02, -2.2250e-02],                                         │ │
│ │                  │   │   [ 1.6486e-02,  1.2800e-02,  6.6323e-03,  ...,  7.7000e-03,          │ │
│ │                  │   │    -2.0062e-02, -1.0825e-02]], requires_grad=True),                   │ │
│ │                  │   │   Parameter containing:                                               │ │
│ │                  tensor([[ 0.0059, -0.0026, -0.0118,  ...,  0.0176, -0.0159, -0.0158],       │ │
│ │                  │   │   [-0.0081, -0.0335,  0.0248,  ...,  0.0273, -0.0085, -0.0179],       │ │
│ │                  │   │   [-0.0047, -0.0156,  0.0021,  ...,  0.0220,  0.0266, -0.0157],       │ │
│ │                  │   │   ...,                                                                │ │
│ │                  │   │   [ 0.0021, -0.0063,  0.0066,  ...,  0.0069, -0.0001,  0.0031],       │ │
│ │                  │   │   [-0.0063,  0.0097, -0.0172,  ..., -0.0196,  0.0112,  0.0135],       │ │
│ │                  │   │   [ 0.0063,  0.0432,  0.0267,  ...,  0.0219, -0.0013, -0.0059]],      │ │
│ │                  │      requires_grad=True),                                                 │ │
│ │                  │   │   Parameter containing:                                               │ │
│ │                  tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., │ │
│ │                  1.,                                                                         │ │
│ │                  │   │   1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., │ │
│ │                  1.,                                                                         │ │
│ │                  │   │   1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., │ │
│ │                  1.,                                                                         │ │
│ │                  │   │   1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., │ │
│ │                  1.,                                                                         │ │
│ │                  │   │   1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., │ │
│ │                  1.,                                                                         │ │
│ │                  │   │   1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., │ │
│ │                  1.,                                                                         │ │
│ │                  │   │   1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., │ │
│ │                  1.,                                                                         │ │
│ │                  │   │   1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., │ │
│ │                  1.,                                                                         │ │
│ │                  │   │   1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., │ │
│ │                  1.,                                                                         │ │
│ │                  │   │   1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., │ │
│ │                  1.,                                                                         │ │
│ │                  │   │   1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., │ │
│ │                  1.,                                                                         │ │
│ │                  │   │   1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., │ │
│ │                  1.,                                                                         │ │
│ │                  │   │   1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., │ │
│ │                  1.,                                                                         │ │
│ │                  │   │   1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., │ │
│ │                  1.,                                                                         │ │
│ │                  │   │   1., 1., 1., 1.], requires_grad=True),                               │ │
│ │                  │   │   Parameter containing:                                               │ │
│ │                  tensor([[ 0.0085, -0.0025, -0.0224,  ..., -0.0279,  0.0090, -0.0066],       │ │
│ │                  │   │   [ 0.0049,  0.0235, -0.0126,  ..., -0.0045,  0.0328, -0.0198],       │ │
│ │                  │   │   [-0.0353,  0.0378, -0.0102,  ...,  0.0162,  0.0216,  0.0013],       │ │
│ │                  │   │   ...,                                                                │ │
│ │                  │   │   [-0.0208, -0.0210, -0.0009,  ...,  0.0140, -0.0131,  0.0013],       │ │
│ │                  │   │   [ 0.0005, -0.0224, -0.0027,  ..., -0.0324, -0.0153,  0.0079],       │ │
│ │                  │   │   [-0.0148, -0.0018, -0.0059,  ...,  0.0002,  0.0080,  0.0034]],      │ │
│ │                  │      requires_grad=True),                                                 │ │
│ │                  │   │   Parameter containing:                                               │ │
│ │                  tensor([[ 0.0129, -0.0013, -0.0167,  ...,  0.0256,  0.0168, -0.0076],       │ │
│ │                  │   │   [-0.0008,  0.0012,  0.0257,  ...,  0.0426,  0.0270,  0.0048],       │ │
│ │                  │   │   [-0.0075,  0.0273,  0.0185,  ...,  0.0196, -0.0489,  0.0040],       │ │
│ │                  │   │   ...,                                                                │ │
│ │                  │   │   [-0.0127, -0.0050, -0.0101,  ...,  0.0293, -0.0102,  0.0009],       │ │
│ │                  │   │   [ 0.0216, -0.0103, -0.0059,  ..., -0.0161, -0.0220,  0.0201],       │ │
│ │                  │   │   [ 0.0188, -0.0405, -0.0223,  ..., -0.0250, -0.0329,  0.0093]],      │ │
│ │                  │      requires_grad=True),                                                 │ │
│ │                  │   │   ... +186                                                            │ │
│ │                  │   ],                                                                      │ │
│ │                  │   'weight_decay': 0.01,                                                   │ │
│ │                  │   'lr': 0.0001                                                            │ │
│ │                  }                                                                           │ │
│ │           loss = None                                                                        │ │
│ │             lr = 1e-06                                                                       │ │
│ │              p = Parameter containing:                                                       │ │
│ │                  tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],       │ │
│ │                  │   │   [-0.0139, -0.0139,  0.0198,  ...,  0.0270,  0.0313, -0.0146],       │ │
│ │                  │   │   [-0.0009, -0.0218, -0.0262,  ...,  0.0154,  0.0274, -0.0132],       │ │
│ │                  │   │   ...,                                                                │ │
│ │                  │   │   [-0.0118, -0.0145,  0.0182,  ..., -0.0162,  0.0020,  0.0483],       │ │
│ │                  │   │   [ 0.0015, -0.0085,  0.0296,  ..., -0.0070,  0.0312, -0.0047],       │ │
│ │                  │   │   [-0.0200, -0.0121, -0.0016,  ...,  0.0087,  0.0232,  0.0026]],      │ │
│ │                  │      requires_grad=True)                                                  │ │
│ │           self = AdaFactor (                                                                 │ │
│ │                  Parameter Group 0                                                           │ │
│ │                  │   lr: 0.0001                                                              │ │
│ │                  │   weight_decay: 0.01                                                      │ │
│ │                                                                                              │ │
│ │                  Parameter Group 1                                                           │ │
│ │                  │   lr: 0.0001                                                              │ │
│ │                  │   weight_decay: 0.0                                                       │ │
│ │                  )                                                                           │ │
│ │          state = {                                                                           │ │
│ │                  │   'step': 1,                                                              │ │
│ │                  │   'exp_avg': tensor([[0., 0., 0.,  ..., 0., 0., 0.],                      │ │
│ │                  │   │   [0., 0., 0.,  ..., 0., 0., 0.],                                     │ │
│ │                  │   │   [0., 0., 0.,  ..., 0., 0., 0.],                                     │ │
│ │                  │   │   ...,                                                                │ │
│ │                  │   │   [0., 0., 0.,  ..., 0., 0., 0.],                                     │ │
│ │                  │   │   [0., 0., 0.,  ..., 0., 0., 0.],                                     │ │
│ │                  │   │   [0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:0'),                  │ │
│ │                  │   'exp_avg_sq_row': tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,   │ │
│ │                  0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,                         │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., │ │
│ │                  0., 0., 0., 0., 0., 0., 0.,                                                 │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., │ │
│ │                  0., 0., 0., 0., 0., 0., 0.,                                                 │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., │ │
│ │                  0., 0., 0., 0., 0., 0., 0.,                                                 │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., │ │
│ │                  0., 0., 0., 0., 0., 0., 0.,                                                 │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., │ │
│ │                  0., 0., 0., 0., 0., 0., 0.,                                                 │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., │ │
│ │                  0., 0., 0., 0., 0., 0., 0.,                                                 │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., │ │
│ │                  0., 0., 0., 0., 0., 0., 0.,                                                 │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., │ │
│ │                  0., 0., 0., 0., 0., 0., 0.,                                                 │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., │ │
│ │                  0., 0., 0., 0., 0., 0., 0.,                                                 │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., │ │
│ │                  0., 0., 0., 0., 0., 0., 0.,                                                 │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., │ │
│ │                  0., 0., 0., 0., 0., 0., 0.,                                                 │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., │ │
│ │                  0., 0., 0., 0., 0., 0., 0.,                                                 │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., │ │
│ │                  0., 0., 0., 0., 0., 0., 0.,                                                 │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., │ │
│ │                  0., 0., 0., 0., 0., 0., 0.,                                                 │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., │ │
│ │                  0., 0., 0., 0., 0., 0., 0.,                                                 │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0.]),                                   │ │
│ │                  │   'exp_avg_sq_col': tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,   │ │
│ │                  0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,                         │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., │ │
│ │                  0., 0., 0., 0., 0., 0., 0.,                                                 │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., │ │
│ │                  0., 0., 0., 0., 0., 0., 0.,                                                 │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., │ │
│ │                  0., 0., 0., 0., 0., 0., 0.,                                                 │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., │ │
│ │                  0., 0., 0., 0., 0., 0., 0.,                                                 │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., │ │
│ │                  0., 0., 0., 0., 0., 0., 0.,                                                 │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., │ │
│ │                  0., 0., 0., 0., 0., 0., 0.,                                                 │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., │ │
│ │                  0., 0., 0., 0., 0., 0., 0.,                                                 │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., │ │
│ │                  0., 0., 0., 0., 0., 0., 0.,                                                 │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., │ │
│ │                  0., 0., 0., 0., 0., 0., 0.,                                                 │ │
│ │                  │   │   0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]),   │ │
│ │                  │   'RMS': tensor(0.0200, device='cuda:0')                                  │ │
│ │                  }                                                                           │ │
│ │         update = tensor([[1.0000e-30, 1.0000e-30, 1.0000e-30,  ..., 1.0000e-30, 1.0000e-30,  │ │
│ │                  │   │    1.0000e-30],                                                       │ │
│ │                  │   │   [1.0000e-30, 1.0000e-30, 1.0000e-30,  ..., 1.0000e-30, 1.0000e-30,  │ │
│ │                  │   │    1.0000e-30],                                                       │ │
│ │                  │   │   [1.0000e-30, 1.0000e-30, 1.0000e-30,  ..., 1.0000e-30, 1.0000e-30,  │ │
│ │                  │   │    1.0000e-30],                                                       │ │
│ │                  │   │   ...,                                                                │ │
│ │                  │   │   [1.0000e-30, 1.0000e-30, 1.0000e-30,  ..., 1.0000e-30, 1.0000e-30,  │ │
│ │                  │   │    1.0000e-30],                                                       │ │
│ │                  │   │   [1.0000e-30, 1.0000e-30, 1.0000e-30,  ..., 1.0000e-30, 1.0000e-30,  │ │
│ │                  │   │    1.0000e-30],                                                       │ │
│ │                  │   │   [1.0000e-30, 1.0000e-30, 1.0000e-30,  ..., 1.0000e-30, 1.0000e-30,  │ │
│ │                  │   │    1.0000e-30]], device='cuda:0')                                     │ │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────────╯ │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
wandb: Waiting for W&B process to finish... (failed 1). Press Ctrl-C to abort syncing.
wandb:  View run 20230422_185120 at: https://wandb.ai/bingsu/speecht5_tts/runs/00i5z5r2
wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: .\wandb\run-20230422_185122-00i5z5r2\logs
Windows 11
RTX 3070

python: 3.10.10
torch: 2.0.0+cu118
lightning: 2.0.1.post0
pytorch_optimizer: 2.6.0

옵티마이저 호출 방법

        optimizer = create_optimizer(
            self,
            "adafactor",
            lr=self.cfg.train.lr,
            weight_decay=self.cfg.train.weight_decay,
            warmup_init=True,
        )

lightning 트레이너 설정

    trainer = L.Trainer(
        precision=cfg.train.precision,  # "16-mixed"
        logger=wandb_logger,
        callbacks=callbacks,
        fast_dev_run=cfg.train.get("fast_dev_run", False),
        max_epochs=cfg.train.max_epochs,
        gradient_clip_val=cfg.train.get("gradient_clip_val", None),
    )

항상 잘 쓰고 있습니다. 감사합니다.
이번에 adafactor를 처음 사용해봤는데 exp_avg_sq_rowexp_avg_sq_col에서 디바이스 에러가 발생했습니다.

혹시 더 필요한 정보가 있다면 말씀해주세요.

AliG: Training Neural Networks for and by Interpolation

Training Neural Networks for and by Interpolation
https://arxiv.org/abs/1906.05661

In modern supervised learning, many deep neural networks are able to interpolate the data: the empirical
loss can be driven to near zero on all samples simultaneously. In this work, we explicitly exploit this
interpolation property for the design of a new optimization algorithm for deep learning, which we term
Adaptive Learning-rates for Interpolation with Gradients (ALI-G). ALI-G retains the two main advantages
of Stochastic Gradient Descent (SGD), which are (i) a low computational cost per iteration and (ii) good
generalization performance in practice. At each iteration, ALI-G exploits the interpolation property to
compute an adaptive learning-rate in closed form. In addition, ALI-G clips the learning-rate to a maximal
value, which we prove to be helpful for non-convex problems. Crucially, in contrast to the learning-rate of
SGD, the maximal learning-rate of ALI-G does not require a decay schedule, which makes it considerably
easier to tune. We provide convergence guarantees of ALI-G in various stochastic settings. Notably, we tackle
the realistic case where the interpolation property is satisfied up to some tolerance. We provide experiments
on a variety of architectures and tasks: (i) learning a differentiable neural computer; (ii) training a wide
residual network on the SVHN data set; (iii) training a Bi-LSTM on the SNLI data set; and (iv) training wide
residual networks and densely connected networks on the CIFAR data sets. ALI-G produces state-of-the-art
results among adaptive methods, and even yields comparable performance with SGD, which requires manually
tuned learning-rate schedules. Furthermore, ALI-G is simple to implement in any standard deep learning
framework and can be used as a drop-in replacement in existing code.

Reference implementation: https://github.com/oval-group/ali-g

Lookahead is not a subclass of torch.optim.Optimizer

Describe the bug

Lookahead is not a subclass of torch.optim.Optimizer. This is a problem e.g. when using PyTorch Lightning which expects your optimiser to be a subclass of Optimizer.

To Reproduce

from pytorch_optimizer import Lookahead
from torch.optim import Optimizer, Adam
from torch import nn

opt = Lookahead(Adam(nn.Linear(2, 3).parameters()))
assert isinstance(opt, Optimizer)

will get you an AssertionError.

Additional context

I checked a lot of the other optimizers, and they subclass pytorch_optimizer.BaseOptimizer as well as torch.optim.Optimizer. I will submit a PR shortly.

Ranger21 causing loss to spike and model never converges

It is very weird to me but Ranger21 is causing my model (Wav2Vec2 based transformer) to have spikes in loss and it never converges. While some other optimizers from this repo works, like RAdam. I wonder if you know any reason for this or anything different I should have done to use the Ranger21 optimizer properly.

Currently using it with a constant LR of 1e-4 (but ranger21 internally it does warm up and down). Batch size of 8 but with gradient accumulation of 4 so making effective batch size 32, epoch is 100 but at 50 or even 5 same thing happens. I am training with the feature extractor frozen so only the logits layer

Can I also check that the num_iterations parameter of Ranger21 refers to number of optimization steps, not epoch? I am currently calculating the steps using this formula: epoch * len(dataset) / gradient_accmulation steps

Below is a comparison between Ranger21 and AdamW. Orange is AdamW. wer is word error rate %, ranger21 made the model to reach 100% wer which means the model got everything wrong. Towards the end, the loss of ranger21 seems to be stabilizing but it is actually stuck in a pretty high value and slowly increasing. The graph is not obvious because of the spikes making the scale very small.

image

RotoGrad: Gradient Homogenization in Multitask Learning

I just noticed PCGrad in the list of optimizers modifiers. I had used this with success in the past on a very specific task. It may be useful.

RotoGrad: Gradient Homogenization in Multitask Learning
Paper: https://openreview.net/pdf?id=T8wHz4rnuGL

Multitask learning is being increasingly adopted in applications domains like computer vision and reinforcement learning. However, optimally exploiting its advantages remains a major challenge due to the effect of negative transfer. Previous works have tracked down this issue to the disparities in gradient magnitudes and directions across tasks when optimizing the shared network parameters. While recent work has acknowledged that negative transfer is a two-fold problem, existing approaches fall short. These methods only focus on either homogenizing the gradient magnitude across tasks; or greedily change the gradient directions, overlooking future conflicts. In this work, we introduce RotoGrad, an algorithm that tackles negative transfer as a whole: it jointly homogenizes gradient magnitudes and directions, while ensuring training convergence. We show that RotoGrad outperforms competing methods in complex problems, including multi-label classification in CelebA and computer vision tasks in the NYUv2 dataset.

A Pytorch implementation can be found in https://github.com/adrianjav/rotograd.

Question about using Ranger21 with Hugging Face Trainer

First of all, sorry if this issue is not very related to this repo, but I thought you being a professional at PyTorch optimizers, would know more on this than the rest.

I am not sure if you are familiar with Hugging Face Trainer https://huggingface.co/docs/transformers/v4.29.1/en/main_classes/trainer#transformers.Trainer , but they allow custom optimizer but also require a learning rate scheduler to be passed in, like

from transformers.optimization import Adafactor, AdafactorSchedule

optimizer = Adafactor(model.parameters(), scale_parameter=True, relative_step=True, warmup_init=True, lr=None)
lr_scheduler = AdafactorSchedule(optimizer)
trainer = Trainer(..., optimizers=(optimizer, lr_scheduler))

In this case, I can replace optimizer with Ranger21, but since Ranger21 has built-in scheduling, how does this interact with the extra lr_scheduler? And what should I be passing in for that?

Scalable Second Order Optimization for Deep Learning

I have been using your optimizers and I stumbled upon improvements on Shampoo. Just in case you havent seen them already.

Paper preprints: https://arxiv.org/abs/2002.09018

@misc{anil2021scalable,
      title={Scalable Second Order Optimization for Deep Learning},
      author={Rohan Anil and Vineet Gupta and Tomer Koren and Kevin Regan and Yoram Singer},
      year={2021},
      eprint={2002.09018},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Pytorch implementation is slower than the JAX implementation as it is written with readability than speed in mind. https://github.com/google-research/google-research/tree/master/scalable_shampoo

Improvement to SAM: SAM as an Optimal Relaxation of Bayes

SAM as an Optimal Relaxation of Bayes

Sharpness-aware minimization (SAM) and related adversarial deep-learning methods can drastically improve generalization, but their underlying mechanisms are not yet fully understood. Here, we establish SAM as a relaxation of the Bayes objective where the expected negative-loss is replaced by the optimal convex lower bound, obtained by using the so-called Fenchel biconjugate. The connection enables a new Adam-like extension of SAM to automatically obtain reasonable uncertainty estimates, while sometimes also improving its accuracy. By connecting adversarial and Bayesian methods, our work opens a new path to robustness

Preprint: https://arxiv.org/abs/2210.01620

image

get_chebyshev_schedule not working

when i call get_chebyshev_schedule with any integer value like get_chebyshev_schedule(5), it will error on return steps[perm]: IndexError: index 3 is out of bounds for axis 0 with size 3

Am I using this wrongly?

Gradient Descent: The Ultimate Optimizer

https://arxiv.org/abs/1909.13371

Working with any gradient-based machine learning algorithm involves the tedious
task of tuning the optimizer’s hyperparameters, such as its step size. Recent work
has shown how the step size can itself be optimized alongside the model parameters
by manually deriving expressions for “hypergradients” ahead of time.
We show how to automatically compute hypergradients with a simple and elegant
modification to backpropagation. This allows us to easily apply the method to
other optimizers and hyperparameters (e.g. momentum coefficients). We can even
recursively apply the method to its own hyper-hyperparameters, and so on ad infinitum. As these towers of optimizers grow taller, they become less sensitive to the initial choice of hyperparameters. We present experiments validating this for MLPs,
CNNs, and RNNs. Finally, we provide a simple PyTorch implementation of this
algorithm (see people.csail.mit.edu/kach/gradient-descent-the-ultimate-optimizer).

Reference implementation: https://github.com/kach/gradient-descent-the-ultimate-optimizer

Had been using this for great effect on some small tasks, but the problem is that it is not very framework friendly (clearly not a plug and play optimizer) and it requires engineering around how it works. Would be great if you can figure out how to make it more plug-and-play.

Aida optimizer

A new optimizer that improves upon AdaBelief is proposed in this paper Aida , the optimizer was tested against SGD, Adam, AdamW, Radam, Fromage, and AdaBelief on ResNet, Transformer, and LSTM.

ipex failed for Adan from pytorch_optimizer

Hi,

ipex does not work at least for Adan from the package pytorch_optimizer.

Here is a toy example:

import torch
import torch.nn as nn
import numpy as np
import intel_extension_for_pytorch as ipex
from pytorch_optimizer import Adan

input_size = 1
output_size = 1

# hyper-parameters
num_epochs = 1
learning_rate = 0.001

# toy dataset
x = np.random.randn(10, input_size).astype(np.float32)
y = np.random.randn(10, output_size).astype(np.float32)

# linear regression model
model = nn.Linear(input_size, output_size)

# loss and optimizer
criterion = nn.MSELoss()
optimizer = Adan(model.parameters(), lr=learning_rate)  

# ipex
model, optimizer = ipex.optimize(model, optimizer=optimizer)

# train the model
for epoch in range(num_epochs):
    inputs = torch.from_numpy(x)
    targets = torch.from_numpy(y)
    
    # forward pass
    outputs = model(inputs)
    loss = criterion(outputs, targets)
    
    # backward and optimize
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Result:

AttributeError: 'Adan' object has no attribute 'use_gc'

I use Python 3.11.4, PyTorch 2.0.1 on cpu and ipex 2.0.100+cpu.

Thanks for any help.

sophiah bug

File "/opt/conda/envs/dsd/lib/python3.10/site-packages/pytorch_optimizer/optimizer/sophia.py", line 92, in step
self.compute_hutchinson_hessian(
File "/opt/conda/envs/dsd/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/envs/dsd/lib/python3.10/site-packages/pytorch_optimizer/base/optimizer.py", line 100, in compute_hutchinson_hessian
h_zs = torch.autograd.grad(grads, params, grad_outputs=zs, retain_graph=i < num_samples - 1)
File "/opt/conda/envs/dsd/lib/python3.10/site-packages/torch/autograd/init.py", line 303, in grad
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

Ranger sign inversion

Describe the bug

From my experiments it seems like the sign for the Ranger is inverted. All other optimizers (including Ranger21) has steps in the opposite direction of Ranger.

Note that I'm testing context-free step directions/magnitudes using a 'perfect' gradient (scaled by 4), so if Ranger somehow reverts course when gradients from different directions are accumulated that would be missed from my test.
Hyperparameters: {'betas': (0.003344506587403595, 0.9685357345548955), 'lr': 0.4616639698903086} (found through hyperparameter search, also done for the other optimizers) and evaluated on the Ackley (dim=2) function.

(I didn't want to create a PR before discussing if this might be intended)

To Reproduce

  • OS : Linux
  • PyTorch version : 2
  • Python version : 3.11

Log

Ranger:
image

For comparison SGD:
image

Plans for pytorch_optimizer v3

In pytorch-optimizer v3, loss function will be added. So, finally, the optimizer & lr scheduler & loss function are all in one package.

Feature

  • support at least 60 optimizers
  • support at least 10 objectives
  • support bitsandbytes (& 4-bit optimizers)

Refactor

  • Organize utils

Docs

  • Organize documentation
  • Support contribution guide (implementation, test, etc...)
  • Add issue templates
  • Migrate to mkdocs
  • Create Q&A page
  • Benchmark on ImageNet

Test

  • Organize test cases

Entropy-MCMC: Sampling from flat basins with ease

Abstract:

Bayesian deep learning counts on the quality of posterior distribution estimation. However, the posterior of deep neural networks is highly multi-modal in nature, with local modes exhibiting varying generalization performance. Given a practical budget, targeting at the original posterior can lead to suboptimal performance, as some samples may become trapped in "bad" modes and suffer from overfitting. Leveraging the observation that "good" modes with low generalization error often reside in flat basins of the energy landscape, we propose to bias sampling on the posterior toward these flat regions. Specifically, we introduce an auxiliary guiding variable, the stationary distribution of which resembles a smoothed posterior free from sharp modes, to lead the MCMC sampler to flat basins. By integrating this guiding variable with the model parameter, we create a simple joint distribution that enables efficient sampling with minimal computational overhead. We prove the convergence of our method and further show that it converges faster than several existing flatness-aware methods in the strongly convex setting. Empirical results demonstrate that our method can successfully sample from flat basins of the posterior, and outperforms all compared baselines on multiple benchmarks including classification, calibration, and out-of-distribution detection.

Paper: https://www.semanticscholar.org/paper/Entropy-MCMC%3A-Sampling-from-Flat-Basins-with-Ease-Li-Zhang/fd95de3f24fc4f955a6fe5719d38d1d06136e0cd
Code: https://github.com/lblaoke/EMCMC/tree/master

VeLO: Training Versatile Learned Optimizers by Scaling Up

https://arxiv.org/abs/2211.09760

While deep learning models have replaced hand-designed features across many domains,
these models are still trained with hand-designed optimizers. In this work, we leverage the same
scaling approach behind the success of deep learning to learn versatile optimizers. We train an
optimizer for deep learning which is itself a small neural network that ingests gradients and
outputs parameter updates. Meta-trained with approximately four thousand TPU-months of
compute on a wide variety of optimization tasks, our optimizer not only exhibits compelling
performance, but optimizes in interesting and unexpected ways. It requires no hyperparameter
tuning, instead automatically adapting to the specifics of the problem being optimized. We open
source our learned optimizer, meta-training code, the associated train and test data, and an
extensive optimizer benchmark suite with baselines at velo-code.github.io.

https://github.com/google/learned_optimization/tree/main/learned_optimization/research/general_lopt

Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training

Title: Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training

Abstract: Given the massive cost of language model pre-training, a non-trivial improvement of the optimization
algorithm would lead to a material reduction on the time and cost of training. Adam and its variants have
been state-of-the-art for years, and more sophisticated second-order (Hessian-based) optimizers often incur
too much per-step overhead. In this paper, we propose Sophia, Second-order Clipped Stochastic Optimization,
a simple scalable second-order optimizer that uses a light-weight estimate of the diagonal Hessian as the
pre-conditioner. The update is the moving average of the gradients divided by the moving average of the
estimated Hessian, followed by element-wise clipping. The clipping controls the worst-case update size
and tames the negative impact of non-convexity and rapid change of Hessian along the trajectory. Sophia
only estimates the diagonal Hessian every handful of iterations, which has negligible average per-step time
and memory overhead. On language modeling with GPT-2 models of sizes ranging from 125M to 770M,
Sophia achieves a 2x speed-up compared with Adam in the number of steps, total compute, and wall-clock
time.
Theoretically, we show that Sophia adapts to the curvature in different components of the parameters,
which can be highly heterogeneous for language modeling tasks. Our run-time bound does not depend on the
condition number of the loss

Explainer: https://twitter.com/tengyuma/status/1661412995430219786
Paper: https://arxiv.org/pdf/2305.14342.pdf

Gradient preconditioner:
image

Trying to use SAM optimizer for Random Sampling Image Classification

I am trying to use SAM optimizer when I use the backward function twice in train_epoch() # second forward-backward pass, it gives me an error otherwise it works fine.

Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [512, 100]], which is output 0 of AsStridedBackward0, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

`def train_epoch(models, criterion, optimizers, dataloaders):

models.train()
global iters
for data in tqdm(dataloaders['train'], leave=False, total=len(dataloaders['train'])):
    with torch.cuda.device(CUDA_VISIBLE_DEVICES):
        inputs = data[0].cuda()
        labels = data[1].cuda()
    iters += 1
    optimizers.zero_grad()  
    #pdb.set_trace()      
    scores, _, features = models(inputs) 
    
    target_loss = criterion(scores, labels)
    m_backbone_loss = torch.sum(target_loss) / target_loss.size(0)        
    loss  = m_backbone_loss
     # -----------------SAM Optimizer -------------------
    # first forward-backward pass
    criterion(models(inputs)[0], labels)
    loss.backward(retain_graph=True)
    optimizers.first_step(zero_grad=True)
    
    # second forward-backward pass
    criterion(models(inputs)[0], labels)
    loss.backward(retain_graph=True)
    optimizers.second_step(zero_grad=True)
#return loss`

EvoLved Sign Momentum [Lion] optimizer

Symbolic Discovery of Optimization Algorithms
https://arxiv.org/pdf/2302.06675.pdf

We present a method to formulate algorithm discovery as program search, and apply it to discover optimization algorithms for deep neural network training. We leverage efficient search techniques to explore an
infinite and sparse program space. To bridge the large generalization gap between proxy and target tasks, we
also introduce program selection and simplification strategies. Our method discovers a simple and effective
optimization algorithm, Lion (EvoLved Sign Momentum). It is more memory-efficient than Adam as it only
keeps track of the momentum. Different from adaptive optimizers, its update has the same magnitude for
each parameter calculated through the sign operation. We compare Lion with widely used optimizers, such
as Adam and Adafactor, for training a variety of models on different tasks. On image classification, Lion
boosts the accuracy of ViT by up to 2% on ImageNet and saves up to 5x the pre-training compute on JFT. On
vision-language contrastive learning, we achieve 88.3% zero-shot and 91.1% fine-tuning accuracy on ImageNet,
surpassing the previous best results by 2% and 0.1%, respectively. On diffusion models, Lion outperforms
Adam by achieving a better FID score and reducing the training compute by up to 2.3x. For autoregressive,
masked language modeling, and fine-tuning, Lion exhibits a similar or better performance compared to
Adam. Our analysis of Lion reveals that its performance gain grows with the training batch size. It also
requires a smaller learning rate than Adam due to the larger norm of the update produced by the sign
function. Additionally, we examine the limitations of Lion and identify scenarios where its improvements
are small or not statistically significant. The implementation of Lion is publicly available.1

Reference implementation: https://github.com/google/automl/tree/master/lion

Implement optimizers

  • implement Accelerated SGD optimizer
  • implement Adaptive SGD optimizer
  • implement A2Grad optimizer (3 variants)
  • implement Yogi optimizer
  • implement AdaMod optimizer
  • implement PID optimizer
  • implement AggMo optimizer
  • implement QHAdam optimizer
  • implement QHM optimizer
  • implement SGDW optimizer
  • implement SWATS optimizer
  • implement Fromage optimizer
  • implement MSVAG optimizer

support torch 2.0

지금 torch 버전이 >=1.10,<2로 고정되어있어 토치 2.0과 같이 설치할 때 문제가 생기네요
강제로 설치하고 써봤을 때, 문제가 생기지는 않았습니다.

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

Training Large Language Models (LLMs) presents significant memory challenges, predominantly due to the growing size of weights and optimizer states. Common memory-reduction approaches, such as low-rank adaptation (LoRA), add a trainable low-rank matrix to the frozen pre-trained weight in each layer, reducing trainable parameters and optimizer states. However, such approaches typically underperform training with full-rank weights in both pre-training and fine-tuning stages since they limit the parameter search to a low-rank subspace and alter the training dynamics, and further, may require full-rank warm start. In this work, we propose Gradient Low-Rank Projection (GaLore), a training strategy that allows full-parameter learning but is more memory-efficient than common low-rank adaptation methods such as LoRA. Our approach reduces memory usage by up to 65.5% in optimizer states while maintaining both efficiency and performance for pre-training on LLaMA 1B and 7B architectures with C4 dataset with up to 19.7B tokens, and on fine-tuning RoBERTa on GLUE tasks. Our 8-bit GaLore further reduces optimizer memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline. Notably, we demonstrate, for the first time, the feasibility of pre-training a 7B model on consumer GPUs with 24GB memory (e.g., NVIDIA RTX 4090) without model parallel, checkpointing, or offloading strategies.

Paper: https://arxiv.org/abs/2403.03507
Code: https://github.com/jiaweizzhao/GaLore/tree/master

Ranger21 has undocumented required arguments

Describe the bug

If you try to use the Ranger21 optimizer (with default settings), you'll get an error:

  File "/src/aigen/aigen/train.py", line 145, in select_optimizer
    optimizer = Ranger21(
TypeError: Ranger21.__init__() missing 1 required positional argument: 'num_iterations'

"num_iterations" is an undocumented argument.

To Reproduce

  • transformers version: 4.35.2
  • Platform: Linux-6.5.9-arch2-1-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • Huggingface_hub version: 0.19.4
  • Safetensors version: 0.4.0
  • Accelerate version: 0.24.1
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.1.1+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: True
  • Using distributed or parallel set-up in script?: False

Expected behavior

I would expect num_iterations to be clearly documented, and to have a default value.

Additional context

I was originally going to fix this myself, and create a pull request. However, looking into the Ranger21 code, I don't actually know how "num_iterations" was intended to be used? There are two other undocumented arguments: "num_warm_up_iterations" and "num_warm_down_iterations". I don't understand why this wasn't left to the scheduler, and indeed, whether I use "1" num_iterations or the value of my run's total steps - the model does not learn at all. Loss remains flat, weights do not update.

This is not a major issue and I'm going to use a different optimizer for now. Just wanted to make sure maintainers were aware.

sophiah in https://github.com/booydar/LM-RMT

#params = 151111638
#non emb params = 41066400
| epoch 1 step 50 | 50 batches | lr 0.06 | ms/batch 1378.43 | loss 7.85 | ppl 2570.784
| epoch 1 step 100 | 100 batches | lr 0.06 | ms/batch 968.61 | loss 7.49 | ppl 1787.593
| epoch 1 step 150 | 150 batches | lr 0.06 | ms/batch 971.58 | loss 7.48 | ppl 1769.387
| epoch 1 step 200 | 200 batches | lr 0.06 | ms/batch 969.84 | loss 7.47 | ppl 1760.055
| epoch 1 step 250 | 250 batches | lr 0.06 | ms/batch 973.37 | loss 7.46 | ppl 1738.300
| epoch 1 step 300 | 300 batches | lr 0.06 | ms/batch 970.12 | loss 7.48 | ppl 1772.002
| epoch 1 step 350 | 350 batches | lr 0.06 | ms/batch 970.52 | loss 7.47 | ppl 1751.793
| epoch 1 step 400 | 400 batches | lr 0.06 | ms/batch 973.12 | loss 7.47 | ppl 1755.161
| epoch 1 step 450 | 450 batches | lr 0.06 | ms/batch 970.79 | loss 7.46 | ppl 1736.315
| epoch 1 step 500 | 500 batches | lr 0.06 | ms/batch 974.13 | loss 7.48 | ppl 1765.010
| epoch 1 step 550 | 550 batches | lr 0.06 | ms/batch 973.86 | loss 7.48 | ppl 1778.569
Traceback (most recent call last):
File "/home/notebook/code/personal/80306170/AGI/LM-RMT/pytorch/train.py", line 620, in
train()
File "/home/notebook/code/personal/80306170/AGI/LM-RMT/pytorch/train.py", line 540, in train
optimizer.step()
File "/opt/conda/envs/dsd/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
return wrapped(*args, **kwargs)
File "/opt/conda/envs/dsd/lib/python3.10/site-packages/torch/optim/optimizer.py", line 280, in wrapper
out = func(*args, **kwargs)
File "/opt/conda/envs/dsd/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/envs/dsd/lib/python3.10/site-packages/pytorch_optimizer/optimizer/sophia.py", line 92, in step
self.compute_hutchinson_hessian(
File "/opt/conda/envs/dsd/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/envs/dsd/lib/python3.10/site-packages/pytorch_optimizer/base/optimizer.py", line 100, in compute_hutchinson_hessian
h_zs = torch.autograd.grad(grads, params, grad_outputs=zs, retain_graph=i < num_samples - 1)
File "/opt/conda/envs/dsd/lib/python3.10/site-packages/torch/autograd/init.py", line 303, in grad
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: res[i].defined() INTERNAL ASSERT FAILED at "../torch/csrc/autograd/functions/tensor.cpp":142, please report a bug to PyTorch.

Can't install `pytorch-optimizer>1.12` in python <= 3.7

파이썬 3.7 이하에서 1.12보다 높은 버전을 설치할 수 없습니다.

google colab이 현재 파이썬 3.7이라 코랩에서 !pip install pytorch-optimizer를 실행하면

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pytorch-optimizer
  Downloading pytorch_optimizer-1.1.2-py3-none-any.whl (56 kB)
     |████████████████████████████████| 56 kB 1.9 MB/s 
Requirement already satisfied: numpy<2.0.0,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from pytorch-optimizer) (1.21.6)
Requirement already satisfied: torch<2.0.0,>=1.11.0 in /usr/local/lib/python3.7/dist-packages (from pytorch-optimizer) (1.12.1+cu113)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from torch<2.0.0,>=1.11.0->pytorch-optimizer) (4.1.1)
Installing collected packages: pytorch-optimizer
Successfully installed pytorch-optimizer-1.1.2

1.1.2가 설치됩니다.

# pyproject.toml
[tool.poetry.dependencies]
python = "^3.8"
numpy = "^1.22.4"
torch = "^1.10"

Ranger21 does not work

Below is the trace when I try to use Ranger21, other optimizers work as they should

c:\users\g\appdata\local\programs\python\python38\lib\site-packages\pytorch_optimizer\ranger21.py in init(self, params, lr, beta0, betas, use_softplus, beta_softplus, num_iterations, num_warm_up_iterations, num_warm_down_iterations, warm_down_min_lr, agc_clipping_value, agc_eps, centralize_gradients, normalize_gradients, lookahead_merge_time, lookahead_blending_alpha, weight_decay, norm_loss_factor, eps)
114 # warmup iterations
115 self.num_warm_up_iterations: int = (
--> 116 self.build_warm_up_iterations(num_iterations, betas[1])
117 if num_warm_up_iterations is None
118 else num_warm_up_iterations

c:\users\g\appdata\local\programs\python\python38\lib\site-packages\pytorch_optimizer\ranger21.py in build_warm_up_iterations(total_iterations, beta2, warm_up_pct)
150 def build_warm_up_iterations(total_iterations: int, beta2: float, warm_up_pct: float = 0.22) -> int:
151 warm_up_iterations: int = math.ceil(2.0 / (1.0 - beta2)) # default un-tuned linear warmup
--> 152 beta_pct: float = warm_up_iterations / total_iterations
153 if beta_pct > 0.45:
154 return int(warm_up_pct * total_iterations)

TypeError: unsupported operand type(s) for /: 'int' and 'NoneType'

Lion NaN on optimization

Under very degenerated conditions (the architecture where I see it happen) is not very well behaved. When it works, the results are great but I need to restart many times until it works. Haven't been able to isolate the issue but seems to me that the following lines may require a few epsilons.

#73  p.div_(neuron_norm(p))
...
#86  grad_normed = grad / (state['exp_avg_sq'] / bias_correction).sqrt()
....
#93  p.div_(neuron_norm(p))

What tips me off, is that if I do a pass (single epoch) on any other optimizer and then I hook Nero in it, there is absolutely no issue found.

Empty Docs Sections

In the Docs, every section in every page is now empty. I tried different browsers and VPNs, it didn't solved an issue. Everything worked fine a few weeks ago.
image
image
image
image
image

Versions of codes that work with half precision models

Hi
I just discovered your repo and I would like to try it to fine-tune my ParlAI blenderbot2 (see https://github.com/facebookresearch/ParlAI) model. However, I am running the model in FP16 precision to make better use of my GPU. ParlAI has versions of a few optimizers that can use FP16 models, and I have tried installing a couple of other optimizers that can also work with FP16 models by casting the state parameters and gradients to FP32 within the optimizer, determining the new state parameters with FP32 accuracy, and recasting the state parameters back to FP16 for updating the model. If you had a version of your library that automatically did this, it would greatly simplify its use with FP16 precision models.
Thanks!

P.S.
It looks like adabelief, radam, and diffrgrad do something like this, but not in a consistent way.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.