Giter Site home page Giter Site logo

sail-sg / adan Goto Github PK

View Code? Open in Web Editor NEW
741.0 7.0 63.0 1.33 MB

Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models

License: Apache License 2.0

Python 87.85% Shell 0.85% Cuda 7.11% C++ 4.19%
adan bert-model convnext deep-learning fairseq mae optimizer resnet timm vit

adan's Introduction

Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models

This is an official PyTorch implementation of Adan. See the paper here. If you find our adan helpful or heuristic to your projects, please cite this paper and also star this repository. Thanks!

@article{xie2024adan,
  title={Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models},
  author={Xie, Xingyu and Zhou, Pan and Li, Huan and Lin, Zhouchen and Yan, Shuicheng},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  year={2024},
  publisher={IEEE}
}

Supported Projects

News

  • 🔥🔥🔥 Results on large language models, like MoE and GPT2, are released.
  • FusedAdan with less memory footprint is released.

Installation

python3 -m pip install git+https://github.com/sail-sg/Adan.git

FusedAdan is installed by default. If you want to use the original Adan, please install it by:

git clone https://github.com/sail-sg/Adan.git
cd Adan
python3 setup.py install --unfused

Usage

For your convenience to use Adan, we briefly provide some intuitive instructions below, then provide some general experimental tips, and finally provide more details (e.g., specific commands and hyper-parameters) for each experiment in the paper.

1) Two steps to use Adan

Step 1. Add Adan-dependent hyper-parameters by adding the following hyper-parameters to the config:

parser.add_argument('--max-grad-norm', type=float, default=0.0, help='if the l2 norm is large than this hyper-parameter, then we clip the gradient  (default: 0.0, no gradient clip)')
parser.add_argument('--weight-decay', type=float, default=0.02,  help='weight decay, similar one used in AdamW (default: 0.02)')
parser.add_argument('--opt-eps', default=None, type=float, metavar='EPSILON', help='optimizer epsilon to avoid the bad case where second-order moment is zero (default: None, use opt default 1e-8 in adan)')
parser.add_argument('--opt-betas', default=None, type=float, nargs='+', metavar='BETA', help='optimizer betas in Adan (default: None, use opt default [0.98, 0.92, 0.99] in Adan)')
parser.add_argument('--no-prox', action='store_true', default=False, help='whether perform weight decay like AdamW (default=False)')

opt-betas: To keep consistent with our usage habits, the $\beta$'s in the paper are actually the $(1-\beta)$'s in the code.

foreach (bool): If True, Adan will use the torch._foreach implementation. It is faster but uses slightly more memory.

no-prox: It determines the update rule of parameters with weight decay. By default, Adan updates the parameters in the way presented in Algorithm 1 in the paper:

$$\boldsymbol{\theta}_{k+1} = ( 1+\lambda \eta)^{-1} \left[\boldsymbol{\theta}_k - \boldsymbol{\eta}_k \circ (\mathbf{m}_k+(1-{\color{blue}\beta_2})\mathbf{v}_k)\right]$$

But one can also update the parameter like Adamw:

$$\boldsymbol{\theta}_{k+1} = ( 1-\lambda \eta)\boldsymbol{\theta}_k - \boldsymbol{\eta}_k \circ (\mathbf{m}_k+(1-{\color{blue}\beta_2})\mathbf{v}_k).$$

Step 2. Create the Adan optimizer as follows. In this step, we can directly replace the vanilla optimizer by using the following command:

from adan import Adan
optimizer = Adan(param, lr=args.lr, weight_decay=args.weight_decay, betas=args.opt_betas, eps = args.opt_eps, max_grad_norm=args.max_grad_norm, no_prox=args.no_prox)

2) Tips for Experiments

  • To make Adan simple, in all experiments except Table 12 in the paper, we do not use the restart strategy in Adan. But Table 12 shows that the restart strategy can further slightly improve the performance of Adan.
  • Adan often allows one to use a large peak learning rate which often fails other optimizers, e.g., Adam and AdamW. For example, in all experiments except for the MAE pre-training and LSTM, the learning rate used by Adan is 5-10 times larger than that in Adam/AdamW.
  • Adan is relatively robust to beta1, beta2, and beta3, especially for beta2. If you want better performance, you can first tune beta3 and then beta1.
  • Adan has a slightly higher GPU memory cost than Adam/AdamW on a single node. However, this problem can be solved using the ZeroRedundancyOptimizer, which shares optimizer states across distributed data-parallel processes to reduce per-process memory footprint. Specifically, when using the ZeroRedundancyOptimizer on more than two GPUs, Adan and Adam consume almost the same amount of memory.

3) More extra detailed steps&results

Please refer to the following links for detailed steps. In these detailed steps, we even include the docker images for reproducibility.

Results for Various Tasks

Results on Large Language Models

Mixture of Experts (MoE)

To investigate the efficacy of the Adan optimizer for LLMs, we conducted pre-training experiments using MoE models. The experiments utilized the RedPajama-v2 dataset with three configurations, each consisting of 8 experts: 8x0.1B (totaling 0.5B trainable parameters), 8x0.3B (2B trainable parameters), and 8x0.6B (4B trainable parameters). These models were trained with sampled data comprising 10B, 30B, 100B, and 300B tokens, respectively.

Model Size 8x0.1B 8x0.1B 8x0.1B 8x0.3B 8x0.3B 8x0.3B 8x0.6B
Token Size 10B 30B 100B 30B 100B 300B 300B
AdamW 2.722 2.550 2.427 2.362 2.218 2.070 2.023
Adan 2.697 2.513 2.404 2.349 2.206 2.045 2.010

GPT2-345m

We provide the config and log for GPT2-345m pre-trained on the dataset that comes from BigCode and evaluated on the HumanEval dataset by zero-shot learning. HumanEval is used to measure functional correctness for synthesizing programs from docstrings. It consists of 164 original programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions. We set Temperature = 0.8 during evaluation.

Steps pass@1 pass@10 pass@100 Download
GPT2-345m (Adam) 300k 0.0840 0.209 0.360 log&config
GPT2-345m (Adan) 150k 0.0843 0.221 0.377 log&config

Adan obtains comparable results with only half cost.

Results on vision tasks

For your convenience to use Adan, we provide the configs and log files for the experiments on ImageNet-1k.

Model Epoch Training Setting Acc. (%) Config Batch Size Download
ViT-S 150 I 80.1 config 2048 log/model
ViT-S 150 II 79.6 config 2048 log/model
ViT-S 300 I 81.1 config 2048 log/model
ViT-S 300 II 80.7 config 2048 log/model
ViT-B 150 II 81.7 config 2048 log/model
ViT-B 300 II 82.6 config 2048 log/model
ResNet-50 100 I 78.1 config 2048 log/model
ResNet-50 200 I 79.7 config 2048 log/model
ResNet-50 300 I 80.2 config 2048 log/model
ResNet-101 100 I 80.0 config 2048 log/model
ResNet-101 200 I 81.6 config 2048 log/model
ResNet-101 300 I 81.9 config 2048 log/model
ConvNext-tiny 150 II 81.7 config 2048 log//model
ConvNext-tiny 300 II 82.4 config 2048 log/model
MAE-small 800+100 --- 83.8 config 4096/2048 log-pretrain/log-finetune/model
MAE-Large 800+50 --- 85.9 config 4096/2048 log-pretrain/log-finetune/model

Results on NLP tasks

BERT-base

We give the configs and log files of the BERT-base model pre-trained on the Bookcorpus and Wikipedia datasets and fine-tuned on GLUE tasks. Note that we provide the config, log file, and detailed instructions for BERT-base in the folder ./NLP/BERT.

Pretraining Config Batch Size Log Model
Adan config 256 log model
Fine-tuning on GLUE-Task Metric Result Config
CoLA Matthew's corr. 64.6 config
SST-2 Accuracy 93.2 config
STS-B Person corr. 89.3 config
QQP Accuracy 91.2 config
MNLI Matched acc./Mismatched acc. 85.7/85.6 config
QNLI Accuracy 91.3 config
RTE Accuracy 73.3 config

For fine-tuning on GLUE-Task, see the total batch size in their corresponding configure files.

Transformer-XL-base

We provide the config and log for Transformer-XL-base trained on the WikiText-103 dataset. The total batch size for this experiment is 60*4.

Steps Test PPL Download
Baseline (Adam) 200k 24.2 log&config
Transformer-XL-base 50k 26.2 log&config
Transformer-XL-base 100k 24.2 log&config
Transformer-XL-base 200k 23.5 log&config

Results on Large Language Models

GPT2-345m

We provide the config and log for GPT2-345m pre-trained on the dataset that comes from BigCode and evaluated on the HumanEval dataset by zero-shot learning. HumanEval is used to measure functional correctness for synthesizing programs from docstrings. It consists of 164 original programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions. We set Temperature = 0.8 during evaluation.

Steps pass@1 pass@10 pass@100 Download
GPT2-345m (Adam) 300k 0.0840 0.209 0.360 log&config
GPT2-345m (Adan) 150k 0.0843 0.221 0.377 log&config

Adan obtains comparable results with only half cost.

Results on Diffusion Models

We show the results of the text-to-3D task supported by the DreamFusion Project. More visualization results could be founded here. Examples generated from text prompt Sydney opera house, aerial view with Adam and Adan:

opera-adan.mp4
opera-adam.mp4

Memory and Efficiency

A brief comparison of peak memory and wall duration for the optimizer is as follows. The duration time is the total time of 200 optimizer.step(). We further compare Adam and FusedAdan in great detail on GPT-2. See more results here.

Model Model Size (MB) Adam Peak (MB) Adan Peak (MB) FusedAdan Peak (MB) Adam Time (ms) Adan Time (ms) FusedAdan Time (ms)
ResNet-50 25 7142 7195 7176 9.0 4.2 1.9
ResNet-101 44 10055 10215 10160 17.5 7.0 3.4
ViT-B 86 9755 9758 9758 8.9 12.3 4.3
Swin-B 87 16118 16202 16173 17.9 12.8 4.9
ConvNext-B 88 17353 17389 17377 19.1 15.6 5.0
Swin-L 196 24299 24316 24310 17.5 28.1 10.1
ConvNext-L 197 26025 26055 26044 18.6 31.1 10.2
ViT-L 304 25652 25658 25656 18.0 43.2 15.1
GPT-2 758 25096 25406 25100 49.9 107.7 37.4
GPT-2 1313 34357 38595 34363 81.8 186.0 64.4

adan's People

Contributors

alexwellchen avatar bonlime avatar janebert avatar panzhous avatar xingyuxie avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

adan's Issues

Restarting strategy

Hey, the repository does not implement the momentum restarting strategy from what I can tell.

If this is something you still have available, would you be so kind to add it in here? It would be super great to optimize Adan training further. :)

block: [0,0,0], thread: [96,0,0] Assertion `input_val >= zero && input_val <= one` failed.

Hi, i try Adan on a keypoints task, i got error like this:

./aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [0,0,0], thread: [96,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [0,0,0], thread: [97,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [2,0,0], thread: [32,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [2,0,0], thread: [33,0,0] Assertion `input_val >= zero && input_val <= one` fathread: [51,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [3,0,0], thread: [52,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [3,0,0], thread: [53,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [3,0,0], thread: [54,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [3,0,0], thread: [55,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [3,0,0], thread: [56,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [3,0,0], thread: [57,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [3,0,0], thread: [58,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [3,0,0], thread: [59,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [3,0,0], thread: [60,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [3,0,0], thread: [61,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [3,0,0], thread: [62,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [3,0,0], thread: [63,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [0,0,0], thread: [0,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [0,0,0], thread: [1,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [0,0,0], thread: [2,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [0,0,0], thread: [3,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [0,0,0], thread: [4,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [0,0,0], thread: [5,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [0,0,0], thread: [6,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [0,0,0], thread: [7,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [0,0,0], thread: [8,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [0,0,0], thread: [9,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [0,0,0], thread: [94,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:129: operator(): block: [0,0,0], thread: [95,0,0] Assertion `input_val >= zero && input_val <= one` failed.
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

my config:

BASE_LR: 0.05 # maybe 0.012?
STEPS: (40000, 65000, 70000, 85000) # step point need to carefully check
WARMUP_FACTOR: 0.001
# WARMUP_ITERS: 1200
WARMUP_ITERS: 3500
MAX_ITER: 900000
# LR_SCHEDULER_NAME: "WarmupCosineLR"
LR_SCHEDULER_NAME: "WarmupMultiStepLR"
WEIGHT_DECAY: 0.02
MOMENTUM: 0.9
BACKBONE_MULTIPLIER: 0.9
OPTIMIZER: "Adan"

this is on detectron2, config on 8 GPU

why does it happen?

Settings for instruction-tuning

Hi, Adan是一个性能十分优秀的优化器,谢谢你们的工作。

但我最近在尝试用Adan进行指令微调时,发现loss曲线很漂亮,但是下游任务表现(GSM-8k)不如预期。
同样的数据处理和评测,AdamW大概9.63,Adan只有5.08左右。

AdamW超参数:weight_decay 0.01, lr 2e-5
Adan超参数:weight_decay 0.02,按照repo的建议lr尝试了2e-4 1e-4, GSM8k都比较低
lr scheduler都是3%升到最高然后下降到0

AdamW的训练loss曲线:
image

Adan的训练loss曲线:
image

使用的代码:

from adan import Adan
optimizer = Adan(model.parameters(), lr=args.lr, weight_decay=0.02, foreach=True, fused=True)

想知道有没有一些对指令微调的超参设置建议?

\epsilon not implemented as in the paper

Hi there,
$\epsilon$ is within the square root in the paper (L6 in Algorithm 1), but in the code, it is outside of the square root. Could you expand on the reason for this?

Some questions about learning rate.

Thank you for your brilliant work.

I want to ask some questions about Adan's learning rate.

Does Adan use learning rate decay in the paper?
Is the Adan optimizer sensitive to the initial learning rate?
How to set the learning rate compared with adam under the same task conditions?

Thank you!

HumanEval shall not be used for training.

HumanEval is a evaluation dataset, you shouldn't train on it and evaluate on exactly the same dataset.

Instead, you can use the github part in the Pile, or other coding source data for training. Before training, make sure the training set doesn't contain HumanEval to avoid probable data leakage.

Deepspeed Integration

Hi~Thanks for your excellent work. Adan optimzier has rechived great success in my different experiments.
However, I really want any suggestions for integrating Adan with deepspeed.
I tried using the ds_config with adamw and simply replacing adamw with adan (of course, I adjusted the learning rate and weight decay correspondingly), but it's pretty slow.
Thank you in advance.

RuntimeError: The detected CUDA version (12.2) mismatches the version that was used to compile PyTorch (11.8).

Hi authors,

I am trying to install Adan with the described command: "python3 -m pip install git+https://github.com/sail-sg/Adan.git", however, I couldn't install it due to the error below, I checked and already saw that torch is installed and worked, do you have any suggestion to install it? I have no idea how to fix this error.

Building wheels for collected packages: adan Building wheel for adan (setup.py) ... error error: subprocess-exited-with-error × python setup.py bdist_wheel did not run successfully. │ exit code: 1 ╰─> [54 lines of output] running bdist_wheel /root/miniconda3/envs/neurips24/lib/python3.8/site-packages/torch/utils/cpp_extension.py:476: UserWarning: Attempted to use ninja as the BuildExtension backend but we could not find ninja.. Falling back to using the slow distutils backend. warnings.warn(msg.format('we could not find ninja.')) running build running build_py creating build creating build/lib.linux-x86_64-cpython-38 copying adan.py -> build/lib.linux-x86_64-cpython-38 running build_ext Traceback (most recent call last): File "", line 2, in File "", line 34, in File "/tmp/pip-req-build-wcs6gasc/setup.py", line 20, in setup( File "/root/miniconda3/envs/xxx/lib/python3.8/site-packages/setuptools/init.py", line 103, in setup return distutils.core.setup(**attrs) File "/root/miniconda3/envs/xxx/lib/python3.8/site-packages/setuptools/_distutils/core.py", line 185, in setup return run_commands(dist) ...
raise RuntimeError(CUDA_MISMATCH_MESSAGE.format(cuda_str_version, torch.version.cuda)) RuntimeError: The detected CUDA version (12.2) mismatches the version that was used to compile PyTorch (11.8). Please make sure to use the same CUDA versions. [end of output] note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for adan Running setup.py clean for adan Failed to build adan ERROR: Could not build wheels for adan, which is required to install pyproject.toml-based projects

Check with the torch, it is ok:
image

valueError: not enough values to unpack (expected 3, got 2)

Hello! Thank you for your work. Now I have a problem.
I don't know how to solve it
Traceback (most recent call last):
File "/home/anaconda/envs/main/Lib/python5.8/site-packages/tonch/optim/optimizer.py" line 113,in wrapperreturn func(*args,**kwargs)
File "/home/anaconda/envs/ main/Lit/python3.8/site-packages/torch/autogpad/gnad_mode.py",line 27,in decorate_contextreturn func(*args,**kwargs)
File "/home/main/adan.py", line 121,in step
beta1, beta2, beta3 = group [ ' betas ']
valueError: not enough values to unpack (expected 3, got 2)

`torch._foreach...` implementation

Hi, very interesting work!
The only problem i see is that your optimizer is slower that sgd/adamw which may discourage some people from using it. Do you plan adding an implementation using torch._foreach... functions? Examples could be seen in torch.optim. This would significantly speed-up your optimizer while having literally no drawbacks.

If you're interested i could take a look and implement this myself, but it would be in 1-2 weeks when i'm less busy

Embedding tensors/weight update unsupported

Hello!

I think I found a bug in the Adan optimizer, which affects embedding tables.

I implemented Adan optimzier in Tensorflow 2. You could find the implementation here

I wanted to keep the implementation as close to the original code as possible. However, there are different approaches for updating "sparse" tensors in TensorFlow and PyTorch. An example of a "sparse" tensor is an embedding matrix. Pytorch treats "sparse" data as if it was dense. TensorFlow has two functions for making updates - _resource_apply_dense for dense and _resource_apply_sparse for "sparse".

I decided to test the correctness of my implementation using the following logic:

  1. Define a function to optimize. In case of "dense" optimization, it's simple linear regression, in case of "sparse" - make all embeddings equal to 1 (see tf_adan/test_adan_*.py)
  2. Generate random input data and initial weights matrix.
  3. Optimize weights matrix using official and my implementation. Optimziers have same hparams.
  4. Compare loss history and weights after optimization. If they are equal - my implementation is correct.

I noticed that loss history and weights after optimization is the same for dense parameters. However, my implementation shows a better loss for embedding params weights after optimization isn't the same. It's especially noticeable in cases when the batch contains only a few possible categories. For example, categorical features have 2k unique values, while the batch size equals 100:

source

image


I think the source of the bug is the following:

  1. For "new" gradients, i.e., categorical values gradients, for which we haven't made an update before, we replace the previous gradient with the current gradient. This logic is implemented here:

Adan/adan.py

Line 130 in d864647

if 'pre_grad' not in state or group['step'] == 1:

As I understand, prev_grad for all "new" gradients on step>1 won't be replaced with the current gradient.

  1. The other reason is that gradient params (exp_avg, exp_avg_sq, exp_avg_diff) are updated regardless of the presence of the category in the batch. That means that for categories

I'm unsure if it's a bug in your implementation or in mine. I also tested Adam optimizer in tf and torch, see:

https://github.com/DenisVorotyntsev/Adan/blob/02e66241a98958152315ae5358ee6f364f092f8b/tf_adan/utils.py#L37

Losses for Adam optimizers in tf/torch are almost the same.


What do you think? Looking forward to your thoughts.

About the convergence trend comparison with Adamw in ViT-H

Hi,
Thank you very much for your brilliant work on Adan!
And from you paper, it said Adan should get a lower loss (both Train and test) than Adamw according to Figure 1. However, I got a higher training loss with Adan than AdamW in ViT-H:

Steps Adamw_train_loss Adan_train_loss
200 6.9077 6.9077
400 6.9074 6.9075
600 6.9068 6.9073
800 6.9061 6.907
1000 6.905 6.9064
1200 6.9036 6.9056
1400 6.9014 6.9044
1600 6.899 6.9028
1800 6.8953 6.9003
2000 6.8911 6.8971
2200 6.8848 6.8929
2400 6.8789 6.8893
2600 6.8699 6.8843
2800 6.8626 6.8805
3000 6.8528 6.8744
3200 6.8402 6.868
3400 6.8293 6.862
3600 6.8172 6.8547
3800 6.7989 6.8465
4000 6.7913 6.8405

I used the same HPs as AdamW and only changed beta from (0.9, 0.999) to (0.9, 0.92, 0.999).
I only trained for few steps to see the trend. But it seems the loss gap from AdamW is quite big, should I change other HPs to better using Adan? How can I get a lower Loss than AdamW?
I noticed that Adan prefers a large batch size in Vision tasks, should we using a larger batch size?
Or should I train with more steps to see the trend?
Thank you!

Typo in the paper

Looking at arxiv version. In Appendix C in the last two lines of the Eq. 10 and the first line of the following update rule: \theta in the last term should have index k-1 instead of k.

(Not sure if this is the appropriate place to report paper typos, please tell me if there is a more sutable one).

Concrete weight decay configuration for GPT-2 pretraining

Dear authors:

According to the README.md of this amazing project, the weight_decay param should be 0.02, while in the configuration file attached in #32, the WD seems to be 0.05. Also, only beta3 is explicitly specified in the aforementioned configuration file, I can only inspect from https://github.com/sail-sg/Adan/blob/main/gpt2/README.md that

beta1 = 0.98
beta2 = 0.92

However, weight_decay=0.02 together with the other hyperparams above yields an inferior val loss curve compared with (that of the AdamW baseline)[https://github.com/karpathy/nanoGPT/blob/master/config/train_gpt2.py]. Thus, do you have any suggestion about the hyperparams I mentioned? Thanks!

Is there a TensorFlow/Keras implementation?

Is there a TensorFlow/Keras implementation of Adan? If no official version, do you know of any third-party implementation? Or alternatively, how many lines would you expect an implementation to have? (If not much I may do it myself and ask for your review if you have time.)

Some questions in step function

Thank you for your impressive work. I have some questions in your adan.py about step function.
In line 179-180, that is:

for p, copy_grad in zip(group['params'], copy_grads):
    self.state[p]['pre_grad'] = copy_grad

It seems that you want to save the corresponding pre_grad. But I have the following bug:
image
I think this is because the former contains all parameters, while the latter only contains parameters with gradient. So I made the following changes:

for p, copy_grad in zip(params_with_grad, copy_grads):
    self.state[p]['pre_grad'] = copy_grad

With this modification, I can run normally. Do you think what problems I have encountered and that this modification is correct? @XingyuXie

Step 2 of Usage

Step 2 of Usage in the documentation says

from adam import Adan

I was wondering if it you meant

from adan import Adan

Suggestions for applying to visual dense prediction tasks.

HI~Thanks for you excitring work. I would like to know the performance of Adan for visual dense prediction tasks. I notice you mention that Adan is suitable for large batchsize. So I wondered if it would also work better for visual dense prediction tasks, which are usually not possible with a large batchsize. I have tried Adan in several tasks, but the results are similar or even inferior to its sgd/adamw counterparts. I have followed the best practices you mention in the paper and repo and was wondering if you have done similar experiments or if you have suggestions for tuning the parameters.

Thanks!

Best.

`no_prox` Flag

Hi there,

I'm just wondering about the no_prox setting.

First of all, does it stand for "approximation"?

In the paper, Algorithm 1, line 7 corresponds to no_prox=True
why is the default setting in this repo False? Why do you include this option at all?

Were the experiments in the paper done as the algorithm states, or with no_prox=True?

Again, I really appreciate the work! Am just struggling with this detail.

OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.

How to install without CUDA_HOME environment variable? For example https://github.com/mapillary/inplace_abn don`t ask about CUDA_HOME.

xxx@xxx:~$ python3 -m pip install git+https://github.com/sail-sg/Adan.git
Defaulting to user installation because normal site-packages is not writeable
Collecting git+https://github.com/sail-sg/Adan.git
  Cloning https://github.com/sail-sg/Adan.git to /tmp/pip-req-build-zs78qhzq
  Running command git clone --filter=blob:none --quiet https://github.com/sail-sg/Adan.git /tmp/pip-req-build-zs78qhzq
  Resolved https://github.com/sail-sg/Adan.git to commit 8f559205f67e565b3bea09554354d69000bd819c
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [12 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-req-build-zs78qhzq/setup.py", line 5, in <module>
          cuda_extension = CUDAExtension(
        File "/home/xxx/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1047, in CUDAExtension
          library_dirs += library_paths(cuda=True)
        File "/home/xxx/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1179, in library_paths
          if (not os.path.exists(_join_cuda_home(lib_dir)) and
        File "/home/xxx/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 2230, in _join_cuda_home
          raise EnvironmentError('CUDA_HOME environment variable is not set. '
      OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

GPU type and GPU nums and total training time on Transformer-XL, GPT-2

Hi! Thank you for sharing your code.

I would like to know for each Transformer-XL, GPT-2 settings.

  • which GPU did you use?
  • how many GPUs are used for training
  • total training time

I saw logs, but I didn't figure out the exact number
https://github.com/sail-sg/Adan/tree/main/gpt2#results-and-logs-on-gpt2-345m
https://github.com/sail-sg/Adan/blob/main/gpt2/pretrain.sh
https://github.com/sail-sg/Adan/tree/main/NLP/Transformer-XL/exp_results

Thank you!

Beta values are not same

According to your paper, you used adan with β1 = 0.02, β2 = 0.01, and β3 = 0.01 when fine-tuning Bert. But in your config file, they are all 0.9x like here. Which is right?

如何设置Adan学习率

您好请问您是否有研究过将Adan用于Diffusion模型训练,其学习率应该如何设置,可否与使用AdamW的学习率一样?

Install Error

Hey guys, I had some problems when I installed FusedAdan.
The information is below here. It reminds me that I don't have nvcc, but actually I have. Please help me.

(MDT) root@ubuntu20:~/Adan# pip install .
Processing /root/Adan
Preparing metadata (setup.py) ... done
Requirement already satisfied: torch in /home/vipuser/anaconda3/envs/MDT/lib/python3.10/site-packages (from adan==0.0.2) (2.2.1+cu118)
Requirement already satisfied: filelock in /home/vipuser/anaconda3/envs/MDT/lib/python3.10/site-packages (from torch->adan==0.0.2) (3.9.0)
Requirement already satisfied: typing-extensions>=4.8.0 in /home/vipuser/anaconda3/envs/MDT/lib/python3.10/site-packages (from torch->adan==0.0.2) (4.8.0)
Requirement already satisfied: sympy in /home/vipuser/anaconda3/envs/MDT/lib/python3.10/site-packages (from torch->adan==0.0.2) (1.12)
Requirement already satisfied: networkx in /home/vipuser/anaconda3/envs/MDT/lib/python3.10/site-packages (from torch->adan==0.0.2) (3.2.1)
Requirement already satisfied: jinja2 in /home/vipuser/anaconda3/envs/MDT/lib/python3.10/site-packages (from torch->adan==0.0.2) (3.1.2)
Requirement already satisfied: fsspec in /home/vipuser/anaconda3/envs/MDT/lib/python3.10/site-packages (from torch->adan==0.0.2) (2024.2.0)
Requirement already satisfied: nvidia-cuda-nvrtc-cu11==11.8.89 in /home/vipuser/anaconda3/envs/MDT/lib/python3.10/site-packages (from torch->adan==0.0.2) (11.8.89)
Requirement already satisfied: nvidia-cuda-runtime-cu11==11.8.89 in /home/vipuser/anaconda3/envs/MDT/lib/python3.10/site-packages (from torch->adan==0.0.2) (11.8.89)
Requirement already satisfied: nvidia-cuda-cupti-cu11==11.8.87 in /home/vipuser/anaconda3/envs/MDT/lib/python3.10/site-packages (from torch->adan==0.0.2) (11.8.87)
Requirement already satisfied: nvidia-cudnn-cu11==8.7.0.84 in /home/vipuser/anaconda3/envs/MDT/lib/python3.10/site-packages (from torch->adan==0.0.2) (8.7.0.84)
Requirement already satisfied: nvidia-cublas-cu11==11.11.3.6 in /home/vipuser/anaconda3/envs/MDT/lib/python3.10/site-packages (from torch->adan==0.0.2) (11.11.3.6)
Requirement already satisfied: nvidia-cufft-cu11==10.9.0.58 in /home/vipuser/anaconda3/envs/MDT/lib/python3.10/site-packages (from torch->adan==0.0.2) (10.9.0.58)
Requirement already satisfied: nvidia-curand-cu11==10.3.0.86 in /home/vipuser/anaconda3/envs/MDT/lib/python3.10/site-packages (from torch->adan==0.0.2) (10.3.0.86)
Requirement already satisfied: nvidia-cusolver-cu11==11.4.1.48 in /home/vipuser/anaconda3/envs/MDT/lib/python3.10/site-packages (from torch->adan==0.0.2) (11.4.1.48)
Requirement already satisfied: nvidia-cusparse-cu11==11.7.5.86 in /home/vipuser/anaconda3/envs/MDT/lib/python3.10/site-packages (from torch->adan==0.0.2) (11.7.5.86)
Requirement already satisfied: nvidia-nccl-cu11==2.19.3 in /home/vipuser/anaconda3/envs/MDT/lib/python3.10/site-packages (from torch->adan==0.0.2) (2.19.3)
Requirement already satisfied: nvidia-nvtx-cu11==11.8.86 in /home/vipuser/anaconda3/envs/MDT/lib/python3.10/site-packages (from torch->adan==0.0.2) (11.8.86)
Requirement already satisfied: triton==2.2.0 in /home/vipuser/anaconda3/envs/MDT/lib/python3.10/site-packages (from torch->adan==0.0.2) (2.2.0)
Requirement already satisfied: MarkupSafe>=2.0 in /home/vipuser/anaconda3/envs/MDT/lib/python3.10/site-packages (from jinja2->torch->adan==0.0.2) (2.1.3)
Requirement already satisfied: mpmath>=0.19 in /home/vipuser/anaconda3/envs/MDT/lib/python3.10/site-packages (from sympy->torch->adan==0.0.2) (1.3.0)
Building wheels for collected packages: adan
Building wheel for adan (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py bdist_wheel did not run successfully.
│ exit code: 1
╰─> [7 lines of output]
running bdist_wheel
running build
running build_py
creating build/lib.linux-x86_64-cpython-310
copying adan.py -> build/lib.linux-x86_64-cpython-310
running build_ext
error: [Errno 2] No such file or directory: ':/usr/local/cuda-11.8/bin/nvcc'
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for adan
Running setup.py clean for adan
Failed to build adan
ERROR: Could not build wheels for adan, which is required to install pyproject.toml-based projects

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.