Hi, I follow the training command: python -m torch.distributed.launc

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

fail to reproduce accuracy of deit-s about deit HOT 4 CLOSED

facebookresearch commented on August 15, 2024

fail to reproduce accuracy of deit-s

from deit.

Comments (4)

fmassa commented on August 15, 2024 1

Hi,

The only thing I can think of for now is that we used PyTorch 1.7.0 and torchvision 0.8.1, but this shouldn't explain the big drop in accuracy that you are seing (I couldn't find any note in the release notes of PyTorch which could indicate that a bug was fixed in 1.7.0 that could have affected the accuracies).

I'm going to be running a few more trainings with the released code and I'll keep you updated of what I get.

from deit.

fmassa commented on August 15, 2024

Hi,

Can you paste your environment here?

Here are the logs for DeiT small with the latest version of the code that I have (which should be fairly similar to the version that we released, maybe with a few arguments removed)

* Acc@1 79.828 Acc@5 95.076 loss 0.882
Max accuracy: 79.84%

The command-line arguments that we used are here

Namespace(aa='rand-m9-mstd0.5-inc1', batch_size=64, clip_grad=None, color_jitter=0.4, comment='', cooldown_epochs=10, cutmix=1.0, cutmix_minmax=None, data_path='/datasets01_101/imagenet_full_size/061417/', data_set='IMNET', decay_epochs=30, decay_rate=0.1, device='cuda', dist_backend='nccl', dist_url='file:///checkpoint/fmassa/experiments/0f45078640694b86abbf9c85fef17611_init', distributed=True, drop=0.0, drop_block=None, drop_path=0.1, epochs=300, eval=False, gpu=0, inat_category='name', input_size=224, job_dir=PosixPath('/checkpoint/fmassa/experiments/%j'), lr=0.0005,lr_noise=None, lr_noise_pct=0.67, lr_noise_std=1.0, min_lr=1e-05, mixup=0.8, mixup_mode='batch', mixup_prob=1.0, mixup_switch_prob=0.5, model='deit_small_patch16_224', model_ema=True, model_ema_decay=0.99996, model_ema_force_cpu=False, momentum=0.9, ngpus=8, nodes=2, num_workers=10, opt='adamw', opt_betas=None, opt_eps=1e-08, output_dir=PosixPath('/checkpoint/fmassa/experiments/34020965'), partition='learnfair', patience_epochs=10, pin_mem=True, rank=0, recount=1, remode='pixel', repeated_aug=True, reprob=0.25, resplit=False, resume='', sched='cosine', seed=0, smoothing=0.1, start_epoch=0, timeout=2800, train_interpolation='bicubic', use_volta32=False, warmup_epochs=5, warmup_lr=1e-06, weight_decay=0.05, world_size=16)

In this run we used 16 GPUs (and a 4x smaller batch size), can you double-check if you see anything different? Otherwise I can try to run this again on 4 GPUs with the version of the code that we opensourced to see what I get

from deit.

chrisjuniorli commented on August 15, 2024

My environments:

Python 3.7
torch 1.6
torchvision 0.7
timm 0.3.2

Here are the arguments I used

Namespace(aa='rand-m9-mstd0.5-inc1', batch_size=256, clip_grad=None, color_jitter=0.4, cooldown_epochs=10, cutmix=1.0, cutmix_minmax=None, data_path='../data/Imagenet/', data_set='IMNET', decay_epochs=30, decay_rate=0.1, device='cuda', dist_backend='nccl', dist_url='env://', distributed=True, drop=0.0, drop_block=None, drop_path=0.1, epochs=300, eval=False, gpu=0, inat_category='name', input_size=224, lr=0.0005, lr_noise=None, lr_noise_pct=0.67, lr_noise_std=1.0, min_lr=1e-05, mixup=0.8, mixup_mode='batch', mixup_prob=1.0, mixup_switch_prob=0.5, model='deit_small_patch16_224', model_ema=True, model_ema_decay=0.99996, model_ema_force_cpu=False, momentum=0.9, num_workers=10, opt='adamw', opt_betas=None, opt_eps=1e-08, output_dir='', patience_epochs=10, pin_mem=True, rank=0, recount=1, remode='pixel', repeated_aug=True, reprob=0.25, resplit=False, resume='', sched='cosine', seed=0, smoothing=0.1, start_epoch=0, train_interpolation='bicubic', warmup_epochs=5, warmup_lr=1e-06, weight_decay=0.05, world_size=4)

The main differences here is only on batch-size, where I use 256 with 4 GPUs. I guess I can try to scale up lr to 4x to match with the batch size I used here and see if the performance could be better. Any ideas from your side?

from deit.

fmassa commented on August 15, 2024

Hi @chrisjuniorli

I've just got the results from training on 4 and 16 GPUs, with the default commands. They match the reported results.
For 4 GPUs

python run_with_submitit.py --model deit_small_patch16_224 --batch-size 256 --nodes 1 --ngpus 4 --use_volta32

gives

* Acc@1 79.860 Acc@5 94.950 loss 0.885

and for 16 GPUs

python run_with_submitit.py --model deit_small_patch16_224

I got

* Acc@1 79.790 Acc@5 94.880 loss 0.883

To facilitate comparison / reproducibility, I'm pasting here the training logs for both runs in https://gist.github.com/fmassa/0dbd0184a0adb904ef42277b487d8b53

Also, here is the result of the `conda list` from the environment I used

alembic                   1.4.3                     <pip>
appdirs                   1.4.4                     <pip>
astor                     0.8.1                     <pip>
attrs                     20.3.0                    <pip>
black                     20.8b1                    <pip>
blas                      1.0                         mkl
ca-certificates           2020.12.8            h06a4308_0
certifi                   2020.12.5        py37h06a4308_0
click                     7.1.2                     <pip>
cloudpickle               1.6.0                     <pip>
contextlib2               0.6.0.post1               <pip>
cudatoolkit               10.1.243             h6bb024c_0
dataclasses               0.6                       <pip>
dumbo                     0.1.1                     <pip>
flake8                    3.8.4                     <pip>
freetype                  2.10.4               h5ab3b9f_0
future                    0.18.2                    <pip>
git-archive-all           1.22.0                    <pip>
importlib-metadata        3.3.0                     <pip>
intel-openmp              2020.2                      254
jpeg                      9b                   h024ee3a_2
lcms2                     2.11                 h396b838_0
ld_impl_linux-64          2.33.1               h53a641e_7
libedit                   3.1.20191231         h14c3975_1
libffi                    3.3                  he6710b0_2
libgcc-ng                 9.1.0                hdf63c60_0
libpng                    1.6.37               hbc83047_0
libstdcxx-ng              9.1.0                hdf63c60_0
libtiff                   4.1.0                h2733197_1
libuv                     1.40.0               h7b6447c_0
lz4-c                     1.9.2                heb0550a_3
Mako                      1.1.3                     <pip>
MarkupSafe                1.1.1                     <pip>
mccabe                    0.6.1                     <pip>
mkl                       2020.2                      256
mkl-service               2.3.0            py37he8ac12f_0
mkl_fft                   1.2.0            py37h23d657b_0
mkl_random                1.1.1            py37h0573a6f_0
mypy-extensions           0.4.3                     <pip>
ncurses                   6.2                  he6710b0_1
ninja                     1.10.2           py37hff7bd54_0
numpy                     1.19.2           py37h54aff64_0
numpy-base                1.19.2           py37hfa32c7d_0
olefile                   0.46                       py_0
openssl                   1.1.1i               h27cfd23_0
pathspec                  0.8.1                     <pip>
pillow                    8.0.1            py37he98fc37_0
pip                       20.3.3           py37h06a4308_0
pycodestyle               2.6.0                     <pip>
pyflakes                  2.2.0                     <pip>
python                    3.7.9                h7579374_0
python-dateutil           2.8.1                     <pip>
python-editor             1.0.4                     <pip>
pytorch                   1.7.0           py3.7_cuda10.1.243_cudnn7.6.3_0    pytorch
readline                  8.0                  h7b6447c_0
regex                     2020.11.13                <pip>
setuptools                51.0.0           py37h06a4308_2
six                       1.15.0           py37h06a4308_0
SQLAlchemy                1.3.21                    <pip>
sqlite                    3.33.0               h62c20be_0
submitit                  1.1.5                     <pip>
tabulate                  0.8.7                     <pip>
timm                      0.3.2                     <pip>
tk                        8.6.10               hbc83047_0
toml                      0.10.2                    <pip>
torchvision               0.8.1                py37_cu101    pytorch
typed-ast                 1.4.1                     <pip>
typing_extensions         3.7.4.3                    py_0
wheel                     0.36.2             pyhd3eb1b0_0
xz                        5.2.5                h7b6447c_0
zipp                      3.4.0                     <pip>
zlib                      1.2.11               h7b6447c_3
zstd                      1.4.5                h9ceee32_0

Given that I was able to match the reported accuracies with the released codebase, I'm closing this issue, but let us know if you have any further questions.

from deit.

fail to reproduce accuracy of deit-s about deit HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent