Giter Site home page Giter Site logo

Comments (10)

prigoyal avatar prigoyal commented on August 30, 2024 2

thank you @iseessel for the debugging and for all the data points above. We expect the results to reproduce between the 8gpu and 1gpu accounting for all the differences you spotted (sync BN etc).
With that said, I will take this task on further from here, as it's important to understand this further for our research. Thank you for your efforts on this. All above data points are helpful.

from vissl.

spurra avatar spurra commented on August 30, 2024 1

Thanks for the response @prigoyal ! I'll let you know what numbers I get

from vissl.

prigoyal avatar prigoyal commented on August 30, 2024 1

thank you for reporting @spurra , let me try to rerun the benchmark and come back to you. We should be able to repro exactly. I'll try the original setting and your setting as well.

from vissl.

doulemint avatar doulemint commented on August 30, 2024 1

@spurra Thank you for your detailed reports and procedures summary so that I also can reproduce this benchmark using one gpu.

from vissl.

iseessel avatar iseessel commented on August 30, 2024 1

@prigoyal Sorry to cross wires here -- I've actually been looking into this -- see below and lmk if you have anything to add.

Hi @spurra + @doulemint,

I was able to reproduce your 1-gpu numbers, as well as our reported 8-gpu numbers. See below for the full results. These represent the best reported accuracy for train/test for each layer. As a side note, I believe the lr appears to be zero because we are rounding before logging to tensroboard: https://github.com/facebookresearch/vissl/blob/main/vissl/hooks/tensorboard_hook.py#L265

1GPU: 
rn50_in1k_simclr_100ep_eval_resnet_8gpu_transfer_in1k_linear_eval_resnet_1gpu_transfer_in1k_linear_14_10_21
None [ rn50_in1k_simclr_100ep_eval_resnet_8gpu_transfer_in1k_linear_eval_resnet_1gpu_transfer_in1k_linear_14_10_21 ] :
 - train.top_1.res5 : 0.628983 (50)
 - train.top_5.res5 : 0.843128 (52)
 - test.top_1.res5 : 0.62368 (47)
 - test.top_5.res5 : 0.85202 (43)

8 GPU: 
rn50_in1k_simclr_100ep_eval_resnet_8gpu_transfer_in1k_linear_eval_resnet_8gpu_transfer_in1k_linear_14_10_21
None [ rn50_in1k_simclr_100ep_eval_resnet_8gpu_transfer_in1k_linear_eval_resnet_8gpu_transfer_in1k_linear_14_10_21 ] :
 - train.top_1.res5 : 0.652946 (52)
 - train.top_5.res5 : 0.8587659999999999 (48)
 - test.top_1.res5 : 0.6443799999999998 (35)
 - test.top_5.res5 : 0.86064 (47)

I don't believe that results are guaranteed to be the same across the 8gpu and 1gpu schemes here. Note that scaling the lr is based on the Imagenet in 1 hour paper: https://arxiv.org/pdf/1706.02677.pdf. There are some differences with the papers results and these experiments. The paper tests global batch sizes of 256+ -- here the global batch size is 32. Note also the Imagenet in 1 hour paper does not calculate the BN statistics across all worker, whereas these transfer experiments do (By setting CONVERT_BN_TO_SYNC_BN: True). Since the training uses SYNC_BN, I think it would be interesting to increase your batchsize, depending on your GPU memory constraints.

So imo, I see reproducing the 8gpu numbers in a 1gpu scheme as an area of research. You could start by tuning some of the hyperaparameters -- imo batch size, LR, and weight decay values could be a good place to start.

(@prigoyal lmk if you disagree / If I've mischaracterized something).

from vissl.

prigoyal avatar prigoyal commented on August 30, 2024

Hi @spurra , thank you for reaching out. Yes, I would expect that you reproduce the number. The important thing is to ensure the learning rate is scaled properly as #gpus are changed. For this in VISSL, the https://github.com/facebookresearch/vissl/blob/master/configs/config/benchmark/linear_image_classification/imagenet1k/eval_resnet_8gpu_transfer_in1k_linear.yaml#L103 is provided to automatically adjust LR.

Please let us know if the numbers don't repro. :) we will look into it.

from vissl.

spurra avatar spurra commented on August 30, 2024

I finished running the experiment. This is the output of the log hook:

INFO 2021-05-12 09:38:02,635 log_hooks.py: 446: Rank: 0, name: test_accuracy_list_meter, value: {'top_1': {'conv1': 14.628, 'res2': 28.27, 'res3': 39.088, 'res4': 56.391999999999996, 'res5': 62.246}, 'top_5': {'conv1': 29.92, 'res2': 48.455999999999996, 'res3': 61.07, 'res4': 78.32000000000001, 'res5': 85.15599999999999}}

I assume the top 1 accuracy of res5 is relevant, I'm not quite sure what the others indicate.

I achieve 62.246 vs 64.4 which was reported for the RN50 model trained for 100 epochs. For completeness, I'm uploading the train_config.yaml file which was produced by the code which can be found here: https://gist.github.com/spurra/d5b89caccbd614522eb19e6bc3a9e2d9

Is this performance discrepancy within expected range?

EDIT: Also on a side note, it seems like the learning rate is 0 for the last few epochs. Is this desired?
image

EDIT2: I'm a little confused by your comment regarding adjusting batch size as #gpus are changed. You state that https://github.com/facebookresearch/vissl/blob/master/configs/config/benchmark/linear_image_classification/imagenet1k/eval_resnet_8gpu_transfer_in1k_linear.yaml#L103 does this in VISSL. However the note in the documentation here suggests that I have to do this manually. Also, the documentation on the configs found here suggest this just applies the linear scaling rule. If I understand correctly, this just scales the learning rate according to the batch size, not the number of GPUs. Could you please clarify if I need to normalize the SimCLR loss by the total batch size or this is already taken care of? Thanks!

from vissl.

prigoyal avatar prigoyal commented on August 30, 2024

EDIT2: I'm a little confused by your comment regarding adjusting batch size as #gpus are changed. You state that https://github.com/facebookresearch/vissl/blob/master/configs/config/benchmark/linear_image_classification/imagenet1k/eval_resnet_8gpu_transfer_in1k_linear.yaml#L103 does this in VISSL. However the note in the documentation here suggests that I have to do this manually. Also, the documentation on the configs found here suggest this just applies the linear scaling rule. If I understand correctly, this just scales the learning rate according to the batch size, not the number of GPUs. Could you please clarify if I need to normalize the SimCLR loss by the total batch size or this is already taken care of? Thanks!

correct. Only learning rate will be scaled. Not the batch size. You should keep the batch size per gpu consistent and then learning rate will be auto-scaled.

from vissl.

prigoyal avatar prigoyal commented on August 30, 2024

Based on the responses above and the report from @doulemint , it looks like this benchmark reproduces numbers . Please free to reopen the task if that's not the case still.

Also, as a follow-up, we will look into providing the clarification for the learning rate scaling feature in VISSL in our docs, code comments etc. :)

from vissl.

spurra avatar spurra commented on August 30, 2024

@iseessel Thanks for testing this!
@prigoyal Please let me know once you have some updates on this, as I find it a very interesting area.

from vissl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.