Current implementation is not compatible with multi-gpu training. Ho

Hello, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Will make an attempt at this now, <a class="user-mention notranslate" data-hovercard-t

Couldn't help myself... Got a little further, <a class="user-mention notranslate" data

EDIT: I just observed that I made a test once, that is already par

EDIT: I just observed that I made a test once, that i

Multi-GPU strategy support about gradientaccumulator HOT 14 CLOSED

andreped commented on May 29, 2024

Multi-GPU strategy support

from gradientaccumulator.

Comments (14)

Bidski commented on May 29, 2024

Is there any update on this?

from gradientaccumulator.

andreped commented on May 29, 2024

Hello, @Bidski! :]

Is there any update on this?

I made an attempt at it before christmas, but it was a bit more challenging to add proper support for it with the current approach I made, than I had time for.

However, if you are able to test this feature very soon, I could make an attempt today. Also, what is your use case?

from gradientaccumulator.

Bidski commented on May 29, 2024

Hi @andreped

I won't be able to test until Monday, but I would be interested in seeing this working

from gradientaccumulator.

andreped commented on May 29, 2024

I won't be able to test until Monday, but I would be interested in seeing this working

No worries, that just means that I could make an attempt during the weekend instead. Will keep you updated on the feature. Stay tuned :]

from gradientaccumulator.

andreped commented on May 29, 2024

Will make an attempt at this now, @Bidski and @innat.

@innat, please provide a gist, if you had one.

EDIT: I just observed that I made a test script once, that is already part of tests. Should this script reproduce your issue properly and represent a valid use case?

from gradientaccumulator.

andreped commented on May 29, 2024

Just made an attempt now, and I keep running into the same issues. Essentially meaning that the multi-GPU strategy does not work with the train_step overload approach.

I'm quite preoccupied with finalizing my PhD work and I don't see that I have time to debug this further, as I myself don't use multiple GPUs simultaneously in my work. Perhaps anyone else could make an attempt? Are you interested in making an attempt, @innat?

I tried using the class provided here which should be useful for handling resources across replicas, and then looking here to see how it is used in a custom pipeline. But I warn you, it is quite the rabbit hole...

from gradientaccumulator.

andreped commented on May 29, 2024

Couldn't help myself... Got a little further, @innat.

I added a GAModelWrapperV2 that is compatible with tf.distribute.MirroredStrategy(). At least, when running a simple test, it does not crash immediantly (as the previous implementation did with the same strategy), and memory seem to be allocated across multiple GPUs.

However, the distribution of "mini"-batches is not done optimally. I assume that you want to split a batch into k smaller batches, and distribute these across k GPUs, then catch the gradients from each GPU and accumulate these before the update. When doing it "optimally, I keep getting this error, but hopefully I can find a solution to it:

RuntimeError: Method requires being in cross-replica context, use get_replica_context().merge_call()

Any progress can be observed from the multi-gpu branch here. See here for the new model wrapper and here for a simple test script.

EDIT: After further inspection, I'm not really sure computation is actually computed on both GPUs. Might be that it "worked", just because only one GPU was used. I don't have time to debug this further, but perhaps someone else have time?

from gradientaccumulator.

Bidski commented on May 29, 2024

EDIT: I just observed that I made a test script once, that is already part of tests. Should this script reproduce your issue properly and represent a valid use case?

I think, according to What should be in scope and what should be outside?, you should move the optimizer (lines 55-56) inside of the scope? Otherwise, I think that test should cover my use case. I also use tf.distribute.ReductionToOneDevice with tf.distribute.MirroredStrategy, but I'm not sure if this would have any significant impact on how this wrapper will operate

from gradientaccumulator.

Bidski commented on May 29, 2024

Was just having a look at the tensorflow source code to see the interaction between Model.fit() and Model.train_step() and I came across the steps_per_execution argument for Model.compile and this bit of code in Model.make_train_function.

This almost seems like a setup for gradient accumulation if I'm not entirely delusional? Maybe a combination of steps_per_execution == accum_steps and the custom train_step function might be the way to go?

from gradientaccumulator.

andreped commented on May 29, 2024

EDIT: I just observed that I made a test script once, that is already part of tests. Should this script reproduce your issue properly and represent a valid use case?

I think, according to What should be in scope and what should be outside?, you should move the optimizer (lines 55-56) inside of the scope? Otherwise, I think that test should cover my use case. I also use tf.distribute.ReductionToOneDevice with tf.distribute.MirroredStrategy, but I'm not sure if this would have any significant impact on how this wrapper will operate

Made a new an improved test script here:
https://github.com/andreped/GradientAccumulator/blob/d2eeee307eefd11182342045622a7fce03319ba5/tests/test_multi_gpu_benchmark.py

I have also tried splitting the mini batch into reduced mini-batches and distribute these, but it does not seem to be working:
https://github.com/andreped/GradientAccumulator/blob/multi-gpu/gradient_accumulator/GAModelWrapperV2.py#L62

If you have time, you can try to explore this further, but I am quite limited with time for the next week or so. Note that all these changes have been made on a separate branch, multi-gpu.

Was just having a look at the tensorflow source code to see the interaction between Model.fit() and Model.train_step() and I came across the steps_per_execution argument for Model.compile and this bit of code in Model.make_train_function.

I often have a hard time understand the docs. I'm not sure steps_per_execution is what we want, but you are free to explore that further. I'm happy to follow up any attempt you make. However, I think we might run into the same Multi-GPU issue as we are observing here anyways. However, I'm not sure.

from gradientaccumulator.

andreped commented on May 29, 2024

@Bidski Just mentioning that a new release has been added which adds experimental support for optimizer wrapping, similar as was common to do in TF1. All optimizers are supported, however, dynamic optimizers such as Adam have strange behaviour (results too far away from regular batch training). However, SGD works great.

With this new approach, it should be possible to add multi-GPU support much more easily. I will update you tomorrow, when I have made a new attempt at it, with this new approach.

The latest release can be found here.

from gradientaccumulator.

andreped commented on May 29, 2024

Seems like multi-GPU is not working as intended, even with the OptimizerWrapper. Seems to work just fine with one GPU though.

If anyone wish to debug this further, see this notebook I made public on Kaggle, which enables you to run tests with two GPUs for free: https://www.kaggle.com/code/andreped/grad-accum-multi-gpu?scriptVersionId=117764939

from gradientaccumulator.

andreped commented on May 29, 2024

Optimizer wrapper is finally working with multi-GPU training!

Fixed in 47a51f3.

from gradientaccumulator.

andreped commented on May 29, 2024

@Bidski Just letting you know that multi-gpu support has been officially added in the latest release v0.5.0.

Should work out-of-the-box for the optimizer wrapper. For the model wrapper, I have added experimental support which only works for the SGD optimizer. Nonetheless, I believe the optimizer wrapper should fits your needs.

Let me know how it works and if you experience any issues using it :]

from gradientaccumulator.

Multi-GPU strategy support about gradientaccumulator HOT 14 CLOSED

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent