There is a simple test scipt to reproduce this issue <a href="https://github.com/andre

Hey, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url

@0syrys's PR has now been merged: <a class="commit-link" data-hovercard-type="commit"

Different behaviour between SGD and Adam optimizers with OptimizerWrapper about gradientaccumulator HOT 12 CLOSED

andreped commented on May 30, 2024

Different behaviour between SGD and Adam optimizers with OptimizerWrapper

from gradientaccumulator.

Comments (12)

RSchmirler commented on May 30, 2024 1

Also stumbled across this, trying to add GA to my Transformer Training with Adam using the OptimizerWrapper. Results are not matching.

Investigating further, I found an additional problem. For training my model, the OptimizerWrapper with accum_steps=1 and directly using the optimizer does not lead to the same results.

Trying it with SGD even leads to an error when training with OptimizerWrapper
There seems to be some problem within resource_apply_sparse

If you are interested, I can set up a public colab for you to reproduce this

from gradientaccumulator.

RSchmirler commented on May 30, 2024 1

Hey, @andreped

thanks for the suggestions! I'm not looking for multi GPU support but Huggingface TF Models come with empty model.inputs/outputs so I am not sure how this would work with the Model Wrapper. Therefore I tried the Optimizer.

The model I am currently using is ESM2:

from transformers import TFAutoModelForSequenceClassification
model = TFAutoModelForSequenceClassification.from_pretrained("facebook/esm2_t6_8M_UR50D", num_labels=1)

Sure, I will write the unit test tomorrow and you can have a look. Most likely I will have to add a small dataset for it as well.
For now, I put up the notebook

from gradientaccumulator.

RSchmirler commented on May 30, 2024 1

These models take a tokenized sequence as input, its basically a dict of tensors
And produces multiple outputs

This seems to work for me:

import tensorflow as tf
from tensorflow.keras.layers import Input
from tensorflow.keras.models import Model
from gradient_accumulator import GradientAccumulateModel
from transformers import TFAutoModelForSequenceClassification, AutoTokenizer

model = TFAutoModelForSequenceClassification.from_pretrained("facebook/esm2_t6_8M_UR50D", num_labels=1)
tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t6_8M_UR50D")

input_ids = tf.keras.Input(shape=(None,), dtype='int32', name="input_ids")
attention_mask = tf.keras.Input(shape=(None,), dtype='int32', name="attention_mask")
some_input={'input_ids': input_ids, 'attention_mask': attention_mask}

x = model(some_input)
new_model = Model(inputs=some_input, outputs=x)

# then add gradient accumulation using the model wrapper
new_model = GradientAccumulateModel(accum_steps=10, inputs=new_model.input, outputs=new_model.output)

new_model(tokenizer("Some Sequence", return_tensors="tf"))

It produces the same outputs as model(tokenizer("Some Sequence", return_tensors="tf"))

I will test it for my training pipeline and report back here

from gradientaccumulator.

RSchmirler commented on May 30, 2024 1

Sure! I will see if I am able to reproduce my results with the GA model and if so create an explanatory notebook for using it with HuggingFace TF Models

from gradientaccumulator.

RSchmirler commented on May 30, 2024 1

I tried it out and results look good! 🎉
Equivalent results on CPU for:

model original model without GA
model with GA and accum=1 and original_batchsize
models with GA and accum>1 and batchsize=original_batchsize/accum

About reproducibility on GPU I am not entirely sure yet. Different mini batch sizes seem to diverge a little bit, but the overall training results look comparable (like runnig a different random state).
My guess is, the GA implementation works fine and this originates somewhere in my training pipeline 😅.

One more thing that came up
It is necessary to define a loss in model.compile (which you normally would not do for these models, because a fitting loss is already defined internally for each model subtype)

Added a short Notebook as well

Thanks for the help!

from gradientaccumulator.

andreped commented on May 30, 2024 1

@0syrys's PR has now been merged: b85180c

This PR adds a notebook which demonstrates how to use gradient accumulation with the discussed HuggingFace TF model. The notebook is available here.

from gradientaccumulator.

tno123 commented on May 30, 2024 1

Optimizer wrapper's working, but we need to test on different optimizers.

from gradientaccumulator.

andreped commented on May 30, 2024

Hello, @0syrys!

I'm normally running unit tests to assess this here.

Hence, if you want to, you could add a PR which adds a specific unit test for this (see here for how to make such a test). Just remember to add the unit test to this CI here. I can review the PR after you have made it. Contributions are always very welcome! :]

But for your application, I would recommend using the GradientAccumulateModel wrapper instead, which is what I use in my work. See here for how to use it.

I assume the reason why you would want to use the OptimizerWrapper is that you want to have multi-GPU training support. Note that this is not working, which was observed yesterday. See this thread. Note that I have also shared a Kaggle notebook. Kaggle offer two GPUs for free. Whereas Colab only offers one.

If you want to debug this further, to so in this Kaggle notebook:
https://www.kaggle.com/code/andreped/grad-accum-multi-gpu?scriptVersionId=117764939

from gradientaccumulator.

andreped commented on May 30, 2024

Huggingface TF Models come with empty model.inputs/outputs

What? Are these actual keras models? Stored in which format? SavedModel or HDF5? Perhaps they were converted from another format, which is why inputs/outputs are not defined? I would expect all keras models to have defined inputs/outputs. If not, I believe you could define that yourself. For instance:

import tensorflow as tf
from tensorflow.keras.layers import Input
from tensorflow.keras.models import Model
from gradient_accumulator import GradientAccumulateModel
from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained("facebook/esm2_t6_8M_UR50D", num_labels=1)

# assuming it is an RGB input. Don't need to know the actual input shape
some_input = Input(shape=(None, None, 3))
x = model(some_input)

# assuming the model produces a single output
new_model = Model(inputs=some_input, outputs=x)

# delete old model
del model

# then add gradient accumulation using the model wrapper
new_model = GradientAccumulateModel(accum_steps=10, inputs=new_model.input, outputs=new_model.output)

I have not tried whether this works, but perhaps you could give it a go?

For more information, see here.

I can make jupyter notebooks in the future, to make it even easier to understand how to use it and whatnot.

from gradientaccumulator.

andreped commented on May 30, 2024

Yeah, that makes sense. Had no idea which imput it took, but glad it seems to be working for you :]

Would be of interest to make a jupyter notebook to demonstrate how to use GA with HuggingFace's GPT, if you have time? You can make a PR to the notebooks/ directory I have.

from gradientaccumulator.

andreped commented on May 30, 2024

I tried it out and results look good! 🎉
Equivalent results on CPU for:

Great to hear! :D

My guess is, the GA implementation works fine and this originates somewhere in my training pipeline

Note that in order to make experiments reproducable and the comparison fair, you should set the seed properly. In TF/Python that is surprisingly tricky to do. I implemented a simple reset method that does that properly, and run that between experiments, see here.

About reproducibility on GPU I am not entirely sure yet.

Not that gradient accumulation is only an approximation of regular batch training. In theory, you should get exactly the same results, but due to machine epsilon, memory fragmentation, and other fun stuff happening with dynamic optimizers, you should expect some minor performance results. However, it should approximate it extremely well. I have never had problems using it in any of my projects. Just note that if your network has batch normalization you will have issues, as the implementation is not compatible with Keras' implementation of BN.

One more thing that came up
It is necessary to define a loss in model.compile (which you normally would not do for these models, because a fitting loss is already defined internally for each model subtype)

I assume you mean that the model itself has a custom loss layer. If that's the case, simply set model.compile(loss=None), and it should work just fine. I just did that in a recent paper. Refer to the sample code for how I did that here.

Thanks for the help!

Anytime! If you have any other questions, please, let me know. Happy to help :]

from gradientaccumulator.

andreped commented on May 30, 2024

Fixed in a4ca2c4.

Hence, @0syrys we should now have similar behaviour between optimizers using the optimizer wrapper.

Further benchmarks needed to address this, but at least current unit tests confirms that it is good enough for now.

from gradientaccumulator.

Different behaviour between SGD and Adam optimizers with OptimizerWrapper about gradientaccumulator HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent