Giter Site home page Giter Site logo

Comments (12)

RSchmirler avatar RSchmirler commented on May 30, 2024 1

Also stumbled across this, trying to add GA to my Transformer Training with Adam using the OptimizerWrapper. Results are not matching.

Investigating further, I found an additional problem. For training my model, the OptimizerWrapper with accum_steps=1 and directly using the optimizer does not lead to the same results.

Trying it with SGD even leads to an error when training with OptimizerWrapper
There seems to be some problem within resource_apply_sparse

If you are interested, I can set up a public colab for you to reproduce this

from gradientaccumulator.

RSchmirler avatar RSchmirler commented on May 30, 2024 1

Hey, @andreped

thanks for the suggestions! I'm not looking for multi GPU support but Huggingface TF Models come with empty model.inputs/outputs so I am not sure how this would work with the Model Wrapper. Therefore I tried the Optimizer.

The model I am currently using is ESM2:

from transformers import TFAutoModelForSequenceClassification
model = TFAutoModelForSequenceClassification.from_pretrained("facebook/esm2_t6_8M_UR50D", num_labels=1)

Sure, I will write the unit test tomorrow and you can have a look. Most likely I will have to add a small dataset for it as well.
For now, I put up the notebook

from gradientaccumulator.

RSchmirler avatar RSchmirler commented on May 30, 2024 1

These models take a tokenized sequence as input, its basically a dict of tensors
And produces multiple outputs

This seems to work for me:

import tensorflow as tf
from tensorflow.keras.layers import Input
from tensorflow.keras.models import Model
from gradient_accumulator import GradientAccumulateModel
from transformers import TFAutoModelForSequenceClassification, AutoTokenizer

model = TFAutoModelForSequenceClassification.from_pretrained("facebook/esm2_t6_8M_UR50D", num_labels=1)
tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t6_8M_UR50D")

input_ids = tf.keras.Input(shape=(None,), dtype='int32', name="input_ids")
attention_mask = tf.keras.Input(shape=(None,), dtype='int32', name="attention_mask")
some_input={'input_ids': input_ids, 'attention_mask': attention_mask}

x = model(some_input)
new_model = Model(inputs=some_input, outputs=x)

# then add gradient accumulation using the model wrapper
new_model = GradientAccumulateModel(accum_steps=10, inputs=new_model.input, outputs=new_model.output)

new_model(tokenizer("Some Sequence", return_tensors="tf"))

It produces the same outputs as model(tokenizer("Some Sequence", return_tensors="tf"))

I will test it for my training pipeline and report back here

from gradientaccumulator.

RSchmirler avatar RSchmirler commented on May 30, 2024 1

Sure! I will see if I am able to reproduce my results with the GA model and if so create an explanatory notebook for using it with HuggingFace TF Models

from gradientaccumulator.

RSchmirler avatar RSchmirler commented on May 30, 2024 1

I tried it out and results look good! 🎉
Equivalent results on CPU for:

  • model original model without GA
  • model with GA and accum=1 and original_batchsize
  • models with GA and accum>1 and batchsize=original_batchsize/accum

About reproducibility on GPU I am not entirely sure yet. Different mini batch sizes seem to diverge a little bit, but the overall training results look comparable (like runnig a different random state).
My guess is, the GA implementation works fine and this originates somewhere in my training pipeline 😅.

One more thing that came up
It is necessary to define a loss in model.compile (which you normally would not do for these models, because a fitting loss is already defined internally for each model subtype)

Added a short Notebook as well

Thanks for the help!

from gradientaccumulator.

andreped avatar andreped commented on May 30, 2024 1

@0syrys's PR has now been merged: b85180c

This PR adds a notebook which demonstrates how to use gradient accumulation with the discussed HuggingFace TF model. The notebook is available here.

from gradientaccumulator.

tno123 avatar tno123 commented on May 30, 2024 1

Optimizer wrapper's working, but we need to test on different optimizers.

from gradientaccumulator.

andreped avatar andreped commented on May 30, 2024

Hello, @0syrys!

I'm normally running unit tests to assess this here.

Hence, if you want to, you could add a PR which adds a specific unit test for this (see here for how to make such a test). Just remember to add the unit test to this CI here. I can review the PR after you have made it. Contributions are always very welcome! :]

But for your application, I would recommend using the GradientAccumulateModel wrapper instead, which is what I use in my work. See here for how to use it.

I assume the reason why you would want to use the OptimizerWrapper is that you want to have multi-GPU training support. Note that this is not working, which was observed yesterday. See this thread. Note that I have also shared a Kaggle notebook. Kaggle offer two GPUs for free. Whereas Colab only offers one.

If you want to debug this further, to so in this Kaggle notebook:
https://www.kaggle.com/code/andreped/grad-accum-multi-gpu?scriptVersionId=117764939

from gradientaccumulator.

andreped avatar andreped commented on May 30, 2024

Huggingface TF Models come with empty model.inputs/outputs

What? Are these actual keras models? Stored in which format? SavedModel or HDF5? Perhaps they were converted from another format, which is why inputs/outputs are not defined? I would expect all keras models to have defined inputs/outputs. If not, I believe you could define that yourself. For instance:

import tensorflow as tf
from tensorflow.keras.layers import Input
from tensorflow.keras.models import Model
from gradient_accumulator import GradientAccumulateModel
from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained("facebook/esm2_t6_8M_UR50D", num_labels=1)

# assuming it is an RGB input. Don't need to know the actual input shape
some_input = Input(shape=(None, None, 3))
x = model(some_input)

# assuming the model produces a single output
new_model = Model(inputs=some_input, outputs=x)

# delete old model
del model

# then add gradient accumulation using the model wrapper
new_model = GradientAccumulateModel(accum_steps=10, inputs=new_model.input, outputs=new_model.output)

I have not tried whether this works, but perhaps you could give it a go?

For more information, see here.

I can make jupyter notebooks in the future, to make it even easier to understand how to use it and whatnot.

from gradientaccumulator.

andreped avatar andreped commented on May 30, 2024

Yeah, that makes sense. Had no idea which imput it took, but glad it seems to be working for you :]

Would be of interest to make a jupyter notebook to demonstrate how to use GA with HuggingFace's GPT, if you have time? You can make a PR to the notebooks/ directory I have.

from gradientaccumulator.

andreped avatar andreped commented on May 30, 2024

I tried it out and results look good! 🎉
Equivalent results on CPU for:

Great to hear! :D

My guess is, the GA implementation works fine and this originates somewhere in my training pipeline

Note that in order to make experiments reproducable and the comparison fair, you should set the seed properly. In TF/Python that is surprisingly tricky to do. I implemented a simple reset method that does that properly, and run that between experiments, see here.

About reproducibility on GPU I am not entirely sure yet.

Not that gradient accumulation is only an approximation of regular batch training. In theory, you should get exactly the same results, but due to machine epsilon, memory fragmentation, and other fun stuff happening with dynamic optimizers, you should expect some minor performance results. However, it should approximate it extremely well. I have never had problems using it in any of my projects. Just note that if your network has batch normalization you will have issues, as the implementation is not compatible with Keras' implementation of BN.

One more thing that came up
It is necessary to define a loss in model.compile (which you normally would not do for these models, because a fitting loss is already defined internally for each model subtype)

I assume you mean that the model itself has a custom loss layer. If that's the case, simply set model.compile(loss=None), and it should work just fine. I just did that in a recent paper. Refer to the sample code for how I did that here.

Thanks for the help!

Anytime! If you have any other questions, please, let me know. Happy to help :]

from gradientaccumulator.

andreped avatar andreped commented on May 30, 2024

Fixed in a4ca2c4.

Hence, @0syrys we should now have similar behaviour between optimizers using the optimizer wrapper.

Further benchmarks needed to address this, but at least current unit tests confirms that it is good enough for now.

from gradientaccumulator.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.