Comments (12)
Also stumbled across this, trying to add GA to my Transformer Training with Adam using the OptimizerWrapper. Results are not matching.
Investigating further, I found an additional problem. For training my model, the OptimizerWrapper with accum_steps=1 and directly using the optimizer does not lead to the same results.
Trying it with SGD even leads to an error when training with OptimizerWrapper
There seems to be some problem within resource_apply_sparse
If you are interested, I can set up a public colab for you to reproduce this
from gradientaccumulator.
Hey, @andreped
thanks for the suggestions! I'm not looking for multi GPU support but Huggingface TF Models come with empty model.inputs/outputs so I am not sure how this would work with the Model Wrapper. Therefore I tried the Optimizer.
The model I am currently using is ESM2:
from transformers import TFAutoModelForSequenceClassification
model = TFAutoModelForSequenceClassification.from_pretrained("facebook/esm2_t6_8M_UR50D", num_labels=1)
Sure, I will write the unit test tomorrow and you can have a look. Most likely I will have to add a small dataset for it as well.
For now, I put up the notebook
from gradientaccumulator.
These models take a tokenized sequence as input, its basically a dict of tensors
And produces multiple outputs
This seems to work for me:
import tensorflow as tf
from tensorflow.keras.layers import Input
from tensorflow.keras.models import Model
from gradient_accumulator import GradientAccumulateModel
from transformers import TFAutoModelForSequenceClassification, AutoTokenizer
model = TFAutoModelForSequenceClassification.from_pretrained("facebook/esm2_t6_8M_UR50D", num_labels=1)
tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t6_8M_UR50D")
input_ids = tf.keras.Input(shape=(None,), dtype='int32', name="input_ids")
attention_mask = tf.keras.Input(shape=(None,), dtype='int32', name="attention_mask")
some_input={'input_ids': input_ids, 'attention_mask': attention_mask}
x = model(some_input)
new_model = Model(inputs=some_input, outputs=x)
# then add gradient accumulation using the model wrapper
new_model = GradientAccumulateModel(accum_steps=10, inputs=new_model.input, outputs=new_model.output)
new_model(tokenizer("Some Sequence", return_tensors="tf"))
It produces the same outputs as model(tokenizer("Some Sequence", return_tensors="tf"))
I will test it for my training pipeline and report back here
from gradientaccumulator.
Sure! I will see if I am able to reproduce my results with the GA model and if so create an explanatory notebook for using it with HuggingFace TF Models
from gradientaccumulator.
I tried it out and results look good! 🎉
Equivalent results on CPU for:
- model original model without GA
- model with GA and accum=1 and original_batchsize
- models with GA and accum>1 and batchsize=original_batchsize/accum
About reproducibility on GPU I am not entirely sure yet. Different mini batch sizes seem to diverge a little bit, but the overall training results look comparable (like runnig a different random state).
My guess is, the GA implementation works fine and this originates somewhere in my training pipeline 😅.
One more thing that came up
It is necessary to define a loss in model.compile (which you normally would not do for these models, because a fitting loss is already defined internally for each model subtype)
Added a short Notebook as well
Thanks for the help!
from gradientaccumulator.
@0syrys's PR has now been merged: b85180c
This PR adds a notebook which demonstrates how to use gradient accumulation with the discussed HuggingFace TF model. The notebook is available here.
from gradientaccumulator.
Optimizer wrapper's working, but we need to test on different optimizers.
from gradientaccumulator.
Hello, @0syrys!
I'm normally running unit tests to assess this here.
Hence, if you want to, you could add a PR which adds a specific unit test for this (see here for how to make such a test). Just remember to add the unit test to this CI here. I can review the PR after you have made it. Contributions are always very welcome! :]
But for your application, I would recommend using the GradientAccumulateModel
wrapper instead, which is what I use in my work. See here for how to use it.
I assume the reason why you would want to use the OptimizerWrapper
is that you want to have multi-GPU training support. Note that this is not working, which was observed yesterday. See this thread. Note that I have also shared a Kaggle notebook. Kaggle offer two GPUs for free. Whereas Colab only offers one.
If you want to debug this further, to so in this Kaggle notebook:
https://www.kaggle.com/code/andreped/grad-accum-multi-gpu?scriptVersionId=117764939
from gradientaccumulator.
Huggingface TF Models come with empty model.inputs/outputs
What? Are these actual keras models? Stored in which format? SavedModel or HDF5? Perhaps they were converted from another format, which is why inputs/outputs are not defined? I would expect all keras models to have defined inputs/outputs. If not, I believe you could define that yourself. For instance:
import tensorflow as tf
from tensorflow.keras.layers import Input
from tensorflow.keras.models import Model
from gradient_accumulator import GradientAccumulateModel
from transformers import TFAutoModelForSequenceClassification
model = TFAutoModelForSequenceClassification.from_pretrained("facebook/esm2_t6_8M_UR50D", num_labels=1)
# assuming it is an RGB input. Don't need to know the actual input shape
some_input = Input(shape=(None, None, 3))
x = model(some_input)
# assuming the model produces a single output
new_model = Model(inputs=some_input, outputs=x)
# delete old model
del model
# then add gradient accumulation using the model wrapper
new_model = GradientAccumulateModel(accum_steps=10, inputs=new_model.input, outputs=new_model.output)
I have not tried whether this works, but perhaps you could give it a go?
For more information, see here.
I can make jupyter notebooks in the future, to make it even easier to understand how to use it and whatnot.
from gradientaccumulator.
Yeah, that makes sense. Had no idea which imput it took, but glad it seems to be working for you :]
Would be of interest to make a jupyter notebook to demonstrate how to use GA with HuggingFace's GPT, if you have time? You can make a PR to the notebooks/ directory I have.
from gradientaccumulator.
I tried it out and results look good! 🎉
Equivalent results on CPU for:
Great to hear! :D
My guess is, the GA implementation works fine and this originates somewhere in my training pipeline
Note that in order to make experiments reproducable and the comparison fair, you should set the seed properly. In TF/Python that is surprisingly tricky to do. I implemented a simple reset
method that does that properly, and run that between experiments, see here.
About reproducibility on GPU I am not entirely sure yet.
Not that gradient accumulation is only an approximation of regular batch training. In theory, you should get exactly the same results, but due to machine epsilon, memory fragmentation, and other fun stuff happening with dynamic optimizers, you should expect some minor performance results. However, it should approximate it extremely well. I have never had problems using it in any of my projects. Just note that if your network has batch normalization you will have issues, as the implementation is not compatible with Keras' implementation of BN.
One more thing that came up
It is necessary to define a loss in model.compile (which you normally would not do for these models, because a fitting loss is already defined internally for each model subtype)
I assume you mean that the model itself has a custom loss layer. If that's the case, simply set model.compile(loss=None)
, and it should work just fine. I just did that in a recent paper. Refer to the sample code for how I did that here.
Thanks for the help!
Anytime! If you have any other questions, please, let me know. Happy to help :]
from gradientaccumulator.
Fixed in a4ca2c4.
Hence, @0syrys we should now have similar behaviour between optimizers using the optimizer wrapper.
Further benchmarks needed to address this, but at least current unit tests confirms that it is good enough for now.
from gradientaccumulator.
Related Issues (20)
- Use tf.function on train_step HOT 11
- 0.5.1, tf 2.11 error for accuoptimizer HOT 8
- Replacing AccumBatchNormalization not working as intended HOT 2
- ConvNeXt not compatible with Model wrapper HOT 1
- No mixed precision support with GradientAccumulateOptimizer? HOT 7
- Replacing BN layer with AccumBN layer results in poorer convergence
- confusion over how to use this module HOT 2
- Dummy issue to test auto-assign
- Dummy issue to test auto-assign
- Test HOT 1
- Reduce number of unit tests HOT 2
- Unit tests fail on tf >= 2.10 HOT 1
- AccumBN is not compatible with 3D ops e.g. Conv3D HOT 1
- Mixed precision not working as intended with AccumBatchNormalization HOT 8
- Add linting to improve code style HOT 1
- Unit test for optimizer invariance in distributed trainings HOT 2
- Optimizer wrapper not working as intended HOT 2
- Optimizer wrapper not compatible with tf==2.6 HOT 2
- AttributeError using Optimizer wrapper with tf==2.4 HOT 1
- Optimizer wrapper performance is dependent on tensorflow version HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gradientaccumulator.