andreped / gradientaccumulator Goto Github PK

:dart: Accumulated Gradients for TensorFlow 2

Home Page: https://gradientaccumulator.readthedocs.io/

License: MIT License

Python 79.39% Jupyter Notebook 19.31% Makefile 0.74% Shell 0.57%

tensorflow accumulated-gradients tensorflow2 tf2 batch-size memory-constraints gradient-accumulation distributed-training float16 mixed-precision

gradientaccumulator's People

Contributors

Stargazers

Watchers

Forkers

dbouget chaithyagr rschmirler andyshenas jpdefrutos mhoibo tno123 grandesty-ml callumfiler landryraccoon dpys

gradientaccumulator's Issues

TensorFlow 2.2 support?

Now that tensorflow-addons has been removed, we could likely support v2.2 again.

Will need to update the test CI to check.

macOS tests fail

When running the CIs, it was just observed that the macOS CIs failed.

Below is the pytest verbose as observed here:

ImportError while importing test module '/Users/runner/work/GradientAccumulator/GradientAccumulator/tests/test_train_mnist.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
../../../hostedtoolcache/Python/3.8.14/x64/lib/python3.8/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/test_train_mnist.py:2: in <module>
    import tensorflow_datasets as tfds
../../../hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/tensorflow_datasets/__init__.py:43: in <module>
    import tensorflow_datasets.core.logging as _tfds_logging
../../../hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/tensorflow_datasets/core/__init__.py:[22](https://github.com/andreped/GradientAccumulator/actions/runs/4629533930/jobs/8189920947#step:8:23): in <module>
    from tensorflow_datasets.core import community
../../../hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/tensorflow_datasets/core/community/__init__.py:18: in <module>
    from tensorflow_datasets.core.community.huggingface_wrapper import mock_builtin_to_use_gfile
../../../hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/tensorflow_datasets/core/community/huggingface_wrapper.py:31: in <module>
    from tensorflow_datasets.core import dataset_builder
../../../hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/tensorflow_datasets/core/dataset_builder.py:34: in <module>
    from tensorflow_datasets.core import dataset_info
../../../hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/tensorflow_datasets/core/dataset_info.py:47: in <module>
    from tensorflow_datasets.core import file_adapters
../../../hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/tensorflow_datasets/core/file_adapters.py:[29](https://github.com/andreped/GradientAccumulator/actions/runs/4629533930/jobs/8189920947#step:8:30): in <module>
    from array_record.python import array_record_module
E   ImportError: dlopen(/Users/runner/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/array_record/python/array_record_module.so, 2): no suitable image found.  Did find:
E   	/Users/runner/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/array_record/python/array_record_module.so: unknown file type, first eight bytes: 0x7F 0x45 0x4C 0x46 0x02 0x01 0x01 0x03
E   	/Users/runner/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/array_record/python/array_record_module.so: unknown file type, first eight bytes: 0x7F 0x45 0x4C 0x46 0x02 0x01 0x01 0x03

Optimizer unit tests fails for tf 2.11 in codecov CI

If I revert back to tf 2.8 it works fine.

From tensorflow > 2.11, the Optimizer class has been reworked. A Legacy Optimizer class was therefore made.

However, after running the same Optimizer wrapper unit tests with tf 2.8 and 2.11, the latter results in an AssertionError, meaning that the results were different.

For now I have reverted back to tf 2.8, but this means that the Optimizer wrapper solution is flawed.

Add documentations

Right now, the only real documentations exist in the README.

It would be better if we made proper documentations and hosted it on PyPI like most other libraries do.

For that Sphinx seems like the way to go.

Alternatively, we could use GitHub pages for this (see here). However, then we would need to make an GradientAccumulator organization which I find extremely overkill.

TF-addons deprecated

As tensorflow-addons has been deprecated, we should likely remove it.

Currently, we only use it for typehints anyways.

Add optimizer wrapper solution

It was observed that model wrapping, where we overload the train_step, is not compatible with distribute training strategies (multi-GPU training), as discussed in this thread.

What should work better is to wrap the optimizer instead, as we should have better control of the gradients across replicates, and can define the correct behaviour.

TensorFlow-datasets in CIs fails

Due to protobuf issues, lots of libraries are having issues to work as intended.

The fix should be done in tensorflow-datasets directly or in protobuf, but for now I have made a PR in tfds to address this issue, see tensorflow/datasets#4865.

Error logs:

==================================== ERRORS ====================================
__________________ ERROR collecting tests/test_train_mnist.py __________________
ImportError while importing test module '/home/runner/work/GradientAccumulator/GradientAccumulator/tests/test_train_mnist.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/importlib/__init__.py:126: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/test_train_mnist.py:2: in <module>
    import tensorflow_datasets as tfds
/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/tensorflow_datasets/__init__.py:43: in <module>
    import tensorflow_datasets.core.logging as _tfds_logging
/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/tensorflow_datasets/core/__init__.py:[22](https://github.com/andreped/GradientAccumulator/actions/runs/4688025743/jobs/8308068575#step:8:23): in <module>
    from tensorflow_datasets.core import community
/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/tensorflow_datasets/core/community/__init__.py:18: in <module>
    from tensorflow_datasets.core.community.huggingface_wrapper import mock_builtin_to_use_gfile
/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/tensorflow_datasets/core/community/huggingface_wrapper.py:[31](https://github.com/andreped/GradientAccumulator/actions/runs/4688025743/jobs/8308068575#step:8:32): in <module>
    from tensorflow_datasets.core import dataset_builder
/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/tensorflow_datasets/core/dataset_builder.py:[33](https://github.com/andreped/GradientAccumulator/actions/runs/4688025743/jobs/8308068575#step:8:34): in <module>
    from tensorflow_datasets.core import dataset_info
/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/tensorflow_datasets/core/dataset_info.py:49: in <module>
    from tensorflow_datasets.core import splits as splits_lib
/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/tensorflow_datasets/core/splits.py:[34](https://github.com/andreped/GradientAccumulator/actions/runs/4688025743/jobs/8308068575#step:8:35): in <module>
    from tensorflow_datasets.core import proto as proto_lib
/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/tensorflow_datasets/core/proto/__init__.py:18: in <module>
    from tensorflow_datasets.core.proto import dataset_info_generated_pb2 as dataset_info_pb2  # pylint: disable=line-too-long
/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/tensorflow_datasets/core/proto/dataset_info_generated_pb2.py:31: in <module>
    from tensorflow_metadata.proto.v0 import schema_pb2 as tensorflow__metadata_dot_proto_dot_v0_dot_schema__pb2
/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/tensorflow_metadata/proto/__init__.py:16: in <module>
    from tensorflow_metadata.proto.v0 import anomalies_pb2
/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/tensorflow_metadata/proto/v0/anomalies_pb2.py:5: in <module>
    from google.protobuf.internal import builder as _builder
E   ImportError: cannot import name 'builder' from 'google.protobuf.internal' (/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/google/protobuf/internal/__init__.py)

Setup Binder for demonstration

Would be great if a Binder demo was added to showcase the implementation.

Could run one of the tests, to showcase its usefulness, and perhaps also that it produced similar results to regular batch training. The natural test is this one: https://github.com/andreped/GradientAccumulator/blob/main/tests/test_expected_result.py

Saving models to .h5 in tf>=2.8 fails

WARNING:tensorflow:Found duplicated Variables
in Model's weights. This is usually caused by Variables being shared by Layers in the Model. These Variables will be treated as separate Variables when the Model is restored. To avoid this, please save w
ith save_format="tf".
Traceback (most recent call last):
File "train.py", line 230, in
model.fit(
File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 67, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/usr/local/lib/python3.8/dist-packages/h5py/_hl/group.py", line 149, in create_dataset
dsid = dataset.make_new_dset(group, shape, dtype, data, name, **kwds)
File "/usr/local/lib/python3.8/dist-packages/h5py/_hl/dataset.py", line 142, in make_new_dset
dset_id = h5d.create(parent.id, name, tid, sid, dcpl=dcpl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5d.pyx", line 87, in h5py.h5d.create
ValueError: Unable to create dataset (name already exists)

Unable to compile model after loading model with GradientAccumulateOptimizer wrapper

See this unit test to reproduce.

Basically, if we write compile=True after load_model here, we need to provide the optimizer as a custom_object. This should not be necessary.

Perhaps some serialization is missing?

Error log:

ValueError: Unknown config_item: 'SGD'. Please ensure you are using a `keras.utils.custom_object_scope` and that
this object is included in the scope. See https://www.tensorflow.org/guide/keras/save_and_serialize#registering_the_custom_object for details.

Working with subclassed models

Right now, if we have a model which is subclassed from tf.keras.Model, we cannot use this.

Just subclassing from GAModelWrapper will not help as we need to call __init__ for this class after self.trainable_variables are used.
Also, we cant have the __init__ call later after configuring model, as this will cause an error as the underlying tf.keras.Model.__init__ needs to be called first!

To prevent this awkward situation, it makes sense to add:

def reinit_grad_accum(self):
    self.gradient_accumulation = [tf.Variable(tf.zeros_like(v, dtype=tf.float32), trainable=False, name="accum_" + str(i)) for i, v in
                                      enumerate(self.trainable_variables)]

This function can just be called as desired later by us again to make sure everything works fine.

Benchmark unit test comparing custom and original BN layers not working

Seems like the current CI results in an AssertionError, meaning that the final performance between identical experiments w/ keras BN or custom BN are not the same.

Might be that a reset is needed, or that the initialization differ somehow; or other details in the original keras implementation which we are not doing.

Download PyPI badge not working

For whatever reason, the PyPI downloads badge is not working properly. Seems like it is unable to find the package.

I have used this before with other packages and it just worked.

Maybe it is the hyphen causing this?

Add mixed precision unit tests

To assess that mixed precision work, it would be great if we could run one test CI with mixed precision.

It should work on CPU only if we remember to enable mixed precision.

However, training should be rather slow, as float16 is not suitable for CPU compute.

TF-datasets deprecates Python 3.7-3.8

We should likely set a fixed version for tensorflow-datasets to be able to run tests on older Python versions.

Add neat looking repo header figure

Due to the success of the illustrations for popular repos such as super-ml-pets and DSS, it felt natural.

@jpdefrutos: Up for a challenge? :P

Multi-input/output support

AFAIK, the current implementation is not compatible with multi-input/output networks, as we overload the train_step and only has x and y here.

Might be that x and y represent sets of inputs and outputs, but this we should check further.

We should add a CI unit test to verify if this implementation works in this scenario as well.

Performance benchmarks

Current implementation produces identical results to regular batch training, but there is expected to be some runtime degradation, as a mini batch is processed in micro batches.

However, the implementation could further be sped up by using the @tf.function decorator for the train_step method, same applies for the test_step, which currenty is not modified.

Multi-GPU strategy support

Current implementation is not compatible with multi-gpu training.

However, it is possible, as shown in TensorFlowTTS.

Should try to incorporate some of the ideas demonstrated here into our package.

Codecov fail to capture overrided Optimizer methods running in background

When we wrap the Optimizer to add gradient accumulation support, I noticed that codecov fails to see that the methods are in use.

I believe this happens as there is no real trace in the code itself that those methods are being used.

This happens within the parent class, and it is not easy to catch why this happens.

Hence, for codecov, we should likely just hide those methods, unless there is a way to get codecov to see them.

Model conversion weights size mismatch

Purpose: converting a trained model's precision from float16 to float32.

Issue: the set_weights method indicates a size mismatch without recreating the model by wrapping it inside the GAModelWrapper, converts fine when wrapped.

Bug: Redirect imports not working

The accumulate methods lie in gradient_accumulate/accumulators.py. However, it is annoying to import by: from gradient_accumulate.accumulators import GradientAccumulateOptimizer. It would be better if we could just: from gradient_accumulate import GradientAccumulateOptimizer.

My idea was to add the redirect in the __init__.py by adding from .accumulators import GradientAccumulateOptimizer. However, that does not seem to be enough, as we still get errors when running the CIs on the cloud. I've also tried to add from gradient_accumulator.accumulators import GradientAccumulateOptimizer, but same result...

Annoyingly, it all seems to work fine locally.

This is the verbose of pytest for one of the last builds (see last job here):

==================================== ERRORS ====================================
[21](https://github.com/andreped/GradientAccumulator/actions/runs/4038695982/jobs/6942879807#step:8:22)
__________________ ERROR collecting tests/test_train_mnist.py __________________
[22](https://github.com/andreped/GradientAccumulator/actions/runs/4038695982/jobs/6942879807#step:8:23)
ImportError while importing test module '/home/runner/work/GradientAccumulator/GradientAccumulator/tests/test_train_mnist.py'.
[23](https://github.com/andreped/GradientAccumulator/actions/runs/4038695982/jobs/6942879807#step:8:24)
Hint: make sure your test modules/packages have valid Python names.
[24](https://github.com/andreped/GradientAccumulator/actions/runs/4038695982/jobs/6942879807#step:8:25)
Traceback:
[25](https://github.com/andreped/GradientAccumulator/actions/runs/4038695982/jobs/6942879807#step:8:26)
/opt/hostedtoolcache/Python/3.7.15/x64/lib/python3.7/importlib/__init__.py:127: in import_module
[26](https://github.com/andreped/GradientAccumulator/actions/runs/4038695982/jobs/6942879807#step:8:27)
    return _bootstrap._gcd_import(name[level:], package, level)
[27](https://github.com/andreped/GradientAccumulator/actions/runs/4038695982/jobs/6942879807#step:8:28)
tests/test_train_mnist.py:4: in <module>
[28](https://github.com/andreped/GradientAccumulator/actions/runs/4038695982/jobs/6942879807#step:8:29)
    from gradient_accumulator import GradientAccumulateModel
[29](https://github.com/andreped/GradientAccumulator/actions/runs/4038695982/jobs/6942879807#step:8:30)
E   ImportError: cannot import name 'GradientAccumulateModel' from 'gradient_accumulator' (/opt/hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/gradient_accumulator/__init__.py)
[30](https://github.com/andreped/GradientAccumulator/actions/runs/4038695982/jobs/6942879807#step:8:31)
------------------------------- Captured stderr --------------------------------
[31](https://github.com/andreped/GradientAccumulator/actions/runs/4038695982/jobs/6942879807#step:8:32)
2023-01-29 21:16:11.551860: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.7.15/x64/lib
[32](https://github.com/andreped/GradientAccumulator/actions/runs/4038695982/jobs/6942879807#step:8:33)
2023-01-29 21:16:11.551909: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
[33](https://github.com/andreped/GradientAccumulator/actions/runs/4038695982/jobs/6942879807#step:8:34)
=============================== warnings summary ===============================
[34](https://github.com/andreped/GradientAccumulator/actions/runs/4038695982/jobs/6942879807#step:8:35)
../../../../../opt/hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/flatbuffers/compat.py:19
[35](https://github.com/andreped/GradientAccumulator/actions/runs/4038695982/jobs/6942879807#step:8:36)
  /opt/hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/flatbuffers/compat.py:19: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
[36](https://github.com/andreped/GradientAccumulator/actions/runs/4038695982/jobs/6942879807#step:8:37)
    import imp
[37](https://github.com/andreped/GradientAccumulator/actions/runs/4038695982/jobs/6942879807#step:8:38)

[38](https://github.com/andreped/GradientAccumulator/actions/runs/4038695982/jobs/6942879807#step:8:39)
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
[39](https://github.com/andreped/GradientAccumulator/actions/runs/4038695982/jobs/6942879807#step:8:40)
=========================== short test summary info ============================
[40](https://github.com/andreped/GradientAccumulator/actions/runs/4038695982/jobs/6942879807#step:8:41)
ERROR tests/test_train_mnist.py
[41](https://github.com/andreped/GradientAccumulator/actions/runs/4038695982/jobs/6942879807#step:8:42)
!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!
[42](https://github.com/andreped/GradientAccumulator/actions/runs/4038695982/jobs/6942879807#step:8:43)
========================= 1 warning, 1 error in 2.56s ==========================

Slight offset in result with/without GA

As observed in this Issue #2 (comment), there is a slight indiscrepancy between using GA and regular batch training.

This can be observed from running the benchmark. To reproduce the issue, setup virtualenv as described here, and run:

python benchmark.py --epochs 3 --accum_ops -1 --batchsize 256 --accum_steps 1
python benchmark.py --epochs 3 --accum_ops -1 --batchsize 1 --accum_steps 256

You will observe that the training log prompts have slight discripancies in terms of loss and accuracy, which should be identical at the end of an epoch, for this particular example.

More in-detail results and discussion can be found here: #2 (comment)

Potential solution: Might be that the very final update is not performed using AG. Should verify if that is the case.

Add TensorFlow 1 support?

As this implementation only is compatible with TF2, one could consider adding TF1 support.

However, this is tricky, as the train_step overloading approach was introduced in TF2.2, therefore, one would need a different solution for older versions. This also means that this implementation does not work with TF 2.0-2.1, which is suboptimal.

Perhaps, we could try to get the optimizer wrapper solution working?

The original TF1-compatible implementation can be found here:
https://github.com/andreped/H2G-Net/blob/main/src/utils/accum_optimizers.py#L139

There exists a similar solution for TF2 already, but it is not producing similar results compared to regular batch training:
https://github.com/andreped/GradientAccumulator/blob/main/gradient_accumulator/accumulator.py#L7

Compatibility with Batch Normalization

Current implementation of GA is not compatible with GA as BN is updated for every single micro-batch and not when the full batch is finished.

Need a way to accumulate mean/std paramemters in BN, similarly as done with gradients.

Perhaps it is possible to wrap the BN layer to add this accumulation?

Model conversion from mixed precision to full precision fails

Error collected:

File "/home/dbouget/Code/Private/neuro-segmentation/venv/lib/python3.8/site-packages/keras/engine/base_layer.py", line 1864, in set_weights
    raise ValueError(
ValueError: Layer gradient_accumulate_model weight shape (3, 3, 3, 1, 16) is not compatible with provided weight shape ().

Code ran:

mixed_model = load_model(mixed_precision_model_filepath, compile=False)
full_model = network.create()

full_model = GradientAccumulateModel(
    accum_steps=accumulated_gradients,
    mixed_precision=use_mixed_precision,
    use_agc=adaptive_gradient_clipping,
    clip_factor=0.01, eps=1e-3,
    inputs=full_model.input,
    outputs=full_model.output)

full_model.set_weights(mixed_model.get_weights())

Details: I've printed the weights' shapes for the mixed_precision model (top) and full_precision model (bottom). The difference is in bold (tough to see though) where a layer with no shape () and of type int32 is in between the encoder and decoder path for the mixprec model, but at the end for the fullprec model.

[(3, 3, 3, 1, 16), (16,), (3, 3, 3, 16, 16), (16,), (3, 3, 3, 1, 32), (32,), (3, 3, 3, 48, 32), (32,), (3, 3, 3, 32, 32), (32,), (3, 3, 3, 1, 128), (128,), (3, 3, 3, 160, 128), (128,), (3, 3, 3, 128, 128), (128,), (3, 3, 3, 1, 256), (256,), (3, 3, 3, 384, 256), (256,), (3, 3, 3, 256, 256), (256,), (3, 3, 3, 256, 256), (256,), (3, 3, 3, 256, 256), (256,), (3, 3, 3, 256, 256), (256,), (1, 1, 1, 256, 128), (128,), (1, 1, 1, 256, 128), (128,), (1, 1, 1, 256, 1), (1,), (3, 3, 3, 512, 256), (256,), (3, 3, 3, 256, 256), (256,), (3, 3, 3, 128, 256), (128,), (1, 1, 1, 128, 64), (64,), (1, 1, 1, 128, 64), (64,), (1, 1, 1, 128, 1), (1,), (3, 3, 3, 256, 128), (128,), (3, 3, 3, 128, 128), (128,), (3, 3, 3, 32, 128), (32,), (1, 1, 1, 32, 16), (16,), (1, 1, 1, 32, 16), (16,), (1, 1, 1, 32, 1), (1,), (3, 3, 3, 64, 32), (32,), (3, 3, 3, 32, 32), (32,), (3, 3, 3, 16, 32), (16,), (1, 1, 1, 16, 8), (8,), (1, 1, 1, 16, 8), (8,), (1, 1, 1, 16, 1), (1,), (3, 3, 3, 32, 16), (16,), (3, 3, 3, 16, 16), (16,), (1, 1, 1, 16, 2), (2,), (1, 1, 1, 32, 2), (2,), (1, 1, 1, 128, 2), (2,), (1, 1, 1, 256, 2), (2,), **(),** (3, 3, 3, 1, 16), (16,), (3, 3, 3, 16, 16), (16,), (3, 3, 3, 1, 32), (32,), (3, 3, 3, 48, 32), (32,), (3, 3, 3, 32, 32), (32,), (3, 3, 3, 1, 128), (128,), (3, 3, 3, 160, 128), (128,), (3, 3, 3, 128, 128), (128,), (3, 3, 3, 1, 256), (256,), (3, 3, 3, 384, 256), (256,), (3, 3, 3, 256, 256), (256,), (3, 3, 3, 256, 256), (256,), (3, 3, 3, 256, 256), (256,), (3, 3, 3, 256, 256), (256,), (1, 1, 1, 256, 128), (128,), (1, 1, 1, 256, 128), (128,), (1, 1, 1, 256, 1), (1,), (3, 3, 3, 512, 256), (256,), (3, 3, 3, 256, 256), (256,), (3, 3, 3, 128, 256), (128,), (1, 1, 1, 128, 64), (64,), (1, 1, 1, 128, 64), (64,), (1, 1, 1, 128, 1), (1,), (3, 3, 3, 256, 128), (128,), (3, 3, 3, 128, 128), (128,), (3, 3, 3, 32, 128), (32,), (1, 1, 1, 32, 16), (16,), (1, 1, 1, 32, 16), (16,), (1, 1, 1, 32, 1), (1,), (3, 3, 3, 64, 32), (32,), (3, 3, 3, 32, 32), (32,), (3, 3, 3, 16, 32), (16,), (1, 1, 1, 16, 8), (8,), (1, 1, 1, 16, 8), (8,), (1, 1, 1, 16, 1), (1,), (3, 3, 3, 32, 16), (16,), (3, 3, 3, 16, 16), (16,), (1, 1, 1, 16, 2), (2,), (1, 1, 1, 32, 2), (2,), (1, 1, 1, 128, 2), (2,), (1, 1, 1, 256, 2), (2,)]

[(3, 3, 3, 1, 16), (16,), (3, 3, 3, 16, 16), (16,), (3, 3, 3, 1, 32), (32,), (3, 3, 3, 48, 32), (32,), (3, 3, 3, 32, 32), (32,), (3, 3, 3, 1, 128), (128,), (3, 3, 3, 160, 128), (128,), (3, 3, 3, 128, 128), (128,), (3, 3, 3, 1, 256), (256,), (3, 3, 3, 384, 256), (256,), (3, 3, 3, 256, 256), (256,), (3, 3, 3, 256, 256), (256,), (3, 3, 3, 256, 256), (256,), (3, 3, 3, 256, 256), (256,), (1, 1, 1, 256, 128), (128,), (1, 1, 1, 256, 128), (128,), (1, 1, 1, 256, 1), (1,), (3, 3, 3, 512, 256), (256,), (3, 3, 3, 256, 256), (256,), (3, 3, 3, 128, 256), (128,), (1, 1, 1, 128, 64), (64,), (1, 1, 1, 128, 64), (64,), (1, 1, 1, 128, 1), (1,), (3, 3, 3, 256, 128), (128,), (3, 3, 3, 128, 128), (128,), (3, 3, 3, 32, 128), (32,), (1, 1, 1, 32, 16), (16,), (1, 1, 1, 32, 16), (16,), (1, 1, 1, 32, 1), (1,), (3, 3, 3, 64, 32), (32,), (3, 3, 3, 32, 32), (32,), (3, 3, 3, 16, 32), (16,), (1, 1, 1, 16, 8), (8,), (1, 1, 1, 16, 8), (8,), (1, 1, 1, 16, 1), (1,), (3, 3, 3, 32, 16), (16,), (3, 3, 3, 16, 16), (16,), (1, 1, 1, 16, 2), (2,), (1, 1, 1, 32, 2), (2,), (1, 1, 1, 128, 2), (2,), (1, 1, 1, 256, 2), (2,), (3, 3, 3, 1, 16), (16,), (3, 3, 3, 16, 16), (16,), (3, 3, 3, 1, 32), (32,), (3, 3, 3, 48, 32), (32,), (3, 3, 3, 32, 32), (32,), (3, 3, 3, 1, 128), (128,), (3, 3, 3, 160, 128), (128,), (3, 3, 3, 128, 128), (128,), (3, 3, 3, 1, 256), (256,), (3, 3, 3, 384, 256), (256,), (3, 3, 3, 256, 256), (256,), (3, 3, 3, 256, 256), (256,), (3, 3, 3, 256, 256), (256,), (3, 3, 3, 256, 256), (256,), (1, 1, 1, 256, 128), (128,), (1, 1, 1, 256, 128), (128,), (1, 1, 1, 256, 1), (1,), (3, 3, 3, 512, 256), (256,), (3, 3, 3, 256, 256), (256,), (3, 3, 3, 128, 256), (128,), (1, 1, 1, 128, 64), (64,), (1, 1, 1, 128, 64), (64,), (1, 1, 1, 128, 1), (1,), (3, 3, 3, 256, 128), (128,), (3, 3, 3, 128, 128), (128,), (3, 3, 3, 32, 128), (32,), (1, 1, 1, 32, 16), (16,), (1, 1, 1, 32, 16), (16,), (1, 1, 1, 32, 1), (1,), (3, 3, 3, 64, 32), (32,), (3, 3, 3, 32, 32), (32,), (3, 3, 3, 16, 32), (16,), (1, 1, 1, 16, 8), (8,), (1, 1, 1, 16, 8), (8,), (1, 1, 1, 16, 1), (1,), (3, 3, 3, 32, 16), (16,), (3, 3, 3, 16, 16), (16,), (1, 1, 1, 16, 2), (2,), (1, 1, 1, 32, 2), (2,), (1, 1, 1, 128, 2), (2,), (1, 1, 1, 256, 2), (2,), **()**]

Alternative: of course it remains possible to perform the model conversion simply by doing the following, but it prints a lot of warnings as already pointed out in issue #25 .

full_model.load_weights(mixed_precision_model_filepath)

TensorFlow import issues in CI

In a recent CI, it was observed that tensorflow failed to import due to protobuf.

This failed on both Python 3.8 and 3.9, with tensorflow==2.8.0 on Ubuntu 20.04.

Might be that we have to introduce a strict versioning on protobuf during tensorflow installation.

2023-04-12 09:52:22.545216: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.[8](https://github.com/andreped/GradientAccumulator/actions/runs/4676953980/jobs/8283906385#step:7:9).16/x64/lib
2023-04-12 0[9](https://github.com/andreped/GradientAccumulator/actions/runs/4676953980/jobs/8283906385#step:7:10):52:22.545242: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/runner/work/GradientAccumulator/GradientAccumulator/gradient_accumulator/__init__.py", line 1, in <module>
    from .accumulators import GradientAccumulateOptimizer
  File "/home/runner/work/GradientAccumulator/GradientAccumulator/gradient_accumulator/accumulators.py", line 1, in <module>
    import tensorflow as tf
  File "/opt/hostedtoolcache/Python/3.8.[16](https://github.com/andreped/GradientAccumulator/actions/runs/4676953980/jobs/8283906385#step:7:17)/x64/lib/python3.8/site-packages/tensorflow/__init__.py", line 37, in <module>
    from tensorflow.python.tools import module_util as _module_util
  File "/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/tensorflow/python/__init__.py", line 37, in <module>
    from tensorflow.python.eager import context
  File "/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/tensorflow/python/eager/context.py", line 29, in <module>
    from tensorflow.core.framework import function_pb2
  File "/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/tensorflow/core/framework/function_pb2.py", line 16, in <module>
    from tensorflow.core.framework import attr_value_pb2 as tensorflow_dot_core_dot_framework_dot_attr__value__pb2
  File "/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/tensorflow/core/framework/attr_value_pb2.py", line 16, in <module>
    from tensorflow.core.framework import tensor_pb2 as tensorflow_dot_core_dot_framework_dot_tensor__pb2
  File "/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py", line 16, in <module>
    from tensorflow.core.framework import resource_handle_pb2 as tensorflow_dot_core_dot_framework_dot_resource__handle__pb2
  File "/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/tensorflow/core/framework/resource_handle_pb2.py", line 16, in <module>
    from tensorflow.core.framework import tensor_shape_pb2 as tensorflow_dot_core_dot_framework_dot_tensor__shape__pb2
  File "/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/tensorflow/core/framework/tensor_shape_pb2.py", line 36, in <module>
    _descriptor.FieldDescriptor(
  File "/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/google/protobuf/descriptor.py", line 561, in __new__
    _message.Message._CheckCalledFromGeneratedFile()
TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.[19](https://github.com/andreped/GradientAccumulator/actions/runs/4676953980/jobs/8283906385#step:7:20).0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.[20](https://github.com/andreped/GradientAccumulator/actions/runs/4676953980/jobs/8283906385#step:7:21).x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

Protobuf issues in TensorFlow 2.12

Due to some dependency update in tensorflow-datasets and where tfds does not have strict enough versioning, protobuf seems to be broken.

I managed to fix it for tensorflow < 2.12 by force-reinstalling protobuf with a specific version like so:

pip install "protobuf<=3.20" --force-reinstall

However, for whatever reason, this does not seem to work for the latest tensorflow version.

This was observed on ubuntu-20.04 with Python 3.9.

pytest-cov with multiprocessing results in occational deadlock

I need to use multiprocessing for the mixed precision unit test. After adding this test, the cov CI seemed to take much longer than expected. This is likely due to a deadlock issue when spawning or killing the process.

Note that all tests in pytest is happening in the same process. Due to tensorflow setting various things globally, we need to use processes occationally to properly test it, in a new environment. Mixed precision is one of those situations. After enabling it, I don't see a way to disable it again.

For more information regarding how to use multiprocessing in pytest-cov, see here.

Optimizer invariance

We should add a unit test for verifying whether GA behave the same within different optimizers.

Issue with 3D layers

Traceback (most recent call last):
File "train.py", line 229, in
model.fit(
File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 67, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/func_graph.py", line 1147, in autograph_handler
raise e.ag_error_metadata.to_exception(e)
ValueError: in user code:

File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1021, in train_function  *
    return step_function(self, iterator)
File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1010, in step_function  **
    outputs = model.distribute_strategy.run(run_step, args=(data,))
File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1000, in run_step  **
    outputs = model.train_step(data)
File "/usr/local/lib/python3.8/dist-packages/gradient_accumulator/GAModelWrapper.py", line 55, in train_step
    gradients = agc.adaptive_clip_grad(self.trainable_variables, gradients, clip_factor=self.clip_factor, eps=self.eps)
File "/usr/local/lib/python3.8/dist-packages/gradient_accumulator/agc.py", line 38, in adaptive_clip_grad
    p_norm = unitwise_norm(params)
File "/usr/local/lib/python3.8/dist-packages/gradient_accumulator/agc.py", line 31, in unitwise_norm
    raise ValueError(f"Got a parameter with shape not in [1, 2, 4]! {x}")

ValueError: Got a parameter with shape not in [1, 2, 4]! <AutoCastVariable 'conv3d/kernel:0' shape=(3, 3, 3, 1, 16) dtype=float32 dtype_to_cast_to=float32>

Overflow/underflow issues

When training deep neural networks in general, overflow/underflow can occur.

When we are accumulating gradients, it is even more likely that this can occur, and can greater impact on training, as the overflow/underflow can be more severe.

This is also relevant for mixed precision, as can be seen discussed in the TensorFlow docs.

For mixed precision, TensorFlow uses dynamic loss scaling to reduce the chance of underflow to happen during backpropagation.

As we are also scaling the loss, and thus the gradients, during gradient accumulation, we likely need to do this scaling more carefully, such that we don't run into underflow. There is a discussion about this in this thread. We should try to incorporate some of those suggestions to further improve the robustness of our method.

GradientAccumulator wrapper not working as expected

In gradient accumulation, we try to update the weights only after a given number of iterations (k number of batches), in an ensemble-based manner. For instance, by averaging across the gradients calculated for k batches, and only updated the weights then - simulating regular batch training.

After running the benchmark described here, using:

batch_size=32, accum_steps=1, epochs=3
batch_size=8, accum_steps=4, epochs=12

We do not get the same results. It seems like the weights are updated for every batch even though we use accum_steps > 4.

Both the original wrapper implementation GradientAccumulator and the Adam-based wrapper AdamAccumulate suffer from this.

Are we actually able to control when the weights are updated from the optimizer, or can we only calculate and get the gradients and enforce and update ourselves?

Obviously we can make our own train loop, but the whole point is to have a simple wrapper class which handles all this for us.

Python support 3.11-12?

Currently, we only support Python 3.6-3.10.

We could support 3.11-3.12, but for Windows I believe there does not yet exist a precompiled wheel for Tensorflow.

Nonetheless, we could still add these two to the CIs and see if it works.

GAModelWrapper fails if gradients are not defined

When using GA in a project, I observed that when gradients failed to be defined in some part of the network, it would return a ValueError, whereas without GA it would give a warning.

We should try to match the behaviour of non-train-step-overloaded method in Model.fit().

Traceback (most recent call last):
  File "main.py", line 103, in <module>
    main()
  File "main.py", line 97, in main
    trainer(ret)
  File "/home/andrep/workspace/bcgrade/milai/train.py", line 205, in trainer
    model.fit(
  File "/home/andrep/workspace/bcgrade/venv/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/tmp/__autograph_generated_fileisn3gvrb.py", line 15, in tf__train_function
    retval_ = ag__.converted_call(ag__.ld(step_function), (ag__.ld(self), ag__.ld(iterator)), None, fscope)
  File "/home/andrep/workspace/bcgrade/venv/lib/python3.8/site-packages/gradient_accumulator/GAModelWrapper.py", line 65, in train_step
    self.gradient_accumulation[i].assign_add(gradients[i])
ValueError: in user code:

    File "/home/andrep/workspace/bcgrade/venv/lib/python3.8/site-packages/keras/engine/training.py", line 1051, in train_function  *
        return step_function(self, iterator)
    File "/home/andrep/workspace/bcgrade/venv/lib/python3.8/site-packages/keras/engine/training.py", line 1040, in step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/home/andrep/workspace/bcgrade/venv/lib/python3.8/site-packages/keras/engine/training.py", line 1030, in run_step  **
        outputs = model.train_step(data)
    File "/home/andrep/workspace/bcgrade/venv/lib/python3.8/site-packages/gradient_accumulator/GAModelWrapper.py", line 65, in train_step
        self.gradient_accumulation[i].assign_add(gradients[i])

    ValueError: None values not supported.

resource_apply_sparse method in Optimizer wrapper is broken

As it was observed that the resource_apply_sparsemethod in the AccumulateOptimizerWrapper was never used, a unit tests including the Embedding layer in Keras, which is commonly used in NLP, was added. This resulted in the error seen below.

Note that if we remove the optimizer wrapper it works just fine, same applies to removing the Embedding layer, hence, the resource_apply_sparse method in the optimizer wrapper is likely broken.

Error log:

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/usr/local/lib/python3.8/site-packages/keras/utils/traceback_utils.py:67: in error_handler
    raise e.with_traceback(filtered_tb) from None
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

args = (<tensorflow.python.data.ops.iterator_ops.OwnedIterator object at 0x1331ce6d0>,), kwargs = {}

    def autograph_handler(*args, **kwargs):
      """Calls a converted version of original_func."""
      # TODO(mdan): Push this block higher in tf.function's call stack.
      try:
        return autograph.converted_call(
            original_func,
            args,
            kwargs,
            options=autograph.ConversionOptions(
                recursive=True,
                optional_features=autograph_options,
                user_requested=True,
            ))
      except Exception as e:  # pylint:disable=broad-except
        if hasattr(e, "ag_error_metadata"):
>         raise e.ag_error_metadata.to_exception(e)
E         tensorflow.python.autograph.impl.api.StagingError: in user code:
E         
E             File "/usr/local/lib/python3.8/site-packages/keras/engine/training.py", line 1051, in train_function  *
E                 return step_function(self, iterator)
E             File "/Users/andreped/workspace/GradientAccumulator/gradient_accumulator/accumulators.py", line 222, in _apply  *
E                 train_op = self.optimizer._resource_apply_sparse(
E             File "/usr/local/lib/python3.8/site-packages/keras/optimizers/optimizer_v2/gradient_descent.py", line 175, in _resource_apply_sparse
E                 momentum_var = self.get_slot(var, "momentum")
E             File "/usr/local/lib/python3.8/site-packages/keras/optimizers/optimizer_v2/optimizer_v2.py", line 961, in get_slot
E                 slot_dict = self._slots[var_key]
E         
E             KeyError: 'embedding/embeddings_13'

/usr/local/lib/python3.8/site-packages/tensorflow/python/framework/func_graph.py:1127: StagingError

GradientAccumulateOptimizer breaks with TensorFlow 2.11

2.11 has made some changes to how the base optimizer works, as shown in the release notes:

https://github.com/tensorflow/tensorflow/releases

As a result, using the GradientAccumulateOptimizer on this version now gives the following error:

AttributeError: 'GradientAccumulateOptimizer' object has no attribute '_learning_rate'

Poor convergence with gradient accumulation

It was observed that training a model with high class imbalance resulted in poor convergence.

The model did not converge at all. The model was trained with batch size 2, accum steps 8, and tested learning rates 1e-3 and 5e-4 with Adam optimizer.

Bug: Older TF version incompatible with the optimizer wrapper

This seems to be related to tf-addons and not GradientAccumulateOptimizer. Hence, if we remove the tf-addons dependency, it should not be an issue. However, we use tf-addons for fancy type hints. Worth it?

This is the error log I get for tensorflow==2.2.0:

Run python -c "from gradient_accumulator import GradientAccumulateModel, GradientAccumulateOptimizer"
[7](https://github.com/andreped/GradientAccumulator/actions/runs/4038830821/jobs/6943110335#step:7:8)
/opt/hostedtoolcache/Python/3.6.15/x64/lib/python3.6/site-packages/tensorflow_addons/utils/ensure_tf_install.py:67: UserWarning: Tensorflow Addons supports using Python ops for all Tensorflow versions above or equal to 2.4.0 and strictly below 2.7.0 (nightly versions are not supported). 
[8](https://github.com/andreped/GradientAccumulator/actions/runs/4038830821/jobs/6943110335#step:7:9)
 The versions of TensorFlow you are currently using is 2.2.0 and is not supported. 
[9](https://github.com/andreped/GradientAccumulator/actions/runs/4038830821/jobs/6943110335#step:7:10)
Some things might work, some things might not.
[10](https://github.com/andreped/GradientAccumulator/actions/runs/4038830821/jobs/6943110335#step:7:11)
If you were to encounter a bug, do not file an issue.
[11](https://github.com/andreped/GradientAccumulator/actions/runs/4038830821/jobs/6943110335#step:7:12)
If you want to make sure you're using a tested and supported configuration, either change the TensorFlow version or the TensorFlow Addons's version. 
[12](https://github.com/andreped/GradientAccumulator/actions/runs/4038830821/jobs/6943110335#step:7:13)
You can find the compatibility matrix in TensorFlow Addon's readme:
[13](https://github.com/andreped/GradientAccumulator/actions/runs/4038830821/jobs/6943110335#step:7:14)
https://github.com/tensorflow/addons
[14](https://github.com/andreped/GradientAccumulator/actions/runs/4038830821/jobs/6943110335#step:7:15)
  UserWarning,
[15](https://github.com/andreped/GradientAccumulator/actions/runs/4038830821/jobs/6943110335#step:7:16)
Traceback (most recent call last):
[16](https://github.com/andreped/GradientAccumulator/actions/runs/4038830821/jobs/6943110335#step:7:17)
  File "<string>", line 1, in <module>
[17](https://github.com/andreped/GradientAccumulator/actions/runs/4038830821/jobs/6943110335#step:7:18)
  File "/home/runner/work/GradientAccumulator/GradientAccumulator/gradient_accumulator/__init__.py", line 1, in <module>
[18](https://github.com/andreped/GradientAccumulator/actions/runs/4038830821/jobs/6943110335#step:7:19)
    from .accumulators import GradientAccumulateOptimizer
[19](https://github.com/andreped/GradientAccumulator/actions/runs/4038830821/jobs/6943110335#step:7:20)
  File "/home/runner/work/GradientAccumulator/GradientAccumulator/gradient_accumulator/accumulators.py", line 3, in <module>
[20](https://github.com/andreped/GradientAccumulator/actions/runs/4038830821/jobs/6943110335#step:7:21)
    from tensorflow_addons.utils import types
[21](https://github.com/andreped/GradientAccumulator/actions/runs/4038830821/jobs/6943110335#step:7:22)
  File "/opt/hostedtoolcache/Python/3.6.15/x64/lib/python3.6/site-packages/tensorflow_addons/__init__.py", line 21, in <module>
[22](https://github.com/andreped/GradientAccumulator/actions/runs/4038830821/jobs/6943110335#step:7:23)
    from tensorflow_addons import activations
[23](https://github.com/andreped/GradientAccumulator/actions/runs/4038830821/jobs/6943110335#step:7:24)
  File "/opt/hostedtoolcache/Python/3.6.15/x64/lib/python3.6/site-packages/tensorflow_addons/activations/__init__.py", line 17, in <module>
[24](https://github.com/andreped/GradientAccumulator/actions/runs/4038830821/jobs/6943110335#step:7:25)
    from tensorflow_addons.activations.gelu import gelu
[25](https://github.com/andreped/GradientAccumulator/actions/runs/4038830821/jobs/6943110335#step:7:26)
  File "/opt/hostedtoolcache/Python/3.6.15/x64/lib/python3.6/site-packages/tensorflow_addons/activations/gelu.py", line 19, in <module>
[26](https://github.com/andreped/GradientAccumulator/actions/runs/4038830821/jobs/6943110335#step:7:27)
    from tensorflow_addons.utils.types import TensorLike
[27](https://github.com/andreped/GradientAccumulator/actions/runs/4038830821/jobs/6943110335#step:7:28)
  File "/opt/hostedtoolcache/Python/3.6.15/x64/lib/python3.6/site-packages/tensorflow_addons/utils/types.py", line 26, in <module>
[28](https://github.com/andreped/GradientAccumulator/actions/runs/4038830821/jobs/6943110335#step:7:29)
    from tensorflow.python.keras.engine import keras_tensor
[29](https://github.com/andreped/GradientAccumulator/actions/runs/4038830821/jobs/6943110335#step:7:30)
ImportError: cannot import name 'keras_tensor'

Better compatibility with batch normalization

As you all know, gradient accumulation is not directly compatible with batch normalization, as batch normalization will be updated for every single forward step and we cannot control that externally.

In order to get the same behaviour as for gradient updates, we will likely need to implement a custom batch normalization layer which does this internally, as overloading the original batch norm step seems challenging (due to its extreme complexity).

Wrapping a TFGPT2LMHeadModel With GradientAccumulateModel

I get an error when I try to wrap my gpt2 model with the ga wrapper. Here is a screenshot:

Add documentations on accumulated batch normalization

We could make something similar to what we already have to gradient accumulation here.

Although it should lie in the documentations and not the README.

Total params is twice as large

When loading a trained model that has been wrapped using GAModelWrapper, we get:

Model: "ga_model_wrapper"
_______________________________________________________________________________
[... Layer (type) info stuff ...]
================================================
[...]
================================================
Total params: 171,700,165
Trainable params: 85,850,082
Non-trainable params: 85,850,083
_______________________________________________________________________________

The problem seems to be that the model contains double the number of params, which likely comes from that all layers have been renamed, such that two copies of the same model exist in the graph.

This might also impact GPU memory allocation.

The expected behaviour is for the layer/weight names to stay the same.

AGC and mixed precision not compatible?

When training a 3D U-Net like architecture, with adaptive gradient clipping and mixed precision enabled, it was observed that the loss was increasing instead of decreasing.

The network trained fine if AGC was enabled and mixed precision was not, or vica versa. Hence, these two implementations does not seem to be compatible.

Different behaviour between SGD and Adam optimizers with OptimizerWrapper

There is a simple test scipt to reproduce this issue here.

Simply swap SGD with Adam and see for yourself.

This might be due to when the MEAN reduction is performed. It might be performed too late in the computation (see here). That is, if this is done too late, it might be that dynamic optimizers such as Adam update the learning rate too frequently. However, I'm not sure.

Mixed precision on TPU

Currently, we support using mixed_float16 dtype, which can be used with NVIDIA dedicated GPUs. However, for those interested in using TPUs on Google Colab, we would need to add support for mixed_bfloat16 as well.

Currently, mixed precision may be enabled by setting mixed_precision=True as an argument to the GAModelWrapper. However, an idea could be to have a dtype argument instead, where dtype=None means normal mixed_float32 training, and dtype=mixed_float16 or dtype=mixed_bfloat16 for mixed precision solutions on GPU and TPU, respectively.

DEPRECATION: gradient-accumulator is being installed using the legacy 'setup.py install' method, because it does not have a 'pyproject.toml' and the 'wheel' package is not installed. pip 23.1 will enforce this behaviour change. A possible replacement is to enable the '--use-pep517' option. Discussion can be found at https://github.com/pypa/pip/issues/8559

andreped / gradientaccumulator Goto Github PK

gradientaccumulator's People

Contributors

Stargazers

Watchers

Forkers

gradientaccumulator's Issues

Error logs:

Error log:

Recommend Projects

Recommend Topics

Recommend Org