Giter Site home page Giter Site logo

tlkh / tf-metal-experiments Goto Github PK

View Code? Open in Web Editor NEW
265.0 16.0 32.0 247 KB

TensorFlow Metal Backend on Apple Silicon Experiments (just for fun)

License: MIT License

Python 24.99% Jupyter Notebook 75.01%
gpu deep-learning tensorflow m1 m1-max benchmark bert

tf-metal-experiments's Introduction

tf-metal-experiments

TensorFlow Metal Backend on Apple Silicon Experiments (just for fun)

Setup

This is tested on M1 series Apple Silicon SOC only.

TensorFlow 2.x

  1. Follow the official instructions from Apple here
  2. Test that your Metal GPU is working by running tf.config.list_physical_devices("GPU"), you should see 1 GPU present (it is not named). Later when you actually use the GPU, there will be a more informative printout that says Metal device set to: Apple M1 Max or similar.
  3. Now you should be ready to run any TF code that doesn't require external libraries.

HuggingFace Transformers library

If you want to play around with Transformer models (with TF Metal backend of course), you will need to install the HuggingFace Transformers library.

  1. Install the regex library (I don't know why it has to be like this, but yeah): python3 -m pip install --upgrade regex --no-use-pep517. You might need do xcode-select --install if the above command doesn't work.
  2. pip install transformers ipywidgets

Experiments and Benchmarks

After some trial and error, some initial benchmarks for what should be the approx best capability of the M1 Max.

  • For all the cases here, increasing batch size does not seem to increase the throughput.
  • High Power Mode enabled + plugged into charger (this does not seem to affect the benchmarks anyway)

Power draw also doesn't seem to be able to go much higher than ~40W:

  • Power draw from the GPU (averaged over 1 second) can be measured with sudo powermetrics --samplers gpu_power -i1000 -n1.
  • I decided to report peak power as observed via asitop (see: tlkh/asitop)
Model GPU BatchSize Throughput Peak Power Memory
ResNet50 M1 Max 32c 128 140 img/sec 42W 21 GB
MobileNetV2 M1 Max 32c 128 352 img/sec 37W 13 GB
DistilBERT M1 Max 32c 64 120 seq/sec 35W 9 GB
BERTLarge M1 Max 32c 16 19 seq/sec 36W 14 GB

The benchmark scripts used are included in this repo.

python train_benchmark.py --type cnn --model resnet50
python train_benchmark.py --type cnn --model mobilenetv2
python train_benchmark.py --type transformer --model distilbert-base-uncased
python train_benchmark.py --type transformer --model bert-large-uncased --bs 16

Reference Benchmarks from RTX 3090

Model GPU BatchSize Throughput Power
Same Batch Size as M1
ResNet50 3090 128 1100 img/sec 360W
MobileNetV2 3090 128 2001 img/sec 340W
DistilBERT 3090 64 1065 seq/sec 360W
BERTLarge 3090 16 131 seq/sec 335W
Larger Batch Size
ResNet50 3090 256 1185 img/sec 370W
MobileNetV2 3090 256 2197 img/sec 350W
DistilBERT 3090 256 1340 seq/sec 380W
BERTLarge 3090 64 193 seq/sec 365W

For 3090, same script is used, but additional optimization that leverage hardware (Tensor Core) and software (XLA compiler) not present/working on M1 is added. Also increase the length of an epoch, as sometimes 3090 is too fast and results in poorer measurement due to overhead of starting/ending the training which finishes in seconds.

Note: 3090 running at 400W power limit. CPU is 5600X.

# config for NVIDIA Tensor Core GPU
# run with more steps, XLA and FP16 (enable tensor core aka mixed precision)
python train_benchmark.py --type cnn --model resnet50 --xla --fp16 --steps 100
python train_benchmark.py --type cnn --model mobilenetv2 --xla --fp16 --steps 100
python train_benchmark.py --type transformer --model distilbert-base-uncased --xla --fp16 --steps 100
python train_benchmark.py --type transformer --model bert-large-uncased --bs 16 --xla --fp16 --steps 30
# If no Tensor Core, remove --fp16 flag

Measuring Achievable TFLOPS

We can use TF to write a matrix multiplication benchmark to try and estimate what is the max compute performance we can get out of a M1 Max. It seems we can get around >8 TFLOPS for large enough problem sizes.

The plot can be generated using tflops_sweep.py.

Note that FP64 and FP16 performance appears to be non-existent. (the code automatically runs on CPU if FP64 or FP16 is specified as data type)

tf-metal-experiments's People

Contributors

mend-bolt-for-github[bot] avatar tlkh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tf-metal-experiments's Issues

Graph execution error

python train_benchmark.py --type cnn --model resnet50

yields the following:

`2023-01-05 12:38:02.090126: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:418 : NOT_FOUND: could not find registered platform with id: 0x1051fc880
2023-01-05 12:38:02.090152: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:418 : NOT_FOUND: could not find registered platform with id: 0x1051fc880
2023-01-05 12:38:02.168153: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:418 : NOT_FOUND: could not find registered platform with id: 0x1051fc880
2023-01-05 12:38:02.168185: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:418 : NOT_FOUND: could not find registered platform with id: 0x1051fc880
2023-01-05 12:38:02.168205: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:418 : NOT_FOUND: could not find registered platform with id: 0x1051fc880
2023-01-05 12:38:02.168225: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:418 : NOT_FOUND: could not find registered platform with id: 0x1051fc880
Traceback (most recent call last):
File "/Users/mac.user/git/tf-metal-experiments/train_benchmark.py", line 45, in
_ = model.fit(x=dataset_x, y=dataset_y, batch_size=args.bs, epochs=1, verbose=1)
File "/Users/mac.user/Library/Python/3.9/lib/python/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/Users/mac.user/Library/Python/3.9/lib/python/site-packages/tensorflow/python/eager/execute.py", line 52, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.NotFoundError: Graph execution error:

Detected at node 'StatefulPartitionedCall_212' defined at (most recent call last):
File "/Users/mac.user/git/tf-metal-experiments/train_benchmark.py", line 45, in
_ = model.fit(x=dataset_x, y=dataset_y, batch_size=args.bs, epochs=1, verbose=1)
File "/Users/mac.user/Library/Python/3.9/lib/python/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
return fn(*args, **kwargs)
File "/Users/mac.user/Library/Python/3.9/lib/python/site-packages/keras/engine/training.py", line 1650, in fit
tmp_logs = self.train_function(iterator)
File "/Users/mac.user/Library/Python/3.9/lib/python/site-packages/keras/engine/training.py", line 1249, in train_function
return step_function(self, iterator)
File "/Users/mac.user/Library/Python/3.9/lib/python/site-packages/keras/engine/training.py", line 1233, in step_function
outputs = model.distribute_strategy.run(run_step, args=(data,))
File "/Users/mac.user/Library/Python/3.9/lib/python/site-packages/keras/engine/training.py", line 1222, in run_step
outputs = model.train_step(data)
File "/Users/mac.user/Library/Python/3.9/lib/python/site-packages/keras/engine/training.py", line 1027, in train_step
self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
File "/Users/mac.user/Library/Python/3.9/lib/python/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 527, in minimize
self.apply_gradients(grads_and_vars)
File "/Users/mac.user/Library/Python/3.9/lib/python/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1140, in apply_gradients
return super().apply_gradients(grads_and_vars, name=name)
File "/Users/mac.user/Library/Python/3.9/lib/python/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 634, in apply_gradients
iteration = self._internal_apply_gradients(grads_and_vars)
File "/Users/mac.user/Library/Python/3.9/lib/python/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1166, in _internal_apply_gradients
return tf.internal.distribute.interim.maybe_merge_call(
File "/Users/mac.user/Library/Python/3.9/lib/python/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1216, in _distributed_apply_gradients_fn
distribution.extended.update(
File "/Users/mac.user/Library/Python/3.9/lib/python/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1211, in apply_grad_to_update_var
return self._update_step_xla(grad, var, id(self._var_key(var)))
Node: 'StatefulPartitionedCall_212'
could not find registered platform with id: 0x1051fc880
[[{{node StatefulPartitionedCall_212}}]] [Op:__inference_train_function_16598]`

Issues from "Big Sur" to "Monterey"

Hi all,

I just update my M1 to OS Monterey, and my tensorflow was spoiled (problems with memory allocation and malloc)

Then I reintall tensorflow metal and then it only train on GPU.

In my experiments I got better results if train on small batches and therefore it was a lot faster for me to train in mode 'any' ( I think it uses CPU and Neural Engine)
This was the code I used:

from tensorflow.python.compiler.mlcompute import mlcompute
mlcompute.set_mlc_device(device_name='any')

Now with tf-metal how can be possible to train on CPU or/and Neural Engine?

Thanks in advance

smaller batch sizes

Thank you !
Because of unified memory, I wonder if tensorflow training on M1 max would be less badly impacted by smaller batch sizes, than training on the RTX 3090...
I would love to have the comparisons of resnet and mobilenet for batchsize=16, on M1 max and RTX 3090.

Disable Eager Execution

I can’t find in the code whether eager mode is disabled. It is enabled by default in TF2.

This can make a huge difference in performance as eager mode is more of a debug mode, and is generally not used for performance benchmarking.

Is eager mode disabled in these benchmarks? If not, I would be curious how the results change with it on as I have noted at times >2x differences in performance between the modes.

Low results for 3090?

Ross Wightman reports 2400-2600 for different variations of ResNet-50 (with approx. 100ms per batch and thus bs 256). I'm surprised to see almost threefold difference here.

TFLOPS is off by a factor of 2 in conv_benchmark.py?

On line 20, you are computing the TFLOPS as such:
conv_flops = MN * MN * CK * CK * HW * HW

Wouldn't it actually be 2x this since each of those points is an add and a mul? I see the "*2" in tflops_sweep.py

Brings the computed speed from 4.8 TFLOPS to 9.6 TFLOPS, a lot closer to the 10.4 theoretical max.

(though since it's 3x3, it might be a Winograd conv)

does M1 max can use all of his memory in training???

m1 max has 64 GB RAM.

Q1.
if who want to train model with m1 max, his the biggest reason of choosing m1 max is the huge VRAM not the throughput.
I also want to know how large memory can be used in training model with m1 max. (60? 55???)

Q2.
do you have any problem with tensorflow-metal???
do you have any plan to improve this post with some experience about problem&solution in M1 training with TF. (some code or compatibilities with major packages in ML/Deep Learning)
It is just my suggestion. but I think many people want this experience sharing.

Thank you!

Adding pytorch benchmark?

Pytorch just release early support for Metal Performance Shader (mps) device, from official blog. I think this issue might be off-topic as the project name is literally tf-metal-experiments, but it would be nice to have baselines comparison between both Tensorflow and Pytorch performance benchmark

But noted that (at least, as of right now) a noticeable performance gain seems to come from M1 Ultra chip, and with large enough batch size, and a lot of operation are not supported (see pytorch/pytorch#77764)

Did anyone try M1 Pro

Hi, I am hesitating whether to pay extra for the MAX to get the best ML performance.

Did anyone run the same test on M1 PRO?

M1 Max Thermal Throttling

Hi @tlkh

Thank you so much for creating and running those benchmarks. I would be interested if thermal throttling affects training speed (ms/step) after a few minutes.

Could you share your training logs when running e.g. bm_rn50.py for benchmark_epochs = 20?

adding A100, other configs comparison for fun

Summary tables (more details below and in comments):

Model M1 7c M1 32c A 100 (-) V 100 (-) P 100 (-) T4 (-) K 80 (-) Q P 5000 (-) Q M 4000 (-) Q RTX 4000 (-)
RN50 10 135 611 347 211 134 F 131 F F
MNV2 23 352 269 187 125 193 94 181 F F
DBERT 15 120 761 187 149 94 47 109 39 129
BERTL 1 18 136 31 16 15 4 17 F F
Model M1 7c M1 32c A 100 (+) V 100 (+) P 100 (+) T4 (+) K 80 (+) Q P 5000 (+) Q M 4000 (+) Q RTX 4000 (+)
RN50 10 135 1147 v100 252 na na na na na
MNV2 23 352 1870 v100 128 na na na na na
DBERT 15 120 1909 v100 209 na na na na na
BERTL 1 18 309 v100 23 na na na na na





Really good and useful work here --- thank you! I can potentially fill the remaining values if you think this would be of interest. Note: I simply copied-pasted the content of your .py files and ran them in a notebook. Also note: accuracy will improve by using float32 as op indicates M1 can only use float32, hence comparison should be without optimization imo. RTX 3090 and A100 are somewhat similar in my understanding in terms if benchmarks like these fwiw.


M1 on MBA (7-core, 8GB RAM). Completely froze my laptop! Even the trackpad stopped responding... first time ever I notice I have this slowdown/lag happens on this MBa, but results incoming! Obviously it's taking forever...

Model GPU BatchSize Throughput Power Memory
ResNet50 M1 7c 64 10.3 img/sec ? ?
MobileNetV2 M1 7c 128 22.7 img/sec ? ?
DistilBERT M1 7c 64 15.2 seq/sec ? ?
BERTLarge M1 7c 32 0.6 seq/sec ? ?

with optimization:

Model GPU BatchSize Throughput Power Memory
ResNet50 A100 40GB 64 1147.4 img/sec ? ?
MobileNetV2 A100 40GB 128 1869.7 img/sec ? ?
DistilBERT A100 40GB 64 1909.3 seq/sec ? ?
BERTLarge A100 40GB 32 309.3 seq/sec ? ?

without optimziation:

Model GPU BatchSize Throughput Power Memory
ResNet50 A100 40GB 64 610.8 img/sec ? ?
MobileNetV2 A100 40GB 128 269.4 img/sec ? ?
DistilBERT A100 40GB 64 761.1 seq/sec ? ?
BERTLarge A100 40GB 32 135.5 seq/sec ? ?

Model GPU BatchSize Throughput Power Memory
ResNet50 M1 Max 32c 64 135 img/sec 40W 13 GB
MobileNetV2 M1 Max 32c 128 352 img/sec 37W 15 GB
DistilBERT M1 Max 32c 64 120 seq/sec 35W 9 GB
BERTLarge M1 Max 32c 32 18 seq/sec 36W 14 GB

Model GPU BatchSize Throughput Power
ResNet50 3090 64 957 img/sec 300W
MobileNetV2 3090 128 1927 img/sec 310W
DistilBERT 3090 64 1040 seq/sec 310W
BERTLarge 3090 32 164 seq/sec 320W

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.