tlkh / tf-metal-experiments Goto Github PK
View Code? Open in Web Editor NEWTensorFlow Metal Backend on Apple Silicon Experiments (just for fun)
License: MIT License
TensorFlow Metal Backend on Apple Silicon Experiments (just for fun)
License: MIT License
Hi, I am hesitating whether to pay extra for the MAX to get the best ML performance.
Did anyone run the same test on M1 PRO?
pip install transfomers ipywidgets
should be pip install transformers ipywidgets
but perhaps just do all in miniforge, as Apple recommends?
Hi @tlkh
Thank you so much for creating and running those benchmarks. I would be interested if thermal throttling affects training speed (ms/step
) after a few minutes.
Could you share your training logs when running e.g. bm_rn50.py
for benchmark_epochs = 20
?
Hi all,
I just update my M1 to OS Monterey, and my tensorflow was spoiled (problems with memory allocation and malloc)
Then I reintall tensorflow metal and then it only train on GPU.
In my experiments I got better results if train on small batches and therefore it was a lot faster for me to train in mode 'any' ( I think it uses CPU and Neural Engine)
This was the code I used:
from tensorflow.python.compiler.mlcompute import mlcompute
mlcompute.set_mlc_device(device_name='any')
Now with tf-metal how can be possible to train on CPU or/and Neural Engine?
Thanks in advance
I can’t find in the code whether eager mode is disabled. It is enabled by default in TF2.
This can make a huge difference in performance as eager mode is more of a debug mode, and is generally not used for performance benchmarking.
Is eager mode disabled in these benchmarks? If not, I would be curious how the results change with it on as I have noted at times >2x differences in performance between the modes.
Ross Wightman reports 2400-2600 for different variations of ResNet-50 (with approx. 100ms per batch and thus bs 256). I'm surprised to see almost threefold difference here.
Pytorch just release early support for Metal Performance Shader (mps
) device, from official blog. I think this issue might be off-topic as the project name is literally tf-metal-experiments
, but it would be nice to have baselines comparison between both Tensorflow and Pytorch performance benchmark
But noted that (at least, as of right now) a noticeable performance gain seems to come from M1 Ultra chip, and with large enough batch size, and a lot of operation are not supported (see pytorch/pytorch#77764)
Thank you !
Because of unified memory, I wonder if tensorflow training on M1 max would be less badly impacted by smaller batch sizes, than training on the RTX 3090...
I would love to have the comparisons of resnet and mobilenet for batchsize=16, on M1 max and RTX 3090.
Which CPU and RAM are in the Reference Benchmark 3090?
How much memory there is the 3090
python train_benchmark.py --type cnn --model resnet50
yields the following:
`2023-01-05 12:38:02.090126: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:418 : NOT_FOUND: could not find registered platform with id: 0x1051fc880
2023-01-05 12:38:02.090152: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:418 : NOT_FOUND: could not find registered platform with id: 0x1051fc880
2023-01-05 12:38:02.168153: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:418 : NOT_FOUND: could not find registered platform with id: 0x1051fc880
2023-01-05 12:38:02.168185: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:418 : NOT_FOUND: could not find registered platform with id: 0x1051fc880
2023-01-05 12:38:02.168205: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:418 : NOT_FOUND: could not find registered platform with id: 0x1051fc880
2023-01-05 12:38:02.168225: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:418 : NOT_FOUND: could not find registered platform with id: 0x1051fc880
Traceback (most recent call last):
File "/Users/mac.user/git/tf-metal-experiments/train_benchmark.py", line 45, in
_ = model.fit(x=dataset_x, y=dataset_y, batch_size=args.bs, epochs=1, verbose=1)
File "/Users/mac.user/Library/Python/3.9/lib/python/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/Users/mac.user/Library/Python/3.9/lib/python/site-packages/tensorflow/python/eager/execute.py", line 52, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.NotFoundError: Graph execution error:
Detected at node 'StatefulPartitionedCall_212' defined at (most recent call last):
File "/Users/mac.user/git/tf-metal-experiments/train_benchmark.py", line 45, in
_ = model.fit(x=dataset_x, y=dataset_y, batch_size=args.bs, epochs=1, verbose=1)
File "/Users/mac.user/Library/Python/3.9/lib/python/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
return fn(*args, **kwargs)
File "/Users/mac.user/Library/Python/3.9/lib/python/site-packages/keras/engine/training.py", line 1650, in fit
tmp_logs = self.train_function(iterator)
File "/Users/mac.user/Library/Python/3.9/lib/python/site-packages/keras/engine/training.py", line 1249, in train_function
return step_function(self, iterator)
File "/Users/mac.user/Library/Python/3.9/lib/python/site-packages/keras/engine/training.py", line 1233, in step_function
outputs = model.distribute_strategy.run(run_step, args=(data,))
File "/Users/mac.user/Library/Python/3.9/lib/python/site-packages/keras/engine/training.py", line 1222, in run_step
outputs = model.train_step(data)
File "/Users/mac.user/Library/Python/3.9/lib/python/site-packages/keras/engine/training.py", line 1027, in train_step
self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
File "/Users/mac.user/Library/Python/3.9/lib/python/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 527, in minimize
self.apply_gradients(grads_and_vars)
File "/Users/mac.user/Library/Python/3.9/lib/python/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1140, in apply_gradients
return super().apply_gradients(grads_and_vars, name=name)
File "/Users/mac.user/Library/Python/3.9/lib/python/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 634, in apply_gradients
iteration = self._internal_apply_gradients(grads_and_vars)
File "/Users/mac.user/Library/Python/3.9/lib/python/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1166, in _internal_apply_gradients
return tf.internal.distribute.interim.maybe_merge_call(
File "/Users/mac.user/Library/Python/3.9/lib/python/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1216, in _distributed_apply_gradients_fn
distribution.extended.update(
File "/Users/mac.user/Library/Python/3.9/lib/python/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1211, in apply_grad_to_update_var
return self._update_step_xla(grad, var, id(self._var_key(var)))
Node: 'StatefulPartitionedCall_212'
could not find registered platform with id: 0x1051fc880
[[{{node StatefulPartitionedCall_212}}]] [Op:__inference_train_function_16598]`
This is something that I suspect will be of interest to people who land on this repo:
https://github.com/octoml/Apple-M1-BERT
People were able to get 2x speed improvement over TF-metal using TVM to accelerate models. Downside is that TVM doesn't (yet) support model training.
Summary tables (more details below and in comments):
Model | M1 7c | M1 32c | A 100 (-) | V 100 (-) | P 100 (-) | T4 (-) | K 80 (-) | Q P 5000 (-) | Q M 4000 (-) | Q RTX 4000 (-) |
---|---|---|---|---|---|---|---|---|---|---|
RN50 | 10 | 135 | 611 | 347 | 211 | 134 | F | 131 | F | F |
MNV2 | 23 | 352 | 269 | 187 | 125 | 193 | 94 | 181 | F | F |
DBERT | 15 | 120 | 761 | 187 | 149 | 94 | 47 | 109 | 39 | 129 |
BERTL | 1 | 18 | 136 | 31 | 16 | 15 | 4 | 17 | F | F |
Model | M1 7c | M1 32c | A 100 (+) | V 100 (+) | P 100 (+) | T4 (+) | K 80 (+) | Q P 5000 (+) | Q M 4000 (+) | Q RTX 4000 (+) |
---|---|---|---|---|---|---|---|---|---|---|
RN50 | 10 | 135 | 1147 | v100 | 252 | na | na | na | na | na |
MNV2 | 23 | 352 | 1870 | v100 | 128 | na | na | na | na | na |
DBERT | 15 | 120 | 1909 | v100 | 209 | na | na | na | na | na |
BERTL | 1 | 18 | 309 | v100 | 23 | na | na | na | na | na |
Really good and useful work here --- thank you! I can potentially fill the remaining values if you think this would be of interest. Note: I simply copied-pasted the content of your .py files and ran them in a notebook. Also note: accuracy will improve by using float32 as op indicates M1 can only use float32, hence comparison should be without optimization imo. RTX 3090 and A100 are somewhat similar in my understanding in terms if benchmarks like these fwiw.
M1 on MBA (7-core, 8GB RAM). Completely froze my laptop! Even the trackpad stopped responding... first time ever I notice I have this slowdown/lag happens on this MBa, but results incoming! Obviously it's taking forever...
Model | GPU | BatchSize | Throughput | Power | Memory |
---|---|---|---|---|---|
ResNet50 | M1 7c | 64 | 10.3 img/sec | ? | ? |
MobileNetV2 | M1 7c | 128 | 22.7 img/sec | ? | ? |
DistilBERT | M1 7c | 64 | 15.2 seq/sec | ? | ? |
BERTLarge | M1 7c | 32 | 0.6 seq/sec | ? | ? |
with optimization:
Model | GPU | BatchSize | Throughput | Power | Memory |
---|---|---|---|---|---|
ResNet50 | A100 40GB | 64 | 1147.4 img/sec | ? | ? |
MobileNetV2 | A100 40GB | 128 | 1869.7 img/sec | ? | ? |
DistilBERT | A100 40GB | 64 | 1909.3 seq/sec | ? | ? |
BERTLarge | A100 40GB | 32 | 309.3 seq/sec | ? | ? |
without optimziation:
Model | GPU | BatchSize | Throughput | Power | Memory |
---|---|---|---|---|---|
ResNet50 | A100 40GB | 64 | 610.8 img/sec | ? | ? |
MobileNetV2 | A100 40GB | 128 | 269.4 img/sec | ? | ? |
DistilBERT | A100 40GB | 64 | 761.1 seq/sec | ? | ? |
BERTLarge | A100 40GB | 32 | 135.5 seq/sec | ? | ? |
Model | GPU | BatchSize | Throughput | Power | Memory |
---|---|---|---|---|---|
ResNet50 | M1 Max 32c | 64 | 135 img/sec | 40W | 13 GB |
MobileNetV2 | M1 Max 32c | 128 | 352 img/sec | 37W | 15 GB |
DistilBERT | M1 Max 32c | 64 | 120 seq/sec | 35W | 9 GB |
BERTLarge | M1 Max 32c | 32 | 18 seq/sec | 36W | 14 GB |
Model | GPU | BatchSize | Throughput | Power |
---|---|---|---|---|
ResNet50 | 3090 | 64 | 957 img/sec | 300W |
MobileNetV2 | 3090 | 128 | 1927 img/sec | 310W |
DistilBERT | 3090 | 64 | 1040 seq/sec | 310W |
BERTLarge | 3090 | 32 | 164 seq/sec | 320W |
Just out of curiosity, are you able to use the 'high power mode' to get around the apparent 40W limit?
https://www.macrumors.com/2021/10/22/high-power-mode-16-inch-macbook-pro-m1-max/
m1 max has 64 GB RAM.
Q1.
if who want to train model with m1 max, his the biggest reason of choosing m1 max is the huge VRAM not the throughput.
I also want to know how large memory can be used in training model with m1 max. (60? 55???)
Q2.
do you have any problem with tensorflow-metal???
do you have any plan to improve this post with some experience about problem&solution in M1 training with TF. (some code or compatibilities with major packages in ML/Deep Learning)
It is just my suggestion. but I think many people want this experience sharing.
Thank you!
On line 20, you are computing the TFLOPS as such:
conv_flops = MN * MN * CK * CK * HW * HW
Wouldn't it actually be 2x this since each of those points is an add and a mul? I see the "*2" in tflops_sweep.py
Brings the computed speed from 4.8 TFLOPS to 9.6 TFLOPS, a lot closer to the 10.4 theoretical max.
(though since it's 3x3, it might be a Winograd conv)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.