When build into iPad Air, occur: failed assertion kernel threadgroup size limit</p

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Error: kernel threadgroup size limit, when build into iPad Air,about krasin/metaldetector

krasin commented on June 27, 2024

Interesting. What if you change the entries with MTLSizeMake(16, 16, 1) to MTLSizeMake(16, 8, 1), here: https://github.com/krasin/MetalDetector/blob/master/MetalDetector/GoogLeNetProfile.swift ?

I would also make a guess that the device is a bit too old to run the network. In the best case, it will be painfully slow (like, 10 seconds / frame)

from metaldetector.

wangzhangup commented on June 27, 2024

@krasin change to (16,8,1) , same error
BTW, my device is iPad Air 2

from metaldetector.

krasin commented on June 27, 2024

Air 2 should be fast enough (2-3 seconds / frame).

Did you change all of (16, 16, 1) entries? Does the program print anything to the debug console?

from metaldetector.

wangzhangup commented on June 27, 2024

@krasin all of them !

w=352, h=288
/BuildRoot/Library/Caches/com.apple.xbs/Sources/Metal/Metal-55.2.6.1/ToolsLayers/Debug/MTLDebugComputeCommandEncoder.mm:702: failed assertion `(threadsPerThreadgroup.width(16) * threadsPerThreadgroup.height(8) * threadsPerThreadgroup.depth(1))(128) must be <= 96. (kernel threadgroup size limit)'
(lldb)

from metaldetector.

krasin commented on June 27, 2024

Sweet! 16*8 = 128, which is larger than 96 (the limit on your device). Try to change it to (8,8,1).

Oh, there's another use of the threadgroup size = 256, see https://github.com/krasin/MetalDetector/blob/master/MetalDetector/Engine.swift#L86
Change it to (64, 1, 1).

Oh, and here: https://github.com/krasin/MetalDetector/blob/master/MetalDetector/Engine.swift#L95 (16, 16, 1) => (8, 8, 1)

from metaldetector.

krasin commented on June 27, 2024

Sorry for the mess. I didn't have Air 2 to test, only iPhone 6S (at hands) and iPhone 6 (my friend tested a bit). When you get it working, a cleanup pull request is welcome. :)

from metaldetector.

krasin commented on June 27, 2024

Also, https://github.com/krasin/MetalDetector/blob/master/MetalDetector/Engine.swift#L256 (128, 1, 1) => (64, 1, 1)

from metaldetector.

krasin commented on June 27, 2024

And here:

MetalDetector/MetalDetector/Net.swift

Line 128 in 74f961c

cell = MTLSizeMake(16, 16, 1)

from metaldetector.

krasin commented on June 27, 2024

If might have missed something, just search for MTLSizeMake

from metaldetector.

krasin commented on June 27, 2024

(sorry, going to sleep; will be offline for a few hours)

from metaldetector.

wangzhangup commented on June 27, 2024

@krasin thx, your demo is great! I will try to use iPhone 6

from metaldetector.

krasin commented on June 27, 2024

Oh, did you get it working?

from metaldetector.

krasin commented on June 27, 2024

What time per frame does it show? It should print something like "net.forward is done within (workTime) sec"

from metaldetector.

wangzhangup commented on June 27, 2024

@krasin same error! I use Xcode 7.2 and SDK 9.2

from metaldetector.

wangzhangup commented on June 27, 2024

@krasin iPhone 6s is fine!
You do such awesome work, but none discovered it!
Thank you so much!

from metaldetector.

wangzhangup commented on June 27, 2024

@krasin

	GPU	Cores	Chip
iPhone 6s	GT7600	192 (FP32)or384 (FP16)	A9
iPhone 6	GX6450	128 (FP32)or256 (FP16)	A8
iPad Air 2	GXA6850	256(not official)	A8X

All of their GPUs do not have their own memory, but A9's memory bandwidth is 2 wider than A8's.

from metaldetector.

krasin commented on June 27, 2024

Interesting. I guess, I need to verify if iPhone 6 still works. Otherwise, I don't see a point for A8 to work and A8X to fail.

from metaldetector.

wangzhangup commented on June 27, 2024

@krasin what is the file type of GoogLeNet.data?

from metaldetector.

krasin commented on June 27, 2024

It's more or less just float32 weights stored in somewhat arbitrary order, and the code in https://github.com/krasin/MetalDetector/blob/master/MetalDetector/GoogLeNet.gen.swift has all the relevant offsets and lengths.

GoogLeNet.data and all files with .gen. in the names are generated by a script that takes a Caffe model and outputs binary, Metal and Swift files. I didn't open source the script, in part because it's half-baked. I more or less lost interest in the developing it once TensorFlow was open-sourced. TensorFlow has a way to deploy the models on the various devices, including mobile out of the box. While they currently have Android support only, iOS is in the plans (to the best of my knowledge based on their public docs). See, for example, https://github.com/tensorflow/tensorflow/tree/master/tensorflow/examples/android which is very similar to this example.

from metaldetector.

wangzhangup commented on June 27, 2024

@krasin I'm not familiar with tensorflow, but I have tested MxNet and caffe. Caffe is faster than MxNet.

I'm building my own model into your platform to benchmark. Titan, laptop CPU/GPU, TK1 CPU /GPU!
Googlenet only use 300ms per image. I think iPhone with Metal maybe is opening the door towards deep learning widely used into life.

from metaldetector.

krasin commented on June 27, 2024

Theoretically, if your model is not too fancy, my script should be able to generate the ios files for it (may be with some minor modifications).

If you already have a prototype for your network, feel free to send me your deploy.prototxt and your.caffemodel files to [email protected]. I will try to run my script, modify it a little bit, if there's a layer or two that are not supported yet, and then send it back to you. No promises, though.

from metaldetector.

krasin commented on June 27, 2024

Also, do take a look at TensorFlow. They have super nice tutorials: https://www.tensorflow.org/versions/master/tutorials/index.html

Even if you end up using Caffe, it's useful to be aware of alternatives around. I am personally super positive about the future with TensorFlow (less excited with the current state, though)

from metaldetector.

wangzhangup commented on June 27, 2024

@krasin my net includes Deconvolution layer and Crop layer.

from metaldetector.

krasin commented on June 27, 2024

Crop should be trivial to support.
Deconvolution will take a bit of work, but in the end it's almost the same code as for the convolution.

No python layer, I hope? :)

from metaldetector.

wangzhangup commented on June 27, 2024

@krasin

from metaldetector.

wangzhangup commented on June 27, 2024

@krasin I have never hear advantages about tensorflow. From here, I would have new great tool.

from metaldetector.

krasin commented on June 27, 2024

I understand your position. It makes sense.

from metaldetector.

wangzhangup commented on June 27, 2024

@krasin Could I use Python read or create the .data file?

from metaldetector.

krasin commented on June 27, 2024

I would expect numpy.fromfile to work: http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.fromfile.html

The file is technically just many float32 numbers.

from metaldetector.

krasin commented on June 27, 2024

Potentially, I could extend my script to generate a wrapper to write the .data file given your .caffemodel. That will unlock your ability to tune your network w/o relying on my magic.

I will think about it.

from metaldetector.

wangzhangup commented on June 27, 2024

@krasin why not write bias into data

from metaldetector.

krasin commented on June 27, 2024

Because it's faster to have the data in the constant address space on the GPU. In fact, it's faster to put weights there too, but iPhone 6S has the limit of 16 KB for the constant address space size, and there's a lot of weights. Once, I addressed that by splitting each convolutional kernel into a series of smaller kernels with just enough weights to fit the constant address space. That reduced the memory bandwidth consumption and they were ~1.5x faster. The problem was that I had about 1500 kernels in the program and it took about 40 minutes to startup. Obviously, that was a no-go. I moved the weights into the main memory, but kept the bias, since it's small enough to fit it.

It might be that the difference is negligible. In this case, blame my laziness. I already had it like that and had to incentive to move into the main memory: in the best case it would be the same speed, in the worst case, it will be slower.

from metaldetector.

wangzhangup commented on June 27, 2024

@krasin
Building into A8*, the previous errors have been soloved. But, occur new fatal error:
Could not create pipeline state for inception_5b_1x1_0: Error Domain=AGXMetalG4G Code=1 "Compute function exceeds spill memory limits" UserInfo={NSLocalizedDescription=Compute function exceeds spill memory limits}

from metaldetector.

krasin commented on June 27, 2024

That's a hard one. The convolution layer implementation makes a hard assumption about the amount of stack available, see

MetalDetector/MetalDetector/GoogLeNet.gen.metal

Line 3965 in a54cadb

    
           kernel void inception_5b_1x1_0(texture2d_array<half, access::read> in [[texture(0)]],

(it allocates two arrays on the stack to speed up things).

from metaldetector.

wangzhangup commented on June 27, 2024

@krasin I have run my model on iPad Air 2, and the speed is about 20pfs.

from metaldetector.

Error: kernel threadgroup size limit, when build into iPad Air about metaldetector HOT 35 CLOSED

Comments (35)

Related Issues (2)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent