Giter Site home page Giter Site logo

Comments (35)

krasin avatar krasin commented on June 27, 2024

Interesting. What if you change the entries with MTLSizeMake(16, 16, 1) to MTLSizeMake(16, 8, 1), here: https://github.com/krasin/MetalDetector/blob/master/MetalDetector/GoogLeNetProfile.swift ?

I would also make a guess that the device is a bit too old to run the network. In the best case, it will be painfully slow (like, 10 seconds / frame)

from metaldetector.

wangzhangup avatar wangzhangup commented on June 27, 2024

@krasin change to (16,8,1) , same error
BTW, my device is iPad Air 2

from metaldetector.

krasin avatar krasin commented on June 27, 2024

Air 2 should be fast enough (2-3 seconds / frame).

Did you change all of (16, 16, 1) entries? Does the program print anything to the debug console?

from metaldetector.

wangzhangup avatar wangzhangup commented on June 27, 2024

@krasin all of them !

w=352, h=288
/BuildRoot/Library/Caches/com.apple.xbs/Sources/Metal/Metal-55.2.6.1/ToolsLayers/Debug/MTLDebugComputeCommandEncoder.mm:702: failed assertion `(threadsPerThreadgroup.width(16) * threadsPerThreadgroup.height(8) * threadsPerThreadgroup.depth(1))(128) must be <= 96. (kernel threadgroup size limit)'
(lldb)

from metaldetector.

krasin avatar krasin commented on June 27, 2024

Sweet! 16*8 = 128, which is larger than 96 (the limit on your device). Try to change it to (8,8,1).

Oh, there's another use of the threadgroup size = 256, see https://github.com/krasin/MetalDetector/blob/master/MetalDetector/Engine.swift#L86
Change it to (64, 1, 1).

Oh, and here: https://github.com/krasin/MetalDetector/blob/master/MetalDetector/Engine.swift#L95 (16, 16, 1) => (8, 8, 1)

from metaldetector.

krasin avatar krasin commented on June 27, 2024

Sorry for the mess. I didn't have Air 2 to test, only iPhone 6S (at hands) and iPhone 6 (my friend tested a bit). When you get it working, a cleanup pull request is welcome. :)

from metaldetector.

krasin avatar krasin commented on June 27, 2024

Also, https://github.com/krasin/MetalDetector/blob/master/MetalDetector/Engine.swift#L256 (128, 1, 1) => (64, 1, 1)

from metaldetector.

krasin avatar krasin commented on June 27, 2024

And here:

cell = MTLSizeMake(16, 16, 1)

from metaldetector.

krasin avatar krasin commented on June 27, 2024

If might have missed something, just search for MTLSizeMake

from metaldetector.

krasin avatar krasin commented on June 27, 2024

(sorry, going to sleep; will be offline for a few hours)

from metaldetector.

wangzhangup avatar wangzhangup commented on June 27, 2024

@krasin thx, your demo is great! I will try to use iPhone 6

from metaldetector.

krasin avatar krasin commented on June 27, 2024

Oh, did you get it working?

from metaldetector.

krasin avatar krasin commented on June 27, 2024

What time per frame does it show? It should print something like "net.forward is done within (workTime) sec"

from metaldetector.

wangzhangup avatar wangzhangup commented on June 27, 2024

@krasin same error! I use Xcode 7.2 and SDK 9.2

from metaldetector.

wangzhangup avatar wangzhangup commented on June 27, 2024

@krasin iPhone 6s is fine!
You do such awesome work, but none discovered it!
Thank you so much!

from metaldetector.

wangzhangup avatar wangzhangup commented on June 27, 2024

@krasin

GPU Cores Chip
iPhone 6s GT7600 192 (FP32)or384 (FP16) A9
iPhone 6 GX6450 128 (FP32)or256 (FP16) A8
iPad Air 2 GXA6850 256(not official) A8X

All of their GPUs do not have their own memory, but A9's memory bandwidth is 2 wider than A8's.

from metaldetector.

krasin avatar krasin commented on June 27, 2024

Interesting. I guess, I need to verify if iPhone 6 still works. Otherwise, I don't see a point for A8 to work and A8X to fail.

from metaldetector.

wangzhangup avatar wangzhangup commented on June 27, 2024

@krasin what is the file type of GoogLeNet.data?

from metaldetector.

krasin avatar krasin commented on June 27, 2024

It's more or less just float32 weights stored in somewhat arbitrary order, and the code in https://github.com/krasin/MetalDetector/blob/master/MetalDetector/GoogLeNet.gen.swift has all the relevant offsets and lengths.

GoogLeNet.data and all files with .gen. in the names are generated by a script that takes a Caffe model and outputs binary, Metal and Swift files. I didn't open source the script, in part because it's half-baked. I more or less lost interest in the developing it once TensorFlow was open-sourced. TensorFlow has a way to deploy the models on the various devices, including mobile out of the box. While they currently have Android support only, iOS is in the plans (to the best of my knowledge based on their public docs). See, for example, https://github.com/tensorflow/tensorflow/tree/master/tensorflow/examples/android which is very similar to this example.

from metaldetector.

wangzhangup avatar wangzhangup commented on June 27, 2024

@krasin I'm not familiar with tensorflow, but I have tested MxNet and caffe. Caffe is faster than MxNet.

I'm building my own model into your platform to benchmark. Titan, laptop CPU/GPU, TK1 CPU /GPU!
Googlenet only use 300ms per image. I think iPhone with Metal maybe is opening the door towards deep learning widely used into life.

from metaldetector.

krasin avatar krasin commented on June 27, 2024

Theoretically, if your model is not too fancy, my script should be able to generate the ios files for it (may be with some minor modifications).

If you already have a prototype for your network, feel free to send me your deploy.prototxt and your.caffemodel files to [email protected]. I will try to run my script, modify it a little bit, if there's a layer or two that are not supported yet, and then send it back to you. No promises, though.

from metaldetector.

krasin avatar krasin commented on June 27, 2024

Also, do take a look at TensorFlow. They have super nice tutorials: https://www.tensorflow.org/versions/master/tutorials/index.html

Even if you end up using Caffe, it's useful to be aware of alternatives around. I am personally super positive about the future with TensorFlow (less excited with the current state, though)

from metaldetector.

wangzhangup avatar wangzhangup commented on June 27, 2024

@krasin my net includes Deconvolution layer and Crop layer.

from metaldetector.

krasin avatar krasin commented on June 27, 2024

Crop should be trivial to support.
Deconvolution will take a bit of work, but in the end it's almost the same code as for the convolution.

No python layer, I hope? :)

from metaldetector.

wangzhangup avatar wangzhangup commented on June 27, 2024

@krasin

from metaldetector.

wangzhangup avatar wangzhangup commented on June 27, 2024

@krasin I have never hear advantages about tensorflow. From here, I would have new great tool.

from metaldetector.

krasin avatar krasin commented on June 27, 2024

I understand your position. It makes sense.

from metaldetector.

wangzhangup avatar wangzhangup commented on June 27, 2024

@krasin Could I use Python read or create the .data file?

from metaldetector.

krasin avatar krasin commented on June 27, 2024

I would expect numpy.fromfile to work: http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.fromfile.html

The file is technically just many float32 numbers.

from metaldetector.

krasin avatar krasin commented on June 27, 2024

Potentially, I could extend my script to generate a wrapper to write the .data file given your .caffemodel. That will unlock your ability to tune your network w/o relying on my magic.

I will think about it.

from metaldetector.

wangzhangup avatar wangzhangup commented on June 27, 2024

@krasin why not write bias into data

from metaldetector.

krasin avatar krasin commented on June 27, 2024

Because it's faster to have the data in the constant address space on the GPU. In fact, it's faster to put weights there too, but iPhone 6S has the limit of 16 KB for the constant address space size, and there's a lot of weights. Once, I addressed that by splitting each convolutional kernel into a series of smaller kernels with just enough weights to fit the constant address space. That reduced the memory bandwidth consumption and they were ~1.5x faster. The problem was that I had about 1500 kernels in the program and it took about 40 minutes to startup. Obviously, that was a no-go. I moved the weights into the main memory, but kept the bias, since it's small enough to fit it.

It might be that the difference is negligible. In this case, blame my laziness. I already had it like that and had to incentive to move into the main memory: in the best case it would be the same speed, in the worst case, it will be slower.

from metaldetector.

wangzhangup avatar wangzhangup commented on June 27, 2024

@krasin
Building into A8*, the previous errors have been soloved. But, occur new fatal error:
Could not create pipeline state for inception_5b_1x1_0: Error Domain=AGXMetalG4G Code=1 "Compute function exceeds spill memory limits" UserInfo={NSLocalizedDescription=Compute function exceeds spill memory limits}

from metaldetector.

krasin avatar krasin commented on June 27, 2024

That's a hard one. The convolution layer implementation makes a hard assumption about the amount of stack available, see

kernel void inception_5b_1x1_0(texture2d_array<half, access::read> in [[texture(0)]],
(it allocates two arrays on the stack to speed up things).

from metaldetector.

wangzhangup avatar wangzhangup commented on June 27, 2024

@krasin I have run my model on iPad Air 2, and the speed is about 20pfs.

from metaldetector.

Related Issues (2)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.