Comments (35)
Interesting. What if you change the entries with MTLSizeMake(16, 16, 1) to MTLSizeMake(16, 8, 1), here: https://github.com/krasin/MetalDetector/blob/master/MetalDetector/GoogLeNetProfile.swift ?
I would also make a guess that the device is a bit too old to run the network. In the best case, it will be painfully slow (like, 10 seconds / frame)
from metaldetector.
@krasin change to (16,8,1) , same error
BTW, my device is iPad Air 2
from metaldetector.
Air 2 should be fast enough (2-3 seconds / frame).
Did you change all of (16, 16, 1) entries? Does the program print anything to the debug console?
from metaldetector.
@krasin all of them !
w=352, h=288
/BuildRoot/Library/Caches/com.apple.xbs/Sources/Metal/Metal-55.2.6.1/ToolsLayers/Debug/MTLDebugComputeCommandEncoder.mm:702: failed assertion `(threadsPerThreadgroup.width(16) * threadsPerThreadgroup.height(8) * threadsPerThreadgroup.depth(1))(128) must be <= 96. (kernel threadgroup size limit)'
(lldb)
from metaldetector.
Sweet! 16*8 = 128, which is larger than 96 (the limit on your device). Try to change it to (8,8,1).
Oh, there's another use of the threadgroup size = 256, see https://github.com/krasin/MetalDetector/blob/master/MetalDetector/Engine.swift#L86
Change it to (64, 1, 1).
Oh, and here: https://github.com/krasin/MetalDetector/blob/master/MetalDetector/Engine.swift#L95 (16, 16, 1) => (8, 8, 1)
from metaldetector.
Sorry for the mess. I didn't have Air 2 to test, only iPhone 6S (at hands) and iPhone 6 (my friend tested a bit). When you get it working, a cleanup pull request is welcome. :)
from metaldetector.
Also, https://github.com/krasin/MetalDetector/blob/master/MetalDetector/Engine.swift#L256 (128, 1, 1) => (64, 1, 1)
from metaldetector.
And here:
MetalDetector/MetalDetector/Net.swift
Line 128 in 74f961c
from metaldetector.
If might have missed something, just search for MTLSizeMake
from metaldetector.
(sorry, going to sleep; will be offline for a few hours)
from metaldetector.
@krasin thx, your demo is great! I will try to use iPhone 6
from metaldetector.
Oh, did you get it working?
from metaldetector.
What time per frame does it show? It should print something like "net.forward is done within (workTime) sec"
from metaldetector.
@krasin same error! I use Xcode 7.2 and SDK 9.2
from metaldetector.
@krasin iPhone 6s is fine!
You do such awesome work, but none discovered it!
Thank you so much!
from metaldetector.
GPU | Cores | Chip | |
---|---|---|---|
iPhone 6s | GT7600 | 192 (FP32)or384 (FP16) | A9 |
iPhone 6 | GX6450 | 128 (FP32)or256 (FP16) | A8 |
iPad Air 2 | GXA6850 | 256(not official) | A8X |
All of their GPUs do not have their own memory, but A9's memory bandwidth is 2 wider than A8's.
from metaldetector.
Interesting. I guess, I need to verify if iPhone 6 still works. Otherwise, I don't see a point for A8 to work and A8X to fail.
from metaldetector.
@krasin what is the file type of GoogLeNet.data?
from metaldetector.
It's more or less just float32 weights stored in somewhat arbitrary order, and the code in https://github.com/krasin/MetalDetector/blob/master/MetalDetector/GoogLeNet.gen.swift has all the relevant offsets and lengths.
GoogLeNet.data and all files with .gen. in the names are generated by a script that takes a Caffe model and outputs binary, Metal and Swift files. I didn't open source the script, in part because it's half-baked. I more or less lost interest in the developing it once TensorFlow was open-sourced. TensorFlow has a way to deploy the models on the various devices, including mobile out of the box. While they currently have Android support only, iOS is in the plans (to the best of my knowledge based on their public docs). See, for example, https://github.com/tensorflow/tensorflow/tree/master/tensorflow/examples/android which is very similar to this example.
from metaldetector.
@krasin I'm not familiar with tensorflow, but I have tested MxNet and caffe. Caffe is faster than MxNet.
I'm building my own model into your platform to benchmark. Titan, laptop CPU/GPU, TK1 CPU /GPU!
Googlenet only use 300ms per image. I think iPhone with Metal maybe is opening the door towards deep learning widely used into life.
from metaldetector.
Theoretically, if your model is not too fancy, my script should be able to generate the ios files for it (may be with some minor modifications).
If you already have a prototype for your network, feel free to send me your deploy.prototxt and your.caffemodel files to [email protected]. I will try to run my script, modify it a little bit, if there's a layer or two that are not supported yet, and then send it back to you. No promises, though.
from metaldetector.
Also, do take a look at TensorFlow. They have super nice tutorials: https://www.tensorflow.org/versions/master/tutorials/index.html
Even if you end up using Caffe, it's useful to be aware of alternatives around. I am personally super positive about the future with TensorFlow (less excited with the current state, though)
from metaldetector.
@krasin my net includes Deconvolution layer and Crop layer.
from metaldetector.
Crop should be trivial to support.
Deconvolution will take a bit of work, but in the end it's almost the same code as for the convolution.
No python layer, I hope? :)
from metaldetector.
from metaldetector.
@krasin I have never hear advantages about tensorflow. From here, I would have new great tool.
from metaldetector.
I understand your position. It makes sense.
from metaldetector.
@krasin Could I use Python read or create the .data file?
from metaldetector.
I would expect numpy.fromfile to work: http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.fromfile.html
The file is technically just many float32 numbers.
from metaldetector.
Potentially, I could extend my script to generate a wrapper to write the .data file given your .caffemodel. That will unlock your ability to tune your network w/o relying on my magic.
I will think about it.
from metaldetector.
@krasin why not write bias into data
from metaldetector.
Because it's faster to have the data in the constant address space on the GPU. In fact, it's faster to put weights there too, but iPhone 6S has the limit of 16 KB for the constant address space size, and there's a lot of weights. Once, I addressed that by splitting each convolutional kernel into a series of smaller kernels with just enough weights to fit the constant address space. That reduced the memory bandwidth consumption and they were ~1.5x faster. The problem was that I had about 1500 kernels in the program and it took about 40 minutes to startup. Obviously, that was a no-go. I moved the weights into the main memory, but kept the bias, since it's small enough to fit it.
It might be that the difference is negligible. In this case, blame my laziness. I already had it like that and had to incentive to move into the main memory: in the best case it would be the same speed, in the worst case, it will be slower.
from metaldetector.
@krasin
Building into A8*, the previous errors have been soloved. But, occur new fatal error:
Could not create pipeline state for inception_5b_1x1_0: Error Domain=AGXMetalG4G Code=1 "Compute function exceeds spill memory limits" UserInfo={NSLocalizedDescription=Compute function exceeds spill memory limits}
from metaldetector.
That's a hard one. The convolution layer implementation makes a hard assumption about the amount of stack available, see
MetalDetector/MetalDetector/GoogLeNet.gen.metal
Line 3965 in a54cadb
from metaldetector.
@krasin I have run my model on iPad Air 2, and the speed is about 20pfs.
from metaldetector.
Related Issues (2)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from metaldetector.