pytorch / glow Goto Github PK
View Code? Open in Web Editor NEWCompiler for Neural Network hardware accelerators
License: Apache License 2.0
Compiler for Neural Network hardware accelerators
License: Apache License 2.0
I have a quip doc on this issue, but here is the summary:
Given a 32-bit floating-point numbers and a 32-bit integer number, we need a general-purpose method of multiplying them using only integer operations. This is for the case where we have quantized operations running on hardware without a fast FPU or other floating-point support. The current implementation of this only works well for particular cases of rescaling operations -- it tends to be slightly incorrect when we are performing a multiplication of a floating-point scale with an integer register in which some quantities have been accumulated, for instance as happens during a matrix multiplication (FC) or convolution operation.
The overall idea here is that we want to maximize the precision we can maintain when transforming an fp32-i32 multiply into a product of two i32's. At present, we have a general idea that we want to:
(A) Select parameters for pre- and post-multiplication shifting, in order to account for scaling either up or down by factors of 2.
(B) Select an appropriately-precise integer approximation to the mantissa of the f32.
(C) At runtime, compute the f32-i32 product by multiplying the integer approximation to the mantissa with the i32 input, and then shifting appropriately to convert the value to the correct scale w.r.t. powers of 2.
The above is the general theoretical approach. However, in practice, we run into several technical issues.
========= The following is wrong =========
(i) Especially for right-shifting, which is the common case (because the integer approximation to the f32 mantissa is usually going to be a "scaled-up" version of the actual mantissa, in order to achieve precision), arithmetic right-shifting is not quite equivalent to [integer] division by a power of 2. It is equivalent, as desired, for positive values; however, the truncation of right-shifting inherently rounds downward toward negative infinity, and hence is not symmetric about 0. This is undesirable. One solution to this is to add a certain quantity to negative i32 values prior to [arithmetic] right-shifts. The correct value to add is 1_[input < 0] << (shift - 1), which is effectively equivalent to dividing by the respective power of 2, adding 1/2, and then truncating (when the operation is thought of as taking place over the real numbers). This thus implements "round to nearest" behavior for negative i32 values. (Whereas we still have "round down" behavior for postive i32 values -- however, at least...
========== END WRONG PART ===========
(ii) In the course of multiplications which result from a floating-point value and an integer value which has been accumulated (possibly from other integer products), predicting the correct scale of precision on which to operate is difficult, if not impossible. For example, suppose that we wish to multiply an f32 (whose value is known at compile-time), and an i32 (whose value is not known at compile-time) which is the result of a row-column dot product during a matrix multiply operation. Even if the matrix entries in question are i8's, each such element-wise product can have up to 16 bits, and the dimensions of the matrix in question may also introduce another element of variance in the size of the integer that results (and which we wish to multiply by the f32). For instance:
Generally: f = s * (i - o)
For matrix multiply:
f_d = \sum f_l * f_r = (s_l * s_r) * (\sum (i_l - o_l) * (i_r - o_r))
and hence
i_d = (s_l * s_r / s_d) * (\sum (i_l - o_l) * (i_r - o_r)) + o_d
The floating-point value in question then in s_l * s_r / s_d, which is known at compile-time. However -- even though perhaps the number of elements in the summation may be known ahead of time -- the fact that the individual entries i_l and i_r (and consequently i_l - o_l and i_r - o_r) cannot be means that the actual order of magnitude, in bits, of the resulting i32 multiplicand can only be bounded within some range, and since we only have 32 bits to work with in the first place, it is easy to see how this could become problematic if assumptions about the actual order of the resultant i32 dot product are made at compile-time. It therefore appears that at least some runtime analysis may be needed.
Hi folks,
I followed the README and tried to build Glow but got the following error:
FAILED: lib/Graph/CMakeFiles/Graph.dir/Graph.cpp.o
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ -DGLOW_WITH_OPENCL=1 -I../include -Iinclude -I. -isystem /usr/local/opt/llvm/include -Wall -fno-exceptions -fno-
rtti -g -fno-omit-frame-pointer -O0 -fvisibility=hidden -fvisibility-inlines-hidden -std=c++14 -MD -MT lib/Graph/CMakeFiles/Graph.dir/Graph.cpp.o -MF lib/Graph/CMakeFiles/Graph.dir/Graph.cpp.o.d -o lib/
Graph/CMakeFiles/Graph.dir/Graph.cpp.o -c ../lib/Graph/Graph.cpp
../lib/Graph/Graph.cpp:871:19: error: invalid operands to binary expression ('llvm::StringRef' and 'const char *')
(namePrefix + ".initial_state").str(), Variable::VisibilityKind::Public,
~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~
My build command is cmake -G Ninja .. -DCMAKE_PREFIX_PATH=/usr/local/opt/llvm
Any idea what had gone wrong? Thanks!
The glow::Backend
interface is current storing a lot of state, and it needs to be refactored. We want to clearly separate the hardware abstraction layer from the different kinds of state we're storing.
Over in #1176 we want to compile multiple functions to run on different CPUs. In the future, we also want to support multiple GPUs and other accelerators, so we need a backend interface that separates state related to different functions and different execution units.
I'll elaborate on a design in the comments below.
Partitioning a Function
into multiple subgraphs breaks a lot of assumptions in our current architecture. PR #1176 shows a proof-of-concept for the mechanics of partitioning, but I want to step back and figure out how this fits into the whole system.
Let's consider the interfaces we currently have:
ExecutionEngine
is configured with a particular Backend
and knows how to compile a Function
for it.EE
can run
a function with particular Variable
-to-Tensor
bindings.Backend
compiles an IRFunction
, creating a CompiledFunction
.CompiledFunction
can execute
itself. It assumes Tensor bindings are performed externally (but see #1227 for an alternate design).The main difficulty with partitioning is that it complicates the pipeline:
Function->IRFunction->CompiledFunction->execute
With partitioning, we have to decide where to split a Function into a DAG of Functions, and how to appropriately propagate the dependence information.
Furthermore some of the choices we make will be backend specific. The following scenarios could apply:
Reproduce: Compile with asan and execute runBatch on network compiled with Infer mode.
We should check how NN was originally compiled and throw an appropriate exception in case of incompatibility.
Upd: Make an assert instead of throwing an exception.
Steps to reproduce:
[1/83] Linking CXX executable bin/emulator
FAILED: bin/emulator
: && /Applications/Xcode_9.0.1_fb.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ -Wall -fno-exceptions -fno-rtti -g -fno-omit-frame-pointer -O0 -Wl,-search_paths_first -Wl,-headerpad_max_install_names tools/emulator/CMakeFiles/emulator.dir/emulator.cpp.o -o bin/emulator && :
Undefined symbols for architecture x86_64:
"___asan_init", referenced from:
_asan.module_ctor in emulator.cpp.o
"___asan_version_mismatch_check_apple_900", referenced from:
_asan.module_ctor in emulator.cpp.o
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
Here is the clinfo
output for my system pastebin
I've compiled glow
with OpenCL support as follows
$ cmake -G Ninja -DCMAKE_BUILD_TYPE=Debug -DGLOW_WITH_CPU=1 -DGLOW_WITH_OPENCL=1 ../glow
$ ninja all
Whenever I try to run any of the unittests or demos with the -opencl
command line option, I get :
<kernel>:282:1: error: unexpected type name 'vtype': expected expression
DEFINE_OPENCL_UNARY_DATA_PARALLEL_KERNEL_WITH_IMM_OPERAND(splat, float, SRC)
^
<kernel>:240:19: note: expanded from macro 'DEFINE_OPENCL_UNARY_DATA_PARALLEL_KERNEL_WITH_IMM_OPERAND'
vtype SRC = vtype(val); \
^
<kernel>:283:1: error: unexpected type name 'vtype': expected expression
DEFINE_OPENCL_UNARY_DATA_PARALLEL_KERNEL_WITH_IMM_OPERAND(splat_u, ulong, SRC)
^
<kernel>:240:19: note: expanded from macro 'DEFINE_OPENCL_UNARY_DATA_PARALLEL_KERNEL_WITH_IMM_OPERAND'
vtype SRC = vtype(val); \
^
<kernel>:334:59: error: __kernel function cannot have argument whose type is, or contains, type size_t
cl_uint32_t batch, size_t numSlice,
^
<kernel>:335:40: error: __kernel function cannot have argument whose type is, or contains, type size_ta
size_t sliceSize) {
^
/glow/lib/Backends/OpenCL/OpenCL.cpp:189: failed assertion `err == CL_SUCCESS && "clBuildProgram Failed."'
I'm far from conversant in OpenCL but I made the changes that the compiler highlights
diff --git a/lib/Backends/OpenCL/kernels.cl b/lib/Backends/OpenCL/kernels.cl
index f51a1267..17b96644 100644
--- a/lib/Backends/OpenCL/kernels.cl
+++ b/lib/Backends/OpenCL/kernels.cl
@@ -238,7 +238,7 @@ size_t getNCHW(ShapeNCHW s, cl_uint32_t n, cl_uint32_t c, cl_uint32_t h,
vstore8(VAL, i * 2, dest); \
} \
{ \
- vtype SRC = vtype(val); \
+ vtype SRC = (vtype)val; \
vtype VAL = body; \
vstore8(VAL, i * 2 + 1, dest); \
} \
@@ -332,8 +332,8 @@ __kernel void batchedreduceaddK(__global float *dest, __global float *batch,
}
__kernel void batchedreduceaddW(__global void *mem, cl_uint32_t dest,
- cl_uint32_t batch, size_t numSlice,
- size_t sliceSize) {
+ cl_uint32_t batch, cl_uint32_t numSlice,
+ cl_uint32_t sliceSize) {
batchedreduceaddK(&mem[dest], &mem[batch], numSlice, sliceSize);
}
Now, OCLTest
, MLTest
, ptb
all work fine with OpenCL backend but operatorTest
, JITTest
and mnist
all fail certain tests.
[ RUN ] OpenCL/Operator.simplePredication/0
glow/lib/Backends/OpenCL/OpenCL.cpp:197: failed assertion `err == CL_SUCCESS && "Unable to set parameter"'
[ RUN ] OpenCL/BackendCorrectnessTest.poolMaxGradTest/0
glow/lib/Backends/OpenCL/OpenCL.cpp:197: failed assertion `err == CL_SUCCESS && "Unable to set parameter"'
$ ./bin/mnist -opencl
Loading the mnist database.
Loaded 50000 images.
Training.
Training - epoch #0
glow/lib/Backends/OpenCL/OpenCL.cpp:197: failed assertion `err == CL_SUCCESS && "Unable to set parameter"'
How can I resolve this?
Currently, we profile all nodes in a graph and capture both min/max as well as the histogram.
Quantization scale and offset are calculated based on the global min and global max and do not take histogram into consideration.
This does not work very well when there are outliers.
There was the logic for calculating scale and offset based on histogram but it was removed: b5d9385
As it did not properly work. Ex, simple tensor with {1, 2} values.
Except for the performance of Glow on Resnet50 etc compared with TensorFlow1.7, will there be some more benchmarks such as comparing Glow with NNVM/TVM on CPU/GPU?
Backends should be able to implement their own type of tensor objects.
A good example for using this feature would be having tensors backed by OpenCL resources for the OpenCL backend (and similarly for other backends).
In order to support this, the Tensor class can be made more suitable for deriving (adding virtual qualifiers etc.) or create a Tensor interface class for other implementations to derive from.
Additional features for this Tensor interface should include support for lock/unlock (map/unmap) or some other forms of synchronization mechanism.
Our graph does not model the memory dependencies explicitly.
As a result, if we have a variable is written and read on different
data paths, the scheduler may be unlucky and we read the wrong
version of the variable.
Essentially, if you have:
A
/ \
write read
The scheduler can produce either:
write A
read A
or
read A
write A
Obviously this does not produce the same output.
Glow intended semantic is that write should happen last, but with
our current algorithm, this is actually possible to fool the
scheduler not to do that.
I'm going to send a PR to show how we can fool the scheduler.
This is part of the Backend
refactoring in #1227.
The goal is to have the Backend
sub-classes represent the hardware abstraction layer. The instances hold no mutable state, and all actions are performed through const virtual functions.
Proposed interface:
class Backend {
virtual bool transformPreLowering(Function *F, CompilationMode mode) const;
virtual bool transformPostLowering(Function *F, CompilationMode mode) const;
virtual bool isOpSupported(Kinded::Kind opKind, ElemKind elementTy) const;
virtual bool shouldLower(const Node *N) const;
virtual bool shouldShareBuffers() const;
/// Compile an IR function, consuming it in the process.
/// Returns a new object representing the compiled function.
virtual unique_ptr<CompiledFunction> compileFunction(unique_ptr<IRFunction>) const;
}
class CompiledFunction {
virtual void execute() =0;
}
Refactoring steps:
transformPreLowering
and friends to be const
.CompiledFunction
sub-class per backend which holds the result of compiling a function.Backend
sub-classes into a locally defined class that only exists during compileFunction()
.IRFunction
pointer from backend instances and the createBackend()
function.At this stage, the CompiledFunction
sub-classes will still reference the original module, and their execute
method can modify variables in the module.
One thing I noticed, the dependencies are not correct in the cmake. In effect, if you run ninja test
without building glow first, it will fail because the tests don't depend on glow being built. This is problematic as we may update glow, forget to rebuild, and run the tests and think that everything is okay whereas we would run the old code.
And currently the CPU jit version is far slower than the Interpreter version, that's werid.
A thing I noticed a while ago was that the initialization algorithm of a variable is currently embedded in the representation of the Variable
class itself. Given that in practice people may want to use quite a variety of initialization kinds, it may be a little unscalable to modify the core Variable
class for each new initialization algorithm, and clutter Nodes.cpp
with initialization algorithms that may best be put somewhere else. Maybe it's worth thinking of how the initialization code can be kept separate from Variable
itself. This is just a thought. I'll look into how other libraries and frameworks deal with this.
Either way adding a few more initialization algorithms may be worthwhile for the future.
Every master commit invokes coverage build and the results of coverage build can be found: https://fb-glow-assets.s3.amazonaws.com/coverage/coverage-master/index.html
It would be great to improve test coverage for certain low coverage places.
That's a good first issue to get some familiarity with the glow code base, contributions are appreciated.
Convolution is the most important operator for Computer Vision NNs.
The simplest implementation (our Interpreter backend) could be found here:
Convolution is 2D operation (it always works with 2D image and 2D filter, and produces 2D result). Right now all our Convolution implementations assume that pad is the same for both dimensions (image height and width). But for some models (very rarely) it's not true. In general case, there could be 4 different pads: left side, right side, top side, bottom side. We need to change "pad" to be a vector instead of a single number (for ConvolutionNode and ConvolutionInstr). And change all the implementations: Interpreter (float and quantized), CPU (float and quantized). Later, we also need to support uneven paddings in ConvolutionGradNode (which is backward pass for Convolution).
Note, that our Convolution for CPU backend is highly optimized, and in theory complicating its code may lead to losing performance. In practice, I highly doubt that this could happen, but worth to check performance after the change just to be sure. One can run tests/images/run.sh -cpu -iterations=100
with and without the change, and make sure that average performance numbers are within 1%.
Here is ONNX specification of Conv operator:
https://github.com/onnx/onnx/blob/master/docs/Operators.md#conv
One can notice that not only padding is represented as array there, but also stride and filterSize are represented as array. Changing these is lower priority, and there's no immediate need. Just good thing to have.
Another direction of future work: support non-square padding in PoolMax and PoolAvg operators.
LLVM takes quite a while to compile, does the entire library need to be built?
It looks like these are the dependencies, and I'm wondering if its possible to reduce the compilation time:
grep -r "include \"llvm" include/ src/ | sed 's/.*#/#/g' | sort | uniq
#include "llvm/Support/ErrorHandling.h"
#include "llvm/ADT/ArrayRef.h"
#include "llvm/ADT/Hashing.h"
#include "llvm/ADT/PointerUnion.h"
#include "llvm/ADT/SmallVector.h"
#include "llvm/ADT/StringRef.h"
#include "llvm/Support/Casting.h"
#include "llvm/Support/CommandLine.h"
#include "llvm/Support/Debug.h"
#include "llvm/Support/NativeFormatting.h"
#include "llvm/Support/raw_ostream.h"
As of today our API takes the CompilationMode for both the lowering and optimization phase (glow::lower and glow::optimize).
Except for one thing in the lowering, which I will come back later, this argument does change the behavior of these functions and thus, I was wondering if we should pass it at all.
My way of thinking is if we don't use it, let's not pass it and if/when we need it we can add it.
Now going back to the only user of this argument in lowering.
The lowering of batch normalization is the only user of that argument, but I believe we could avoid that use and that would be a better design. Basically, we lower BN differently between inference and training, whereas I think the right thing would be to have a different representation when we differentiate BN. In essence, what I am suggesting for this use case is to introduce a new node for mean and variance normalization (i.e., for what we special case with the training mode during lowering.) In other words, differentiation for BN would produce BNGrad(BN(meanVarNorm)) and that would completely eliminate the need for a special lowering.
All in all, as of today, I don't see a reason why we pass this mode.
What do people think?
Working with accelerators, we are going to need support for 16-bit floating point types.
Type | binary16 | bfloat | binary32 |
---|---|---|---|
k, storage width in bits | 16 | 16 | 32 |
w, exponent field width in bits | 5 | 8 | 8 |
t, trailing significand field width in bits | 10 | 7 | 23 |
We should at a minimum add these types to glow::ElemKind
so we can represent low-precision tensors in these types. Ideally, also add interpreter support.
I noticed that the image-classifier returns different results if invoked with one file at a time and all the same file at once and was wondering if this is expected.
To reproduce:
$ ./bin/image-classifier tests/images/imagenet/* -image_mode=0to1 -m=densenet121/model.onnx
$ ./bin/image-classifier tests/images/imagenet/cat_285.png -image_mode=0to1 -m=densenet121/model.onnx
Result:
With all the files at once:
Model: densenet121/model.onnx
File: tests/images/imagenet/cat_285.png Result:281
File: tests/images/imagenet/dog_207.png Result:207
File: tests/images/imagenet/zebra_340.png Result:340
With one at a time:
Model: densenet121/model.onnx
File: tests/images/imagenet/cat_285.png Result:674
Model: densenet121/model.onnx
File: tests/images/imagenet/dog_207.png Result:674
Model: densenet121/model.onnx
File: tests/images/imagenet/zebra_340.png Result:674
Also when there is only one file given, the process takes much longer to produce a result.
Is this expected?
It would be useful to have more attributes on tensor objects.
Specifically, whether a tensor is an input/output for the function and whether its constant or mutable during inference etc.
Alternatively this could be added as a utility function that backends can invoke.
Currently the script that downloads datasets downloads all of them at once. We should add command line options to control which one. E.g. when I only want MNIST, I shouldn't have to download the other large datasets.
An autoencoder is a very basic neural network that takes in some input, encodes it into a lower-dimensional representation, and then decodes it back into something that should ideally be identical to the input. We should try to implement such an autoencoder with Glow. An extension of this would be to make a variational autoencoder.
Putting this up as a "wishlist" issue: Since we already have clang-tidy, let's also integrate Include-What-You-Use. LLVM has a lot of different headers so having a tool to tell us what to include where will be healthy for the future and avoid things like cyclic dependencies or other badness that can arise from C++'s lack of a module system. It also avoid the kind of thing where you remove one header and the whole card-house of mismanaged includes breaks down. I managed to integrate it with my last LLVM project. It's excruciatingly slow, but would be great to have as a separate build rule to run over night or before lunch once in a while.
Glow should provide reference documentation for its intermediate representations, both the graph IR and the linear IR.
Examples from similar projects:
We represent a broadcast as a series of concats. For example if we want to broadcast a tensor A
with dims = {1}
to a tensor B
with dims = {10}
, B
is a ConcatNode with A
repeatedly input 10 times.
This works nicely because we do not need a BroadcastNode. Additionally, we IRGen this ConcatNode with repeated inputs into a single InsertTensorInst with some count
value, i.e. the InsertTensorInst does not have its input count scale with the broadcast size.
However, this scaling issue does occur at the Node level. If the broadcast size is large, for example if the destination tensor B
has dims = {2048}
, then B
will be a ConcatNode with 2048 inputs. Not great.
To improve this, one option I am considering is to add a vector member to ConcatNode that represents how many times to concatenate each input. For example, instead of the above ConcatNode B
having A
input 2048 times, it has A
input once, as well as a new "count" member vector {2048}
, signaling that A
should be inserted 2048
times.
If we wanted to also Concat some other Node C
10 times at the end, then this "count" member vector would be {2048, 10}
, and B
would have A
and C
input once each.
This would mean ConcatNode more closely matches our InsertTensorInst, which ConcatNode is IRGen'd into anyway. However, it adds some complexity to the ConcatNode.
What does everyone think?
I was looking at the llvm IR produced by the image classifier using tests/images/run.sh and I noticed that we issue thousands of this kernel:
define internal void @libjit_stacked_kernel.1575(float* noalias) {
entry:
br label %loop
loop: ; preds = %loop, %entry
%1 = phi i64 [ , %entry ], [ %nextvar, %loop ]
%2 = call float @libjit_splat_kernel_f(i64 %1, float SomeNumber, float* null, float* null)
%buffer.element.addr = getelementptr float, float* %0, i64 %1
store float %2, float* %buffer.element.addr
%nextvar = add nuw nsw i64 %1, SomeOtherNumber
%loopcond = icmp ult i64 %nextvar, AnotherNumber
br i1 %loopcond, label %loop, label %afterloop, !llvm.loop !
afterloop: ; preds = %loop
ret void
}
Would be nice to reuse them if at all possible (haven't checked but the code seems duplicated quite a lot), so that we reduce our memory footprint and our compile time.
Was browsing the codebase and noticed that we are overloading std::to_string
for Glow's Type
class. Not to be that guy, but this is undefined behavior as per the standard (only allowed to explicitly specialize templates for user-defined types).
Not sure if we care? If we do, we should probably move that to Type::to_string
or something.
Source: https://stackoverflow.com/questions/14402990/should-you-overload-swap-in-the-std-namespace
Hi, I am a graphic designer and i want to contribute to your nice project by proposing an icon/logo design.
If you like and want to use it i will be gladly sending you the files as a gift for free. Here is what i come up with:
Design idea- letter G is very visible for the initial of Glow, it also shows gears which represents machine or engine. another element it shows is a sun that glows which literally represents glow.
I hope you like and make you wanna use it. it there's any modification that you would like to do please let me know and i will edit.
Thanks and best regards!
-Tobaloidee
We should be able to build Glow compiler with the Buck.
You can get more details on BUCK here.
I ran a build with LeakSanitizer enabled and it pointed to a few leaks.
cmake -DGLOW_USE_SANITIZER=Address ~/src/glow
export ASAN_OPTIONS=detect_leaks=1
ninja graphTest
./tests/graphTest
The leaks are of allocations inside OpenCL, but we should make sure we're not misusing the API.
e.g.,
Indirect leak of 36 byte(s) in 1 object(s) allocated from:
#0 0x103024c33 in wrap_malloc (libclang_rt.asan_osx_dynamic.dylib:x86_64h+0x56c33)
#1 0x7fff7e6f38b2 in _Block_copy (libsystem_blocks.dylib:x86_64+0x8b2)
#2 0x7fff7e6f3b3d in _Block_object_assign (libsystem_blocks.dylib:x86_64+0xb3d)
#3 0x7fff7e6f38ef in _Block_copy (libsystem_blocks.dylib:x86_64+0x8ef)
#4 0x7fff7e66bfd3 in _dispatch_Block_copy (libdispatch.dylib:x86_64+0x1fd3)
#5 0x7fff7e683a07 in _dispatch_source_set_cancel_handler (libdispatch.dylib:x86_64+0x19a07)
#6 0x103023f9e in wrap_dispatch_source_set_cancel_handler (libclang_rt.asan_osx_dynamic.dylib:x86_64h+0x55f9e)
#7 0x7fff7867db1a in (anonymous namespace)::RunElsewhere::instance() (SkyLight:x86_64+0x23b1a)
#8 0x7fff787ba521 in __SLSInitialize_block_invoke (SkyLight:x86_64+0x160521)
#9 0x7fff7e66bd4f in _dispatch_client_callout (libdispatch.dylib:x86_64+0x1d4f)
#10 0x7fff7e66bd02 in dispatch_once_f (libdispatch.dylib:x86_64+0x1d02)
#11 0x7fff78828d08 in SLSMainConnectionID (SkyLight:x86_64+0x1ced08)
#12 0x7fff604e5c9a in gfxLoadPluginData (libGFXShared.dylib:x86_64+0xc9a)
#13 0x7fff5f30308a (OpenCL:x86_64+0x2008a)
#14 0x7fff7e9309bd in __pthread_once_handler (libsystem_pthread.dylib:x86_64+0x39bd)
#15 0x7fff7e925eff in _os_once (libsystem_platform.dylib:x86_64+0xeff)
#16 0x7fff7e930958 in pthread_once (libsystem_pthread.dylib:x86_64+0x3958)
#17 0x7fff5f302e67 (OpenCL:x86_64+0x1fe67)
#18 0x7fff5f3034e1 in clGetDeviceIDs (OpenCL:x86_64+0x204e1)
#19 0x1003bee98 in glow::OCLBackend::OCLBackend() OpenCL.cpp:89
#20 0x1003bebda in glow::createOCLBackend() OpenCL.cpp:87
#21 0x10002395b in getConvNodeSize(glow::BackendKind) graphTest.cpp:408
#22 0x100024477 in Graph_disableUnrollingGroupConv_Test::TestBody() graphTest.cpp:438
#23 0x100268e1f in testing::Test::Run() gtest.cc
#24 0x10026c688 in testing::TestInfo::Run() gtest.cc:2654
#25 0x10026dc94 in testing::TestCase::Run() gtest.cc:2772
#26 0x10028e1c7 in testing::internal::UnitTestImpl::RunAllTests() gtest.cc:4677
#27 0x10028cfb5 in testing::UnitTest::Run() gtest.cc
#28 0x1002c7b3f in main gtest.h:2237
#29 0x7fff7e6a5114 in start (libdyld.dylib:x86_64+0x1114)
I was looking at the QuantizationProfile node and the related QuantizationProfile instruction, and it looks like we only provide the interpreter support.
So here are my questions:
If we expect every backend to come up with their own support, we should make an extra effort on documenting what are the expectations of this operator.
Glow contains a variety of validation tests:
Model: resnet50/predict_net.pb
File: tests/images/imagenet/cat_285.png Result:285
The goal here is to introduce a script which allows running (3) on CI.
Few things to consider before implementing:
Important scenarios include AOT examples, which should be validated on the CI in a similar way as we propose to do for (3). AOT examples also require downloading weights and models.
Support ONNXIFI interface
[ 28%] Building CXX object src/glow/Base/CMakeFiles/Base.dir/Train.cpp.o
In file included from /Users/bwasti/Glow/src/glow/Base/Train.cpp:3:
In file included from /Users/bwasti/Glow/include/glow/Base/Train.h:4:
/Users/bwasti/Glow/include/glow/Base/Tensor.h:158:26: error: no member named 'drop_front' in
'llvm::ArrayRef<unsigned long>'
auto dim = t->dims().drop_front();
~~~~~~~~~ ^
/Users/bwasti/Glow/include/glow/Base/Tensor.h:173:35: error: no member named 'drop_front' in
'llvm::ArrayRef<unsigned long>'
auto onceSliceDim = t->dims().drop_front();
~~~~~~~~~ ^
/Users/bwasti/Glow/include/glow/Base/Tensor.h:175:35: error: no member named 'drop_front' in
'llvm::ArrayRef<unsigned long>'
assert(onceSliceDim == dims().drop_front() && "Invalid slice size");
~~~~~~ ^
/usr/include/assert.h:93:25: note: expanded from macro 'assert'
(__builtin_expect(!(e), 0) ? __assert_rtn(__func__, __FILE__, __LINE__, #e) : (void)0)
^
In file included from /Users/bwasti/Glow/src/glow/Base/Train.cpp:3:
In file included from /Users/bwasti/Glow/include/glow/Base/Train.h:4:
/Users/bwasti/Glow/include/glow/Base/Tensor.h:357:32: error: no member named 'drop_front' in
'llvm::ArrayRef<unsigned long>'
Tensor slice(elemTy, sizes.drop_front());
~~~~~ ^
4 errors generated.
make[2]: *** [src/glow/Base/CMakeFiles/Base.dir/Train.cpp.o] Error 1
make[1]: *** [src/glow/Base/CMakeFiles/Base.dir/all] Error 2
make: *** [all] Error 2
The definition seems to exist in external/llvm/include/llvm/ADT/ArrayRef.h
, so I'm not really sure where the error is coming from.
$ strings external/llvm/lib/* 2>/dev/null | c++filt -n | grep "llvm::ArrayRef<unsigned long>" | grep drop_front
llvm::ArrayRef<unsigned long>::drop_front(unsigned long) const
Currently, there are two end2end tests for quantization which make sure that we can:
Note that Profiling is always done with the Interpreter backend. While quantizing and inference is run on a specific backend.
See here for the reference.
@ZchiPitt we've been talking about this. Created this issue just to make sure it's not lost.
First of all NVIDIA CUDA OpenCL doesn't support size_t.
Besides, When compiling the executable under 32 bit application,
uint64_t about to be another issue:0
cause in kernel
typedef struct {
cl_uint64_t n; // Number of samples
cl_uint64_t h; // Height
cl_uint64_t w; // Width
cl_uint64_t c; // Number of channels
} ShapeNHWC;
but in C++
struct ShapeNCHW {
size_t n; // Number of samples
size_t c; // Number of Channels
size_t h; // Height
size_t w; // Width
There is inconsistence between OpenCL kernel & C++ code about size_t /uint64_t.
We may need to define a fixed coding style about size_t & uint64_t.
The latest version of Glow compiler provides a certain level of quantization support.
Quantization process consists of the following steps:
Instrument an original floating-point graph with the special Quantization Profile node
Run inference on the instrumented graph many times with good representative inputs (see dump_profile flag).
During this step Glow automatically captures the distribution of all activations in the compute graph. The distribution contains min/max value seen from the given output of the graph node as well as detailed frequency per floating point range withing [min, max] interval. There are 2000 floating point ranges kept as part of the Quantization Profile node.
As the result of the profiling procedure, Glow generates Scale and Offset parameters for all activations and dumps it in the file (Note, currently Glow uses linear quantization). The scale is a positive fp32 number while Offset is an int32. Note, Glow does not use the distribution of floating point numbers but relies only on values of min and max (this is a separate issue and out of scope here).
Transform computation graph based on the captured profile and specific execution backend. Note, not all nodes are quantized as not all backends support certain quantized op. See CPU backend op support here for example.
Perform computation of quantized graph. See Interpreter implementation here.
There are backends that tight to a specific quantization schema, e.g., symmetric quantization.
The difference between symmetric and assymetric quantization is that the "offset" parameter equals to 0 in symmetric linear quantization. In this case the dequantization follows the formula: fp32_number = scale * quantized_number (fp32_number = scale * (quantized_number - offset) for the asymmetric case).
The issue could be tackled in two steps:
Please discuss this issue here before implementation.
cc: @qcolombet
Our pooling nodes and instructions are named PoolAvg and PoolMax. Everywhere else in ML they are called AvgPool and MaxPool, respectively. It would be great to do a mechanical rename.
Over in #1065, we have an example of a unit test that fails intermittently depending on the pseudo-random numbers delivered by nextRand()
. This function uses a static
PRNG, so while it is seeded deterministically, reordering tests will give them different pseudo-random numbers.
The PRNG state is also shared between threads, so in a multithreaded environment the pseudo-random numbers will be non-deterministic.
One way of solving this problem would be to make the PRNG state a member of the Module
class and change the initXavier()
and randomize()
methods of Tensor
to take an explicit PRNG state parameter.
Also, we don't take into account what was returned from readPngImage(). E.g. must wrap readPngImage() call into assert: GLOW_ASSERT(!readPngImage() ... )
The glow::differentiate()
function takes a mutable reference to a TrainingConfig
struct, but it doesn't seem to modify it.
Could that argument be changed to a const reference instead?
The picture below depicts a concat node that joins nodes on dimension number 1. The IR that we generate for this code is inefficient for two reasons. First, we can't optimize the operator that writes the result because the result is scattered across the 2st dimension (dim zero is the first). And second, we emit a sequence of insert_tensor instructions that process the tensor several times invalidating cache. A much better way would be to represent this as dim-0 concat followed by a transpose.
Design question: I am not sure if this should be the canonical representation, the only representation or simply a target specific optimization.
As found in #1209, our shareBuffers optimization fails to merge chain of copies.
The problem is illustrated with the test case in #1237.
Copy/pasting the analysis made in #1209:
In a nutshell, our buffer optimization only looks at merging one segment at a time. In the case of the lowering of the load node, we have for the first time potentially more than one copy on a chain to a use (e.g., with the lowering of reshape(load)) .
Now to illustrate the problem with chain of copy, let us consider:
1: a = copy b
2: c = copy a
3: = c
We have the following segments:
a = (1, 2); b = (-,1); c (2, 3)
First we merge c and a:
1: a = copy b
2: a = copy a
3: = a
Now the segments look like:
a = (1, 2), (2, 3); b = (-,1)
When considering merging a with b, we only consider updating the segment of a that encloses the segment of b, thus (1,2).
The updated code looks like:
1: b = copy b
2: a = copy b
3: = a
We don't update the old c segment and end up with a copy in 2 now.
Interestingly, if you delete the useless copies as you go, you actually end up with incorrect code, e.g., for the previous example:
1: b = copy b
---2: a = copy b---- deleted
3: = a <--- a never defined
This is because when we update the instruction, we don't preserve the last definition of the outgoing variable and given we don't update the uses not in the segment, they potentially reference an undefined variable.
It looks like it works just by accident but maybe it was intended this way?
Long story short we need to lift the one-segment-at-a-time limitation (maybe by merging the segments?).
ONNX implementation of squeezenet uses GlobalAveragePool. Here's the spec.
GlobalAveragePool could be represented as simple AveragePool with kernel size equal to size of image.
One way to support is to "lower" at Graph creation stage. I.e. do something similar to ChannelShuffle (commits: 1 2).
Alternatively, we could create GlobalAveragePoolNode and lower it at normal lowering stage, but I don't see any benefits of doing so.
Also, AveragePool in ONNX spec supports averaging across any-shaped N-dimensional box. Currently we support only square boxes. This may or may not be an issue for lowering GlobalAveragePool.
Hi,
I'm trying to run the Resnet 50 example in examples/bundles
. My examples/bundles/resnet50/Makefile
bundle has
LOADER?=~/build_Debug/bin/image-classifier
GLOW_SRC?=~/glow
Upon executing make run
, I get
File ‘resnet50/predict_net.pbtxt’ already there; not retrieving.
File ‘resnet50/predict_net.pb’ already there; not retrieving.
File ‘resnet50/init_net.pb’ already there; not retrieving.
Model: resnet50_profile
File: /home/ubuntu/glow/tests/images/imagenet/cat_285.png Result:285
File: /home/ubuntu/glow/tests/images/imagenet/dog_207.png Result:207
File: /home/ubuntu/glow/tests/images/imagenet/zebra_340.png Result:340
Segmentation fault (core dumped)
Makefile:37: recipe for target 'build/resnet50.o' failed
make: *** [build/resnet50.o] Error 139
Then, without quantization (QUANTIZE=NO make run
), I get
File ‘resnet50/predict_net.pbtxt’ already there; not retrieving.
File ‘resnet50/predict_net.pb’ already there; not retrieving.
File ‘resnet50/init_net.pb’ already there; not retrieving.
Segmentation fault (core dumped)
Makefile:42: recipe for target 'build/resnet50.o' failed
make: *** [build/resnet50.o] Error 139
Device: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
OS: Ubuntu 16.04.4 LTS
Some of the Graph optimizations assume that some nodes have variables as their inputs (e.g., batch normalization, convolution) and that these variables are used only in those nodes.
Although this currently matches what the importers emit, this creates an unwelcome coupling. In essence, we could spot usage of cast<Variable>
sprinkled in the code base.
We should either:
I am tempted to go with #2 because:
A. Only the optimizer, importer and test cases have these assumptions baked in (for instance, the CPU optimization for convolution optimizeCPUConv is fine without those, same for the lowering and the interpreter.)
B. It is not unlikely that constant Variables with the same value would be uniqued and be shared across different uses and that's never checked.
C. With the introduce of the load node, those optimizations will need to be taught how to look through loads anyway.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.