pytorch / glow Goto Github PK

View Code? Open in Web Editor NEW

3.1K 3.1K 682.0 75.7 MB

Compiler for Neural Network hardware accelerators

License: Apache License 2.0

CMake 1.18% C++ 91.16% Shell 0.19% Python 5.95% C 1.51% Dockerfile 0.01%

glow's People

Contributors

Stargazers

Watchers

Forkers

compnerd artemrakhov-glow bertmaher opti-mix rdzhabarov 3l1 nadavrot ml-lab janekine barseghyanartur soprof dio bzcheeseman beicy parthchadha azizur77 dailyactie narayanguo smilejx amraw linecode xyuan kramelec praveenmunagapati issuehsu entn-at fanetg awareautoh zcfrank1st cxz hedgefair preejackie longjohncoder hbcbh1999 tfgt anthonyofseattle monad-one soumyarooproy yinghai yanghaha11514 boncheolgu chasesoule1992 ericfate ivanredbread jfix71 algoskynet mrzyzhaozeyu mehrdad-shokri lygstate waylandgod vincentlin78 raulsantoshib b9515308 jingliting cryinlaugh zakuto3 aturetsk vedanuj cadence-tip-ai siju-samuel tlming16 hyzcn thucgn dmoham1476 deeptechlabs qijune gneverov xsunfeng splitinfinity likethesky challenzhou hanzz2007 hklee2040 jinderek debdipta ai-driven ananthc raggamufia sby li363849131 chrisaberger kiritigowda zgsxwsdxg hongyunnchen ajchuang wxlsjsw hulalazz fendaq zhf459 hemantbits gccfancier gcatron shahargigi andraskadar denethor1997 sanjeevshrestha mattpd stjordanis pangge hegemanjwh2

glow's Issues

Implement Optimal fp32-i32 Multiplication in Integer Instructions

I have a quip doc on this issue, but here is the summary:

Given a 32-bit floating-point numbers and a 32-bit integer number, we need a general-purpose method of multiplying them using only integer operations. This is for the case where we have quantized operations running on hardware without a fast FPU or other floating-point support. The current implementation of this only works well for particular cases of rescaling operations -- it tends to be slightly incorrect when we are performing a multiplication of a floating-point scale with an integer register in which some quantities have been accumulated, for instance as happens during a matrix multiplication (FC) or convolution operation.

The overall idea here is that we want to maximize the precision we can maintain when transforming an fp32-i32 multiply into a product of two i32's. At present, we have a general idea that we want to:

(A) Select parameters for pre- and post-multiplication shifting, in order to account for scaling either up or down by factors of 2.
(B) Select an appropriately-precise integer approximation to the mantissa of the f32.
(C) At runtime, compute the f32-i32 product by multiplying the integer approximation to the mantissa with the i32 input, and then shifting appropriately to convert the value to the correct scale w.r.t. powers of 2.

The above is the general theoretical approach. However, in practice, we run into several technical issues.

========= The following is wrong =========
(i) Especially for right-shifting, which is the common case (because the integer approximation to the f32 mantissa is usually going to be a "scaled-up" version of the actual mantissa, in order to achieve precision), arithmetic right-shifting is not quite equivalent to [integer] division by a power of 2. It is equivalent, as desired, for positive values; however, the truncation of right-shifting inherently rounds downward toward negative infinity, and hence is not symmetric about 0. This is undesirable. One solution to this is to add a certain quantity to negative i32 values prior to [arithmetic] right-shifts. The correct value to add is 1_[input < 0] << (shift - 1), which is effectively equivalent to dividing by the respective power of 2, adding 1/2, and then truncating (when the operation is thought of as taking place over the real numbers). This thus implements "round to nearest" behavior for negative i32 values. (Whereas we still have "round down" behavior for postive i32 values -- however, at least...
========== END WRONG PART ===========

(ii) In the course of multiplications which result from a floating-point value and an integer value which has been accumulated (possibly from other integer products), predicting the correct scale of precision on which to operate is difficult, if not impossible. For example, suppose that we wish to multiply an f32 (whose value is known at compile-time), and an i32 (whose value is not known at compile-time) which is the result of a row-column dot product during a matrix multiply operation. Even if the matrix entries in question are i8's, each such element-wise product can have up to 16 bits, and the dimensions of the matrix in question may also introduce another element of variance in the size of the integer that results (and which we wish to multiply by the f32). For instance:

Generally: f = s * (i - o)
For matrix multiply:
f_d = \sum f_l * f_r = (s_l * s_r) * (\sum (i_l - o_l) * (i_r - o_r))
and hence
i_d = (s_l * s_r / s_d) * (\sum (i_l - o_l) * (i_r - o_r)) + o_d

The floating-point value in question then in s_l * s_r / s_d, which is known at compile-time. However -- even though perhaps the number of elements in the summation may be known ahead of time -- the fact that the individual entries i_l and i_r (and consequently i_l - o_l and i_r - o_r) cannot be means that the actual order of magnitude, in bits, of the resulting i32 multiplicand can only be bounded within some range, and since we only have 32 bits to work with in the first place, it is easy to see how this could become problematic if assumptions about the actual order of the resultant i32 dot product are made at compile-time. It therefore appears that at least some runtime analysis may be needed.

Build failure on MacOS

Hi folks,
I followed the README and tried to build Glow but got the following error:

FAILED: lib/Graph/CMakeFiles/Graph.dir/Graph.cpp.o
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++  -DGLOW_WITH_OPENCL=1 -I../include -Iinclude -I. -isystem /usr/local/opt/llvm/include -Wall -fno-exceptions -fno-
rtti -g -fno-omit-frame-pointer -O0 -fvisibility=hidden -fvisibility-inlines-hidden   -std=c++14 -MD -MT lib/Graph/CMakeFiles/Graph.dir/Graph.cpp.o -MF lib/Graph/CMakeFiles/Graph.dir/Graph.cpp.o.d -o lib/
Graph/CMakeFiles/Graph.dir/Graph.cpp.o -c ../lib/Graph/Graph.cpp
../lib/Graph/Graph.cpp:871:19: error: invalid operands to binary expression ('llvm::StringRef' and 'const char *')
      (namePrefix + ".initial_state").str(), Variable::VisibilityKind::Public,
       ~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~

My build command is cmake -G Ninja .. -DCMAKE_PREFIX_PATH=/usr/local/opt/llvm

Any idea what had gone wrong? Thanks!

Refactor Backend interface

The glow::Backend interface is current storing a lot of state, and it needs to be refactored. We want to clearly separate the hardware abstraction layer from the different kinds of state we're storing.

Over in #1176 we want to compile multiple functions to run on different CPUs. In the future, we also want to support multiple GPUs and other accelerators, so we need a backend interface that separates state related to different functions and different execution units.

I'll elaborate on a design in the comments below.

Graph partitioning for multi-device execution

Partitioning a Function into multiple subgraphs breaks a lot of assumptions in our current architecture. PR #1176 shows a proof-of-concept for the mechanics of partitioning, but I want to step back and figure out how this fits into the whole system.

Let's consider the interfaces we currently have:

ExecutionEngine is configured with a particular Backend and knows how to compile a Function for it.
Once compiled, EE can run a function with particular Variable-to-Tensor bindings.
Internally to EE, Backend compiles an IRFunction, creating a CompiledFunction.
CompiledFunction can execute itself. It assumes Tensor bindings are performed externally (but see #1227 for an alternate design).

The main difficulty with partitioning is that it complicates the pipeline:
Function->IRFunction->CompiledFunction->execute
With partitioning, we have to decide where to split a Function into a DAG of Functions, and how to appropriately propagate the dependence information.

Furthermore some of the choices we make will be backend specific. The following scenarios could apply:

CPU execution in which each part is generated sequentially, and a higher-level Executor manages scheduling of parts onto threads
CPU execution in which synchronization is compiled in to the graph
Accelerator execution where synchronization generates inter-card DMA operations

ExecutionEngine.runBatch fails with weird error when it's run on a network compiled with Infer mode

Reproduce: Compile with asan and execute runBatch on network compiled with Infer mode.

We should check how NN was originally compiled and throw an appropriate exception in case of incompatibility.

Upd: Make an assert instead of throwing an exception.

ASAN build is broken

Steps to reproduce:

cmake -G Ninja --DGLOW_USE_SANITIZER=Address ../glow
ninja all

[1/83] Linking CXX executable bin/emulator
FAILED: bin/emulator 
: && /Applications/Xcode_9.0.1_fb.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++  -Wall -fno-exceptions -fno-rtti -g -fno-omit-frame-pointer -O0 -Wl,-search_paths_first -Wl,-headerpad_max_install_names  tools/emulator/CMakeFiles/emulator.dir/emulator.cpp.o  -o bin/emulator   && :
Undefined symbols for architecture x86_64:
  "___asan_init", referenced from:
      _asan.module_ctor in emulator.cpp.o
  "___asan_version_mismatch_check_apple_900", referenced from:
      _asan.module_ctor in emulator.cpp.o
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)

[OpenCL] clBuildProgram Failed. __kernel function cannot have argument whose type is, or contains, type size_t.

Here is the clinfo output for my system pastebin

I've compiled glow with OpenCL support as follows

$ cmake -G Ninja -DCMAKE_BUILD_TYPE=Debug -DGLOW_WITH_CPU=1 -DGLOW_WITH_OPENCL=1  ../glow
$ ninja all

Whenever I try to run any of the unittests or demos with the -opencl command line option, I get :

<kernel>:282:1: error: unexpected type name 'vtype': expected expression
DEFINE_OPENCL_UNARY_DATA_PARALLEL_KERNEL_WITH_IMM_OPERAND(splat, float, SRC)
^
<kernel>:240:19: note: expanded from macro 'DEFINE_OPENCL_UNARY_DATA_PARALLEL_KERNEL_WITH_IMM_OPERAND'
      vtype SRC = vtype(val);                                                  \
                  ^
<kernel>:283:1: error: unexpected type name 'vtype': expected expression
DEFINE_OPENCL_UNARY_DATA_PARALLEL_KERNEL_WITH_IMM_OPERAND(splat_u, ulong, SRC)
^
<kernel>:240:19: note: expanded from macro 'DEFINE_OPENCL_UNARY_DATA_PARALLEL_KERNEL_WITH_IMM_OPERAND'
      vtype SRC = vtype(val);                                                  \
                  ^
<kernel>:334:59: error: __kernel function cannot have argument whose type is, or contains, type size_t
                                cl_uint32_t batch, size_t numSlice,
                                                          ^
<kernel>:335:40: error: __kernel function cannot have argument whose type is, or contains, type size_ta
                                size_t sliceSize) {
                                       ^

/glow/lib/Backends/OpenCL/OpenCL.cpp:189: failed assertion `err == CL_SUCCESS && "clBuildProgram Failed."'

I'm far from conversant in OpenCL but I made the changes that the compiler highlights

diff --git a/lib/Backends/OpenCL/kernels.cl b/lib/Backends/OpenCL/kernels.cl
index f51a1267..17b96644 100644
--- a/lib/Backends/OpenCL/kernels.cl
+++ b/lib/Backends/OpenCL/kernels.cl
@@ -238,7 +238,7 @@ size_t getNCHW(ShapeNCHW s, cl_uint32_t n, cl_uint32_t c, cl_uint32_t h,
       vstore8(VAL, i * 2, dest);                                               \
     }                                                                          \
     {                                                                          \
-      vtype SRC = vtype(val);                                                  \
+      vtype SRC = (vtype)val;                                                  \
       vtype VAL = body;                                                        \
       vstore8(VAL, i * 2 + 1, dest);                                           \
     }                                                                          \
@@ -332,8 +332,8 @@ __kernel void batchedreduceaddK(__global float *dest, __global float *batch,
 }
 
 __kernel void batchedreduceaddW(__global void *mem, cl_uint32_t dest,
-                                cl_uint32_t batch, size_t numSlice,
-                                size_t sliceSize) {
+                                cl_uint32_t batch, cl_uint32_t numSlice,
+                                cl_uint32_t sliceSize) {
   batchedreduceaddK(&mem[dest], &mem[batch], numSlice, sliceSize);
 }

Now, OCLTest, MLTest, ptb all work fine with OpenCL backend but operatorTest, JITTest and mnist all fail certain tests.

[ RUN      ] OpenCL/Operator.simplePredication/0
glow/lib/Backends/OpenCL/OpenCL.cpp:197: failed assertion `err == CL_SUCCESS && "Unable to set parameter"'
[ RUN      ] OpenCL/BackendCorrectnessTest.poolMaxGradTest/0
glow/lib/Backends/OpenCL/OpenCL.cpp:197: failed assertion `err == CL_SUCCESS && "Unable to set parameter"'
$ ./bin/mnist -opencl
Loading the mnist database.
Loaded 50000 images.
Training.
Training - epoch #0
glow/lib/Backends/OpenCL/OpenCL.cpp:197: failed assertion `err == CL_SUCCESS && "Unable to set parameter"'

How can I resolve this?

[Quantization] Calculate Scale and Offset based on the histogram

Currently, we profile all nodes in a graph and capture both min/max as well as the histogram.

Quantization scale and offset are calculated based on the global min and global max and do not take histogram into consideration.
This does not work very well when there are outliers.

There was the logic for calculating scale and offset based on histogram but it was removed: b5d9385
As it did not properly work. Ex, simple tensor with {1, 2} values.

Performance compared with NNVM/TVM

Except for the performance of Glow on Resnet50 etc compared with TensorFlow1.7, will there be some more benchmarks such as comparing Glow with NNVM/TVM on CPU/GPU?

[Feature request] Backend-specific tensors

Backends should be able to implement their own type of tensor objects.
A good example for using this feature would be having tensors backed by OpenCL resources for the OpenCL backend (and similarly for other backends).

In order to support this, the Tensor class can be made more suitable for deriving (adding virtual qualifiers etc.) or create a Tensor interface class for other implementations to derive from.
Additional features for this Tensor interface should include support for lock/unlock (map/unmap) or some other forms of synchronization mechanism.

Wrong scheduling because of the lack of memory dependencies

Our graph does not model the memory dependencies explicitly.
As a result, if we have a variable is written and read on different
data paths, the scheduler may be unlucky and we read the wrong
version of the variable.

Essentially, if you have:
     A
    / \
write read

The scheduler can produce either:
write A
read A
or
read A
write A

Obviously this does not produce the same output.
Glow intended semantic is that write should happen last, but with
our current algorithm, this is actually possible to fool the
scheduler not to do that.

I'm going to send a PR to show how we can fool the scheduler.

Separate HAL, temporary state, and compiled functions

This is part of the Backend refactoring in #1227.

The goal is to have the Backend sub-classes represent the hardware abstraction layer. The instances hold no mutable state, and all actions are performed through const virtual functions.

Proposed interface:

class Backend {
  virtual bool transformPreLowering(Function *F, CompilationMode mode) const;
  virtual bool transformPostLowering(Function *F, CompilationMode mode) const;
  virtual bool isOpSupported(Kinded::Kind opKind, ElemKind elementTy) const;
  virtual bool shouldLower(const Node *N) const;
  virtual bool shouldShareBuffers() const;

  /// Compile an IR function, consuming it in the process.
  /// Returns a new object representing the compiled function.
  virtual unique_ptr<CompiledFunction> compileFunction(unique_ptr<IRFunction>) const;
}

class CompiledFunction {
  virtual void execute() =0;
}

Refactoring steps:

Change transformPreLowering and friends to be const.
Create a CompiledFunction sub-class per backend which holds the result of compiling a function.
Move remaining temporary compilation state out of the Backend sub-classes into a locally defined class that only exists during compileFunction().
Remove the IRFunction pointer from backend instances and the createBackend() function.

At this stage, the CompiledFunction sub-classes will still reference the original module, and their execute method can modify variables in the module.

test target dependencies are incomplete

One thing I noticed, the dependencies are not correct in the cmake. In effect, if you run ninja test without building glow first, it will fail because the tests don't depend on glow being built. This is problematic as we may update glow, forget to rebuild, and run the tests and think that everything is okay whereas we would run the old code.

Consider enabling OpenCL in CI by using pocl.

And currently the CPU jit version is far slower than the Interpreter version, that's werid.

More initialization kinds and possibly refactor initialization mechanism

A thing I noticed a while ago was that the initialization algorithm of a variable is currently embedded in the representation of the Variable class itself. Given that in practice people may want to use quite a variety of initialization kinds, it may be a little unscalable to modify the core Variable class for each new initialization algorithm, and clutter Nodes.cpp with initialization algorithms that may best be put somewhere else. Maybe it's worth thinking of how the initialization code can be kept separate from Variable itself. This is just a thought. I'll look into how other libraries and frameworks deal with this.

Either way adding a few more initialization algorithms may be worthwhile for the future.

[unittest coverage] Improve test coverage for various aspects of glow

Every master commit invokes coverage build and the results of coverage build can be found: https://fb-glow-assets.s3.amazonaws.com/coverage/coverage-master/index.html

It would be great to improve test coverage for certain low coverage places.

That's a good first issue to get some familiarity with the glow code base, contributions are appreciated.

Support non-square padding for Convolution operator

Convolution is the most important operator for Computer Vision NNs.

The simplest implementation (our Interpreter backend) could be found here:

glow/lib/Backends/Interpreter/InterpreterNodes.cpp

Line 40 in 405e632

void Interpreter::fwdConvolutionInst_FloatImpl(Value *inV, Value *outV,

It has parameters, such as group, pad, stride, filterSize.

Convolution is 2D operation (it always works with 2D image and 2D filter, and produces 2D result). Right now all our Convolution implementations assume that pad is the same for both dimensions (image height and width). But for some models (very rarely) it's not true. In general case, there could be 4 different pads: left side, right side, top side, bottom side. We need to change "pad" to be a vector instead of a single number (for ConvolutionNode and ConvolutionInstr). And change all the implementations: Interpreter (float and quantized), CPU (float and quantized). Later, we also need to support uneven paddings in ConvolutionGradNode (which is backward pass for Convolution).

Note, that our Convolution for CPU backend is highly optimized, and in theory complicating its code may lead to losing performance. In practice, I highly doubt that this could happen, but worth to check performance after the change just to be sure. One can run tests/images/run.sh -cpu -iterations=100 with and without the change, and make sure that average performance numbers are within 1%.

Here is ONNX specification of Conv operator:
https://github.com/onnx/onnx/blob/master/docs/Operators.md#conv
One can notice that not only padding is represented as array there, but also stride and filterSize are represented as array. Changing these is lower priority, and there's no immediate need. Just good thing to have.

Another direction of future work: support non-square padding in PoolMax and PoolAvg operators.

Compilation time and LLVM dependency reduction

LLVM takes quite a while to compile, does the entire library need to be built?

It looks like these are the dependencies, and I'm wondering if its possible to reduce the compilation time:

grep -r "include \"llvm" include/ src/ | sed 's/.*#/#/g' | sort | uniq
#include "llvm/Support/ErrorHandling.h"
#include "llvm/ADT/ArrayRef.h"
#include "llvm/ADT/Hashing.h"
#include "llvm/ADT/PointerUnion.h"
#include "llvm/ADT/SmallVector.h"
#include "llvm/ADT/StringRef.h"
#include "llvm/Support/Casting.h"
#include "llvm/Support/CommandLine.h"
#include "llvm/Support/Debug.h"
#include "llvm/Support/NativeFormatting.h"
#include "llvm/Support/raw_ostream.h"

Should lower/optimize behave differently based on the compilation mode?

As of today our API takes the CompilationMode for both the lowering and optimization phase (glow::lower and glow::optimize).
Except for one thing in the lowering, which I will come back later, this argument does change the behavior of these functions and thus, I was wondering if we should pass it at all.
My way of thinking is if we don't use it, let's not pass it and if/when we need it we can add it.

Now going back to the only user of this argument in lowering.
The lowering of batch normalization is the only user of that argument, but I believe we could avoid that use and that would be a better design. Basically, we lower BN differently between inference and training, whereas I think the right thing would be to have a different representation when we differentiate BN. In essence, what I am suggesting for this use case is to introduce a new node for mean and variance normalization (i.e., for what we special case with the training mode during lowering.) In other words, differentiation for BN would produce BNGrad(BN(meanVarNorm)) and that would completely eliminate the need for a special lowering.

All in all, as of today, I don't see a reason why we pass this mode.

What do people think?

Add support for FP16 and bfloat

Working with accelerators, we are going to need support for 16-bit floating point types.

FP16 is the binary16 format defined in IEEE 754-2008.
bfloat is the Tensorflow floating point type with the same exponent range as a float.

Type	binary16	bfloat	binary32
k, storage width in bits	16	16	32
w, exponent field width in bits	5	8	8
t, trailing significand field width in bits	10	7	23

We should at a minimum add these types to glow::ElemKind so we can represent low-precision tensors in these types. Ideally, also add interpreter support.

image-classifier behaves differently with one or several images for densenet121

I noticed that the image-classifier returns different results if invoked with one file at a time and all the same file at once and was wondering if this is expected.

To reproduce:
$ ./bin/image-classifier tests/images/imagenet/* -image_mode=0to1 -m=densenet121/model.onnx
$ ./bin/image-classifier tests/images/imagenet/cat_285.png -image_mode=0to1 -m=densenet121/model.onnx

Result:
With all the files at once:
Model: densenet121/model.onnx
File: tests/images/imagenet/cat_285.png Result:281
File: tests/images/imagenet/dog_207.png Result:207
File: tests/images/imagenet/zebra_340.png Result:340

With one at a time:
Model: densenet121/model.onnx
File: tests/images/imagenet/cat_285.png Result:674
Model: densenet121/model.onnx
File: tests/images/imagenet/dog_207.png Result:674
Model: densenet121/model.onnx
File: tests/images/imagenet/zebra_340.png Result:674

Also when there is only one file given, the process takes much longer to produce a result.

Is this expected?

[Feature request] Tensor attributes

It would be useful to have more attributes on tensor objects.
Specifically, whether a tensor is an input/output for the function and whether its constant or mutable during inference etc.
Alternatively this could be added as a utility function that backends can invoke.

Download script should provide option of which dataset to download

Currently the script that downloads datasets downloads all of them at once. We should add command line options to control which one. E.g. when I only want MNIST, I shouldn't have to download the other large datasets.

Add Autoencoder example

An autoencoder is a very basic neural network that takes in some input, encodes it into a lower-dimensional representation, and then decodes it back into something that should ideally be identical to the input. We should try to implement such an autoencoder with Glow. An extension of this would be to make a variational autoencoder.

Integrate IWYU

Putting this up as a "wishlist" issue: Since we already have clang-tidy, let's also integrate Include-What-You-Use. LLVM has a lot of different headers so having a tool to tell us what to include where will be healthy for the future and avoid things like cyclic dependencies or other badness that can arise from C++'s lack of a module system. It also avoid the kind of thing where you remove one header and the whole card-house of mismanaged includes breaks down. I managed to integrate it with my last LLVM project. It's excruciatingly slow, but would be great to have as a separate build rule to run over night or before lunch once in a while.

Language reference manuals

Glow should provide reference documentation for its intermediate representations, both the graph IR and the linear IR.

Examples from similar projects:

The LLVM language reference is handwritten.
The Cretonne language reference is generated from the instruction documentation in its definition files.

Add support for building under Windows:)

[Broadcasting] Concat inputs scale with the broadcast size.

We represent a broadcast as a series of concats. For example if we want to broadcast a tensor A with dims = {1} to a tensor B with dims = {10}, B is a ConcatNode with A repeatedly input 10 times.

This works nicely because we do not need a BroadcastNode. Additionally, we IRGen this ConcatNode with repeated inputs into a single InsertTensorInst with some count value, i.e. the InsertTensorInst does not have its input count scale with the broadcast size.

However, this scaling issue does occur at the Node level. If the broadcast size is large, for example if the destination tensor B has dims = {2048}, then B will be a ConcatNode with 2048 inputs. Not great.

To improve this, one option I am considering is to add a vector member to ConcatNode that represents how many times to concatenate each input. For example, instead of the above ConcatNode B having A input 2048 times, it has A input once, as well as a new "count" member vector {2048}, signaling that A should be inserted 2048 times.

If we wanted to also Concat some other Node C 10 times at the end, then this "count" member vector would be {2048, 10}, and B would have A and C input once each.

This would mean ConcatNode more closely matches our InsertTensorInst, which ConcatNode is IRGen'd into anyway. However, it adds some complexity to the ConcatNode.

What does everyone think?

Thousands of libjit_stacked_kernelXXX doing the same thing are issued

I was looking at the llvm IR produced by the image classifier using tests/images/run.sh and I noticed that we issue thousands of this kernel:
define internal void @libjit_stacked_kernel.1575(float* noalias) {
entry:
br label %loop

loop: ; preds = %loop, %entry
%1 = phi i64 [ , %entry ], [ %nextvar, %loop ]
%2 = call float @libjit_splat_kernel_f(i64 %1, float SomeNumber, float* null, float* null)
%buffer.element.addr = getelementptr float, float* %0, i64 %1
store float %2, float* %buffer.element.addr
%nextvar = add nuw nsw i64 %1, SomeOtherNumber
%loopcond = icmp ult i64 %nextvar, AnotherNumber
br i1 %loopcond, label %loop, label %afterloop, !llvm.loop !

afterloop: ; preds = %loop
ret void
}

Would be nice to reuse them if at all possible (haven't checked but the code seems duplicated quite a lot), so that we reduce our memory footprint and our compile time.

Overloading std::to_string is undefined behavior

Was browsing the codebase and noticed that we are overloading std::to_string for Glow's Type class. Not to be that guy, but this is undefined behavior as per the standard (only allowed to explicitly specialize templates for user-defined types).

Not sure if we care? If we do, we should probably move that to Type::to_string or something.

Source: https://stackoverflow.com/questions/14402990/should-you-overload-swap-in-the-std-namespace

icon/logo design proposal

Hi, I am a graphic designer and i want to contribute to your nice project by proposing an icon/logo design.
If you like and want to use it i will be gladly sending you the files as a gift for free. Here is what i come up with:

Design idea- letter G is very visible for the initial of Glow, it also shows gears which represents machine or engine. another element it shows is a sun that glows which literally represents glow.

icon

logotype

I hope you like and make you wanna use it. it there's any modification that you would like to do please let me know and i will edit.

Thanks and best regards!
-Tobaloidee

Add support for BUCK build

We should be able to build Glow compiler with the Buck.

You can get more details on BUCK here.

OpenCL backend leaks memory

I ran a build with LeakSanitizer enabled and it pointed to a few leaks.

cmake -DGLOW_USE_SANITIZER=Address ~/src/glow
export ASAN_OPTIONS=detect_leaks=1
ninja graphTest
./tests/graphTest

The leaks are of allocations inside OpenCL, but we should make sure we're not misusing the API.

e.g.,

Indirect leak of 36 byte(s) in 1 object(s) allocated from:
    #0 0x103024c33 in wrap_malloc (libclang_rt.asan_osx_dynamic.dylib:x86_64h+0x56c33)
    #1 0x7fff7e6f38b2 in _Block_copy (libsystem_blocks.dylib:x86_64+0x8b2)
    #2 0x7fff7e6f3b3d in _Block_object_assign (libsystem_blocks.dylib:x86_64+0xb3d)
    #3 0x7fff7e6f38ef in _Block_copy (libsystem_blocks.dylib:x86_64+0x8ef)
    #4 0x7fff7e66bfd3 in _dispatch_Block_copy (libdispatch.dylib:x86_64+0x1fd3)
    #5 0x7fff7e683a07 in _dispatch_source_set_cancel_handler (libdispatch.dylib:x86_64+0x19a07)
    #6 0x103023f9e in wrap_dispatch_source_set_cancel_handler (libclang_rt.asan_osx_dynamic.dylib:x86_64h+0x55f9e)
    #7 0x7fff7867db1a in (anonymous namespace)::RunElsewhere::instance() (SkyLight:x86_64+0x23b1a)
    #8 0x7fff787ba521 in __SLSInitialize_block_invoke (SkyLight:x86_64+0x160521)
    #9 0x7fff7e66bd4f in _dispatch_client_callout (libdispatch.dylib:x86_64+0x1d4f)
    #10 0x7fff7e66bd02 in dispatch_once_f (libdispatch.dylib:x86_64+0x1d02)
    #11 0x7fff78828d08 in SLSMainConnectionID (SkyLight:x86_64+0x1ced08)
    #12 0x7fff604e5c9a in gfxLoadPluginData (libGFXShared.dylib:x86_64+0xc9a)
    #13 0x7fff5f30308a  (OpenCL:x86_64+0x2008a)
    #14 0x7fff7e9309bd in __pthread_once_handler (libsystem_pthread.dylib:x86_64+0x39bd)
    #15 0x7fff7e925eff in _os_once (libsystem_platform.dylib:x86_64+0xeff)
    #16 0x7fff7e930958 in pthread_once (libsystem_pthread.dylib:x86_64+0x3958)
    #17 0x7fff5f302e67  (OpenCL:x86_64+0x1fe67)
    #18 0x7fff5f3034e1 in clGetDeviceIDs (OpenCL:x86_64+0x204e1)
    #19 0x1003bee98 in glow::OCLBackend::OCLBackend() OpenCL.cpp:89
    #20 0x1003bebda in glow::createOCLBackend() OpenCL.cpp:87
    #21 0x10002395b in getConvNodeSize(glow::BackendKind) graphTest.cpp:408
    #22 0x100024477 in Graph_disableUnrollingGroupConv_Test::TestBody() graphTest.cpp:438
    #23 0x100268e1f in testing::Test::Run() gtest.cc
    #24 0x10026c688 in testing::TestInfo::Run() gtest.cc:2654
    #25 0x10026dc94 in testing::TestCase::Run() gtest.cc:2772
    #26 0x10028e1c7 in testing::internal::UnitTestImpl::RunAllTests() gtest.cc:4677
    #27 0x10028cfb5 in testing::UnitTest::Run() gtest.cc
    #28 0x1002c7b3f in main gtest.h:2237
    #29 0x7fff7e6a5114 in start (libdyld.dylib:x86_64+0x1114)

Travis CI should use travis-ci.org url in README.md

Should we lower QuantizationProfile node?

I was looking at the QuantizationProfile node and the related QuantizationProfile instruction, and it looks like we only provide the interpreter support.
So here are my questions:

Do we expect every backend to come up with their own support for this instruction? or,
Should we lower the node in something more chewable for the backends?

If we expect every backend to come up with their own support, we should make an extra effort on documenting what are the expectations of this operator.

[CI jobs] More tests need to run in CI

Glow contains a variety of validation tests:

Unit tests, these tests are run as part of the CI currently and verify inference as well as training
Larger tests that exercise training/backpropagation (cifar10, mnist, ptb) and typically take a very long time to execute. The execution time can be tuned by introducing a dynamic number of iterations if needed.
Inference on well known pre-trained models: resnet50, vgg19, squeezenet, shufflenet, etc. Model definitions that are in ONNX or Caffe2 format are required for those tests. You can check on how those models downloaded for ONNX and Caffe2. These tests do not assert current results against expected and used in a manual validation. Sample input/output look like

     Model: resnet50/predict_net.pb
     File: tests/images/imagenet/cat_285.png Result:285

The goal here is to introduce a script which allows running (3) on CI.
Few things to consider before implementing:

Measure how much time it takes to download weights/models and run validation
Depending on time, enable validation on every PR or as part of the daily travis cron job
Create a script which makes test fail in case output mismatch with the expected output (run this on CI)

Important scenarios include AOT examples, which should be validated on the CI in a similar way as we propose to do for (3). AOT examples also require downloading weights and models.

[onnxifi] Execute Resnet50 via ONNXIFI interface

Support ONNXIFI interface

Quick compilation test on macOS Sierra 10.12.6 failing

[ 28%] Building CXX object src/glow/Base/CMakeFiles/Base.dir/Train.cpp.o
In file included from /Users/bwasti/Glow/src/glow/Base/Train.cpp:3:
In file included from /Users/bwasti/Glow/include/glow/Base/Train.h:4:
/Users/bwasti/Glow/include/glow/Base/Tensor.h:158:26: error: no member named 'drop_front' in
      'llvm::ArrayRef<unsigned long>'
    auto dim = t->dims().drop_front();
               ~~~~~~~~~ ^
/Users/bwasti/Glow/include/glow/Base/Tensor.h:173:35: error: no member named 'drop_front' in
      'llvm::ArrayRef<unsigned long>'
    auto onceSliceDim = t->dims().drop_front();
                        ~~~~~~~~~ ^
/Users/bwasti/Glow/include/glow/Base/Tensor.h:175:35: error: no member named 'drop_front' in
      'llvm::ArrayRef<unsigned long>'
    assert(onceSliceDim == dims().drop_front() && "Invalid slice size");
                           ~~~~~~ ^
/usr/include/assert.h:93:25: note: expanded from macro 'assert'
    (__builtin_expect(!(e), 0) ? __assert_rtn(__func__, __FILE__, __LINE__, #e) : (void)0)
                        ^
In file included from /Users/bwasti/Glow/src/glow/Base/Train.cpp:3:
In file included from /Users/bwasti/Glow/include/glow/Base/Train.h:4:
/Users/bwasti/Glow/include/glow/Base/Tensor.h:357:32: error: no member named 'drop_front' in
      'llvm::ArrayRef<unsigned long>'
    Tensor slice(elemTy, sizes.drop_front());
                         ~~~~~ ^
4 errors generated.
make[2]: *** [src/glow/Base/CMakeFiles/Base.dir/Train.cpp.o] Error 1
make[1]: *** [src/glow/Base/CMakeFiles/Base.dir/all] Error 2
make: *** [all] Error 2

The definition seems to exist in external/llvm/include/llvm/ADT/ArrayRef.h, so I'm not really sure where the error is coming from.

$ strings external/llvm/lib/* 2>/dev/null | c++filt -n | grep "llvm::ArrayRef<unsigned long>" | grep drop_front
llvm::ArrayRef<unsigned long>::drop_front(unsigned long) const

[Quantization] Allow running end2end quantization tests on OpenCL backend

Currently, there are two end2end tests for quantization which make sure that we can:

Profile and Quantize the graph
Execute inference for the quantized graph on a specific backend

Note that Profiling is always done with the Interpreter backend. While quantizing and inference is run on a specific backend.

See here for the reference.

@ZchiPitt we've been talking about this. Created this issue just to make sure it's not lost.

Does the size_t & uint64_t really necessary in opencl kernel.cl?

First of all NVIDIA CUDA OpenCL doesn't support size_t.
Besides, When compiling the executable under 32 bit application,
uint64_t about to be another issue:0
cause in kernel

typedef struct {
  cl_uint64_t n; // Number of samples
  cl_uint64_t h; // Height
  cl_uint64_t w; // Width
  cl_uint64_t c; // Number of channels
} ShapeNHWC;

but in C++

struct ShapeNCHW {
  size_t n; // Number of samples
  size_t c; // Number of Channels
  size_t h; // Height
  size_t w; // Width

There is inconsistence between OpenCL kernel & C++ code about size_t /uint64_t.
We may need to define a fixed coding style about size_t & uint64_t.

[quantization] Add symmetric quantization schema support to Glow

Current state

The latest version of Glow compiler provides a certain level of quantization support.

Quantization process consists of the following steps:

Instrument an original floating-point graph with the special Quantization Profile node
Run inference on the instrumented graph many times with good representative inputs (see dump_profile flag).
During this step Glow automatically captures the distribution of all activations in the compute graph. The distribution contains min/max value seen from the given output of the graph node as well as detailed frequency per floating point range withing [min, max] interval. There are 2000 floating point ranges kept as part of the Quantization Profile node.
As the result of the profiling procedure, Glow generates Scale and Offset parameters for all activations and dumps it in the file (Note, currently Glow uses linear quantization). The scale is a positive fp32 number while Offset is an int32. Note, Glow does not use the distribution of floating point numbers but relies only on values of min and max (this is a separate issue and out of scope here).
Transform computation graph based on the captured profile and specific execution backend. Note, not all nodes are quantized as not all backends support certain quantized op. See CPU backend op support here for example.
Perform computation of quantized graph. See Interpreter implementation here.

What needs to be enhanced

There are backends that tight to a specific quantization schema, e.g., symmetric quantization.
The difference between symmetric and assymetric quantization is that the "offset" parameter equals to 0 in symmetric linear quantization. In this case the dequantization follows the formula: fp32_number = scale * quantized_number (fp32_number = scale * (quantized_number - offset) for the asymmetric case).

Requirement for the symmetric quantization

Need to make sure that the accuracy loss is comparable with the asymmetric case on Resnet50. Important note, currently, activations are signed int8 and in case of symmetric quantization, outputs of Sigmoid/ReLU will effectively use only 7 bits out of 8 (which is undesirable).

The issue could be tackled in two steps:

Introduce the symmetric schema and deal with the int8 activations (accuracy loss)
Introduce int8, uint8 activations. Check if specific backend can handle int8/uint8 activations for a given Op (this way we could gradually onboard backends). This is a bigger change and would require a design discussion here.

Please discuss this issue here before implementation.
cc: @qcolombet

Rename Pool Nodes and Instrs

Our pooling nodes and instructions are named PoolAvg and PoolMax. Everywhere else in ML they are called AvgPool and MaxPool, respectively. It would be great to do a mechanical rename.

Use deterministic PRNG seeds for testing

Over in #1065, we have an example of a unit test that fails intermittently depending on the pseudo-random numbers delivered by nextRand(). This function uses a static PRNG, so while it is seeded deterministically, reordering tests will give them different pseudo-random numbers.

The PRNG state is also shared between threads, so in a multithreaded environment the pseudo-random numbers will be non-deterministic.

One way of solving this problem would be to make the PRNG state a member of the Module class and change the initXavier() and randomize() methods of Tensor to take an explicit PRNG state parameter.

readPngImage() leaks memory when image fails to read

Also, we don't take into account what was returned from readPngImage(). E.g. must wrap readPngImage() call into assert: GLOW_ASSERT(!readPngImage() ... )

Add `const` to glow::differentiate() `conf` argument

The glow::differentiate() function takes a mutable reference to a TrainingConfig struct, but it doesn't seem to modify it.

Could that argument be changed to a const reference instead?

Need to canonicalize or optimize high-dim concat to concat+transpose

The picture below depicts a concat node that joins nodes on dimension number 1. The IR that we generate for this code is inefficient for two reasons. First, we can't optimize the operator that writes the result because the result is scattered across the 2st dimension (dim zero is the first). And second, we emit a sequence of insert_tensor instructions that process the tensor several times invalidating cache. A much better way would be to represent this as dim-0 concat followed by a transpose.

Design question: I am not sure if this should be the canonical representation, the only representation or simply a target specific optimization.

shareBuffers fails to merge chain of copies

As found in #1209, our shareBuffers optimization fails to merge chain of copies.
The problem is illustrated with the test case in #1237.

Copy/pasting the analysis made in #1209:
In a nutshell, our buffer optimization only looks at merging one segment at a time. In the case of the lowering of the load node, we have for the first time potentially more than one copy on a chain to a use (e.g., with the lowering of reshape(load)) .
Now to illustrate the problem with chain of copy, let us consider:
1: a = copy b
2: c = copy a
3: = c
We have the following segments:
a = (1, 2); b = (-,1); c (2, 3)

First we merge c and a:
1: a = copy b
2: a = copy a
3: = a
Now the segments look like:
a = (1, 2), (2, 3); b = (-,1)

When considering merging a with b, we only consider updating the segment of a that encloses the segment of b, thus (1,2).
The updated code looks like:
1: b = copy b
2: a = copy b
3: = a
We don't update the old c segment and end up with a copy in 2 now.

Interestingly, if you delete the useless copies as you go, you actually end up with incorrect code, e.g., for the previous example:
1: b = copy b
---2: a = copy b---- deleted
3: = a <--- a never defined

This is because when we update the instruction, we don't preserve the last definition of the outgoing variable and given we don't update the uses not in the segment, they potentially reference an undefined variable.

It looks like it works just by accident but maybe it was intended this way?

Long story short we need to lift the one-segment-at-a-time limitation (maybe by merging the segments?).

Support GlobalAveragePool

ONNX implementation of squeezenet uses GlobalAveragePool. Here's the spec.

GlobalAveragePool could be represented as simple AveragePool with kernel size equal to size of image.

One way to support is to "lower" at Graph creation stage. I.e. do something similar to ChannelShuffle (commits: 1 2).

Alternatively, we could create GlobalAveragePoolNode and lower it at normal lowering stage, but I don't see any benefits of doing so.

Also, AveragePool in ONNX spec supports averaging across any-shaped N-dimensional box. Currently we support only square boxes. This may or may not be an issue for lowering GlobalAveragePool.

Resnet 50 example segfault

Hi,

I'm trying to run the Resnet 50 example in examples/bundles. My examples/bundles/resnet50/Makefile bundle has

LOADER?=~/build_Debug/bin/image-classifier
GLOW_SRC?=~/glow

Upon executing make run, I get

File ‘resnet50/predict_net.pbtxt’ already there; not retrieving.

File ‘resnet50/predict_net.pb’ already there; not retrieving.

File ‘resnet50/init_net.pb’ already there; not retrieving.

Model: resnet50_profile
 File: /home/ubuntu/glow/tests/images/imagenet/cat_285.png Result:285
 File: /home/ubuntu/glow/tests/images/imagenet/dog_207.png Result:207
 File: /home/ubuntu/glow/tests/images/imagenet/zebra_340.png Result:340
Segmentation fault (core dumped)
Makefile:37: recipe for target 'build/resnet50.o' failed
make: *** [build/resnet50.o] Error 139

Then, without quantization (QUANTIZE=NO make run), I get

File ‘resnet50/predict_net.pbtxt’ already there; not retrieving.

File ‘resnet50/predict_net.pb’ already there; not retrieving.

File ‘resnet50/init_net.pb’ already there; not retrieving.

Segmentation fault (core dumped)
Makefile:42: recipe for target 'build/resnet50.o' failed
make: *** [build/resnet50.o] Error 139

Device: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
OS: Ubuntu 16.04.4 LTS

Assumptions baked in the Graph optimizer not checked anywhere

Some of the Graph optimizations assume that some nodes have variables as their inputs (e.g., batch normalization, convolution) and that these variables are used only in those nodes.
Although this currently matches what the importers emit, this creates an unwelcome coupling. In essence, we could spot usage of cast<Variable> sprinkled in the code base.
We should either:

Check these assumptions in the related verify methods, or
Fix the optimizer (and the hand written test cases)

I am tempted to go with #2 because:
A. Only the optimizer, importer and test cases have these assumptions baked in (for instance, the CPU optimization for convolution optimizeCPUConv is fine without those, same for the lowering and the interpreter.)
B. It is not unlikely that constant Variables with the same value would be uniqued and be shared across different uses and that's never checked.
C. With the introduce of the load node, those optimizations will need to be taught how to look through loads anyway.