nunoplopes / torchy Goto Github PK

View Code? Open in Web Editor NEW

12.0 12.0 0.0 5.4 MB

A tracing JIT compiler for PyTorch

License: MIT License

C++ 85.82% Python 13.96% Shell 0.04% PHP 0.17%

jit-compiler lazy-evaluation pytorch tracing-jit

torchy's People

Contributors

Stargazers

Watchers

torchy's Issues

PyTorch's redispatch functions copy input arguments

build/aten/src/ATen/RedispatchFunctions.cpp:

    at::Tensor conv1d(..., std::string padding,...) {
        static auto op = ...;
        return op.redispatch(dispatchKeySet, input, weight, bias, stride, padding, dilation, groups);
    }

The redispatch call should have a std::move() on some of the args.
Should upstream patch that mimics login in our gen.py.

hash collisions in program cache don't seem relevant

Tried a better hash:

#define HASH_COMBINE(hash, ty, v) hash * 31 + std::hash<ty>()(v)
  for (unsigned i = 0; i < key.num_ops; ++i) {
    hash = HASH_COMBINE(hash, uint16_t, key.ops[i].id);
    for (auto &arg : key.ops[i].args) {
      if (auto *v = get_if<int64_t>(&arg)) {
        hash = HASH_COMBINE(hash, int64_t, *v);
      } else if (auto *v = get_if<vector<long>>(&arg)) {
        for (auto n : *v) {
          hash = HASH_COMBINE(hash, long, n);
        }
      }
    }
  }
#undef HASH_COMBINE

Reduced collisions to almost zero, but zero perf difference.

Check why shallow_copy_from is called on a wrong object

It required this patch:

diff --git a/c10/core/TensorImpl.cpp b/c10/core/TensorImpl.cpp
index a0c7673641..3c027a4a17 100644
--- a/c10/core/TensorImpl.cpp
+++ b/c10/core/TensorImpl.cpp
@@ -480,9 +480,12 @@ void TensorImpl::copy_tensor_metadata_except_version_counter(
     const TensorImpl* src_impl,
     TensorImpl* dest_impl,
     bool allow_tensor_metadata_change) {
-  dest_impl->storage_ = src_impl->storage_;
-  dest_impl->sizes_and_strides_ = src_impl->sizes_and_strides_;
-  dest_impl->storage_offset_ = src_impl->storage_offset_;
+  dest_impl->storage_ = src_impl->storage();
+  dest_impl->sizes_and_strides_.set_sizes(src_impl->sizes());
+  auto strides = src_impl->strides();
+  memcpy(dest_impl->sizes_and_strides_.strides_data(), strides.begin(),
+         sizeof(int64_t) * strides.size());
+  dest_impl->storage_offset_ = src_impl->storage_offset();
   dest_impl->data_type_ = src_impl->data_type_;
   dest_impl->device_opt_ = src_impl->device_opt_;
   dest_impl->key_set_ = src_impl->key_set_;

But ideally it wouldn't be needed, as shallow_copy_from would be called between Torchy objects. So why pickle used different objects?

This is the backtrace, while executing a TorchVision model:

(gdb) bt
#0  c10::TensorImpl::shallow_copy_from (this=0x555558e79400, impl=...)
    at ../c10/core/TensorImpl.h:1270
#1  0x00007fffbc5610d6 in torch::autograd::VariableHooks::set_data (
    this=<optimized out>, self=..., new_data=...)
    at ../torch/csrc/autograd/variable.cpp:440
#2  0x00007fffc2eeac63 in THPVariable_set_data (self=0x7fff7d5b7840,
    data=0x7fff7d5b4980, unused=<optimized out>)
    at ../torch/csrc/autograd/python_variable.cpp:316
#3  0x00005555556cf597 in _PyObject_GenericSetAttrWithDict ()
    at /tmp/build/80754af9/python_1599203911753/work/Objects/object.c:1366
#4  0x00005555556cf687 in PyObject_GenericSetAttr (value=0x7fff7d5b4980,
    name=<optimized out>, obj=0x7fff7d5b7840)
    at /tmp/build/80754af9/python_1599203911753/work/Objects/object.c:1416
#5  PyObject_SetAttr ()
    at /tmp/build/80754af9/python_1599203911753/work/Objects/object.c:1045
#6  0x00005555557156b7 in _PyEval_EvalFrameDefault ()
    at /tmp/build/80754af9/python_1599203911753/work/Python/ceval.c:2372
#7  0x00005555556df86b in function_code_fastcall (globals=<optimized out>,
    nargs=2, args=<optimized out>, co=<optimized out>)
    at /tmp/build/80754af9/python_1599203911753/work/Objects/call.c:283
#8  _PyFunction_Vectorcall.localalias.355 ()
    at /tmp/build/80754af9/python_1599203911753/work/Objects/call.c:410
#9  0x00005555556dfe79 in _PyObject_Vectorcall (kwnames=0x0, nargsf=2,
    args=0x7fffffffb610, callable=0x7fff848710d0)
    at /tmp/build/80754af9/python_1599203911753/work/Include/cpython/abstract.h:127
#10 method_vectorcall ()
    at /tmp/build/80754af9/python_1599203911753/work/Objects/classobject.c:89
#11 0x00005555555d22d6 in _PyObject_Vectorcall (kwnames=0x0, nargsf=1,
    args=0x7fffffffb6b0, callable=0x7fff7f353a40)
    at /tmp/build/80754af9/python_1599203911753/work/Include/cpython/abstract.h:127
#12 _PyObject_FastCall ()
    at /tmp/build/80754af9/python_1599203911753/work/Include/cpython/abstract.h:147
#13 object_vacall (base=<optimized out>, callable=0x7fff7f353a40,
    vargs=<optimized out>)
    at /tmp/build/80754af9/python_1599203911753/work/Objects/call.c:1186
#14 0x0000555555691e1e in PyObject_CallFunctionObjArgs (
    callable=<optimized out>)
    at /tmp/build/80754af9/python_1599203911753/work/Objects/call.c:1259
#15 0x00007fff84762615 in _Pickle_FastCall (obj=0x7fff7d5b0770,
    func=0x7fff7f353a40)
    at /usr/local/src/conda/python-3.8.5/Modules/_pickle.c:362
#16 load_build.isra.38 ()
    at /usr/local/src/conda/python-3.8.5/Modules/_pickle.c:6707
#17 load () at /usr/local/src/conda/python-3.8.5/Modules/_pickle.c:6961
#18 0x00005555556c3e6a in method_vectorcall_NOARGS ()

PyTorch's dispatcher callback mechanism seems broken

This supposedly no-op patch made a big difference:

diff --git a/tensor.cpp b/tensor.cpp
index 6415571..c7339bd 100644
--- a/tensor.cpp
+++ b/tensor.cpp
@@ -786,12 +786,18 @@ bool register_in_place(const Tensor &t0, TorchOp op, DispatchKeySet ks,

 #include "autogen/dispatch_wrappers.h"

+tuple<at::Tensor,at::Tensor,at::Tensor> wrap_native_layer_norm(c10::DispatchKeySet dispatchKeySet, const at::Tensor & input, at::IntArrayRef normalized_shape, const c10::optional<at::Tensor> & weight, const c10::optional<at::Tensor> & bias, double eps) {
+  dispatchKeySet = dispatchKeySet & DispatchKeySet(DispatchKeySet::FULL_AFTER, DISPATCHKEY);
+  return at::redispatch::native_layer_norm(dispatchKeySet, input, normalized_shape, weight, bias, eps);
+}
+
 TORCH_LIBRARY_IMPL(_, DISPATCHKEY_NO_NS, m) {
   m.fallback(torch::CppFunction::makeFallthrough());
 }

 TORCH_LIBRARY_IMPL(aten, DISPATCHKEY_NO_NS, m) {
 #include "autogen/torch_library_table.h"
+m.impl("native_layer_norm", wrap_native_layer_norm);
 }

 TORCH_LIBRARY_IMPL(_, AUTOGRADDISPATCHKEY_NO_NS, m) {

The input tensor has 2 dispatch keys set: CPU & our own.

Without the patch, we get redispatched (through fallback mechanism) to at::native::math_native_layer_norm (registered at aten/src/ATen/RegisterCompositeImplicitAutograd.cpp, or 'CompositeImplicitAutograd' key).
With the patch, we get redispatched to at::native::layer_norm_cpu instead.

Why the difference? I've no idea 😅 Though the fallback mechanism should be equivalent to the redispatch above (AFAIU).

(repro with benchmarks/inference-huggingface/sentiment-analysis.py)

isinstanceof(t, *Tensor) broken

File lib/python3.8/site-packages/torchvision/transforms/functional.py exposed a but with isinstanceof. It has this code:

    img = img.permute((2, 0, 1)).contiguous()
    if isinstance(img, torch.ByteTensor):
        return img.to(dtype=default_float_dtype).div(255)
    else:
        return img

It always returns false with Torchy, and then it crashes.

Simple repro:

x = torch.zeros([4, 3], dtype=torch.uint8)
print(isinstance(x, torch.ByteTensor))
print(x.dtype)

I've no idea where ByteTensor is defined or zeros() ends up creating a ByteTensor obj.

Trace cache: add shape, dtype, etc for key equality (but not the hash)

The TorchScript backend uses: t.scalar_type(), t.device(), t.sizes(), t.strides(), t.requires_grad(), t.is_contiguous()
In theory it can specialize traces for this data, so we need to take them into account when looking up traces in the cache.

arange.start_out: bad shape data

$ python benchmarks/inference-huggingface/sentiment-analysis.py --torchy
...
%21 = <Long> arange.start_out 0, 5, 1, %21 #refs=2 #output

The reason for missing shape information is because the out tensor always has 'shape=[0]` regardless of the real output shape.
Need to confirm whether this arange.out is being dispatched from arange.

For inplace ops, we can't change the shape information ahead of executing the op as we change the tensor itself. Maybe for "out" tensors is ok?

super slow (10+ hours) shape inference

[3456/3464] shape fused_moving_avg_obs_fake_quant
[3457/3464] shape rnn_tanh_cell
[3458/3464] shape rnn_relu_cell
[3459/3464] shape _embedding_bag_sparse_backward
bash: line 1: 130795 Terminated              ./infer_shapes _embedding_bag_sparse_backward > shapes/_embedding_bag_sparse_backward.txt 2> /dev/null
[3460/3464] shape _embedding_bag_dense_backward
bash: line 1:  1622 Terminated              ./infer_shapes _embedding_bag_dense_backward > shapes/_embedding_bag_dense_backward.txt 2> /dev/null
[3461/3464] shape cudnn_convolution_add_relu
bash: line 1: 130682 Terminated              ./infer_shapes cudnn_convolution_add_relu > shapes/cudnn_convolution_add_relu.txt 2> /dev/null
[3462/3464] shape batch_norm_backward_elemt
bash: line 1:  6466 Terminated              ./infer_shapes batch_norm_backward_elemt > shapes/batch_norm_backward_elemt.txt 2> /dev/null
[3463/3464] shape _embedding_bag_backward
bash: line 1: 130733 Terminated              ./infer_shapes _embedding_bag_backward > shapes/_embedding_bag_backward.txt 2> /dev/null

(killed after 10 hours)

Detection of dead ops breaks when ops return aliased storage

Example:

x = torch.zeros([1,2])
y = torch.ones([2])

w = x.view([2])
w.add_(y)
w = None

print(x)

If we ignore that view returns a tensor whose storage is aliased with x, we would mark view & add_ as dead. But these operations are needed as they change x besides the direct impact on w.

We need a list of ops that alias storage to reenable dead op detection.

Tensor reference in trace input leads to copies in make_variable

For example, when running zeros(), underneath PyTorch first creates a new tensor and then calls zero_.
So we end up with a trace like:

%0 = <Float> zero_ in<0> #refs E/I=1/2 #output shape=[1, 2]

Inputs:
in<0>: tensor(Float : [1, 2])

We now have a reference to the tensor that is returned by zeros.

Now let's look at the code in torch/csrc/autograd/generated/variable_factories.h:

inline at::Tensor zeros(at::IntArrayRef size, at::TensorOptions options = {}) {
  at::AutoDispatchBelowADInplaceOrView guard;
  return autograd::make_variable(at::zeros(size, at::TensorOptions(options).requires_grad(), /*requires_grad=*/options.requires_grad());
}

And now in torch/csrc/autograd/variable.h:

inline Variable make_variable(at::Tensor data, bool requires_grad = false, bool allow_tensor_metadata_change = true) {
  if (data.defined()) {
    if (data.getIntrusivePtr().use_count() == 1 && data.getIntrusivePtr()->unique_version()) {
      // reuse tensor
    } else {
      auto data_impl_copy = data.getIntrusivePtr()->shallow_copy_and_detach(...);
      return Variable(data_impl_copy); // <-- missing std::move here btw
    }
  }
  return Variable();
}

So now because we have that reference we force this function to create a copy of the tensor unnecessarily.
Can this be fixed?

shape inference doesn't use multiple Scalars

It's important for reshape-style operations to try Scalars other than 0. Arange also supports floats, for example.
The concert is the increased running time.
Doing manually for now.

PyTorch's ParsedArgs::tensor() copies the returned tensor

e.g.: pytorch/torch/csrc/autograd/python_variable.cpp

static PyObject* THPVariable_make_subclass(PyObject* _ignored, PyObject* args, PyObject* kwargs) {
...
  auto data = r.tensor(1).detach();

That r.tensor(1) copies the tensor and then detaches. Unnecessary copy.

PythonArgs::optionalTensor should be changed to return a null/non-null ptr instead of optional

Globals destruction order may cause deadlock with ~Trace

If Python exits and there's something in the trace, Torchy will hold pointers to PyTorch tensors in Trace::inputs.
The call trace for ~Trace looks like:

#0  futex_abstimed_wait_cancelable()
#1  __pthread_cond_wait_common()
#2  __pthread_cond_timedwait()
#3  PyCOND_TIMEDWAIT()
#4  take_gil()
#5  PyEval_AcquireThread()
#6  pybind11::gil_scoped_acquire::gil_scoped_acquire()
#7  std::_Function_handler<void (void*), torch::utils::tensor_from_numpy(_object*, bool)::{lambda(void*)#1}>::_M_invoke(std::_Any_data const&, void*&&) ()
#8  c10::deleteInefficientStdFunctionContext(void*)
#9  c10::StorageImpl::release_resources()
#10 c10::TensorImpl::release_resources()
#11 c10::intrusive_ptr<c10::intrusive_ptr_target, c10::UndefinedTensorImpl>::reset_()
#12 c10::intrusive_ptr<c10::intrusive_ptr_target, c10::UndefinedTensorImpl>::~intrusive_ptr()
#13 c10::IValue::destroy()
#14 c10::IValue::~IValue()
#15 std::_Destroy<c10::IValue>()
#16 std::_Destroy_aux<false>::__destroy<c10::IValue*>()
#17 std::_Destroy<c10::IValue*>()
#18 std::_Destroy<c10::IValue*, c10::IValue>()
#19 std::vector<c10::IValue, std::allocator<c10::IValue> >::~vector()
#20 Trace::~Trace()
#21 __run_exit_handlers()
#22 __GI_exit()
#23 __libc_start_main()
#24 _start () at python_1635226063427/work/Parser/parser.c:325

~Trace must be called before PyTorch is destroyed. We must register a callback somehow? Maybe in our PYBIND11_MODULE def?

nunoplopes / torchy Goto Github PK

torchy's People

Contributors

Stargazers

Watchers

torchy's Issues

Recommend Projects

Recommend Topics

Recommend Org