nunoplopes / torchy Goto Github PK
View Code? Open in Web Editor NEWA tracing JIT compiler for PyTorch
License: MIT License
A tracing JIT compiler for PyTorch
License: MIT License
build/aten/src/ATen/RedispatchFunctions.cpp:
at::Tensor conv1d(..., std::string padding,...) {
static auto op = ...;
return op.redispatch(dispatchKeySet, input, weight, bias, stride, padding, dilation, groups);
}
The redispatch call should have a std::move() on some of the args.
Should upstream patch that mimics login in our gen.py.
Tried a better hash:
#define HASH_COMBINE(hash, ty, v) hash * 31 + std::hash<ty>()(v)
for (unsigned i = 0; i < key.num_ops; ++i) {
hash = HASH_COMBINE(hash, uint16_t, key.ops[i].id);
for (auto &arg : key.ops[i].args) {
if (auto *v = get_if<int64_t>(&arg)) {
hash = HASH_COMBINE(hash, int64_t, *v);
} else if (auto *v = get_if<vector<long>>(&arg)) {
for (auto n : *v) {
hash = HASH_COMBINE(hash, long, n);
}
}
}
}
#undef HASH_COMBINE
Reduced collisions to almost zero, but zero perf difference.
It required this patch:
diff --git a/c10/core/TensorImpl.cpp b/c10/core/TensorImpl.cpp
index a0c7673641..3c027a4a17 100644
--- a/c10/core/TensorImpl.cpp
+++ b/c10/core/TensorImpl.cpp
@@ -480,9 +480,12 @@ void TensorImpl::copy_tensor_metadata_except_version_counter(
const TensorImpl* src_impl,
TensorImpl* dest_impl,
bool allow_tensor_metadata_change) {
- dest_impl->storage_ = src_impl->storage_;
- dest_impl->sizes_and_strides_ = src_impl->sizes_and_strides_;
- dest_impl->storage_offset_ = src_impl->storage_offset_;
+ dest_impl->storage_ = src_impl->storage();
+ dest_impl->sizes_and_strides_.set_sizes(src_impl->sizes());
+ auto strides = src_impl->strides();
+ memcpy(dest_impl->sizes_and_strides_.strides_data(), strides.begin(),
+ sizeof(int64_t) * strides.size());
+ dest_impl->storage_offset_ = src_impl->storage_offset();
dest_impl->data_type_ = src_impl->data_type_;
dest_impl->device_opt_ = src_impl->device_opt_;
dest_impl->key_set_ = src_impl->key_set_;
But ideally it wouldn't be needed, as shallow_copy_from
would be called between Torchy objects. So why pickle used different objects?
This is the backtrace, while executing a TorchVision model:
(gdb) bt
#0 c10::TensorImpl::shallow_copy_from (this=0x555558e79400, impl=...)
at ../c10/core/TensorImpl.h:1270
#1 0x00007fffbc5610d6 in torch::autograd::VariableHooks::set_data (
this=<optimized out>, self=..., new_data=...)
at ../torch/csrc/autograd/variable.cpp:440
#2 0x00007fffc2eeac63 in THPVariable_set_data (self=0x7fff7d5b7840,
data=0x7fff7d5b4980, unused=<optimized out>)
at ../torch/csrc/autograd/python_variable.cpp:316
#3 0x00005555556cf597 in _PyObject_GenericSetAttrWithDict ()
at /tmp/build/80754af9/python_1599203911753/work/Objects/object.c:1366
#4 0x00005555556cf687 in PyObject_GenericSetAttr (value=0x7fff7d5b4980,
name=<optimized out>, obj=0x7fff7d5b7840)
at /tmp/build/80754af9/python_1599203911753/work/Objects/object.c:1416
#5 PyObject_SetAttr ()
at /tmp/build/80754af9/python_1599203911753/work/Objects/object.c:1045
#6 0x00005555557156b7 in _PyEval_EvalFrameDefault ()
at /tmp/build/80754af9/python_1599203911753/work/Python/ceval.c:2372
#7 0x00005555556df86b in function_code_fastcall (globals=<optimized out>,
nargs=2, args=<optimized out>, co=<optimized out>)
at /tmp/build/80754af9/python_1599203911753/work/Objects/call.c:283
#8 _PyFunction_Vectorcall.localalias.355 ()
at /tmp/build/80754af9/python_1599203911753/work/Objects/call.c:410
#9 0x00005555556dfe79 in _PyObject_Vectorcall (kwnames=0x0, nargsf=2,
args=0x7fffffffb610, callable=0x7fff848710d0)
at /tmp/build/80754af9/python_1599203911753/work/Include/cpython/abstract.h:127
#10 method_vectorcall ()
at /tmp/build/80754af9/python_1599203911753/work/Objects/classobject.c:89
#11 0x00005555555d22d6 in _PyObject_Vectorcall (kwnames=0x0, nargsf=1,
args=0x7fffffffb6b0, callable=0x7fff7f353a40)
at /tmp/build/80754af9/python_1599203911753/work/Include/cpython/abstract.h:127
#12 _PyObject_FastCall ()
at /tmp/build/80754af9/python_1599203911753/work/Include/cpython/abstract.h:147
#13 object_vacall (base=<optimized out>, callable=0x7fff7f353a40,
vargs=<optimized out>)
at /tmp/build/80754af9/python_1599203911753/work/Objects/call.c:1186
#14 0x0000555555691e1e in PyObject_CallFunctionObjArgs (
callable=<optimized out>)
at /tmp/build/80754af9/python_1599203911753/work/Objects/call.c:1259
#15 0x00007fff84762615 in _Pickle_FastCall (obj=0x7fff7d5b0770,
func=0x7fff7f353a40)
at /usr/local/src/conda/python-3.8.5/Modules/_pickle.c:362
#16 load_build.isra.38 ()
at /usr/local/src/conda/python-3.8.5/Modules/_pickle.c:6707
#17 load () at /usr/local/src/conda/python-3.8.5/Modules/_pickle.c:6961
#18 0x00005555556c3e6a in method_vectorcall_NOARGS ()
This supposedly no-op patch made a big difference:
diff --git a/tensor.cpp b/tensor.cpp
index 6415571..c7339bd 100644
--- a/tensor.cpp
+++ b/tensor.cpp
@@ -786,12 +786,18 @@ bool register_in_place(const Tensor &t0, TorchOp op, DispatchKeySet ks,
#include "autogen/dispatch_wrappers.h"
+tuple<at::Tensor,at::Tensor,at::Tensor> wrap_native_layer_norm(c10::DispatchKeySet dispatchKeySet, const at::Tensor & input, at::IntArrayRef normalized_shape, const c10::optional<at::Tensor> & weight, const c10::optional<at::Tensor> & bias, double eps) {
+ dispatchKeySet = dispatchKeySet & DispatchKeySet(DispatchKeySet::FULL_AFTER, DISPATCHKEY);
+ return at::redispatch::native_layer_norm(dispatchKeySet, input, normalized_shape, weight, bias, eps);
+}
+
TORCH_LIBRARY_IMPL(_, DISPATCHKEY_NO_NS, m) {
m.fallback(torch::CppFunction::makeFallthrough());
}
TORCH_LIBRARY_IMPL(aten, DISPATCHKEY_NO_NS, m) {
#include "autogen/torch_library_table.h"
+m.impl("native_layer_norm", wrap_native_layer_norm);
}
TORCH_LIBRARY_IMPL(_, AUTOGRADDISPATCHKEY_NO_NS, m) {
The input tensor has 2 dispatch keys set: CPU & our own.
Without the patch, we get redispatched (through fallback mechanism) to at::native::math_native_layer_norm
(registered at aten/src/ATen/RegisterCompositeImplicitAutograd.cpp
, or 'CompositeImplicitAutograd' key).
With the patch, we get redispatched to at::native::layer_norm_cpu
instead.
Why the difference? I've no idea ๐ Though the fallback mechanism should be equivalent to the redispatch above (AFAIU).
(repro with benchmarks/inference-huggingface/sentiment-analysis.py)
File lib/python3.8/site-packages/torchvision/transforms/functional.py
exposed a but with isinstanceof. It has this code:
img = img.permute((2, 0, 1)).contiguous()
if isinstance(img, torch.ByteTensor):
return img.to(dtype=default_float_dtype).div(255)
else:
return img
It always returns false with Torchy, and then it crashes.
Simple repro:
x = torch.zeros([4, 3], dtype=torch.uint8)
print(isinstance(x, torch.ByteTensor))
print(x.dtype)
I've no idea where ByteTensor is defined or zeros() ends up creating a ByteTensor obj.
The TorchScript backend uses: t.scalar_type(), t.device(), t.sizes(), t.strides(), t.requires_grad(), t.is_contiguous()
In theory it can specialize traces for this data, so we need to take them into account when looking up traces in the cache.
$ python benchmarks/inference-huggingface/sentiment-analysis.py --torchy
...
%21 = <Long> arange.start_out 0, 5, 1, %21 #refs=2 #output
The reason for missing shape information is because the out
tensor always has 'shape=[0]` regardless of the real output shape.
Need to confirm whether this arange.out is being dispatched from arange.
For inplace ops, we can't change the shape information ahead of executing the op as we change the tensor itself. Maybe for "out" tensors is ok?
[3456/3464] shape fused_moving_avg_obs_fake_quant
[3457/3464] shape rnn_tanh_cell
[3458/3464] shape rnn_relu_cell
[3459/3464] shape _embedding_bag_sparse_backward
bash: line 1: 130795 Terminated ./infer_shapes _embedding_bag_sparse_backward > shapes/_embedding_bag_sparse_backward.txt 2> /dev/null
[3460/3464] shape _embedding_bag_dense_backward
bash: line 1: 1622 Terminated ./infer_shapes _embedding_bag_dense_backward > shapes/_embedding_bag_dense_backward.txt 2> /dev/null
[3461/3464] shape cudnn_convolution_add_relu
bash: line 1: 130682 Terminated ./infer_shapes cudnn_convolution_add_relu > shapes/cudnn_convolution_add_relu.txt 2> /dev/null
[3462/3464] shape batch_norm_backward_elemt
bash: line 1: 6466 Terminated ./infer_shapes batch_norm_backward_elemt > shapes/batch_norm_backward_elemt.txt 2> /dev/null
[3463/3464] shape _embedding_bag_backward
bash: line 1: 130733 Terminated ./infer_shapes _embedding_bag_backward > shapes/_embedding_bag_backward.txt 2> /dev/null
(killed after 10 hours)
Example:
x = torch.zeros([1,2])
y = torch.ones([2])
w = x.view([2])
w.add_(y)
w = None
print(x)
If we ignore that view returns a tensor whose storage is aliased with x
, we would mark view & add_ as dead. But these operations are needed as they change x
besides the direct impact on w
.
We need a list of ops that alias storage to reenable dead op detection.
For example, when running zeros()
, underneath PyTorch first creates a new tensor and then calls zero_
.
So we end up with a trace like:
%0 = <Float> zero_ in<0> #refs E/I=1/2 #output shape=[1, 2]
Inputs:
in<0>: tensor(Float : [1, 2])
We now have a reference to the tensor that is returned by zeros
.
Now let's look at the code in torch/csrc/autograd/generated/variable_factories.h:
inline at::Tensor zeros(at::IntArrayRef size, at::TensorOptions options = {}) {
at::AutoDispatchBelowADInplaceOrView guard;
return autograd::make_variable(at::zeros(size, at::TensorOptions(options).requires_grad(), /*requires_grad=*/options.requires_grad());
}
And now in torch/csrc/autograd/variable.h:
inline Variable make_variable(at::Tensor data, bool requires_grad = false, bool allow_tensor_metadata_change = true) {
if (data.defined()) {
if (data.getIntrusivePtr().use_count() == 1 && data.getIntrusivePtr()->unique_version()) {
// reuse tensor
} else {
auto data_impl_copy = data.getIntrusivePtr()->shallow_copy_and_detach(...);
return Variable(data_impl_copy); // <-- missing std::move here btw
}
}
return Variable();
}
So now because we have that reference we force this function to create a copy of the tensor unnecessarily.
Can this be fixed?
It's important for reshape-style operations to try Scalars other than 0. Arange also supports floats, for example.
The concert is the increased running time.
Doing manually for now.
e.g.: pytorch/torch/csrc/autograd/python_variable.cpp
static PyObject* THPVariable_make_subclass(PyObject* _ignored, PyObject* args, PyObject* kwargs) {
...
auto data = r.tensor(1).detach();
That r.tensor(1)
copies the tensor and then detaches. Unnecessary copy.
PythonArgs::optionalTensor
should be changed to return a null/non-null ptr instead of optional
If Python exits and there's something in the trace, Torchy will hold pointers to PyTorch tensors in Trace::inputs.
The call trace for ~Trace looks like:
#0 futex_abstimed_wait_cancelable()
#1 __pthread_cond_wait_common()
#2 __pthread_cond_timedwait()
#3 PyCOND_TIMEDWAIT()
#4 take_gil()
#5 PyEval_AcquireThread()
#6 pybind11::gil_scoped_acquire::gil_scoped_acquire()
#7 std::_Function_handler<void (void*), torch::utils::tensor_from_numpy(_object*, bool)::{lambda(void*)#1}>::_M_invoke(std::_Any_data const&, void*&&) ()
#8 c10::deleteInefficientStdFunctionContext(void*)
#9 c10::StorageImpl::release_resources()
#10 c10::TensorImpl::release_resources()
#11 c10::intrusive_ptr<c10::intrusive_ptr_target, c10::UndefinedTensorImpl>::reset_()
#12 c10::intrusive_ptr<c10::intrusive_ptr_target, c10::UndefinedTensorImpl>::~intrusive_ptr()
#13 c10::IValue::destroy()
#14 c10::IValue::~IValue()
#15 std::_Destroy<c10::IValue>()
#16 std::_Destroy_aux<false>::__destroy<c10::IValue*>()
#17 std::_Destroy<c10::IValue*>()
#18 std::_Destroy<c10::IValue*, c10::IValue>()
#19 std::vector<c10::IValue, std::allocator<c10::IValue> >::~vector()
#20 Trace::~Trace()
#21 __run_exit_handlers()
#22 __GI_exit()
#23 __libc_start_main()
#24 _start () at python_1635226063427/work/Parser/parser.c:325
~Trace
must be called before PyTorch is destroyed. We must register a callback somehow? Maybe in our PYBIND11_MODULE def?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.