Comments (14)
The behavior I observed is that it only stops after all remaining generations have finished (which is obviously not the intended functionality).
My suggestion would be to revert to abort (or even better have an option that switches between the two behaviors). To work around the python interpreter issue we could launch the autotuner in a separate process as @ftynse suggested.
from tensorcomprehensions.
thanks @ttheodor, let me repro and look into python side for this. I'll report back. Right now, the autotuning is indeed stopped once the signal is caught and only after the current generation has finished running.
the constraint from the python side is that abort in autotuning as we had earlier would kill python interpreter. so we rather throw exception
from tensorcomprehensions.
Hi @ttheodor, thank you for the insight. It should not wait for all generations to finish. Perhaps something got broken in the code logic. are you on master branch or dev branch?
from tensorcomprehensions.
On dev, can you confirm that it works on your machine?
from tensorcomprehensions.
seems to work for me as expected (abort at the end of the current generation)
[ RUN ] ATenCompilationUnitTest.LayerNorm
Generation 0 Jobs(Compiled, GPU)/total (8, 2)/8 (best/median/worst)us: 64/75/75^CAutotuning aborted.
No filepath provided, not saving cache
Generation 0 Jobs(Compiled, GPU)/total (8, 8)/8 (best/median/worst)us: 62/75/132
unknown file: Failure
C++ exception with description "Abort requested" thrown in the test body.
[ FAILED ] ATenCompilationUnitTest.LayerNorm (13082 ms)
from tensorcomprehensions.
Can you run it it with --smoke_check=0 --gtest_filter='*TensorDot' --tuner_threads=8
(choose a high number of threads).
from tensorcomprehensions.
works too, but it takes a non-negligible amount of time to finish the generation
s$ ./build/test/test_autotuner --smoke_check=0 --gtest_filter='*TensorDot' --tuner_threads=60
Note: Google Test filter = *TensorDot
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from ATenCompilationUnitTest
[ RUN ] ATenCompilationUnitTest.TensorDot
---------------------------------------------------------
--------------------- KERNEL STATS ----------------------
------------------ 100 ITERATIONS ----------------
---------------------------------------------------------
Min: 8781us, p50: 9207us, p90: 9388us, p99: 9541us, Max: 9541us
---------------------------------------------------------
---------------------------------------------------------
----------------------- TOTAL STATS --------------------
------------------ 100 ITERATIONS ----------------
---------------------------------------------------------
Min: 8970us, p50: 9254us, p90: 9448us, p99: 9602us, Max: 9602us
---------------------------------------------------------
Generation 0 Jobs(Compiled, GPU)/total (60, 0)/100^CAutotuning aborted.
No filepath provided, not saving cache
Generation 0 Jobs(Compiled, GPU)/total (100, 100)/100 (best/median/worst)us: 3178/17622/126168
unknown file: Failure
C++ exception with description "Abort requested" thrown in the test body.
[ FAILED ] ATenCompilationUnitTest.TensorDot (68657 ms)
[----------] 1 test from ATenCompilationUnitTest (68657 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (68657 ms total)
[ PASSED ] 0 tests.
[ FAILED ] 1 test, listed below:
[ FAILED ] ATenCompilationUnitTest.TensorDot
1 FAILED TEST
from tensorcomprehensions.
Hi @ttheodor , it looks like the sigint/sigterm are working as expected. Perhaps something is not right in your build env? can you try again and let us know if this works.
from tensorcomprehensions.
It is working but it's still annoying:
Sending a SIGTERM signal to a process and then waiting seconds~minutes (depending on how large each generation is) is not a behavior I would expect.
Terminating a process when it receives a SIGTERM is the most intuitive thing to do. So why are we not terminating as soon as the caches are dumped on a SIGTERM?
from tensorcomprehensions.
Hi @ttheodor , we wait for the current generation to finish. Before the next generation starts, we check for the signal and don't start the generation if the sigint/sigterm is thrown. But we don't interrupt the current generation if the signal has been passed. I agree that it would be nice to kill the current generation as well but this requires more deep code changes. Feel free to take a stab at it. :)
from tensorcomprehensions.
I still think it can be handled by running the autotuner as a separate process and then killing that process, from the python side.
from tensorcomprehensions.
@prigoyal
Would the following behavior be acceptable?
On SIGINT: maintain the current behavior.
ON SITERM: dump caches and abort
from tensorcomprehensions.
Hi @ttheodor , that sounds reasonable. :)
from tensorcomprehensions.
closed by #206
from tensorcomprehensions.
Related Issues (20)
- [Build] Aten library cannot be found, incomplete config HOT 5
- Conda installation failed: PROTOBUF_LIBRARIES are used in this project, but they are set to NOTFOUND." HOT 1
- Convolution Backward Pass with kernel_size > 1 HOT 5
- module 'tensor_comprehensions' has no attribute 'make_autotuned_options_factory'
- Build from source HOT 4
- Update Conda package? HOT 1
- Question about how to integrate TC to my project HOT 1
- [Build] No rule to make libnvrtc.so needed by 'test/cuda/test_basic_gpu'.
- [BUILD] Trying on ubuntu 18.04
- Cannot find -lqcustomplot
- Is there support for pytorch 1.x ? HOT 2
- Can I have a minimum version excluding python and torch?
- [Build] error in llvm: include/llvm/Object/SymbolicFile.h HOT 3
- Is the project still under development? HOT 3
- [Build] dockerfile build failed, source code compilation failed HOT 1
- TC supports for Pytorch >= 1.0 or for CUDA 10.0?
- How to express Broadcast with TC language? HOT 6
- an IRBuilder interface for TensorComprehensions? HOT 1
- [Build] Error while "Building CXX object tc/core/CMakeFiles/tc_core_cpu.dir/polyhedral/llvm_jit.cc.o"
- Python API HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tensorcomprehensions.