When the <a href="https://github.com/facebookresearch/TensorComprehensions/blob/dev/sr

Improve SIGINT/SIGTERM during autotuning about tensorcomprehensions HOT 14 CLOSED

facebookresearch commented on August 15, 2024

Improve SIGINT/SIGTERM during autotuning

from tensorcomprehensions.

Comments (14)

thetheodor commented on August 15, 2024 1

The behavior I observed is that it only stops after all remaining generations have finished (which is obviously not the intended functionality).

My suggestion would be to revert to abort (or even better have an option that switches between the two behaviors). To work around the python interpreter issue we could launch the autotuner in a separate process as @ftynse suggested.

from tensorcomprehensions.

prigoyal commented on August 15, 2024

thanks @ttheodor, let me repro and look into python side for this. I'll report back. Right now, the autotuning is indeed stopped once the signal is caught and only after the current generation has finished running.

the constraint from the python side is that abort in autotuning as we had earlier would kill python interpreter. so we rather throw exception

from tensorcomprehensions.

prigoyal commented on August 15, 2024

Hi @ttheodor, thank you for the insight. It should not wait for all generations to finish. Perhaps something got broken in the code logic. are you on master branch or dev branch?

from tensorcomprehensions.

thetheodor commented on August 15, 2024

On dev, can you confirm that it works on your machine?

from tensorcomprehensions.

ftynse commented on August 15, 2024

seems to work for me as expected (abort at the end of the current generation)

[ RUN      ] ATenCompilationUnitTest.LayerNorm
Generation 0    Jobs(Compiled, GPU)/total  (8, 2)/8   (best/median/worst)us: 64/75/75^CAutotuning aborted.
No filepath provided, not saving cache
Generation 0    Jobs(Compiled, GPU)/total  (8, 8)/8   (best/median/worst)us: 62/75/132
unknown file: Failure
C++ exception with description "Abort requested" thrown in the test body.
[  FAILED  ] ATenCompilationUnitTest.LayerNorm (13082 ms)

from tensorcomprehensions.

thetheodor commented on August 15, 2024

Can you run it it with --smoke_check=0 --gtest_filter='*TensorDot' --tuner_threads=8 (choose a high number of threads).

from tensorcomprehensions.

ftynse commented on August 15, 2024

works too, but it takes a non-negligible amount of time to finish the generation

s$ ./build/test/test_autotuner  --smoke_check=0 --gtest_filter='*TensorDot' --tuner_threads=60
Note: Google Test filter = *TensorDot
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from ATenCompilationUnitTest
[ RUN      ] ATenCompilationUnitTest.TensorDot

---------------------------------------------------------
--------------------- KERNEL STATS ----------------------
------------------    100 ITERATIONS    ----------------
---------------------------------------------------------
Min: 8781us, p50: 9207us, p90: 9388us, p99: 9541us, Max: 9541us
---------------------------------------------------------


---------------------------------------------------------
-----------------------  TOTAL STATS --------------------
------------------    100 ITERATIONS    ----------------
---------------------------------------------------------
Min: 8970us, p50: 9254us, p90: 9448us, p99: 9602us, Max: 9602us
---------------------------------------------------------

Generation 0    Jobs(Compiled, GPU)/total  (60, 0)/100^CAutotuning aborted.
No filepath provided, not saving cache
Generation 0    Jobs(Compiled, GPU)/total  (100, 100)/100   (best/median/worst)us: 3178/17622/126168
unknown file: Failure
C++ exception with description "Abort requested" thrown in the test body.
[  FAILED  ] ATenCompilationUnitTest.TensorDot (68657 ms)
[----------] 1 test from ATenCompilationUnitTest (68657 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (68657 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] ATenCompilationUnitTest.TensorDot

 1 FAILED TEST

from tensorcomprehensions.

prigoyal commented on August 15, 2024

Hi @ttheodor , it looks like the sigint/sigterm are working as expected. Perhaps something is not right in your build env? can you try again and let us know if this works.

from tensorcomprehensions.

thetheodor commented on August 15, 2024

It is working but it's still annoying:
Sending a SIGTERM signal to a process and then waiting seconds~minutes (depending on how large each generation is) is not a behavior I would expect.

Terminating a process when it receives a SIGTERM is the most intuitive thing to do. So why are we not terminating as soon as the caches are dumped on a SIGTERM?

from tensorcomprehensions.

prigoyal commented on August 15, 2024

Hi @ttheodor , we wait for the current generation to finish. Before the next generation starts, we check for the signal and don't start the generation if the sigint/sigterm is thrown. But we don't interrupt the current generation if the signal has been passed. I agree that it would be nice to kill the current generation as well but this requires more deep code changes. Feel free to take a stab at it. :)

from tensorcomprehensions.

ftynse commented on August 15, 2024

I still think it can be handled by running the autotuner as a separate process and then killing that process, from the python side.

from tensorcomprehensions.

thetheodor commented on August 15, 2024

@prigoyal
Would the following behavior be acceptable?

On SIGINT: maintain the current behavior.
ON SITERM: dump caches and abort

from tensorcomprehensions.

prigoyal commented on August 15, 2024

Hi @ttheodor , that sounds reasonable. :)

from tensorcomprehensions.

prigoyal commented on August 15, 2024

closed by #206

from tensorcomprehensions.

Improve SIGINT/SIGTERM during autotuning about tensorcomprehensions HOT 14 CLOSED

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent