Giter Site home page Giter Site logo

Comments (14)

thetheodor avatar thetheodor commented on August 15, 2024 1

The behavior I observed is that it only stops after all remaining generations have finished (which is obviously not the intended functionality).

My suggestion would be to revert to abort (or even better have an option that switches between the two behaviors). To work around the python interpreter issue we could launch the autotuner in a separate process as @ftynse suggested.

from tensorcomprehensions.

prigoyal avatar prigoyal commented on August 15, 2024

thanks @ttheodor, let me repro and look into python side for this. I'll report back. Right now, the autotuning is indeed stopped once the signal is caught and only after the current generation has finished running.

the constraint from the python side is that abort in autotuning as we had earlier would kill python interpreter. so we rather throw exception

from tensorcomprehensions.

prigoyal avatar prigoyal commented on August 15, 2024

Hi @ttheodor, thank you for the insight. It should not wait for all generations to finish. Perhaps something got broken in the code logic. are you on master branch or dev branch?

from tensorcomprehensions.

thetheodor avatar thetheodor commented on August 15, 2024

On dev, can you confirm that it works on your machine?

from tensorcomprehensions.

ftynse avatar ftynse commented on August 15, 2024

seems to work for me as expected (abort at the end of the current generation)

[ RUN      ] ATenCompilationUnitTest.LayerNorm
Generation 0    Jobs(Compiled, GPU)/total  (8, 2)/8   (best/median/worst)us: 64/75/75^CAutotuning aborted.
No filepath provided, not saving cache
Generation 0    Jobs(Compiled, GPU)/total  (8, 8)/8   (best/median/worst)us: 62/75/132
unknown file: Failure
C++ exception with description "Abort requested" thrown in the test body.
[  FAILED  ] ATenCompilationUnitTest.LayerNorm (13082 ms)

from tensorcomprehensions.

thetheodor avatar thetheodor commented on August 15, 2024

Can you run it it with --smoke_check=0 --gtest_filter='*TensorDot' --tuner_threads=8 (choose a high number of threads).

from tensorcomprehensions.

ftynse avatar ftynse commented on August 15, 2024

works too, but it takes a non-negligible amount of time to finish the generation

s$ ./build/test/test_autotuner  --smoke_check=0 --gtest_filter='*TensorDot' --tuner_threads=60
Note: Google Test filter = *TensorDot
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from ATenCompilationUnitTest
[ RUN      ] ATenCompilationUnitTest.TensorDot

---------------------------------------------------------
--------------------- KERNEL STATS ----------------------
------------------    100 ITERATIONS    ----------------
---------------------------------------------------------
Min: 8781us, p50: 9207us, p90: 9388us, p99: 9541us, Max: 9541us
---------------------------------------------------------


---------------------------------------------------------
-----------------------  TOTAL STATS --------------------
------------------    100 ITERATIONS    ----------------
---------------------------------------------------------
Min: 8970us, p50: 9254us, p90: 9448us, p99: 9602us, Max: 9602us
---------------------------------------------------------

Generation 0    Jobs(Compiled, GPU)/total  (60, 0)/100^CAutotuning aborted.
No filepath provided, not saving cache
Generation 0    Jobs(Compiled, GPU)/total  (100, 100)/100   (best/median/worst)us: 3178/17622/126168
unknown file: Failure
C++ exception with description "Abort requested" thrown in the test body.
[  FAILED  ] ATenCompilationUnitTest.TensorDot (68657 ms)
[----------] 1 test from ATenCompilationUnitTest (68657 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (68657 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] ATenCompilationUnitTest.TensorDot

 1 FAILED TEST

from tensorcomprehensions.

prigoyal avatar prigoyal commented on August 15, 2024

Hi @ttheodor , it looks like the sigint/sigterm are working as expected. Perhaps something is not right in your build env? can you try again and let us know if this works.

from tensorcomprehensions.

thetheodor avatar thetheodor commented on August 15, 2024

It is working but it's still annoying:
Sending a SIGTERM signal to a process and then waiting seconds~minutes (depending on how large each generation is) is not a behavior I would expect.

Terminating a process when it receives a SIGTERM is the most intuitive thing to do. So why are we not terminating as soon as the caches are dumped on a SIGTERM?

from tensorcomprehensions.

prigoyal avatar prigoyal commented on August 15, 2024

Hi @ttheodor , we wait for the current generation to finish. Before the next generation starts, we check for the signal and don't start the generation if the sigint/sigterm is thrown. But we don't interrupt the current generation if the signal has been passed. I agree that it would be nice to kill the current generation as well but this requires more deep code changes. Feel free to take a stab at it. :)

from tensorcomprehensions.

ftynse avatar ftynse commented on August 15, 2024

I still think it can be handled by running the autotuner as a separate process and then killing that process, from the python side.

from tensorcomprehensions.

thetheodor avatar thetheodor commented on August 15, 2024

@prigoyal
Would the following behavior be acceptable?

On SIGINT: maintain the current behavior.
ON SITERM: dump caches and abort

from tensorcomprehensions.

prigoyal avatar prigoyal commented on August 15, 2024

Hi @ttheodor , that sounds reasonable. :)

from tensorcomprehensions.

prigoyal avatar prigoyal commented on August 15, 2024

closed by #206

from tensorcomprehensions.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.