Giter Site home page Giter Site logo

Comments (16)

amitmurthy avatar amitmurthy commented on July 21, 2024

No, I don't think it is related to the issues mentioned.

cc @nlhepler

from clustermanagers.jl.

bjarthur avatar bjarthur commented on July 21, 2024

more data on this: the remote-process stdout does eventually appear on the local repl, but sometimes not for several tens of seconds. if i simultaneously examine the julia log files (the ones it clutters the home directory with), the stdout there is also delayed. it appears on the repl and in the log files at roughly the same time, both much delayed. so i think the fix is just a matter of flushing the i/o buffer. i tried to add flush(STDOUT) to the remote script, but got an error about serializing a pointer. i tried looking in base for a place to add a flush, but it was not obvious where to put it. thanks for any help.

from clustermanagers.jl.

amitmurthy avatar amitmurthy commented on July 21, 2024

You could have a function defined that does the flush and just call it remotely. For example, flush_stdout() = flush(STDOUT) and execute a remotecall(p, flush_stdout)

from clustermanagers.jl.

bjarthur avatar bjarthur commented on July 21, 2024

i tried precisely that, and just tried again. here is the error i get:

julia> using ClusterManagers

julia> ClusterManagers.addprocs_sge(1)
job id is 8855037, waiting for job to start ..............................
1-element Array{Any,1}:
 2

julia> @everywhere flush_stdout() = flush(STDOUT)

julia> remotecall_fetch(2,println,"foo")

julia> remotecall_fetch(2,flush_stdout)
Worker 2 terminated.
ERROR: ProcessExitedException()
 in remotecall_fetch at multi.jl:673
 in remotecall_fetch at multi.jl:678

julia>  From worker 2:  foo
    From worker 2:  fatal error on 2: ERROR: cannot serialize a pointer
    From worker 2:   in serialize at serialize.jl:60
julia> 

note that "foo" eventually appears on the repl, and that the call to flush() causes the error

from clustermanagers.jl.

amitmurthy avatar amitmurthy commented on July 21, 2024

Strange. It does not seem to be an issue with local workers, i.e. something like addprocs(2).

Could you try with this definition?
@everywhere flush_stdout() = eval(parse("flush(STDOUT)"))

from clustermanagers.jl.

bjarthur avatar bjarthur commented on July 21, 2024

that doesn't work either. exact same error. and you're right about it working fine on local workers.

from clustermanagers.jl.

amitmurthy avatar amitmurthy commented on July 21, 2024

Just to try:

Instead of the @everywhere, could you just do a

julia> remotecall_fetch(2,println,"foo")

julia> remotecall_fetch(2,()->eval(parse("flush(STDOUT)"))  )  

from clustermanagers.jl.

bjarthur avatar bjarthur commented on July 21, 2024

nope. same error. is the problem that it's referring to the STDOUT in worker 1 and not worker 2?

from clustermanagers.jl.

amitmurthy avatar amitmurthy commented on July 21, 2024

That's what I initially thought, but it does not seem to be the case. At least with parse and eval that is ruled out.

I just saw flush_cstdio() in the documentation. Could you try with remotecall_fetch(2, flush_cstdio)?

from clustermanagers.jl.

amitmurthy avatar amitmurthy commented on July 21, 2024

Also STDOUT seems to be of type TTY and not a regular stream as we have been assuming...

from clustermanagers.jl.

bjarthur avatar bjarthur commented on July 21, 2024

flush_cstdio has no effect: no error, no stdout on repl, no change to log files in home directory.

from clustermanagers.jl.

amitmurthy avatar amitmurthy commented on July 21, 2024

Is it possible that SGE is responsible for the delay? I am not at all familiar with cluster technologies, but a Google search brought up this link - http://scicomp.stackexchange.com/questions/7804/flush-output-in-torque-scheduler . Is there a similar option for SGE?

from clustermanagers.jl.

bjarthur avatar bjarthur commented on July 21, 2024

the problem is not SGE but rather the buffering done by our high-performance file system. a flush() in julia, as described here JuliaLang/julia#6549, is not sufficient. for STDOUT to actually appear in the julia log files i also had to readall(ls $(ENV["HOME"])). with these two extra steps, STDOUT of worker procs now appears on the repl. thanks for all the help.

from clustermanagers.jl.

nlhepler avatar nlhepler commented on July 21, 2024

Whew, glad to see that was resolved.
Amit, I think we already use the appropriate flag (-k o on pbs and -j y on sge). Their semantics seem a little different in the documentation, though honestly the documentation has not been too illuminating.
bjarthur, what's the fs you're using? It might be worth mentioning this in the readme (and your workaround).

from clustermanagers.jl.

bjarthur avatar bjarthur commented on July 21, 2024

yesterday i got it to work by adding @fetchfrom 3 flush(eval(:STDOUT)); readall(ls $(ENV["HOME"])) after every println. but today the flush is throwing a serialization error, despite jeff's trick JuliaLang/julia#6549. so i'm reopening this issue. nothing has changed that i know of.

julia> addprocs(1)
1-element Array{Any,1}:
 2

julia> using ClusterManagers

julia> ClusterManagers.addprocs_sge(1)
job id is 9010468, waiting for job to start ............................................
1-element Array{Any,1}:
 3

julia> @fetchfrom 1 flush(eval(:STDOUT))

julia> @fetchfrom 2 flush(eval(:STDOUT))

julia> remotecall_fetch(3,println,"foo")

julia> @fetchfrom 3 flush(eval(:STDOUT))
Worker 3 terminated.
ERROR: ProcessExitedException()
 in remotecall_fetch at multi.jl:673
 in remotecall_fetch at multi.jl:678

julia> workers()
1-element Array{Int64,1}:
 2

julia>  From worker 3:  foo
    From worker 3:  fatal error on 3: ERROR: cannot serialize a pointer
    From worker 3:   in serialize at serialize.jl:60
julia> 

note that the remote workers stdout only appears on the repl (and in the log file) after it crashes.

this is on an asynchronous NFS3 file system. the sysadmin strongly discourages use of the disk for interprocess communication because of the async buffering.

i'm looking into replacing qsub with qrsh...

from clustermanagers.jl.

bjarthur avatar bjarthur commented on July 21, 2024

qrsh implementation here #11

from clustermanagers.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.