while stdout for a local process appears in the repl, that for a remote sge process do

stdout not redirected to repl about clustermanagers.jl HOT 16 CLOSED

juliaparallel commented on July 21, 2024

stdout not redirected to repl

from clustermanagers.jl.

Comments (16)

amitmurthy commented on July 21, 2024

No, I don't think it is related to the issues mentioned.

cc @nlhepler

from clustermanagers.jl.

bjarthur commented on July 21, 2024

more data on this: the remote-process stdout does eventually appear on the local repl, but sometimes not for several tens of seconds. if i simultaneously examine the julia log files (the ones it clutters the home directory with), the stdout there is also delayed. it appears on the repl and in the log files at roughly the same time, both much delayed. so i think the fix is just a matter of flushing the i/o buffer. i tried to add flush(STDOUT) to the remote script, but got an error about serializing a pointer. i tried looking in base for a place to add a flush, but it was not obvious where to put it. thanks for any help.

from clustermanagers.jl.

amitmurthy commented on July 21, 2024

You could have a function defined that does the flush and just call it remotely. For example, flush_stdout() = flush(STDOUT) and execute a remotecall(p, flush_stdout)

from clustermanagers.jl.

bjarthur commented on July 21, 2024

i tried precisely that, and just tried again. here is the error i get:

julia> using ClusterManagers

julia> ClusterManagers.addprocs_sge(1)
job id is 8855037, waiting for job to start ..............................
1-element Array{Any,1}:
 2

julia> @everywhere flush_stdout() = flush(STDOUT)

julia> remotecall_fetch(2,println,"foo")

julia> remotecall_fetch(2,flush_stdout)
Worker 2 terminated.
ERROR: ProcessExitedException()
 in remotecall_fetch at multi.jl:673
 in remotecall_fetch at multi.jl:678

julia>  From worker 2:  foo
    From worker 2:  fatal error on 2: ERROR: cannot serialize a pointer
    From worker 2:   in serialize at serialize.jl:60
julia>

note that "foo" eventually appears on the repl, and that the call to flush() causes the error

from clustermanagers.jl.

amitmurthy commented on July 21, 2024

Strange. It does not seem to be an issue with local workers, i.e. something like addprocs(2).

Could you try with this definition?
@everywhere flush_stdout() = eval(parse("flush(STDOUT)"))

from clustermanagers.jl.

bjarthur commented on July 21, 2024

that doesn't work either. exact same error. and you're right about it working fine on local workers.

from clustermanagers.jl.

amitmurthy commented on July 21, 2024

Just to try:

Instead of the @everywhere, could you just do a

julia> remotecall_fetch(2,println,"foo")

julia> remotecall_fetch(2,()->eval(parse("flush(STDOUT)"))  )

from clustermanagers.jl.

bjarthur commented on July 21, 2024

nope. same error. is the problem that it's referring to the STDOUT in worker 1 and not worker 2?

from clustermanagers.jl.

amitmurthy commented on July 21, 2024

That's what I initially thought, but it does not seem to be the case. At least with parse and eval that is ruled out.

I just saw flush_cstdio() in the documentation. Could you try with remotecall_fetch(2, flush_cstdio)?

from clustermanagers.jl.

amitmurthy commented on July 21, 2024

Also STDOUT seems to be of type TTY and not a regular stream as we have been assuming...

from clustermanagers.jl.

bjarthur commented on July 21, 2024

flush_cstdio has no effect: no error, no stdout on repl, no change to log files in home directory.

from clustermanagers.jl.

amitmurthy commented on July 21, 2024

Is it possible that SGE is responsible for the delay? I am not at all familiar with cluster technologies, but a Google search brought up this link - http://scicomp.stackexchange.com/questions/7804/flush-output-in-torque-scheduler . Is there a similar option for SGE?

from clustermanagers.jl.

bjarthur commented on July 21, 2024

the problem is not SGE but rather the buffering done by our high-performance file system. a flush() in julia, as described here JuliaLang/julia#6549, is not sufficient. for STDOUT to actually appear in the julia log files i also had to readall(ls $(ENV["HOME"])). with these two extra steps, STDOUT of worker procs now appears on the repl. thanks for all the help.

from clustermanagers.jl.

nlhepler commented on July 21, 2024

Whew, glad to see that was resolved.
Amit, I think we already use the appropriate flag (-k o on pbs and -j y on sge). Their semantics seem a little different in the documentation, though honestly the documentation has not been too illuminating.
bjarthur, what's the fs you're using? It might be worth mentioning this in the readme (and your workaround).

from clustermanagers.jl.

bjarthur commented on July 21, 2024

yesterday i got it to work by adding @fetchfrom 3 flush(eval(:STDOUT)); readall(ls $(ENV["HOME"])) after every println. but today the flush is throwing a serialization error, despite jeff's trick JuliaLang/julia#6549. so i'm reopening this issue. nothing has changed that i know of.

julia> addprocs(1)
1-element Array{Any,1}:
 2

julia> using ClusterManagers

julia> ClusterManagers.addprocs_sge(1)
job id is 9010468, waiting for job to start ............................................
1-element Array{Any,1}:
 3

julia> @fetchfrom 1 flush(eval(:STDOUT))

julia> @fetchfrom 2 flush(eval(:STDOUT))

julia> remotecall_fetch(3,println,"foo")

julia> @fetchfrom 3 flush(eval(:STDOUT))
Worker 3 terminated.
ERROR: ProcessExitedException()
 in remotecall_fetch at multi.jl:673
 in remotecall_fetch at multi.jl:678

julia> workers()
1-element Array{Int64,1}:
 2

julia>  From worker 3:  foo
    From worker 3:  fatal error on 3: ERROR: cannot serialize a pointer
    From worker 3:   in serialize at serialize.jl:60
julia>

note that the remote workers stdout only appears on the repl (and in the log file) after it crashes.

this is on an asynchronous NFS3 file system. the sysadmin strongly discourages use of the disk for interprocess communication because of the async buffering.

i'm looking into replacing qsub with qrsh...

from clustermanagers.jl.

bjarthur commented on July 21, 2024

qrsh implementation here #11

from clustermanagers.jl.

stdout not redirected to repl about clustermanagers.jl HOT 16 CLOSED

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent