Comments (16)
No, I don't think it is related to the issues mentioned.
cc @nlhepler
from clustermanagers.jl.
more data on this: the remote-process stdout does eventually appear on the local repl, but sometimes not for several tens of seconds. if i simultaneously examine the julia log files (the ones it clutters the home directory with), the stdout there is also delayed. it appears on the repl and in the log files at roughly the same time, both much delayed. so i think the fix is just a matter of flushing the i/o buffer. i tried to add flush(STDOUT) to the remote script, but got an error about serializing a pointer. i tried looking in base for a place to add a flush, but it was not obvious where to put it. thanks for any help.
from clustermanagers.jl.
You could have a function defined that does the flush and just call it remotely. For example, flush_stdout() = flush(STDOUT)
and execute a remotecall(p, flush_stdout)
from clustermanagers.jl.
i tried precisely that, and just tried again. here is the error i get:
julia> using ClusterManagers
julia> ClusterManagers.addprocs_sge(1)
job id is 8855037, waiting for job to start ..............................
1-element Array{Any,1}:
2
julia> @everywhere flush_stdout() = flush(STDOUT)
julia> remotecall_fetch(2,println,"foo")
julia> remotecall_fetch(2,flush_stdout)
Worker 2 terminated.
ERROR: ProcessExitedException()
in remotecall_fetch at multi.jl:673
in remotecall_fetch at multi.jl:678
julia> From worker 2: foo
From worker 2: fatal error on 2: ERROR: cannot serialize a pointer
From worker 2: in serialize at serialize.jl:60
julia>
note that "foo" eventually appears on the repl, and that the call to flush() causes the error
from clustermanagers.jl.
Strange. It does not seem to be an issue with local workers, i.e. something like addprocs(2)
.
Could you try with this definition?
@everywhere flush_stdout() = eval(parse("flush(STDOUT)"))
from clustermanagers.jl.
that doesn't work either. exact same error. and you're right about it working fine on local workers.
from clustermanagers.jl.
Just to try:
Instead of the @everywhere
, could you just do a
julia> remotecall_fetch(2,println,"foo")
julia> remotecall_fetch(2,()->eval(parse("flush(STDOUT)")) )
from clustermanagers.jl.
nope. same error. is the problem that it's referring to the STDOUT in worker 1 and not worker 2?
from clustermanagers.jl.
That's what I initially thought, but it does not seem to be the case. At least with parse
and eval
that is ruled out.
I just saw flush_cstdio()
in the documentation. Could you try with remotecall_fetch(2, flush_cstdio)
?
from clustermanagers.jl.
Also STDOUT seems to be of type TTY and not a regular stream as we have been assuming...
from clustermanagers.jl.
flush_cstdio has no effect: no error, no stdout on repl, no change to log files in home directory.
from clustermanagers.jl.
Is it possible that SGE is responsible for the delay? I am not at all familiar with cluster technologies, but a Google search brought up this link - http://scicomp.stackexchange.com/questions/7804/flush-output-in-torque-scheduler . Is there a similar option for SGE?
from clustermanagers.jl.
the problem is not SGE but rather the buffering done by our high-performance file system. a flush() in julia, as described here JuliaLang/julia#6549, is not sufficient. for STDOUT to actually appear in the julia log files i also had to readall(
ls $(ENV["HOME"]))
. with these two extra steps, STDOUT of worker procs now appears on the repl. thanks for all the help.
from clustermanagers.jl.
Whew, glad to see that was resolved.
Amit, I think we already use the appropriate flag (-k o
on pbs and -j y
on sge). Their semantics seem a little different in the documentation, though honestly the documentation has not been too illuminating.
bjarthur, what's the fs you're using? It might be worth mentioning this in the readme (and your workaround).
from clustermanagers.jl.
yesterday i got it to work by adding @fetchfrom 3 flush(eval(:STDOUT)); readall(
ls $(ENV["HOME"]))
after every println. but today the flush is throwing a serialization error, despite jeff's trick JuliaLang/julia#6549. so i'm reopening this issue. nothing has changed that i know of.
julia> addprocs(1)
1-element Array{Any,1}:
2
julia> using ClusterManagers
julia> ClusterManagers.addprocs_sge(1)
job id is 9010468, waiting for job to start ............................................
1-element Array{Any,1}:
3
julia> @fetchfrom 1 flush(eval(:STDOUT))
julia> @fetchfrom 2 flush(eval(:STDOUT))
julia> remotecall_fetch(3,println,"foo")
julia> @fetchfrom 3 flush(eval(:STDOUT))
Worker 3 terminated.
ERROR: ProcessExitedException()
in remotecall_fetch at multi.jl:673
in remotecall_fetch at multi.jl:678
julia> workers()
1-element Array{Int64,1}:
2
julia> From worker 3: foo
From worker 3: fatal error on 3: ERROR: cannot serialize a pointer
From worker 3: in serialize at serialize.jl:60
julia>
note that the remote workers stdout only appears on the repl (and in the log file) after it crashes.
this is on an asynchronous NFS3 file system. the sysadmin strongly discourages use of the disk for interprocess communication because of the async buffering.
i'm looking into replacing qsub with qrsh...
from clustermanagers.jl.
qrsh implementation here #11
from clustermanagers.jl.
Related Issues (20)
- Error in `rmprocs` SGE HOT 1
- Ship telnet via jll? HOT 2
- addprocs(SGEManager) fails HOT 5
- SGE fails in rmprocs
- Singularity images does not work with SLURM HOT 5
- Error launching workers: no such file or directory HOT 5
- TagBot trigger issue HOT 8
- lsf_bpeek makes strong assumptions on iterator state of retry_delays
- [SlurmManager] 100 % CPU usage while waiting for the job to get created HOT 6
- Better handling of SLURM job submission timing
- Handling of busy LSF deamon HOT 4
- SLURM 10 nodes good, 16 nodes error HOT 3
- pbs error HOT 4
- LSF manager broken in Julia 1.8.1 HOT 2
- -o argument in addprocs_slurm leads to an error
- ClusterManagers can be run on top of dask clusters! HOT 2
- Elastic auto IP address function HOT 2
- Limiting number of cores per node on with LSF HOT 3
- Finalizer task switch bug
- Slurm broken HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from clustermanagers.jl.