jaywonchung / pegasus Goto Github PK
View Code? Open in Web Editor NEWAn SSH command runner with a focus on simplicity
License: MIT License
An SSH command runner with a focus on simplicity
License: MIT License
Pressing ctrl-c when Pegasus is running will also send ctrl-c to the entire foreground process group, thus also sending SIGINT to the ssh
processes.
Instead of waiting for ctrl-c with the tokio ctcl-c catcher, find another way to catch the user's intent to cancel.
Possible solutions:
stdin
for something random. Like q<Enter>
.pegasus stop
command. This should make sure to identify the specific pegasus
process that is running on the current working directory. This will probably require a pid file.
pegasus
per directory (if not, that means more than one pegasus
processes are manipulating queue.yaml
and consumed.yaml
).openssh v0.9.0-rc1 is released with a new feature native-mux
.
It provides new functions like Session::connect_mux
and SessionBuilder::connect_mux
which communicates with the ssh multiplex master directly, through the control socket, instead of spawning a new process to communicate with it.
The advantage of this is more robust error reporting, better performance and less memory usage.
The old implementation (process-mux
) checks the exit status of ssh
for indication of error, then parse the output of it and the output of the ssh multiplex master to return an error.
This method is obviously not so robust as native-mux
, which directly communicates with ssh multiplex master through its multiplex protocol.
The better performance and less memory usage part is mostly because we avoid creating a new process for every command you spawned on remote, every Session::check
and every Session::request_port_forwarding
.
The new release also add new function Session::request_port_forwarding
, which supports local/remote forwarding of tcp and unix socket stream.
There are also other changes to API:
Stdio
is used for setting stdin/stdout/stderr.ChildStd*
types are now alias for tokio_pipe::{PipeRead, PipeWrite}
.Command::spawn
and Command::status
now confirms to std::process::Command
and tokio::process::Command
, in which stdin, stdout and stderr are inherit by default.Command::spawn
is now an async
method.RemoteChild::wait
now takes self
by value.Error
is now marked #[non_exhaustive]
and new variants is added.I know this is a huge release and upgrading it is going to be quite difficult, but I sincerely want you to try it out, as the new implementation requires feedback.
For a large parametrized command that generates a lot of jobs and takes days, the user will want to figure out the progress of each command generated: queued, running, or done.
Internal bookkeeping is easy. Maybe keep a file that we dump the commands in progress, and add a mode in pegasus that parses and displays that file.
# queue.yaml
- command:
- long_job {{ param }}
- another_long_job {{ param }}
param:
- resnet50
- ViT
Currently, Pegasus will consume the entire entry (virtually the entire file) from queue.yaml
. However, consuming minimally would be nice. For instance, leaving queue.yaml
in the following state would be nice:
# queue.yaml
- command:
- another_long_job {{ param }}
param:
- resnet50
- ViT
Then in the next round of get_one_job
, only then queue.yaml
will be empty.
Cancelling commands ran by Pegasus is very difficult. You essentially have to ssh into each node and manually figure out the PIDs of commands and kill them.
Nested commands, so to say, make things more complicated. For instance, docker exec sh -c "python train.py"
will run the following commands:
sh -c docker exec sh -c "python train.py"
docker exec sh -c "python train.py"
sh -c "python train.py"
python train.py
Only killing the fourth python train.py
command will truely achieve cancellation. The bottom line is, it is difficult for Pegasus to infer how to properly terminate a command.
Potential solutions
queue.yaml
. For example, sudo kill $(pgrep -f 'train.py')
. Then the ctrl_c handler will create a new connection to the hosts and run the designated cancellation command.sh
process and run sudo kill -- -PGID
. Can we pgrep -f
with the entire command? Shell escaping might become a problem. (pgrep -f
with every single word in the command and kill the intersection of all PIDs returned?)Currently, when queue.yaml
is empty, the scheduling loop just sleeps for three seconds and re-checks if there's any new commands. This is less responsive and it's basically polling.
Crates like notify
does filesystem watching, but it's not async. Especially, since its watcher takes std::sync::mpsc::{Sender, Receiver}
in its constructor and those are !Sync
.
Use case:
hosts.yaml
when you get access to more nodes, and the jobs you are currently running will automatically go there.hosts.yaml
when someone asks you to vacate the node.Removing a node from hosts.yaml
does not kill the job. However, it is guaranteed that the removed node will not be given a new job once it's been removed from hosts.yaml
.
Before entering the schedule loop, Pegasus should create and connect the SSH session to see if it connects. If not, it should gracefully terminate all previous connections and exit. The connected SSH session object can be moved into the tokio task.
Currently, Pegasus doesn't handle ctrl-c
very well. When the user presses ctrl-c
, Pegasus terminates without cancelling the SSH session tasks. So the ssh
sessions remain open, .ssh-connection*
directories stay as is, and in order to kill the processes spawned on remote nodes, I need to walk into each node and kill them manually.
Potential solutions
ssh
processes. Proper error handling inside the tasks would probably serve as an okay solution.native-mux
feature of the coming openssh
crate will make graceful shutdown any easier.No real need to ssh into localhost.
Currently, given that daemon
mode is not set, the scheduling loop of both broadcast mode and queue mode just terminate after sending out the last command to a free SSH session. But the time gap between sending out the last command and the final command finishing execution is vast.
In this time window, users will want to submit more jobs. Obviously since Pegasus seems to be up and running, it is more intuitive for users to assume that it will admit more jobs.
The current behavior of Pegasus on SSH connection failure is to panic, and then abort. It'll be great if the async tasks handling connections just return and the scheduling loop detects this (or perhaps don't even start before all connections are successfully established) and terminates Pegasus peacefully.
The liquid templating language will allow users to implement "logics" like if - else in their command specification.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.