jaywonchung / pegasus Goto Github PK

View Code? Open in Web Editor NEW

29.0 29.0 3.0 907 KB

An SSH command runner with a focus on simplicity

License: MIT License

Rust 100.00%

pegasus's Introduction

Hey there, I'm Jae-Won! 😆

PhD student at UMich CSE, SymbioticLab. Working on energy efficient software systems for Deep Learning!
Commandline enthusiast. Check out my dotfiles!
Fingerstyle guitar player. But I rarely get to show off!

pegasus's People

Contributors

Stargazers

Watchers

Forkers

nobodyxu gatewaynode iostat42

pegasus's Issues

Cancelling commands

Cancelling commands ran by Pegasus is very difficult. You essentially have to ssh into each node and manually figure out the PIDs of commands and kill them.

Nested commands, so to say, make things more complicated. For instance, docker exec sh -c "python train.py" will run the following commands:

Ran by user: sh -c docker exec sh -c "python train.py"
Ran by user: docker exec sh -c "python train.py"
Ran by root:sh -c "python train.py"
Ran by root: python train.py

Only killing the fourth python train.py command will truely achieve cancellation. The bottom line is, it is difficult for Pegasus to infer how to properly terminate a command.

Potential solutions

We might ask the user for a cancellation command in queue.yaml. For example, sudo kill $(pgrep -f 'train.py'). Then the ctrl_c handler will create a new connection to the hosts and run the designated cancellation command.
Somehow figure out the PGID of the sh process and run sudo kill -- -PGID. Can we pgrep -f with the entire command? Shell escaping might become a problem. (pgrep -f with every single word in the command and kill the intersection of all PIDs returned?)

Ctrl-c is the wrong key to trigger cancellation

Pressing ctrl-c when Pegasus is running will also send ctrl-c to the entire foreground process group, thus also sending SIGINT to the ssh processes.

Instead of waiting for ctrl-c with the tokio ctcl-c catcher, find another way to catch the user's intent to cancel.

Possible solutions:

Listen to stdin for something random. Like q<Enter>.
Create a pegasus stop command. This should make sure to identify the specific pegasus process that is running on the current working directory. This will probably require a pid file.
- It's okay to assume that there is only one instance of pegasus per directory (if not, that means more than one pegasus processes are manipulating queue.yaml and consumed.yaml).

Run the scheduling loop until all commands finish executing

Currently, given that daemon mode is not set, the scheduling loop of both broadcast mode and queue mode just terminate after sending out the last command to a free SSH session. But the time gap between sending out the last command and the final command finishing execution is vast.

In this time window, users will want to submit more jobs. Obviously since Pegasus seems to be up and running, it is more intuitive for users to assume that it will admit more jobs.

Watch filesystem events on `queue.yaml` instead of polling every 3 secs

Currently, when queue.yaml is empty, the scheduling loop just sleeps for three seconds and re-checks if there's any new commands. This is less responsive and it's basically polling.

Crates like notify does filesystem watching, but it's not async. Especially, since its watcher takes std::sync::mpsc::{Sender, Receiver} in its constructor and those are !Sync.

Switch templating engine to liquid

The liquid templating language will allow users to implement "logics" like if - else in their command specification.

Check SSH connection before sched loop

Before entering the schedule loop, Pegasus should create and connect the SSH session to see if it connects. If not, it should gracefully terminate all previous connections and exit. The connected SSH session object can be moved into the tokio task.

Graceful shutdown

Currently, Pegasus doesn't handle ctrl-c very well. When the user presses ctrl-c, Pegasus terminates without cancelling the SSH session tasks. So the ssh sessions remain open, .ssh-connection* directories stay as is, and in order to kill the processes spawned on remote nodes, I need to walk into each node and kill them manually.

Potential solutions

One method would be to propagate the signal to all child ssh processes. Proper error handling inside the tasks would probably serve as an okay solution.
Another method I'd like to explore is to see if using the native-mux feature of the coming openssh crate will make graceful shutdown any easier.

localhost specialization

No real need to ssh into localhost.

Consuming jobs in a finer granularity

# queue.yaml
- command:
    - long_job {{ param }}
    - another_long_job {{ param }}
  param:
    - resnet50
    - ViT

Currently, Pegasus will consume the entire entry (virtually the entire file) from queue.yaml. However, consuming minimally would be nice. For instance, leaving queue.yaml in the following state would be nice:

# queue.yaml
- command:
    - another_long_job {{ param }}
   param:
    - resnet50
    - ViT

Then in the next round of get_one_job, only then queue.yaml will be empty.

Upgrade to openssh v0.9.0 with native-mux feature?

openssh v0.9.0-rc1 is released with a new feature native-mux.

It provides new functions like Session::connect_mux and SessionBuilder::connect_mux which communicates with the ssh multiplex master directly, through the control socket, instead of spawning a new process to communicate with it.

The advantage of this is more robust error reporting, better performance and less memory usage.

The old implementation (process-mux) checks the exit status of ssh for indication of error, then parse the output of it and the output of the ssh multiplex master to return an error.

This method is obviously not so robust as native-mux, which directly communicates with ssh multiplex master through its multiplex protocol.

The better performance and less memory usage part is mostly because we avoid creating a new process for every command you spawned on remote, every Session::check and every Session::request_port_forwarding.

The new release also add new function Session::request_port_forwarding, which supports local/remote forwarding of tcp and unix socket stream.

There are also other changes to API:

A new type Stdio is used for setting stdin/stdout/stderr.
ChildStd* types are now alias for tokio_pipe::{PipeRead, PipeWrite}.
Command::spawn and Command::status now confirms to std::process::Command and tokio::process::Command, in which stdin, stdout and stderr are inherit by default.
Command::spawn is now an async method.
RemoteChild::wait now takes self by value.
Error is now marked #[non_exhaustive] and new variants is added.

I know this is a huge release and upgrading it is going to be quite difficult, but I sincerely want you to try it out, as the new implementation requires feedback.

Find a way to display progress

For a large parametrized command that generates a lot of jobs and takes days, the user will want to figure out the progress of each command generated: queued, running, or done.

Internal bookkeeping is easy. Maybe keep a file that we dump the commands in progress, and add a mode in pegasus that parses and displays that file.

Support dynamically updating `hosts.yaml`

Use case:

Add a node to hosts.yaml when you get access to more nodes, and the jobs you are currently running will automatically go there.
Remove a node from hosts.yaml when someone asks you to vacate the node.

Removing a node from hosts.yaml does not kill the job. However, it is guaranteed that the removed node will not be given a new job once it's been removed from hosts.yaml.

Terminate peacefully on connection failure

The current behavior of Pegasus on SSH connection failure is to panic, and then abort. It'll be great if the async tasks handling connections just return and the scheduling loop detects this (or perhaps don't even start before all connections are successfully established) and terminates Pegasus peacefully.