Comments (4)
For now, I think using pssh is the way to go. Assuming you have a workers.txt
with all of the node IP addresses (other than the head node), you can do the following. The instructions are similar to the ones in this file https://github.com/ray-project/ray/blob/master/doc/using-ray-on-a-large-cluster.md and maybe should be added to that file.
- Stop, update, and start Ray on the head node.
# Stop Ray
ray/scripts/stop_ray.sh
# Update Ray
cd ~/ray/python
git pull
python setup.py install
# Start Ray
cd ~
ray/scripts/start_ray.sh --head --num-workers=10 --redis-port=6379
Then make a script to run via parallel ssh on all the other nodes. E.g., script.sh
export PATH=/home/ubuntu/anaconda2/bin/:$PATH
ray/scripts/stop_ray.sh
cd ~/ray/python
git pull
python setup.py install
cd ~
ray/scripts/start_ray.sh --num-workers=10 --redis-address=<head-node-ip>:6379
Then run it via parallel-ssh
parallel-ssh -h workers.txt -P -I < script.sh
This assumes you have a workers.txt
file containing the (private) IP addresses of all the nodes other than the head node and can ssh to the other nodes from the head node.
from ray.
I've been trying to use a script for doing the initial installation of Ray so you don't need to create an AMI, but I haven't quite gotten it to work.
initial_setup.sh
sudo apt-get update
sudo apt-get install -y cmake build-essential autoconf curl libtool libboost-al\
l-dev unzip emacs
wget https://repo.continuum.io/archive/Anaconda2-4.3.0-Linux-x86_64.sh -O ~/ana\
conda.sh
bash ~/anaconda.sh -b -p $HOME/anaconda
export PATH="$HOME/anaconda/bin:$PATH"
echo 'export PATH="$HOME/anaconda/bin:$PATH"' >> ~/.bashrc
git clone https://github.com/ray-project/ray.git
cd ray/python
python setup.py install
conda install -y libgcc
pip install numpy cloudpickle funcsigs colorama psutil redis
parallel-ssh -h workers.txt -P -I -t 0 < initial_setup.sh
The -t 0
is to prevent it from timing out and dying.
It currently dies with messages like
[FAILURE] 172.31.1.198 Exited with error code 127
FYI @jssmith.
My guess is that the command git clone https://github.com/ray-project/ray.git
or something is failing when it is run the second time, e.g., with the error fatal: destination path 'ray' already exists and is not an empty directory.
So we need to make that the failure of one command doesn't prevent the others from succeeding.
from ray.
A few responses here:
On Updating
Instructions for updating are basically right. The steps should be 1/ shut down Ray on all nodes, 2/ write update script, 3/ run update script on head node, 4/ run update script in parallel on other nodes, 4/ start up Ray on all nodes. I agree that we should add these instructions to https://github.com/ray-project/ray/blob/master/doc/using-ray-on-a-large-cluster.md
On AMI
This should work. Some suggestions on bug fixes, then I'll comment on whether it is a good idea. You can use the -o
and -e
options on parallel-ssh
to redirect the standard output and standard error from each host to a file. This should help in debugging. Also, as written I think the script will just keep running even if individual commands result in errors (newline is like ;
, not like &&
). It really would be best to have a script that won't have any errors along the way, though. That way we can just confirm that no error output was generated after running parallel-ssh
and have confidence that the installation was successful.
I still tend to prefer steering users toward creating AMIs, but it is worth considering this. For one, the user doesn't have to get a smooth-running setup script. If there is a need for libraries that require license approval, large files, etc., it may be easier to just do it once, by hand, and then clone the result of this work. The larger worry that I have is that whenever there are external dependencies, e.g., downloading Anaconda or other dependencies, then speed and success become variable factors. Unless one has a good way to verify the success of the installation on each machine then this is a risky way to go. Note that these risks usually scale with the number of machines, so for small clusters the AMI may have less value, but as you get to larger installations it becomes increasingly useful to bring up all of the machines in a well-defined state.
from ray.
Instructions for updating the version of Ray using parallel-ssh have been added. #256
from ray.
Related Issues (20)
- [Core] deployment with stream=True cannot handle normal request, why
- Release test chaos_dataset_shuffle_push_based_sort_1tb.aws failed HOT 1
- [Core/Clusters] Allow dynamically limiting a job's memory usage threshold HOT 1
- ray.data map_batches leak dead ray::IDLE HOT 5
- [Serve] Performance degradation in Deployments HOT 6
- [RLlib] Action masking example with new API stack error HOT 6
- [Core] Document that sync methods of an async actor run in the same event loop HOT 1
- [data] Resource reservation shouldn't reserve CPUs for operators that don't need CPU HOT 1
- Ray Serve docs - reorg Examples page HOT 2
- GetAllJobInfo `is_running_tasks` is not returning the correct value when driver starts ray HOT 2
- [serve] Add docs on best practices for testing
- CI test darwin://python/ray/tests:test_runtime_env_working_dir_2 is flaky HOT 5
- [Job] Python module imports failing with ModuleNotFound errors after upgrading to Ray 2.10.0 from 2.6.1 HOT 16
- CI test darwin://python/ray/tests:test_cli is consistently_failing HOT 2
- CI test darwin://python/ray/tests:test_object_spilling is consistently_failing HOT 2
- [RLlib] Attribute error when trying to compute action after training Multi Agent PPO with New API Stack HOT 1
- [Serve] Baseline CPU increases indefinitely with many restarted deployments HOT 4
- [RLlib] TypeError converting batch (INFOS) to torch tensor with ConnectorV2
- [Data] Can't disable repartition progress bar
- CI test linux://python/ray/tests:test_cli is flaky HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ray.