Comments (11)
A cool variant of this I've just successfully used:
Edit services.py to output the command that was supposed to be run instead of actually running it, and put a
import IPython
IPython.embed()
after it. Then you can start Ray normally and when the interpreter hits the IPython.embed, start the program by hand (possibly in gdb). It works beautifully!
Here is the diff:
--- a/python/ray/local_scheduler/local_scheduler_services.py
+++ b/python/ray/local_scheduler/local_scheduler_services.py
@@ -117,6 +117,10 @@ def start_local_scheduler(plasma_store_name,
stdout=stdout_file, stderr=stderr_file)
time.sleep(1.0)
else:
- pid = subprocess.Popen(command, stdout=stdout_file, stderr=stderr_file)
+ print(" ".join(command))
+ import IPython
+ IPython.embed()
+ pid = None
+ # pid = subprocess.Popen(command, stdout=stdout_file, stderr=stderr_file)
time.sleep(0.1)
return local_scheduler_name, pid
diff --git a/python/ray/services.py b/python/ray/services.py
index a37c16a..e6f3ede 100644
--- a/python/ray/services.py
+++ b/python/ray/services.py
@@ -555,7 +555,7 @@ def start_local_scheduler(redis_address,
stderr_file=stderr_file,
static_resource_list=[num_cpus, num_gpus],
num_workers=num_workers)
- if cleanup:
+ if cleanup and p:
all_processes[PROCESS_TYPE_LOCAL_SCHEDULER].append(p)
record_log_files_in_redis(redis_address, node_ip_address,
[stdout_file, stderr_file])
-- Philipp.
from ray.
I've been using tmux panes to debug. Wrote a script here that may be helpful to others. Originally, I had this in tmuxinator
, but wanted to reduce the number of dependencies.
It's a little finicky since this doesn't really respect dependencies, but a little less painful than starting all processes by hand.
from ray.
Closing for now.
from ray.
Updated instructions
Start Redis
rm dump.rdb; ./src/common/thirdparty/redis/src/redis-server --loadmodule src/common/redis_module/ray_redis_module.so
Start the global scheduler (and pass in the Redis address)
./src/global_scheduler/build/global_scheduler -r 127.0.0.1:6379
Start the plasma store
src/plasma/build/plasma_store -s /tmp/s1 -m 1000000000
Start the plasma manager
src/plasma/build/plasma_manager -s /tmp/s1 -m /tmp/m1 -r 127.0.0.1:6379 -h 127.0.0.1 -p 23894
Start the local scheduler
src/photon/build/photon_scheduler -s /tmp/sched1 -p /tmp/s1 -m /tmp/m1 -r 127.0.0.1:6379 -a 127.0.0.1:23894 -h 127.0.0.1
Start a worker (or run this multiple times to start multiple workers).
python lib/python/ray/workers/default_worker.py --redis-address 127.0.0.1:6379 --object-store-name /tmp/s1 --object-store-manager-name /tmp/m1 --local-scheduler-name /tmp/sched1 --node-ip-address 127.0.0.1
Connect a driver (run this in a Python interpreter).
import ray
address_info = {
"node_ip_address": "127.0.0.1",
"redis_address": "127.0.0.1:6379",
"store_socket_name": "/tmp/s1",
"manager_socket_name": "/tmp/m1",
"local_scheduler_socket_name": "/tmp/sched1"}
ray.connect(address_info, mode=ray.SCRIPT_MODE)
Some other useful things:
- Start a redis client with
redis-cli
and runmonitor
to monitor all of the commands going through the redis server.
from ray.
Updating this again, since the instructions have changed.
Start Redis
rm dump.rdb; ./python/core/src/common/thirdparty/redis/src/redis-server --loadmodule python/core/src/common/redis_module/libray_redis_module.so
Start the global scheduler (and pass in the Redis address)
./python/core/src/global_scheduler/global_scheduler -r 127.0.0.1:6379
Start the plasma store
./python/core/src/plasma/plasma_store -s /tmp/s1 -m 1000000000
Start the plasma manager
./python/core/src/plasma/plasma_manager -s /tmp/s1 -m /tmp/m1 -r 127.0.0.1:6379 -h 127.0.0.1 -p 23894
Start the local scheduler
./python/core/src/photon/photon_scheduler -s /tmp/sched1 -p /tmp/s1 -m /tmp/m1 -r 127.0.0.1:6379 -a 127.0.0.1:23894 -h 127.0.0.1
Start a worker (or run this multiple times to start multiple workers).
python python/ray/workers/default_worker.py --redis-address 127.0.0.1:6379 --object-store-name /tmp/s1 --object-store-manager-name /tmp/m1 --local-scheduler-name /tmp/sched1 --node-ip-address 127.0.0.1
Connect a driver (run this in a Python interpreter).
import ray
address_info = {
"node_ip_address": "127.0.0.1",
"redis_address": "127.0.0.1:6379",
"store_socket_name": "/tmp/s1",
"manager_socket_name": "/tmp/m1",
"local_scheduler_socket_name": "/tmp/sched1"}
ray.connect(address_info, mode=ray.SCRIPT_MODE)
Some other useful things:
- Start a redis client with
redis-cli
and runmonitor
to monitor all of the commands going through the redis server.
from ray.
Updated instructions.
Start Redis
rm dump.rdb; ./python/ray/core/src/common/thirdparty/redis/src/redis-server \
--loadmodule python/ray/core/src/common/redis_module/libray_redis_module.so
Start the global scheduler (and pass in the Redis address)
./python/ray/core/src/global_scheduler/global_scheduler \
-r 127.0.0.1:6379 \
-h 127.0.0.1
Start the plasma store
./python/ray/core/src/plasma/plasma_store \
-s /tmp/s1 \
-m 1000000000
Start the plasma manager
./python/ray/core/src/plasma/plasma_manager \
-s /tmp/s1 \
-m /tmp/m1 \
-r 127.0.0.1:6379 \
-h 127.0.0.1 \
-p 23894
Start the local scheduler
# Without ability to start new workers.
./python/ray/core/src/local_scheduler/local_scheduler \
-s /tmp/sched1 \
-p /tmp/s1 \
-m /tmp/m1 \
-r 127.0.0.1:6379 \
-a 127.0.0.1:23894 \
-h 127.0.0.1
# With ability to start new workers.
./python/ray/core/src/local_scheduler/local_scheduler \
-s /tmp/sched1 \
-p /tmp/s1 \
-m /tmp/m1 \
-r 127.0.0.1:6379 \
-a 127.0.0.1:23894 \
-h 127.0.0.1 \
-w "python python/ray/workers/default_worker.py --redis-address 127.0.0.1:6379 --object-store-name /tmp/s1 --object-store-manager-name /tmp/m1 --local-scheduler-name /tmp/sched1 --node-ip-address 127.0.0.1"
Start a worker (or run this multiple times to start multiple workers).
python python/ray/workers/default_worker.py \
--redis-address 127.0.0.1:6379 \
--object-store-name /tmp/s1 \
--object-store-manager-name /tmp/m1 \
--local-scheduler-name /tmp/sched1 \
--node-ip-address 127.0.0.1
Connect a driver (run this in a Python interpreter).
import ray
address_info = {
"node_ip_address": "127.0.0.1",
"redis_address": "127.0.0.1:6379",
"store_socket_name": "/tmp/s1",
"manager_socket_name": "/tmp/m1",
"local_scheduler_socket_name": "/tmp/sched1"}
ray.connect(address_info, mode=ray.SCRIPT_MODE)
Some other useful things:
- Start a redis client with
redis-cli
and runmonitor
to monitor all of the commands going through the redis server.
from ray.
More recent instructions.
Note: Throughout, you will need to replace <head-node-ip>
with something more correct.
The processes on the head node can be started as follows.
-
Run the following in Python to start the Redis shards.
import ray ray.services.start_redis("<head-node-ip>", port=6379, num_redis_shards=1, redirect_output=False, cleanup=False)
-
Start a global scheduler.
./python/ray/core/src/global_scheduler/global_scheduler \ -r <head-node-ip>:6379 \ -h <head-node-ip>
-
Start a plasma store
./python/ray/core/src/plasma/plasma_store \ -s /tmp/s1 \ -m 1000000000
-
Start a plasma manager
./python/ray/core/src/plasma/plasma_manager \ -s /tmp/s1 \ -m /tmp/m1 \ -r <head-node-ip>:6379 \ -h <head-node-ip> \ -p 23894
-
Start the local scheduler
# Without ability to start new workers. ./python/ray/core/src/local_scheduler/local_scheduler \ -s /tmp/sched1 \ -p /tmp/s1 \ -m /tmp/m1 \ -r <head-node-ip>:6379 \ -a <head-node-ip>:23894 \ -h <head-node-ip> \ -c 16,0 # With ability to start new workers. ./python/ray/core/src/local_scheduler/local_scheduler \ -s /tmp/sched1 \ -p /tmp/s1 \ -m /tmp/m1 \ -r <head-node-ip>:6379 \ -a <head-node-ip>:23894 \ -h <head-node-ip> \ -c 16,0 \ -w "python python/ray/workers/default_worker.py --redis-address <head-node-ip>:6379 --object-store-name /tmp/s1 --object-store-manager-name /tmp/m1 --local-scheduler-name /tmp/sched1 --node-ip-address <head-node-ip>"
-
Start a worker (or run this multiple times to start multiple workers).
python python/ray/workers/default_worker.py \ --redis-address <head-node-ip>:6379 \ --object-store-name /tmp/s1 \ --object-store-manager-name /tmp/m1 \ --local-scheduler-name /tmp/sched1 \ --node-ip-address <head-node-ip>
-
Connect a driver (run this in a Python interpreter).
import ray ray.init(redis_address="<head-node-ip>:6379")
After that, it should be possible to start up Ray on other nodes as follows.*
ray start --redis-address=<head-node-ip>:6379
from ray.
If you wish to start the Redis serves by hand instead of calling ray.services.start_redis
as in the previous comment, you can do the following (this starts one primary Redis server and one other Redis shard). Note the value <head-node-ip>
will need to be replaced.
./python/ray/core/src/common/thirdparty/redis/src/redis-server \
--loglevel warning \
--loadmodule ./python/ray/core/src/common/redis_module/libray_redis_module.so \
--port 6379
./python/ray/core/src/common/thirdparty/redis/src/redis-server \
--loglevel warning \
--loadmodule ./python/ray/core/src/common/redis_module/libray_redis_module.so \
--port 6380
./python/ray/core/src/common/thirdparty/redis/src/redis-cli -p 6379 set NumRedisShards 1
./python/ray/core/src/common/thirdparty/redis/src/redis-cli -p 6379 rpush RedisShards <head-node-ip>:6380
./python/ray/core/src/common/thirdparty/redis/src/redis-cli -p 6379 config set notify-keyspace-events Kl
./python/ray/core/src/common/thirdparty/redis/src/redis-cli -p 6380 config set notify-keyspace-events Kl
./python/ray/core/src/common/thirdparty/redis/src/redis-cli -p 6379 config set protected-mode no
./python/ray/core/src/common/thirdparty/redis/src/redis-cli -p 6380 config set protected-mode no
./python/ray/core/src/common/thirdparty/redis/src/redis-cli -p 6379 config set client-output-buffer-limit "normal 0 0 0 slave 268435456 67108864 60 pubsub 134217728 134217728 60"
./python/ray/core/src/common/thirdparty/redis/src/redis-cli -p 6380 config set client-output-buffer-limit "normal 0 0 0 slave 268435456 67108864 60 pubsub 134217728 134217728 60"
from ray.
@danielsuo that looks like a step in the right direction! I think the most useful thing would be to be able to start any of the Ray processes within tmux (and also within gdb), e.g., integrating something like this within services.py.
from ray.
from ray.
A tutorial on this would be great!
from ray.
Related Issues (20)
- AttributeError: 'JobConfig' object has no attribute 'py_driver_sys_path'
- Cloud Provider Requirements Documentation
- CI test windows://python/ray/tests:test_runtime_env_2 is consistently_failing HOT 1
- CI test windows://python/ray/tests:test_task_events is consistently_failing HOT 4
- [core] streaming generator object doesn't reconstruct correctly
- [Core] Deadcode raylet::RayletClient::RequestObjectSpillage()
- <Ray train> deepspeed the checkpoint for each shard is the whole model instead of shard of the model HOT 8
- [Data] _StatsActor has unbounded memory usage HOT 1
- [serve] Setting only `num_replicas="auto"` in a serve config throws error HOT 1
- CI test darwin://python/ray/tests:test_actor_cancel is consistently_failing HOT 6
- CI test darwin://python/ray/tests:test_multiprocessing_client_mode is flaky HOT 6
- CI test windows://python/ray/tests:test_runtime_env_complicated is consistently_failing HOT 1
- CI test windows://python/ray/tests:test_runtime_env_conda_and_pip is consistently_failing HOT 1
- CI test windows://python/ray/tests:test_runtime_env_conda_and_pip_4 is consistently_failing HOT 1
- CI test windows://python/ray/tests:test_runtime_env_conda_and_pip_5 is consistently_failing HOT 1
- CI test windows://python/ray/tests:test_runtime_env_strong_type is consistently_failing HOT 1
- CI test windows://python/ray/tests:test_serialization is consistently_failing HOT 1
- CI test windows://python/ray/tests:test_tls_auth is consistently_failing HOT 1
- CI test darwin://python/ray/tests:test_asyncio is flaky HOT 2
- memory problems caused by runtime_env.env_vars
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ray.