Giter Site home page Giter Site logo

ray-cluster-launcher's Introduction

Ray Cluster Launcher

Description

A use case of Ray Cluster Launcher for deploying a Ray cluster on an on-premise cluster.

Client Prerequisites

This project assumes that your client machine (e.g, personal laptop) has ssh access to two or more on-premise servers.

Cluster Prerequisites

The on-premise servers must have the following setup:

  • Docker is installed.
  • The head node firewall exposes ports within the private network. See here for default ports.
  • The head node has ssh access to all worker nodes.

Installation

  1. Clone this repository
git clone https://github.com/jacksonjacobs1/ray-cluster-launcher.git
  1. Change directory to the repository
cd ray-cluster-launcher
  1. Crete a virtual environment and activate it
python3 -m venv venv
source venv/bin/activate
  1. Install the dependencies
pip install -r requirements.txt

Cluster Setup

Cluster setup is a user-level procedure, as opposed to a system-level procedure: Each user must set up their cluster individually. The client machine is used to launch the cluster, so SSH passwordless login must be enabled between the client machine and all cluster nodes.

This can be done by generating a public/private key pair on the client machine and copying the public key to the cluster nodes. See here for more details. Here is an example of how this may be done:

  1. On the client machine, generate a public/private key pair and press enter through the prompts. It's important to leave the passphrase blank.
ssh-keygen -t rsa -b 4096 -C "<insert-key-identifier-here>"
  1. Copy the public key to each node in the cluster.
ssh-copy-id -i ~/.ssh/id_rsa.pub <head-node-username>@<head-node-ip-address>
ssh-copy-id -i ~/.ssh/id_rsa.pub <worker-node-username>@<worker-node-ip-address>
  1. Test that passwordless login works.
ssh -i ~/.ssh/id_rsa <head-node-username>@<head-node-ip-address>

Cluster Configuration

The cluster configuration is defined in the local_cluster_config.yaml file. See here for more configuration examples.

You will need to modify the following parameters for your own on-premise cluster:

head_ip: <ip-or-hostname-of-head-node>
worker_ips: [<ip-or-hostname-of-worker-node-1>, <ip-or-hostname-of-worker-node-2>, ...]
ssh_user: <username>
min_workers: <number of workers>
max_workers: <number of workers>

Cluster Launch

To launch the cluster, run the following command:

ray up local_cluster_config.yaml

To verify that the cluster is running, attach to the head node:

ray attach local_cluster_config.yaml

Check the status of the cluster:

(base) ray@SOMAI-SERV01:~$ ray status
======== Autoscaler status: 2023-12-08 14:29:16.383004 ========
Node status
---------------------------------------------------------------
Active:
 2 local.cluster.node
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/64.0 CPU
 0.0/4.0 GPU
 0B/203.37GiB memory
 0B/37.92GiB object_store_memory

Demands:
 (no resource demands)

Cluster Teardown

To teardown the cluster, run the following command from the client machine:

ray down local_cluster_config.yaml

Ray down does not always terminate the worker nodes properly, as documented here. Incomplete termination can cause issues while launching subsequent clusters. To check if the cluster has terminated properly, follow these steps:

  1. ssh into a worker node
ssh <worker-node-username>@<worker-node-ip-address>
  1. Check for a hanging docker container.
docker ps | grep ray_container
  1. If the container is still running, stop it.
docker stop ray_container

ray-cluster-launcher's People

Contributors

jacksonjacobs1 avatar

Watchers

 avatar  avatar

ray-cluster-launcher's Issues

Cluster Setup

I would make mention, or if i understood correctly, in this section this is at a user level, is that correct?

so any user would have to set this up for themselves, and its not a server-level setup?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.