grc-iit / jarvis-cd Goto Github PK

3.0 6.0 5.0 1.3 MB

Jarvis-cd is a unified platform for deploying various applications

Home Page: https://github.com/grc-iit/jarvis-cd

License: MIT License

Python 99.40% Shell 0.21% Dockerfile 0.40%

ci-cd deployment jarvis scripting

jarvis-cd's Introduction

Jarvis-cd is a unified platform for deploying various applications, including storage systems and benchmarks. Many applications have complex configuration spaces and are difficult to deploy across different machines.

We provide a builtin repo which contains various applications to deploy. We refer to applications as "jarivs pkgs" which can be connected to form "deployment pipelines".

0.1 Dependencies

0.1.1. Jarvis-Util

Jarvis-CD depends on jarvis-util. jarvis-util contains functions to execute binaries in python and collect their output.

git clone https://github.com/scs-lab/jarvis-util.git
cd jarvis-util
python3 -m pip install -r requirements.txt
python3 -m pip install -e .

0.1.2. Scspkg

Scspkg is a tool for building modulefiles using a CLI. It's not strictly necessary for Jarvis to function, but many of the readmes use it to provide structure to manual installations.

git clone https://github.com/scs-lab/scspkg.git
python3 -m pip install -r requirements.txt
python3 -m pip install -e .
echo "module use \`scspkg module dir\`" >> ~/.bashrc

The wiki for scspkg is here.

0.2. Installation

cd /path/to/jarvis-cd
python3 -m pip install -r requirements.txt
python3 -m pip install -e .

0.3. Configuring Jarvis

0.3.1. Bootstrapping from a specific machine

Jarivs has been pre-configured on some machines. To bootstrap from one of them, run the following:

jarvis bootstrap from ares

NOTE: Jarvis must be installed from the compute nodes in Ares, NOT the master node. This is because we store configuration data in /mnt/ssd by default, which is only on compute nodes. We do not store data in /tmp since it will be eventually destroyed.

To check the set of available machines to bootstrap from, run:

jarvis bootstrap list

0.3.2. Creating a new configuration

A configuration can be generated as follows:

jarvis init [CONFIG_DIR] [PRIVATE_DIR] [SHARED_DIR (optional)]

CONFIG_DIR: A directory where jarvis metadata for pkgs and pipelines are stored. This directory can be anywhere that the current user can access.
PRIVATE_DIR: A directory which is common across all machines, but stores data locally to the machine. Some jarvis pkgs require certain data to be stored per-machine. OrangeFS is an example.
SHARED_DIR: A directory which is common across all machines, where each machine has the same view of data in the directory. Most jarvis pkgs require this, but on machines without a global filesystem (e.g., Chameleon Cloud), this parameter can be set later.

For a personal machine, these directories can be the same directory.

jarvis-cd's People

Contributors

Stargazers

Watchers

Forkers

lukemartinlogan jye-525 jaimecernuda hxu65 candicet233

jarvis-cd's Issues

Add timing to pkgs

We should collect the time of each pkg and print.

Make the resource graph collection decoupled from the pruning phase

Slurm to use for the resource graph collection.

Walkthrough API after the resource graph.

Make it so there is only one environment for the entire pipeline.

Mistakenly, I made it so that each individual node caches a copy of the environment dictionary. This makes it harder to change environment variables for the pipeline.

Document resource graph file format

We should document the resource graph file format

Add logger with info, warning, and error level

Add MongoDB package

Add clean, status, and restart API

status should give status of the deployment/undeployment.

clean should remove data/metadata associated with deployment (if any)

restart should execute stop and start.

Add scspkg as a dependency of jarvis

Many of these programs are manually installed. SCSPKG makes life much easier for these kinds of packages by providing a CLI to create and handle modulefiles.

Make it so packages can define custom environment variables

Sometimes, packages require more environment variables than the default ones in order to function. Orangefs, for example, requires ORANGEFS_PATH.

wrap jarvis py call behind a shell script.

Update wiki to include hostfile format

Integrate Slurm job submissions

Make a separate cache for environments

Oftentimes, I find that I need to re-use the same environment multiple times. We should make it so that environments can be cached outside of the pipeline. The API I have in mind is as follows:

jarvis env create hermes-env
jarvis env build
jarvis pipeline env copy hermes-env

jarvis env create + build will cache the current environment.
pipeline env copy will copy the environment file to the config directory.

Add RAM to resource graph

Hermes needs it for buffering

Sockets provider doesn't play nice with the "domain" parameter

We should filter the resource graph to remove the domain from sockets providers. May also be incompatible with TCP/UDP.

Adding repos in Jarvis.

We need to have a repository namespace resolution:
Essentially:

create folder structure of var/jarvis/repos where all prebuilt repositories would be present.
By default, we should have a builtin repository folder in Jarvis.
The list of all repositories add in jarvis will be stored in etc/jarvis/repos.yaml similar to link
Adding a new repo will update the file at etc/jarvis/repos.yaml
We need a Repository entity which will load and serve a repository
We need a RepositoryManager which will maintain a list of all current repositories and instantiate the Repository entity.
We need a schema finder module that will locate a given schema from all existing repositories.

Each repository:

should have repo.yaml (fixed name) : similar to link
and a schemas folder that will contain all the deployment schemas as we have now.

Add Redis package

Document sbatch on jarvis-cd

Test the resource graph in parallel in Ares

Install jarvis-util and jarvis-cd according to the home page: https://github.com/scs-lab/jarvis-cd
Try building a resource graph: https://github.com/scs-lab/jarvis-cd/wiki/2.-Resource-Graph. Just section 2.1 and 2.2

Add packages for monitoring resource utilization

We should have a Service package which spawns resource monitoring programs and stores their outputs in well-formatted CSV files. This would be extremely helpful for future evaluations

Add FIO package

Schema Loader in Jarvis

Currently a schema is loaded on the main jarvis.py. This approach is not maintainable. I suggest the following.

Build a SchemaLoader entity that gets schema name as input and returns a schema object (currently called Graph).
Search Hierarchy should be repo -> schema.

This feature will work cohesively with issue #6.

Inconsistent naming convention in the Gray-Scott example

In the README.md file under the builtin gray_scott package (PATH: builtin/builtin/gray_scott/README.md), the scspkg that was created in the first line (which is gray-scott) does not match with name that's used later (gray_scott) under the Installation section.

Pointed out the issue as commented lines below

scspkg create gray-scott                    #gray-scott is different from gray_scott used below
cd `scspkg pkg src gray-scott`           #gray-scott is different from gray_scott used below
git clone https://github.com/pnorbert/adiosvm
cd adiosvm/Tutorial/gs-mpiio
mkdir build
pushd build
cmake ../ -DCMAKE_BUILD_TYPE=Release
make -j8
export GRAY_SCOTT_PATH=`pwd`
scspkg env set gray_scott GRAY_SCOTT_PATH="${GRAY_SCOTT_PATH}"
scspkg env prepend gray_scott PATH "${GRAY_SCOTT_PATH}"               #gray_scott is different from gray-scott used above
module load gray_scott
spack load mpi adios2

Update resource graph documentation

The resource graph doc should show how we can query the resource graph in modules. We should also mention that pkgs can modify the resource graph dynamically for use in future modules. For example, OrangeFS spawns a mount point, so it should modify the resource graph.

Slurm issue with multi-nodes

When running IOR with more than 2 nodes on Ares with this command:
jarvis pipeline sbatch job_name=ior2ntest nnodes=2 ppn=10 output_f ile=./4n_ior_test.out error_file=./4n_ior_test.err

Slurm not able to start job show status:

             JOBID  PARTITION   NAME        USER       ST       TIME      NODES NODELIST(REASON)
              1866   compute       ior2ntes    mtang11   PD       0:00      2 (launch failed requeued held)

IOR pipeline already set to correct nprocs and ppn number:

pipeline with name ior_test
  pkg_type=pipeline
  ior with name ior
    api=POSIX
    block=32m
    dbg_port=4000
    do_dbg=False
    fpp=False
    hide_output=False
    log=None
    nprocs=20
    out=/tmp/ior.bin
    pkg_type=ior
    ppn=10
    read=True
    reinit=False
    sleep=0
    stderr=None
    stdout=None
    write=True
    xfer=1m

Document the jarvis python interface

Try the getting started example in Ares

Try going over the getting started example: https://github.com/scs-lab/jarvis-cd/wiki/1.-Getting-Started

Hierarchical Argument Parser for Jarvis.

Current we have a flat argument parser in Jarvis. For extensibility we need to have hierarchical commands. Example.

jarvis-cd repos
jarvis-cd deploy <sub-commands such as --clean>

Please refer to link

Build a demonstration of how to use Jarvis for benchmarking

Potentially make a pipeline iterator

Unit test and CI to check Jarvis functionality and build

Please add a CI which will interpret Jarvis and test various functionalities it supports.

Paths have to be very specific for repos to work

We find that external repos cannot have a trailing slash.

For example, the following results in errors:

jarvis repo add /my_repo/

While the following is correct:

jarvis repo add /my_repo

Document how to build a resource graph on slurm machines

The current documentation shows how to use the walkthrough build for machines where slurm is not used, which is really only Ares and small benchmark machines.

There is a way to submit a slurm job to collect the resource graph and prune later. This should be documented

Add YCSB package

Make jarvis-cd support multi-user deployments

We need to make Jarvis store configurations and pipelines per-user.

Add RocksDB package

(User experience) Specify to git clone jarvis-cd as well

Hello, it would be an improvement to the ease of use of this tutorial, if it's mentioned to git clone "jarvis-cd" before changing the path to jarvis-cd under the section 0.2. Installation in the README.md file.

Current:

cd /path/to/jarvis-cd
python3 -m pip install -r requirements.txt
python3 -m pip install -e .

Suggested change:

git clone https://github.com/grc-iit/jarvis-cd.git
cd /path/to/jarvis-cd
python3 -m pip install -r requirements.txt
python3 -m pip install -e .

Polaris hostfile is incorrectly interpreted by Thallium

Polaris use $PBS_NODELIST file. This stores hostnames, which are resolved to different ip addrs. These addrs are resolved in different orders. This results in the hostname resolution containing ip addresses from different domains. This leads to problems in thallium, which does not seem to support networking across domains, leading to HG_NOENTRY issues.

We need to find a way to create a hostfile containing only IP addresses from the same domain.

To get the host names

ip addr

Add IOR package

Add unit tests for the jarvis python API

Fix OrangeFS package

Potentially change the jarvis hostfile format to support node allocation + grouping

Right now the jarvis hostfile is just a bag of hosts. Users will probably know high-level properties about the nodes they are going to allocate. Like "I want 10 compute nodes" and "15 storage nodes". However, right now, we require users to know the exact hostnames before-hand, which is not typically realistic. It is in Ares, but not anywhere else. We need to augment the jarvis hostfile to be either a high-level job submission or a detailed node list.

Think about and document how to use Jarvis for multi-tenant deployments

Let's say we want to run the following pipeline:

OrangeFS
IOR (4KB, CPU 1-2) + IOR(1MB, CPU 3-4)

The two IORs should be spawned at the same time. How do we handle this?
Let's say that IOR supported a configuration option: async. This option would allow the first IOR to be spawned and then the next to be spawned almost immediately after. This would make them multi-tenant

Update Jarvis wiki with updated CLI + gray-scott tutorial

Jarvis pkg configure resets certain values back to default values unintentionally

jarvis pkg configure will reset unspecified values to their defaults, when in reality it should simply keep what was originally there. In other words, it will always reset the jarvis config instead of simply updating it. We need a way to detect when an argument was and was not specified explicitly by the user.

build a slurm reservation and destroy job

In this enhancement, we should add nodes in Jarvis such as

AllocateNode(type:compute/storage, number_of_nodes): output allocation_id
Users should have the option of calling allocate (or call it part of Start). The allocation should be deallocated on Stop or on explicit call.
Also we should pass the allocation id part of our deployment script to loop up nodes for deployment.