Giter Site home page Giter Site logo

grc-iit / jarvis-cd Goto Github PK

View Code? Open in Web Editor NEW
3.0 6.0 5.0 1.3 MB

Jarvis-cd is a unified platform for deploying various applications

Home Page: https://github.com/grc-iit/jarvis-cd

License: MIT License

Python 99.40% Shell 0.21% Dockerfile 0.40%
ci-cd deployment jarvis scripting

jarvis-cd's Introduction

Jarvis-cd is a unified platform for deploying various applications, including storage systems and benchmarks. Many applications have complex configuration spaces and are difficult to deploy across different machines.

We provide a builtin repo which contains various applications to deploy. We refer to applications as "jarivs pkgs" which can be connected to form "deployment pipelines".

0.1 Dependencies

0.1.1. Jarvis-Util

Jarvis-CD depends on jarvis-util. jarvis-util contains functions to execute binaries in python and collect their output.

git clone https://github.com/scs-lab/jarvis-util.git
cd jarvis-util
python3 -m pip install -r requirements.txt
python3 -m pip install -e .

0.1.2. Scspkg

Scspkg is a tool for building modulefiles using a CLI. It's not strictly necessary for Jarvis to function, but many of the readmes use it to provide structure to manual installations.

git clone https://github.com/scs-lab/scspkg.git
python3 -m pip install -r requirements.txt
python3 -m pip install -e .
echo "module use \`scspkg module dir\`" >> ~/.bashrc

The wiki for scspkg is here.

0.2. Installation

cd /path/to/jarvis-cd
python3 -m pip install -r requirements.txt
python3 -m pip install -e .

0.3. Configuring Jarvis

0.3.1. Bootstrapping from a specific machine

Jarivs has been pre-configured on some machines. To bootstrap from one of them, run the following:

jarvis bootstrap from ares

NOTE: Jarvis must be installed from the compute nodes in Ares, NOT the master node. This is because we store configuration data in /mnt/ssd by default, which is only on compute nodes. We do not store data in /tmp since it will be eventually destroyed.

To check the set of available machines to bootstrap from, run:

jarvis bootstrap list

0.3.2. Creating a new configuration

A configuration can be generated as follows:

jarvis init [CONFIG_DIR] [PRIVATE_DIR] [SHARED_DIR (optional)]
  • CONFIG_DIR: A directory where jarvis metadata for pkgs and pipelines are stored. This directory can be anywhere that the current user can access.
  • PRIVATE_DIR: A directory which is common across all machines, but stores data locally to the machine. Some jarvis pkgs require certain data to be stored per-machine. OrangeFS is an example.
  • SHARED_DIR: A directory which is common across all machines, where each machine has the same view of data in the directory. Most jarvis pkgs require this, but on machines without a global filesystem (e.g., Chameleon Cloud), this parameter can be set later.

For a personal machine, these directories can be the same directory.

jarvis-cd's People

Contributors

lukemartinlogan avatar jaimecernuda avatar candicet233 avatar hariharan-devarajan avatar mengtang-pnnl avatar hxu65 avatar jye-525 avatar waugh2010 avatar

Stargazers

 avatar Gerd Heber avatar  avatar

Watchers

James Cloos avatar Kun Feng avatar Anthony Kougkas avatar  avatar Neeraj Rajesh avatar  avatar

jarvis-cd's Issues

Add clean, status, and restart API

status should give status of the deployment/undeployment.

clean should remove data/metadata associated with deployment (if any)

restart should execute stop and start.

Add scspkg as a dependency of jarvis

Many of these programs are manually installed. SCSPKG makes life much easier for these kinds of packages by providing a CLI to create and handle modulefiles.

Make a separate cache for environments

Oftentimes, I find that I need to re-use the same environment multiple times. We should make it so that environments can be cached outside of the pipeline. The API I have in mind is as follows:

jarvis env create hermes-env
jarvis env build
jarvis pipeline env copy hermes-env

jarvis env create + build will cache the current environment.
pipeline env copy will copy the environment file to the config directory.

Adding repos in Jarvis.

We need to have a repository namespace resolution:
Essentially:

  • create folder structure of var/jarvis/repos where all prebuilt repositories would be present.
  • By default, we should have a builtin repository folder in Jarvis.
  • The list of all repositories add in jarvis will be stored in etc/jarvis/repos.yaml similar to link
  • Adding a new repo will update the file at etc/jarvis/repos.yaml
  • We need a Repository entity which will load and serve a repository
  • We need a RepositoryManager which will maintain a list of all current repositories and instantiate the Repository entity.
  • We need a schema finder module that will locate a given schema from all existing repositories.

Each repository:

  • should have repo.yaml (fixed name) : similar to link
  • and a schemas folder that will contain all the deployment schemas as we have now.

Schema Loader in Jarvis

Currently a schema is loaded on the main jarvis.py. This approach is not maintainable. I suggest the following.

  • Build a SchemaLoader entity that gets schema name as input and returns a schema object (currently called Graph).
  • Search Hierarchy should be repo -> schema.

This feature will work cohesively with issue #6.

Inconsistent naming convention in the Gray-Scott example

In the README.md file under the builtin gray_scott package (PATH: builtin/builtin/gray_scott/README.md), the scspkg that was created in the first line (which is gray-scott) does not match with name that's used later (gray_scott) under the Installation section.

Pointed out the issue as commented lines below

scspkg create gray-scott                    #gray-scott is different from gray_scott used below
cd `scspkg pkg src gray-scott`           #gray-scott is different from gray_scott used below
git clone https://github.com/pnorbert/adiosvm
cd adiosvm/Tutorial/gs-mpiio
mkdir build
pushd build
cmake ../ -DCMAKE_BUILD_TYPE=Release
make -j8
export GRAY_SCOTT_PATH=`pwd`
scspkg env set gray_scott GRAY_SCOTT_PATH="${GRAY_SCOTT_PATH}"
scspkg env prepend gray_scott PATH "${GRAY_SCOTT_PATH}"               #gray_scott is different from gray-scott used above
module load gray_scott
spack load mpi adios2

Update resource graph documentation

The resource graph doc should show how we can query the resource graph in modules. We should also mention that pkgs can modify the resource graph dynamically for use in future modules. For example, OrangeFS spawns a mount point, so it should modify the resource graph.

Slurm issue with multi-nodes

When running IOR with more than 2 nodes on Ares with this command:
jarvis pipeline sbatch job_name=ior2ntest nnodes=2 ppn=10 output_f ile=./4n_ior_test.out error_file=./4n_ior_test.err

Slurm not able to start job show status:

             JOBID  PARTITION   NAME        USER       ST       TIME      NODES NODELIST(REASON)
              1866   compute       ior2ntes    mtang11   PD       0:00      2 (launch failed requeued held)

IOR pipeline already set to correct nprocs and ppn number:

pipeline with name ior_test
  pkg_type=pipeline
  ior with name ior
    api=POSIX
    block=32m
    dbg_port=4000
    do_dbg=False
    fpp=False
    hide_output=False
    log=None
    nprocs=20
    out=/tmp/ior.bin
    pkg_type=ior
    ppn=10
    read=True
    reinit=False
    sleep=0
    stderr=None
    stdout=None
    write=True
    xfer=1m

Hierarchical Argument Parser for Jarvis.

Current we have a flat argument parser in Jarvis. For extensibility we need to have hierarchical commands. Example.

  • jarvis-cd repos
  • jarvis-cd deploy <sub-commands such as --clean>

Please refer to link

Document how to build a resource graph on slurm machines

The current documentation shows how to use the walkthrough build for machines where slurm is not used, which is really only Ares and small benchmark machines.

There is a way to submit a slurm job to collect the resource graph and prune later. This should be documented

(User experience) Specify to git clone jarvis-cd as well

Hello, it would be an improvement to the ease of use of this tutorial, if it's mentioned to git clone "jarvis-cd" before changing the path to jarvis-cd under the section 0.2. Installation in the README.md file.

Current:

cd /path/to/jarvis-cd
python3 -m pip install -r requirements.txt
python3 -m pip install -e .

Suggested change:

git clone https://github.com/grc-iit/jarvis-cd.git
cd /path/to/jarvis-cd
python3 -m pip install -r requirements.txt
python3 -m pip install -e .

Polaris hostfile is incorrectly interpreted by Thallium

Polaris use $PBS_NODELIST file. This stores hostnames, which are resolved to different ip addrs. These addrs are resolved in different orders. This results in the hostname resolution containing ip addresses from different domains. This leads to problems in thallium, which does not seem to support networking across domains, leading to HG_NOENTRY issues.

We need to find a way to create a hostfile containing only IP addresses from the same domain.

To get the host names

ip addr

Potentially change the jarvis hostfile format to support node allocation + grouping

Right now the jarvis hostfile is just a bag of hosts. Users will probably know high-level properties about the nodes they are going to allocate. Like "I want 10 compute nodes" and "15 storage nodes". However, right now, we require users to know the exact hostnames before-hand, which is not typically realistic. It is in Ares, but not anywhere else. We need to augment the jarvis hostfile to be either a high-level job submission or a detailed node list.

Think about and document how to use Jarvis for multi-tenant deployments

Let's say we want to run the following pipeline:

  1. OrangeFS
  2. IOR (4KB, CPU 1-2) + IOR(1MB, CPU 3-4)

The two IORs should be spawned at the same time. How do we handle this?
Let's say that IOR supported a configuration option: async. This option would allow the first IOR to be spawned and then the next to be spawned almost immediately after. This would make them multi-tenant

build a slurm reservation and destroy job

In this enhancement, we should add nodes in Jarvis such as

  • AllocateNode(type:compute/storage, number_of_nodes): output allocation_id
  • Users should have the option of calling allocate (or call it part of Start). The allocation should be deallocated on Stop or on explicit call.
  • Also we should pass the allocation id part of our deployment script to loop up nodes for deployment.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.