Giter Site home page Giter Site logo

dell / omnia Goto Github PK

View Code? Open in Web Editor NEW
200.0 20.0 107.0 67.83 MB

An open-source toolkit for deploying and managing high performance clusters for HPC, AI, and data analytics workloads.

Home Page: https://omnia-doc.readthedocs.io/en/latest/index.html

License: Apache License 2.0

Shell 0.39% Dockerfile 0.04% Jinja 22.46% Python 11.01% YAML 65.89% Cuda 0.01% C++ 0.01% MiniYAML 0.20%
k8s-cluster kubernetes slurm-cluster slurm dell-emc dellemc ansible ansible-playbooks hpc-clusters hpc

omnia's Introduction

GitHub GitHub release (latest by date including pre-releases)

All contributors GitHub forks GitHub Repo stars GitHub all releases

GitHub issues GitHub Discussions

Ansible playbook-based deployment of Slurm and Kubernetes on servers running an RPM-based Linux OS

Omnia (Latin: all or everything) is a deployment tool to turn servers with RPM-based Linux images into functioning Slurm/Kubernetes clusters.

Omnia Documentation

Omnia Documentation is hosted on Read The Docs.

Current Status: GitHub

Licensing

Omnia is made available under the Apache 2.0 license

Contributing To Omnia

We encourage everyone to help us improve Omnia by contributing to the project. Contributions can be as small as documentation updates or adding example use cases, to adding commenting and properly styling code segments all the way up to full feature contributions. We ask that contributors follow our established guidelines for contributing to the project.

Contributions to Omnia are made through Pull Requests (PRs) to "devel" branch. "devel" is the bleeding edge branch of Omnia packed with experimental and untested features".

Omnia Community Members:

Dell Technologies

Intel Corporation

Universita di Pisa Arizona State University Vizias

LIQID Inc. Texas Tech University

Contributors

Our thanks go to everyone who makes Omnia possible (emoji key):

John Lockman
John Lockman

โš ๏ธ ๐Ÿ’ป ๐Ÿ“ ๐Ÿค” ๐Ÿšง ๐Ÿง‘โ€๐Ÿซ ๐ŸŽจ ๐Ÿ‘€ ๐Ÿ“ข ๐Ÿ›
Lucas A. Wilson
Lucas A. Wilson

๐Ÿ’ป ๐ŸŽจ ๐Ÿšง ๐Ÿค” ๐Ÿ“ ๐Ÿ“– ๐Ÿง‘โ€๐Ÿซ ๐Ÿ“† ๐Ÿ‘€ ๐Ÿ“ข ๐Ÿ›
Sujit Jadhav
Sujit Jadhav

๐Ÿค” ๐Ÿ“– ๐Ÿ’ป ๐Ÿ‘€ ๐Ÿšง ๐Ÿ“† ๐Ÿง‘โ€๐Ÿซ ๐Ÿ“ข ๐Ÿ’ฌ โš ๏ธ ๐Ÿ›
Deepika K
Deepika K

๐Ÿ’ป โš ๏ธ ๐Ÿ› ๐Ÿ›ก๏ธ ๐Ÿ“ข ๐Ÿ‘€ ๐Ÿง‘โ€๐Ÿซ
Abhishek SA
Abhishek SA

๐Ÿ’ป ๐Ÿ› ๐Ÿ“– โš ๏ธ ๐Ÿšง ๐Ÿ“ข ๐Ÿง‘โ€๐Ÿซ ๐Ÿ‘€
Sakshi Arora
Sakshi Arora

๐Ÿ’ป ๐Ÿ› ๐Ÿ“ข
Shubhangi Srivastava
Shubhangi Srivastava

๐Ÿ’ป ๐Ÿšง ๐Ÿ› ๐Ÿ“ข
Cassey Goveas
Cassey Goveas

๐Ÿ“– ๐Ÿ› ๐Ÿšง ๐Ÿ“ข
Khushboo Dholi
Khushboo Dholi

๐Ÿ’ป
Prasoon Kumar Sinha
Prasoon Kumar Sinha

๐Ÿค” ๐Ÿ“ข
SajithDas
SajithDas

๐Ÿ“† ๐Ÿ“ข
i3igpete
i3igpete

๐Ÿ’ผ ๐Ÿ“ข
renzo-granados
renzo-granados

๐Ÿ›
Aditya-DP
Aditya-DP

๐Ÿ’ป
Katakam Rakesh Naga Sai
Katakam Rakesh Naga Sai

๐Ÿ’ป
araji
araji

๐Ÿ’ป
Mike Renfro
Mike Renfro

๐Ÿ“–
Lee Reynolds
Lee Reynolds

๐Ÿ’ป ๐Ÿ“– โœ…
blesson-james
blesson-james

๐Ÿ’ป โš ๏ธ ๐Ÿ›
avinashvishwanath
avinashvishwanath

๐Ÿ“–
abhishek-s-a
abhishek-s-a

๐Ÿ’ป ๐Ÿ“– โš ๏ธ
Franklin-Johnson
Franklin-Johnson

๐Ÿ’ป ๐Ÿ“
teiland7
teiland7

๐Ÿ’ป ๐Ÿ“
VishnupriyaKrish
VishnupriyaKrish

๐Ÿ’ป โš ๏ธ
Ishita Datta
Ishita Datta

๐Ÿ“–
William Dizon
William Dizon

โœ…
bssitton-BU
bssitton-BU

๐Ÿ›
John Hearns
John Hearns

๐Ÿ›
kris buggenhout
kris buggenhout

๐Ÿ›
jiad-vmware
jiad-vmware

๐Ÿ›
Justin Lecher
Justin Lecher

๐Ÿค”
Kavyabr23
Kavyabr23

๐Ÿ’ป โš ๏ธ
vedaprakashanp
vedaprakashanp

โš ๏ธ ๐Ÿ’ป
Bhagyashree-shetty
Bhagyashree-shetty

โš ๏ธ ๐Ÿ’ป
Nihal Ranjan
Nihal Ranjan

โš ๏ธ ๐Ÿ’ป ๐Ÿ“ข
ptrinesh
ptrinesh

๐Ÿ’ป
Ikko Ashimine
Ikko Ashimine

๐Ÿ’ป
Lakshmi-Patneedi
Lakshmi-Patneedi

๐Ÿ’ป
Jie Li
Jie Li

๐Ÿ’ป
Yong Chen
Yong Chen

๐ŸŽจ
nvtngan
nvtngan

๐Ÿ’ป ๐Ÿ”Œ
tamilarasansubrama1
tamilarasansubrama1

โš ๏ธ ๐Ÿ’ป
shemasr
shemasr

๐Ÿ› ๐Ÿ’ป โš ๏ธ
Naresh Sharma
Naresh Sharma

๐Ÿ›
Jon Hass
Jon Hass

๐Ÿ“– ๐ŸŽจ
KalyanKonatham
KalyanKonatham

๐Ÿ›
Rahul Akolkar
Rahul Akolkar

๐Ÿ›
srinandini-karumuri
srinandini-karumuri

๐Ÿ’ป
Rishabhm47
Rishabhm47

โš ๏ธ ๐Ÿ’ป
vaishakh-pm
vaishakh-pm

โš ๏ธ ๐Ÿ’ป
shridhar-sharma
shridhar-sharma

โš ๏ธ ๐Ÿ’ป
Jaya.Dayyala
Jaya.Dayyala

โš ๏ธ ๐Ÿ’ป
fasongan
fasongan

๐Ÿ’ป
rahuldell21
rahuldell21

๐Ÿ’ป โš ๏ธ
diptiman12
diptiman12

๐Ÿ’ป
Supriya Parthasarathy
Supriya Parthasarathy

๐Ÿ“†
Subhankar-Adak
Subhankar-Adak

๐Ÿ’ป
priti-parate
priti-parate

๐Ÿ’ป
Lavanya Adhikari
Lavanya Adhikari

๐Ÿ’ป
Preeti Thankachan
preeti-thankachan

โš ๏ธ
Boris Glimcher
Boris Glimcher

๐Ÿ’ป ๐Ÿšง๐Ÿ“–
Moshi Binyamini
Moshi Binyamini

๐Ÿ’ป๐Ÿšง

omnia's People

Contributors

abhishek-s-a avatar abhishek-sa1 avatar aditya-dp avatar allcontributors[bot] avatar araji avatar avinashvishwanath avatar bhagyashree-shetty avatar blesson-james avatar cgoveas avatar deepikakrishnaiah avatar franklin-johnson avatar glimchb avatar j0hnl avatar kavyabr23 avatar lakshmi-patneedi avatar lwilson avatar milisha-gupta avatar naresh3774 avatar nihalranjan-hpc avatar ptrinesh avatar rahuldell21 avatar rishabhm47 avatar sakshiarora13 avatar shemasr avatar shridhar-sharma avatar shubhangi-dell avatar srinandini-karumuri avatar sujit-jadhav avatar vedaprakashanp avatar vishnupriyakrish avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

omnia's Issues

Commands for quickly switching node personality

Is your feature request related to a problem? Please describe.
We need a way to switch nodes from Kubernetes to Slurm and back

Describe the solution you'd like
Solution is a pair of scripts which move nodes from K8s->Slurm and Slurm->K8s

Describe alternatives you've considered
I considered using a playbook, but since this will be run on the master node a Bash script seemed most appropriate

Additional context
N/A

provide NVIDIA NGC Tensor Flow example

Is your feature request related to a problem? Please describe.
Omnia should provide and example for running Tensorflow in NVIDIA NGC containers

Describe the solution you'd like
provide yaml or notebook for how to use NGC containers on omnia

Describe alternatives you've considered

Additional context
https://ngc.nvidia.com/

use versionlock for k8s and kernel

Is your feature request related to a problem? Please describe.
we need to lock the versions of
kernel
kubeadm
kubectl
kubelet

Describe the solution you'd like
yum versionlock kubelet-1.16.7 kubectl-1.16.7 kubeadm-1.16.7

Describe alternatives you've considered

Additional context
we need to settle on a kernel to lock as well

Updating docs on contributing to the project

We need to very simply lay out how contributions to the project can be made. We have fairly detailed instructions in CONTRIBUTING.md, but we should be more specific about how to handle the fork/branch/PR process.

We should also include a figure which describes the process.

Install mpi-operator

Is your feature request related to a problem? Please describe.
We need to be able to run MPI jobs on the Kubernetes cluster

Describe the solution you'd like
mpi-operator from KubeFlow should do the trick: https://github.com/kubeflow/mpi-operator

Describe alternatives you've considered
We could build our own operator, but that seems unnecessary

Additional context
I think we should install the MPI operator separate from KubeFlow, as some users may want to run MPI jobs independently of using KubeFlow's other features.

cmd to join new computes

Is your feature request related to a problem? Please describe.
we need a command to join new computes after the cluster has been created.

Describe the solution you'd like
something like grow_cluster -f ./new_inventory_file

Describe alternatives you've considered
maybe just a tag in the ansible playbook?

Additional context
this could apply to different scenarios:

  1. adding new compute hardware to existing cluster
  2. adding cloud infrastructure to existing cluster

Create CPU example

Is your feature request related to a problem? Please describe.
we should have an example yaml demonstrating CPU workload. Bonus if it uses MPI-Operator

Describe the solution you'd like
something similar to the tensorflow example that uses GPUs

Describe alternatives you've considered

Additional context

Support for Fedora CoreOS

Is your feature request related to a problem? Please describe.
Would be ideal to have a lightweight base OS for Kubernetes clusters.

Describe the solution you'd like
Originally wanted to use CoreOS, but Core is EoL. Project has moved to Fedora CoreOS

Describe alternatives you've considered
We could also build our own minimal spin, but that seems like unnecessary work

Additional context
Fedora CoreOS automatically updates. Will this be a concern?

Update docs to support github pages

Is your feature request related to a problem? Please describe.
We should support a GitHub page for Omnia

Describe the solution you'd like
Built-in feature. We just have to update the format of the docs to ensure it works.

kfserving-ingressgateway does not request enough resources

Describe the bug
when launching kubeflow the kfserving-ingressgateway is OOMKilled. It needs more resources than are originally requested.

To Reproduce
launch kubeflow
observe kf-serving-ingressgateway OOMKilled

Expected behavior
this should not happen

Screenshots
istio-system kfserving-ingressgateway-6b469d64d-fgjqk 0/1 OOMKilled 1 19m 10.244.159.199 compute005.localdomain <none> <none>
Desktop (please complete the following information):

Smartphone (please complete the following information):

Additional context

jupyter_config.yaml containers are stale

Describe the bug
the containers described in jupyter_config.yaml are old and need to be updated

To Reproduce
Steps to reproduce the behavior:
try to start a jupyterhub container

Expected behavior
the container should start

Screenshots

Desktop (please complete the following information):

Smartphone (please complete the following information):

Additional context

OpenHPC repos

Is your feature request related to a problem? Please describe.
Use OpenHPC RPM repositories to provide HPC libraries and tools

Describe the solution you'd like
Enable the OpenHPC RPM repos in /etc/yum.repos.d/

Describe alternatives you've considered
We could rebuild the packages, but seems like duplicating work

Additional context
N/A

Use Omnia to deploy multiple clusters

Is your feature request related to a problem? Please describe.
Instead of using Omnia to install a single cluster, Omnia should be able to install multiple clusters from a single inventory of nodes.

Describe the solution you'd like
Rather than running Omnia on a cluster's master node, Omnia should be run on a separate management server. In order to accomplish this, we should:

  • Eliminate assumption of execution on master node (see #58, first bullet)
  • Provide means of handling multiple inventory files
    • Possible solution is AWX
      • AWX also provides potential for multi-inventory management (i.e., hybrid cloud)

Describe alternatives you've considered
Currently Omnia would have to be run independently from each cluster. This could be troublesome in a multi-cluster environment

Additional context
N/A

Switch to GPU Operator

Is your feature request related to a problem? Please describe.
we have a branch using nvidia-device-plugin but not GPU Operator

Describe the solution you'd like
a new branch that uses GPU Operator not nvidia-device-plugin

Describe alternatives you've considered

Additional context

allow variable for master node IP

Is your feature request related to a problem? Please describe.
omnia currently has hardcoded 10.0.0.1 and p3p1 for the master node. This should be a variable in the host_inventory_file

Describe the solution you'd like
replace hardcoded values with variables

Describe alternatives you've considered
none

Additional context
none

Mark nfs-provisioner as default StorageClass

Is your feature request related to a problem? Please describe.
there is currently no default storageClass defined but we are using nfs-provisioner. We should set nfs-provisioner as default StorageClass so that applications such as KubeFlow can take advantage of the service.

Describe the solution you'd like
kubectl patch storageclasses.storage.k8s.io nfs-client -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'

Describe alternatives you've considered
we could force every application to create their own volumes, this is problematic.

Additional context
https://kubernetes.io/docs/tasks/administer-cluster/change-default-storage-class/

Auto-detect accelerators

Is your feature request related to a problem? Please describe.
How to we determine which packages (e.g., operators, device-plugins, drivers. etc.) to install on each compute node?

Describe the solution you'd like
Gather facts about each node and dynamically determine which packages to install based on accelerators in the box.

Describe alternatives you've considered
Alternative would be to have node groups for each, but this solution wouldn't allow for corner cases, such as DSS8440 with a mix of accelerators.

Additional context
This is not an immediate concern, but future nodes may be more heterogeneous in their accelerator makeup. Let's explore a more dynamic way of inspecting each node. Also simplifies inventory management.

autoscale coreDNS

Is your feature request related to a problem? Please describe.
coreDNS needs to run on all the compute nodes, we need some mechanisim to autoscale coredns to the total number of nodes in the cluster or find another alternative.

Describe the solution you'd like
when initializing cluster coredns should be replicated to every node.

Describe alternatives you've considered

Additional context

Update jupyterhub to latestes

Is your feature request related to a problem? Please describe.
jupyterhub group released an updated helm chart v0.9.0 that gets jhub v1.1.0

Describe the solution you'd like
update the version

Describe alternatives you've considered

Additional context

Add omnia logo

Omnia needs a logo, and we've already made one! Just need to put it in the repo.

Install Local Registry

Is your feature request related to a problem? Please describe.
Omnia needs an option to install a local registry

Describe the solution you'd like
a playbook for installing a local registry such as Harbor - https://github.com/goharbor/harbor

Describe alternatives you've considered
local registry - with simple docker container

Additional context

Examples for storage connectors

Is your feature request related to a problem? Please describe.
Would like examples for other storage connectors such as HDFS, NoSQL and S3

Describe the solution you'd like
HDFS, NoSQL and S3 for cloud

Describe alternatives you've considered

Additional context

Refactor playbooks

Is your feature request related to a problem? Please describe.
In order to support more community contributions, we should make playbooks easier to read/extend.

Describe the solution you'd like
Let's follow the best practices guidelines for Ansible: https://docs.ansible.com/ansible/latest/user_guide/playbooks_best_practices.html

Describe alternatives you've considered
N/A

Additional context

  • One directory structure for each major component, i.e., kubernetes, slurm, tools, etc.
  • Eliminate shell and command plays whenever possible
    ** Extra benefit - yum plays could have version tagged and placed within vars.yml, providing a solution to #28 without forcing versionlock via yum.
  • Isolate different tasks of each component in order to avoid accidental modification

Be able to install different SDNs

Is your feature request related to a problem? Please describe.
Be able to have the choice to pick from different SDNs during install.

Describe the solution you'd like
Maybe pass a keyword argument for the ansible script which identifies the kind of SDN network to be installed. To get started, Omnia could support Calico, Cilium, Contiv-VPP, Kube-router, Weave Net. I think these are commonly used ones, supported by Kubeadm installation path.

Provide Object Storage

Is your feature request related to a problem? Please describe.
Provide object storage deployment

Describe the solution you'd like
a helm chart that deploys object storage with the k8s cluster

Describe alternatives you've considered
MinIO ?

Additional context

JupyterHub Install as separate playbook

Is your feature request related to a problem? Please describe.
jupyterhub is currently installed with the init tag. Only base k8s and supporting services should be started in init

Describe the solution you'd like
pull the commands out of startservices and make them a new playbook

Describe alternatives you've considered
it could stay in the current playbook

Additional context

Fix Licensing

Need to update LICENSE file to reference Dell Technologies copyright.

Also need to add appropriate license clauses to the top of all source files.

Auto Cull Users from JupyterHub

Is your feature request related to a problem? Please describe.
we should automatically remove idle users.

Describe the solution you'd like
add cull instructions in the jupyter_config.yaml to automatically remove idle users.

Describe alternatives you've considered

Additional context

kfserving-gateway does not change limits

Describe the bug
limits are not set in kfserving-gateway

To Reproduce

Expected behavior

Screenshots

Desktop (please complete the following information):

Smartphone (please complete the following information):

Additional context

Create playbook for installing tools

Is your feature request related to a problem? Please describe.
Need a way to install tools like #22

Describe the solution you'd like
Playbooks for copying tools to local filesystem on admin/service nodes

Describe alternatives you've considered
Shell script installer was considered, but doesn't scale to multiple admin nodes

Additional context
N/A

nfs-provisioner ip variable in host_inventory_file

Is your feature request related to a problem? Please describe.
the IP address for nfs-provisioner-client is hard coded in the startservices playbook and should be a variable in the inventory file or vars

Describe the solution you'd like
add variable to startservices group to allow users to point to different NFS servers

Describe alternatives you've considered
none

Additional context
none

Include dockerfiles for custom containers

Is your feature request related to a problem? Please describe.
Add dockerfiles for customized JupyterLab containers

Describe the solution you'd like
Include dockerfiles for customized JupyterLab containers. This is related to issue #42

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Create additional documentation on system setup and pre-installation prep

Is your feature request related to a problem? Please describe.
Not related to a particular feature.

Describe the solution you'd like
Documentation needs to be updated to include instructions on server/network preparation prior to running Ansible playbooks.

Describe alternatives you've considered
No alternatives. Documentation needs to be provided.

Additional context
No additional context.

make network interfaces parameter or get from inventory

Is your feature request related to a problem? Please describe.
the network interface for k8s is currently hard coded in startmaster. It should either be a user defined variable or do some magic with ansible's inventory. We have several adapters on most nodes so I would prefer to start with a user defined variable.

Describe the solution you'd like
create variable in inventory file
use variable in startmaster playbook

Describe alternatives you've considered

Additional context
Add any other context or screenshots about the feature request here.

Installing Omnia on a Single Node

Is your feature request related to a problem? Please describe.
If there was a way to install this on a single node that would be great!

Describe the solution you'd like
Perhaps an installation strategy with README to do it on a single node.

Describe alternatives you've considered
N/A

Additional context
N/A

add --wait to jupyterhub install

Is your feature request related to a problem? Please describe.
jupyterhub currently fails if you do not have all containers downloaded. There is a timeout and it just quits. Add the --wait flag to helm install to force it to wait until all containers have been downloaded

Describe the solution you'd like
add --wait to helm install

Describe alternatives you've considered
we could prepull images elsewhere in the playbooks but jupyterhub has a hook that will pull them all down for us

Additional context

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.