kubeflow / arena Goto Github PK

View Code? Open in Web Editor NEW

710.0 53.0 175.0 72.49 MB

A CLI for Kubeflow.

License: Apache License 2.0

Makefile 0.29% Go 79.23% Dockerfile 0.04% Shell 1.88% Python 7.60% Mustache 1.20% Java 9.58% Smarty 0.19%

kubeflow kubernetes tensorflow docker deep-learning

arena's Introduction

Arena

View the Arena documentation.

Overview

Arena is a command-line interface for the data scientists to run and monitor the machine learning training jobs and check their results in an easy way. Currently it supports solo/distributed TensorFlow training. In the backend, it is based on Kubernetes, helm and Kubeflow. But the data scientists can have very little knowledge about kubernetes.

Meanwhile, the end users require GPU resource and node management. Arena also provides top command to check available GPU resources in the Kubernetes cluster.

In one word, Arena's goal is to make the data scientists feel like to work on a single machine but with the Power of GPU clusters indeed.

For the Chinese version, please refer to 中文文档

Setup

You can follow up the Installation guide

User Guide

Arena is a command-line interface to run and monitor the machine learning training jobs and check their results in an easy way. Please refer the User Guide to manage your training jobs.

Demo

Developing

Prerequisites:

Go >= 1.8

mkdir -p $(go env GOPATH)/src/github.com/kubeflow
cd $(go env GOPATH)/src/github.com/kubeflow
git clone https://github.com/kubeflow/arena.git
cd arena
make

arena binary is located in directory arena/bin. You may want to add the directory to $PATH.

Then you can follow Installation guide for developer

CPU Profiling

# set profile rate (HZ)
export PROFILE_RATE=1000

# arena {command} --pprof
arena list --pprof
INFO[0000] Dump cpu profile file into /tmp/cpu_profile

Then you can analyze the profile by following Go CPU profiling: pprof and speedscope

Adopters

If you are intrested in Arena and would like to share your experiences with others, you are warmly welcome to add your information on ADOPTERS.md page. We will continuousely discuss new requirements and feature design with you in advance.

FAQ

Please refer to FAQ

CLI Document

Please refer to arena.md

RoadMap

See RoadMap

arena's People

Contributors

Stargazers

Watchers

Forkers

qiankai-weibo xianlubird nowbug xiechengsheng cheyang garnettwang banbanpeppa winowang winsuperstar wsxiaozhang ringwraith xiaozhoux xuchongbo kkasravi nicholas-fwang jiangming1 mrkm4ntr ashahab lkunxyz starcloud-ai rannmeim wycharry osswangxining shinytang6 kkkyan awesomegolang kioco qinzhao168 chesterli29 hougangliu jackhuang007 chengyh2golang wangzewang imbolo soolaugust mbrukman gyliu513 limx59 yylin1 cnhup xwyangjshb lxx719 mt0803 ringtail silenceshell chaosju codesummer kmarquardsen terrytangyuan zhujl1991 aliyuncontainerservice cndaimin ywskycn oopsoutofmemory baseyou queenwu parallelo posix4e melpozhang flashlau srinivaschilveri gongcan1219 sakuralbj xauthulei javaderek happy2048 qiankai-kwai caoyang001 jhs2jhs hzxuzhonghu isgasho 66come asdfsx jianweilin zuodh lilongthinker clementcj nkgfirecream lanya16 qiaoxingli phoenixwu0229 cczz12345 yuanzac xigang meibenjin julang2 ddbmh muwon yongqiangyue dorucioclea mrg123 jiaqianjing qiwqwq denkensk jeffyfhuang hwk42 qinshixu armorleon bobmayuze qijune

arena's Issues

sub-commands to display tensorflow serving jobs list and details info

arena list tfserving : display tensorflow serving jobs list;
arena describe tfserving $name : display the tensorflow serving job detail infos, include resource info, ingress info, replicas info, etc.

Extend arena to support for training dataset or model storage management

It should be helpful for data scientists to use command like "arena create data imagenet-full" to create, index and manage different training datasets for different training jobs.
Then when use arena submit training job, the "imagenet-full" can be passed in as parameter "data" directly.
The data actually can be a pointer to specific PVC or hdfs path, etc. It's easy to use 'data' to manage and record which training datasets to load, while needn't care about which storage backend is used for persistent.

Could not found gpu with "arena top node"?

I following setup guide in https://github.com/kubeflow/arena/blob/master/docs/installation/README.md. All are successfully, but when I try arena top node, the result is:

but using nvidia-smi, I can found 4 gpus:

Why does this happen?

Optimize arena logs when job is failed

When some job is failed, arena logs can't show the last failed pod's log. So we need to use kubectl get po and kubectl logs xxx to debug .

Make TFJob in bin pack mode

Need to schedule the workers and PS in binpack mode rather spread as default mode. TFJob #781

Consider that use '~/charts' charts file overwrite '/charts'

Some environment we can not write file in /charts because of the permission.
Can we use ~/charts in place of /charts if ~/charts exist . Or if /charts not exist, use ~/charts.
@cheyang

Need `arena get` json and yaml output

Json output:

arena get tf-git -ojson
{
    "name": "tf-git",
    "namespace": "default",
    "duration": "0s",
    "status": "FAILED",
    "trainer": "tfjob",
    "instances": [
        {
            "status": "Failed",
            "name": "tf-git-chief-0",
            "age": "0s",
            "node": "192.168.0.199"
        }
    ]
}

yaml output

arena get tf-git -oyaml
name: tf-git
namespace: default
duration: 0s
status: FAILED
trainer: tfjob
tensorboard: ""
instances:
- status: Failed
  name: tf-git-chief-0
  age: 0s
  node: 192.168.0.199

In tfjob, is there a plan to support RDMA SRIOV with non hostNetwork?

In tfjob, is there a plan to support RDMA SRIOV with non hostNetwork?
Using https://github.com/Mellanox/k8s-rdma-sriov-dev-plugin.

Add MPIJob

Arena should be able to submit MPIJob whose backend is MPIOperator.

support serving to share GPU

Currently GPU is not sharable by different serving application, arena will support this case .

support for distributed tensorflow estimator

Aliyun guru, can you please kindly support run distributed tensorflow estimator, typical usage can be found here estimator train and evaluate

Key difference from raw API is that estimator must have chief node(worker 0), and suggested to have evaluator node.

In our practice, we have ps, worker, chief and evaluator.

Thanks advance.

SomeThing about horovodjob?

Have you consider offering one horovodjob demo in home page.
The init container of mpijob git-sync ,does it provide the same functionality of the init container of horovodjob rsync
@cheyang

arena should support gang schedule for TFJob

when user submit TFJob(kubelow/tf-operator) batch jobs with arena, we should guarantee all or nothing to avoid wasting resources when the cluster is hunger, especially for the expensive gpus.

evaluator need Replicas parameter

When create a tfjob that have evaluator, tf-operator will check Replicas parameter of evaluator. If Replicas doesn't exist, it will cause runtime problem. And the tf-operator will crashed.
Right now the tfjob.yaml doesn't have such parameter.

arena/charts/tfjob/templates/tfjob.yaml

Lines 728 to 730 in f3a3594

    
           Evaluator: 
        
             restartPolicy: Never 
        
             template:

And following is the validation code in tf-operator

https://github.com/kubeflow/tf-operator/blob/3bf0ac83924aeea57df3e7c50bc82fc3f5546844/pkg/apis/tensorflow/validation/validation.go#L95-L97

"arena list" status not same as "arena get" status.

Hi,

When I create a training job using arena, I found the status of job in 'arena list' is not same as 'arena get':

The command is as follow:

arena submit tf \
             --name=tf-git \
             --image=test:1.0 \
             --syncMode=git \
             --syncSource=https://github.com/chenyu85/Distribute_MNIST.git \
             "python code/Distribute_MNIST/distributed.py --job_name=ps --task_index=0 --data_dir /MNIST_data"

test:1.0 is build by myself, based on tensorflow/tensorflow:1.5.0-devel-gpu with a data file.

And the error things is that logviewer shows nothing with any job:

the first time I use logview, it will show tfjob even if it is failed.

Could anyone help me? thanks.

Add owner file

use 'arena logs -f' as tail -f

hi, we want to use 'arena logs -f' as 'tail -f', which output logs in realtime

thanks

mpijob reports running when the worker pod is pending

# arena  list
NAME  STATUS   TRAINER  AGE  NODE
test  RUNNING  MPIJOB   16m
# arena get test
NAME  STATUS   TRAINER  AGE  INSTANCE              NODE
test  PENDING  MPIJOB   16m  test-mpijob-worker-0  N/A
test  PENDING  MPIJOB   16m  test-mpijob-worker-1  N/A
 kubectl get mpijob -o=yaml
apiVersion: v1
items:
- apiVersion: kubeflow.org/v1alpha1
  kind: MPIJob
  metadata:
 
  status: {}
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

The reason is the status of mpijob is empty.

Use arena to create a single node model development box

besides submit training jobs, user also wants to create a development box which contains jupyter, math lib, frameworks, in order to dev and debug algorithms before starting to train in cluster. It's better to have GPU enabled as well

tfserving job support model stored on external hdfs cluster

Currently, arena submit tfserving only support model data stored through pv/pvc.

We should support that model data stored on external HDFS cluster.

error: ValidationError(CustomResourceDefinition.spec): missing required field "scope"

this is for Closed issues #8

I think we can add scope filed valued "Namespaced" to CustomResourceDefinition part for TFJOB in the tf-operator.yaml and I find ksonnet makes it to Namespaced automatically , like below, we just add it here

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: tfjobs.kubeflow.org
spec:
  group: kubeflow.org
  names:
    kind: TFJob
    plural: tfjobs
    singular: tfjob
  ## add scope filed here
  scope: Namespaced
  validation:
    openAPIV3Schema:
      properties:
        spec:
          properties:
            tfReplicaSpecs:
              properties:
                Chief:
                  properties:
                    replicas:
                      maximum: 1
                      minimum: 1
                      type: integer
                PS:
                  properties:
                    replicas:
                      minimum: 1
                      type: integer
                Worker:
                  properties:
                    replicas:
                      minimum: 1
                      type: integer
  version: v1alpha2

this will solve the problem for arena when it create CRD-TFJOB where k8s version>=1.11.0, of cource if exec command below also can solve this problem

kubectl create -f arena/kubernetes-artifacts/tf-operator/tf-operator.yaml --validate=false

Update k8s.io/apimachinery

Pls updatek8s.io/apimachinery. New CRD operator needs to invoke runtime.Must(AddToSchem()) which is missing in current version of k8s.io/apimachinery.

link:
https://github.com/kubernetes/apimachinery/blob/release-1.11/pkg/util/runtime/runtime.go#L169
https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/pkg/client/clientset/versioned/scheme/register.go#L59

error validating data: ValidationError(CustomResourceDefinition.spec): missing required field "scope"

阿里的大牛你好，我是一个新手，当我使用安装文档时
kubectl create -f arena/kubernetes-artifacts/tf-operator/tf-operator.yaml

遭遇了错误，以下是错误日志，我猜测可能是我环境有问题，不知道能否给出一些建议。
configmap/tf-job-operator-config created
serviceaccount/tf-job-operator created
clusterrole.rbac.authorization.k8s.io/tf-job-operator created
clusterrolebinding.rbac.authorization.k8s.io/tf-job-operator created
clusterrole.rbac.authorization.k8s.io/tf-job-dashboard created
clusterrolebinding.rbac.authorization.k8s.io/tf-job-dashboard created
serviceaccount/tf-job-dashboard created
deployment.extensions/tf-job-dashboard created
error: error validating "arena/kubernetes-artifacts/tf-operator/tf-operator.yaml": error validating data: ValidationError(CustomResourceDefinition.spec): missing required field "scope" in io.k8s.apiextensions-apiserver.pkg.apis.apiextensions.v1beta1.CustomResourceDefinitionSpec; if you choose to ignore these errors, turn validation off with --validate=false

arena delete tfjobs error

delete command and logs：

[chaochchao1]$ arena delete tf-estimator1
ERRO[0000] Failed to delete tf-estimator1, the reason is that There is no training job found with the name tf-estimator1, please check it with arena list | grep tf-estimator1

[chaochao1]$ arena delete tf-estimator2
WARN[0000] Failed to UninstallAppsWithAppInfoFile due to exit status 123
WARN[0000] manually delete the following resource:
service "tf-estimator2-tensorboard" deleted
deployment.extensions "tf-estimator2-tensorboard" deleted
configmap "tf-estimator2-tfjob" deleted
INFO[0001] The Job tf-estimator2 has been deleted successfully
[chaochao1]$ arena list
NAME STATUS TRAINER AGE NODE
tf-estimator1 FAILED TFJOB 0s N/A
tf-estimator2 SUCCEEDED TFJOB 18m N/A
tf-estimator SUCCEEDED TFJOB 28m N/A
[chaochao1]$ arena delete tf-estimator2
ERRO[0000] Failed to delete tf-estimator2, the reason is that There is no training job found with the name tf-estimator2, please check it with arena list | grep tf-estimator2

[chaochao1]$ arena list
NAME STATUS TRAINER AGE NODE
tf-estimator1 FAILED TFJOB 0s N/A
tf-estimator2 SUCCEEDED TFJOB 18m N/A
tf-estimator SUCCEEDED TFJOB 28m N/A
[chaochao1]$ arena delete tf-estimator
WARN[0000] Failed to UninstallAppsWithAppInfoFile due to exit status 123
WARN[0000] manually delete the following resource:
service "tf-estimator-tensorboard" deleted
deployment.extensions "tf-estimator-tensorboard" deleted
configmap "tf-estimator-tfjob" deleted
INFO[0001] The Job tf-estimator has been deleted successfully
[chaochao1]$ arena list
NAME STATUS TRAINER AGE NODE
tf-estimator1 FAILED TFJOB 0s N/A
tf-estimator2 SUCCEEDED TFJOB 18m N/A
tf-estimator SUCCEEDED TFJOB 28m N/A
[chaochao1]$ arena delete tf-estimator
ERRO[0000] Failed to delete tf-estimator, the reason is that There is no training job found with the name tf-estimator, please check it with arena list | grep tf-estimator

[chaochao1]$ arena delete tf-estimator
ERRO[0000] Failed to delete tf-estimator, the reason is that There is no training job found with the name tf-estimator, please check it with arena list | grep tf-estimator

kubernetes-dashboard can't be created

in the arena-system namespace the pod kubernetes-dashboard can't be created successfully, it seems the image cloud not be pulled

kubectl describe po kubernetes-dashboard-6647d54ddc-9bqzl -n arena-system

Events:
  Type     Reason     Age               From               Message
  ----     ------     ----              ----               -------
  Normal   Scheduled  6m                default-scheduler  Successfully assigned arena-system/kubernetes-dashboard-6647d54ddc-9bqzl to slave
  Warning  Failed     1m (x3 over 3m)   kubelet, slave     Failed to pull image "registry-vpc.cn-zhangjiakou.aliyuncs.com/tensorflow-samples/kubernetes-dashboard-amd64:v1.6.0": rpc error: code = Unknown desc = Error response from daemon: Get https://registry-vpc.cn-zhangjiakou.aliyuncs.com/v1/_ping: dial tcp 100.100.80.161:443: i/o timeout
  Warning  Failed     1m (x3 over 3m)   kubelet, slave     Error: ErrImagePull
  Normal   BackOff    54s (x5 over 3m)  kubelet, slave     Back-off pulling image "registry-vpc.cn-zhangjiakou.aliyuncs.com/tensorflow-samples/kubernetes-dashboard-amd64:v1.6.0"
  Warning  Failed     54s (x5 over 3m)  kubelet, slave     Error: ImagePullBackOff
  Normal   Pulling    41s (x4 over 6m)  kubelet, slave     pulling image "registry-vpc.cn-zhangjiakou.aliyuncs.com/tensorflow-samples/kubernetes-dashboard-amd64:v1.6.0"

Support CPU/Memory stats in `top`

To fix #3

Support MPI Job?

I see mpijob in docs but i didn`t see the mpi demo?

Doesn`t support yet?

Arena doesn't support multi kubeconfig file

My KUBECONFIG config as follows

minikube=~/.kube/config
kubenetes=~/.kube/localconfig
hongkong=~/.kube/hongkongconfig
export KUBECONFIG=$KUBECONFIG:$minikube:$kubenetes:$hongkong\

but when I executearena list, I got the following error

WARN[0000] Illegal kubeconfig file: :~/.kube/config:~/.kube/localconfig:~/.kube/hongkongconfig
FATA[0000] stat :~/.kube/config:~/.kube/localconfig:~/.kube/hongkongconfig: no such file or directory

MPIjob Error

When I run demo , i meet a problem?

Error: apiVersion "kubeflow.org/v1alpha1" in mpijob/templates/mpijob.yaml is not available

Shoud I deployment kubeflow advanced? @cheyang

Arena should support serve model online with tensorflow serving

Arena should support serve model online with tensorflow serving.

arena sumbit serving --name=xxx ... can deploy tensorflow serving instance with needed model_server config, and auto create corresponding service, ingress.

arena get serving xxx can display the serving instance info, include the exposed endpoint info for grpc client.

create tf-operator failed

when I created tf-operator uesd kubectl create -f arena/kubernetes-artifacts/tf-operator/tf-operator.yaml it failed,the log shows:
starting container process cause "exec: \opt/kubeflow/tf-operator.v2 " : stat:/opt/kubeflow/tf-operator.v2: no such file or directory

logviewer cause index out of range when mpijob is not ready

submit mpijob then use arena logviewer cause panic

arena logviewer mpi
panic: runtime error: index out of range

goroutine 1 [running]:
github.com/kubeflow/arena/cmd/arena/commands.(*MPIJob).GetJobDashboards(0xc420655500, 0xc42028c000, 0x10, 0x11cec8b, 0x7, 0x12e75e0, 0xc420655500)
	/go/src/github.com/kubeflow/arena/cmd/arena/commands/trainer_mpi.go:143 +0x4b5
github.com/kubeflow/arena/cmd/arena/commands.NewLogViewerCommand.func1(0xc4203d0c80, 0xc420294aa0, 0x1, 0x1)
	/go/src/github.com/kubeflow/arena/cmd/arena/commands/logviewer.go:55 +0x241
github.com/kubeflow/arena/vendor/github.com/spf13/cobra.(*Command).execute(0xc4203d0c80, 0xc420294a50, 0x1, 0x1, 0xc4203d0c80, 0xc420294a50)
	/go/src/github.com/kubeflow/arena/vendor/github.com/spf13/cobra/command.go:766 +0x2c1
github.com/kubeflow/arena/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0xc420286a00, 0xc4203d0c80, 0xc4203d0a00, 0xc4203d0780)
	/go/src/github.com/kubeflow/arena/vendor/github.com/spf13/cobra/command.go:852 +0x30a
github.com/kubeflow/arena/vendor/github.com/spf13/cobra.(*Command).Execute(0xc420286a00, 0x405a9c, 0xc4200a8058)
	/go/src/github.com/kubeflow/arena/vendor/github.com/spf13/cobra/command.go:800 +0x2b
main.main()
	/go/src/github.com/kubeflow/arena/cmd/arena/main.go:38 +0x38

Which framework does arena support

@willb @ewilderj @aronchick @mhausenblas Can someone tell me which framework arena can support and give some examples for each framework?
Thanks very much.

Status is running, but cannot get nvidia info

@cheyang I reboot the system and run the tf-job, now it is running, but it is always in status running and I can not get any output infomation about the use of gpu, like this:

On the web, I can see the tf-job is running (and status is always on running), but can not get the log info.

"arene top job" couldn't detect metrics.

I follow guide: arena/docs/userguide/9-top-job-gpu-metric.md.

everything works as expect until last one, when I submit the tfjob anr use "arena to job" to check, the result shows like this:

ERRO[0000] gpu metric is not exist in prometheus for query  {__name__=~"nvidia_gpu_duty_cycle|nvidia_gpu_memory_used_bytes|nvidia_gpu_memory_total_bytes", pod_name=~""}
INSTANCE NAME  GPU(Device Index)  GPU(Duty Cycle)  GPU(Memory MiB)  STATUS  NODE

show rdma device allocate info for `arena top node` and `arena top job`

show rdma device allocate info for arena top node and arena top job . Only if there is any node or job have rdma device.

Performance issue of arena list

The latency of arena list is more than 4s, but the cpu usage is only 0.331s.

time arena list
NAME                                  STATUS     TRAINER  AGE  NODE
.....

real	0m4.492s
user	0m0.241s
sys	0m0.090s

when should I set hostIPC to false?

arena/charts/tfjob/values.yaml

Lines 5 to 9 in 25b7c75

    
           useHostNetwork: false 
        
           useHostPID: true 
        
           useHostIPC: true 
        
           gpuCount: 0 # user define 
        
           privileged: false

I start a service in container, and try to connect to this service using localhost:xxxx, it always failed.

But after I set useHostIPC to false, I can connect the service using localhost:xxxx.

I wanna know when should I using useHostIPC? And if I set HostIPC to false, does it affect to the tfjob?

support `arena get --details` for debugging Job pending issue

WDYT， @wsxiaozhang

gitlab:could not lock config file //.gitconfig: Permission denied

when i use gitlab in tfjob ,occur a err:

ERROR: can't create .netrc file: error setting up git credentials exit status 255: error: could not lock config file //.gitconfig: Permission denied

Why we set horovodjob`s sshport default is 33?

A litter confuse about sshport

--sshPort int           ssh port. (default 33)

why not support for cpu cluster?

I dont't have a gpu cluster but cpu, and why not support for cpu cluster?

tfserving job support custom version policy

serve single model in a tfserving instance:

arena submit xxxx
    --modelName=xx
    --modelPath=xx
    --versionPolicy=latest:2 # means model server load the latest 2 versions of the $modelName

serve multi model in a tfserving instance:

arena submit xxxx
    --modelConfigFile=./my-model-config-file

my-model-config-file's content will set to tensorflow serving --model-config-file flag, Sample:

model_config_list: {
	config: {
		name: "mnist",
		base_path: "/tmp/monitored/_model",
		model_platform: "tensorflow",
		model_version_policy: {
		   all: {}
		}
	},
	config: {
		name: "inception",
		base_path: "/tmp/monitored/inception_model",
		model_platform: "tensorflow",
		model_version_policy: {
		   latest: {
		   	num_versions: 2
		   }
		}
	},
	config: {
		name: "test",
		base_path: "/tmp/monitored/test_model",
		model_platform: "tensorflow",
		model_version_policy: {
		   specific: {
		   	versions: 1
		   }
		}
	}
}

Prometheus could not pull metric from node-gpu-expoter

When deployed node-gpu-expoter and prometheus in kubernetes GPU cluster,everything goes well,Prometheus UI can list gpu metric,however two day pass,the node-gpu-expoter become down.When i exec kubectl logs -f node-gpu-expoter-XXX,it shows nothing.

arena guide met problems

Hi,

I am a freshman,

when I follow this guide: arena/docs/userguide/1-tfjob-standalone.md,

but met problems, could anyone help to check it?

Command in guide "rdma" is incorrect.

From raml, we can see rdma device plugin is created within namespace 'arena-system'.

arena/kubernetes-artifacts/rdma/device-plugin.yaml

Line 5 in 42137e0

namespace: arena-system

but in guide "rdma", command is as follow:

kubectl -n kube-system get ds

this won't get expect result.

I think the command should be:

kubectl -n arena-system get ds

and local test result is as following:

[maintainance] Add copyright header in all files.

Golang:

// Copyright 2018 The Kubeflow Authors
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
//       http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

`arena serve list` needs the real status rather than chart status

Can not run job in minikube deployment

I have installed arena following the installation guide to my minikube test environment. After installation, arena top node shows no worker:

NAME      IPADDRESS      ROLE    GPU(Total)  GPU(Allocated)
minikube  192.168.0.199  master  1           0
-----------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
0/1 (0%)

When submiting examples in 1-tfjob-standalone, the job stays PENDINGS state (maybe for podAffinity issues).

Since only workers require GPU resources, can I adjust arena to support training jobs under minikube?

arena should support gang schedule for MPIJob

when user submit MPIJob(kubelow/mpi-operator) batch jobs with arena, we should guarantee all or nothing to avoid wasting resources when the cluster is hunger, especially for the expensive gpus.

	useHostNetwork: false
	useHostPID: true
	useHostIPC: true
	gpuCount: 0 # user define
	privileged: false