Giter Site home page Giter Site logo

kubeflow / arena Goto Github PK

View Code? Open in Web Editor NEW
710.0 53.0 175.0 72.49 MB

A CLI for Kubeflow.

License: Apache License 2.0

Makefile 0.29% Go 79.23% Dockerfile 0.04% Shell 1.88% Python 7.60% Mustache 1.20% Java 9.58% Smarty 0.19%
kubeflow kubernetes tensorflow docker deep-learning

arena's Introduction

Arena

CircleCI Build Status Go Report Card

View the Arena documentation.

Overview

Arena is a command-line interface for the data scientists to run and monitor the machine learning training jobs and check their results in an easy way. Currently it supports solo/distributed TensorFlow training. In the backend, it is based on Kubernetes, helm and Kubeflow. But the data scientists can have very little knowledge about kubernetes.

Meanwhile, the end users require GPU resource and node management. Arena also provides top command to check available GPU resources in the Kubernetes cluster.

In one word, Arena's goal is to make the data scientists feel like to work on a single machine but with the Power of GPU clusters indeed.

For the Chinese version, please refer to 中文文档

Setup

You can follow up the Installation guide

User Guide

Arena is a command-line interface to run and monitor the machine learning training jobs and check their results in an easy way. Please refer the User Guide to manage your training jobs.

Demo

Developing

Prerequisites:

  • Go >= 1.8
mkdir -p $(go env GOPATH)/src/github.com/kubeflow
cd $(go env GOPATH)/src/github.com/kubeflow
git clone https://github.com/kubeflow/arena.git
cd arena
make

arena binary is located in directory arena/bin. You may want to add the directory to $PATH.

Then you can follow Installation guide for developer

CPU Profiling

# set profile rate (HZ)
export PROFILE_RATE=1000

# arena {command} --pprof
arena list --pprof
INFO[0000] Dump cpu profile file into /tmp/cpu_profile

Then you can analyze the profile by following Go CPU profiling: pprof and speedscope

Adopters

If you are intrested in Arena and would like to share your experiences with others, you are warmly welcome to add your information on ADOPTERS.md page. We will continuousely discuss new requirements and feature design with you in advance.

FAQ

Please refer to FAQ

CLI Document

Please refer to arena.md

RoadMap

See RoadMap

arena's People

Contributors

alanfokco avatar banbanpeppa avatar binblee avatar chenyi015 avatar cheyang avatar denkensk avatar goodoid avatar gujingit avatar happy2048 avatar hwk42 avatar jackhuang007 avatar jiaqianjing avatar meibenjin avatar ocherfas avatar osswangxining avatar oyutiano avatar ringtail avatar sakuralbj avatar soolaugust avatar syulin7 avatar terrytangyuan avatar thliang01 avatar tingshua-yts avatar uzuku avatar wsxiaozhang avatar xauthulei avatar xiaozhoux avatar xiechengsheng avatar xieydd avatar zhujl1991 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

arena's Issues

Extend arena to support for training dataset or model storage management

It should be helpful for data scientists to use command like "arena create data imagenet-full" to create, index and manage different training datasets for different training jobs.
Then when use arena submit training job, the "imagenet-full" can be passed in as parameter "data" directly.
The data actually can be a pointer to specific PVC or hdfs path, etc. It's easy to use 'data' to manage and record which training datasets to load, while needn't care about which storage backend is used for persistent.

Need `arena get` json and yaml output

Json output:

arena get tf-git -ojson
{
    "name": "tf-git",
    "namespace": "default",
    "duration": "0s",
    "status": "FAILED",
    "trainer": "tfjob",
    "instances": [
        {
            "status": "Failed",
            "name": "tf-git-chief-0",
            "age": "0s",
            "node": "192.168.0.199"
        }
    ]
}

yaml output

arena get tf-git -oyaml
name: tf-git
namespace: default
duration: 0s
status: FAILED
trainer: tfjob
tensorboard: ""
instances:
- status: Failed
  name: tf-git-chief-0
  age: 0s
  node: 192.168.0.199

Add MPIJob

Arena should be able to submit MPIJob whose backend is MPIOperator.

support for distributed tensorflow estimator

Aliyun guru, can you please kindly support run distributed tensorflow estimator, typical usage can be found here estimator train and evaluate

Key difference from raw API is that estimator must have chief node(worker 0), and suggested to have evaluator node.

In our practice, we have ps, worker, chief and evaluator.

Thanks advance.

SomeThing about horovodjob?

  1. Have you consider offering one horovodjob demo in home page.
  2. The init container of mpijob git-sync ,does it provide the same functionality of the init container of horovodjob rsync
    @cheyang

arena should support gang schedule for TFJob

when user submit TFJob(kubelow/tf-operator) batch jobs with arena, we should guarantee all or nothing to avoid wasting resources when the cluster is hunger, especially for the expensive gpus.

evaluator need Replicas parameter

When create a tfjob that have evaluator, tf-operator will check Replicas parameter of evaluator. If Replicas doesn't exist, it will cause runtime problem. And the tf-operator will crashed.
Right now the tfjob.yaml doesn't have such parameter.

Evaluator:
restartPolicy: Never
template:

And following is the validation code in tf-operator

https://github.com/kubeflow/tf-operator/blob/3bf0ac83924aeea57df3e7c50bc82fc3f5546844/pkg/apis/tensorflow/validation/validation.go#L95-L97

"arena list" status not same as "arena get" status.

Hi,

When I create a training job using arena, I found the status of job in 'arena list' is not same as 'arena get':

image

The command is as follow:

arena submit tf \
             --name=tf-git \
             --image=test:1.0 \
             --syncMode=git \
             --syncSource=https://github.com/chenyu85/Distribute_MNIST.git \
             "python code/Distribute_MNIST/distributed.py --job_name=ps --task_index=0 --data_dir /MNIST_data"

test:1.0 is build by myself, based on tensorflow/tensorflow:1.5.0-devel-gpu with a data file.

And the error things is that logviewer shows nothing with any job:

image

the first time I use logview, it will show tfjob even if it is failed.

Could anyone help me? thanks.

mpijob reports running when the worker pod is pending

# arena  list
NAME  STATUS   TRAINER  AGE  NODE
test  RUNNING  MPIJOB   16m
# arena get test
NAME  STATUS   TRAINER  AGE  INSTANCE              NODE
test  PENDING  MPIJOB   16m  test-mpijob-worker-0  N/A
test  PENDING  MPIJOB   16m  test-mpijob-worker-1  N/A
 kubectl get mpijob -o=yaml
apiVersion: v1
items:
- apiVersion: kubeflow.org/v1alpha1
  kind: MPIJob
  metadata:
 
  status: {}
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

The reason is the status of mpijob is empty.

Use arena to create a single node model development box

besides submit training jobs, user also wants to create a development box which contains jupyter, math lib, frameworks, in order to dev and debug algorithms before starting to train in cluster. It's better to have GPU enabled as well

error: ValidationError(CustomResourceDefinition.spec): missing required field "scope"

this is for Closed issues #8

I think we can add scope filed valued "Namespaced" to CustomResourceDefinition part for TFJOB in the tf-operator.yaml and I find ksonnet makes it to Namespaced automatically , like below, we just add it here

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: tfjobs.kubeflow.org
spec:
  group: kubeflow.org
  names:
    kind: TFJob
    plural: tfjobs
    singular: tfjob
  ## add scope filed here
  scope: Namespaced
  validation:
    openAPIV3Schema:
      properties:
        spec:
          properties:
            tfReplicaSpecs:
              properties:
                Chief:
                  properties:
                    replicas:
                      maximum: 1
                      minimum: 1
                      type: integer
                PS:
                  properties:
                    replicas:
                      minimum: 1
                      type: integer
                Worker:
                  properties:
                    replicas:
                      minimum: 1
                      type: integer
  version: v1alpha2

this will solve the problem for arena when it create CRD-TFJOB where k8s version>=1.11.0, of cource if exec command below also can solve this problem

kubectl create -f arena/kubernetes-artifacts/tf-operator/tf-operator.yaml --validate=false

error validating data: ValidationError(CustomResourceDefinition.spec): missing required field "scope"

阿里的大牛你好,我是一个新手,当我使用安装文档时
kubectl create -f arena/kubernetes-artifacts/tf-operator/tf-operator.yaml

遭遇了错误,以下是错误日志,我猜测可能是我环境有问题,不知道能否给出一些建议。
configmap/tf-job-operator-config created
serviceaccount/tf-job-operator created
clusterrole.rbac.authorization.k8s.io/tf-job-operator created
clusterrolebinding.rbac.authorization.k8s.io/tf-job-operator created
clusterrole.rbac.authorization.k8s.io/tf-job-dashboard created
clusterrolebinding.rbac.authorization.k8s.io/tf-job-dashboard created
serviceaccount/tf-job-dashboard created
deployment.extensions/tf-job-dashboard created
error: error validating "arena/kubernetes-artifacts/tf-operator/tf-operator.yaml": error validating data: ValidationError(CustomResourceDefinition.spec): missing required field "scope" in io.k8s.apiextensions-apiserver.pkg.apis.apiextensions.v1beta1.CustomResourceDefinitionSpec; if you choose to ignore these errors, turn validation off with --validate=false

arena delete tfjobs error

delete command and logs:

[chaochchao1]$ arena delete tf-estimator1
ERRO[0000] Failed to delete tf-estimator1, the reason is that There is no training job found with the name tf-estimator1, please check it with arena list | grep tf-estimator1

[chaochao1]$ arena delete tf-estimator2
WARN[0000] Failed to UninstallAppsWithAppInfoFile due to exit status 123
WARN[0000] manually delete the following resource:
service "tf-estimator2-tensorboard" deleted
deployment.extensions "tf-estimator2-tensorboard" deleted
configmap "tf-estimator2-tfjob" deleted
INFO[0001] The Job tf-estimator2 has been deleted successfully
[chaochao1]$ arena list
NAME STATUS TRAINER AGE NODE
tf-estimator1 FAILED TFJOB 0s N/A
tf-estimator2 SUCCEEDED TFJOB 18m N/A
tf-estimator SUCCEEDED TFJOB 28m N/A
[chaochao1]$ arena delete tf-estimator2
ERRO[0000] Failed to delete tf-estimator2, the reason is that There is no training job found with the name tf-estimator2, please check it with arena list | grep tf-estimator2

[chaochao1]$ arena list
NAME STATUS TRAINER AGE NODE
tf-estimator1 FAILED TFJOB 0s N/A
tf-estimator2 SUCCEEDED TFJOB 18m N/A
tf-estimator SUCCEEDED TFJOB 28m N/A
[chaochao1]$ arena delete tf-estimator
WARN[0000] Failed to UninstallAppsWithAppInfoFile due to exit status 123
WARN[0000] manually delete the following resource:
service "tf-estimator-tensorboard" deleted
deployment.extensions "tf-estimator-tensorboard" deleted
configmap "tf-estimator-tfjob" deleted
INFO[0001] The Job tf-estimator has been deleted successfully
[chaochao1]$ arena list
NAME STATUS TRAINER AGE NODE
tf-estimator1 FAILED TFJOB 0s N/A
tf-estimator2 SUCCEEDED TFJOB 18m N/A
tf-estimator SUCCEEDED TFJOB 28m N/A
[chaochao1]$ arena delete tf-estimator
ERRO[0000] Failed to delete tf-estimator, the reason is that There is no training job found with the name tf-estimator, please check it with arena list | grep tf-estimator

[chaochao1]$ arena delete tf-estimator
ERRO[0000] Failed to delete tf-estimator, the reason is that There is no training job found with the name tf-estimator, please check it with arena list | grep tf-estimator

kubernetes-dashboard can't be created

in the arena-system namespace the pod kubernetes-dashboard can't be created successfully, it seems the image cloud not be pulled

kubectl describe po kubernetes-dashboard-6647d54ddc-9bqzl -n arena-system 
Events:
  Type     Reason     Age               From               Message
  ----     ------     ----              ----               -------
  Normal   Scheduled  6m                default-scheduler  Successfully assigned arena-system/kubernetes-dashboard-6647d54ddc-9bqzl to slave
  Warning  Failed     1m (x3 over 3m)   kubelet, slave     Failed to pull image "registry-vpc.cn-zhangjiakou.aliyuncs.com/tensorflow-samples/kubernetes-dashboard-amd64:v1.6.0": rpc error: code = Unknown desc = Error response from daemon: Get https://registry-vpc.cn-zhangjiakou.aliyuncs.com/v1/_ping: dial tcp 100.100.80.161:443: i/o timeout
  Warning  Failed     1m (x3 over 3m)   kubelet, slave     Error: ErrImagePull
  Normal   BackOff    54s (x5 over 3m)  kubelet, slave     Back-off pulling image "registry-vpc.cn-zhangjiakou.aliyuncs.com/tensorflow-samples/kubernetes-dashboard-amd64:v1.6.0"
  Warning  Failed     54s (x5 over 3m)  kubelet, slave     Error: ImagePullBackOff
  Normal   Pulling    41s (x4 over 6m)  kubelet, slave     pulling image "registry-vpc.cn-zhangjiakou.aliyuncs.com/tensorflow-samples/kubernetes-dashboard-amd64:v1.6.0"

Support MPI Job?

I see mpijob in docs but i didn`t see the mpi demo?

Doesn`t support yet?

Arena doesn't support multi kubeconfig file

My KUBECONFIG config as follows

minikube=~/.kube/config
kubenetes=~/.kube/localconfig
hongkong=~/.kube/hongkongconfig
export KUBECONFIG=$KUBECONFIG:$minikube:$kubenetes:$hongkong\

but when I executearena list, I got the following error

WARN[0000] Illegal kubeconfig file: :~/.kube/config:~/.kube/localconfig:~/.kube/hongkongconfig
FATA[0000] stat :~/.kube/config:~/.kube/localconfig:~/.kube/hongkongconfig: no such file or directory

MPIjob Error

When I run demo , i meet a problem?

Error: apiVersion "kubeflow.org/v1alpha1" in mpijob/templates/mpijob.yaml is not available

Shoud I deployment kubeflow advanced? @cheyang

Arena should support serve model online with tensorflow serving

Arena should support serve model online with tensorflow serving.

arena sumbit serving --name=xxx ... can deploy tensorflow serving instance with needed model_server config, and auto create corresponding service, ingress.

arena get serving xxx can display the serving instance info, include the exposed endpoint info for grpc client.

create tf-operator failed

when I created tf-operator uesd kubectl create -f arena/kubernetes-artifacts/tf-operator/tf-operator.yaml it failed,the log shows:
starting container process cause "exec: \opt/kubeflow/tf-operator.v2 " : stat:/opt/kubeflow/tf-operator.v2: no such file or directory

logviewer cause index out of range when mpijob is not ready

submit mpijob then use arena logviewer cause panic

arena logviewer mpi
panic: runtime error: index out of range

goroutine 1 [running]:
github.com/kubeflow/arena/cmd/arena/commands.(*MPIJob).GetJobDashboards(0xc420655500, 0xc42028c000, 0x10, 0x11cec8b, 0x7, 0x12e75e0, 0xc420655500)
	/go/src/github.com/kubeflow/arena/cmd/arena/commands/trainer_mpi.go:143 +0x4b5
github.com/kubeflow/arena/cmd/arena/commands.NewLogViewerCommand.func1(0xc4203d0c80, 0xc420294aa0, 0x1, 0x1)
	/go/src/github.com/kubeflow/arena/cmd/arena/commands/logviewer.go:55 +0x241
github.com/kubeflow/arena/vendor/github.com/spf13/cobra.(*Command).execute(0xc4203d0c80, 0xc420294a50, 0x1, 0x1, 0xc4203d0c80, 0xc420294a50)
	/go/src/github.com/kubeflow/arena/vendor/github.com/spf13/cobra/command.go:766 +0x2c1
github.com/kubeflow/arena/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0xc420286a00, 0xc4203d0c80, 0xc4203d0a00, 0xc4203d0780)
	/go/src/github.com/kubeflow/arena/vendor/github.com/spf13/cobra/command.go:852 +0x30a
github.com/kubeflow/arena/vendor/github.com/spf13/cobra.(*Command).Execute(0xc420286a00, 0x405a9c, 0xc4200a8058)
	/go/src/github.com/kubeflow/arena/vendor/github.com/spf13/cobra/command.go:800 +0x2b
main.main()
	/go/src/github.com/kubeflow/arena/cmd/arena/main.go:38 +0x38

Status is running, but cannot get nvidia info

@cheyang I reboot the system and run the tf-job, now it is running, but it is always in status running and I can not get any output infomation about the use of gpu, like this:
_ _20190125173713

_ _20190125173909

On the web, I can see the tf-job is running (and status is always on running), but can not get the log info.
_ _20190125173952

"arene top job" couldn't detect metrics.

I follow guide: arena/docs/userguide/9-top-job-gpu-metric.md.

everything works as expect until last one, when I submit the tfjob anr use "arena to job" to check, the result shows like this:

ERRO[0000] gpu metric is not exist in prometheus for query  {__name__=~"nvidia_gpu_duty_cycle|nvidia_gpu_memory_used_bytes|nvidia_gpu_memory_total_bytes", pod_name=~""}
INSTANCE NAME  GPU(Device Index)  GPU(Duty Cycle)  GPU(Memory MiB)  STATUS  NODE

image

Performance issue of arena list

The latency of arena list is more than 4s, but the cpu usage is only 0.331s.

time arena list
NAME                                  STATUS     TRAINER  AGE  NODE
.....

real	0m4.492s
user	0m0.241s
sys	0m0.090s

when should I set hostIPC to false?

useHostNetwork: false
useHostPID: true
useHostIPC: true
gpuCount: 0 # user define
privileged: false

I start a service in container, and try to connect to this service using localhost:xxxx, it always failed.

But after I set useHostIPC to false, I can connect the service using localhost:xxxx.

I wanna know when should I using useHostIPC? And if I set HostIPC to false, does it affect to the tfjob?

tfserving job support custom version policy

serve single model in a tfserving instance:

arena submit xxxx
    --modelName=xx
    --modelPath=xx
    --versionPolicy=latest:2 # means model server load the latest 2 versions of the $modelName

serve multi model in a tfserving instance:

arena submit xxxx
    --modelConfigFile=./my-model-config-file

my-model-config-file's content will set to tensorflow serving --model-config-file flag, Sample:

model_config_list: {
	config: {
		name: "mnist",
		base_path: "/tmp/monitored/_model",
		model_platform: "tensorflow",
		model_version_policy: {
		   all: {}
		}
	},
	config: {
		name: "inception",
		base_path: "/tmp/monitored/inception_model",
		model_platform: "tensorflow",
		model_version_policy: {
		   latest: {
		   	num_versions: 2
		   }
		}
	},
	config: {
		name: "test",
		base_path: "/tmp/monitored/test_model",
		model_platform: "tensorflow",
		model_version_policy: {
		   specific: {
		   	versions: 1
		   }
		}
	}
}

Prometheus could not pull metric from node-gpu-expoter

When deployed node-gpu-expoter and prometheus in kubernetes GPU cluster,everything goes well,Prometheus UI can list gpu metric,however two day pass,the node-gpu-expoter become down.When i exec kubectl logs -f node-gpu-expoter-XXX,it shows nothing.

[maintainance] Add copyright header in all files.

Golang:

// Copyright 2018 The Kubeflow Authors
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
//       http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

Can not run job in minikube deployment

I have installed arena following the installation guide to my minikube test environment. After installation, arena top node shows no worker:

NAME      IPADDRESS      ROLE    GPU(Total)  GPU(Allocated)
minikube  192.168.0.199  master  1           0
-----------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
0/1 (0%)  

When submiting examples in 1-tfjob-standalone, the job stays PENDINGS state (maybe for podAffinity issues).

Since only workers require GPU resources, can I adjust arena to support training jobs under minikube?

arena should support gang schedule for MPIJob

when user submit MPIJob(kubelow/mpi-operator) batch jobs with arena, we should guarantee all or nothing to avoid wasting resources when the cluster is hunger, especially for the expensive gpus.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.