kubeflow / common Goto Github PK

View Code? Open in Web Editor NEW

50.0 50.0 73.0 5.71 MB

Common APIs and libraries shared by other Kubeflow operator repositories.

License: Apache License 2.0

Shell 2.88% Go 97.12%

common's Introduction

Kubeflow the cloud-native platform for machine learning operations - pipelines, training and deployment.

Documentation

Please refer to the official docs at kubeflow.org.

Working Groups

The Kubeflow community is organized into working groups (WGs) with associated repositories, that focus on specific pieces of the ML platform.

Quick Links

PR Dashboard

Get Involved

Please refer to the Community page.

common's People

Contributors

Stargazers

Watchers

Forkers

richardsliu jian-he k82cn merlintang gaocegege simoncqk sperlingxx johnugeorge 4everming yesterdayxd jinchihe joyme123 jeffwan kelvin169 stpabhi merryzhou hzxuzhonghu terrytangyuan hustcat chanyilin xfate123 ganeshmaharaj whalecold isabella232 mkkb473 davidspek shaowei-su jiangkaihua shinytang6 thor-wl rajesh-ibm-power xieydd zw0610 hegaoyuan valmach myelinio nasen23 alculquicondor andreyvelich zqz-net qiankunli sumlare goyalankit dengpanyin pugangxa polyaxon gridai haijohn padlar arangogutierrez dreaming12580 cheimu georgkaleido djwhatle peterchg ggaaooppeenngg elastic-ai tryweirdier tingshua-yts dyyfk reptilefury tsuiot tenzen-y syulin7 davi000015 mimowo tonyzzhang zhaizhch yowenter dafu-wu iq-scm jazzsir xiaoaier-z-l

common's Issues

Enhance maintainability of operator common module

There're few issues in current project and I propose to have some improvements for these issues.
I am in progress of a PR and please share some insights and feedbacks.

./hack/update-codegen doesn't work with current code structure. It tries to generate outputs for operator:v1 but project only have job_controller/api/v1 directory which doesn't match.

common/hack/update-codegen.sh

Lines 34 to 35 in 88a0cd0

github.com/kubeflow/common/client github.com/kubeflow/common \

operator:v1 \

Either the update-codegen or code structure need to be changed.

There's no ./hack/verify-codegen.sh codegen verify process in the CI.
code-generator version referenced in the project doesn't work. The minimum version to support go mod is 0.15.x
https://github.com/kubeflow/common/blob/88a0cd071cf162120aa3ea0c3cb9cc64d69ef3f5/hack/update-codegen.sh#L25
Upgrade Kubernetes dependency.
common project uses expectation which belongs to k8s.io/kubernetes which makes it has direct dependency on it.

common/job_controller/api/v1/controller.go

Line 83 in 88a0cd0

Expectations controller.ControllerExpectationsInterface

We should start to consider to remove dependency of k8s/kubernetes because it is not primarily intended to be consumed as a module. Only the published subcomponents are. reference: #48

The other challenge is most of the new training operators may use kubebuilder2 or latest operator-sdk to generate APIs. These framework depends on latest k8s.io/code-generater and k8s.io/client-go which has conflicts with the ones k8s.io/[email protected] uses. There're some breaking changes user have to resolve.

It would be better to upgrade to 1.15+ for all k8s.io/xx dependencies and make sure our common package is compatible with newer client-go.

code struct is not that clear. Looks like apis is under job_controller, I think we plan to have ControllerInterface and common codes under this package, As I know, some users only use api like CleanupPolicy or RestartPolicy, etc. It would be great to separate api definition and jobcontroller implementation in different folders.

For test_job, seems this is only used for testing and I think it's fine to put apis and client under pkg directly. I am ok to have a separate top level directory like test_job to host apis. Currently we have this codes structure for test_job.

├── client
│   ├── clientset
│   │   └── versioned
│   ├── informers
│   │   └── externalversions
│   └── listers
│          └── test_job
├── test_job
│   ├── v1

Based on what we talked above, could we create a project architecture like this


├── LICENSE
├── OWNERS
├── README.md
├── go.mod
├── go.sum
├── hack
│   ├── boilerplate
│   ├── scripts
│   ├── update-codegen.sh
│   └── verify-codegen.sh
├── pkg
│   ├── apis
│   │   └── common
│   │       └── v1              ---> common API 
│   │   └── test_job
│   │       └── v1             ----> test Job API
│   ├── client                   ----> test Job client
│   │   ├── clientset
│   │   │   └── versioned
│   │   ├── informers
│   │   │   └── externalversions
│   │   └── listers
│   │       └── test_job
│   ├── job_controller
│   │   ├── job.go
│   │   ├── job_controller.go
│   │   ├── job_test.go
│   │   ├── pod.go
│   │   ├── pod_control.go
│   │   ├── pod_control_test.go
│   │   ├── pod_test.go
│   │   ├── service.go
│   │   ├── service_control.go
│   │   ├── service_control_test.go
│   │   ├── service_ref_manager.go
│   │   ├── service_ref_manager_test.go
│   │   ├── status.go
│   │   ├── status_test.go
│   │   ├── test_job_controller.go
│   │   ├── util.go
│   │   └── util_test.go
│   ├── test_util                          ----> I doubt some of utils still being used, need more clean ups.
│   │   ├── const.go..
│   └── util
│       ├── k8sutil
│       ├── logger.go
│       ├── signals
│       ├── status.go
│       ├── status_test.go
│       ├── train
│       └── util.go

For other operators, they can import package api and controller packages like following.

import (
"github.com/kubeflow/common/pkg/apis/common/v1"
"github.com/kubeflow/common/pkg/job_controller"
)

Please leave some comments, I will file the PR if it makes sense.

/cc @terrytangyuan @gaocegege @jian-he @richardsliu

Continue to optimize reconciler performance and reduce latency to take actions on CR events

Performance optimization for 500 concurrent jobs and large scale completed jobs. Related Issues: kubeflow/training-operator#965 kubeflow/training-operator#1079.

This is part of the roadmap.

[feature] Support delay pods creating when EnableGangScheduling is true

As described in kubeflow/training-operator#1264 and volcano-sh/volcano#1387

we expect that if EnableGangScheduling is true, the pods controlled by job operator should be delay creating util the podgroup status is Inqueue or Running.

/cc @Thor-wl @jiangkaihua

Deprecate glog and use logger in commonutil instead

glog has been deprecated in the kubernetes community for a few reasons.
I notice it's still being used in the common project and I would be better to clean up all references and migrate to existing log solutions.

/improvement

Make training container port optional

Instead of using fixed port to communicate with each other, some frameworks randomly pick up port by worker and use auto-discovery to find entire topology in consensus store.

Currently, common force user to implement GetDefaultContainerPortName and GetDefaultContainerPortNumber and use it in following logic.

common/pkg/controller.v1/common/service.go

Line 174 in 64f9430

if port.Name == jc.Controller.GetDefaultContainerPortName(){

If job doesn't have port, it will return an error.

In order to make this change, we can make this optional to meet the case I described above.

Does package name github.com/stretchr/testify/assert correct ?

For the test job of common,
the assert function i.e., https://github.com/stretchr/testify/assert is already moved to
https://github.com/stretchr/testify/tree/master/assert. Should we also change the related https://github.com/kubeflow/common/blob/master/job_controller/job_test.go#L9 as well ?

project cannot complie with kubeflow/common v0.3.3 and client-go v0.19

I use operator-sdk(v1.5.0) and this repo to scaffold a project.
go.mod

require (
	github.com/go-logr/logr v0.3.0
	github.com/kubeflow/common v0.3.3
	github.com/onsi/ginkgo v1.14.1
	github.com/onsi/gomega v1.10.2
	k8s.io/api v0.19.2
	k8s.io/apimachinery v0.19.2
	k8s.io/client-go v0.19.2
	sigs.k8s.io/controller-runtime v0.7.2
)

By I met errors when I try to compile,

/Users/xxxx/test-operator/bin/controller-gen object:headerFile="hack/boilerplate.go.txt" paths="./..."
go fmt ./...
go vet ./...
# volcano.sh/apis/pkg/client/clientset/versioned/typed/bus/v1alpha1
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/bus/v1alpha1/command.go:72:5: not enough arguments in call to c.client.Get().Namespace(c.ns).Resource("commands").Name(name).VersionedParams(&options, scheme.ParameterCodec).Do
        have ()
        want (context.Context)
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/bus/v1alpha1/command.go:89:5: not enough arguments in call to c.client.Get().Namespace(c.ns).Resource("commands").VersionedParams(&opts, scheme.ParameterCodec).Timeout(timeout).Do
        have ()
        want (context.Context)
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/bus/v1alpha1/command.go:106:8: not enough arguments in call to c.client.Get().Namespace(c.ns).Resource("commands").VersionedParams(&opts, scheme.ParameterCodec).Timeout(timeout).Watch
        have ()
        want (context.Context)
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/bus/v1alpha1/command.go:116:5: not enough arguments in call to c.client.Post().Namespace(c.ns).Resource("commands").Body(command).Do
        have ()
        want (context.Context)
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/bus/v1alpha1/command.go:129:5: not enough arguments in call to c.client.Put().Namespace(c.ns).Resource("commands").Name(command.ObjectMeta.Name).Body(command).Do
        have ()
        want (context.Context)
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/bus/v1alpha1/command.go:141:5: not enough arguments in call to c.client.Delete().Namespace(c.ns).Resource("commands").Name(name).Body(options).Do
        have ()
        want (context.Context)
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/bus/v1alpha1/command.go:157:5: not enough arguments in call to c.client.Delete().Namespace(c.ns).Resource("commands").VersionedParams(&listOptions, scheme.ParameterCodec).Timeout(timeout).Body(options).Do
        have ()
        want (context.Context)
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/bus/v1alpha1/command.go:170:5: not enough arguments in call to c.client.Patch(pt).Namespace(c.ns).Resource("commands").SubResource(subresources...).Name(name).Body(data).Do
        have ()
        want (context.Context)
# volcano.sh/apis/pkg/client/clientset/versioned/typed/batch/v1alpha1
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/batch/v1alpha1/job.go:73:5: not enough arguments in call to c.client.Get().Namespace(c.ns).Resource("jobs").Name(name).VersionedParams(&options, scheme.ParameterCodec).Do
        have ()
        want (context.Context)
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/batch/v1alpha1/job.go:90:5: not enough arguments in call to c.client.Get().Namespace(c.ns).Resource("jobs").VersionedParams(&opts, scheme.ParameterCodec).Timeout(timeout).Do
        have ()
        want (context.Context)
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/batch/v1alpha1/job.go:107:8: not enough arguments in call to c.client.Get().Namespace(c.ns).Resource("jobs").VersionedParams(&opts, scheme.ParameterCodec).Timeout(timeout).Watch
        have ()
        want (context.Context)
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/batch/v1alpha1/job.go:117:5: not enough arguments in call to c.client.Post().Namespace(c.ns).Resource("jobs").Body(job).Do
        have ()
        want (context.Context)
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/batch/v1alpha1/job.go:130:5: not enough arguments in call to c.client.Put().Namespace(c.ns).Resource("jobs").Name(job.ObjectMeta.Name).Body(job).Do
        have ()
        want (context.Context)
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/batch/v1alpha1/job.go:146:5: not enough arguments in call to c.client.Put().Namespace(c.ns).Resource("jobs").Name(job.ObjectMeta.Name).SubResource("status").Body(job).Do
        have ()
        want (context.Context)
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/batch/v1alpha1/job.go:158:5: not enough arguments in call to c.client.Delete().Namespace(c.ns).Resource("jobs").Name(name).Body(options).Do
        have ()
        want (context.Context)
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/batch/v1alpha1/job.go:174:5: not enough arguments in call to c.client.Delete().Namespace(c.ns).Resource("jobs").VersionedParams(&listOptions, scheme.ParameterCodec).Timeout(timeout).Body(options).Do
        have ()
        want (context.Context)
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/batch/v1alpha1/job.go:187:5: not enough arguments in call to c.client.Patch(pt).Namespace(c.ns).Resource("jobs").SubResource(subresources...).Name(name).Body(data).Do
        have ()
        want (context.Context)
# volcano.sh/apis/pkg/client/clientset/versioned/typed/scheduling/v1beta1
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/scheduling/v1beta1/podgroup.go:73:5: not enough arguments in call to c.client.Get().Namespace(c.ns).Resource("podgroups").Name(name).VersionedParams(&options, scheme.ParameterCodec).Do
        have ()
        want (context.Context)
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/scheduling/v1beta1/podgroup.go:90:5: not enough arguments in call to c.client.Get().Namespace(c.ns).Resource("podgroups").VersionedParams(&opts, scheme.ParameterCodec).Timeout(timeout).Do
        have ()
        want (context.Context)
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/scheduling/v1beta1/podgroup.go:107:8: not enough arguments in call to c.client.Get().Namespace(c.ns).Resource("podgroups").VersionedParams(&opts, scheme.ParameterCodec).Timeout(timeout).Watch
        have ()
        want (context.Context)
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/scheduling/v1beta1/podgroup.go:117:5: not enough arguments in call to c.client.Post().Namespace(c.ns).Resource("podgroups").Body(podGroup).Do
        have ()
        want (context.Context)
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/scheduling/v1beta1/podgroup.go:130:5: not enough arguments in call to c.client.Put().Namespace(c.ns).Resource("podgroups").Name(podGroup.ObjectMeta.Name).Body(podGroup).Do
        have ()
        want (context.Context)
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/scheduling/v1beta1/podgroup.go:146:5: not enough arguments in call to c.client.Put().Namespace(c.ns).Resource("podgroups").Name(podGroup.ObjectMeta.Name).SubResource("status").Body(podGroup).Do
        have ()
        want (context.Context)
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/scheduling/v1beta1/podgroup.go:158:5: not enough arguments in call to c.client.Delete().Namespace(c.ns).Resource("podgroups").Name(name).Body(options).Do
        have ()
        want (context.Context)
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/scheduling/v1beta1/podgroup.go:174:5: not enough arguments in call to c.client.Delete().Namespace(c.ns).Resource("podgroups").VersionedParams(&listOptions, scheme.ParameterCodec).Timeout(timeout).Body(options).Do
        have ()
        want (context.Context)
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/scheduling/v1beta1/podgroup.go:187:5: not enough arguments in call to c.client.Patch(pt).Namespace(c.ns).Resource("podgroups").SubResource(subresources...).Name(name).Body(data).Do
        have ()
        want (context.Context)
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/scheduling/v1beta1/queue.go:70:5: not enough arguments in call to c.client.Get().Resource("queues").Name(name).VersionedParams(&options, scheme.ParameterCodec).Do
        have ()
        want (context.Context)
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/scheduling/v1beta1/queue.go:70:5: too many errors
# github.com/kubeflow/common/pkg/controller.v1/control
../../../go/pkg/mod/github.com/kubeflow/[email protected]/pkg/controller.v1/control/pod_control.go:113:55: not enough arguments in call to r.KubeClient.CoreV1().Pods(namespace).Patch
        have (string, types.PatchType, []byte)
        want (context.Context, string, types.PatchType, []byte, "k8s.io/apimachinery/pkg/apis/meta/v1".PatchOptions, ...string)
../../../go/pkg/mod/github.com/kubeflow/[email protected]/pkg/controller.v1/control/pod_control.go:149:64: not enough arguments in call to r.KubeClient.CoreV1().Pods(namespace).Create
        have (*"k8s.io/api/core/v1".Pod)
        want (context.Context, *"k8s.io/api/core/v1".Pod, "k8s.io/apimachinery/pkg/apis/meta/v1".CreateOptions)
../../../go/pkg/mod/github.com/kubeflow/[email protected]/pkg/controller.v1/control/pod_control.go:170:55: not enough arguments in call to r.KubeClient.CoreV1().Pods(namespace).Get
        have (string, "k8s.io/apimachinery/pkg/apis/meta/v1".GetOptions)
        want (context.Context, string, "k8s.io/apimachinery/pkg/apis/meta/v1".GetOptions)
../../../go/pkg/mod/github.com/kubeflow/[email protected]/pkg/controller.v1/control/pod_control.go:182:56: not enough arguments in call to r.KubeClient.CoreV1().Pods(namespace).Delete
        have (string, nil)
        want (context.Context, string, "k8s.io/apimachinery/pkg/apis/meta/v1".DeleteOptions)
../../../go/pkg/mod/github.com/kubeflow/[email protected]/pkg/controller.v1/control/service_control.go:68:59: not enough arguments in call to r.KubeClient.CoreV1().Services(namespace).Patch
        have (string, types.PatchType, []byte)
        want (context.Context, string, types.PatchType, []byte, "k8s.io/apimachinery/pkg/apis/meta/v1".PatchOptions, ...string)
../../../go/pkg/mod/github.com/kubeflow/[email protected]/pkg/controller.v1/control/service_control.go:93:69: not enough arguments in call to r.KubeClient.CoreV1().Services(namespace).Create
        have (*"k8s.io/api/core/v1".Service)
        want (context.Context, *"k8s.io/api/core/v1".Service, "k8s.io/apimachinery/pkg/apis/meta/v1".CreateOptions)
../../../go/pkg/mod/github.com/kubeflow/[email protected]/pkg/controller.v1/control/service_control.go:116:63: not enough arguments in call to r.KubeClient.CoreV1().Services(namespace).Get
        have (string, "k8s.io/apimachinery/pkg/apis/meta/v1".GetOptions)
        want (context.Context, string, "k8s.io/apimachinery/pkg/apis/meta/v1".GetOptions)
../../../go/pkg/mod/github.com/kubeflow/[email protected]/pkg/controller.v1/control/service_control.go:128:60: not enough arguments in call to r.KubeClient.CoreV1().Services(namespace).Delete
        have (string, nil)
        want (context.Context, string, "k8s.io/apimachinery/pkg/apis/meta/v1".DeleteOptions)
# github.com/kubeflow/common/pkg/util/k8sutil
../../../go/pkg/mod/github.com/kubeflow/[email protected]/pkg/util/k8sutil/client.go:77:19: not enough arguments in call to r.DoRaw
        have ()
        want (context.Context)
../../../go/pkg/mod/github.com/kubeflow/[email protected]/pkg/util/k8sutil/client.go:91:19: not enough arguments in call to r.DoRaw
        have ()
        want (context.Context)

How can I solve this problem?

PodGroup sync loop is missing

We should reconcile PodGroup during Job reconciliation.

common/pkg/controller.v1/common/job.go

Line 192 in 4d01e68

// Diff current active pods/services with replicas.

The code doesn't have it.

See tf-operator and mxnet-operator
https://github.com/kubeflow/tf-operator/blob/7e06f5ca06bf5754da2e6e943a42e6458e6bb5e0/pkg/controller.v1/tensorflow/controller.go#L463-L469

https://github.com/kubeflow/mxnet-operator/blob/4535d1df314f4c04edc1803fae38b9f4bd1c8efa/pkg/controller.v1/mxnet/controller.go#L443-L449

Type of replicaType is not consistent in common project

Is there any thought about why type of replicaType is not consistent across the common project?

e.g.

common/pkg/apis/common/v1/interface.go

Line 47 in eb55dde

    
           SetClusterSpec(job interface{}, podTemplate *v1.PodTemplateSpec, rtype, index string) error

common/pkg/apis/common/v1/interface.go

Line 57 in eb55dde

    
           IsMasterRole(replicas map[ReplicaType]*ReplicaSpec, rtype ReplicaType, index int) bool

I think maybe it is a good idea to define replicaType as commonv1.ReplicaType.

Training WG leads need to take ownership of release processes, infra, and artifacts

Every work group needs to take ownership of its release process and artifacts.

This issue is tracking the work for training operators

@kubeflow/wg-training-leads

[hot-fix] GetDefaultContainerPortNumber would be PortName

For the reconcile service part , we need to match the container port name with specific job port name. I think there is an mismatch between name and port number in the reconcile service.

see this function: https://github.com/kubeflow/common/blob/master/job_controller/service.go#L174

it would be

port.Name == jc.Controller.GetDefaultContainerPortName()

I would like to propose a PR to fix this.

Cut new release for kubeflow/common

As @gaocegege suggested, we could cut a release as volcano support is fully integrated into kubeflow/common.

v0.3.2...master

Any other pending changes to include in this release? Next release is v0.4.0?

/cc @kubeflow/wg-training-leads

Proposal for exposing generic prometheus metrics in common operator

Proposal

Add generic metrics (jobs/pods/...) to the common operator, which can be directly enabled and used by operators built base on common operator

Motivation

To track some job-level metrics, currently we need to add prometheus metric code inside each job operator. For example, to know how many tfjobs created in the last hour, we need to add a Counter inside tf-operator. This request is very common and is needed for different operators. As we're moving common code to the common operator, we could also add metric-related code there, and can be used by all operators built base on the common one.

Details

For metric definition and registry, will add a new metrics folder and all metrics will be defined there. Some prelim metrics include # jobs/pods/services created, durations for various operations, etc.

For metrics updating:

For pods/services, we can directly add related metric code inside job_controller/pod.go and job_controller/service.go.
For jobs, to track the numbers, we may need to watch the creation events. Similar to controller_watches.

As the common project is still under active development, some details discussed above may be changed later. Comments will be very appreciated, @jlewi @richardsliu @gaocegege @jian-he .

Make training container port support multi ports

Now common repo only supports single port for communicating with each other, but some scenarios need another pod for profiling or other user cases.
Currently GetPortFromJob func only return single port for job.

common/pkg/controller.v1/common/service.go

Line 256 in 1c55acc

    
           func (jc *JobController) GetPortFromJob(spec *apiv1.ReplicaSpec) (*int32, error) {

Making port flexible, user can configure multi ports with the name and port for different role.
/assign @xieydd
/cc @gaocegege @terrytangyuan

Bump dependency to 1.19 and upgrade client to accept Context

There's a dependency to accept #123

Major reason is

kubeflow/common uses 1.16 and volcano latest api repo uses 1.19.
client-go accepts context in 1.18. https://v1-18.docs.kubernetes.io/docs/setup/release/notes/#client-go

We will need to upgrade k8s dependency to 1.19 directly and update clients to accept Context.TODO() and this will unblock volcano api changes

/cc @Thor-wl @shinytang6

Skip check active deadline and backoff limit if job is terminated

Comes from kubeflow/training-operator#1111.

Seems kubeflow/common doesn't have this patch yet. We need to update the reconcile logic.

Bump Kubernetes dependency to 1.16

I've seen this issue when I use common in other operators. The problem is volcano changes brings Kubernetes 1.16 and it has direct dependency on 1.16. At the same time, we still replace k8s packages to 1.15.10.

Build will work but if we determine to use 1.16 in other projects, CRD client.go will have incompatibility issue like beflow.

../../../../pkg/mod/github.com/kubeflow/[email protected]/pkg/util/k8sutil/client.go:50:32: undefined: serializer.DirectCodecFactory

Let's bump version to 1.16 and make sure they are consistent.

Centralized Job Dashboard for training jobs (Add metadata graph, model artifacts later)

This is part of the roadmap.

prow does not have ability to list repo collaborators

/assign @jlewi

{
  component:  "hook"   
  error:  "status code 403 not one of [204 404 302], body: {"message":"Must have push access to view repository collaborators.","documentation_url":"https://developer.github.com/v3/repos/collaborators/#check-if-a-user-is-a-collaborator"}"   
  event-GUID:  "ca394c90-559d-11e9-945f-ee71934d44cc"   
  event-type:  "pull_request_review"   
  level:  "error"   
  msg:  "Failed to check if author is a collaborator."   
  org:  "kubeflow"   
  plugin:  "lgtm"   
  pr:  2   
  repo:  "common"   
  review:  221946237   
  reviewer:  "terrytangyuan"   
  url:  "https://github.com/kubeflow/common/pull/2#pullrequestreview-221946237"   
 }

Consider supporting SuccessPolicy and FailurePolicy

We recently added SuccessPolicy in tf-operator kubeflow/training-operator#1165 and are considering adding FailurePolicy to handle the case of failure in kubeflow/training-operator#1170. Once it's mature and if we see a common pattern in other operators, we should consider moving that to kubeflow/common.

cc @gaocegege @Jeffwan @johnugeorge @ChanYiLin @pingsutw

[Release 1.1] training operator release

Opening this issue to track releasing training operators for Kubeflow 1.1 and would like to add area OWNERS for @johnugeorge @terrytangyuan @Jeffwan

Seems we want to these works done before 1.1 release. We may not finish all of them but let's have a try.

Migration operator implementation to kubeflow/common fashion. (TF, PyTorch in pending)
- kubeflow/training-operator#1171 @ChanYiLin
- kubeflow/pytorch-operator#282
Graduate Operator to v1
- MXNet Operator kubeflow/mxnet-operator#66
- XGBoost Operator kubeflow/xgboost-operator#44
- MPI Operator kubeflow/mpi-operator#217

[FeatureRequest] Support dynamic volume provisioning for distributed training job

This is ported from kubeflow/training-operator#949 as we want to make this feature generic. Basically we want to add volumeClaimTemplate to job spec so the workers are able to use customized volumes for scratch space.

/assign

Failed to fetch common module

Following command doesn't work.

$ go get github.com/kubeflow/common
go: finding github.com/kubeflow/common latest
go get: github.com/kubeflow/[email protected] requires
	k8s.io/[email protected] requires
	k8s.io/[email protected]: reading k8s.io/kubectl/go.mod at revision v0.0.0: unknown revision v0.0.0

I think this is because of #62.

I use the stable version here for now and I will fix this when I get time.

$ go get github.com/kubeflow/[email protected]
go: finding github.com/kubeflow/common v0.0.0-20200327002023-0b3a4c3fca85
go: downloading github.com/kubeflow/common v0.0.0-20200327002023-0b3a4c3fca85
go: extracting github.com/kubeflow/common v0.0.0-20200327002023-0b3a4c3fca85

Job controller doesn't reconcile scale down changes

If user changed replica to a smaller value, controller won't delete those redundant pods and services.

In the reconcile logic, we always set scope slice size based on value of replicas.

common/pkg/controller.v1/common/pod.go

Line 221 in 64f9430

podSlices := make([][]*v1.Pod, replicas)

Let's say we have index 0, 1, 2 pods and user change replicas from 3 to 2. When we iterate to pod with index 2. In GetPodSlices methods, we only print warn logs for these cases.

common/pkg/controller.v1/common/pod.go

Line 232 in 64f9430

if index < 0 || index >= replicas {

ReconcilePods doesn't even get a chance to make actions to these pods. index 2 pod will be left there.

common/pkg/controller.v1/common/pod.go

Line 283 in 64f9430

err = jc.createNewPod(job, rt, strconv.Itoa(index), spec, masterRole, replicas)

CleanPodPolicyRunning should also clean Pending pods

common/pkg/controller.v1/common/job.go

Line 31 in 31a6ec1

    
           if *runPolicy.CleanPodPolicy == apiv1.CleanPodPolicyRunning && pod.Status.Phase != v1.PodRunning {

orphan Pending pod will become orphan Running pod once schedulable.

Proposal：add priority and queue in scheduling for the common operator

Problem

1.Currently in kube-batch，it has PodGroupSpec that it includes some status about scheduling policy,for example MinAvailable,Queue,PriorityClassName.But kubeflow operators don't provide the parameters for kube-batch now.

2.MPI-operator and tf-operator don't use common operator,and pytorch-operator and mxnet-operator use tf-operator/pkg/common package.

Proposed Solution

1.Supplement these attributions in type RunPolicy.SchedulingPolicy. When it uses kubeflow and kube-batch,
kubeflow can pass parameters to kube-batch.

// SchedulingPolicy encapsulates various scheduling policies of the distributed training
// job, for example `minAvailable` for gang-scheduling.
type SchedulingPolicy struct {
    MinAvailable *int32 `json:"minAvailable,omitempty"`

    //PriorityClassName is a type of k8s resource.(kubectl get priorityclass)
    PriorityClassName *string `json:"priorityClassName,omitempty"`
  
    Queue *string `json:"queue,omitempty"`
}

2.All operators use common operator.Because tf,pytorch and mxnet are similar.The bad news is that mpi maybe need more changes.

Advantages

Unify all operators about runPolicy and packages where are imported.

Frameworks Support

pytorch

mxnet

mpi

tensorflow

Rough API Spec(pytorch-operator)

apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
  name: "pytorch-dist-mnist-gloo"
spec:
  priorityClassName: high
  queue:default
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: gcr.io/<your_project>/pytorch_dist_mnist:latest
              args: ["--backend", "gloo"]
              # Comment out the below resources to use the CPU.
              resources: 
                limits:
                  nvidia.com/gpu: 1
    Worker:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers: 
            - name: pytorch
              image: gcr.io/<your_project>/pytorch_dist_mnist:latest
              args: ["--backend", "gloo"]
              # Comment out the below resources to use the CPU.
              resources: 
                limits:
                  nvidia.com/gpu: 1

[feature] Add reconciler.v1 package along with controller.v1

The common repo offers a controller.v1 package which is designed for low-level controller mode in kubebuilder. As we are working on a unified controller, it seems a reconciler.v1 package will help future developers to code in high-level reconciler mode.

I would like to take a try if developers think such a reconciler.v1 package will be helpful.

There will be some code overlapping between the controller and reconciler packages, which we can extract into an individual package so that these code can be re-used.

Delete direct dependency of kubernetes

Now go.mod has dependency of kubernetes v1.11.2.
It will be hard to import job_controller/pkg/api because of compatibility with new version client-go

Proposal: Add Error JobConditionType to reflect controller error (Resource Quota Error) into the status

Problem: Currently JobConditionType constants, there doesn't exist a a status that indicates error happening in the controller. This cause problems, e.g. when pod creation fails because it exceeds resource quota. Take TFJob as an example (https://github.com/kubeflow/tf-operator/blob/5adee6f30c86484897db33188af591d5976d1cd2/pkg/control/pod_control.go#L138), if pod creation here exceeds resource quota, the Create func will return an error. This information is only exposed to the higher-level jobs as an event.

Solution: Add an Error JobConditionType in kubeflow common api. Also show pod creation error (or other types of error if necessary) into the JobStatus, e.g. in the updateStatusSingle func for TFJob (https://github.com/kubeflow/tf-operator/blob/5adee6f30c86484897db33188af591d5976d1cd2/pkg/controller.v1/tensorflow/status.go#L61). This Error status will just indicate a retriable error that happened in the controller.

Allow user to submit jobs using Git repo without building container images

This is part of the roadmap.

Add SchedulingPolicy for the common operator

Add .spec.runtime.schedulingPolicy for common operator to use batch scheduler.

Graduate MPI Operator, MXNet Operator and XGBoost Operator to v1

This is part of the roadmap.

Make batch scheduler transparent to users

We should extract common logic from kube-batch/volcano, make sure user pass some client by their own based on their choice.

Proposal: Merge operators into one controller manager

We have many operators in the community, and it causes huge maintenance costs. I think we should investigate if we can merge these operators into one process.

/cc @kubeflow/wg-training-leads

WDYT

[feature] Suppory unit test by providing handlers for updateStatus and DeleteJob

	// TODO(ChanYiLin): these are originally for testing, but with using common library,
	// we can not replcae the function. Also need to update or remove some tests

	// tc.updateStatusHandler = tc.UpdateJobStatusInApiServer
	// set delete handler.
	// tc.deleteTFJobHandler = tc.DeleteJob

It will be better to have such handlers to replace them when running test cases.

/cc @ChanYiLin

[feature] Support gang scheduling configuration in common

/cc @k82cn

We now support gang scheduling by adding a CLI flag to enable it. We do not expose the ability to users, thus there are some complaints that we do not support queue, minresources and some other PodGroup fields.

I think we should support these:

type SchedulingPolicy struct {
	MinAvailable *int32 `json:"minAvailable,omitempty"`
+      Queue *string
+      MinResources *v1.ResourceList
+      PriorityClass *string
}

/assign

Unified training operator working progress

@zw0610 and I present all-in-one training operator proposal in last month community meeting.

WG-Training leads have already agreed to move forward. This issue is created to track implementation progress. The desired alpha release of this new unified operator will be Kubeflow 1.4

Configuration and deployment

Description	Category	Status	Issue
Kustomize package	Required	Done
Application CR	Required	Not Done
Images listed in kustomization.yaml	Required	Not Done
Upgradeability	Required	Not Done
Separate cluster scoped and namespace scoped resources	Recommended	Not Done	N/A
Kustomize package should be deployable on its own	Recommended	Done	Need to coordinate with 1.4 release

Custom Resources

Description	Category	Status	Issue
Version stability	Required	Not Done
Backward compatibility	Required	Not Done
Supports status subresource	Required	Done	All jobs have status to reflect the real status
CRD schema validation	Required	Not Done
Training operators follow kubeflow/common conventions	Required	Done	kubeflow/training-operator#1296 kubeflow/training-operator#1295 kubeflow/training-operator#1294 kubeflow/training-operator#1293

Observability

Description	Category	Status
Liveness/Readiness signals	Required	Not Done
Prometheus metrics and Graphs	Required	Not Done
Job Events	Required	Not Done
Json logging	Recommended	Not Done

CI/CD

Description	Category	Status	Issue
E2E tests	Required	Not Done
Scalability / load testing	Required	Not Done
Continuous building of docker images	Recommended	Not Done	kubeflow/testing#951
Continuous updating of Kustomize manifests	Recommended	Not Done	This is not valid anymore - kubeflow/manifests will fetch repo's kustomize manifest

Docs

Description	Category	Status	Issue
API Reference docs	Required	Not Done
Application docs	Required	Not Done

Owners/Maintenance

Description	Category	Explanation	Status	Issue
Healthy number of committers and commits	Required	Committers are listed as approvers in owners filesNumber to be determined by TOC based on size and scope of application	Not Done
At least 2 different organizations are committers	Required		Not Done

Adoption

Description	Category	Explanation	Status	Issue
List of users running the application	Recommended	Suggest listing adopters willing to be identified publicly in ADOPTERS.md	Not Done

Don't import k8s.io/kubernetes

Ref kubernetes/kubernetes#90358

Ref argoproj/argo-cd#5173

[Feature Request] Provides a better common logger

Currently, kubeflow/common provides a logger utils for operator to use. It uses github.com/sirupsen/logrus underneath.

https://github.com/kubeflow/common/blob/master/pkg/util/logger.go#L20

The problem I notice is new operators using kubebuilder have inbuilt logger for operator logics.
This makes the logs really messy.

Can we provide a interface layer like logr and provide method like SetLogger and different operator can use their own implementation?

Difference between kubeflow/common and kubeflow/tf-operator/pkg/common

Comparing ControllerInterface between two different projects, tf-operator is pretty clean. kubeflow/common adds extra methods to create Job/Pod/Services.

common/pkg/apis/common/v1/interface.go

Lines 31 to 56 in fda69c7

    
           GetPodsForJob(job interface{}) ([]*v1.Pod, error) 
        
           // GetServicesForJob returns the services managed by the job. This can be achieved by selecting services using label key "job-name" 
        
           // i.e. all services created by the job will come with label "job-name" = <this_job_name> 
        
           GetServicesForJob(job interface{}) ([]*v1.Service, error) 
        
           // DeleteJob deletes the job 
        
           DeleteJob(job interface{}) error 
        
           // UpdateJobStatus updates the job status and job conditions 
        
           UpdateJobStatus(job interface{}, replicas map[ReplicaType]*ReplicaSpec, jobStatus *JobStatus) error 
        
           // UpdateJobStatusInApiServer updates the job status in API server 
        
           UpdateJobStatusInApiServer(job interface{}, jobStatus *JobStatus) error 
        
           // CreateService creates the service 
        
           CreateService(job interface{}, service *v1.Service) error 
        
           // DeleteService deletes the service 
        
           DeleteService(job interface{}, name string, namespace string) error 
        
           // CreatePod creates the pod 
        
           CreatePod(job interface{}, pod *v1.Pod) error 
        
           // DeletePod deletes the pod 
        
           DeletePod(job interface{}, pod *v1.Pod) error

Besides that, we extracted a few interfaces for high level abstraction which makes sense.

common/pkg/apis/common/v1/interface.go

Lines 59 to 72 in fda69c7

    
           SetClusterSpec(job interface{}, podTemplate *v1.PodTemplateSpec, rtype, index string) error 
        
           // Returns the default container name in pod 
        
           GetDefaultContainerName() string 
        
           // Get the default container port name 
        
           GetDefaultContainerPortName() string 
        
           // Get the default container port number 
        
           GetDefaultContainerPortNumber() int32 
        
           // Returns if this replica type with index specified is a master role. 
        
           // MasterRole pod will have "job-role=master" set in its label 
        
           IsMasterRole(replicas map[ReplicaType]*ReplicaSpec, rtype ReplicaType, index int) bool

JobController side, tf-operator has PodControl and ServiceControl which are used to operate k8s objects. However, in #36, since create/delete pods/services interface have been exposed, community determine to remove PodControl and ServiceControl because them are kubernetes controller internal codes, and there's also concern on the size of the dependency (need kubernetes code base).

Currently, it's a little bit weird.

We can not fully get rid of kubernetes/pkg/controller, Expectation is still using them here. https://github.com/kubeflow/common/blob/master/pkg/controller.v1/common/job_controller.go#L100 . This is not part of client-go yet. See issue kubernetes/client-go#332

I think we used dep to manage dependencies in the past and I've migrated common to go module compatible. Ideally, we will move all operator to go modules pretty soon. Size of the dependencies is not that painful.

Files https://github.com/kubeflow/common/blob/master/pkg/controller.v1/common/pod_control.go and https://github.com/kubeflow/common/blob/master/pkg/controller.v1/common/service_control.go are not being used in kubeflow/common.

I think we only one of following three. The status in kubeflow/common is

Create Pods/Service Interface - exist and being used
controller. ServiceControlInterface - deleted
copied ServiceControlInterface - still there and but not being used. Only tests are using it.

Implementation from different operators for following interface are very close. I doubt we need different controller to have similar logic there. Currently, only xgboost-operator is built on top of this operator, we have plan to migrate tf, pytorch, mxnet to follow kubeflow/common fashion. trying to see any other operators are built on top of common, if there's a need to customize logic of CreateService, etc. If that's all same, it's better to implement it inside JobController.

common/pkg/apis/common/v1/interface.go

Lines 31 to 38 in fda69c7

    
           GetPodsForJob(job interface{}) ([]*v1.Pod, error) 
        
           // GetServicesForJob returns the services managed by the job. This can be achieved by selecting services using label key "job-name" 
        
           // i.e. all services created by the job will come with label "job-name" = <this_job_name> 
        
           GetServicesForJob(job interface{}) ([]*v1.Service, error) 
        
           // DeleteJob deletes the job 
        
           DeleteJob(job interface{}) error

common/pkg/apis/common/v1/interface.go

Lines 47 to 56 in fda69c7

    
           CreateService(job interface{}, service *v1.Service) error 
        
           // DeleteService deletes the service 
        
           DeleteService(job interface{}, name string, namespace string) error 
        
           // CreatePod creates the pod 
        
           CreatePod(job interface{}, pod *v1.Pod) error 
        
           // DeletePod deletes the pod 
        
           DeletePod(job interface{}, pod *v1.Pod) error

PodControl/ServiceControl interfaces are duplicated with the JobController create/delete Pod/Service Interface

https://github.com/kubeflow/common/blob/master/job_controller/job_controller.go#L68-L77
https://github.com/kubeflow/common/blob/master/job_controller/job_controller.go#L111-L114

Problem
Above two interfaces are both used for underlying create/delete Pod/Service implementations, which are duplicated.

Goal

we should keep one of them
only in one place, i.e. ControllerInterface.

Solution

Originally thought we can remove the Pod/ServiceControl interface, but there are some code relying on it like here used for the parameter. - we could rewrite these to remove the dependency though
Keep the Pod/Service Interface, and move it to one place , i.e. the ControllerInterface. I also realized some implementations are directly copied from Kubernetes utils, wondering why not just use that.

opinions?
@richardsliu @gaocegege @terrytangyuan

Re organize the API package

Right now the user facing APIs are scattered in different packages, we should re-organize them into a central api package.
A proposal looks like below:

types.go
[new file] interface.go: the ControllerInterface for custom operators to override
[new file] controller.go: the JobController struct which contains the ReconcileJobs entry method.
[new file] constants.go: the constant such as the label keys: job-name, replica-index, etc.

Change expectation package log level to debug

expectation log level is Info now, it's kind of noisy. We will at least see two messages for one pod/service event on setting up or lowering the expectations.

It's better to change to debug level.

INFO[0011] need to create new service: worker-1          job=default.xgboost-dist-iris-test-train replica-type=worker uid=467fb419-962b-11ea-b670-026fb543c17c
INFO[0011] Setting expectations &expectation.ControlleeExpectations{add:1, del:0, key:"default/xgboost-dist-iris-test-train/worker/services", timestamp:time.Time{wall:0xbfa78c778e0631d8, ext:11414609468, loc:(*time.Location)(0x313bcc0)}}
INFO[0011] Update on create function xgboostjob-operator create object xgboost-dist-iris-test-train-worker-1
INFO[0011] Lowered expectations &expectation.ControlleeExpectations{add:-2, del:0, key:"default/xgboost-dist-iris-test-train/worker/pods", timestamp:time.Time{wall:0xbfa78c778a2968f0, ext:11349809202, loc:(*time.Location)(0x313bcc0)}}
INFO[0011] Controller xgboost-dist-iris-test-train created service xgboost-dist-iris-test-train-worker-1

Allow modification of controller's replica index/type labels

Per comment in kubeflow/training-operator#1171 (comment), given that each operator may have different labels, we should discuss whether we want to allow each operator's controller that implements JobController to change ReplicaIndexLabel and ReplicaTypeLabel. For example, something like the following for tf-operator:

func (jc *JobController) GetReplicaTypeLabelKey() string {
	return tfReplicaTypeLabel
}

cc @gaocegege @Jeffwan @ChanYiLin @johnugeorge

[hot-fix] UpdateJobStatus Pass the jobStatus by reference

The UpdateJobStatus would be passed by reference, because user already modify this status. Next, UpdateJobStatusInApiServer come to use this updated status as well.

Proposal: using gang scheduling API for generic distributed training support in Kubeflow

Problem

Currently in Kubeflow, we have a controller per framework (e.g. TF-Job, and PyTorch-Operator) and to support a new framework, the message we are giving is that users have to write a new controller. This is a lot of friction for data scientists who most likely don’t know Go-Lang and K8s. Even if they do, getting a version of controller deployed in a corp cluster is not easy.

Proposed Solution

However, in reality users actually don’t have to write a new controller if they have a generic Gang scheduling API and in fact TF-Job controller exposes a restricted version of the API that works for almost all of the use cases. In fact, the Google AI Platform team implemented distributed PyTorch and XGBoost jobs using TF-Job API for the Google AI-Hub. So if we can create a controller for gang scheduling it will make it easy to add support for new frameworks.

Advantages

Less effort to support a new framework (users don’t need K8s or Go-Lang expertise)
A better story for portability between Kubeflow and other platforms like Mesos. The same container can be used in other platforms without any changes.

Other infras that support some version of gang scheduling API

Google AI Platform training
Amazon Sagemaker
Mesos/Uber’s Michelangelo
YARN (not natively but with some application logic in Application Master, e.g. TonY)

Frameworks Support

From my understanding, distributed training for following frameworks can be implemented easily using just a generic gang scheduling.

TensorFlow
Horovod
PyTorch
XGBoost
Julia
LightGBM

Rough API Spec

Almost same as current tf-job spec but with more generic names and generalizing #worker groups.

apiVersion: kubeflow.org/v1beta1
kind: GangJob
metadata:
  generateName: gangjob
  namespace: kubeflow
spec:
  replicaSpecs:
    WorkerGroup1:
      replicas: 4
      restartPolicy: OnFailure
      template:
        spec:
          containers:
          - name: 
            image: 
            command:
    WorkerGroup2:
      replicas: 3
      restartPolicy: OnFailure
      template:
        spec:
          containers:
          - name: 
            image: 
            Command:
    .
    .
    .
    WorkerGroupN:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
          - name: 
            image: 
            command:

Supplement more information about tfjob into podgroup when `enable-gang-scheduler` is set `true` in tf-operator

When enable-gang-scheduler=true, tf-operator will create CRD podgroup to permit gang scheduler volcano to allocate the pods. but when createing pod in func SyncPodGroup:

common/pkg/controller.v1/common/job_controller.go

Line 211 in 3fbe0ce

    
           func (jc *JobController) SyncPodGroup(job metav1.Object, minAvailableReplicas int32) (*v1beta1.PodGroup, error) {

	createPodGroup := &v1beta1.PodGroup{
		ObjectMeta: metav1.ObjectMeta{
			Name: job.GetName(),
			OwnerReferences: []metav1.OwnerReference{
				*jc.GenOwnerReference(job),
			},
		},
		Spec: v1beta1.PodGroupSpec{
			MinMember: minAvailable.IntVal,
		},
	}
	createdPodGroup, err := volcanoClientSet.SchedulingV1beta1().PodGroups(job.GetNamespace()).Create(createPodGroup)

Only pod infos and minMember is set in podgroup, which resulting to function missing, as well as unpredicatable bugs during allocation.

For example, since no minResources field is filled in podgroup, gang scheduler volcano cannot diff tfjobs from bestEffort jobs as both of the two jobs owns nil minResources, causing all tfjobs can be inqueue and action enqueue , reserve lose effort.

So in my opinion, we need to supplement more infos about tfjob into podgroup, such as minMember, queue as well as other fields, so as to make sure gang scheduler workers correctly.

The controller logic under https://github.com/kubeflow/tf-operator/tree/master/pkg/controller.v1beta2/tensorflow and https://github.com/kubeflow/pytorch-operator/tree/master/pkg/controller.v1beta2/pytorch have some code in common. We should consider refactoring the logic there, and move the shared implementation into this repository.

	github.com/kubeflow/common/client github.com/kubeflow/common \
	operator:v1 \

	GetPodsForJob(job interface{}) ([]*v1.Pod, error)

	// GetServicesForJob returns the services managed by the job. This can be achieved by selecting services using label key "job-name"
	// i.e. all services created by the job will come with label "job-name" = <this_job_name>
	GetServicesForJob(job interface{}) ([]*v1.Service, error)

	// DeleteJob deletes the job
	DeleteJob(job interface{}) error

	// UpdateJobStatus updates the job status and job conditions
	UpdateJobStatus(job interface{}, replicas map[ReplicaType]ReplicaSpec, jobStatus JobStatus) error

	// UpdateJobStatusInApiServer updates the job status in API server
	UpdateJobStatusInApiServer(job interface{}, jobStatus *JobStatus) error

	// CreateService creates the service
	CreateService(job interface{}, service *v1.Service) error

	// DeleteService deletes the service
	DeleteService(job interface{}, name string, namespace string) error

	// CreatePod creates the pod
	CreatePod(job interface{}, pod *v1.Pod) error

	// DeletePod deletes the pod
	DeletePod(job interface{}, pod *v1.Pod) error

	SetClusterSpec(job interface{}, podTemplate *v1.PodTemplateSpec, rtype, index string) error

	// Returns the default container name in pod
	GetDefaultContainerName() string

	// Get the default container port name
	GetDefaultContainerPortName() string

	// Get the default container port number
	GetDefaultContainerPortNumber() int32

	// Returns if this replica type with index specified is a master role.
	// MasterRole pod will have "job-role=master" set in its label
	IsMasterRole(replicas map[ReplicaType]*ReplicaSpec, rtype ReplicaType, index int) bool

kubeflow / common Goto Github PK

common's Introduction

Documentation

Working Groups

Quick Links

Get Involved

common's People

Contributors

Stargazers

Watchers

Forkers

common's Issues

Proposal

Motivation

Details

Problem

Proposed Solution

Advantages

Frameworks Support

Rough API Spec(pytorch-operator)

Configuration and deployment

Custom Resources

Observability

CI/CD

Docs

Owners/Maintenance

Adoption

Problem

Proposed Solution

Advantages

Other infras that support some version of gang scheduling API

Frameworks Support

Rough API Spec

Recommend Projects

Recommend Topics

Recommend Org