Giter Site home page Giter Site logo

common's Introduction

OpenSSF Best Practices OpenSSF Scorecard CLOMonitor

Kubeflow the cloud-native platform for machine learning operations - pipelines, training and deployment.


Documentation

Please refer to the official docs at kubeflow.org.

Working Groups

The Kubeflow community is organized into working groups (WGs) with associated repositories, that focus on specific pieces of the ML platform.

Quick Links

Get Involved

Please refer to the Community page.

common's People

Contributors

alculquicondor avatar eggiter avatar gaocegege avatar georgkaleido avatar hegaoyuan avatar hustcat avatar hzxuzhonghu avatar jeffwan avatar jian-he avatar jiangkaihua avatar jinchihe avatar johnugeorge avatar martinforreal avatar merlintang avatar mkkb473 avatar pugangxa avatar qiankunli avatar richardsliu avatar shaowei-su avatar shinytang6 avatar sperlingxx avatar sumlare avatar syulin7 avatar tenzen-y avatar terrytangyuan avatar thor-wl avatar whalecold avatar xieydd avatar yowenter avatar zw0610 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

common's Issues

Enhance maintainability of operator common module

There're few issues in current project and I propose to have some improvements for these issues.
I am in progress of a PR and please share some insights and feedbacks.

  1. ./hack/update-codegen doesn't work with current code structure. It tries to generate outputs for operator:v1 but project only have job_controller/api/v1 directory which doesn't match.
    github.com/kubeflow/common/client github.com/kubeflow/common \
    operator:v1 \

Either the update-codegen or code structure need to be changed.

  1. There's no ./hack/verify-codegen.sh codegen verify process in the CI.

  2. code-generator version referenced in the project doesn't work. The minimum version to support go mod is 0.15.x
    https://github.com/kubeflow/common/blob/88a0cd071cf162120aa3ea0c3cb9cc64d69ef3f5/hack/update-codegen.sh#L25

  3. Upgrade Kubernetes dependency.
    common project uses expectation which belongs to k8s.io/kubernetes which makes it has direct dependency on it.

    Expectations controller.ControllerExpectationsInterface

    We should start to consider to remove dependency of k8s/kubernetes because it is not primarily intended to be consumed as a module. Only the published subcomponents are. reference: #48

The other challenge is most of the new training operators may use kubebuilder2 or latest operator-sdk to generate APIs. These framework depends on latest k8s.io/code-generater and k8s.io/client-go which has conflicts with the ones k8s.io/[email protected] uses. There're some breaking changes user have to resolve.

It would be better to upgrade to 1.15+ for all k8s.io/xx dependencies and make sure our common package is compatible with newer client-go.

  1. code struct is not that clear. Looks like apis is under job_controller, I think we plan to have ControllerInterface and common codes under this package, As I know, some users only use api like CleanupPolicy or RestartPolicy, etc. It would be great to separate api definition and jobcontroller implementation in different folders.

For test_job, seems this is only used for testing and I think it's fine to put apis and client under pkg directly. I am ok to have a separate top level directory like test_job to host apis. Currently we have this codes structure for test_job.

├── client
│   ├── clientset
│   │   └── versioned
│   ├── informers
│   │   └── externalversions
│   └── listers
│          └── test_job
├── test_job
│   ├── v1

Based on what we talked above, could we create a project architecture like this


├── LICENSE
├── OWNERS
├── README.md
├── go.mod
├── go.sum
├── hack
│   ├── boilerplate
│   ├── scripts
│   ├── update-codegen.sh
│   └── verify-codegen.sh
├── pkg
│   ├── apis
│   │   └── common
│   │       └── v1              ---> common API 
│   │   └── test_job
│   │       └── v1             ----> test Job API
│   ├── client                   ----> test Job client
│   │   ├── clientset
│   │   │   └── versioned
│   │   ├── informers
│   │   │   └── externalversions
│   │   └── listers
│   │       └── test_job
│   ├── job_controller
│   │   ├── job.go
│   │   ├── job_controller.go
│   │   ├── job_test.go
│   │   ├── pod.go
│   │   ├── pod_control.go
│   │   ├── pod_control_test.go
│   │   ├── pod_test.go
│   │   ├── service.go
│   │   ├── service_control.go
│   │   ├── service_control_test.go
│   │   ├── service_ref_manager.go
│   │   ├── service_ref_manager_test.go
│   │   ├── status.go
│   │   ├── status_test.go
│   │   ├── test_job_controller.go
│   │   ├── util.go
│   │   └── util_test.go
│   ├── test_util                          ----> I doubt some of utils still being used, need more clean ups.
│   │   ├── const.go..
│   └── util
│       ├── k8sutil
│       ├── logger.go
│       ├── signals
│       ├── status.go
│       ├── status_test.go
│       ├── train
│       └── util.go


For other operators, they can import package api and controller packages like following.

import (
"github.com/kubeflow/common/pkg/apis/common/v1"
"github.com/kubeflow/common/pkg/job_controller"
)

Please leave some comments, I will file the PR if it makes sense.

/cc @terrytangyuan @gaocegege @jian-he @richardsliu

Deprecate glog and use logger in commonutil instead

glog has been deprecated in the kubernetes community for a few reasons.
I notice it's still being used in the common project and I would be better to clean up all references and migrate to existing log solutions.

/improvement

Make training container port optional

Instead of using fixed port to communicate with each other, some frameworks randomly pick up port by worker and use auto-discovery to find entire topology in consensus store.

Currently, common force user to implement GetDefaultContainerPortName and GetDefaultContainerPortNumber and use it in following logic.

if port.Name == jc.Controller.GetDefaultContainerPortName(){

If job doesn't have port, it will return an error.

In order to make this change, we can make this optional to meet the case I described above.

project cannot complie with kubeflow/common v0.3.3 and client-go v0.19

I use operator-sdk(v1.5.0) and this repo to scaffold a project.
go.mod

require (
	github.com/go-logr/logr v0.3.0
	github.com/kubeflow/common v0.3.3
	github.com/onsi/ginkgo v1.14.1
	github.com/onsi/gomega v1.10.2
	k8s.io/api v0.19.2
	k8s.io/apimachinery v0.19.2
	k8s.io/client-go v0.19.2
	sigs.k8s.io/controller-runtime v0.7.2
)

By I met errors when I try to compile,

/Users/xxxx/test-operator/bin/controller-gen object:headerFile="hack/boilerplate.go.txt" paths="./..."
go fmt ./...
go vet ./...
# volcano.sh/apis/pkg/client/clientset/versioned/typed/bus/v1alpha1
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/bus/v1alpha1/command.go:72:5: not enough arguments in call to c.client.Get().Namespace(c.ns).Resource("commands").Name(name).VersionedParams(&options, scheme.ParameterCodec).Do
        have ()
        want (context.Context)
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/bus/v1alpha1/command.go:89:5: not enough arguments in call to c.client.Get().Namespace(c.ns).Resource("commands").VersionedParams(&opts, scheme.ParameterCodec).Timeout(timeout).Do
        have ()
        want (context.Context)
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/bus/v1alpha1/command.go:106:8: not enough arguments in call to c.client.Get().Namespace(c.ns).Resource("commands").VersionedParams(&opts, scheme.ParameterCodec).Timeout(timeout).Watch
        have ()
        want (context.Context)
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/bus/v1alpha1/command.go:116:5: not enough arguments in call to c.client.Post().Namespace(c.ns).Resource("commands").Body(command).Do
        have ()
        want (context.Context)
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/bus/v1alpha1/command.go:129:5: not enough arguments in call to c.client.Put().Namespace(c.ns).Resource("commands").Name(command.ObjectMeta.Name).Body(command).Do
        have ()
        want (context.Context)
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/bus/v1alpha1/command.go:141:5: not enough arguments in call to c.client.Delete().Namespace(c.ns).Resource("commands").Name(name).Body(options).Do
        have ()
        want (context.Context)
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/bus/v1alpha1/command.go:157:5: not enough arguments in call to c.client.Delete().Namespace(c.ns).Resource("commands").VersionedParams(&listOptions, scheme.ParameterCodec).Timeout(timeout).Body(options).Do
        have ()
        want (context.Context)
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/bus/v1alpha1/command.go:170:5: not enough arguments in call to c.client.Patch(pt).Namespace(c.ns).Resource("commands").SubResource(subresources...).Name(name).Body(data).Do
        have ()
        want (context.Context)
# volcano.sh/apis/pkg/client/clientset/versioned/typed/batch/v1alpha1
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/batch/v1alpha1/job.go:73:5: not enough arguments in call to c.client.Get().Namespace(c.ns).Resource("jobs").Name(name).VersionedParams(&options, scheme.ParameterCodec).Do
        have ()
        want (context.Context)
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/batch/v1alpha1/job.go:90:5: not enough arguments in call to c.client.Get().Namespace(c.ns).Resource("jobs").VersionedParams(&opts, scheme.ParameterCodec).Timeout(timeout).Do
        have ()
        want (context.Context)
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/batch/v1alpha1/job.go:107:8: not enough arguments in call to c.client.Get().Namespace(c.ns).Resource("jobs").VersionedParams(&opts, scheme.ParameterCodec).Timeout(timeout).Watch
        have ()
        want (context.Context)
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/batch/v1alpha1/job.go:117:5: not enough arguments in call to c.client.Post().Namespace(c.ns).Resource("jobs").Body(job).Do
        have ()
        want (context.Context)
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/batch/v1alpha1/job.go:130:5: not enough arguments in call to c.client.Put().Namespace(c.ns).Resource("jobs").Name(job.ObjectMeta.Name).Body(job).Do
        have ()
        want (context.Context)
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/batch/v1alpha1/job.go:146:5: not enough arguments in call to c.client.Put().Namespace(c.ns).Resource("jobs").Name(job.ObjectMeta.Name).SubResource("status").Body(job).Do
        have ()
        want (context.Context)
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/batch/v1alpha1/job.go:158:5: not enough arguments in call to c.client.Delete().Namespace(c.ns).Resource("jobs").Name(name).Body(options).Do
        have ()
        want (context.Context)
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/batch/v1alpha1/job.go:174:5: not enough arguments in call to c.client.Delete().Namespace(c.ns).Resource("jobs").VersionedParams(&listOptions, scheme.ParameterCodec).Timeout(timeout).Body(options).Do
        have ()
        want (context.Context)
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/batch/v1alpha1/job.go:187:5: not enough arguments in call to c.client.Patch(pt).Namespace(c.ns).Resource("jobs").SubResource(subresources...).Name(name).Body(data).Do
        have ()
        want (context.Context)
# volcano.sh/apis/pkg/client/clientset/versioned/typed/scheduling/v1beta1
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/scheduling/v1beta1/podgroup.go:73:5: not enough arguments in call to c.client.Get().Namespace(c.ns).Resource("podgroups").Name(name).VersionedParams(&options, scheme.ParameterCodec).Do
        have ()
        want (context.Context)
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/scheduling/v1beta1/podgroup.go:90:5: not enough arguments in call to c.client.Get().Namespace(c.ns).Resource("podgroups").VersionedParams(&opts, scheme.ParameterCodec).Timeout(timeout).Do
        have ()
        want (context.Context)
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/scheduling/v1beta1/podgroup.go:107:8: not enough arguments in call to c.client.Get().Namespace(c.ns).Resource("podgroups").VersionedParams(&opts, scheme.ParameterCodec).Timeout(timeout).Watch
        have ()
        want (context.Context)
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/scheduling/v1beta1/podgroup.go:117:5: not enough arguments in call to c.client.Post().Namespace(c.ns).Resource("podgroups").Body(podGroup).Do
        have ()
        want (context.Context)
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/scheduling/v1beta1/podgroup.go:130:5: not enough arguments in call to c.client.Put().Namespace(c.ns).Resource("podgroups").Name(podGroup.ObjectMeta.Name).Body(podGroup).Do
        have ()
        want (context.Context)
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/scheduling/v1beta1/podgroup.go:146:5: not enough arguments in call to c.client.Put().Namespace(c.ns).Resource("podgroups").Name(podGroup.ObjectMeta.Name).SubResource("status").Body(podGroup).Do
        have ()
        want (context.Context)
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/scheduling/v1beta1/podgroup.go:158:5: not enough arguments in call to c.client.Delete().Namespace(c.ns).Resource("podgroups").Name(name).Body(options).Do
        have ()
        want (context.Context)
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/scheduling/v1beta1/podgroup.go:174:5: not enough arguments in call to c.client.Delete().Namespace(c.ns).Resource("podgroups").VersionedParams(&listOptions, scheme.ParameterCodec).Timeout(timeout).Body(options).Do
        have ()
        want (context.Context)
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/scheduling/v1beta1/podgroup.go:187:5: not enough arguments in call to c.client.Patch(pt).Namespace(c.ns).Resource("podgroups").SubResource(subresources...).Name(name).Body(data).Do
        have ()
        want (context.Context)
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/scheduling/v1beta1/queue.go:70:5: not enough arguments in call to c.client.Get().Resource("queues").Name(name).VersionedParams(&options, scheme.ParameterCodec).Do
        have ()
        want (context.Context)
../../../go/pkg/mod/volcano.sh/[email protected]/pkg/client/clientset/versioned/typed/scheduling/v1beta1/queue.go:70:5: too many errors
# github.com/kubeflow/common/pkg/controller.v1/control
../../../go/pkg/mod/github.com/kubeflow/[email protected]/pkg/controller.v1/control/pod_control.go:113:55: not enough arguments in call to r.KubeClient.CoreV1().Pods(namespace).Patch
        have (string, types.PatchType, []byte)
        want (context.Context, string, types.PatchType, []byte, "k8s.io/apimachinery/pkg/apis/meta/v1".PatchOptions, ...string)
../../../go/pkg/mod/github.com/kubeflow/[email protected]/pkg/controller.v1/control/pod_control.go:149:64: not enough arguments in call to r.KubeClient.CoreV1().Pods(namespace).Create
        have (*"k8s.io/api/core/v1".Pod)
        want (context.Context, *"k8s.io/api/core/v1".Pod, "k8s.io/apimachinery/pkg/apis/meta/v1".CreateOptions)
../../../go/pkg/mod/github.com/kubeflow/[email protected]/pkg/controller.v1/control/pod_control.go:170:55: not enough arguments in call to r.KubeClient.CoreV1().Pods(namespace).Get
        have (string, "k8s.io/apimachinery/pkg/apis/meta/v1".GetOptions)
        want (context.Context, string, "k8s.io/apimachinery/pkg/apis/meta/v1".GetOptions)
../../../go/pkg/mod/github.com/kubeflow/[email protected]/pkg/controller.v1/control/pod_control.go:182:56: not enough arguments in call to r.KubeClient.CoreV1().Pods(namespace).Delete
        have (string, nil)
        want (context.Context, string, "k8s.io/apimachinery/pkg/apis/meta/v1".DeleteOptions)
../../../go/pkg/mod/github.com/kubeflow/[email protected]/pkg/controller.v1/control/service_control.go:68:59: not enough arguments in call to r.KubeClient.CoreV1().Services(namespace).Patch
        have (string, types.PatchType, []byte)
        want (context.Context, string, types.PatchType, []byte, "k8s.io/apimachinery/pkg/apis/meta/v1".PatchOptions, ...string)
../../../go/pkg/mod/github.com/kubeflow/[email protected]/pkg/controller.v1/control/service_control.go:93:69: not enough arguments in call to r.KubeClient.CoreV1().Services(namespace).Create
        have (*"k8s.io/api/core/v1".Service)
        want (context.Context, *"k8s.io/api/core/v1".Service, "k8s.io/apimachinery/pkg/apis/meta/v1".CreateOptions)
../../../go/pkg/mod/github.com/kubeflow/[email protected]/pkg/controller.v1/control/service_control.go:116:63: not enough arguments in call to r.KubeClient.CoreV1().Services(namespace).Get
        have (string, "k8s.io/apimachinery/pkg/apis/meta/v1".GetOptions)
        want (context.Context, string, "k8s.io/apimachinery/pkg/apis/meta/v1".GetOptions)
../../../go/pkg/mod/github.com/kubeflow/[email protected]/pkg/controller.v1/control/service_control.go:128:60: not enough arguments in call to r.KubeClient.CoreV1().Services(namespace).Delete
        have (string, nil)
        want (context.Context, string, "k8s.io/apimachinery/pkg/apis/meta/v1".DeleteOptions)
# github.com/kubeflow/common/pkg/util/k8sutil
../../../go/pkg/mod/github.com/kubeflow/[email protected]/pkg/util/k8sutil/client.go:77:19: not enough arguments in call to r.DoRaw
        have ()
        want (context.Context)
../../../go/pkg/mod/github.com/kubeflow/[email protected]/pkg/util/k8sutil/client.go:91:19: not enough arguments in call to r.DoRaw
        have ()
        want (context.Context)

How can I solve this problem?

PodGroup sync loop is missing

Type of replicaType is not consistent in common project

Is there any thought about why type of replicaType is not consistent across the common project?

e.g.

SetClusterSpec(job interface{}, podTemplate *v1.PodTemplateSpec, rtype, index string) error

vs

IsMasterRole(replicas map[ReplicaType]*ReplicaSpec, rtype ReplicaType, index int) bool

I think maybe it is a good idea to define replicaType as commonv1.ReplicaType.

Proposal for exposing generic prometheus metrics in common operator

Proposal

Add generic metrics (jobs/pods/...) to the common operator, which can be directly enabled and used by operators built base on common operator

Motivation

To track some job-level metrics, currently we need to add prometheus metric code inside each job operator. For example, to know how many tfjobs created in the last hour, we need to add a Counter inside tf-operator. This request is very common and is needed for different operators. As we're moving common code to the common operator, we could also add metric-related code there, and can be used by all operators built base on the common one.

Details

For metric definition and registry, will add a new metrics folder and all metrics will be defined there. Some prelim metrics include # jobs/pods/services created, durations for various operations, etc.

For metrics updating:

  • For pods/services, we can directly add related metric code inside job_controller/pod.go and job_controller/service.go.
  • For jobs, to track the numbers, we may need to watch the creation events. Similar to controller_watches.

As the common project is still under active development, some details discussed above may be changed later. Comments will be very appreciated, @jlewi @richardsliu @gaocegege @jian-he .

Make training container port support multi ports

Now common repo only supports single port for communicating with each other, but some scenarios need another pod for profiling or other user cases.
Currently GetPortFromJob func only return single port for job.

func (jc *JobController) GetPortFromJob(spec *apiv1.ReplicaSpec) (*int32, error) {

Making port flexible, user can configure multi ports with the name and port for different role.
/assign @xieydd
/cc @gaocegege @terrytangyuan

Bump Kubernetes dependency to 1.16

I've seen this issue when I use common in other operators. The problem is volcano changes brings Kubernetes 1.16 and it has direct dependency on 1.16. At the same time, we still replace k8s packages to 1.15.10.

Build will work but if we determine to use 1.16 in other projects, CRD client.go will have incompatibility issue like beflow.

../../../../pkg/mod/github.com/kubeflow/[email protected]/pkg/util/k8sutil/client.go:50:32: undefined: serializer.DirectCodecFactory

Let's bump version to 1.16 and make sure they are consistent.

prow does not have ability to list repo collaborators

/assign @jlewi

{
  component:  "hook"   
  error:  "status code 403 not one of [204 404 302], body: {"message":"Must have push access to view repository collaborators.","documentation_url":"https://developer.github.com/v3/repos/collaborators/#check-if-a-user-is-a-collaborator"}"   
  event-GUID:  "ca394c90-559d-11e9-945f-ee71934d44cc"   
  event-type:  "pull_request_review"   
  level:  "error"   
  msg:  "Failed to check if author is a collaborator."   
  org:  "kubeflow"   
  plugin:  "lgtm"   
  pr:  2   
  repo:  "common"   
  review:  221946237   
  reviewer:  "terrytangyuan"   
  url:  "https://github.com/kubeflow/common/pull/2#pullrequestreview-221946237"   
 }

[Release 1.1] training operator release

Opening this issue to track releasing training operators for Kubeflow 1.1 and would like to add area OWNERS for @johnugeorge @terrytangyuan @Jeffwan

Seems we want to these works done before 1.1 release. We may not finish all of them but let's have a try.

Failed to fetch common module

Following command doesn't work.

$ go get github.com/kubeflow/common
go: finding github.com/kubeflow/common latest
go get: github.com/kubeflow/[email protected] requires
	k8s.io/[email protected] requires
	k8s.io/[email protected]: reading k8s.io/kubectl/go.mod at revision v0.0.0: unknown revision v0.0.0

I think this is because of #62.

I use the stable version here for now and I will fix this when I get time.

$ go get github.com/kubeflow/[email protected]
go: finding github.com/kubeflow/common v0.0.0-20200327002023-0b3a4c3fca85
go: downloading github.com/kubeflow/common v0.0.0-20200327002023-0b3a4c3fca85
go: extracting github.com/kubeflow/common v0.0.0-20200327002023-0b3a4c3fca85

Job controller doesn't reconcile scale down changes

If user changed replica to a smaller value, controller won't delete those redundant pods and services.

In the reconcile logic, we always set scope slice size based on value of replicas.

podSlices := make([][]*v1.Pod, replicas)

Let's say we have index 0, 1, 2 pods and user change replicas from 3 to 2. When we iterate to pod with index 2. In GetPodSlices methods, we only print warn logs for these cases.

if index < 0 || index >= replicas {

ReconcilePods doesn't even get a chance to make actions to these pods. index 2 pod will be left there.

err = jc.createNewPod(job, rt, strconv.Itoa(index), spec, masterRole, replicas)

Proposal:add priority and queue in scheduling for the common operator

Problem

1.Currently in kube-batch,it has PodGroupSpec that it includes some status about scheduling policy,for example MinAvailable,Queue,PriorityClassName.But kubeflow operators don't provide the parameters for kube-batch now.

2.MPI-operator and tf-operator don't use common operator,and pytorch-operator and mxnet-operator use tf-operator/pkg/common package.

Proposed Solution

1.Supplement these attributions in type RunPolicy.SchedulingPolicy. When it uses kubeflow and kube-batch,
kubeflow can pass parameters to kube-batch.

// SchedulingPolicy encapsulates various scheduling policies of the distributed training
// job, for example `minAvailable` for gang-scheduling.
type SchedulingPolicy struct {
    MinAvailable *int32 `json:"minAvailable,omitempty"`

    //PriorityClassName is a type of k8s resource.(kubectl get priorityclass)
    PriorityClassName *string `json:"priorityClassName,omitempty"`
  
    Queue *string `json:"queue,omitempty"`
}

2.All operators use common operator.Because tf,pytorch and mxnet are similar.The bad news is that mpi maybe need more changes.

Advantages

Unify all operators about runPolicy and packages where are imported.

Frameworks Support

pytorch

mxnet

mpi

tensorflow

Rough API Spec(pytorch-operator)

apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
  name: "pytorch-dist-mnist-gloo"
spec:
  priorityClassName: high
  queue:default
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: gcr.io/<your_project>/pytorch_dist_mnist:latest
              args: ["--backend", "gloo"]
              # Comment out the below resources to use the CPU.
              resources: 
                limits:
                  nvidia.com/gpu: 1
    Worker:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers: 
            - name: pytorch
              image: gcr.io/<your_project>/pytorch_dist_mnist:latest
              args: ["--backend", "gloo"]
              # Comment out the below resources to use the CPU.
              resources: 
                limits:
                  nvidia.com/gpu: 1

[feature] Add reconciler.v1 package along with controller.v1

The common repo offers a controller.v1 package which is designed for low-level controller mode in kubebuilder. As we are working on a unified controller, it seems a reconciler.v1 package will help future developers to code in high-level reconciler mode.

I would like to take a try if developers think such a reconciler.v1 package will be helpful.

There will be some code overlapping between the controller and reconciler packages, which we can extract into an individual package so that these code can be re-used.

Delete direct dependency of kubernetes

Now go.mod has dependency of kubernetes v1.11.2.
It will be hard to import job_controller/pkg/api because of compatibility with new version client-go

Proposal: Add Error JobConditionType to reflect controller error (Resource Quota Error) into the status

Problem: Currently JobConditionType constants, there doesn't exist a a status that indicates error happening in the controller. This cause problems, e.g. when pod creation fails because it exceeds resource quota. Take TFJob as an example (https://github.com/kubeflow/tf-operator/blob/5adee6f30c86484897db33188af591d5976d1cd2/pkg/control/pod_control.go#L138), if pod creation here exceeds resource quota, the Create func will return an error. This information is only exposed to the higher-level jobs as an event.

Solution: Add an Error JobConditionType in kubeflow common api. Also show pod creation error (or other types of error if necessary) into the JobStatus, e.g. in the updateStatusSingle func for TFJob (https://github.com/kubeflow/tf-operator/blob/5adee6f30c86484897db33188af591d5976d1cd2/pkg/controller.v1/tensorflow/status.go#L61). This Error status will just indicate a retriable error that happened in the controller.

[feature] Suppory unit test by providing handlers for updateStatus and DeleteJob

	// TODO(ChanYiLin): these are originally for testing, but with using common library,
	// we can not replcae the function. Also need to update or remove some tests

	// tc.updateStatusHandler = tc.UpdateJobStatusInApiServer
	// set delete handler.
	// tc.deleteTFJobHandler = tc.DeleteJob

It will be better to have such handlers to replace them when running test cases.

/cc @ChanYiLin

[feature] Support gang scheduling configuration in common

/cc @k82cn

We now support gang scheduling by adding a CLI flag to enable it. We do not expose the ability to users, thus there are some complaints that we do not support queue, minresources and some other PodGroup fields.

I think we should support these:

type SchedulingPolicy struct {
	MinAvailable *int32 `json:"minAvailable,omitempty"`
+      Queue *string
+      MinResources *v1.ResourceList
+      PriorityClass *string
}

/assign

Unified training operator working progress

@zw0610 and I present all-in-one training operator proposal in last month community meeting.

WG-Training leads have already agreed to move forward. This issue is created to track implementation progress. The desired alpha release of this new unified operator will be Kubeflow 1.4

Configuration and deployment

Description Category Status Issue
Kustomize package Required Done  
Application CR Required Not Done  
Images listed in kustomization.yaml Required Not Done  
Upgradeability Required Not Done  
Separate cluster scoped and namespace scoped resources Recommended Not Done N/A
Kustomize package should be deployable on its own Recommended Done  Need to coordinate with 1.4 release

Custom Resources

Description Category Status Issue
Version stability Required Not Done  
Backward compatibility Required Not Done  
Supports status subresource Required Done  All jobs have status to reflect the real status
CRD schema validation Required Not Done
Training operators follow kubeflow/common conventions Required Done kubeflow/training-operator#1296 kubeflow/training-operator#1295 kubeflow/training-operator#1294 kubeflow/training-operator#1293

Observability

Description Category Status Issue
Liveness/Readiness signals Required Not Done  
Prometheus metrics and Graphs Required Not Done
Job Events Required Not Done  
Json logging Recommended Not Done  

CI/CD

Description Category Status Issue
E2E tests Required Not Done  
Scalability / load testing Required Not Done  
Continuous building of docker images Recommended Not Done  kubeflow/testing#951
Continuous updating of Kustomize manifests Recommended Not Done  This is not valid anymore - kubeflow/manifests will fetch repo's kustomize manifest

Docs

Description Category Status Issue
API Reference docs Required Not Done  
Application docs Required Not Done  

Owners/Maintenance

Description Category Explanation Status Issue
Healthy number of committers and commits Required Committers are listed as approvers in owners filesNumber to be determined by TOC based on size and scope of application Not Done  
At least 2 different organizations are committers Required Not Done  

Adoption

Description Category Explanation Status Issue
List of users running the application Recommended Suggest listing adopters willing to be identified publicly in ADOPTERS.md Not Done  

[Feature Request] Provides a better common logger

Currently, kubeflow/common provides a logger utils for operator to use. It uses github.com/sirupsen/logrus underneath.

https://github.com/kubeflow/common/blob/master/pkg/util/logger.go#L20

The problem I notice is new operators using kubebuilder have inbuilt logger for operator logics.
This makes the logs really messy.

Can we provide a interface layer like logr and provide method like SetLogger and different operator can use their own implementation?

Difference between kubeflow/common and kubeflow/tf-operator/pkg/common

Comparing ControllerInterface between two different projects, tf-operator is pretty clean. kubeflow/common adds extra methods to create Job/Pod/Services.

GetPodsForJob(job interface{}) ([]*v1.Pod, error)
// GetServicesForJob returns the services managed by the job. This can be achieved by selecting services using label key "job-name"
// i.e. all services created by the job will come with label "job-name" = <this_job_name>
GetServicesForJob(job interface{}) ([]*v1.Service, error)
// DeleteJob deletes the job
DeleteJob(job interface{}) error
// UpdateJobStatus updates the job status and job conditions
UpdateJobStatus(job interface{}, replicas map[ReplicaType]*ReplicaSpec, jobStatus *JobStatus) error
// UpdateJobStatusInApiServer updates the job status in API server
UpdateJobStatusInApiServer(job interface{}, jobStatus *JobStatus) error
// CreateService creates the service
CreateService(job interface{}, service *v1.Service) error
// DeleteService deletes the service
DeleteService(job interface{}, name string, namespace string) error
// CreatePod creates the pod
CreatePod(job interface{}, pod *v1.Pod) error
// DeletePod deletes the pod
DeletePod(job interface{}, pod *v1.Pod) error

Besides that, we extracted a few interfaces for high level abstraction which makes sense.

SetClusterSpec(job interface{}, podTemplate *v1.PodTemplateSpec, rtype, index string) error
// Returns the default container name in pod
GetDefaultContainerName() string
// Get the default container port name
GetDefaultContainerPortName() string
// Get the default container port number
GetDefaultContainerPortNumber() int32
// Returns if this replica type with index specified is a master role.
// MasterRole pod will have "job-role=master" set in its label
IsMasterRole(replicas map[ReplicaType]*ReplicaSpec, rtype ReplicaType, index int) bool

JobController side, tf-operator has PodControl and ServiceControl which are used to operate k8s objects. However, in #36, since create/delete pods/services interface have been exposed, community determine to remove PodControl and ServiceControl because them are kubernetes controller internal codes, and there's also concern on the size of the dependency (need kubernetes code base).

Currently, it's a little bit weird.

  1. We can not fully get rid of kubernetes/pkg/controller, Expectation is still using them here. https://github.com/kubeflow/common/blob/master/pkg/controller.v1/common/job_controller.go#L100 . This is not part of client-go yet. See issue kubernetes/client-go#332

I think we used dep to manage dependencies in the past and I've migrated common to go module compatible. Ideally, we will move all operator to go modules pretty soon. Size of the dependencies is not that painful.

  1. Files https://github.com/kubeflow/common/blob/master/pkg/controller.v1/common/pod_control.go and https://github.com/kubeflow/common/blob/master/pkg/controller.v1/common/service_control.go are not being used in kubeflow/common.

I think we only one of following three. The status in kubeflow/common is

  • Create Pods/Service Interface - exist and being used
  • controller. ServiceControlInterface - deleted
  • copied ServiceControlInterface - still there and but not being used. Only tests are using it.
  1. Implementation from different operators for following interface are very close. I doubt we need different controller to have similar logic there. Currently, only xgboost-operator is built on top of this operator, we have plan to migrate tf, pytorch, mxnet to follow kubeflow/common fashion. trying to see any other operators are built on top of common, if there's a need to customize logic of CreateService, etc. If that's all same, it's better to implement it inside JobController.

GetPodsForJob(job interface{}) ([]*v1.Pod, error)
// GetServicesForJob returns the services managed by the job. This can be achieved by selecting services using label key "job-name"
// i.e. all services created by the job will come with label "job-name" = <this_job_name>
GetServicesForJob(job interface{}) ([]*v1.Service, error)
// DeleteJob deletes the job
DeleteJob(job interface{}) error

CreateService(job interface{}, service *v1.Service) error
// DeleteService deletes the service
DeleteService(job interface{}, name string, namespace string) error
// CreatePod creates the pod
CreatePod(job interface{}, pod *v1.Pod) error
// DeletePod deletes the pod
DeletePod(job interface{}, pod *v1.Pod) error

PodControl/ServiceControl interfaces are duplicated with the JobController create/delete Pod/Service Interface

https://github.com/kubeflow/common/blob/master/job_controller/job_controller.go#L68-L77
https://github.com/kubeflow/common/blob/master/job_controller/job_controller.go#L111-L114

Problem
Above two interfaces are both used for underlying create/delete Pod/Service implementations, which are duplicated.

Goal

  • we should keep one of them
  • only in one place, i.e. ControllerInterface.

Solution

  1. Originally thought we can remove the Pod/ServiceControl interface, but there are some code relying on it like here used for the parameter. - we could rewrite these to remove the dependency though

  2. Keep the Pod/Service Interface, and move it to one place , i.e. the ControllerInterface. I also realized some implementations are directly copied from Kubernetes utils, wondering why not just use that.

opinions?
@richardsliu @gaocegege @terrytangyuan

Re organize the API package

Right now the user facing APIs are scattered in different packages, we should re-organize them into a central api package.
A proposal looks like below:

image

  1. types.go
  2. [new file] interface.go: the ControllerInterface for custom operators to override
  3. [new file] controller.go: the JobController struct which contains the ReconcileJobs entry method.
  4. [new file] constants.go: the constant such as the label keys: job-name, replica-index, etc.

Change expectation package log level to debug

expectation log level is Info now, it's kind of noisy. We will at least see two messages for one pod/service event on setting up or lowering the expectations.

It's better to change to debug level.

INFO[0011] need to create new service: worker-1          job=default.xgboost-dist-iris-test-train replica-type=worker uid=467fb419-962b-11ea-b670-026fb543c17c
INFO[0011] Setting expectations &expectation.ControlleeExpectations{add:1, del:0, key:"default/xgboost-dist-iris-test-train/worker/services", timestamp:time.Time{wall:0xbfa78c778e0631d8, ext:11414609468, loc:(*time.Location)(0x313bcc0)}}
INFO[0011] Update on create function xgboostjob-operator create object xgboost-dist-iris-test-train-worker-1
INFO[0011] Lowered expectations &expectation.ControlleeExpectations{add:-2, del:0, key:"default/xgboost-dist-iris-test-train/worker/pods", timestamp:time.Time{wall:0xbfa78c778a2968f0, ext:11349809202, loc:(*time.Location)(0x313bcc0)}}
INFO[0011] Controller xgboost-dist-iris-test-train created service xgboost-dist-iris-test-train-worker-1

Allow modification of controller's replica index/type labels

Per comment in kubeflow/training-operator#1171 (comment), given that each operator may have different labels, we should discuss whether we want to allow each operator's controller that implements JobController to change ReplicaIndexLabel and ReplicaTypeLabel. For example, something like the following for tf-operator:

func (jc *JobController) GetReplicaTypeLabelKey() string {
	return tfReplicaTypeLabel
}

cc @gaocegege @Jeffwan @ChanYiLin @johnugeorge

Proposal: using gang scheduling API for generic distributed training support in Kubeflow

Problem

Currently in Kubeflow, we have a controller per framework (e.g. TF-Job, and PyTorch-Operator) and to support a new framework, the message we are giving is that users have to write a new controller. This is a lot of friction for data scientists who most likely don’t know Go-Lang and K8s. Even if they do, getting a version of controller deployed in a corp cluster is not easy.

Proposed Solution

However, in reality users actually don’t have to write a new controller if they have a generic Gang scheduling API and in fact TF-Job controller exposes a restricted version of the API that works for almost all of the use cases. In fact, the Google AI Platform team implemented distributed PyTorch and XGBoost jobs using TF-Job API for the Google AI-Hub. So if we can create a controller for gang scheduling it will make it easy to add support for new frameworks.

Advantages

Less effort to support a new framework (users don’t need K8s or Go-Lang expertise)
A better story for portability between Kubeflow and other platforms like Mesos. The same container can be used in other platforms without any changes.

Other infras that support some version of gang scheduling API

Frameworks Support

From my understanding, distributed training for following frameworks can be implemented easily using just a generic gang scheduling.

  • TensorFlow
  • Horovod
  • PyTorch
  • XGBoost
  • Julia
  • LightGBM

Rough API Spec

Almost same as current tf-job spec but with more generic names and generalizing #worker groups.

apiVersion: kubeflow.org/v1beta1
kind: GangJob
metadata:
  generateName: gangjob
  namespace: kubeflow
spec:
  replicaSpecs:
    WorkerGroup1:
      replicas: 4
      restartPolicy: OnFailure
      template:
        spec:
          containers:
          - name: 
            image: 
            command:
    WorkerGroup2:
      replicas: 3
      restartPolicy: OnFailure
      template:
        spec:
          containers:
          - name: 
            image: 
            Command:
    .
    .
    .
    WorkerGroupN:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
          - name: 
            image: 
            command:

Supplement more information about tfjob into podgroup when `enable-gang-scheduler` is set `true` in tf-operator

When enable-gang-scheduler=true, tf-operator will create CRD podgroup to permit gang scheduler volcano to allocate the pods. but when createing pod in func SyncPodGroup:

func (jc *JobController) SyncPodGroup(job metav1.Object, minAvailableReplicas int32) (*v1beta1.PodGroup, error) {

	createPodGroup := &v1beta1.PodGroup{
		ObjectMeta: metav1.ObjectMeta{
			Name: job.GetName(),
			OwnerReferences: []metav1.OwnerReference{
				*jc.GenOwnerReference(job),
			},
		},
		Spec: v1beta1.PodGroupSpec{
			MinMember: minAvailable.IntVal,
		},
	}
	createdPodGroup, err := volcanoClientSet.SchedulingV1beta1().PodGroups(job.GetNamespace()).Create(createPodGroup)

Only pod infos and minMember is set in podgroup, which resulting to function missing, as well as unpredicatable bugs during allocation.

For example, since no minResources field is filled in podgroup, gang scheduler volcano cannot diff tfjobs from bestEffort jobs as both of the two jobs owns nil minResources, causing all tfjobs can be inqueue and action enqueue , reserve lose effort.

So in my opinion, we need to supplement more infos about tfjob into podgroup, such as minMember, queue as well as other fields, so as to make sure gang scheduler workers correctly.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.