Giter Site home page Giter Site logo

Comments (24)

Jeffwan avatar Jeffwan commented on August 20, 2024 5

Some updates here. @zw0610 and I wrote a doc and we still need some time to polish the doc and make it public. Then we can have some discussion in WG meetings.

from common.

Jeffwan avatar Jeffwan commented on August 20, 2024 3

Raise this request in 3/24's AutoML and Training meeting. I will draft one proposal for deep discussion in following month. We can either discuss it offline or talk about it in next community meeting.

from common.

Jeffwan avatar Jeffwan commented on August 20, 2024 2

Here's the proposal All-in-one training operator. Any feedbacks are welcomed. I present in 05/19 US & EU friendly meeting and @zw0610 will presents in 06/02 CN & EU friendly meeting.

from common.

terrytangyuan avatar terrytangyuan commented on August 20, 2024 2

I also just left some specific comments in the doc. Please take a look when you get a chance.

from common.

gaocegege avatar gaocegege commented on August 20, 2024 2

Should we close this issue?

from common.

ChanYiLin avatar ChanYiLin commented on August 20, 2024

SGTM !
One of the feedbacks I heard from users is exactly the case.
"Before you run any job, there is already tons of pods running in the k8s."

We have tried to merge multiple controller in one process before. With using kubebuilder SetupWithManager, one can spawn multiple controllers in one pod while implement each controller separately

from common.

gaocegege avatar gaocegege commented on August 20, 2024

Yeah, we also have an internal implementation for it. Alibaba has similar design. I think we can try to merge it back to the community.

from common.

gaocegege avatar gaocegege commented on August 20, 2024

/cc @jian-he

from common.

terrytangyuan avatar terrytangyuan commented on August 20, 2024

This is mainly for merging the controllers into one controller manager and we still keep the code base for different operators separate, right?

from common.

gaocegege avatar gaocegege commented on August 20, 2024

Personally, I think it will be better to keep all training operators in one repository. WDYT

from common.

terrytangyuan avatar terrytangyuan commented on August 20, 2024

@gaocegege Personally I think it would be great if the code bases are closer together but it might not be easy to maintain releases and versioning. Also things like stargazers/watchers, commit history, and useful design discussions will be spread out if we start a new repo and develop from scratch.

from common.

Jeffwan avatar Jeffwan commented on August 20, 2024

One repo -> one manager -> multiple controller -> crd reconciler loop.

The manager could has many controllers, each controller should only take care of crd. Frameworks should be configured to

@gaocegege What's the plan on your side? We have some internal works going on as well and consider to make it in kubeflow.

from common.

gaocegege avatar gaocegege commented on August 20, 2024

@Jeffwan We also have some internal works on it, maybe we can discuss it

from common.

zw0610 avatar zw0610 commented on August 20, 2024

I second @terrytangyuan 's suggestion that instead of merging all operators into one repo. While keeping one manager with multiple controller does bring many benefits like saving traffic, if the controllers work independently, the one-manager-multiple-controller design is perpendicular to the idea of sharing job reconciliation function to all operators for develop cost saving. That includes features like elastic training, error handling, event recording, etc.

Instead, maybe we can move kubeflow/common one step forward by creating operator-sdk equivalent executable for this common library, making it something like kubefloe-operator-sdk. The executable generates more training-operator specific functions like ReconcilePods, ReconcileServices in addition to code that operator-sdk generates. So developers would only need to fill these functions for customization.


A summary from most operators:

Operator Roles Native* Resources Others
tf-operator PS, Worker, Chief, Evaluator service, pod (podgroup)
mpi-operator Launcher, Worker service, pod, configMap, serviceAccount, role, roleBinding (podgroup)
pytorch-operator Master, Worker service, pod (podgroup)
xgboost-operator Master, Worker service, pod (podgroup)
mxnet-operator Scheduler, Server, Sorker, TunerTracker, TunerServer, Tuner service, pod (podgroup)

*while podgroup is not a Kubernetes native resource, here we consider it as one as it is defined out of the xxx-operator scope.

If mpi-operator is able to get rid of the reliance on configMap, serviceAccount, role and roleBinding, maybe we can apply common to all of the operators.

from common.

zw0610 avatar zw0610 commented on August 20, 2024

But here I have another concern. The contemporary design of ReconcilePods and ReconcileServices is really puzzling thanks to the lack of real OOP in golang. Moreover, this design still prevents developers from sharing elastic feature among operators. For example, when the replicas of worker is changed, the Pods creation/deletion logic still needs to be implemented individually for multiple operators. However, the reason why we need ReconcilePods is more about the difference in podTemplate, like the compose of TF_CONFIG.

Is it possible if we use the shared ReconcilePods method, but allow developers to register decorators for podTemplate before it is sent to PodControl.CreatePodsWithControllerRef.

But sure, developers should still be able to 'override' ReconcilePods method if they want.

from common.

Jeffwan avatar Jeffwan commented on August 20, 2024

if the controllers work independently, the one-manager-multiple-controller design is perpendicular to the idea of sharing job reconciliation function to all operators for develop cost saving

I don't quite understand perpendicular in this context.. Can you elaborate this with more details?

Instead, maybe we can move kubeflow/common one step forward by creating operator-sdk equivalent executable for this common library, making it something like kubefloe-operator-sdk.

I think this is a good idea. I have two things in mind.

  1. from implementation perspective, I am not sure if there's a way to leverage existing tools like kubebuilder to operator-sdk and inject generated codes. Currently, people still need to use low level api or use kubebuilder to generate skeleton and then use kubeflow/common. With kubeflow-operator-sdk This further reduce time to build up a new DL operator from scratch
  2. Frameworks is growing but still within reasonable number. Making tools for small groups may not worth the efforts. If the tool can be made flexible enough so that All-in-one operator can use it as well, then it may worth the efforts.

Moreover, this design still prevents developers from sharing elastic feature among operators. For example, when the replicas of worker is changed, the Pods creation/deletion logic still needs to be implemented individually for multiple operators.
Is it possible if we use the shared ReconcilePods method, but allow developers to register decorators for podTemplate before it is sent to PodControl.CreatePodsWithControllerRef.

Em. I think we should collect these good use cases and probably try to find a different way to abstract the methods. I agree that there's no guarantee we can make it compatible for future use cases, we can still leave enough flexibility like overide ReconcilePods

from common.

zw0610 avatar zw0610 commented on August 20, 2024

if the controllers work independently, the one-manager-multiple-controller design is perpendicular to the idea of sharing job reconciliation function to all operators for develop cost saving

Hi @Jeffwan , what I mean for the comment above is whether we use one-mgr-multi-controller or controllers working individually is a question of another dimension to the idea of sharing the code on how we reconcile jobs. With or without such a design, we can help developers working on operators for new frameworks with more shared code base.

But as you mentioned, the number of new framework looks limited, which I agree.

from common.

johnugeorge avatar johnugeorge commented on August 20, 2024

Few thoughts

  1. Are all operators using same amount of resources? Say, In one deployment, Pytorch jobs are spawned more than TF Jobs while in other, MXNet jobs are used more. How do we control/recommend resource usage from a single combined operator point of view?
  2. Since we will lose isolation between operator deployments, will releases become difficult? Also what if there is CR upgrade required for one operator while there are running jobs for the other operators?

@gaocegege Are you having this design already?

from common.

gaocegege avatar gaocegege commented on August 20, 2024

@zw0610 and @Jeffwan are working on the design, I think we can discuss it in the community call.

from common.

Jeffwan avatar Jeffwan commented on August 20, 2024

Few thoughts

  1. Are all operators using same amount of resources? Say, In one deployment, Pytorch jobs are spawned more than TF Jobs while in other, MXNet jobs are used more. How do we control/recommend resource usage from a single combined operator point of view?

controllers will use different amount of resources. Instead of understanding each controller's utilization, I would recommend to adjust resources based on total number of the jobs. For users who transit from multiple operators, they can sum up request/limits and use it in the new operator. ( All-in-one operator uses less resources because of shared cache. ) However, this won't be accurate, it would be good for us to do some load testing and give some recommended number.

  1. Since we will lose isolation between operator deployments, will releases become difficult? Also what if there is CR upgrade required for one operator while there are running jobs for the other operators?

good point. Do you have concerns on the job state? Anything different from upgrading an operator nowadays?

from common.

johnugeorge avatar johnugeorge commented on August 20, 2024

When any CR is upgraded, we will need operator upgrade and controller restart. Will upgrade be difficult for the users?

from common.

Jeffwan avatar Jeffwan commented on August 20, 2024

When any CR is upgraded, we will need operator upgrade and controller restart. Will upgrade be difficult for the users?

I forget to response that, I think the upgrade behavior is similar to existing controller. The only "overhead" is operator owner may need more time for coordination since users are all using one.

from common.

Jeffwan avatar Jeffwan commented on August 20, 2024

/close

from common.

google-oss-robot avatar google-oss-robot commented on August 20, 2024

@Jeffwan: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

from common.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.