To build the Java factorized fleet manager, we are considering going for a modern Quar

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Currently the code is this <a href="https://github.com/5733d9e2be6485d52ffa08870cabdee

CC <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Spiking Java fleet manager template as reactive with executor pool for Operator SDK usages about ffm-project HOT 24 OPEN

emmanuelbernard commented on July 30, 2024

Spiking Java fleet manager template as reactive with executor pool for Operator SDK usages

from ffm-project.

Comments (24)

emmanuelbernard commented on July 30, 2024 1

@metacosm
Fleet manager is handling a fleet of Kafka instances which are deployed on a fleet of OpenShift Dedicated (OSD) clusters. It does things like OSD terraforming, dynamic scaling, pushing the custom resources to the specific OSD cluster that will receive a Kafka instance etc. Fleet Manager is also exposing a API for a user or a UI to request creations of Kafka instances.

In the future, we want to converge on the notion of a Kubernetes Control Plane which is a Kubernetes server managing a fleet of other Kubernetes and deploying custom resources accordingly. In that future, the fleet manager will do less work but will receive user requests as a logical custom resource from KCP and do some transformation as well sa a bunch of business validations.
In that context, you can see how the future fleet manager looks a lot more like an Operator that interacts with the KCP API and the KCP API is the kubernetes server API with the concept of custom resource and all (it's a Kube without pods).

I hope it make more sense.

from ffm-project.

emmanuelbernard commented on July 30, 2024 1

Some projects using by by mid year.

from ffm-project.

danielezonca commented on July 30, 2024 1

Currently the code is this single repo, folder manager.

Note: this component is not covering yet AMS and other aspects too

from ffm-project.

emmanuelbernard commented on July 30, 2024

CC @metacosm @danielezonca

from ffm-project.

metacosm commented on July 30, 2024

One thing that I'm not clear about is how you're doing the scaling because operators are not scalable by default… We're thinking about options there but the typical way of doing things is more about high availability than scaling because basically only one operator can process events at a time (if you're not doing anything specific, which is what we're considering, but that's not, by any means, a solved issue, it's not even solved for the Go side of things).

Your operator can be deployed multiple times but there's only one leader that processes the events, the other ones are just tracking what's going on to more or less keep their local caches warm but are not processing the events so there isn't any scaling going on, just state replication so that whenever the leader goes down, another one can be elected and be ready to take over quickly instead of having to replay lots of events. At least, that's my current understanding.

So I think we actually need to solve that horizontal scaling issue first before we can consider improving the vertical scaling issue via reactive.

from ffm-project.

emmanuelbernard commented on July 30, 2024

Ah interesting. Though I did not get you jump from let's fix the horizontal scaling first before addressing the vertical scaling.

My understanding from your point is that for an operator deployed on multiple AZs, then only one will be active. But it's then pretty important that it scales to the demands more easily as a single instance, hence reactive FTW. Makes sense?

from ffm-project.

metacosm commented on July 30, 2024

Well, to truly scale, I guess we need both, and indeed, I wasn't quite clear in my explanation…

The thing I'm not clear about, though, is how are clusters partitioned across AZs (I assume that refers to Availability Zones)? If there's an effective separation of events across AZs, i.e. no chance that a cluster event in one AZ will ever target a resource in another AZ (because of namespaces or just because clusters are completely independent to start with) then, indeed, you can run one operator per AZ.

The current pattern is that for a set of events that might affect all or part of a cluster, there should ever be only one controller dealing with these events. If you can partition your cluster and configure your operator to watch events on that "closed-world" partition then it's fine to run several operators, each configured to deal with such isolated partitions. For example, it's perfectly fine to deploy an operator instance per namespace when that operator will only ever deal with resources in that particular namespace.

That said, I'd argue that if your controller is written in such a way that it does a reconcile-the-world each time it reconciles thing then it should be possible to have several instances work concurrently. It's just a lot more complex to reason about because you have to assume that things your controller observes might be changed by another instance at the same time. It should work OK when only dealing with Kubernetes resources (because after all, your controller should already assume that kubernetes resources might change from underneath it) but it's quite tricky when provisioning external resources who don't often work with the same philosophy…

from ffm-project.

emmanuelbernard commented on July 30, 2024

AZ stands for Availability Zones and are different data centers close to one another (in latency). For Kubernetes, it's just a cluster than happens to run on these 3 data centers as one (with some anti affinity to spread it). So the events won't be sharded by AZ. The goal is to be able to lose one or two AZ but still remain operational.

from ffm-project.

metacosm commented on July 30, 2024

In that case, right now, only the leader election mechanism would be recommended, i.e. only one operator instance can be actively processing the events if your operator is listening cluster-wide. It really depends on what the operator is supposed to be doing. Like I said, if one instance can have total control over a namespace and doesn't need to modify cluster-wide shared resources, then you can deploy one operator instance per namespace without issue.

from ffm-project.

machi1990 commented on July 30, 2024

@emmanuelbernard @metacosm thanks for the discussion. I've been following it but I am having some difficulties understanding: Why are we talking about operators for a fleet-manager? Are fleet managers ("an API handling fleet management logic") eventually going to be operators too?
Or it is more like there are some specific concepts in the Java Operator SDK we'd lke to re-use (e.g reconciliation)

Spiking a model where the fleet manager is reactive but use the executor pool for operator SDK tasks and see how simple / complicated that model is

In that case, right now, only the leader election mechanism would be recommended

On the reconciliation part, if it weren't for quarkusio/quarkus#10716, we could have re-used the quartz extension and have leader election for free via clustered jobs. The Quartz extensions needs a blocking database connection and it is not possible to run Hibernate ORM (i.e. Blocking) and Hibernate Reactive side by side.

from ffm-project.

metacosm commented on July 30, 2024

That's actually what I don't understand either since the current implementation is not an operator as far as I'm aware but I don't know much about the current architecture or why a switch to using JOSDK is considered… I just provided some info about the constraints as they are for scaling operators.

from ffm-project.

pmuir commented on July 30, 2024

I'll ask a more basic question - have we decided to switch from go to java for the factorized fleet manager?

from ffm-project.

emmanuelbernard commented on July 30, 2024

I'll ask a more basic question - have we decided to switch from go to java for the factorized fleet manager?

Java minded teams will have a java based fleet manager template
Go minded teams will have a go based fleet managet template

from ffm-project.

metacosm commented on July 30, 2024

Is there somewhere where I can learn about the architecture because I basically have only the faintest idea of what we're talking about in terms of role and responsibilities of the fleet manager in Managed Kafka?

from ffm-project.

emmanuelbernard commented on July 30, 2024

Are fleet managers ("an API handling fleet management logic") eventually going to be operators too?

I suspect yes when Kubernetes Control Plane comes to fruition. Because the API server endpoint will be the KCP endpoint and the API will be a form of Managed CR. And the fleet manager will receive the request very much like an operator receives a new CR or a CR change. It will keep doing the same job but the architecture might evolve.

But we are not there yet. That's why I wanted to spike the effort to know what it would entail if or when that eventuality arises say Q3 / Q4 this year.

from ffm-project.

metacosm commented on July 30, 2024

I suspect yes when Kubernetes Control Plane comes to fruition.

Hmm, that's another thing I don't understand… There's already a control plane in Kubernetes so I'm sure you're referring to something different but, lacking context, this whole conversation is quite confusing to me. Are you referring to https://github.com/kcp-dev/kcp by any chance?

from ffm-project.

emmanuelbernard commented on July 30, 2024

Thanks @pmuir for the title improvement.

from ffm-project.

emmanuelbernard commented on July 30, 2024

Discussing the subject with @maxandersen and @n1hility, they were arguing against too much of a premature optimisation. We can use RESTEasy reactive and have the route bee implemented in classic blocking fashion and use Hibernate ORM classic. Even that model brings good memory and resource efficiency. We can see the non blocking Hibernate ORM if the load turns out to be a problem.
That resolves the problem the spike was trying to address. I’m thinking that we can then close that issue and make the decision to use the architecture approach mentioned here for the Java template. How does that work for you @danielezonca and @r00ta ?

from ffm-project.

r00ta commented on July 30, 2024

How does that work for you @danielezonca and @r00ta ?

Fine for me, this is already how we implemented the fleet manager in OpenBridge.

from ffm-project.

danielezonca commented on July 30, 2024

Fine for me too. We probably need to wait KCP to revisit/reconsider it

from ffm-project.

metacosm commented on July 30, 2024

What's the timeframe for KCP?

from ffm-project.

machi1990 commented on July 30, 2024

How does that work for you @danielezonca and @r00ta ?

Fine for me, this is already how we implemented the fleet manager in OpenBridge.

@r00ta that's cool. Can I get access to the repo?

from ffm-project.

machi1990 commented on July 30, 2024

Thank you @danielezonca, will have a look.

from ffm-project.

machi1990 commented on July 30, 2024

Related to #5

from ffm-project.

Spiking Java fleet manager template as reactive with executor pool for Operator SDK usages about ffm-project HOT 24 OPEN

Comments (24)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent