Comments (24)
@metacosm
Fleet manager is handling a fleet of Kafka instances which are deployed on a fleet of OpenShift Dedicated (OSD) clusters. It does things like OSD terraforming, dynamic scaling, pushing the custom resources to the specific OSD cluster that will receive a Kafka instance etc. Fleet Manager is also exposing a API for a user or a UI to request creations of Kafka instances.
In the future, we want to converge on the notion of a Kubernetes Control Plane which is a Kubernetes server managing a fleet of other Kubernetes and deploying custom resources accordingly. In that future, the fleet manager will do less work but will receive user requests as a logical custom resource from KCP and do some transformation as well sa a bunch of business validations.
In that context, you can see how the future fleet manager looks a lot more like an Operator that interacts with the KCP API and the KCP API is the kubernetes server API with the concept of custom resource and all (it's a Kube without pods).
I hope it make more sense.
from ffm-project.
Some projects using by by mid year.
from ffm-project.
Currently the code is this single repo, folder manager
.
Note: this component is not covering yet AMS and other aspects too
from ffm-project.
from ffm-project.
One thing that I'm not clear about is how you're doing the scaling because operators are not scalable by default… We're thinking about options there but the typical way of doing things is more about high availability than scaling because basically only one operator can process events at a time (if you're not doing anything specific, which is what we're considering, but that's not, by any means, a solved issue, it's not even solved for the Go side of things).
Your operator can be deployed multiple times but there's only one leader that processes the events, the other ones are just tracking what's going on to more or less keep their local caches warm but are not processing the events so there isn't any scaling going on, just state replication so that whenever the leader goes down, another one can be elected and be ready to take over quickly instead of having to replay lots of events. At least, that's my current understanding.
So I think we actually need to solve that horizontal scaling issue first before we can consider improving the vertical scaling issue via reactive.
from ffm-project.
Ah interesting. Though I did not get you jump from let's fix the horizontal scaling first before addressing the vertical scaling.
My understanding from your point is that for an operator deployed on multiple AZs, then only one will be active. But it's then pretty important that it scales to the demands more easily as a single instance, hence reactive FTW. Makes sense?
from ffm-project.
Well, to truly scale, I guess we need both, and indeed, I wasn't quite clear in my explanation…
The thing I'm not clear about, though, is how are clusters partitioned across AZs (I assume that refers to Availability Zones)? If there's an effective separation of events across AZs, i.e. no chance that a cluster event in one AZ will ever target a resource in another AZ (because of namespaces or just because clusters are completely independent to start with) then, indeed, you can run one operator per AZ.
The current pattern is that for a set of events that might affect all or part of a cluster, there should ever be only one controller dealing with these events. If you can partition your cluster and configure your operator to watch events on that "closed-world" partition then it's fine to run several operators, each configured to deal with such isolated partitions. For example, it's perfectly fine to deploy an operator instance per namespace when that operator will only ever deal with resources in that particular namespace.
That said, I'd argue that if your controller is written in such a way that it does a reconcile-the-world each time it reconciles thing then it should be possible to have several instances work concurrently. It's just a lot more complex to reason about because you have to assume that things your controller observes might be changed by another instance at the same time. It should work OK when only dealing with Kubernetes resources (because after all, your controller should already assume that kubernetes resources might change from underneath it) but it's quite tricky when provisioning external resources who don't often work with the same philosophy…
from ffm-project.
AZ stands for Availability Zones and are different data centers close to one another (in latency). For Kubernetes, it's just a cluster than happens to run on these 3 data centers as one (with some anti affinity to spread it). So the events won't be sharded by AZ. The goal is to be able to lose one or two AZ but still remain operational.
from ffm-project.
In that case, right now, only the leader election mechanism would be recommended, i.e. only one operator instance can be actively processing the events if your operator is listening cluster-wide. It really depends on what the operator is supposed to be doing. Like I said, if one instance can have total control over a namespace and doesn't need to modify cluster-wide shared resources, then you can deploy one operator instance per namespace without issue.
from ffm-project.
@emmanuelbernard @metacosm thanks for the discussion. I've been following it but I am having some difficulties understanding: Why are we talking about operators for a fleet-manager? Are fleet managers ("an API handling fleet management logic") eventually going to be operators too?
Or it is more like there are some specific concepts in the Java Operator SDK we'd lke to re-use (e.g reconciliation)
Spiking a model where the fleet manager is reactive but use the executor pool for operator SDK tasks and see how simple / complicated that model is
+1
In that case, right now, only the leader election mechanism would be recommended
On the reconciliation part, if it weren't for quarkusio/quarkus#10716, we could have re-used the quartz extension and have leader election for free via clustered jobs. The Quartz extensions needs a blocking database connection and it is not possible to run Hibernate ORM (i.e. Blocking) and Hibernate Reactive side by side.
from ffm-project.
That's actually what I don't understand either since the current implementation is not an operator as far as I'm aware but I don't know much about the current architecture or why a switch to using JOSDK is considered… I just provided some info about the constraints as they are for scaling operators.
from ffm-project.
I'll ask a more basic question - have we decided to switch from go to java for the factorized fleet manager?
from ffm-project.
I'll ask a more basic question - have we decided to switch from go to java for the factorized fleet manager?
Java minded teams will have a java based fleet manager template
Go minded teams will have a go based fleet managet template
from ffm-project.
Is there somewhere where I can learn about the architecture because I basically have only the faintest idea of what we're talking about in terms of role and responsibilities of the fleet manager in Managed Kafka?
from ffm-project.
Are fleet managers ("an API handling fleet management logic") eventually going to be operators too?
I suspect yes when Kubernetes Control Plane comes to fruition. Because the API server endpoint will be the KCP endpoint and the API will be a form of Managed CR. And the fleet manager will receive the request very much like an operator receives a new CR or a CR change. It will keep doing the same job but the architecture might evolve.
But we are not there yet. That's why I wanted to spike the effort to know what it would entail if or when that eventuality arises say Q3 / Q4 this year.
from ffm-project.
I suspect yes when Kubernetes Control Plane comes to fruition.
Hmm, that's another thing I don't understand… There's already a control plane in Kubernetes so I'm sure you're referring to something different but, lacking context, this whole conversation is quite confusing to me. Are you referring to https://github.com/kcp-dev/kcp by any chance?
from ffm-project.
Thanks @pmuir for the title improvement.
from ffm-project.
Discussing the subject with @maxandersen and @n1hility, they were arguing against too much of a premature optimisation. We can use RESTEasy reactive and have the route bee implemented in classic blocking fashion and use Hibernate ORM classic. Even that model brings good memory and resource efficiency. We can see the non blocking Hibernate ORM if the load turns out to be a problem.
That resolves the problem the spike was trying to address. I’m thinking that we can then close that issue and make the decision to use the architecture approach mentioned here for the Java template. How does that work for you @danielezonca and @r00ta ?
from ffm-project.
How does that work for you @danielezonca and @r00ta ?
Fine for me, this is already how we implemented the fleet manager in OpenBridge.
from ffm-project.
Fine for me too. We probably need to wait KCP to revisit/reconsider it
from ffm-project.
What's the timeframe for KCP?
from ffm-project.
How does that work for you @danielezonca and @r00ta ?
Fine for me, this is already how we implemented the fleet manager in OpenBridge.
@r00ta that's cool. Can I get access to the repo?
from ffm-project.
Thank you @danielezonca, will have a look.
from ffm-project.
Related to #5
from ffm-project.
Related Issues (20)
- API Definition for new Terraforming/Infrastructure/Fleet Management service
- Convert "TODO" strings to change in the template by external variables HOT 13
- Authorization for Fleet Managers
- Create ADR with proposal for Authorization for factorized fleet Manager HOT 1
- Create the roadmap view documentation HOT 1
- Cluster Registration Service HOT 2
- Investigate and Document FleetShard Operator Patterns HOT 3
- User Facing Metrics Service HOT 7
- DNS Management Service HOT 2
- Publish the v0 of the template that's using SyncSet to deploy the service HOT 1
- Zullip link on communication page is incorrect HOT 1
- Add keycloak container setup for CI and local development
- Skip all integration tests that are failing for now and add a note in the automated test guide HOT 1
- Revise the keycloak service HOT 2
- Update the golang template automated tests documentation
- Backport some changes from kas-fleet-manager to the golang fleet manager template
- Document/Show a pattern on how to store sensitive data in the database
- Dynamic scaling
- backport makefile and cleanup changes from kas-fleet-manager
- Allow CORS option configurability
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ffm-project.