Giter Site home page Giter Site logo

Comments (7)

fnpanic avatar fnpanic commented on September 4, 2024

@mnaser from my point of view all errors should be reflected in magnum api with a reason so the user has a chance to understand (in terms of a quota issue) or addressed with the operators in a straight forward way.

from magnum-cluster-api.

mnaser avatar mnaser commented on September 4, 2024

@mnaser from my point of view all errors should be reflected in magnum api with a reason so the user has a chance to understand (in terms of a quota issue) or addressed with the operators in a straight forward way.

Indeed, but there's a really weird scenario where the CAPI is like a controller so it will keep retrying, so it's not like it's technically "CREATE_FAILED" and the process is over.. technically if you fix the quota, the process will continue... so maybe "CREATE_BLOCKED" with a message might make sense, but that is not a native Magnum API concept.

from magnum-cluster-api.

fnpanic avatar fnpanic commented on September 4, 2024

That is a good to know because i am pretty sure we will have users which keep clusters in failed stated forever which means the controller will retry forever which will place some unnessecary load on all openstack APIs. I like the idea of CREATE_BLOCKED but not sure if it is really a good idea to extend it. If we will not add changes to magnum API for now i would go with a CREATE_FAILED and a reason and if the state can be changed via the API we would be good as soon as the user will add quota.

from magnum-cluster-api.

okozachenko1203 avatar okozachenko1203 commented on September 4, 2024

That is a good to know because i am pretty sure we will have users which keep clusters in failed stated forever which means the controller will retry forever which will place some unnessecary load on all openstack APIs. I like the idea of CREATE_BLOCKED but not sure if it is really a good idea to extend it. If we will not add changes to magnum API for now i would go with a CREATE_FAILED and a reason and if the state can be changed via the API we would be good as soon as the user will add quota.

This will cause the difference between the desired/expected state and the running state. Because there is no way to stop reconciliation in capi controller, it will deploy the cluster once the quota issue is fixed. i.e. even if the openstack coe cluster status is CREATE_FAILED, there will be capi cluster deployed and running, it means customers will charge the price for cluster nodes, lb, floating ip, etc.

I think we need to decide a critical thing to fix this kinda issue, how to mitigate the working principle difference between the k8s operator and openstack conductor.

  • if we respect the openstack state transition logic(once it is failed, there is no auto retry and not updated automatically), we have to find the way how to stop reconciling the capi cluster whose corresponding coe cluster status is *_FAILED
  • If we can follow the K8s operator's reconciliation logic, it means we let m-capi to update the cluster status from *_FAILED to *_IN_PROGRESS.

from magnum-cluster-api.

mnaser avatar mnaser commented on September 4, 2024

That's some good thoughts @okozachenko1203 ... I've given some thought to this..

The update status loop will only happen for a cluster if the status is _IN_PROGRESS which means we unfortunately would need to start another daemon/controller if we want to transition from failed to complete.

Also, the other issue is that it would be pretty hard for us to "capture" all the possible ways the cluster can fail to create, maybe making us enter into this catch up game of always implementing more and more checks.

Since the introduction of another service to reconcile seems like a tall order (built not something we aren't against doing), perhaps for the short term being able to update the stack status and letting the resource stay in progress much be best.

from magnum-cluster-api.

fnpanic avatar fnpanic commented on September 4, 2024

I would suggest a different route.
As soon as we enter create_failed we should delete the cluster in CAPI.
as soon as the user deletes it it will just get deleted form the magnum db.

Updating the reason like „quota issues“ will give the user a better user experience and more consistency to magnum-heat.

fail hard fail fast.

from magnum-cluster-api.

okozachenko1203 avatar okozachenko1203 commented on September 4, 2024

I would suggest a different route. As soon as we enter create_failed we should delete the cluster in CAPI. as soon as the user deletes it it will just get deleted form the magnum db.

Updating the reason like „quota issues“ will give the user a better user experience and more consistency to magnum-heat.

fail hard fail fast.

To decide if the cluster is failed or not, it means we have to define all possible failure scenarios of coe cluster reflected by capi-controller and define the checkers for each conditions for them. It will be much more painful to implement such a logic inside the openstack conductor because it is more or less k8s-like way.
For a long-term solution, we will finally have a controller reconciling outside magnum-conductor. At the current stage, we will just expose all the capi cluster-related events as coe cluster's status_reason so users can have a path to check what is happening underneath and why it is stuck in *_IN_PROGRESS.

from magnum-cluster-api.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.