At the moment, errors in the Cluster API do not bubble up into the Magnum API <div

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

That's some good thoughts <a class="user-mention notranslate" data-hovercard-type="use

Bubble errors up to Magnum API about magnum-cluster-api HOT 7 CLOSED

vexxhost commented on September 4, 2024

Bubble errors up to Magnum API

from magnum-cluster-api.

Comments (7)

fnpanic commented on September 4, 2024

@mnaser from my point of view all errors should be reflected in magnum api with a reason so the user has a chance to understand (in terms of a quota issue) or addressed with the operators in a straight forward way.

from magnum-cluster-api.

mnaser commented on September 4, 2024

@mnaser from my point of view all errors should be reflected in magnum api with a reason so the user has a chance to understand (in terms of a quota issue) or addressed with the operators in a straight forward way.

Indeed, but there's a really weird scenario where the CAPI is like a controller so it will keep retrying, so it's not like it's technically "CREATE_FAILED" and the process is over.. technically if you fix the quota, the process will continue... so maybe "CREATE_BLOCKED" with a message might make sense, but that is not a native Magnum API concept.

from magnum-cluster-api.

fnpanic commented on September 4, 2024

That is a good to know because i am pretty sure we will have users which keep clusters in failed stated forever which means the controller will retry forever which will place some unnessecary load on all openstack APIs. I like the idea of CREATE_BLOCKED but not sure if it is really a good idea to extend it. If we will not add changes to magnum API for now i would go with a CREATE_FAILED and a reason and if the state can be changed via the API we would be good as soon as the user will add quota.

from magnum-cluster-api.

okozachenko1203 commented on September 4, 2024

That is a good to know because i am pretty sure we will have users which keep clusters in failed stated forever which means the controller will retry forever which will place some unnessecary load on all openstack APIs. I like the idea of CREATE_BLOCKED but not sure if it is really a good idea to extend it. If we will not add changes to magnum API for now i would go with a CREATE_FAILED and a reason and if the state can be changed via the API we would be good as soon as the user will add quota.

This will cause the difference between the desired/expected state and the running state. Because there is no way to stop reconciliation in capi controller, it will deploy the cluster once the quota issue is fixed. i.e. even if the openstack coe cluster status is CREATE_FAILED, there will be capi cluster deployed and running, it means customers will charge the price for cluster nodes, lb, floating ip, etc.

I think we need to decide a critical thing to fix this kinda issue, how to mitigate the working principle difference between the k8s operator and openstack conductor.

if we respect the openstack state transition logic(once it is failed, there is no auto retry and not updated automatically), we have to find the way how to stop reconciling the capi cluster whose corresponding coe cluster status is *_FAILED
If we can follow the K8s operator's reconciliation logic, it means we let m-capi to update the cluster status from *_FAILED to *_IN_PROGRESS.

from magnum-cluster-api.

mnaser commented on September 4, 2024

That's some good thoughts @okozachenko1203 ... I've given some thought to this..

The update status loop will only happen for a cluster if the status is _IN_PROGRESS which means we unfortunately would need to start another daemon/controller if we want to transition from failed to complete.

Also, the other issue is that it would be pretty hard for us to "capture" all the possible ways the cluster can fail to create, maybe making us enter into this catch up game of always implementing more and more checks.

Since the introduction of another service to reconcile seems like a tall order (built not something we aren't against doing), perhaps for the short term being able to update the stack status and letting the resource stay in progress much be best.

from magnum-cluster-api.

fnpanic commented on September 4, 2024

I would suggest a different route.
As soon as we enter create_failed we should delete the cluster in CAPI.
as soon as the user deletes it it will just get deleted form the magnum db.

Updating the reason like „quota issues“ will give the user a better user experience and more consistency to magnum-heat.

fail hard fail fast.

from magnum-cluster-api.

okozachenko1203 commented on September 4, 2024

I would suggest a different route. As soon as we enter create_failed we should delete the cluster in CAPI. as soon as the user deletes it it will just get deleted form the magnum db.

Updating the reason like „quota issues“ will give the user a better user experience and more consistency to magnum-heat.

fail hard fail fast.

To decide if the cluster is failed or not, it means we have to define all possible failure scenarios of coe cluster reflected by capi-controller and define the checkers for each conditions for them. It will be much more painful to implement such a logic inside the openstack conductor because it is more or less k8s-like way.
For a long-term solution, we will finally have a controller reconciling outside magnum-conductor. At the current stage, we will just expose all the capi cluster-related events as coe cluster's status_reason so users can have a path to check what is happening underneath and why it is stuck in *_IN_PROGRESS.

from magnum-cluster-api.

Bubble errors up to Magnum API about magnum-cluster-api HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent