As part of an effort to standardize operators observability, in order to be able to cr

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Created the following PR with more details <a class="issue-link js-issue-link" data-er

[RFE] Operators Health Metric,about openshift/enhancements

jan--f commented on September 15, 2024

I very much like the idea of having a standard way of reporting operator health that has a low barrier of entry.

I don't think though that alerts are the right trigger for this.

I agree that operators already vary in how they implement health metrics. I presume asking developers to add alert labels signifying impact on operator health will be just as varied.
Existing alerts often refer to the operands, rarely the operator itself. Unhealthy operands (i.e. firing alerts) don't necessarily reflect on the operator health.
OLM would have to query an endpoint providing the ALERTS metric. The question will be where an operators metrics will be kept. The OpenShift in-cluster stack is probably what comes to mind, but that doesn't have to be universally true.
There is currently work being done, adding an agent mode for prometheus (prometheus already has it, prometheus-operator support is a wip). A prometheus in agent mode has no local querying capabilities, i.e. no local alerting in which case this approach will no longer provide insight.

As I said, I do like this effort! Maybe as a first step its worth having a discussion what operator health actually means as opposed to health of the operands or a particular CR instance? A CR might not be healthy (and it might expose that through its status subresrouce) but the operator that reconciles changes to that CR might be running just fine. To me its worth mapping out the difference.

from enhancements.

sradco commented on September 15, 2024

I very much like the idea of having a standard way of reporting operator health that has a low barrier of entry.

I don't think though that alerts are the right trigger for this.

I agree that operators already vary in how they implement health metrics. I presume asking developers to add alert labels signifying impact on operator health will be just as varied.

@jan--f Thank you for the review.

I think that the developer that creates the alert knows if the alert impact the health of the whole operator. Its even something that they must explain in the "Impact" section in the alerts runbooks.

I think we can consider how to implement this and what the label should be called, but its a very important label for the alert itself, regardless of the health metric.

Critical alerts definition from the OCP docs:

For alerting current and impending disaster situations. These alerts page an SRE. The situation should warrant waking someone in the middle of the night.

Reserve critical level alerts only for reporting conditions that may lead to
loss of data or inability to deliver service for the cluster as a whole.
Failures of most individual components should not trigger critical level alerts,
unless they would result in either of those conditions. Configure critical level
alerts so they fire before the situation becomes irrecoverable. Expect users to
be notified of a critical alert within a short period of time after it fires so
they can respond with corrective action quickly.

Having critical alerts for an operator means by definition that the operator is unhealthy and should be "Red".
There may be warning alerts that also impact the operator health but don't impact the cluster as a whole or don't cause loss of data.

Adding a label that indicates that there is an health issue for the operator can help with creating work pipelines in alerts manager and indicate they are important to fix and can turn the operator health to "Yellow".
So I agree that the new label should be well defined and mean that the alert impacts the operator health.

The added value of the label would be the ability to calculate the operator health in a generic way.

Existing alerts often refer to the operands, rarely the operator itself. Unhealthy operands (i.e. firing alerts) don't necessarily reflect on the operator health.

I think you are correct. The label should not be about whether its about the operand or operator, but mean that is impacts the operator health, Its ability to do the basic functionalities that is should.

OLM would have to query an endpoint providing the ALERTS metric. The question will be where an operators metrics will be kept. The OpenShift in-cluster stack is probably what comes to mind, but that doesn't have to be universally true.

That is true. But the health label will have value even without the health metric.
All the metrics that we provide are based on the Prometheus operator. We do have the basic metric from OLM.
I look at the operator health metric like the other metrics that we provide. Is that not the correct way?
Its the only way for us to get data that is not instant, but also looks at the system over time.
Since the system is built to fix itself, having alerts based metric would allow us to see issues that the system was unable to fix.

There is currently work being done, adding an agent mode for prometheus (prometheus already has it, prometheus-operator support is a wip). A prometheus in agent mode has no local querying capabilities, i.e. no local alerting in which case this approach will no longer provide insight.

That is very good to know. Thank you.
It means that the data is being collected to the remote Prometheus right?
Wouldn't the remote Prometheus still have the alerting capabilities?
Would the OCP UI query the remote instance?

As I said, I do like this effort! Maybe as a first step its worth having a discussion what operator health actually means as opposed to health of the operands or a particular CR instance? A CR might not be healthy (and it might expose that through its status subresrouce) but the operator that reconciles changes to that CR might be running just fine. To me its worth mapping out the difference.

Yes. Will be happy to have a discussion about this.

from enhancements.

sradco commented on September 15, 2024

@jan--f I would appreciate it if you could review my comments above and add others that can review this proposal.

from enhancements.

openshift-bot commented on September 15, 2024

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

from enhancements.

kmajcher-rh commented on September 15, 2024

/remove-lifecycle stale

from enhancements.

openshift-bot commented on September 15, 2024

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

from enhancements.

sradco commented on September 15, 2024

/remove-lifecycle stale

from enhancements.

sradco commented on September 15, 2024

Created the following PR with more details #1280

from enhancements.

openshift-bot commented on September 15, 2024

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

from enhancements.

sradco commented on September 15, 2024

/remove-lifecycle stale

from enhancements.

dhellmann commented on September 15, 2024

@sradco, I've seen this PR go stale a few times now. Is this on the roadmap for an upcoming release? If so, who needs to be reviewing it so we can move forward? If not, let's let it close until we're ready to work on it so we can keep the active review list cleared.

from enhancements.

sradco commented on September 15, 2024

Hi @dhellmann, I created a PR, #1280 and asked for reviews based on @jan--f suggestion.
I would appreciate it if others could look at it.

I believe this can bring value to Red Hat customers on the UI side and also to community operators that work with the Prometheus stack, since the labels that are added can be used for better routing in Alertmanager.

from enhancements.

openshift-bot commented on September 15, 2024

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

from enhancements.

openshift-bot commented on September 15, 2024

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

from enhancements.

openshift-bot commented on September 15, 2024

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

from enhancements.

openshift-ci commented on September 15, 2024

@openshift-bot: Closing this issue.

In response to this:

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

from enhancements.

[RFE] Operators Health Metric about enhancements HOT 16 CLOSED

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent