Comments (16)
I very much like the idea of having a standard way of reporting operator health that has a low barrier of entry.
I don't think though that alerts are the right trigger for this.
- I agree that operators already vary in how they implement health metrics. I presume asking developers to add alert labels signifying impact on operator health will be just as varied.
- Existing alerts often refer to the operands, rarely the operator itself. Unhealthy operands (i.e. firing alerts) don't necessarily reflect on the operator health.
- OLM would have to query an endpoint providing the ALERTS metric. The question will be where an operators metrics will be kept. The OpenShift in-cluster stack is probably what comes to mind, but that doesn't have to be universally true.
- There is currently work being done, adding an
agent
mode for prometheus (prometheus already has it, prometheus-operator support is a wip). A prometheus in agent mode has no local querying capabilities, i.e. no local alerting in which case this approach will no longer provide insight.
As I said, I do like this effort! Maybe as a first step its worth having a discussion what operator health
actually means as opposed to health of the operands or a particular CR instance? A CR might not be healthy (and it might expose that through its status subresrouce) but the operator that reconciles changes to that CR might be running just fine. To me its worth mapping out the difference.
from enhancements.
I very much like the idea of having a standard way of reporting operator health that has a low barrier of entry.
I don't think though that alerts are the right trigger for this.
- I agree that operators already vary in how they implement health metrics. I presume asking developers to add alert labels signifying impact on operator health will be just as varied.
@jan--f Thank you for the review.
I think that the developer that creates the alert knows if the alert impact the health of the whole operator. Its even something that they must explain in the "Impact" section in the alerts runbooks.
I think we can consider how to implement this and what the label should be called, but its a very important label for the alert itself, regardless of the health metric.
Critical alerts definition from the OCP docs:
For alerting current and impending disaster situations. These alerts page an SRE. The situation should warrant waking someone in the middle of the night.
Reserve critical level alerts only for reporting conditions that may lead to
loss of data or inability to deliver service for the cluster as a whole.
Failures of most individual components should not trigger critical level alerts,
unless they would result in either of those conditions. Configure critical level
alerts so they fire before the situation becomes irrecoverable. Expect users to
be notified of a critical alert within a short period of time after it fires so
they can respond with corrective action quickly.
Having critical alerts for an operator means by definition that the operator is unhealthy and should be "Red".
There may be warning alerts that also impact the operator health but don't impact the cluster as a whole or don't cause loss of data.
Adding a label that indicates that there is an health issue for the operator can help with creating work pipelines in alerts manager and indicate they are important to fix and can turn the operator health to "Yellow".
So I agree that the new label should be well defined and mean that the alert impacts the operator health.
The added value of the label would be the ability to calculate the operator health in a generic way.
- Existing alerts often refer to the operands, rarely the operator itself. Unhealthy operands (i.e. firing alerts) don't necessarily reflect on the operator health.
I think you are correct. The label should not be about whether its about the operand or operator, but mean that is impacts the operator health, Its ability to do the basic functionalities that is should.
- OLM would have to query an endpoint providing the ALERTS metric. The question will be where an operators metrics will be kept. The OpenShift in-cluster stack is probably what comes to mind, but that doesn't have to be universally true.
That is true. But the health label will have value even without the health metric.
All the metrics that we provide are based on the Prometheus operator. We do have the basic metric from OLM.
I look at the operator health metric like the other metrics that we provide. Is that not the correct way?
Its the only way for us to get data that is not instant, but also looks at the system over time.
Since the system is built to fix itself, having alerts based metric would allow us to see issues that the system was unable to fix.
- There is currently work being done, adding an
agent
mode for prometheus (prometheus already has it, prometheus-operator support is a wip). A prometheus in agent mode has no local querying capabilities, i.e. no local alerting in which case this approach will no longer provide insight.
That is very good to know. Thank you.
It means that the data is being collected to the remote Prometheus right?
Wouldn't the remote Prometheus still have the alerting capabilities?
Would the OCP UI query the remote instance?
As I said, I do like this effort! Maybe as a first step its worth having a discussion what
operator health
actually means as opposed to health of the operands or a particular CR instance? A CR might not be healthy (and it might expose that through its status subresrouce) but the operator that reconciles changes to that CR might be running just fine. To me its worth mapping out the difference.
Yes. Will be happy to have a discussion about this.
from enhancements.
@jan--f I would appreciate it if you could review my comments above and add others that can review this proposal.
from enhancements.
Inactive enhancement proposals go stale after 28d of inactivity.
See https://github.com/openshift/enhancements#life-cycle for details.
Mark the proposal as fresh by commenting /remove-lifecycle stale
.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen
.
If this proposal is safe to close now please do so with /close
.
/lifecycle stale
from enhancements.
/remove-lifecycle stale
from enhancements.
Inactive enhancement proposals go stale after 28d of inactivity.
See https://github.com/openshift/enhancements#life-cycle for details.
Mark the proposal as fresh by commenting /remove-lifecycle stale
.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen
.
If this proposal is safe to close now please do so with /close
.
/lifecycle stale
from enhancements.
/remove-lifecycle stale
from enhancements.
Created the following PR with more details #1280
from enhancements.
Inactive enhancement proposals go stale after 28d of inactivity.
See https://github.com/openshift/enhancements#life-cycle for details.
Mark the proposal as fresh by commenting /remove-lifecycle stale
.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen
.
If this proposal is safe to close now please do so with /close
.
/lifecycle stale
from enhancements.
/remove-lifecycle stale
from enhancements.
@sradco, I've seen this PR go stale a few times now. Is this on the roadmap for an upcoming release? If so, who needs to be reviewing it so we can move forward? If not, let's let it close until we're ready to work on it so we can keep the active review list cleared.
from enhancements.
Hi @dhellmann, I created a PR, #1280 and asked for reviews based on @jan--f suggestion.
I would appreciate it if others could look at it.
I believe this can bring value to Red Hat customers on the UI side and also to community operators that work with the Prometheus stack, since the labels that are added can be used for better routing in Alertmanager.
from enhancements.
Inactive enhancement proposals go stale after 28d of inactivity.
See https://github.com/openshift/enhancements#life-cycle for details.
Mark the proposal as fresh by commenting /remove-lifecycle stale
.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen
.
If this proposal is safe to close now please do so with /close
.
/lifecycle stale
from enhancements.
Stale enhancement proposals rot after 7d of inactivity.
See https://github.com/openshift/enhancements#life-cycle for details.
Mark the proposal as fresh by commenting /remove-lifecycle rotten
.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen
.
If this proposal is safe to close now please do so with /close
.
/lifecycle rotten
/remove-lifecycle stale
from enhancements.
Rotten enhancement proposals close after 7d of inactivity.
See https://github.com/openshift/enhancements#life-cycle for details.
Reopen the proposal by commenting /reopen
.
Mark the proposal as fresh by commenting /remove-lifecycle rotten
.
Exclude this proposal from closing again by commenting /lifecycle frozen
.
/close
from enhancements.
@openshift-bot: Closing this issue.
In response to this:
Rotten enhancement proposals close after 7d of inactivity.
See https://github.com/openshift/enhancements#life-cycle for details.
Reopen the proposal by commenting
/reopen
.
Mark the proposal as fresh by commenting/remove-lifecycle rotten
.
Exclude this proposal from closing again by commenting/lifecycle frozen
./close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
from enhancements.
Related Issues (20)
- need okd4.x hello-world example and suppport webui HOT 4
- Integration with 5G network (MEC-like functionality) HOT 4
- Geo redundancy feasible distance & solutions HOT 4
- SSO logout redirect uri new spec HOT 6
- [RFE] Sidecar container specification in the Secondary Scheduler Operator HOT 4
- Add the ability to install via dnf in fedora HOT 4
- vSphere IPI Zonal - Streched Clusters HOT 12
- `make lint` doesnt work HOT 4
- API counts HOT 5
- Openshift-installer support for IBM Cloud HOT 5
- template-lint.sh can't handle committed file renames HOT 10
- arm64 installer files for OKD4 HOT 4
- Enhancement necessary for resource pool support on vSphere? HOT 5
- Improve markdown linter; Optimize for human time vs brittle manual markdown formatting. HOT 4
- Default tls termination configuration for user-created routes HOT 4
- OKD machineset with "Highly Available" option in oVirt HOT 5
- Support for SOCA AWS MachinePools to support EC2 Spot Best practices HOT 4
- Support or information on more cluster configurations HOT 4
- Allow Users To List Own Groups HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from enhancements.