Comments (37)
Is there any progress on this? Just wondering because my cloud instance show old and wrong alarms on almost every node and that turns the cloud dashboard useless unfortunately.
from netdata-cloud.
FYI the issue with the empty alerts is still present with 1.30
from netdata-cloud.
This is still a high priority for us.
We have identified 2 systems in our infrastructure that can cause messages to be lost and lead to incorrect alarm status in the cloud.
Now we are working on both fixing those systems and implement a mechanism that auto-detects and fixes such inconsistencies.
from netdata-cloud.
looks good
from netdata-cloud.
Would it be possible for you to share the health-db.log
and health-log.db.old
files located under
/var/lib/netdata/health/health-log.db
/var/lib/netdata/health/health-log.db.old
(prefix that with your installation path if different)
Those files contain a longer alarm history information (the http://localhost:19999/api/v1/alarm_log
will return a portion of that)
If you choose to share, you can send the files directly to [email protected]
from netdata-cloud.
We have identified some inconsistencies between the agent's and the cloud's alarms.
You could try restarting the agent, but this isn't a guaranteed way of mitigating this issue.
The team is currently implementing a more robust solution that most probably will fix these inconsistencies once and for all.
from netdata-cloud.
Same issues here, very old alarms I cannot clear off the dashboard for individual nodes.
from netdata-cloud.
Great news @lvrach thanks for letting us know.
from netdata-cloud.
This issue has been mentioned on the Netdata Community. There might be relevant details there:
https://community.netdata.cloud/t/a-lot-of-old-alerts-without-any-value/780/3
from netdata-cloud.
This issue has been mentioned on the Netdata Community. There might be relevant details there:
https://community.netdata.cloud/t/metric-shows-all-data-as-empty/837/5
from netdata-cloud.
This has been ongoing for quite some time and there are no workarounds for the cloud part. Any chance this could receive some attention?
from netdata-cloud.
Hi @Slind14,
as @lvrach said we are actively working on improving cloud's alarm feature.
Expect most of your issues to be mitigated during the end of the next week.
Rest assured that we are investing both of our time and effort in implementing a more robust solution that will solve these inconsistencies once and for all.
Once we have an update we will let you know and ask for your valuable feedback 😁
from netdata-cloud.
@jurgenhaas @ssm3e30 @Slind14 We recently did an automatic fix for all nodes with these kind of problems. Can you confirm if you are still having these issues?
from netdata-cloud.
@nicolasparada recognized that a few days ago and was excited. It was looking good. I went through my 7 spaces just now and in one of them there is one single host which has an old alarm still present. All others are OK.
Since that cloud update we have the new issue that after a longer period of time the cloud dashboard becomes unresponsive if it was open in the background all the time. But that's probably unrelated?
from netdata-cloud.
Great to hear. 1/7 is the amount we were expected to not be fixed during this attempt, but we will fix them all soon.
The other issue seems unrelated, but we will look at it 👍
Thank you.
from netdata-cloud.
This issue has been mentioned on the Netdata Community. There might be relevant details there:
https://community.netdata.cloud/t/entropy-alarm-outdated/1090/3
from netdata-cloud.
I have the same problem. Ran out of memory in one of my VMs yesterday which triggered the alarm (and unfortunately the OOM killer killed Netdata). I have since then:
- Resized the RAM on the VM
- Restarted Netdata
- Pondered why the alarm didn't go away
- Restarted Netdata (many times, after each step below and more)
- Removed the db files in /var/cache/netdata/dbengine (now the alarms refer to data that isn't available)
- Removed the node from Cloud and reattached it (as a new one) with -id=$(uuidgen)
but the alarms persists. Is there anything else I can do or is the problem entirely on the Cloud side?
from netdata-cloud.
The issue resides entirely in the Cloud.
We applied a patch earlier today that fixed most of the cases.
Can you provide us with your space id so we can have a closer look at your case?
from netdata-cloud.
FYI for us it is totally skewed as well:
from netdata-cloud.
@harrisbz the space/room/node is spaces/trustno1/rooms/general/nodes/24de4bea-3f29-433a-a531-f80eedfe2dd0
from netdata-cloud.
Files sent.
from netdata-cloud.
Files sent.
Thank you -- received!
from netdata-cloud.
Hi @mattpson
I have created a draft PR netdata/netdata#10822 that attempts to address this issue. If you can, please test this out and let me know if it fixes the problem for you so I can work on the permanent solution.
from netdata-cloud.
Can confirm that the persistent alarms disappeared sometime in the last few days (don't remember exactly when). So what ever was fixed seems to work afaik.
from netdata-cloud.
FYI the issue with the empty alerts is still present with 1.30
A fix will be included in the nightly build and included in a patch release in the next few days
from netdata-cloud.
Thanks, @Slind14 for pitching in. Please do tell us if the patch fixes it for you. @stelfrag is vicious with bugs 💪
from netdata-cloud.
We still have those empty alerts with netdata v1.30.1-1-nightly
What is the version that should have fixed it?
from netdata-cloud.
We still have those empty alerts with
netdata v1.30.1-1-nightly
What is the version that should have fixed it?
Hi @Slind14 the version you are using should have addressed the issue. So what you are observing is:
- On the local dashboard you do not see any alarms raised
- You login to the cloud and you see alarms reported for the agent
- Restarting the agent repeats the above behavior
If possible, please send the following files to [email protected]
(prefix your installation directory if needed)
- /var/lib/netdata/health//var/lib/netdata/health/health-log.db
- /var/lib/netdata/health//var/lib/netdata/health/health-log.db.old
those files will help us understand any inconsistency in the transition states that are reported to the cloud
from netdata-cloud.
I have no idea what the local dashboard is showing. It is orchestrated without the local one.
from netdata-cloud.
Any update here? There are still hundreds of empty alerts.
from netdata-cloud.
Hey @Slind14,
We are so very sorry for this experience. Did you send your files to the email provided by @stelfrag ?
If you haven't please do so he can see in detail what is happening. Alarms is a whole area that we are actively reworking, both in the Agent and in the Cloud. We expect a lot of improvement over 2021 in both performance and features.
from netdata-cloud.
Any update here? There are still hundreds of empty alerts.
Confirmed -- files received. I will examine them for inconsistencies and get back to you
from netdata-cloud.
Hi @Slind14 ! sorry for the long delay on this.
There is indeed inconsistency in the files, a lot of alerts for mem.oom_kill and a lot of cgroup_ram_in_use
Example (I know its cryptic)
The red highlighted area indicates that the "alert" is still active (no new update entry) and also critical (98.88%) so it will be submitted to the cloud. There is also no record in the file of this alert being removed or change state.
So all the corresponding cgroup_monitoring_cadvisor-XXXXX-YYYY.mem_usage
charts that generate the alerts vanish without a trace with their last state being critical or warning.
We will keep working on this.
We will investigate adding a special option to prevent the agent from sending (Warning/Critical) alerts to the cloud when the agent-cloud-link is established (that will be off by default)
from netdata-cloud.
@stelfrag thank you for taking a look.
For every alert there are 20-30 empty ones that never disappear.
from netdata-cloud.
Fixed closing.
from netdata-cloud.
Sorry to chime in on a closed issue but I found this while trying to find a way to clear defunct alarms. I see there hasn't been any activity in quite awhile but it appears that I am having the same issue. As far as the agents are concerned the alarms don't exist. Trying to click and follow them in the cloud interface leads to a page stating that chart/metric doesn't exist (although the metrics do exist and are continuing to update.
In my case, I have suddenly (over the past couple of weeks) been getting constant floods of anomaly alarms that will almost immediately clear. They vary as to the metric they are alarming on. Usually ip, ram, hardirq etc.
The problem is that at one point while there were 60 or so alarms in the cloud, I was performing maintenance on all of the children (15 or so) as well as the parent. During this time, all the agents were updated and the servers were restarted.
The restarts were for resource increases and I think that briefly there was high RAM usage and Netdata may have been killed...unfortunately this may have also happened while it was temporarily unable to write to the disk as well. The reboots should've been clean and it appeared that way. Netdata seemed to start normally on the parent and children but ever since they came back up the cloud interface is still persisting the alerts (albeit with no actual values listed for the metric in the alerts tab list) while the children and parent nodes show no alarms and seem to be otherwise running normally...
Is there any news on this or way to clear the alarms out of the cloud interface? Restarting the agents doesn't seem to do anything except change the "Triggered" time to when the agent was restarted.
from netdata-cloud.
Just a note, we cleared this issue with @car12o (thanks man), and we're processing a plan to deal with this situations.
from netdata-cloud.
Related Issues (20)
- [Bug]: Light theme Alerts tab tooltip is not neadable HOT 2
- [Feat]: Have proper contrast ratio compliant themes + colorblind options HOT 1
- [Bug]: Alert details on `Community` plan should be visible when below the node limit HOT 2
- [Bug]: User setting bug when on localhost HOT 1
- [Bug]: When updating committed nodes on a Business Annual plan displayed total values aren't ok HOT 2
- [Bug]: When using Grafana plugin `/contexts` is returning a 403 response HOT 1
- [Feat]: furthest/closest to zero
- [Bug]: Netdata is one of the most starred projects in the CNCF landscape. HOT 2
- [Bug]: Scroll indicator is not visible in light theme HOT 1
- [Feat]: Webhook消息通知,建议取消Challenge secret 为星号项
- [Bug]: When I try to delete a User Setting from a chart I get a Forbidden error HOT 1
- [Bug]: Anomaly Advisor tab bug HOT 4
- [Feat]: Ability to remove nodes from "All nodes" room settings HOT 1
- [Feat]: Ability to remove nodes with different state than `Offline` HOT 1
- [Bug]: Some of my VM doesn't report CPU usage HOT 2
- [Bug]: When passing before and after as absolute value parameters on an URL these aren't respected HOT 3
- [Issue]: Windows Virtual Node not shown on cloud dashboard HOT 2
- [Bug]: Issue Identified in reclaiming process for parent nodes HOT 1
- [Bug]: Nodes tab / right bar not honoring space with quick charts
- [Feat]: Node tab presentation improvements
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from netdata-cloud.