Describe the bug Sometimes alarms are Incorrectly show as active

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Great news <a class="user-mention notranslate" data-hovercard-type="user" data-hoverca

Incorrect active alarms in Cloud about netdata-cloud HOT 37 CLOSED

netdata commented on September 25, 2024

Incorrect active alarms in Cloud

from netdata-cloud.

Comments (37)

jurgenhaas commented on September 25, 2024 3

Is there any progress on this? Just wondering because my cloud instance show old and wrong alarms on almost every node and that turns the cloud dashboard useless unfortunately.

from netdata-cloud.

Slind14 commented on September 25, 2024 2

FYI the issue with the empty alerts is still present with 1.30

from netdata-cloud.

lvrach commented on September 25, 2024 1

This is still a high priority for us.

We have identified 2 systems in our infrastructure that can cause messages to be lost and lead to incorrect alarm status in the cloud.

Now we are working on both fixing those systems and implement a mechanism that auto-detects and fixes such inconsistencies.

from netdata-cloud.

Slind14 commented on September 25, 2024 1

looks good

from netdata-cloud.

stelfrag commented on September 25, 2024 1

Hi @Slind14 , @mattpson

Would it be possible for you to share the health-db.log and health-log.db.old files located under

/var/lib/netdata/health/health-log.db
/var/lib/netdata/health/health-log.db.old

(prefix that with your installation path if different)

Those files contain a longer alarm history information (the http://localhost:19999/api/v1/alarm_log will return a portion of that)

If you choose to share, you can send the files directly to [email protected]

from netdata-cloud.

commented on September 25, 2024

We have identified some inconsistencies between the agent's and the cloud's alarms.

You could try restarting the agent, but this isn't a guaranteed way of mitigating this issue.

The team is currently implementing a more robust solution that most probably will fix these inconsistencies once and for all.

from netdata-cloud.

ssm3e30 commented on September 25, 2024

Same issues here, very old alarms I cannot clear off the dashboard for individual nodes.

from netdata-cloud.

jurgenhaas commented on September 25, 2024

Great news @lvrach thanks for letting us know.

from netdata-cloud.

netdata-community-bot commented on September 25, 2024

This issue has been mentioned on the Netdata Community. There might be relevant details there:

https://community.netdata.cloud/t/a-lot-of-old-alerts-without-any-value/780/3

from netdata-cloud.

netdata-community-bot commented on September 25, 2024

This issue has been mentioned on the Netdata Community. There might be relevant details there:

https://community.netdata.cloud/t/metric-shows-all-data-as-empty/837/5

from netdata-cloud.

Slind14 commented on September 25, 2024

This has been ongoing for quite some time and there are no workarounds for the cloud part. Any chance this could receive some attention?

from netdata-cloud.

commented on September 25, 2024

Hi @Slind14,

as @lvrach said we are actively working on improving cloud's alarm feature.
Expect most of your issues to be mitigated during the end of the next week.

Rest assured that we are investing both of our time and effort in implementing a more robust solution that will solve these inconsistencies once and for all.

Once we have an update we will let you know and ask for your valuable feedback 😁

from netdata-cloud.

nicolasparada commented on September 25, 2024

@jurgenhaas @ssm3e30 @Slind14 We recently did an automatic fix for all nodes with these kind of problems. Can you confirm if you are still having these issues?

from netdata-cloud.

jurgenhaas commented on September 25, 2024

@nicolasparada recognized that a few days ago and was excited. It was looking good. I went through my 7 spaces just now and in one of them there is one single host which has an old alarm still present. All others are OK.

Since that cloud update we have the new issue that after a longer period of time the cloud dashboard becomes unresponsive if it was open in the background all the time. But that's probably unrelated?

from netdata-cloud.

nicolasparada commented on September 25, 2024

Great to hear. 1/7 is the amount we were expected to not be fixed during this attempt, but we will fix them all soon.
The other issue seems unrelated, but we will look at it 👍
Thank you.

from netdata-cloud.

netdata-community-bot commented on September 25, 2024

This issue has been mentioned on the Netdata Community. There might be relevant details there:

https://community.netdata.cloud/t/entropy-alarm-outdated/1090/3

from netdata-cloud.

mattpson commented on September 25, 2024

I have the same problem. Ran out of memory in one of my VMs yesterday which triggered the alarm (and unfortunately the OOM killer killed Netdata). I have since then:

Resized the RAM on the VM
Restarted Netdata
Pondered why the alarm didn't go away
Restarted Netdata (many times, after each step below and more)
Removed the db files in /var/cache/netdata/dbengine (now the alarms refer to data that isn't available)
Removed the node from Cloud and reattached it (as a new one) with -id=$(uuidgen)

but the alarms persists. Is there anything else I can do or is the problem entirely on the Cloud side?

from netdata-cloud.

commented on September 25, 2024

The issue resides entirely in the Cloud.
We applied a patch earlier today that fixed most of the cases.
Can you provide us with your space id so we can have a closer look at your case?

from netdata-cloud.

Slind14 commented on September 25, 2024

FYI for us it is totally skewed as well:

from netdata-cloud.

mattpson commented on September 25, 2024

@harrisbz the space/room/node is spaces/trustno1/rooms/general/nodes/24de4bea-3f29-433a-a531-f80eedfe2dd0

from netdata-cloud.

mattpson commented on September 25, 2024

Files sent.

from netdata-cloud.

stelfrag commented on September 25, 2024

Files sent.

Thank you -- received!

from netdata-cloud.

stelfrag commented on September 25, 2024

Hi @mattpson

I have created a draft PR netdata/netdata#10822 that attempts to address this issue. If you can, please test this out and let me know if it fixes the problem for you so I can work on the permanent solution.

from netdata-cloud.

mattpson commented on September 25, 2024

Can confirm that the persistent alarms disappeared sometime in the last few days (don't remember exactly when). So what ever was fixed seems to work afaik.

from netdata-cloud.

stelfrag commented on September 25, 2024

FYI the issue with the empty alerts is still present with 1.30

A fix will be included in the nightly build and included in a patch release in the next few days

from netdata-cloud.

odyslam commented on September 25, 2024

Thanks, @Slind14 for pitching in. Please do tell us if the patch fixes it for you. @stelfrag is vicious with bugs 💪

from netdata-cloud.

Slind14 commented on September 25, 2024

We still have those empty alerts with netdata v1.30.1-1-nightly What is the version that should have fixed it?

from netdata-cloud.

stelfrag commented on September 25, 2024

We still have those empty alerts with netdata v1.30.1-1-nightly What is the version that should have fixed it?

Hi @Slind14 the version you are using should have addressed the issue. So what you are observing is:

On the local dashboard you do not see any alarms raised
You login to the cloud and you see alarms reported for the agent
Restarting the agent repeats the above behavior

If possible, please send the following files to [email protected]
(prefix your installation directory if needed)

/var/lib/netdata/health//var/lib/netdata/health/health-log.db
/var/lib/netdata/health//var/lib/netdata/health/health-log.db.old

those files will help us understand any inconsistency in the transition states that are reported to the cloud

from netdata-cloud.

Slind14 commented on September 25, 2024

I have no idea what the local dashboard is showing. It is orchestrated without the local one.

from netdata-cloud.

Slind14 commented on September 25, 2024

Any update here? There are still hundreds of empty alerts.

from netdata-cloud.

odyslam commented on September 25, 2024

Hey @Slind14,

We are so very sorry for this experience. Did you send your files to the email provided by @stelfrag ?

If you haven't please do so he can see in detail what is happening. Alarms is a whole area that we are actively reworking, both in the Agent and in the Cloud. We expect a lot of improvement over 2021 in both performance and features.

from netdata-cloud.

stelfrag commented on September 25, 2024

Any update here? There are still hundreds of empty alerts.

Confirmed -- files received. I will examine them for inconsistencies and get back to you

from netdata-cloud.

stelfrag commented on September 25, 2024

Hi @Slind14 ! sorry for the long delay on this.

There is indeed inconsistency in the files, a lot of alerts for mem.oom_kill and a lot of cgroup_ram_in_use

Example (I know its cryptic)

The red highlighted area indicates that the "alert" is still active (no new update entry) and also critical (98.88%) so it will be submitted to the cloud. There is also no record in the file of this alert being removed or change state.

So all the corresponding cgroup_monitoring_cadvisor-XXXXX-YYYY.mem_usage
charts that generate the alerts vanish without a trace with their last state being critical or warning.

We will keep working on this.

We will investigate adding a special option to prevent the agent from sending (Warning/Critical) alerts to the cloud when the agent-cloud-link is established (that will be off by default)

from netdata-cloud.

Slind14 commented on September 25, 2024

@stelfrag thank you for taking a look.

How does this explain this:

For every alert there are 20-30 empty ones that never disappear.

from netdata-cloud.

dimko commented on September 25, 2024

Fixed closing.

from netdata-cloud.

phiberoptick commented on September 25, 2024

Sorry to chime in on a closed issue but I found this while trying to find a way to clear defunct alarms. I see there hasn't been any activity in quite awhile but it appears that I am having the same issue. As far as the agents are concerned the alarms don't exist. Trying to click and follow them in the cloud interface leads to a page stating that chart/metric doesn't exist (although the metrics do exist and are continuing to update.

In my case, I have suddenly (over the past couple of weeks) been getting constant floods of anomaly alarms that will almost immediately clear. They vary as to the metric they are alarming on. Usually ip, ram, hardirq etc.

The problem is that at one point while there were 60 or so alarms in the cloud, I was performing maintenance on all of the children (15 or so) as well as the parent. During this time, all the agents were updated and the servers were restarted.

The restarts were for resource increases and I think that briefly there was high RAM usage and Netdata may have been killed...unfortunately this may have also happened while it was temporarily unable to write to the disk as well. The reboots should've been clean and it appeared that way. Netdata seemed to start normally on the parent and children but ever since they came back up the cloud interface is still persisting the alerts (albeit with no actual values listed for the metric in the alerts tab list) while the children and parent nodes show no alarms and seem to be otherwise running normally...

Is there any news on this or way to clear the alarms out of the cloud interface? Restarting the agents doesn't seem to do anything except change the "Triggered" time to when the agent was restarted.

from netdata-cloud.

MrZammler commented on September 25, 2024

Just a note, we cleared this issue with @car12o (thanks man), and we're processing a plan to deal with this situations.

from netdata-cloud.

Incorrect active alarms in Cloud about netdata-cloud HOT 37 CLOSED

Comments (37)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent