Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature] Handle cases in which devices are receiving metrics but not reachable via ping #566

Open
nemesifier opened this issue Mar 15, 2024 · 1 comment
Labels
bug Something isn't working enhancement New feature or request

Comments

@nemesifier
Copy link
Member

There can be a conflicting situation in which a device is not reachable on the management IP but is sending metrics succesfully to the server.

Due to the recovery detection feature, this generates additional load on the server because as soon as metrics or checksum requests are received, the system schedules a ping because it belive it will be able to reach the device and hence set the status back to OK, but that won't happen.

If many devices are in this situation, the monitoring queue can grow indefinitely until consuming all the available memory, at that point the server will crash.

We need to devise a way to spot these situations and set the status to "PROBLEM".

In this case, the ping check should not set the status to CRITICAL even if it cannot ping, unless no metrics were received for more than 10 minutes.

The device recovery mechanism should not be triggered if the status of the device is not critical.

Maybe we could solve this by simply modifying the ping check to look whether the device has been receiving monitoring metrics before deciding to set the status to CRITICAL or PROBLEM.

@nemesifier nemesifier added bug Something isn't working enhancement New feature or request labels Mar 15, 2024
@SanjayKumar-M
Copy link

Hey @nemesifier i would like to work on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
Status: To do (general)
Development

No branches or pull requests

2 participants