nagios wrongly reports packet loss

5,848

After you have verified the packet loss by different tools, First of all you need to find out which plugin is actually checking for packet loss. Locate that plugin and manually run it after the interval defined in the nagios, and check its output if that can give you a clue. The problem doesnt seems to be that packet loss is there but its the fault plugin. once you have verified the plugin output, then compare that output with output of other tools (to see if it shows any packet loss and if others dont). Usually the plugin is check_ping.

Share:
5,848

Related videos on Youtube

Alien Life Form
Author by

Alien Life Form

Updated on September 18, 2022

Comments

  • Alien Life Form
    Alien Life Form almost 2 years

    Lately, on my nagios 3.2.3 install (CentOS5, monitoring ~ 300 hosts, 1150 services) has sdtarted to occasionally report high packet loss on 50-60 hosts at a time. Problem is it's bogus. Manual runs of ping (or its own check_ping binary) finds no fault with any of the affected hosts. The only possible cures I found so far are:

    1. run all the checks manually (they will succeed but it may act up again on next check)
    2. acknowledge and wait for the problem to go away (may take several ours)

    I suspect (but have no particular reason other than single rescheduled checks succeeding) that the problem may lay with all the checks being mass scheduled together - in which case introducing some jitter in the scheduling (how?) might help. Or it may be something completely different.

    Ideas, anyone?

    Edit:

    For people interested in constructive debate (rather than point scoring). I am not trying to measure packet loss. Network performance is not my concern in this instance, and if it was, it would be investigated with the proper tools for the job. NAGIOS (for the unwary) is mostly used to check upness in host servicesand to generate alerts. When it starts generating large amounts of fishy alerts is therefore highly annoying. I am 99.9% positive that the problem is either due to either:

    1. some Nagios/Nagios-Plugin snag
    2. some system (memory-cpu- i/O - network stack) problem

    possibly caused by the burst of requests sent by the nagios scheduler. The packet losses are all above 50% - if they were real, our phones would be melting. So far I have no evidence for (2), so I am looking for "prior art" in (1). I may well be mistaken in my belief, but, if I have to reach for wireshark or similar, a suggestion on what to look for would be greatly appreciated.

    • MadHatter
      MadHatter over 11 years
      The fact that there's packet loss at time A, but not when checked again at time B, doesn't mean the first result was bogus. I'd be inclined to start by assuming that NAGIOS was telling the truth, and investigate why I was getting intermittent packet loss.
    • Alien Life Form
      Alien Life Form over 11 years
      Besides manual checks, I have other independent checks (smokeping, cacti) telling me that it's not the case. The affected hosts are on different remote networks (and different owners) yet other hosts on the same networks do not have the same loss. Several of the hosts are running loss sensitive services (VPNS, mostly) which would drop with the reported loss rates - they don't. Everything happens in lockstep. I could go on, but the bottom line is that it is highly unlikely that Nagios is telling the truth.
    • MadHatter
      MadHatter over 11 years
      A reasonable answer. Is there any time correlation in the affected hosts? That is, if all the checks run between (say) 1000 and 1002 gave losses, but those between 1002 and 1004 didn't, the fact that hosts in both groups were on the same networks wouldn't signify. The point about other services being available definitely doesn't signify, since tests for connectivity using different transport media (eg, TCP over a VPN) have different timeouts. What do you see on the wire when the losses are occurring?
  • dunxd
    dunxd over 11 years
    I would +1 this but I don't like the snarky tone at the end. Help the asker understand if you don't think they understand, don't berate them. We all have to start somewhere.
  • Alien Life Form
    Alien Life Form over 11 years
    Are you sure YOU understand what I am trying to measure? How do you know I am concerned about packet loss? (I'm not) Do you know what nagios is and what it is used for? (Hint: not for measuring network performance) How do you know that packet loss is in the order of 1% (It is around 90% when reported) Did you even bother RTFQ (Reading The Fine Question)? If you did, did you stop to think (assuming that activity applies in this case) before deciding that all involved facts and parties are dumb? Have a nice day.
  • Felix Frank
    Felix Frank about 10 years
    I believe that the problem described in that blog is something else altogether. Using check_http as a plugin for a check_ping command also strikes me as at least slightly evil ;)
  • lgfischer
    lgfischer about 10 years
    Sorry, I copied the wrong piece of code from my Nagios scripts. I've updated my answer ;)