High load on a nagios server -- How many service checks for a nagios server is too many?

14,861

Solution 1

You need to figure out where your bottleneck is...

I run a nagios monitor that checks 400+ hosts with http, ping and ssh checks. (along with a lot of other passive checks and nscd)

This is on a 2xQuadCore server with 4 SAS disks in RAID10.

I suspect you're having IO contention, as writing to lots of rrds is very inefficient.

You need to figure out which process is taking up your resources. (cacti, nagios or something else)

For IO checking, I like iotop. Install iotop (the 9.04 package works on 8.04)

But otherwise top should also help you find your load hog.

Cacti once a minute is pretty aggressive. (I run mine at 5m intervals)

One approach I've heard of for rrd write contention is to put your rrd stores on a ramdisk/tmpfs. (be sure to rsync that every now and then to persistent storage)

Good luck.

Solution 2

Unless it's cacti generating most of the load then you should be able to run many more checks than that on your hardware.

I'm running nagios on a FreeBSD virtual machine running on Microsoft Virtual Server on a dog-slow old PC (Pentium 3 1GHz with a slow PATA disk). The virtual machine has only 128MB RAM, and performance is dire.

However the load average is about 0.2, running 158 checks across 42 hosts.

Solution 3

On an old PIII with 256MB of RAM I'm actively monitoring about 230 different services. The same machine is also running MRTG and HylaFAX for all our incoming faxes and is doing so quite comfortably.

Solution 4

You should be able to run a boatload of nagios checks with that hardware. We run a similar setup with about 70 checks and Nagiosgraph - the major difference is added RAM (it's cheap, so I'd bump up the box to 2Gb).

Try running top or ps -aux to see if the CPU is overloaded, but I doubt it. You may also want to check the nagios parallelization docs to see if your install is trying to run too many checks at once rather than serializing them.

Share:
14,861

Related videos on Youtube

Josh
Author by

Josh

I am Josh Gitlin, CTO and co-founder of Digital Fruition a software as a service eCommerce company. Currently serving as Principal DevOps Engineer at Pinnacle 21, and hacking away at Cinc Server, the free-as-in-beer rebranded distribution of Chef Server.

Updated on September 17, 2022

Comments

  • Josh
    Josh almost 2 years

    I have a nagios server running Ubuntu with a 2.0 GHz Intel Processor, a RAID10 array, and 400 MB of RAM. It monitors a total of 42 services across 8 hosts, most of which are checked using the check_http plugin even 5 minutes, some every minute. Recently the load on the nagios server has been above 4, often as high as 6. The server also runs cacti, gathering statistics every minute for 6 hosts.

    I wonder, how many services should hardware like this be able to handle? Is the load so high because I am pushing the limits of the hardware, or should this hardware be able to handle 42 service checks plus cacti? If the hardware is inadequate, should I look to add more RAM, more cores, or faster cores? What hardware / service checks are others running?

    • Admin
      Admin over 14 years
      What does ram usage look like right now on the server? Also what does cpu usage look like? If that's high what processes are pegging it?
    • Admin
      Admin almost 13 years
      Did you solve the problem ? We are experiencing the same issue. Load avg is 12..
  • Josh
    Josh over 14 years
    Thanks. I'll look into it. It probably is cacti generating the load, and I'll see if there is a way to move the rrds to tmpfs. Or just add more RAM so the server can buffer the rrds. I fear if I run cacti every 5 minutes there could be load spikes that last only 1 or 2 minutes which I would completely miss...
  • Josh
    Josh over 14 years
    Thanks. I wish I could accept both answers! Your was very helpful, it indicates to me that cacti is probably the culprit.
  • Josh
    Josh over 14 years
    Very helpful information. This indicates to me that cacti is probably the culprit, not nagios. Thanks!