Kill processes if high load average
Solution 1
You can use a watchdog like Monit to watch over the processes you care about, and restart them if they consume excess resources.
Something like this would be used to monitor Apache:
check process apache with pidfile /var/run/httpd.pid
start program = "/etc/init.d/httpd start"
stop program = "/etc/init.d/httpd stop"
if cpu > 40% for 2 cycles then alert
if totalcpu > 60% for 2 cycles then alert
if totalcpu > 80% for 5 cycles then restart
if mem > 100 MB for 5 cycles then stop
if loadavg(5min) greater than 10.0 for 8 cycles then stop
So, if the cpu% for the Apache process or any of its children are over 40%, send an alert. If it's above 80%, do a restart of Apache.
Monit will also start up Apache if it's not running for some reason, which is a reasonable way to keep critical services up (if you don't have something like Upstart available).
This assumes that you have a set of processes that you can target for this sort of monitoring. Presumably, you suspect a particular application may be a problem.
Solution 2
When your LA raises and you can't login via ssh, try Grey Goo a tiny available and reliable remote command execution server and client designed purely for emergency situations:
https://code.google.com/p/greygoo/
Related videos on Youtube
Drakmail
Updated on September 18, 2022Comments
-
Drakmail over 1 year
Not so far ago LA on my server raised to 400 and I couldn't even login to server using ssh. Does exists any software, that can prevent this situations by automatically killing processes that making huge load on server?
PS. Debian 6.0.5
-
Drakmail almost 12 yearsIdeally, I want something, that will be kill all suspicious processes which using too many CPU or IO with something like blacklist of processes, that never will be killed (like ssh).
-
Matthew Ife almost 12 yearsI dont want to sound harsh, but that feels like a band-aid to me. You should really be figuring out what circumstances cause the load to go out of control and remedy the problem at the source.
-
Drakmail almost 12 yearsSounds good, but problem does not repeat now :( I has some suggestions, why it could be, but I'm afraid that verifying of it my kill my server again.