kernel panic: Out of memory and no killable processes (SuSe on VMWare ESXi)
Solution 1
How much memory is assigned to the Suse instance? Given you're running a lot of memory hungry services on it (3 RDBMS plus memcached), it's going to need a significant amount of the 8GB of memory to run.
You'll need to check both the memory reservation and limit setting in ESXi for the Suse instance - remember the limit setting could force the machine to swap out or even crash if it's set too low.
Solution 2
You must find the culprit of using too much memory. You can do that with a simple script recording the output of ps
from time to time to using monitoring facilities like munin.
Without exactly watching what is going on, it's not easy to know who is eating your memory and swap to the point of leaving none available, even I am inclined to guess on the Databases first.
Related videos on Youtube
Radek
Updated on September 18, 2022Comments
-
Radek almost 2 years
we have a server (ESXi virtual machine) at work that time to time gets frozen because of "kernel panic: Out of memory and no killable processes..."
Host's memory is 12GB.
Configuration of virtual machine
- VMware ESXi
- VM version 7
- 2 CPU
- Memory 8192
- memory reservation 0, memory limit setting = unlimited
SuSe 11.3 (64 bit) + kernel 2.6.34-12
firebird, postresql, db2
- php5.3, PHP-FPM, LIGHTTPD, MEMCACHED, OOo
the comp is NOT heavily used, it crashes once a day, once in two days. Sometimes it happends over weekned.
How can I find out what is causing the server to crash?
extract from vmware.log file
Apr 03 07:21:22.266: vcpu-0| Vix: [17514025 vmxCommands.c:7612]: VMAutomation_HandleCLIHLTEvent. Do nothing. Apr 03 07:21:22.266: vcpu-0| Msg_Hint: msg.monitorevent.halt (sent) Apr 03 07:21:22.266: vcpu-0| The CPU has been disabled by the guest operating system. You will need to power off or reset the virtual machine at this point. Apr 03 07:21:22.266: vcpu-0| --------------------------------------- Apr 03 07:21:37.167: vmx| GuestRpcSendTimedOut: message to toolbox timed out. Apr 03 07:21:37.167: vmx| GuestRpc: app toolbox's second ping timeout; assuming app is down Apr 03 22:30:06.017: mks| MKS: Base polling period is 10000us
UPDATE I (bit of /var/log/messages)
extract from /var/log/messages where it all (probably) starts. I am going to remove
/opt/eduserver/bin/php
from cron and we will see if the crash is going to happen again.Apr 9 22:15:02 testing /usr/sbin/cron[4312]: (root) CMD (/opt/eduserver/bin/php /srv/www/htdocs/imacs/radek/trunk/lib/views/edu_scheduler/controllers/action_scheduler.php >/var/lib/edumate/imacs/radek/trunk/scheduler ) Apr 9 22:15:20 testing kernel: [115148.493482] oom_kill_process: 3 callbacks suppressed Apr 9 22:15:20 testing kernel: [115148.493485] php invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0 Apr 9 22:15:20 testing kernel: [115148.493488] Pid: 4317, comm: php Not tainted 2.6.34-12-desktop #1 Apr 9 22:15:20 testing kernel: [115148.493490] Call Trace: Apr 9 22:15:20 testing kernel: [115148.493511] [<ffffffff81005ca9>] dump_trace+0x79/0x340 Apr 9 22:15:20 testing kernel: [115148.493516] [<ffffffff8149e612>] dump_stack+0x69/0x6f Apr 9 22:15:20 testing kernel: [115148.493522] [<ffffffff810dbae0>] dump_header.clone.1+0x70/0x1a0 Apr 9 22:15:20 testing kernel: [115148.493525] [<ffffffff810dbc8e>] oom_kill_process.clone.0+0x7e/0x150 Apr 9 22:15:20 testing kernel: [115148.493529] [<ffffffff810dc0cb>] __out_of_memory+0x10b/0x180 Apr 9 22:15:20 testing kernel: [115148.493533] [<ffffffff810dc3c8>] out_of_memory+0x88/0x190 Apr 9 22:15:20 testing kernel: [115148.493536] [<ffffffff810e073a>] __alloc_pages_nodemask+0x69a/0x6b0 Apr 9 22:15:20 testing kernel: [115148.493541] [<ffffffff810e35a4>] __do_page_cache_readahead+0x114/0x290 Apr 9 22:15:20 testing kernel: [115148.493545] [<ffffffff810e389c>] ra_submit+0x1c/0x30 Apr 9 22:15:20 testing kernel: [115148.493548] [<ffffffff810d9e9f>] filemap_fault+0x3cf/0x410 Apr 9 22:15:20 testing kernel: [115148.493553] [<ffffffff810f4fc2>] __do_fault+0x52/0x520 Apr 9 22:15:20 testing kernel: [115148.493557] [<ffffffff810f9933>] handle_mm_fault+0x1a3/0x450 Apr 9 22:15:20 testing kernel: [115148.493561] [<ffffffff814a4b34>] do_page_fault+0x194/0x450 Apr 9 22:15:20 testing kernel: [115148.493565] [<ffffffff814a1fcf>] page_fault+0x1f/0x30 Apr 9 22:15:20 testing kernel: [115148.493587] [<00007f52b7d4cce5>] 0x7f52b7d4cce5 Apr 9 22:15:20 testing kernel: [115148.493588] Mem-Info: Apr 9 22:15:20 testing kernel: [115148.493590] Node 0 DMA per-cpu: Apr 9 22:15:20 testing kernel: [115148.493592] CPU 0: hi: 0, btch: 1 usd: 0 Apr 9 22:15:20 testing kernel: [115148.493593] CPU 1: hi: 0, btch: 1 usd: 0 Apr 9 22:15:20 testing kernel: [115148.493595] Node 0 DMA32 per-cpu: Apr 9 22:15:20 testing kernel: [115148.493597] CPU 0: hi: 186, btch: 31 usd: 155 Apr 9 22:15:20 testing kernel: [115148.493598] CPU 1: hi: 186, btch: 31 usd: 161 Apr 9 22:15:20 testing kernel: [115148.493600] Node 0 Normal per-cpu: Apr 9 22:15:20 testing kernel: [115148.493601] CPU 0: hi: 186, btch: 31 usd: 173 Apr 9 22:15:20 testing kernel: [115148.493603] CPU 1: hi: 186, btch: 31 usd: 57 Apr 9 22:15:20 testing kernel: [115148.493607] active_anon:1465647 inactive_anon:288016 isolated_anon:0 Apr 9 22:15:20 testing kernel: [115148.493607] active_file:129 inactive_file:784 isolated_file:0 Apr 9 22:15:20 testing kernel: [115148.493608] unevictable:0 dirty:0 writeback:0 unstable:0 Apr 9 22:15:20 testing kernel: [115148.493609] free:11853 slab_reclaimable:4721 slab_unreclaimable:64985 Apr 9 22:15:20 testing kernel: [115148.493609] mapped:14998 shmem:15500 pagetables:161144 bounce:0 Apr 9 22:15:20 testing kernel: [115148.493611] Node 0 DMA free:15812kB min:20kB low:24kB high:28kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15708kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes Apr 9 22:15:20 testing kernel: [115148.493618] lowmem_reserve[]: 0 3000 8050 8050 Apr 9 22:15:20 testing kernel: [115148.493621] Node 0 DMA32 free:24432kB min:4272kB low:5340kB high:6408kB active_anon:2097640kB inactive_anon:524448kB active_file:52kB inactive_file:64kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3072160kB mlocked:0kB dirty:0kB writeback:0kB mapped:448kB shmem:360kB slab_reclaimable:1988kB slab_unreclaimable:97472kB kernel_stack:17712kB pagetables:239608kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:144 all_unreclaimable? no Apr 9 22:15:20 testing kernel: [115148.493629] lowmem_reserve[]: 0 0 5050 5050 Apr 9 22:15:20 testing kernel: [115148.493631] Node 0 Normal free:7168kB min:7192kB low:8988kB high:10788kB active_anon:3764948kB inactive_anon:627616kB active_file:464kB inactive_file:3072kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:5171200kB mlocked:0kB dirty:0kB writeback:0kB mapped:59544kB shmem:61640kB slab_reclaimable:16896kB slab_unreclaimable:162468kB kernel_stack:28984kB pagetables:404968kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1440 all_unreclaimable? yes Apr 9 22:15:20 testing kernel: [115148.493639] lowmem_reserve[]: 0 0 0 0 Apr 9 22:15:20 testing kernel: [115148.493641] Node 0 DMA: 3*4kB 1*8kB 1*16kB 1*32kB 2*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15812kB Apr 9 22:15:20 testing kernel: [115148.493648] Node 0 DMA32: 272*4kB 140*8kB 31*16kB 127*32kB 84*64kB 42*128kB 11*256kB 0*512kB 0*1024kB 0*2048kB 1*4096kB = 24432kB Apr 9 22:15:20 testing kernel: [115148.493655] Node 0 Normal: 840*4kB 26*8kB 1*16kB 0*32kB 0*64kB 0*128kB 0*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 7168kB Apr 9 22:15:20 testing kernel: [115148.493662] 19767 total pagecache pages Apr 9 22:15:20 testing kernel: [115148.493663] 3345 pages in swap cache Apr 9 22:15:20 testing kernel: [115148.493664] Swap cache stats: add 531666, delete 528321, find 103411/104065 Apr 9 22:15:20 testing kernel: [115148.493666] Free swap = 0kB Apr 9 22:15:20 testing kernel: [115148.493667] Total swap = 2103292kB Apr 9 22:15:20 testing kernel: [115148.514162] 2097136 pages RAM Apr 9 22:15:20 testing kernel: [115148.514164] 48271 pages reserved Apr 9 22:15:20 testing kernel: [115148.514165] 106772 pages shared Apr 9 22:15:20 testing kernel: [115148.514166] 2006923 pages non-shared Apr 9 22:15:20 testing kernel: [115148.514169] Out of memory: kill process 3016 (cron) score 308233 or a child Apr 9 22:15:20 testing kernel: [115148.514171] Killed process 15546 (cron) vsz:50064kB, anon-rss:272kB, file-rss:32kB Apr 9 22:16:01 testing /usr/sbin/cron[4347]: (root) CMD (/usr/bin/ruby /root/radek/scripts/freemem.rb) Apr 9 22:17:07 testing kernel: [115255.428734] vmtoolsd invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0 Apr 9 22:17:07 testing kernel: [115255.428738] Pid: 2772, comm: vmtoolsd Not tainted 2.6.34-12-desktop #1 Apr 9 22:17:08 testing kernel: [115255.428740] Call Trace: Apr 9 22:17:08 testing kernel: [115255.428751] [<ffffffff81005ca9>] dump_trace+0x79/0x340 Apr 9 22:17:08 testing kernel: [115255.428756] [<ffffffff8149e612>] dump_stack+0x69/0x6f Apr 9 22:17:08 testing kernel: [115255.428762] [<ffffffff810dbae0>] dump_header.clone.1+0x70/0x1a0 Apr 9 22:17:08 testing kernel: [115255.428765] [<ffffffff810dbc8e>] oom_kill_process.clone.0+0x7e/0x150 Apr 9 22:17:08 testing kernel: [115255.428769] [<ffffffff810dc0cb>] __out_of_memory+0x10b/0x180 Apr 9 22:17:08 testing kernel: [115255.428773] [<ffffffff810dc3c8>] out_of_memory+0x88/0x190 Apr 9 22:17:08 testing kernel: [115255.428777] [<ffffffff810e073a>] __alloc_pages_nodemask+0x69a/0x6b0 Apr 9 22:17:08 testing kernel: [115255.428781] [<ffffffff810e35a4>] __do_page_cache_readahead+0x114/0x290 Apr 9 22:17:08 testing kernel: [115255.428785] [<ffffffff810e389c>] ra_submit+0x1c/0x30 Apr 9 22:17:08 testing kernel: [115255.428788] [<ffffffff810d9e9f>] filemap_fault+0x3cf/0x410 Apr 9 22:17:08 testing kernel: [115255.428793] [<ffffffff810f4fc2>] __do_fault+0x52/0x520 Apr 9 22:17:08 testing kernel: [115255.428802] [<ffffffff810f9933>] handle_mm_fault+0x1a3/0x450 Apr 9 22:17:08 testing kernel: [115255.428824] [<ffffffff814a4b34>] do_page_fault+0x194/0x450 Apr 9 22:17:08 testing kernel: [115255.428828] [<ffffffff814a1fcf>] page_fault+0x1f/0x30 Apr 9 22:17:08 testing kernel: [115255.428841] [<00007f09951973c0>] 0x7f09951973c0 Apr 9 22:17:08 testing kernel: [115255.428842] Mem-Info: Apr 9 22:17:08 testing kernel: [115255.428844] Node 0 DMA per-cpu: Apr 9 22:17:08 testing kernel: [115255.428846] CPU 0: hi: 0, btch: 1 usd: 0 Apr 9 22:17:08 testing kernel: [115255.428847] CPU 1: hi: 0, btch: 1 usd: 0 Apr 9 22:17:08 testing kernel: [115255.428848] Node 0 DMA32 per-cpu: Apr 9 22:17:08 testing kernel: [115255.428850] CPU 0: hi: 186, btch: 31 usd: 44 Apr 9 22:17:08 testing kernel: [115255.428852] CPU 1: hi: 186, btch: 31 usd: 174 Apr 9 22:17:08 testing kernel: [115255.428853] Node 0 Normal per-cpu: Apr 9 22:17:08 testing kernel: [115255.428855] CPU 0: hi: 186, btch: 31 usd: 146 Apr 9 22:17:08 testing kernel: [115255.428856] CPU 1: hi: 186, btch: 31 usd: 171 Apr 9 22:17:08 testing kernel: [115255.428860] active_anon:1464570 inactive_anon:287629 isolated_anon:0 Apr 9 22:17:08 testing kernel: [115255.428861] active_file:66 inactive_file:2047 isolated_file:64 Apr 9 22:17:08 testing kernel: [115255.428862] unevictable:0 dirty:0 writeback:0 unstable:0 Apr 9 22:17:08 testing kernel: [115255.428862] free:11882 slab_reclaimable:4727 slab_unreclaimable:64987 Apr 9 22:17:08 testing kernel: [115255.428863] mapped:15715 shmem:15500 pagetables:161192 bounce:0 Apr 9 22:17:08 testing kernel: [115255.428865] Node 0 DMA free:15812kB min:20kB low:24kB high:28kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15708kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes Apr 9 22:17:08 testing kernel: [115255.428872] lowmem_reserve[]: 0 3000 8050 8050 Apr 9 22:17:08 testing kernel: [115255.428875] Node 0 DMA32 free:24448kB min:4272kB low:5340kB high:6408kB active_anon:2091648kB inactive_anon:522644kB active_file:176kB inactive_file:7944kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3072160kB mlocked:0kB dirty:0kB writeback:0kB mapped:3496kB shmem:360kB slab_reclaimable:2004kB slab_unreclaimable:97488kB kernel_stack:17712kB pagetables:239656kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:210 all_unreclaimable? yes Apr 9 22:17:08 testing kernel: [115255.428882] lowmem_reserve[]: 0 0 5050 5050 Apr 9 22:17:08 testing kernel: [115255.428885] Node 0 Normal free:7268kB min:7192kB low:8988kB high:10788kB active_anon:3766632kB inactive_anon:627872kB active_file:88kB inactive_file:244kB unevictable:0kB isolated(anon):0kB isolated(file):256kB present:5171200kB mlocked:0kB dirty:0kB writeback:0kB mapped:59364kB shmem:61640kB slab_reclaimable:16904kB slab_unreclaimable:162460kB kernel_stack:29000kB pagetables:405112kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:129 all_unreclaimable? yes Apr 9 22:17:08 testing kernel: [115255.428893] lowmem_reserve[]: 0 0 0 0 Apr 9 22:17:08 testing kernel: [115255.428895] Node 0 DMA: 3*4kB 1*8kB 1*16kB 1*32kB 2*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15812kB Apr 9 22:17:08 testing kernel: [115255.428902] Node 0 DMA32: 278*4kB 127*8kB 33*16kB 119*32kB 81*64kB 44*128kB 6*256kB 1*512kB 1*1024kB 0*2048kB 1*4096kB = 24448kB Apr 9 22:17:08 testing kernel: [115255.428909] Node 0 Normal: 881*4kB 20*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 7268kB Apr 9 22:17:08 testing kernel: [115255.428915] 18755 total pagecache pages Apr 9 22:17:08 testing kernel: [115255.428916] 1043 pages in swap cache Apr 9 22:17:08 testing kernel: [115255.428918] Swap cache stats: add 531680, delete 530637, find 103628/104282 Apr 9 22:17:08 testing kernel: [115255.428919] Free swap = 0kB Apr 9 22:17:08 testing kernel: [115255.428920] Total swap = 2103292kB Apr 9 22:17:08 testing kernel: [115255.447686] 2097136 pages RAM Apr 9 22:17:08 testing kernel: [115255.447688] 48271 pages reserved Apr 9 22:17:08 testing kernel: [115255.447689] 64969 pages shared Apr 9 22:17:08 testing kernel: [115255.447690] 2006202 pages non-shared Apr 9 22:17:08 testing kernel: [115255.447693] Out of memory: kill process 3016 (cron) score 308364 or a child Apr 9 22:17:08 testing kernel: [115255.447696] Killed process 15547 (cron) vsz:50064kB, anon-rss:316kB, file-rss:4kB Apr 9 22:17:08 testing kernel: [115255.753860] db2sysc invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0 Apr 9 22:17:08 testing kernel: [115255.753864] Pid: 3346, comm: db2sysc Not tainted 2.6.34-12-desktop #1
-
coredump about 13 yearsWhat crashes, the virtual machine or the host?
- VMware ESXi
-
Radek about 13 yearsthe crash happens quite often during times where there is no db activity ...
-
Radek about 13 yearshost's RAM = 12GB, RAM for virtual machine = 8GB. Was 4 or 6 but we increased it when the vm started to crash because of memory. Memory reservation is 0 and limit settings = unlimited.
-
Ewan Leith about 13 yearsFrom the additional information you've posted, it looks like it's Suse itself that's crashing, rather than any action from ESXi. Can you post the /var/log/messages output? It should contain error messages about memory.
-
Radek about 13 yearsI posted bit from the log file where I think it all starts. Then I can see lots of
Out of memory: kill process
till the server freezes completely. -
Ewan Leith about 13 yearsCheck the php.ini file - possibly in /etc/php.ini or similar, I'm not sure on Suse. Look for a entry for memory_limit - if it's not there, then the default is 128M on PHP5.3, it used to be 8MB. Try setting it to 16MB (which is plenty for most apps) and see if that stops the server crashing - it will instead force PHP to kill that thread if it's consuming too much memory.
-
Radek about 13 yearsmy memory_limit is set up to
2048M
is that too much? I think we run lots of (big) database queries. Could it be that we need lots of memory? I'll try to lower it. -
Ewan Leith about 13 years2048M is a lot by PHP standards, that's per-execution, so if 5 scripts execute at the same time, then you're looking at PHP being allowed to consume 10GB of RAM if they need it. I'd drop it down to 512M, and see if that resolves both your crashes and doesn't cause new issues with the scripts.
-
Radek about 13 yearsI was told that we need that because some of the 'stuff' we do needs lots of memory (like printing statements for 1400 students). The same setting 2048MB is on other development server that 'should' be identical to my one but the other one doesn't crash