Server overloaded, acts like out of memory, but thats not true
Solution 1
The problem was solved by reducing the memory allocated for guests. Now there are 3 guests with 80 GB RAM each, leaving about 150 GB RAM to the host system:
# free -h
total used free shared buff/cache available
Mem: 377G 243G 29G 1,9G 104G 132G
Feels like a big memory waste but things are stable now.
Solution 2
Theres a lot of free memory, but these zones are totally fragmented:
Node 0 Normal: 1648026*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 6592104kB
Node 1 Normal: 8390977*4kB 1181188*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB
There are very few non-zero order pages left, none in one zone left at all.
I cant guarantee anything but you may want to try to turn off ksmd and re-compact memory. Compaction only gets called automatically on higher order page allocations and never calls oom-killer, so I assume that the system has tried to allocate memory from orders 2 or 3 and got stuck.
To compact memory run echo 1 >/proc/sys/vm/compact_memory
Theres only so much to go off in this question, but I suspect ksmd
is causing the fragmentation by scanning for pages duplicated in both VM's and swapping them all around.
Solution 3
@Matthew's answer should be marked as solution for this problem. The /proc/buddyinfo
clearly shows fragmentation (due to ksmd or other behaviour). The memory compaction is a valid solution.
We just hit the same problem on our server :
# cat /proc/buddyinfo
Node 0, zone DMA 1 0 1 0 0 1 0 0 0 1 3
Node 0, zone DMA32 4941 14025 10661 1462 1715 154 1 0 0 0 0
Node 0, zone Normal 420283 217678 3852 3 1 0 1 1 1 0 0
Node 1, zone Normal 1178429 294431 21420 340 7 2 1 2 0 0 0
This clearly shows fragmentation, since most memory is fragmented in lots of small blocks of memory (large number on the left, zero on the right).
Now the compaction solves this :
# echo 1 >/proc/sys/vm/compact_memory
# cat /proc/buddyinfo
Node 0, zone DMA 1 0 1 0 0 1 0 0 0 1 3
Node 0, zone DMA32 485 1746 8588 3311 2076 505 98 19 3 0 0
Node 0, zone Normal 83764 22474 8597 3130 1971 1421 1090 808 556 358 95
Node 1, zone Normal 51928 36053 36093 29024 21498 13148 5719 1405 151 8 0
Related videos on Youtube
dave
Updated on September 18, 2022Comments
-
dave almost 2 years
I have a Centos 6.5 server, running QEMU-KVM virtualization:
hardware:
- 40 CPUs
- 400 GB RAM
software:
- Kernel: 2.6.32-431.17.1.el6.x86_64
- Qemu: 0.12.1.2
- Libvirt: 0.10.2
There are 3 guests, with identical hw configuration:
- 16 CPUs
- 120 GB RAM
<memory unit='KiB'>125829120</memory>
<currentMemory unit='KiB'>125829120</currentMemory>
<vcpu placement='static'>16</vcpu>Guests are running Apache and MySQL.
On the host runs just some backup and maintenance scripts beside the virtuals, nothing else.
Always after a few days running, problems starts to show up. The load on guests randomly spikes upto about 150, with 10-15% in steal cpu time. On the host the load is around 38-40, about 30-40% in user cpu time, 40-50% in system cpu time.
The most CPU-consuming processes on the host in that moment are the Qemu precesses of virtual guests and right after them are kswapd0 and kswapd1, with 100% CPU usage.
Memory usage in that moment:
- RAM total 378.48 GB
- RAM used 330.82 GB
- RAM free 47.66 GB
- SWAP total 500.24 MB
- SWAP used 497.13 MB
- SWAP free 3192 kB
plus 10-20 GB RAM in buffers.
So, from the point of memory usage, there shouldnt be any problem. But the heavy work of kswapd processes indicates memory shortage, also full swap points in that direction (and when i turn swap of and on, it gets filled up in few moments). And once in a while, OOM-killer kills some process:
Nov 20 12:42:42 wv2-f302 kernel: active_anon:79945387 inactive_anon:3660742 isolated_anon:0 Nov 20 12:42:42 wv2-f302 kernel: active_file:252 inactive_file:0 isolated_file:0 Nov 20 12:42:42 wv2-f302 kernel: unevictable:0 dirty:2 writeback:0 unstable:0 Nov 20 12:42:42 wv2-f302 kernel: free:12513746 slab_reclaimable:5001 slab_unreclaimable:1759785 Nov 20 12:42:42 wv2-f302 kernel: mapped:213 shmem:41 pagetables:188243 bounce:0 Nov 20 12:42:42 wv2-f302 kernel: Node 0 DMA free:15728kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15332kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes Nov 20 12:42:42 wv2-f302 kernel: lowmem_reserve[]: 0 2965 193855 193855 Nov 20 12:42:42 wv2-f302 kernel: Node 0 DMA32 free:431968kB min:688kB low:860kB high:1032kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3037072kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes Nov 20 12:42:42 wv2-f302 kernel: lowmem_reserve[]: 0 0 190890 190890 Nov 20 12:42:42 wv2-f302 kernel: Node 0 Normal free:6593828kB min:44356kB low:55444kB high:66532kB active_anon:178841380kB inactive_anon:7783292kB active_file:540kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:195471360kB mlocked:0kB dirty:8kB writeback:0kB mapped:312kB shmem:48kB slab_reclaimable:11136kB slab_unreclaimable:1959664kB kernel_stack:5104kB pagetables:397332kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no Nov 20 12:42:42 wv2-f302 kernel: lowmem_reserve[]: 0 0 0 0 Nov 20 12:42:42 wv2-f302 kernel: Node 1 Normal free:43013460kB min:45060kB low:56324kB high:67588kB active_anon:140940168kB inactive_anon:6859676kB active_file:468kB inactive_file:56kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:198574076kB mlocked:0kB dirty:0kB writeback:0kB mapped:540kB shmem:116kB slab_reclaimable:8868kB slab_unreclaimable:5079476kB kernel_stack:2856kB pagetables:355640kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no Nov 20 12:42:42 wv2-f302 kernel: lowmem_reserve[]: 0 0 0 0 Nov 20 12:42:42 wv2-f302 kernel: Node 0 DMA: 2*4kB 1*8kB 2*16kB 2*32kB 2*64kB 1*128kB 0*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15728kB Nov 20 12:42:42 wv2-f302 kernel: Node 0 DMA32: 10*4kB 11*8kB 12*16kB 13*32kB 12*64kB 5*128kB 7*256kB 10*512kB 9*1024kB 6*2048kB 98*4096kB = 431968kB Nov 20 12:42:42 wv2-f302 kernel: Node 0 Normal: 1648026*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 6592104kB Nov 20 12:42:42 wv2-f302 kernel: Node 1 Normal: 8390977*4kB 1181188*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 43013412kB Nov 20 12:42:42 wv2-f302 kernel: 49429 total pagecache pages Nov 20 12:42:42 wv2-f302 kernel: 48929 pages in swap cache Nov 20 12:42:42 wv2-f302 kernel: Swap cache stats: add 2688331, delete 2639402, find 16219898/16530111 Nov 20 12:42:42 wv2-f302 kernel: Free swap = 3264kB Nov 20 12:42:42 wv2-f302 kernel: Total swap = 512248kB Nov 20 12:42:44 wv2-f302 kernel: 100663294 pages RAM Nov 20 12:42:44 wv2-f302 kernel: 1446311 pages reserved Nov 20 12:42:44 wv2-f302 kernel: 10374115 pages shared Nov 20 12:42:44 wv2-f302 kernel: 84534113 pages non-shared Oct 27 14:24:43 wv2-f302 kernel: [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name Oct 27 14:24:43 wv2-f302 kernel: [ 3878] 0 3878 32042399 31569413 10 0 0 qemu_wl52 Oct 27 14:24:43 wv2-f302 kernel: [ 4321] 0 4321 32092081 31599762 20 0 0 qemu_wl51 Oct 27 14:24:43 wv2-f302 kernel: [ 4394] 0 4394 32106979 31575717 15 0 0 qemu_wl50 ... Oct 27 14:24:43 wv2-f302 kernel: Out of memory: Kill process 3878 (qemu_wl52) score 318 or sacrifice child Oct 27 14:24:43 wv2-f302 kernel: Killed process 3878, UID 0, (qemu_wl52) total-vm:128169596kB, anon-rss:126277476kB, file-rss:176kB
Complete dump: http://evilcigi.eu/msg/msg.txt
Then i start the killed guest and from that moment, everything is OK, for a few days.. With the same memory usage as it has before the problem:
- RAM total 378.48 GB
- RAM used 336.15 GB
- RAM free 42.33 GB
- SWAP total 500.24 MB
- SWAP used 344.55 MB
- SWAP free 155.69 MB
Is it possible that server somehow counts memory badly? Or is there something I'm missing?
One thing comes to my mind, that host puts all free memory in buffers and cache and then suffers from memory shortage (invokes OOM-killer)? But that, i think, shouldnt happen, right? Also, that doesnt explain the behavior before the killing.
Thank you in advance.
so today the problem occurs again, here is the content of /proc/meminfo:
MemTotal: 396867932 kB MemFree: 9720268 kB Buffers: 53354000 kB Cached: 22196 kB SwapCached: 343964 kB Active: 331872796 kB Inactive: 41283992 kB Active(anon): 305458432 kB Inactive(anon): 14322324 kB Active(file): 26414364 kB Inactive(file): 26961668 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 512248 kB SwapFree: 0 kB Dirty: 48 kB Writeback: 0 kB AnonPages: 319438656 kB Mapped: 8536 kB Shmem: 164 kB Slab: 9052784 kB SReclaimable: 2014752 kB SUnreclaim: 7038032 kB KernelStack: 8064 kB PageTables: 650892 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 198946212 kB Committed_AS: 383832752 kB VmallocTotal: 34359738367 kB VmallocUsed: 1824832 kB VmallocChunk: 34157271228 kB HardwareCorrupted: 0 kB AnonHugePages: 31502336 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 7852 kB DirectMap2M: 3102720 kB DirectMap1G: 399507456 kB
It seems that all the "free" memory is spent in buffers.
After hints from @Matthew Ife about memory fragmentation, I have compacted the memory and also dropped the caches (to free up 60 GBs in buffers) on Host, with those commands:
echo 3 > /proc/sys/vm/drop_caches echo 1 >/proc/sys/vm/compact_memory
Here is what memory fragmentation looks like now:
# cat /proc/buddyinfo Node 0, zone DMA 2 1 2 2 2 1 0 0 1 1 3 Node 0, zone DMA32 12 12 13 16 10 5 7 10 9 6 98 Node 0, zone Normal 2398537 469407 144288 97224 58276 24155 8153 3141 1299 451 75 Node 1, zone Normal 9182926 2727543 648104 81843 7915 1267 244 67 3 1 0
update 2014/11/25 - server is overloaded again:
# cat /proc/buddyinfo Node 0, zone DMA 2 1 2 2 2 1 0 0 1 1 3 Node 0, zone DMA32 12 12 13 16 10 5 7 10 9 6 98 Node 0, zone Normal 4374385 85408 0 0 0 0 0 0 0 0 0 Node 1, zone Normal 1830850 261703 460 14 0 0 0 0 0 0 0 # cat /proc/meminfo MemTotal: 396867932 kB MemFree: 28038892 kB Buffers: 49126656 kB Cached: 19088 kB SwapCached: 303624 kB Active: 305426204 kB Inactive: 49729776 kB Active(anon): 292040988 kB Inactive(anon): 13969376 kB Active(file): 13385216 kB Inactive(file): 35760400 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 512248 kB SwapFree: 20 kB Dirty: 28 kB Writeback: 0 kB AnonPages: 305706632 kB Mapped: 9324 kB Shmem: 124 kB Slab: 8616228 kB SReclaimable: 1580736 kB SUnreclaim: 7035492 kB KernelStack: 8200 kB PageTables: 702268 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 198946212 kB Committed_AS: 384014048 kB VmallocTotal: 34359738367 kB VmallocUsed: 1824832 kB VmallocChunk: 34157271228 kB HardwareCorrupted: 0 kB AnonHugePages: 31670272 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 7852 kB DirectMap2M: 3102720 kB DirectMap1G: 399507456 kB
and in syslog are some page allocation failures:
Nov 25 09:14:07 wv2-f302 kernel: qemu_wl50: page allocation failure. order:4, mode:0x20 Nov 25 09:14:07 wv2-f302 kernel: Pid: 4444, comm: qemu_wl50 Not tainted 2.6.32-431.17.1.el6.x86_64 #1 Nov 25 09:14:07 wv2-f302 kernel: Call Trace: Nov 25 09:14:07 wv2-f302 kernel: <IRQ> [<ffffffff8112f64a>] ? __alloc_pages_nodemask+0x74a/0x8d0 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8116e082>] ? kmem_getpages+0x62/0x170 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8116ec9a>] ? fallback_alloc+0x1ba/0x270 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8116ea19>] ? ____cache_alloc_node+0x99/0x160 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8116fbe0>] ? kmem_cache_alloc_node_trace+0x90/0x200 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8116fdfd>] ? __kmalloc_node+0x4d/0x60 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8144ff5a>] ? __alloc_skb+0x7a/0x180 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff81451070>] ? skb_copy+0x40/0xb0 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffffa013a55c>] ? tg3_start_xmit+0xa8c/0xd80 [tg3] Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff814603e4>] ? dev_hard_start_xmit+0x224/0x480 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8147be6a>] ? sch_direct_xmit+0x15a/0x1c0 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff814608e8>] ? dev_queue_xmit+0x228/0x320 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffffa02c8898>] ? br_dev_queue_push_xmit+0x88/0xc0 [bridge] Nov 25 09:14:07 wv2-f302 kernel: [<ffffffffa02c8928>] ? br_forward_finish+0x58/0x60 [bridge] Nov 25 09:14:07 wv2-f302 kernel: [<ffffffffa02c8ae8>] ? __br_deliver+0x78/0x110 [bridge] Nov 25 09:14:07 wv2-f302 kernel: [<ffffffffa02c8bb5>] ? br_deliver+0x35/0x40 [bridge] Nov 25 09:14:07 wv2-f302 kernel: [<ffffffffa02c78f4>] ? br_dev_xmit+0x114/0x140 [bridge] Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff814603e4>] ? dev_hard_start_xmit+0x224/0x480 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8146087d>] ? dev_queue_xmit+0x1bd/0x320 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff81466785>] ? neigh_resolve_output+0x105/0x2d0 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8149a2f8>] ? ip_finish_output+0x148/0x310 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8149a578>] ? ip_output+0xb8/0xc0 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8149983f>] ? __ip_local_out+0x9f/0xb0 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff81499875>] ? ip_local_out+0x25/0x30 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff81499d50>] ? ip_queue_xmit+0x190/0x420 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff814af06e>] ? tcp_transmit_skb+0x40e/0x7b0 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff814b15b0>] ? tcp_write_xmit+0x230/0xa90 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff814b2130>] ? __tcp_push_pending_frames+0x30/0xe0 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff814a9893>] ? tcp_data_snd_check+0x33/0x100 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff814ad491>] ? tcp_rcv_established+0x381/0x7f0 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff814b5873>] ? tcp_v4_do_rcv+0x2e3/0x490 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffffa02b1557>] ? ipv4_confirm+0x87/0x1d0 [nf_conntrack_ipv4] Nov 25 09:14:07 wv2-f302 kernel: [<ffffffffa0124441>] ? nf_nat_fn+0x91/0x260 [iptable_nat] Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff814b717a>] ? tcp_v4_rcv+0x51a/0x900 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff81494300>] ? ip_local_deliver_finish+0x0/0x2d0 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff814943dd>] ? ip_local_deliver_finish+0xdd/0x2d0 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff81494668>] ? ip_local_deliver+0x98/0xa0 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff81493b2d>] ? ip_rcv_finish+0x12d/0x440 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff814940b5>] ? ip_rcv+0x275/0x350 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff81489509>] ? nf_iterate+0x69/0xb0 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8145b5db>] ? __netif_receive_skb+0x4ab/0x750 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8145f1f0>] ? netif_receive_skb+0x0/0x60 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8145f248>] ? netif_receive_skb+0x58/0x60 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffffa02c9af8>] ? br_handle_frame_finish+0x1e8/0x2a0 [bridge] Nov 25 09:14:07 wv2-f302 kernel: [<ffffffffa02c9d5a>] ? br_handle_frame+0x1aa/0x250 [bridge] Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8145b659>] ? __netif_receive_skb+0x529/0x750 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8145f248>] ? netif_receive_skb+0x58/0x60 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8145f350>] ? napi_skb_finish+0x50/0x70 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff81460ab9>] ? napi_gro_receive+0x39/0x50 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffffa0136b54>] ? tg3_poll_work+0xc24/0x1020 [tg3] Nov 25 09:14:07 wv2-f302 kernel: [<ffffffffa0136f9c>] ? tg3_poll_msix+0x4c/0x150 [tg3] Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff81460bd3>] ? net_rx_action+0x103/0x2f0 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff810a6da9>] ? ktime_get+0x69/0xf0 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8107a551>] ? __do_softirq+0xc1/0x1e0 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff810e6b20>] ? handle_IRQ_event+0x60/0x170 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8100c30c>] ? call_softirq+0x1c/0x30 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8100fa75>] ? do_softirq+0x65/0xa0 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8107a405>] ? irq_exit+0x85/0x90 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff815312c5>] ? do_IRQ+0x75/0xf0 Nov 25 09:14:07 wv2-f302 kernel: <EOI> [<ffffffffa018e271>] ? kvm_arch_vcpu_ioctl_run+0x4c1/0x10b0 [kvm] Nov 25 09:14:07 wv2-f302 kernel: [<ffffffffa018e25f>] ? kvm_arch_vcpu_ioctl_run+0x4af/0x10b0 [kvm] Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff810aee2e>] ? futex_wake+0x10e/0x120 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffffa0175b04>] ? kvm_vcpu_ioctl+0x434/0x580 [kvm] Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8100b9ce>] ? common_interrupt+0xe/0x13 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8119d802>] ? vfs_ioctl+0x22/0xa0 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8119dcca>] ? do_vfs_ioctl+0x3aa/0x580 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff810b186b>] ? sys_futex+0x7b/0x170 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8119df21>] ? sys_ioctl+0x81/0xa0 Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
edit: The problem was solved by reducing the memory allocated for guests. Now there are 3 guests with 80 GB RAM each, leaving about 150 GB RAM to the host system:
# free -h total used free shared buff/cache available Mem: 377G 243G 29G 1,9G 104G 132G
-
kasperd over 9 yearsHow much data do you have stored in tmpfs file systems?
-
dave over 9 yearstmpfs is limited to 5 GB and is almost unused:
tmpfs 5,0G 16K 5,0G 1% /dev/shm
-
kasperd over 9 yearsJust to be sure, did you check that there are no other tmpfs mounted on the system?
-
Matthew Ife over 9 yearsTheres some info missing from the memory dump, can you provide the next 10 lines up from
active_anon:79945387 inactive_anon:3660742 isolated_anon:0
-
dave over 9 yearscomplete info: evilcigi.eu/msg/msg.txt
-
fsoppelsa almost 8 years@dave Were you able to find a long-term solution for this issue? If so, please update the post, it would be very helpful.
-
Deer Hunter over 9 yearsPlease include the relevant information in your post. Answers are supposed to be self-sufficient.
-
dave over 9 yearsThat is not the case, there are no snapshots.
-
dave over 9 yearsI have already turned ksmd (ksmtuned) down yesterday. I'll try to compact the memory, but how will it affect the performance? (it is live server).
-
Matthew Ife over 9 yearsMay cause %sys to go up for a few seconds. Expect the host to be unresponsive against a particular CPU for about 20 seconds or so. This wont cause downtime, note the system does this automatically without your consent anyway on certain memory allocations.
-
dave over 9 yearsOk, just to be sure.. Now its compacted. For now its ok, but its primarily because i lowered the memory usage of guests (virsh setmem after clearing their memory caches).
-
dave over 9 yearsI have updated the question with contents of /proc/buddyinfo after compacting the memory.. I can't realy tell if its ok or no. Also, do you have any clue about why was so much memory spent in buffers? Can it be related with fragmentation?
-
Matthew Ife over 9 yearsI think KSM fragmented you. The buddyinfo looks healthy. You want each column to have at least some number but getting lower and lower as you go further to the right. In your case you want to make sure there is at least a few thousand in the first 4 numeric columns.
-
Henrik Carlqvist over 9 yearsThanks Deer Hunter for your comment, I have now added some more info even though my answer will not help dave as he is not using -snapshot. Still hopefully may answer will now be more useful for people finding it by googling here.
-
dave over 9 yearsWell, it seems that fragmentation is not the cause of the problem. Always after a while, buffers on the host eats all free memory and server starts to act like it is out-of-memory..
-
dave over 9 yearsI've updated the informations above