Server overloaded, acts like out of memory, but thats not true

linux kvm-virtualization centos6 memory-usage

6,647

Solution 1

The problem was solved by reducing the memory allocated for guests. Now there are 3 guests with 80 GB RAM each, leaving about 150 GB RAM to the host system:

# free -h
              total        used        free      shared  buff/cache   available
Mem:           377G        243G         29G        1,9G        104G        132G

Feels like a big memory waste but things are stable now.

Solution 2

Theres a lot of free memory, but these zones are totally fragmented:

Node 0 Normal: 1648026*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 6592104kB
Node 1 Normal: 8390977*4kB 1181188*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB

There are very few non-zero order pages left, none in one zone left at all.

I cant guarantee anything but you may want to try to turn off ksmd and re-compact memory. Compaction only gets called automatically on higher order page allocations and never calls oom-killer, so I assume that the system has tried to allocate memory from orders 2 or 3 and got stuck.

To compact memory run echo 1 >/proc/sys/vm/compact_memory

Theres only so much to go off in this question, but I suspect ksmd is causing the fragmentation by scanning for pages duplicated in both VM's and swapping them all around.

Solution 3

@Matthew's answer should be marked as solution for this problem. The /proc/buddyinfo clearly shows fragmentation (due to ksmd or other behaviour). The memory compaction is a valid solution.

We just hit the same problem on our server :

# cat /proc/buddyinfo
Node 0, zone      DMA      1      0      1      0      0      1      0      0      0      1      3
Node 0, zone    DMA32   4941  14025  10661   1462   1715    154      1      0      0      0      0
Node 0, zone   Normal 420283 217678   3852      3      1      0      1      1      1      0      0
Node 1, zone   Normal 1178429 294431  21420    340      7      2      1      2      0      0      0

This clearly shows fragmentation, since most memory is fragmented in lots of small blocks of memory (large number on the left, zero on the right).

Now the compaction solves this :

# echo 1 >/proc/sys/vm/compact_memory
# cat /proc/buddyinfo
Node 0, zone      DMA      1      0      1      0      0      1      0      0      0      1      3
Node 0, zone    DMA32    485   1746   8588   3311   2076    505     98     19      3      0      0
Node 0, zone   Normal  83764  22474   8597   3130   1971   1421   1090    808    556    358     95
Node 1, zone   Normal  51928  36053  36093  29024  21498  13148   5719   1405    151      8      0

6,647

dave

Updated on September 18, 2022

Comments

dave almost 2 years

I have a Centos 6.5 server, running QEMU-KVM virtualization:

hardware:

40 CPUs
400 GB RAM

software:

Kernel: 2.6.32-431.17.1.el6.x86_64
Qemu: 0.12.1.2
Libvirt: 0.10.2

There are 3 guests, with identical hw configuration:

16 CPUs
120 GB RAM

<memory unit='KiB'>125829120</memory>
<currentMemory unit='KiB'>125829120</currentMemory>
<vcpu placement='static'>16</vcpu>

Guests are running Apache and MySQL.

On the host runs just some backup and maintenance scripts beside the virtuals, nothing else.

Always after a few days running, problems starts to show up. The load on guests randomly spikes upto about 150, with 10-15% in steal cpu time. On the host the load is around 38-40, about 30-40% in user cpu time, 40-50% in system cpu time.

The most CPU-consuming processes on the host in that moment are the Qemu precesses of virtual guests and right after them are kswapd0 and kswapd1, with 100% CPU usage.

Memory usage in that moment:

RAM total 378.48 GB
RAM used 330.82 GB
RAM free 47.66 GB
SWAP total 500.24 MB
SWAP used 497.13 MB
SWAP free 3192 kB

plus 10-20 GB RAM in buffers.

So, from the point of memory usage, there shouldnt be any problem. But the heavy work of kswapd processes indicates memory shortage, also full swap points in that direction (and when i turn swap of and on, it gets filled up in few moments). And once in a while, OOM-killer kills some process:

Nov 20 12:42:42 wv2-f302 kernel: active_anon:79945387 inactive_anon:3660742 isolated_anon:0
Nov 20 12:42:42 wv2-f302 kernel: active_file:252 inactive_file:0 isolated_file:0
Nov 20 12:42:42 wv2-f302 kernel: unevictable:0 dirty:2 writeback:0 unstable:0
Nov 20 12:42:42 wv2-f302 kernel: free:12513746 slab_reclaimable:5001 slab_unreclaimable:1759785
Nov 20 12:42:42 wv2-f302 kernel: mapped:213 shmem:41 pagetables:188243 bounce:0
Nov 20 12:42:42 wv2-f302 kernel: Node 0 DMA free:15728kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15332kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Nov 20 12:42:42 wv2-f302 kernel: lowmem_reserve[]: 0 2965 193855 193855
Nov 20 12:42:42 wv2-f302 kernel: Node 0 DMA32 free:431968kB min:688kB low:860kB high:1032kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3037072kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Nov 20 12:42:42 wv2-f302 kernel: lowmem_reserve[]: 0 0 190890 190890
Nov 20 12:42:42 wv2-f302 kernel: Node 0 Normal free:6593828kB min:44356kB low:55444kB high:66532kB active_anon:178841380kB inactive_anon:7783292kB active_file:540kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:195471360kB mlocked:0kB dirty:8kB writeback:0kB mapped:312kB shmem:48kB slab_reclaimable:11136kB slab_unreclaimable:1959664kB kernel_stack:5104kB pagetables:397332kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Nov 20 12:42:42 wv2-f302 kernel: lowmem_reserve[]: 0 0 0 0
Nov 20 12:42:42 wv2-f302 kernel: Node 1 Normal free:43013460kB min:45060kB low:56324kB high:67588kB active_anon:140940168kB inactive_anon:6859676kB active_file:468kB inactive_file:56kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:198574076kB mlocked:0kB dirty:0kB writeback:0kB mapped:540kB shmem:116kB slab_reclaimable:8868kB slab_unreclaimable:5079476kB kernel_stack:2856kB pagetables:355640kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Nov 20 12:42:42 wv2-f302 kernel: lowmem_reserve[]: 0 0 0 0
Nov 20 12:42:42 wv2-f302 kernel: Node 0 DMA: 2*4kB 1*8kB 2*16kB 2*32kB 2*64kB 1*128kB 0*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15728kB
Nov 20 12:42:42 wv2-f302 kernel: Node 0 DMA32: 10*4kB 11*8kB 12*16kB 13*32kB 12*64kB 5*128kB 7*256kB 10*512kB 9*1024kB 6*2048kB 98*4096kB = 431968kB
Nov 20 12:42:42 wv2-f302 kernel: Node 0 Normal: 1648026*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 6592104kB
Nov 20 12:42:42 wv2-f302 kernel: Node 1 Normal: 8390977*4kB 1181188*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 43013412kB
Nov 20 12:42:42 wv2-f302 kernel: 49429 total pagecache pages
Nov 20 12:42:42 wv2-f302 kernel: 48929 pages in swap cache
Nov 20 12:42:42 wv2-f302 kernel: Swap cache stats: add 2688331, delete 2639402, find 16219898/16530111
Nov 20 12:42:42 wv2-f302 kernel: Free swap  = 3264kB
Nov 20 12:42:42 wv2-f302 kernel: Total swap = 512248kB
Nov 20 12:42:44 wv2-f302 kernel: 100663294 pages RAM
Nov 20 12:42:44 wv2-f302 kernel: 1446311 pages reserved
Nov 20 12:42:44 wv2-f302 kernel: 10374115 pages shared
Nov 20 12:42:44 wv2-f302 kernel: 84534113 pages non-shared

Oct 27 14:24:43 wv2-f302 kernel: [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
Oct 27 14:24:43 wv2-f302 kernel: [ 3878]     0  3878 32042399 31569413  10       0             0 qemu_wl52
Oct 27 14:24:43 wv2-f302 kernel: [ 4321]     0  4321 32092081 31599762  20       0             0 qemu_wl51
Oct 27 14:24:43 wv2-f302 kernel: [ 4394]     0  4394 32106979 31575717  15       0             0 qemu_wl50
...
Oct 27 14:24:43 wv2-f302 kernel: Out of memory: Kill process 3878 (qemu_wl52) score 318 or sacrifice child
Oct 27 14:24:43 wv2-f302 kernel: Killed process 3878, UID 0, (qemu_wl52) total-vm:128169596kB, anon-rss:126277476kB, file-rss:176kB

Complete dump: http://evilcigi.eu/msg/msg.txt

Then i start the killed guest and from that moment, everything is OK, for a few days.. With the same memory usage as it has before the problem:

RAM total 378.48 GB
RAM used 336.15 GB
RAM free 42.33 GB
SWAP total 500.24 MB
SWAP used 344.55 MB
SWAP free 155.69 MB

Is it possible that server somehow counts memory badly? Or is there something I'm missing?

One thing comes to my mind, that host puts all free memory in buffers and cache and then suffers from memory shortage (invokes OOM-killer)? But that, i think, shouldnt happen, right? Also, that doesnt explain the behavior before the killing.

Thank you in advance.

so today the problem occurs again, here is the content of /proc/meminfo:

MemTotal:       396867932 kB
MemFree:         9720268 kB
Buffers:        53354000 kB
Cached:            22196 kB
SwapCached:       343964 kB
Active:         331872796 kB
Inactive:       41283992 kB
Active(anon):   305458432 kB
Inactive(anon): 14322324 kB
Active(file):   26414364 kB
Inactive(file): 26961668 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:        512248 kB
SwapFree:              0 kB
Dirty:                48 kB
Writeback:             0 kB
AnonPages:      319438656 kB
Mapped:             8536 kB
Shmem:               164 kB
Slab:            9052784 kB
SReclaimable:    2014752 kB
SUnreclaim:      7038032 kB
KernelStack:        8064 kB
PageTables:       650892 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    198946212 kB
Committed_AS:   383832752 kB
VmallocTotal:   34359738367 kB
VmallocUsed:     1824832 kB
VmallocChunk:   34157271228 kB
HardwareCorrupted:     0 kB
AnonHugePages:  31502336 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:        7852 kB
DirectMap2M:     3102720 kB
DirectMap1G:    399507456 kB

It seems that all the "free" memory is spent in buffers.

After hints from @Matthew Ife about memory fragmentation, I have compacted the memory and also dropped the caches (to free up 60 GBs in buffers) on Host, with those commands:

echo 3 > /proc/sys/vm/drop_caches
echo 1 >/proc/sys/vm/compact_memory

Here is what memory fragmentation looks like now:

# cat /proc/buddyinfo
Node 0, zone      DMA      2      1      2      2      2      1      0      0      1      1      3 
Node 0, zone    DMA32     12     12     13     16     10      5      7     10      9      6     98 
Node 0, zone   Normal 2398537 469407 144288  97224  58276  24155   8153   3141   1299    451     75 
Node 1, zone   Normal 9182926 2727543 648104  81843   7915   1267    244     67      3      1      0

update 2014/11/25 - server is overloaded again:

# cat /proc/buddyinfo
Node 0, zone      DMA      2      1      2      2      2      1      0      0      1      1      3 
Node 0, zone    DMA32     12     12     13     16     10      5      7     10      9      6     98 
Node 0, zone   Normal 4374385  85408      0      0      0      0      0      0      0      0      0 
Node 1, zone   Normal 1830850 261703    460     14      0      0      0      0      0      0      0 

# cat /proc/meminfo 
MemTotal:       396867932 kB
MemFree:        28038892 kB
Buffers:        49126656 kB
Cached:            19088 kB
SwapCached:       303624 kB
Active:         305426204 kB
Inactive:       49729776 kB
Active(anon):   292040988 kB
Inactive(anon): 13969376 kB
Active(file):   13385216 kB
Inactive(file): 35760400 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:        512248 kB
SwapFree:             20 kB
Dirty:                28 kB
Writeback:             0 kB
AnonPages:      305706632 kB
Mapped:             9324 kB
Shmem:               124 kB
Slab:            8616228 kB
SReclaimable:    1580736 kB
SUnreclaim:      7035492 kB
KernelStack:        8200 kB
PageTables:       702268 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    198946212 kB
Committed_AS:   384014048 kB
VmallocTotal:   34359738367 kB
VmallocUsed:     1824832 kB
VmallocChunk:   34157271228 kB
HardwareCorrupted:     0 kB
AnonHugePages:  31670272 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:        7852 kB
DirectMap2M:     3102720 kB
DirectMap1G:    399507456 kB

and in syslog are some page allocation failures:

Nov 25 09:14:07 wv2-f302 kernel: qemu_wl50: page allocation failure. order:4, mode:0x20
Nov 25 09:14:07 wv2-f302 kernel: Pid: 4444, comm: qemu_wl50 Not tainted 2.6.32-431.17.1.el6.x86_64 #1
Nov 25 09:14:07 wv2-f302 kernel: Call Trace:
Nov 25 09:14:07 wv2-f302 kernel: <IRQ>  [<ffffffff8112f64a>] ? __alloc_pages_nodemask+0x74a/0x8d0
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8116e082>] ? kmem_getpages+0x62/0x170
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8116ec9a>] ? fallback_alloc+0x1ba/0x270
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8116ea19>] ? ____cache_alloc_node+0x99/0x160
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8116fbe0>] ? kmem_cache_alloc_node_trace+0x90/0x200
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8116fdfd>] ? __kmalloc_node+0x4d/0x60
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8144ff5a>] ? __alloc_skb+0x7a/0x180
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff81451070>] ? skb_copy+0x40/0xb0
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffffa013a55c>] ? tg3_start_xmit+0xa8c/0xd80 [tg3]
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff814603e4>] ? dev_hard_start_xmit+0x224/0x480
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8147be6a>] ? sch_direct_xmit+0x15a/0x1c0
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff814608e8>] ? dev_queue_xmit+0x228/0x320
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffffa02c8898>] ? br_dev_queue_push_xmit+0x88/0xc0 [bridge]
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffffa02c8928>] ? br_forward_finish+0x58/0x60 [bridge]
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffffa02c8ae8>] ? __br_deliver+0x78/0x110 [bridge]
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffffa02c8bb5>] ? br_deliver+0x35/0x40 [bridge]
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffffa02c78f4>] ? br_dev_xmit+0x114/0x140 [bridge]
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff814603e4>] ? dev_hard_start_xmit+0x224/0x480
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8146087d>] ? dev_queue_xmit+0x1bd/0x320
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff81466785>] ? neigh_resolve_output+0x105/0x2d0
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8149a2f8>] ? ip_finish_output+0x148/0x310
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8149a578>] ? ip_output+0xb8/0xc0
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8149983f>] ? __ip_local_out+0x9f/0xb0
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff81499875>] ? ip_local_out+0x25/0x30
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff81499d50>] ? ip_queue_xmit+0x190/0x420
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff814af06e>] ? tcp_transmit_skb+0x40e/0x7b0
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff814b15b0>] ? tcp_write_xmit+0x230/0xa90
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff814b2130>] ? __tcp_push_pending_frames+0x30/0xe0
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff814a9893>] ? tcp_data_snd_check+0x33/0x100
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff814ad491>] ? tcp_rcv_established+0x381/0x7f0
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff814b5873>] ? tcp_v4_do_rcv+0x2e3/0x490
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffffa02b1557>] ? ipv4_confirm+0x87/0x1d0 [nf_conntrack_ipv4]
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffffa0124441>] ? nf_nat_fn+0x91/0x260 [iptable_nat]
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff814b717a>] ? tcp_v4_rcv+0x51a/0x900
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff81494300>] ? ip_local_deliver_finish+0x0/0x2d0
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff814943dd>] ? ip_local_deliver_finish+0xdd/0x2d0
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff81494668>] ? ip_local_deliver+0x98/0xa0
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff81493b2d>] ? ip_rcv_finish+0x12d/0x440
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff814940b5>] ? ip_rcv+0x275/0x350
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff81489509>] ? nf_iterate+0x69/0xb0
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8145b5db>] ? __netif_receive_skb+0x4ab/0x750
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8145f1f0>] ? netif_receive_skb+0x0/0x60
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8145f248>] ? netif_receive_skb+0x58/0x60
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffffa02c9af8>] ? br_handle_frame_finish+0x1e8/0x2a0 [bridge]
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffffa02c9d5a>] ? br_handle_frame+0x1aa/0x250 [bridge]
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8145b659>] ? __netif_receive_skb+0x529/0x750
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8145f248>] ? netif_receive_skb+0x58/0x60
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8145f350>] ? napi_skb_finish+0x50/0x70
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff81460ab9>] ? napi_gro_receive+0x39/0x50
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffffa0136b54>] ? tg3_poll_work+0xc24/0x1020 [tg3]
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffffa0136f9c>] ? tg3_poll_msix+0x4c/0x150 [tg3]
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff81460bd3>] ? net_rx_action+0x103/0x2f0
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff810a6da9>] ? ktime_get+0x69/0xf0
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8107a551>] ? __do_softirq+0xc1/0x1e0
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff810e6b20>] ? handle_IRQ_event+0x60/0x170
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8100c30c>] ? call_softirq+0x1c/0x30
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8100fa75>] ? do_softirq+0x65/0xa0
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8107a405>] ? irq_exit+0x85/0x90
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff815312c5>] ? do_IRQ+0x75/0xf0
Nov 25 09:14:07 wv2-f302 kernel: <EOI>  [<ffffffffa018e271>] ? kvm_arch_vcpu_ioctl_run+0x4c1/0x10b0 [kvm]
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffffa018e25f>] ? kvm_arch_vcpu_ioctl_run+0x4af/0x10b0 [kvm]
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff810aee2e>] ? futex_wake+0x10e/0x120
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffffa0175b04>] ? kvm_vcpu_ioctl+0x434/0x580 [kvm]
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8100b9ce>] ? common_interrupt+0xe/0x13
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8119d802>] ? vfs_ioctl+0x22/0xa0
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8119dcca>] ? do_vfs_ioctl+0x3aa/0x580
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff810b186b>] ? sys_futex+0x7b/0x170
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8119df21>] ? sys_ioctl+0x81/0xa0
Nov 25 09:14:07 wv2-f302 kernel: [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b

edit: The problem was solved by reducing the memory allocated for guests. Now there are 3 guests with 80 GB RAM each, leaving about 150 GB RAM to the host system:

# free -h
              total        used        free      shared  buff/cache   available
Mem:           377G        243G         29G        1,9G        104G        132G

kasperd over 9 years

How much data do you have stored in tmpfs file systems?
dave over 9 years

tmpfs is limited to 5 GB and is almost unused: tmpfs 5,0G 16K 5,0G 1% /dev/shm
kasperd over 9 years

Just to be sure, did you check that there are no other tmpfs mounted on the system?
Matthew Ife over 9 years

Theres some info missing from the memory dump, can you provide the next 10 lines up from active_anon:79945387 inactive_anon:3660742 isolated_anon:0
dave over 9 years

complete info: evilcigi.eu/msg/msg.txt
fsoppelsa almost 8 years

@dave Were you able to find a long-term solution for this issue? If so, please update the post, it would be very helpful.

Deer Hunter over 9 years

Please include the relevant information in your post. Answers are supposed to be self-sufficient.
dave over 9 years

That is not the case, there are no snapshots.
dave over 9 years

I have already turned ksmd (ksmtuned) down yesterday. I'll try to compact the memory, but how will it affect the performance? (it is live server).
Matthew Ife over 9 years

May cause %sys to go up for a few seconds. Expect the host to be unresponsive against a particular CPU for about 20 seconds or so. This wont cause downtime, note the system does this automatically without your consent anyway on certain memory allocations.
dave over 9 years

Ok, just to be sure.. Now its compacted. For now its ok, but its primarily because i lowered the memory usage of guests (virsh setmem after clearing their memory caches).
dave over 9 years

I have updated the question with contents of /proc/buddyinfo after compacting the memory.. I can't realy tell if its ok or no. Also, do you have any clue about why was so much memory spent in buffers? Can it be related with fragmentation?
Matthew Ife over 9 years

I think KSM fragmented you. The buddyinfo looks healthy. You want each column to have at least some number but getting lower and lower as you go further to the right. In your case you want to make sure there is at least a few thousand in the first 4 numeric columns.
Henrik Carlqvist over 9 years

Thanks Deer Hunter for your comment, I have now added some more info even though my answer will not help dave as he is not using -snapshot. Still hopefully may answer will now be more useful for people finding it by googling here.
dave over 9 years

Well, it seems that fragmentation is not the cause of the problem. Always after a while, buffers on the host eats all free memory and server starts to act like it is out-of-memory..
dave over 9 years

I've updated the informations above