CentOS 6: strange page allocation failure messages

networking centos memory kernel syslog

14,715

Solution 1

Workaround for https://bugzilla.redhat.com/show_bug.cgi?id=713546

vm.min_free_kbytes = 512000
vm.zone_reclaim_mode = 1

It was also suggested in this CentOS thread as a potential workaround, http://lists.centos.org/pipermail/centos/2012-October/129844.html.

Solution 2

Please upgrade to kernel-2.6.32-358.el6 equivalent for cenos. The bug has been fixed for this.

Essentially this is about memory allocation in interrupt context. If you want you might check gfp.h in include/linux. The mode 0x20 means that the allocation can't wait, it is in interrupt context, the wait bit for allocation is not set. Therefore, if it isn't allocated, it fails. The fix is quite substantial.

14,715

steveh80

Updated on September 18, 2022

Comments

steveh80 over 1 year

I set up a new Server with CentOS 6.4 final as successor for an old mysql server and I'm facing some problems with it. From time to time mysql connections are being disconnected. Furthermore the transfer of the large backup tar files to a ftp-storage sometimes fails. Both not reproducible.

While analyzing I found some strange messages that I cannot interpret in /var/log/messages.

Mar 30 13:09:24 s16838172 kernel: swapper: page allocation failure. order:1, mode:0x20
Mar 30 13:09:24 s16838172 kernel: Pid: 0, comm: swapper Not tainted 2.6.32-358.0.1.el6.x86_64 #1
Mar 30 13:09:24 s16838172 kernel: Call Trace:
Mar 30 13:09:24 s16838172 kernel: <IRQ>  [<ffffffff8112c207>] ? __alloc_pages_nodemask+0x757/0x8d0
Mar 30 13:09:24 s16838172 kernel: [<ffffffff81166ab2>] ? kmem_getpages+0x62/0x170
Mar 30 13:09:24 s16838172 kernel: [<ffffffff811676ca>] ? fallback_alloc+0x1ba/0x270
Mar 30 13:09:24 s16838172 kernel: [<ffffffff8116711f>] ? cache_grow+0x2cf/0x320
Mar 30 13:09:24 s16838172 kernel: [<ffffffff81167449>] ? ____cache_alloc_node+0x99/0x160
Mar 30 13:09:24 s16838172 kernel: [<ffffffff811683cb>] ? kmem_cache_alloc+0x11b/0x190
Mar 30 13:09:24 s16838172 kernel: [<ffffffff81439c18>] ? sk_prot_alloc+0x48/0x1c0
Mar 30 13:09:24 s16838172 kernel: [<ffffffff8143acf2>] ? sk_clone+0x22/0x2e0
Mar 30 13:09:24 s16838172 kernel: [<ffffffff81489bc6>] ? inet_csk_clone+0x16/0xd0
Mar 30 13:09:24 s16838172 kernel: [<ffffffff814a2ad3>] ? tcp_create_openreq_child+0x23/0x450
Mar 30 13:09:24 s16838172 kernel: [<ffffffff814a02cd>] ? tcp_v4_syn_recv_sock+0x4d/0x310
Mar 30 13:09:24 s16838172 kernel: [<ffffffff814a2876>] ? tcp_check_req+0x226/0x460
Mar 30 13:09:24 s16838172 kernel: [<ffffffff8149fd6b>] ? tcp_v4_do_rcv+0x35b/0x430
Mar 30 13:09:24 s16838172 kernel: [<ffffffffa03b4557>] ? ipv4_confirm+0x87/0x1d0 [nf_conntrack_ipv4]
Mar 30 13:09:24 s16838172 kernel: [<ffffffff814a157e>] ? tcp_v4_rcv+0x4fe/0x8d0
Mar 30 13:09:24 s16838172 kernel: [<ffffffff8147f290>] ? ip_local_deliver_finish+0x0/0x2d0
Mar 30 13:09:24 s16838172 kernel: [<ffffffff8147f36d>] ? ip_local_deliver_finish+0xdd/0x2d0
Mar 30 13:09:24 s16838172 kernel: [<ffffffff8147f5f8>] ? ip_local_deliver+0x98/0xa0
Mar 30 13:09:24 s16838172 kernel: [<ffffffff8147eabd>] ? ip_rcv_finish+0x12d/0x440
Mar 30 13:09:24 s16838172 kernel: [<ffffffff8147f045>] ? ip_rcv+0x275/0x350
Mar 30 13:09:24 s16838172 kernel: [<ffffffff8144827b>] ? __netif_receive_skb+0x4ab/0x750
Mar 30 13:09:24 s16838172 kernel: [<ffffffff8144a658>] ? netif_receive_skb+0x58/0x60
Mar 30 13:09:24 s16838172 kernel: [<ffffffff8144a760>] ? napi_skb_finish+0x50/0x70
Mar 30 13:09:24 s16838172 kernel: [<ffffffff8144cd09>] ? napi_gro_receive+0x39/0x50
Mar 30 13:09:24 s16838172 kernel: [<ffffffffa00f933b>] ? e1000_receive_skb+0x5b/0x90 [e1000e]
Mar 30 13:09:24 s16838172 kernel: [<ffffffffa00fc601>] ? e1000_clean_rx_irq+0x241/0x4c0 [e1000e]
Mar 30 13:09:24 s16838172 kernel: [<ffffffffa0103bbd>] ? e1000e_poll+0xbd/0x380 [e1000e]
Mar 30 13:09:24 s16838172 kernel: [<ffffffffa00f9eca>] ? e1000_put_txbuf+0x6a/0xa0 [e1000e]
Mar 30 13:09:24 s16838172 kernel: [<ffffffff8144ce23>] ? net_rx_action+0x103/0x2f0
Mar 30 13:09:24 s16838172 kernel: [<ffffffff8109b153>] ? hrtimer_get_next_event+0xc3/0x100
Mar 30 13:09:24 s16838172 kernel: [<ffffffff81076fb1>] ? __do_softirq+0xc1/0x1e0
Mar 30 13:09:24 s16838172 kernel: [<ffffffff810e1720>] ? handle_IRQ_event+0x60/0x170
Mar 30 13:09:24 s16838172 kernel: [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30
Mar 30 13:09:24 s16838172 kernel: [<ffffffff8100de05>] ? do_softirq+0x65/0xa0
Mar 30 13:09:24 s16838172 kernel: [<ffffffff81076d95>] ? irq_exit+0x85/0x90
Mar 30 13:09:24 s16838172 kernel: [<ffffffff81516d75>] ? do_IRQ+0x75/0xf0
Mar 30 13:09:24 s16838172 kernel: [<ffffffff8100b9d3>] ? ret_from_intr+0x0/0x11
Mar 30 13:09:24 s16838172 kernel: <EOI>  [<ffffffff812d388e>] ? intel_idle+0xde/0x170
Mar 30 13:09:24 s16838172 kernel: [<ffffffff812d3871>] ? intel_idle+0xc1/0x170
Mar 30 13:09:24 s16838172 kernel: [<ffffffff81414fd7>] ? cpuidle_idle_call+0xa7/0x140
Mar 30 13:09:24 s16838172 kernel: [<ffffffff81009fc6>] ? cpu_idle+0xb6/0x110
Mar 30 13:09:24 s16838172 kernel: [<ffffffff814f300a>] ? rest_init+0x7a/0x80
Mar 30 13:09:24 s16838172 kernel: [<ffffffff81c27f7b>] ? start_kernel+0x424/0x430
Mar 30 13:09:24 s16838172 kernel: [<ffffffff81c2733a>] ? x86_64_start_reservations+0x125/0x129
Mar 30 13:09:24 s16838172 kernel: [<ffffffff81c27438>] ? x86_64_start_kernel+0xfa/0x109

This kind of message blocks appearing about 2-10 times in 5 minutes, after that they are gone for a few hours.

Can somebody help me with that? I hope its not a hardware problem.

Update: Seems to be reproducible by transferring big files over network (backups to ftp-storage). The ftp upload fails/aborts after a few GB and the stuff above appears in /var/log/messages

Thanks!

steveh80 about 11 years

You are not authorized to access bug #713546. :-( Can you share more information about what they are talking there? I also read about zone_reclaim_mode=1 brings performance issues to database servers??
steveh80 about 11 years

Ok, thanks for this information. Do you know if this kernel upgrade will be available via the standard centos repos? Yum tells me nothing to update...
steveh80 about 11 years

I see, I am already on 2.6.32-358.0.1.el6.x86_64. The bug seems not to be fixed in this version...
steveh80 about 11 years

I applied this settings to /etc/sysctl.conf and reloaded via sysctl -p. Didn't solve that problem.
steveh80 about 11 years

Ok: Thats a dedicated server running CentOS 6.4 and everything is updated and at its latest versions (from official centos repos). Intel Xeon E3-1220, 12 GB DDR3 ECC RAM, Software Raid 1TB The only thing I can assume is, that this error comes up on heavy network traffic (transferring big backup files over network via ftp). What further do you need?
slm about 11 years

What hardware are we dealing with here? Custom box or a Dell server, or what? You're going to have to go through the box piece by piece and see if there are any open issues with the various components I'm afraid.
steveh80 about 11 years

I don't know. It's a dedicated root server from 1und1.de with pre installed and configured centos min system. That should be pretty standard and nothing special.
slm about 11 years

It probably wouldn't hurt to enlist 1und1.de's help here. At this point without more info about the make-up of the hardware it's a guessing game for any of us here to try and help. There are a number of patches that have addressed specific issues with Linux kernels and heavy network traffic, but they are dependent on specific hardware like this one or this one.
Soham Chakraborty about 11 years

Oh, hold on a day. Let me search a bit more.