CentOS 6: strange page allocation failure messages

14,715

Solution 1

Workaround for https://bugzilla.redhat.com/show_bug.cgi?id=713546

vm.min_free_kbytes = 512000
vm.zone_reclaim_mode = 1

It was also suggested in this CentOS thread as a potential workaround, http://lists.centos.org/pipermail/centos/2012-October/129844.html.

Solution 2

Please upgrade to kernel-2.6.32-358.el6 equivalent for cenos. The bug has been fixed for this.

Essentially this is about memory allocation in interrupt context. If you want you might check gfp.h in include/linux. The mode 0x20 means that the allocation can't wait, it is in interrupt context, the wait bit for allocation is not set. Therefore, if it isn't allocated, it fails. The fix is quite substantial.

Share:
14,715

Related videos on Youtube

steveh80
Author by

steveh80

Updated on September 18, 2022

Comments

  • steveh80
    steveh80 over 1 year

    I set up a new Server with CentOS 6.4 final as successor for an old mysql server and I'm facing some problems with it. From time to time mysql connections are being disconnected. Furthermore the transfer of the large backup tar files to a ftp-storage sometimes fails. Both not reproducible.

    While analyzing I found some strange messages that I cannot interpret in /var/log/messages.

    Mar 30 13:09:24 s16838172 kernel: swapper: page allocation failure. order:1, mode:0x20
    Mar 30 13:09:24 s16838172 kernel: Pid: 0, comm: swapper Not tainted 2.6.32-358.0.1.el6.x86_64 #1
    Mar 30 13:09:24 s16838172 kernel: Call Trace:
    Mar 30 13:09:24 s16838172 kernel: <IRQ>  [<ffffffff8112c207>] ? __alloc_pages_nodemask+0x757/0x8d0
    Mar 30 13:09:24 s16838172 kernel: [<ffffffff81166ab2>] ? kmem_getpages+0x62/0x170
    Mar 30 13:09:24 s16838172 kernel: [<ffffffff811676ca>] ? fallback_alloc+0x1ba/0x270
    Mar 30 13:09:24 s16838172 kernel: [<ffffffff8116711f>] ? cache_grow+0x2cf/0x320
    Mar 30 13:09:24 s16838172 kernel: [<ffffffff81167449>] ? ____cache_alloc_node+0x99/0x160
    Mar 30 13:09:24 s16838172 kernel: [<ffffffff811683cb>] ? kmem_cache_alloc+0x11b/0x190
    Mar 30 13:09:24 s16838172 kernel: [<ffffffff81439c18>] ? sk_prot_alloc+0x48/0x1c0
    Mar 30 13:09:24 s16838172 kernel: [<ffffffff8143acf2>] ? sk_clone+0x22/0x2e0
    Mar 30 13:09:24 s16838172 kernel: [<ffffffff81489bc6>] ? inet_csk_clone+0x16/0xd0
    Mar 30 13:09:24 s16838172 kernel: [<ffffffff814a2ad3>] ? tcp_create_openreq_child+0x23/0x450
    Mar 30 13:09:24 s16838172 kernel: [<ffffffff814a02cd>] ? tcp_v4_syn_recv_sock+0x4d/0x310
    Mar 30 13:09:24 s16838172 kernel: [<ffffffff814a2876>] ? tcp_check_req+0x226/0x460
    Mar 30 13:09:24 s16838172 kernel: [<ffffffff8149fd6b>] ? tcp_v4_do_rcv+0x35b/0x430
    Mar 30 13:09:24 s16838172 kernel: [<ffffffffa03b4557>] ? ipv4_confirm+0x87/0x1d0 [nf_conntrack_ipv4]
    Mar 30 13:09:24 s16838172 kernel: [<ffffffff814a157e>] ? tcp_v4_rcv+0x4fe/0x8d0
    Mar 30 13:09:24 s16838172 kernel: [<ffffffff8147f290>] ? ip_local_deliver_finish+0x0/0x2d0
    Mar 30 13:09:24 s16838172 kernel: [<ffffffff8147f36d>] ? ip_local_deliver_finish+0xdd/0x2d0
    Mar 30 13:09:24 s16838172 kernel: [<ffffffff8147f5f8>] ? ip_local_deliver+0x98/0xa0
    Mar 30 13:09:24 s16838172 kernel: [<ffffffff8147eabd>] ? ip_rcv_finish+0x12d/0x440
    Mar 30 13:09:24 s16838172 kernel: [<ffffffff8147f045>] ? ip_rcv+0x275/0x350
    Mar 30 13:09:24 s16838172 kernel: [<ffffffff8144827b>] ? __netif_receive_skb+0x4ab/0x750
    Mar 30 13:09:24 s16838172 kernel: [<ffffffff8144a658>] ? netif_receive_skb+0x58/0x60
    Mar 30 13:09:24 s16838172 kernel: [<ffffffff8144a760>] ? napi_skb_finish+0x50/0x70
    Mar 30 13:09:24 s16838172 kernel: [<ffffffff8144cd09>] ? napi_gro_receive+0x39/0x50
    Mar 30 13:09:24 s16838172 kernel: [<ffffffffa00f933b>] ? e1000_receive_skb+0x5b/0x90 [e1000e]
    Mar 30 13:09:24 s16838172 kernel: [<ffffffffa00fc601>] ? e1000_clean_rx_irq+0x241/0x4c0 [e1000e]
    Mar 30 13:09:24 s16838172 kernel: [<ffffffffa0103bbd>] ? e1000e_poll+0xbd/0x380 [e1000e]
    Mar 30 13:09:24 s16838172 kernel: [<ffffffffa00f9eca>] ? e1000_put_txbuf+0x6a/0xa0 [e1000e]
    Mar 30 13:09:24 s16838172 kernel: [<ffffffff8144ce23>] ? net_rx_action+0x103/0x2f0
    Mar 30 13:09:24 s16838172 kernel: [<ffffffff8109b153>] ? hrtimer_get_next_event+0xc3/0x100
    Mar 30 13:09:24 s16838172 kernel: [<ffffffff81076fb1>] ? __do_softirq+0xc1/0x1e0
    Mar 30 13:09:24 s16838172 kernel: [<ffffffff810e1720>] ? handle_IRQ_event+0x60/0x170
    Mar 30 13:09:24 s16838172 kernel: [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30
    Mar 30 13:09:24 s16838172 kernel: [<ffffffff8100de05>] ? do_softirq+0x65/0xa0
    Mar 30 13:09:24 s16838172 kernel: [<ffffffff81076d95>] ? irq_exit+0x85/0x90
    Mar 30 13:09:24 s16838172 kernel: [<ffffffff81516d75>] ? do_IRQ+0x75/0xf0
    Mar 30 13:09:24 s16838172 kernel: [<ffffffff8100b9d3>] ? ret_from_intr+0x0/0x11
    Mar 30 13:09:24 s16838172 kernel: <EOI>  [<ffffffff812d388e>] ? intel_idle+0xde/0x170
    Mar 30 13:09:24 s16838172 kernel: [<ffffffff812d3871>] ? intel_idle+0xc1/0x170
    Mar 30 13:09:24 s16838172 kernel: [<ffffffff81414fd7>] ? cpuidle_idle_call+0xa7/0x140
    Mar 30 13:09:24 s16838172 kernel: [<ffffffff81009fc6>] ? cpu_idle+0xb6/0x110
    Mar 30 13:09:24 s16838172 kernel: [<ffffffff814f300a>] ? rest_init+0x7a/0x80
    Mar 30 13:09:24 s16838172 kernel: [<ffffffff81c27f7b>] ? start_kernel+0x424/0x430
    Mar 30 13:09:24 s16838172 kernel: [<ffffffff81c2733a>] ? x86_64_start_reservations+0x125/0x129
    Mar 30 13:09:24 s16838172 kernel: [<ffffffff81c27438>] ? x86_64_start_kernel+0xfa/0x109
    

    This kind of message blocks appearing about 2-10 times in 5 minutes, after that they are gone for a few hours.

    Can somebody help me with that? I hope its not a hardware problem.

    Update: Seems to be reproducible by transferring big files over network (backups to ftp-storage). The ftp upload fails/aborts after a few GB and the stuff above appears in /var/log/messages

    Thanks!

  • steveh80
    steveh80 about 11 years
    You are not authorized to access bug #713546. :-( Can you share more information about what they are talking there? I also read about zone_reclaim_mode=1 brings performance issues to database servers??
  • steveh80
    steveh80 about 11 years
    Ok, thanks for this information. Do you know if this kernel upgrade will be available via the standard centos repos? Yum tells me nothing to update...
  • steveh80
    steveh80 about 11 years
    I see, I am already on 2.6.32-358.0.1.el6.x86_64. The bug seems not to be fixed in this version...
  • steveh80
    steveh80 about 11 years
    I applied this settings to /etc/sysctl.conf and reloaded via sysctl -p. Didn't solve that problem.
  • steveh80
    steveh80 about 11 years
    Ok: Thats a dedicated server running CentOS 6.4 and everything is updated and at its latest versions (from official centos repos). Intel Xeon E3-1220, 12 GB DDR3 ECC RAM, Software Raid 1TB The only thing I can assume is, that this error comes up on heavy network traffic (transferring big backup files over network via ftp). What further do you need?
  • slm
    slm about 11 years
    What hardware are we dealing with here? Custom box or a Dell server, or what? You're going to have to go through the box piece by piece and see if there are any open issues with the various components I'm afraid.
  • steveh80
    steveh80 about 11 years
    I don't know. It's a dedicated root server from 1und1.de with pre installed and configured centos min system. That should be pretty standard and nothing special.
  • slm
    slm about 11 years
    It probably wouldn't hurt to enlist 1und1.de's help here. At this point without more info about the make-up of the hardware it's a guessing game for any of us here to try and help. There are a number of patches that have addressed specific issues with Linux kernels and heavy network traffic, but they are dependent on specific hardware like this one or this one.
  • Soham Chakraborty
    Soham Chakraborty about 11 years
    Oh, hold on a day. Let me search a bit more.