df hangs and I have no idea why

9,014

Thanks to @StéphaneChazelas I used strace to see where df was hanging which lead to this answer: https://serverfault.com/a/918160/256146

Restarting that and then restarting dovecot restored system operation.

Share:
9,014

Related videos on Youtube

TheDavidFactor
Author by

TheDavidFactor

Updated on September 18, 2022

Comments

  • TheDavidFactor
    TheDavidFactor almost 2 years

    [backstory] I've got a CentOS 7.5.1804 mail server (postfix/dovecot) that seems to be working, however when I added a new email client, it didn't create the mail folder for the user [/backstory]

    I ran df -h to see if I was out of space and it hung. I then ran df which also hung. I found this question: How to kill a process which can't be killed without rebooting? and started poking around.

    I've only got two physical drives on the server. I can run df on each of those mounts (/ and /mnt/external) and it runs correctly. I ran mount and it shows a ton of entries that I'm not familiar with. I was pretty sure they don't all show up in df, so I compared the output of df with mount on another CentOS box I've got (same version) and df shows a handful of mount points (/, /dev, /dev/shm, /run, /sys/fs/cgroup, /boot, /run/user/1000) so I tried running df against each of these individually on the suspect server and they all completed correctly.

    I'm able to access /proc/$PID/* on the hung df instances (one of the answers in the referenced question suggested that) and I can cat /proc/$PID/syscall which isn't changing, but I don't know how to interpret it.

    StackExchange, you're my only hope! (Ok, maybe not, but I'm hoping someone can point me in the right direction)

    dmesg output:

    [4400391.164795] systemd-journald[16672]: File /run/log/journal/b2bcc90256f446df876bea47007f387b/system.journal corrupted or uncleanly shut down, renaming and replacing.
    [4400392.726853] SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
    this repeats 556 times
    [4856400.839491] SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
    [4856640.666072] INFO: task df:22269 blocked for more than 120 seconds.
    [4856640.666180] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [4856640.666310] df              D ffff88007fc13680     0 22269  21526 0x00000084
    [4856640.666316]  ffff88004fa57a30 0000000000000082 ffff880079fd4440 ffff88004fa57fd8
    [4856640.666320]  ffff88004fa57fd8 ffff88004fa57fd8 ffff880079fd4440 ffff88003641cf98
    [4856640.666324]  ffff88003641cfa0 7fffffffffffffff ffff880079fd4440 ffff880035e44e00
    [4856640.666328] Call Trace:
    [4856640.666338]  [<ffffffff816096a9>] schedule+0x29/0x70
    [4856640.666342]  [<ffffffff816075f9>] schedule_timeout+0x209/0x2d0
    [4856640.666347]  [<ffffffff810980e6>] ? finish_wait+0x56/0x70
    [4856640.666351]  [<ffffffff81607a42>] ? mutex_lock+0x12/0x2f
    [4856640.666355]  [<ffffffff8124a8fa>] ? autofs4_wait+0x37a/0x870
    [4856640.666359]  [<ffffffff81609ba6>] wait_for_completion+0x116/0x170
    [4856640.666363]  [<ffffffff810a9500>] ? wake_up_state+0x20/0x20
    [4856640.666366]  [<ffffffff8124ba5b>] autofs4_expire_wait+0x6b/0x110
    [4856640.666371]  [<ffffffff81270fcd>] ? selinux_inode_setsecurity+0x6d/0x140
    [4856640.666374]  [<ffffffff81248b52>] do_expire_wait+0x172/0x190
    [4856640.666378]  [<ffffffff81248d7b>] autofs4_d_manage+0x9b/0x160
    [4856640.666383]  [<ffffffff811d0b85>] follow_managed+0xb5/0x300
    [4856640.666386]  [<ffffffff811d149b>] lookup_fast+0x19b/0x2e0
    [4856640.666390]  [<ffffffff811d4495>] path_lookupat+0x165/0x7a0
    [4856640.666394]  [<ffffffff811ab5a5>] ? kmem_cache_alloc+0x35/0x1d0
    [4856640.666397]  [<ffffffff811d6a0f>] ? getname_flags+0x4f/0x1a0
    [4856640.666400]  [<ffffffff811d4afb>] filename_lookup+0x2b/0xc0
    [4856640.666404]  [<ffffffff811d7b37>] user_path_at_empty+0x67/0xc0
    [4856640.666408]  [<ffffffff810f0ad2>] ? from_kgid_munged+0x12/0x20
    [4856640.666412]  [<ffffffff811cba1f>] ? cp_new_stat+0x14f/0x180
    [4856640.666415]  [<ffffffff811d7ba1>] user_path_at+0x11/0x20
    [4856640.666419]  [<ffffffff811cb513>] vfs_fstatat+0x63/0xc0
    [4856640.666422]  [<ffffffff811cba7e>] SYSC_newstat+0x2e/0x60
    [4856640.666426]  [<ffffffff810fac96>] ? __audit_syscall_exit+0x1e6/0x280
    [4856640.666429]  [<ffffffff811cbd5e>] SyS_newstat+0xe/0x10
    [4856640.666434]  [<ffffffff81614209>] system_call_fastpath+0x16/0x1b
    

    strace df ends with:

    stat("/sys/fs/selinux", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
    stat("/proc/sys/fs/binfmt_misc",
    
    • Stéphane Chazelas
      Stéphane Chazelas almost 6 years
      What do you see if you do strace df -h?
    • Stéphane Chazelas
      Stéphane Chazelas almost 6 years
      Any detail from /proc/self/mountinfo about those tons of entries from the output of mount?
    • Stéphane Chazelas
      Stéphane Chazelas almost 6 years
      Anything of interest in the output of dmesg?
    • TheDavidFactor
      TheDavidFactor almost 6 years
      I evidently don't have strace installed, I'm working on that. Looking at the "ton" of entries, the ones that df doesn't list (on my working CentOS box) are the ones under /dev and /sys/fs/cgroup (df lists those directories, but mount lists a bunch of subdirectories)
    • TheDavidFactor
      TheDavidFactor almost 6 years
      @StéphaneChazelas I added info to the question
    • Bob
      Bob almost 6 years
      What is the question? Where namely df doesn't work if it works on / and /mnt/external?
    • TheDavidFactor
      TheDavidFactor almost 6 years
      Found the answer using the output of strace here: serverfault.com/questions/911811/centos-7-df-started-to-hang THANK YOU @StéphaneChazelas for the pointers