GlusterFS failing to mount at boot with Ubuntu 14.04

20,268

Solution 1

I managed to make this work through a combination of answers in this thread and this one: GlusterFS is failing to mount on boot

As per @Dan Pisarski edit /etc/init/mounting-glusterfs.conf to read:

exec start wait-for-state WAIT_FOR=networking WAITER=mounting-glusterfs-$MOUNTPOINT

As per @dialt0ne change /etc/fstab to read:

[serverip]:[vol]  [mountpoint]  glusterfs  defaults,nobootwait,_netdev,backupvolfile-server=[backupserverip],direct-io-mode=disable  0       0

Works For Me(tm) on Ubuntu 14.04.2 LTS

Solution 2

I have run into the same problem on AWS on ubuntu 12.04. Here are some things you can do that worked for me:

  • add more fetch-attempts in your fstab

This will allow you to retry the volfile server while the network is unavailable.

  • add a backup volfile server in your fstab

This will allow for you to mount the filesystem from another gluster server member if the primary is down for some reason.

  • add nobootwait in your fstab

This allows the instance to continue booting while this filesystem isn't mounted.

A sample entry from my current fstab is:

10.20.30.40:/fs1 /example glusterfs defaults,nobootwait,_netdev,backupvolfile-server=10.20.30.41,fetch-attempts=10 0 2

I have not tested this on 14.04, but it works ok for my 12.04 instances.

Solution 3

It's a bug

This is really a bug (the static-network-up is not a job, it's an event signal).

Moreover, using the network job as suggested in other answers is not the most correct solution.

So, I created this bug report and submitted a patch to this problem.

As a workaround, you can apply my proposed solution (at the end of this answer) and use the _netdev option in your fstab.

A better explanation is showed above too, but you can skip this explanation if you want.

Explanation

This is a bug in the mounting-glusterfs.conf. It can increase unnecessary 30 seconds in the boot in an Ubuntu Server, or even hang the boot process.

Because of this bug, the mountall process thinks that the mount failed (you'll see "Mount failed" errors in /var/log/boot.log). So, when not using the nobootwait/nofail flags in /etc/fstab, the bug can hang the mount process (and the boot process too). When using the nobootwait/nofail flags, the bug will increase the boot time in about 30 seconds.

The bug is caused by the following errors:

  • There is no need to wait for the network is up. The Ubuntu itself has the _netdev mount flag that will retry the mount for each time that an interface brings up;
  • However, it's necessary to wait for the GlusterFS Server daemon (for mounts using localhost);
    • This was implemented in an old commit in the GlusterFS upstream project. However, this commit was overwritten;
  • It's wrong to use the wait-for-state upstart task to wait for a signal. It's used to wait for a job. static-network-up is an event signal, and not a job;
    • This is why the "Unknown job: static-network-up" is logged;
  • It's wrong, when waiting for a job to be started, not passing the WAIT_STATE=running env var because it's not the default in wait-for-state.

Solution

/etc/init/mounting-glusterfs.conf:

author "Louis Zuckerman <[email protected]>"
description "Block the mounting event for glusterfs filesystems until the glusterfs-server is running"

instance $MOUNTPOINT

start on mounting TYPE=glusterfs
task
script
  if status glusterfs-server; then
    start wait-for-state WAIT_FOR=glusterfs-server WAIT_STATE=running \
        WAITER=mounting-glusterfs-$MOUNTPOINT
  fi
end script

PS: Use also the _netdev option in your fstab.

Solution 4

Thanks for the detailed explanation, I think I understand a lot more than earlier. Latest solution is almost working. The problems (actually one, since the first implies the second):

  • local shares (127.0.0.1:/share) still not mounted
  • mounted TYPE=glusterfs never satisfied, so the services which are dependent of the mounted TYPE=glusterfs state

/etc/fstab:

127.0.0.1:/control-share /mnt/glu-control-share glusterfs defaults,_netdev 0 0

/etc/init/mounting-glusterfs.conf: copied from above

/etc/init/salt-master.conf:

description "Salt Master"

start on (mounted TYPE=glusterfs
          and runlevel [2345])
stop on runlevel [!2345]
limit nofile 100000 100000
...

The local share must be mounted by hand, or by some automatism, salt-master must be started by hand after all reboots.

Noticed later: the above WAIT script in mounting-glusterfs... blocks the whole boot procedure, seems like glusterfs-server state never reaches running.

Solution 5

I ran into this as well, and want to preface this answer with the statement that I am not an expert in this area so its possible there is a better solution to this!

But the issue seems to be that static-network-up is an event, not the name of an upstart job. However, the wait-for-state script expects a job name to be passed in as WAIT_FOR value. Thus, the error of "Unknown job" as you discovered above.

To resolve the issue I changed /etc/init/mounting-glusterfs.conf, changing:

exec start wait-for-state WAIT_FOR=static-network-up WAITER=mounting-glusterfs-$MOUNTPOINT

into:

exec start wait-for-state WAIT_FOR=networking WAITER=mounting-glusterfs-$MOUNTPOINT

networking is the name of an actual job (/etc/init/networking.conf) and I believe the job that typically emits static-network-up.

This change worked for me on Ubuntu 14.04.

Share:
20,268

Related videos on Youtube

Pablo
Author by

Pablo

You can find my blog at https://pupeno.com where I publish about coding and other stuff.

Updated on September 18, 2022

Comments

  • Pablo
    Pablo almost 2 years

    Previously I asked about mounting GlusterFS at boot in an Ubuntu 12.04 server and the answer was that this was buggy in 12.04 and worked in 14.04. Curious I gave it a try on a virtual machine running on my laptop and in 14.04 it worked. Since this was critical for me, I decided to upgrade my running servers to 14.04 only to discover that GlusterFS is not mounting localhost volumes automatically either.

    This is a Linode server and fstab looks like this:

    # <file system> <mount point>          <type>    <options>                 <dump>  <pass>
    proc        /proc                        proc    defaults                       0       0
    /dev/xvda   /                            ext4    noatime,errors=remount-ro      0       1
    /dev/xvdb   none                         swap    sw                             0       0
    /dev/xvdc   /var/lib/glusterfs/brick01   ext4    defaults                       1       2
    koraga.int.example.com:/public_uploads /var/www/shared/public/uploads glusterfs defaults,_netdev 0 0
    

    The booting process likes like this (around the networking mounting part, which are the only fails):

     * Stopping Mount network filesystems                                    [ OK ]
     * Starting set sysctls from /etc/sysctl.conf                            [ OK ]
     * Stopping set sysctls from /etc/sysctl.conf                            [ OK ]
     * Starting configure virtual network devices                            [ OK ]
     * Starting Bridge socket events into upstart                            [ OK ]
     * Starting Waiting for state                                            [fail]
     * Stopping Waiting for state                                            [ OK ]
     * Starting Block the mounting event for glusterfs filesystems until the [fail]k interfaces are running
     * Starting Waiting for state                                            [fail]
     * Starting Block the mounting event for glusterfs filesystems until the [fail]k interfaces are running
     * Stopping Waiting for state                                            [ OK ]
     * Starting Signal sysvinit that remote filesystems are mounted          [ OK ]
     * Starting GNU Screen Cleanup                                           [ OK ]
    

    I believe the log file /var/log/glusterfs/var-www-shared-public-uploads.log contains the main clue to the problem, as it's the only one that is really different between this server, where mounting is not working, and my local virtual server, where it is:

    [2014-07-10 05:51:49.762162] I [glusterfsd.c:1959:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.5.1 (/usr/sbin/glusterfs --volfile-server=koraga.int.example.com --volfile-id=/public_uploads /var/www/shared/public/uploads)
    [2014-07-10 05:51:49.774248] I [socket.c:3561:socket_init] 0-glusterfs: SSL support is NOT enabled
    [2014-07-10 05:51:49.774278] I [socket.c:3576:socket_init] 0-glusterfs: using system polling thread
    [2014-07-10 05:51:49.775573] E [socket.c:2161:socket_connect_finish] 0-glusterfs: connection to 192.168.134.227:24007 failed (Connection refused)
    [2014-07-10 05:51:49.775634] E [glusterfsd-mgmt.c:1601:mgmt_rpc_notify] 0-glusterfsd-mgmt: failed to connect with remote-host: koraga.int.example.com (No data available)
    [2014-07-10 05:51:49.775649] I [glusterfsd-mgmt.c:1607:mgmt_rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers
    [2014-07-10 05:51:49.776284] W [glusterfsd.c:1095:cleanup_and_exit] (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_transport_notify+0x23) [0x7f6718bf3f83] (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0x90) [0x7f6718bf7da0] (-->/usr/sbin/glusterfs(+0xcf13) [0x7f67192bbf13]))) 0-: received signum (1), shutting down
    [2014-07-10 05:51:49.776314] I [fuse-bridge.c:5475:fini] 0-fuse: Unmounting '/var/www/shared/public/uploads'.
    

    The status of the volume is:

    Volume Name: public_uploads
    Type: Distribute
    Volume ID: 52aa6d85-f4ea-4c39-a2b3-d20d34ab5916
    Status: Started
    Number of Bricks: 1
    Transport-type: tcp
    Bricks:
    Brick1: koraga.int.example.com:/var/lib/glusterfs/brick01/public_uploads
    Options Reconfigured:
    auth.allow: 127.0.0.1,192.168.134.227
    client.ssl: off
    server.ssl: off
    nfs.disable: on
    

    If I run mount -a after booting up, the volume is mounted correctly:

    koraga.int.example.com:/public_uploads on /var/www/shared/public/uploads type fuse.glusterfs (rw,default_permissions,allow_other,max_read=131072)
    

    A couple of related log files show this:

    /var/log/upstart/mounting-glusterfs-_var_www_shared_public_uploads.log:

    start: Job failed to start
    

    /var/log/upstart/wait-for-state-mounting-glusterfs-_var_www_shared_public_uploadsstatic-network-up.log:

    status: Unknown job: static-network-up
    start: Unknown job: static-network-up
    

    but on my testing server, it shows exactly the same, so, I don't think this is relevant.

    Any ideas what's wrong now?

    Update: I tried the change of WAIT_FOR from static-network-up to networking and it still didn't work but all the [fail] messages at boot disappear. These are the contains of the log files under these conditions:

    /var/log/glusterfs/var-www-shared-public-uploads.log contains:

    wait-for-state stop/waiting
    

    /var/log/upstart/wait-for-state-mounting-glusterfs-_var_www_shared_public_uploadsstatic-network-up.log contains:

    start: Job is already running: networking
    

    /var/log/glusterfs/var-www-shared-public-uploads.log contains:

    [2014-07-11 17:19:38.000207] I [glusterfsd.c:1959:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.5.1 (/usr/sbin/glusterfs --volfile-server=koraga.int.example.com --volfile-id=/public_uploads /var/www/shared/public/uploads)
    [2014-07-11 17:19:38.029421] I [socket.c:3561:socket_init] 0-glusterfs: SSL support is NOT enabled
    [2014-07-11 17:19:38.029450] I [socket.c:3576:socket_init] 0-glusterfs: using system polling thread
    [2014-07-11 17:19:38.030288] E [socket.c:2161:socket_connect_finish] 0-glusterfs: connection to 192.168.134.227:24007 failed (Connection refused)
    [2014-07-11 17:19:38.030331] E [glusterfsd-mgmt.c:1601:mgmt_rpc_notify] 0-glusterfsd-mgmt: failed to connect with remote-host: koraga.int.example.com (No data available)
    [2014-07-11 17:19:38.030345] I [glusterfsd-mgmt.c:1607:mgmt_rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers
    [2014-07-11 17:19:38.030984] W [glusterfsd.c:1095:cleanup_and_exit] (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_transport_notify+0x23) [0x7fd9495b7f83] (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0x90) [0x7fd9495bbda0] (-->/usr/sbin/glusterfs(+0xcf13) [0x7fd949c7ff13]))) 0-: received signum (1), shutting down
    [2014-07-11 17:19:38.031013] I [fuse-bridge.c:5475:fini] 0-fuse: Unmounting '/var/www/shared/public/uploads'.
    

    Update 2: I also tried this in the upstart file:

    start on (started glusterfs-server and mounting TYPE=glusterfs)
    

    but the computer failed to boot (don't know why yet).

    • Pablo
      Pablo almost 10 years
      @totti, I cannot pull the network cable on a VPS. The IP using to mount is statically assign and the host is matched to that IP in /etc/hosts.
    • totti
      totti almost 10 years
      Then somehow disable internet (don't block or drop) and try to mount. What happens now ?
  • Pablo
    Pablo almost 10 years
    This is not working for me. There's no FAILs during the bootup process but no volumes are mounted. Can you please show me your fstab to see whether I'm doing something wrong?
  • Raja Ehtesham
    Raja Ehtesham almost 8 years
    I tried your solution on Ubuntu 14.04, glusterfs server/client 3.5. But getting this in upstart log 'status: Unknown job: glusterfs-server'
  • Raja Ehtesham
    Raja Ehtesham almost 8 years
    now getting this in upstart log glusterfs-server stop/waiting wait-for-state stop/waiting. However when I run time start mountall everything mounts perfect.