mount.ocfs2: Transport endpoint is not connected while mounting...?

10,896

Oh yeah! Problem solved.

Pay attention to the UUID:

# mounted.ocfs2 -d
Device                FS     Stack  UUID                              Label
/dev/sdb              ocfs2  o2cb   12963EAF4E16484DB81ECB0251177C26  ocfs2_drbd1
/dev/drbd1            ocfs2  o2cb   12963EAF4E16484DB81ECB0251177C26  ocfs2_drbd1

but:

# ls -l /sys/kernel/config/cluster/cpc/heartbeat/
drwxr-xr-x 2 root root    0 Dec 24 22:53 72EF09EA3D0D4F51BDC00B47432B1EB2

This could happen because I "accidentally" force re-formated the OCFS2 volume. The problem I'm facing with is similar to this on the Ocfs2-user mailing list.

This is also the reason for below error:

ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping heartbeat

because ocfs2_hb_ctl cannot find the device with UUID 72EF09EA3D0D4F51BDC00B47432B1EB2 in the /proc/partitions.

One idea comes to my mind: Can I change the UUID of a OCFS2 volume?

Looking through the tunefs.ocfs2 man page:

Usage: tunefs.ocfs2 [options] <device> [new-size]
       tunefs.ocfs2 -h|--help
       tunefs.ocfs2 -V|--version
[options] can be any mix of:
        -U|--uuid-reset[=new-uuid]

so I do the following command:

# tunefs.ocfs2 --uuid-reset=72EF09EA3D0D4F51BDC00B47432B1EB2 /dev/drbd1
WARNING!!! OCFS2 uses the UUID to uniquely identify a file system. 
Having two OCFS2 file systems with the same UUID could, in the least, 
cause erratic behavior, and if unlucky, cause file system damage. 
Please choose the UUID with care.
Update the UUID ?yes

Verify:

# tunefs.ocfs2 -Q "%U\n" /dev/drbd1 
72EF09EA3D0D4F51BDC00B47432B1EB2

Tried to kill the heartbeat region again to see what happens:

# ocfs2_hb_ctl -K -u 72EF09EA3D0D4F51BDC00B47432B1EB2
# ocfs2_hb_ctl -I -u 72EF09EA3D0D4F51BDC00B47432B1EB2
72EF09EA3D0D4F51BDC00B47432B1EB2: 6 refs

Keep killing until I see the 0 refs then turn off the cluster:

# /etc/init.d/o2cb offline cpc
Stopping O2CB cluster cpc: OK

and stop it:

# /etc/init.d/o2cb stop
Stopping O2CB cluster cpc: OK
Unloading module "ocfs2": OK
Unmounting ocfs2_dlmfs filesystem: OK
Unloading module "ocfs2_dlmfs": OK
Unmounting configfs filesystem: OK
Unloading module "configfs": OK

Re-starting to see if the new node was updated:

# /etc/init.d/o2cb start
Loading filesystem "configfs": OK
Mounting configfs filesystem at /sys/kernel/config: OK
Loading filesystem "ocfs2_dlmfs": OK
Mounting ocfs2_dlmfs filesystem at /dlm: OK
Starting O2CB cluster cpc: OK

# ls -l /sys/kernel/config/cluster/cpc/node/
total 0
drwxr-xr-x 2 root root 0 Dec 26 19:02 SVR022-293.localdomain
drwxr-xr-x 2 root root 0 Dec 26 19:02 SVR233NTC-3145.localdomain

OK, on the peer node (192.168.2.93), tried to start the OCFS2:

# /etc/init.d/ocfs2 start
Starting Oracle Cluster File System (OCFS2)                [  OK  ]

Thanks to Sunil Mushran because this thread helped me solve the problem.

The lessons are:

  1. The IP address, port, ... can be only changed when the cluster is offlined. See the FAQ.
  2. Never force a re-format a OCFS2 volume.
Share:
10,896
Greg Petersen
Author by

Greg Petersen

Updated on September 18, 2022

Comments

  • Greg Petersen
    Greg Petersen almost 2 years

    I have replaced a dead node that was running in dual-primary mode with OCFS2. All the steps work:

    /proc/drbd

    version: 8.3.13 (api:88/proto:86-96)
    GIT-hash: 83ca112086600faacab2f157bc5a9324f7bd7f77 build by [email protected], 2012-05-07 11:56:36
    
     1: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
        ns:81 nr:407832 dw:106657970 dr:266340 al:179 bm:6551 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
    

    until I try to mount the volume:

    mount -t ocfs2 /dev/drbd1 /data/webroot/
    mount.ocfs2: Transport endpoint is not connected while mounting /dev/drbd1 on /data/webroot/. Check 'dmesg' for more information on this error.
    

    /var/log/kern.log

    kernel: (o2net,11427,1):o2net_connect_expired:1664 ERROR: no connection established with node 0 after 30.0 seconds, giving up and returning errors.
    kernel: (mount.ocfs2,12037,1):dlm_request_join:1036 ERROR: status = -107
    kernel: (mount.ocfs2,12037,1):dlm_try_to_join_domain:1210 ERROR: status = -107
    kernel: (mount.ocfs2,12037,1):dlm_join_domain:1488 ERROR: status = -107
    kernel: (mount.ocfs2,12037,1):dlm_register_domain:1754 ERROR: status = -107
    kernel: (mount.ocfs2,12037,1):ocfs2_dlm_init:2808 ERROR: status = -107
    kernel: (mount.ocfs2,12037,1):ocfs2_mount_volume:1447 ERROR: status = -107
    kernel: ocfs2: Unmounting device (147,1) on (node 1)
    

    and below is the kernel log on the node 0 (192.168.3.145):

    kernel: : (swapper,0,7):o2net_listen_data_ready:1894 bytes: 0
    kernel: : (o2net,4024,3):o2net_accept_one:1800 attempt to connect from unknown node at 192.168.2.93
    :43868
    kernel: : (o2net,4024,3):o2net_connect_expired:1664 ERROR: no connection established with node 1 after 30.0 seconds, giving up and returning errors.
    kernel: : (o2net,4024,3):o2net_set_nn_state:478 node 1 sc: 0000000000000000 -> 0000000000000000, valid 0 -> 0, err 0 -> -107
    

    I'm sure /etc/ocfs2/cluster.conf on the both node are identical:

    /etc/ocfs2/cluster.conf

    node:
        ip_port = 7777
        ip_address = 192.168.3.145
        number = 0
        name = SVR233NTC-3145.localdomain
        cluster = cpc
    
    node:
        ip_port = 7777
        ip_address = 192.168.2.93
        number = 1
        name = SVR022-293.localdomain
        cluster = cpc
    
    cluster:
        node_count = 2
        name = cpc
    

    and they are connected fine:

    # nc -z 192.168.3.145 7777
    Connection to 192.168.3.145 7777 port [tcp/cbt] succeeded!
    

    but the O2CB heartbeat is not active on the new node (192.168.2.93):

    /etc/init.d/o2cb status

    Driver for "configfs": Loaded
    Filesystem "configfs": Mounted
    Driver for "ocfs2_dlmfs": Loaded
    Filesystem "ocfs2_dlmfs": Mounted
    Checking O2CB cluster cpc: Online
    Heartbeat dead threshold = 31
      Network idle timeout: 30000
      Network keepalive delay: 2000
      Network reconnect delay: 2000
    Checking O2CB heartbeat: Not active
    

    Here're the results when running tcpdump on the node 0 while starting the ocfs2 on the node 1:

      1   0.000000 192.168.2.93 -> 192.168.3.145 TCP 70 55274 > cbt [SYN] Seq=0 Win=5840 Len=0 MSS=1460 TSval=690432180 TSecr=0
      2   0.000008 192.168.3.145 -> 192.168.2.93 TCP 70 cbt > 55274 [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0 MSS=1460 TSval=707657223 TSecr=690432180
      3   0.000223 192.168.2.93 -> 192.168.3.145 TCP 66 55274 > cbt [ACK] Seq=1 Ack=1 Win=5840 Len=0 TSval=690432181 TSecr=707657223
      4   0.000286 192.168.2.93 -> 192.168.3.145 TCP 98 55274 > cbt [PSH, ACK] Seq=1 Ack=1 Win=5840 Len=32 TSval=690432181 TSecr=707657223
      5   0.000292 192.168.3.145 -> 192.168.2.93 TCP 66 cbt > 55274 [ACK] Seq=1 Ack=33 Win=5792 Len=0 TSval=707657223 TSecr=690432181
      6   0.000324 192.168.3.145 -> 192.168.2.93 TCP 66 cbt > 55274 [RST, ACK] Seq=1 Ack=33 Win=5792 Len=0 TSval=707657223 TSecr=690432181
    

    The RST flag is sent after every 6 packets.

    What other can I do to debug this case?

    PS:

    OCFS2 versions on the node 0:

    • ocfs2-tools-1.4.4-1.el5
    • ocfs2-2.6.18-274.12.1.el5-1.4.7-1.el5

    OCFS2 versions on the node 1:

    • ocfs2-tools-1.4.4-1.el5
    • ocfs2-2.6.18-308.el5-1.4.7-1.el5

    UPDATE 1 - Sun Dec 23 18:15:07 ICT 2012

    Are both nodes on the same lan segment? No routers etc.?

    No, they are 2 VMWare servers on the different subnet.

    Oh, while I remember - hostnames/DNS all setup and working correctly?

    Sure, I added both the hostname and IP address of each node to /etc/hosts:

    192.168.2.93    SVR022-293.localdomain
    192.168.3.145   SVR233NTC-3145.localdomain
    

    and they can connect to each other via hostname:

    # nc -z SVR022-293.localdomain 7777
    Connection to SVR022-293.localdomain 7777 port [tcp/cbt] succeeded!
    
    # nc -z SVR233NTC-3145.localdomain 7777
    Connection to SVR233NTC-3145.localdomain 7777 port [tcp/cbt] succeeded!
    

    UPDATE 2 - Mon Dec 24 18:32:15 ICT 2012

    Found the clues: my co-worker manually edited the /etc/ocfs2/cluster.conf file while the cluster is running. So, it still keeps the dead node information in the /sys/kernel/config/cluster/:

    # ls -l /sys/kernel/config/cluster/cpc/node/
    total 0
    drwxr-xr-x 2 root root 0 Dec 24 18:21 SVR150-4107.localdomain
    drwxr-xr-x 2 root root 0 Dec 24 18:21 SVR233NTC-3145.localdomain
    

    (SVR150-4107.localdomain in this case)

    I'm going to stop the cluster to remove the dead node but got the following error:

    # /etc/init.d/o2cb stop
    Stopping O2CB cluster cpc: Failed
    Unable to stop cluster as heartbeat region still active
    

    I'm sure the ocfs2 service is already stopped:

    # mounted.ocfs2 -f
    Device                FS     Nodes
    /dev/sdb              ocfs2  Not mounted
    /dev/drbd1            ocfs2  Not mounted
    

    There are no references anymore:

    # ocfs2_hb_ctl -I -u 12963EAF4E16484DB81ECB0251177C26
    12963EAF4E16484DB81ECB0251177C26: 0 refs
    

    I also unloaded the ocfs2 kernel module to ensure:

    # ps -ef | grep [o]cfs2
    root     12513    43  0 18:25 ?        00:00:00 [ocfs2_wq]
    
    # modprobe -r ocfs2
    # ps -ef | grep [o]cfs2
    # lsof | grep ocfs2
    

    but nothing changes:

    # /etc/init.d/o2cb offline
    Stopping O2CB cluster cpc: Failed
    Unable to stop cluster as heartbeat region still active
    

    So the final question is: how to delete the dead node information without rebooting?


    UPDATE 3 - Mon Dec 24 22:41:51 ICT 2012

    Here're all of the running heartbeat threads:

    # ls -l /sys/kernel/config/cluster/cpc/heartbeat/ | grep '^d'
    drwxr-xr-x 2 root root    0 Dec 24 22:18 72EF09EA3D0D4F51BDC00B47432B1EB2
    

    Reference counts for this heartbeat region:

    # ocfs2_hb_ctl -I -u 72EF09EA3D0D4F51BDC00B47432B1EB2
    72EF09EA3D0D4F51BDC00B47432B1EB2: 7 refs
    

    Try to kill:

    # ocfs2_hb_ctl -K -u 72EF09EA3D0D4F51BDC00B47432B1EB2
    ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping heartbeat
    

    Any ideas?

    • growse
      growse over 11 years
      When this happened to me, it was because I'd neglected to open the right firewall ports. Bugged me for ages that the node communication would just fail, and then I realised that I'd created a private network for the nodes to talk over, but hadn't updated the firewall to allow traffic. From the looks of the tcpdump traffic (which I missed before the edit), it looks like some data is getting through, so I guess it can't be that. Are both nodes on the same lan segment? No routers etc.?
    • growse
      growse over 11 years
      Oh, while I remember - hostnames/DNS all setup and working correctly?
    • rhasti
      rhasti over 11 years
      according to ocfs2 userguide you have to set 2 kernel parameters on all nodes in cluster: $ echo 1 > /proc/sys/kernel/panic_on_oops $ echo 30 > /proc/sys/kernel/panic did you ?
    • Greg Petersen
      Greg Petersen over 11 years
      @rhasti: panic_on_oops was enabled but panic is set to zero by default. I have set a 30 sec timeout for reboot on panic, what can I do now?
  • Nils
    Nils over 11 years
    Why did you do the force-reformat?