Failover pacemaker cluster with two network interfaces?

8,279

Solution 1

So, I resolve my issue with ocf:pacemaker:ping, thanks to @Dok.

# crm configure show
node srv1
node srv2
primitive P_INTRANET ocf:pacemaker:ping \
  params host_list="10.10.10.11 10.10.10.12" multiplier="100" name="ping_intranet" \
  op monitor interval="5s" timeout="5s"
primitive cluster-ip ocf:heartbeat:IPaddr2 \
  params ip="10.10.10.100" cidr_netmask="24" \
  op monitor interval="5s"
primitive ha-nginx lsb:nginx \
  op monitor interval="5s"
clone CL_INTRANET P_INTRANET \
  meta globally-unique="false"
location L_CLUSTER_IP_PING_INTRANET cluster-ip \
  rule $id="L_CLUSTER_IP_PING_INTRANET-rule" ping_intranet: defined ping_intranet
location L_HA_NGINX_PING_INTRANET ha-nginx \
  rule $id="L_HA_NGINX_PING_INTRANET-rule" ping_intranet: defined ping_intranet
location L_INTRANET_01 CL_INTRANET 100: srv1
location L_INTRANET_02 CL_INTRANET 100: srv2
colocation nginx-and-cluster-ip 1000: ha-nginx cluster-ip
property $id="cib-bootstrap-options" \
  dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \
  cluster-infrastructure="openais" \
  expected-quorum-votes="2" \
  no-quorum-policy="ignore" \
  stonith-enabled="false"

Solution 2

The ping monitor is not sufficient for this scenario for multiple reasons. what do you ping? In linux, the local ip still pings when the interface is down (which is BAD, it should not), but what if you ping the default gateway? There are a bunch of things outside of the cluster that can cause default gateway ping loss, almost all of which have nothing to do with health of the cluster's local network connection.

Pacemaker needs to have a method to monitor the network interface UP/DOWN status. That is the best indication of local network problems. I haven't found any way to do this yet, and it's a horrible flaw in pacemaker clusters as far as i can tell.

Solution 3

The ocf:pacemaker:pingd resource was designed precisely to failover a node over upon loss of connectivity. You may find a very brief example of this on the cluster labs wiki here: http://clusterlabs.org/wiki/Example_configurations#Set_up_pingd

Somewhat unrelated, but I have seen issues in the past with using ifconfig down to test loss of connectivity. I would strongly encourage that you instead use iptables to drop traffic to test loss of connectivity.

Share:
8,279

Related videos on Youtube

mr.The
Author by

mr.The

Updated on September 18, 2022

Comments

  • mr.The
    mr.The almost 2 years

    So, i have two test servers in one vlan.

    srv1
      eth1 10.10.10.11
      eth2 10.20.10.11
    
    srv2
      eth1 10.10.10.12
      eth2 10.20.10.12
    
    Cluster VIP - 10.10.10.100
    

    Corosync config with two interfaces:

      rrp_mode: passive
    
      interface {
        ringnumber: 0
        bindnetaddr: 10.10.10.0
        mcastaddr: 226.94.1.1
        mcastport: 5405
      }
    
      interface {
        ringnumber: 1
        bindnetaddr: 10.20.10.0
        mcastaddr: 226.94.1.1
        mcastport: 5407
      }
    

    Pacemaker config:

    # crm configure show
    node srv1
    node srv2
    primitive cluster-ip ocf:heartbeat:IPaddr2 \
        params ip="10.10.10.100" cidr_netmask="24" \
        op monitor interval="5s"
    primitive ha-nginx lsb:nginx \
        op monitor interval="5s"
    location prefer-srv-2 ha-nginx 50: srv2
    colocation nginx-and-cluster-ip +inf: ha-nginx cluster-ip
    property $id="cib-bootstrap-options" \
        dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \
        cluster-infrastructure="openais" \
        expected-quorum-votes="2" \
        no-quorum-policy="ignore" \
        stonith-enabled="false"
    

    Status:

    # crm status
    ============
    Last updated: Thu Jan 29 13:40:16 2015
    Last change: Thu Jan 29 12:47:25 2015 via crmd on srv1
    Stack: openais
    Current DC: srv2 - partition with quorum
    Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
    2 Nodes configured, 2 expected votes
    2 Resources configured.
    ============
    
    Online: [ srv1 srv2 ]
    
     cluster-ip (ocf::heartbeat:IPaddr2):   Started srv2
     ha-nginx   (lsb:nginx):    Started srv2
    

    Rings:

    # corosync-cfgtool -s
    Printing ring status.
    Local node ID 185207306
    RING ID 0
        id  = 10.10.10.11
        status  = ring 0 active with no faults
    RING ID 1
        id  = 10.20.10.11
        status  = ring 1 active with no faults
    

    And, if i do srv2# ifconfig eth1 down, pacemaker still works over eth2, and that's ok. But nginx not available on 10.10.10.100 (becouse eth1 down, ya), and pacemeker says, that everything ok.

    But, I want the nginx moves to srv1 after eth1 dies on srv2.

    So, what can i do for that?

    • c4f4t0r
      c4f4t0r over 9 years
      your colocation "colocation nginx-and-cluster-ip +inf: ha-nginx cluster-ip" is strange, positive score need to be inf: without a plush in front
    • mr.The
      mr.The over 9 years
      @c4f4t0r, it's ok, i tested it event without a plus, and it works the same: na-nginx and cluster-ip on one node.
    • mr.The
      mr.The over 9 years
      @c4f4t0r It will be 2 or 3 nodes, so thanks for advice.
  • mr.The
    mr.The over 9 years
    I tried it, and it works good, but only if it a 100% connectivity loss. But when i only ~50% packet loss - it works awful - pacemaker change active node about 5 times a minute. How can i fix it?
  • Erik Amundson
    Erik Amundson almost 4 years
    ah HA! I found the 'ocf:heartbeat:ethmonitor' resource in CentOS 8 pacemaker. THIS looks promising. I will test...