MariaDB Galera Cluster set up problems

15,411

Solution 1

Here is how I fixed my similar issue.

CentOS 7 w/ MariaDB Galera 10.1.

Node2 I saw this:

016-12-27 15:40:38 140703512762624 [Warning] WSREP: no nodes coming from prim view, prim not possible

After doing some reading, I tried running this on node1.

service mysql start --wsrep-new-cluster

But this failed, and in the logs, I found this...

2016-12-27 15:44:08 140438853814528 [ERROR] WSREP: It may not be safe to bootstrap the cluster from this node. It was not the last one to leave the cluster and may not contain all the updates. To force cluster bootstrap with this node, edit the grastate.dat file manually and set safe_to_bootstrap to 1 .

So I edited the file /var/lib/mysql/grastate.dat, changing safe_to_bootstrap to 1.

I was then able to start the Primary node using:

service mysql start --wsrep-new-cluster

Then on the others, I just used

service mysql start

Note: This was in a demo pre-production environment. I promptly broke it after getting everything to work by rebooting all servers at the same time :P, but I knew there were no writes, and that the DB's were in sync. If you are in produciton and this happens, you can use the following to figure out which node to run "new-cluster" on, which is akin to saying, make me primary.

mysqld_safe --wsrep-recover

If this is a production issue, I highly reccomend reading this article and making a backup w/ CloneZilla before throwing commands at the broken clients!

https://www.percona.com/blog/2014/09/01/galera-replication-how-to-recover-a-pxc-cluster/

Solution 2

The cluster must start with this command on primary node:

galera_new_cluster

after starting first node, you can start other nodes in the cluster successfully.

Solution 3

I believe you need to list all the IPs in the wsrep_cluster_address parameter.

wsrep_cluster_address=gcomm://192.168.211.132,192.168.211.133

This should be done on both hosts. BTW you likely want three nodes not two as to avoid split brain scenarios.

Share:
15,411
Admin
Author by

Admin

Updated on June 30, 2022

Comments

  • Admin
    Admin almost 2 years

    I am trying to get a mariadb cluster up and running but it is not working out for me. Right now I am using MariaDB Galera 5.5.36 on a 64 bit red hat ES6 machine. I installed mariadb through this repo here:

    [mariadb]
    name = MariaDB
    baseurl = http://yum.mariadb.org/5.5-galera/rhel6-amd64/
    gpgkey=https://yum.mariadb.org/RPM-GPG-KEY-MariaDB
    gpgcheck=1
    

    In the server.conf I have the following in server 1:

    [mariadb]
    log_error=/var/log/mariadb.log
    query_cache_size=0
    query_cache_type=0
    binlog_format=ROW
    default_storage_engine=innodb
    innodb_autoinc_lock_mode=2
    wsrep_provider=/usr/lib64/galera/libgalera_smm.so
    wsrep_cluster_address=gcomm://192.168.211.133
    wsrep_cluster_name='cluster'
    wsrep_node_address='192.168.211.132'
    wsrep_node_name='cluster1'
    wsrep_sst_method=rsync
    

    and on server 2 I have

    [mariadb]
    log_error=/var/log/mariadb.log
    query_cache_size=0
    query_cache_type=0
    binlog_format=ROW
    default_storage_engine=innodb
    innodb_autoinc_lock_mode=2
    wsrep_provider=/usr/lib64/galera/libgalera_smm.so
    wsrep_cluster_address=gcomm://192.168.211.132
    wsrep_cluster_name='cluster'
    wsrep_node_address='192.168.211.133'
    wsrep_node_name='cluster2'
    wsrep_sst_method=rsync
    

    When I start server 1 with the following command: sudo service mysql start --wsrep-new-cluster it starts up just fine, if I open up mysql and check the status of wsrep it says everything is up and running which is good but when I try to do sudo service mysql start on the second server I get the following in the error logs:

    140609 14:47:55 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
    140609 14:47:56 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.i5qfm2' --pid-file='/var/lib/mysql/localhost.localdomain-recover.pid'
    140609 14:47:57 mysqld_safe WSREP: Recovered position 85448d73-ebe8-11e3-9c20-fbc1995fee11:0
    140609 14:47:57 [Note] WSREP: wsrep_start_position var submitted: '85448d73-ebe8-11e3-9c20-fbc1995fee11:0'
    140609 14:47:57 [Note] WSREP: Read nil XID from storage engines, skipping position init
    140609 14:47:57 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera/libgalera_smm.so'
    140609 14:47:57 [Note] WSREP: wsrep_load(): Galera 25.3.2(r170) by Codership Oy <[email protected]> loaded successfully.
    140609 14:47:57 [Note] WSREP: CRC-32C: using hardware acceleration.
    140609 14:47:57 [Note] WSREP: Found saved state: 85448d73-ebe8-11e3-9c20-fbc1995fee11:-1
    140609 14:47:57 [Note] WSREP: Passing config to GCS: base_host = 192.168.211.133; base_port = 4567; cert.log_conflicts = no; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.size = 128M; gcs.fc_debug = 0; gcs.fc_factor = 1; gcs.fc_limit = 16; gcs.fc_master_slave = NO; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = NO; repl.causal_read_timeout = PT30S; repl.commit_order = 3; repl.key_format = FLAT8; repl.proto_max = 5
    140609 14:47:57 [Note] WSREP: Assign initial position for certification: 0, protocol version: -1
    140609 14:47:57 [Note] WSREP: wsrep_sst_grab()
    140609 14:47:57 [Note] WSREP: Start replication
    140609 14:47:57 [Note] WSREP: Setting initial position to 85448d73-ebe8-11e3-9c20-fbc1995fee11:0
    140609 14:47:57 [Note] WSREP: protonet asio version 0
    140609 14:47:57 [Note] WSREP: Using CRC-32C (optimized) for message checksums.
    140609 14:47:57 [Note] WSREP: backend: asio
    140609 14:47:57 [Note] WSREP: GMCast version 0
    140609 14:47:57 [Note] WSREP: (0c085f34-efe5-11e3-9f6b-8bfd1706e2a4, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
    140609 14:47:57 [Note] WSREP: (0c085f34-efe5-11e3-9f6b-8bfd1706e2a4, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
    140609 14:47:57 [Note] WSREP: EVS version 0
    140609 14:47:57 [Note] WSREP: PC version 0
    140609 14:47:57 [Note] WSREP: gcomm: connecting to group 'cluster', peer '192.168.211.132:,192.168.211.134:'
    140609 14:48:00 [Warning] WSREP: no nodes coming from prim view, prim not possible
    140609 14:48:00 [Note] WSREP: view(view_id(NON_PRIM,0c085f34-efe5-11e3-9f6b-8bfd1706e2a4,1) memb {
            0c085f34-efe5-11e3-9f6b-8bfd1706e2a4,0
    } joined {
    } left {
    } partitioned {
    })
    140609 14:48:01 [Warning] WSREP: last inactive check more than PT1.5S ago (PT3.50775S), skipping check
    140609 14:48:31 [Note] WSREP: view((empty))
    140609 14:48:31 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
             at gcomm/src/pc.cpp:connect():141
    140609 14:48:31 [ERROR] WSREP: gcs/src/gcs_core.c:gcs_core_open():196: Failed to open backend connection: -110 (Connection timed out)
    140609 14:48:31 [ERROR] WSREP: gcs/src/gcs.c:gcs_open():1291: Failed to open channel 'cluster' at 'gcomm://192.168.211.132,192.168.211.134': -110 (Connection timed out)
    140609 14:48:31 [ERROR] WSREP: gcs connect failed: Connection timed out
    140609 14:48:31 [ERROR] WSREP: wsrep::connect() failed: 7
    140609 14:48:31 [ERROR] Aborting
    
    140609 14:48:31 [Note] WSREP: Service disconnected.
    140609 14:48:32 [Note] WSREP: Some threads may fail to exit.
    140609 14:48:32 [Note] /usr/sbin/mysqld: Shutdown complete
    
    140609 14:48:32 mysqld_safe mysqld from pid file /var/lib/mysql/localhost.localdomain.pid ended
    

    I am at a loss as to why the second server cannot detect that a cluster is up and running. These machines can communicate with each other just fine, I can SSH from one to the other and they can ping each other. I tried deleted the galera cache, tried downgrading my version of mariadb galera, tried disabling SELinux, tried running the mysql service as a different user, verified that the correct ports are open, tried running them on 2 VMs on separate computers with different IP addresses, etc. Does anyone have any idea what is going on here because I have been searching for 3 days trying to fix this but no solution seems to work with me.