MariaDB Galera Cluster falls out of sync easily
First, huge transactions and LOAD DATA INFILE
on galera cluster is still a known limitation, if you have to it is advised to split these transactions by 5k-10k trx's or lesser YMMV.
Try to increase wsrep-max-ws-size
.
Set innodb_flush_log_at_trx_commit=0
on all nodes.
Related videos on Youtube
Cera
Updated on September 18, 2022Comments
-
Cera over 1 year
We're trying to debug an issue with a MariaDB cluster.
We're running Maria 10.0.19 on
c4.large
instances in Amazon EC2; the OS is Ubuntu 14.04 (Trusty).There are three machines clustered together, replicating fine (we can run
create database foo;
on one machine and see it on the other, etc).But: when we try to restore a database from a dump while all three machines are running and synced, there's an error:$ du -sh *.sql 2.7G sqldump.sql $ cat sqldump.sql | sudo mysql ERROR 1047 (08S01) at line 4361: WSREP has not yet prepared node for application use
It seems like this error has something to do with how long the import takes. If we run
service mysql stop
on two of the three nodes in the cluster and run the SQL command against the remaining node, it works fine. We can then start each of the machines in the cluster one by one to replicate the data through an SST, so it seems like this is an issue with the Galera configuration.This doesn't only happen when running a large MySQL import: it happens during routine usage with small transactions. The large import is our most reliable way to replicate this issue, though.
The system memory usage during the import is not particularly high, nor is CPU usage. The network traffic is far below what the machine's link is capable of, and in our test there is no other traffic aside from our SSH connection.
Can someone please help understand what might be causing this issue?
Here is some more detail about the machines in the cluster and the MariaDB configuration:
Ubuntu:
$ lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 14.04.2 LTS Release: 14.04 Codename: trusty
Kernel:
$ uname -a Linux servername.domain 3.13.0-53-generic #89-Ubuntu SMP Wed May 20 10:34:39 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
MySQL config (IP addresses, domain, etc. in
wsrep_cluster_address
intentionally obfuscated):$ find /etc/mysql/ -name "*.cnf" -exec cat {} \; | egrep -v "^#" | grep v "^$" [mysqld] server-id = 965424531 bind-address = * max_connections = 500 max_connect_errors = 1000000 innodb_buffer_pool_size = 2635M log_bin = /var/lib/mysql/mysql/mysql-bin expire_logs_days = 7 sync_binlog = 1 binlog_format = MIXED log-slave-updates = 1 slow_query_log = 1 slow_query_log_file = /var/log/mysql/mysql-slow.log [mysqld] innodb_use_native_aio = 0 innodb_flush_method = O_DSYNC [client] [mysqld] [mysqld_safe] syslog [mariadb] [client] port = 3306 socket = /var/run/mysqld/mysqld.sock [mysqld_safe] socket = /var/run/mysqld/mysqld.sock nice = 0 [mysqld] user = mysql pid-file = /var/run/mysqld/mysqld.pid socket = /var/run/mysqld/mysqld.sock port = 3306 basedir = /usr datadir = /var/lib/mysql tmpdir = /tmp lc-messages-dir = /usr/share/mysql skip-external-locking bind-address = 127.0.0.1 key_buffer = 16M max_allowed_packet = 16M thread_stack = 192K thread_cache_size = 8 myisam-recover = BACKUP query_cache_limit = 1M query_cache_size = 16M log_error = /var/log/mysql/error.log expire_logs_days = 10 max_binlog_size = 100M [mysqldump] quick quote-names max_allowed_packet = 16M [isamchk] key_buffer = 16M !includedir /etc/mysql/conf.d/ [mysqld] wsrep_provider=/usr/lib/galera/libgalera_smm.so wsrep_debug=ON wsrep_cluster_name="clustername" wsrep_cluster_address="gcomm://10.0.X.X,10.0.X.X" wsrep_sst_method=xtrabackup-v2 wsrep_sst_auth=sstuser:sstpassword wsrep_node_address="10.0.1.10" wsrep_node_name="servername.domain" binlog_format=ROW wsrep_on=ON default_storage_engine=InnoDB innodb_autoinc_lock_mode=2 innodb_doublewrite=1 query_cache_size=0 innodb_log_file_size = 256M
Cluster status:
$ sudo mysql Welcome to the MariaDB monitor. Commands end with ; or \g. Your MariaDB connection id is 288 Server version: 10.0.19-MariaDB-1~trusty-wsrep-log mariadb.org binary distribution, wsrep_25.10.r4144 Copyright (c) 2000, 2015, Oracle, MariaDB Corporation Ab and others. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. MariaDB [(none)]> show status like '%wsrep%'; +------------------------------+---------------------------------------------------+ | Variable_name | Value | +------------------------------+---------------------------------------------------+ | wsrep_local_state_uuid | e856afdc-18af-11e5-a3a6-efccde439ba4 | | wsrep_protocol_version | 7 | | wsrep_last_committed | 45764 | | wsrep_replicated | 2031 | | wsrep_replicated_bytes | 1527494811 | | wsrep_repl_keys | 9973524 | | wsrep_repl_keys_bytes | 79839767 | | wsrep_repl_data_bytes | 1447525060 | | wsrep_repl_other_bytes | 0 | | wsrep_received | 1478 | | wsrep_received_bytes | 13040 | | wsrep_local_commits | 1750 | | wsrep_local_cert_failures | 0 | | wsrep_local_replays | 0 | | wsrep_local_send_queue | 0 | | wsrep_local_send_queue_max | 2 | | wsrep_local_send_queue_min | 0 | | wsrep_local_send_queue_avg | 0.001140 | | wsrep_local_recv_queue | 0 | | wsrep_local_recv_queue_max | 7 | | wsrep_local_recv_queue_min | 0 | | wsrep_local_recv_queue_avg | 0.043302 | | wsrep_local_cached_downto | 45564 | | wsrep_flow_control_paused_ns | 3956186469 | | wsrep_flow_control_paused | 0.005006 | | wsrep_flow_control_sent | 0 | | wsrep_flow_control_recv | 41 | | wsrep_cert_deps_distance | 4.487445 | | wsrep_apply_oooe | 0.000000 | | wsrep_apply_oool | 0.000000 | | wsrep_apply_window | 1.000000 | | wsrep_commit_oooe | 0.000000 | | wsrep_commit_oool | 0.000000 | | wsrep_commit_window | 1.000000 | | wsrep_local_state | 4 | | wsrep_local_state_comment | Synced | | wsrep_cert_index_size | 11438 | | wsrep_causal_reads | 0 | | wsrep_cert_interval | 0.000000 | | wsrep_incoming_addresses | ,, | | wsrep_evs_delayed | | | wsrep_evs_evict_list | | | wsrep_evs_repl_latency | 0.00059098/0.000958534/0.00469729/0.000375612/732 | | wsrep_evs_state | OPERATIONAL | | wsrep_gcomm_uuid | 8bcfefe4-25f7-11e5-be32-062acc002ed5 | | wsrep_cluster_conf_id | 88 | | wsrep_cluster_size | 3 | | wsrep_cluster_state_uuid | e856afdc-18af-11e5-a3a6-efccde439ba4 | | wsrep_cluster_status | Primary | | wsrep_connected | ON | | wsrep_local_bf_aborts | 0 | | wsrep_local_index | 2 | | wsrep_provider_name | Galera | | wsrep_provider_vendor | Codership Oy <[email protected]> | | wsrep_provider_version | 3.9(rXXXX) | | wsrep_ready | ON | | wsrep_thread_count | 2 | +------------------------------+---------------------------------------------------+ 57 rows in set (0.00 sec)
-
Michael Hampton almost 9 yearsThis is usually a lost quorum, network partition, etc., event. Do you have a flaky network between the cluster nodes?
-
Cera almost 9 yearsThe machines are in AWS EC2 and the networking is generally extremely reliable. For example, we have xtradb multi-master setups that don't have this issue, and many other services with much higher traffic and no issues like this. The traffic on the network interfaces while running this query, as I said, is far below the capacity of the links.
-
Jaime Soriano almost 9 yearsHave you tried innodb_flush_method = O_DIRECT?
-
Ondra Sniper Flidr over 8 yearsWhat are your mysql logs telling about it? Galera is logging pretty well, co you should look in those files. And I can see you have configured mysql to use coflict autoincrement sequences. You should define auto_increment_increment and auto_increment_offset config values to be sure no two nodes will generate same autoincrement value.
-