What can I really do with percona toolkit's pt-table-sync?
Solution 1
Answer to Question 1
MySQL Replication suffers from two major problems
MySQL Replication is Asynchronous. This may introduce replication delay. This manifests itself with communication problems between a Master and the Slave via the Slave I/O Thread. This may logically and numerically be seen in
Seconds_Behind_Master
.-
Data Drift
. This is a intermittent condition where a Master and Slave are simply out-of-sync because of factors outside the realm of MySQL Replication. For example, please note one way to better synchronize replication: use the optionsync-binlog
. When you setsync-binlog
to 1, mysqld will perform a flush of the current binary log for every entry you record in the binary log. That can ridiculously slow down a Master. By default,sync-binlog
is 0.- Here is a question: With
sync-binlog=0
, who is responsible for flushing the binary log to disk? - Answer (please sit down for this one): THE OPERATING SYSTEM !!!
- With that answer, it puts the Slave as a terrible disadvantage because its I/O Thread is at the mercy of the Master's Operating System. When the Master's OS gets around to flushing the binary log changes to disk and the Slave's I/O Thread can detect the next incoming SQL statement, then the statement is shipped over the I/O Thread to the Slave.
- Percona has a nice PDF on dealing with Data Drift
- Here is a question: With
Answer to Question 2
The direct answer here is no because pt-table-sync
was designed to detect the I/O thread of a Slave by means of the --sync-to-master
option.
Answer to Question 3
The direct answer here is no because MySQL Replication demands to know
- what is the current binary log on the Master? (this is
Master_Log_File
fromSHOW SLAVE STATUS\G
) - what is the latest position the Slave has read from Master's current binary log? (this is
Read_Master_Log_Pos
fromSHOW SLAVE STATUS\G
)
If you simply want your binary logs to get out of the way, you can do one of two things
- OPTION 1 : On the Master, set
expire-logs-days
to 3 to keep the last 3 days worth of binary logs- Add
expire-logs-days=3
in /etc/my.cnf - No restart required: Just run
SET GLOBAL expire_logs_days = 3;
- Add
- OPTION 2 : Run
SHOW SLAVE STATUS\G
on the Slave. Take the value ofRelay_Master_Log_File
. and use it to clear binary logs on the Master to up that log file.- Suppose you run
SHOW SLAVE STATUS\G
on the Slave - You get this
Relay_Master_Log_File: mysql-bin.000035
- Run this on the Master :
PURGE BINARY LOGS TO 'mysql-bin.000035';
- Suppose you run
SUGGESTION
If you want to have more faith in pt-table-sync, try using the --print
option and redirecting to a text file instead of the --execute
option. This will generate the SQL that would normally execute on the Master. You could just run the SQL directly on that Slave thereafter. Think of it as a dress rehearsal for --execute
.
Solution 2
but I thought the whole point of replication was to take care of the synchronization of data for you
Yes, MySQL replication does try to synchronize a replicated database. However, MySQL replication is tricky and the replication can fail for various reasons. Replication errors in my experience are rare, but they do happen during unexpected server crashes, when users hit "Control-C" in the middle of a big insert on the master, etc. MySQL.com does not provide good tools to deal with many of these problems. Luckily, a few engineers such as Baron Schwartz (Original author of the Percona Toolkit (formerly known as Maatkit) have developed tools to make MySQL administration easier.
For example, I currently have a table with 50 million rows. A handful of rows are out of sync due to a server crash a few weeks ago. I need to discover which rows are out of sync, but that would be painful to do manually. I use pt-table-checksum to check for replication errors on the replica, and pt-table-sync to discover which rows are missing on the replica.
If you are considering MySQL replication, I highly recommend that you investigate and use the Percona Toolkit. If we had started off with the Percona Toolkit, the administration of our MySQL databases would have been much simpler.
I read the documentation and got confused:
The documentation for Percona Toolkit is written like a technical manual. It unfortunately does not do a good job at describing how to use the tools, how do they help you, etc. http://www.mysqlperformanceblog.com has some of this information, but it's largely focused on the Percona fork of MySQL (This is how they make a living), which requires the reader to do some translation.
Solution 3
Answer to Question 1
pt-table-sync
(togerther with pt-table-checksum
) can be used to correct replication errors like data corruption, someone directly modifying data on the slave, server crashes, schema changes in wrong order, etc.
However pt-table-sync
can be also used without replication for keeping tables synced in near-realtime if the data don't change too much.
Correct Answer to Question 2
Of course you can use it also in non-replicated environment, the manual also mentiones it. I use it from cron to keep 3 mysql servers "in sync" every 5 minutes. They have the same copy of data which changes only sometimes (on first server), so replication for this purpose would be overkill.
You can specify individual databases or individual tables to be synced. You can have several destination servers. pt-table-sync
uses several efficient algorithms to detect changes in database tables and copy only the changes (it categorizes the changes into 4 groups: Deletes, Replaces, Inserts, Updates).
Related videos on Youtube
Max
Analytics consultant available for hire. More info: https://maxcorbeau.com
Updated on September 18, 2022Comments
-
Max over 1 year
I have been looking for a tool to synchronize tables from 2 different databases and found
pt-table-sync
. I read the documentation and got confused: they mostly use examples referring to a replicated environment, but I thought the whole point of replication was to take care of the synchronization of data for you, hence my questions:QUESTIONS
-
What is the point of using
pt-table-sync
if the replication process is supposed to take care of the synchronization of data for you? -
Can I use
pt-table-sync
in a non-replicated environment (between 2+ hosts that have nothing to do with each other, is it the role of thept-table-sync --execute host1 host2 host3
example given)? -
If I must use
pt-table-sync
in a replicated environment, can I do without thebin-logs
on themaster
(there is an example talking about resolving differences discovered bypt-table-checksum
so wondering ifbin-logs
are absolutely necessary)?
-
-
Max over 11 yearsThanks Rolando for your answer. Here is my feedback:
A1
:"MySQL Replication is Asynchronous"
: isn't that affectingps-sync-table
too? (Q1 was about the benefits ofps-sync-table
over replication)."Data Drift"
: OKA2
: OKA3
:"If you simply what binary logs to get out of the way"
: what do you mean there? -
RolandoMySQLDBA over 11 yearsFor Answer #1, pt-table-sync tries coalesce the slave status to read a stable log file and position from the master such the seconds_behind_master is virtually zero when computing differences. For Answer #3, once slave read the Master's latest binary log, there is no need for any binary logs prior to it. You can have mysqld rotate out aged binary logs. The setting
expire-logs-days
simply sets the age at which binary logs are automatically erased. If you use the binary logs on the Master in the event of point-in-time recovery situations, you could keep as many days as needed. -
Max over 11 yearsVery re-assuring answer as you seem to be going in my way and put things in a very understandable fashion (
Rolando
is too technical for me :(). I'll definitely check out thepercona toolkit
and the blog as they seem to be great resources. -
Marki555 almost 9 yearsFor Question #2, the answer is YES, not no (see details in my own answer).