What can I really do with percona toolkit's pt-table-sync?

8,474

Solution 1

Answer to Question 1

MySQL Replication suffers from two major problems

  • MySQL Replication is Asynchronous. This may introduce replication delay. This manifests itself with communication problems between a Master and the Slave via the Slave I/O Thread. This may logically and numerically be seen in Seconds_Behind_Master.

  • Data Drift. This is a intermittent condition where a Master and Slave are simply out-of-sync because of factors outside the realm of MySQL Replication. For example, please note one way to better synchronize replication: use the option sync-binlog. When you set sync-binlog to 1, mysqld will perform a flush of the current binary log for every entry you record in the binary log. That can ridiculously slow down a Master. By default, sync-binlog is 0.

    • Here is a question: With sync-binlog=0, who is responsible for flushing the binary log to disk?
    • Answer (please sit down for this one): THE OPERATING SYSTEM !!!
    • With that answer, it puts the Slave as a terrible disadvantage because its I/O Thread is at the mercy of the Master's Operating System. When the Master's OS gets around to flushing the binary log changes to disk and the Slave's I/O Thread can detect the next incoming SQL statement, then the statement is shipped over the I/O Thread to the Slave.
    • Percona has a nice PDF on dealing with Data Drift

Answer to Question 2

The direct answer here is no because pt-table-sync was designed to detect the I/O thread of a Slave by means of the --sync-to-master option.

Answer to Question 3

The direct answer here is no because MySQL Replication demands to know

  • what is the current binary log on the Master? (this is Master_Log_File from SHOW SLAVE STATUS\G)
  • what is the latest position the Slave has read from Master's current binary log? (this is Read_Master_Log_Pos from SHOW SLAVE STATUS\G)

If you simply want your binary logs to get out of the way, you can do one of two things

  • OPTION 1 : On the Master, set expire-logs-days to 3 to keep the last 3 days worth of binary logs
    • Add expire-logs-days=3 in /etc/my.cnf
    • No restart required: Just run SET GLOBAL expire_logs_days = 3;
  • OPTION 2 : Run SHOW SLAVE STATUS\G on the Slave. Take the value of Relay_Master_Log_File. and use it to clear binary logs on the Master to up that log file.
    • Suppose you run SHOW SLAVE STATUS\G on the Slave
    • You get this Relay_Master_Log_File: mysql-bin.000035
    • Run this on the Master : PURGE BINARY LOGS TO 'mysql-bin.000035';

SUGGESTION

If you want to have more faith in pt-table-sync, try using the --print option and redirecting to a text file instead of the --execute option. This will generate the SQL that would normally execute on the Master. You could just run the SQL directly on that Slave thereafter. Think of it as a dress rehearsal for --execute.

Solution 2

but I thought the whole point of replication was to take care of the synchronization of data for you

Yes, MySQL replication does try to synchronize a replicated database. However, MySQL replication is tricky and the replication can fail for various reasons. Replication errors in my experience are rare, but they do happen during unexpected server crashes, when users hit "Control-C" in the middle of a big insert on the master, etc. MySQL.com does not provide good tools to deal with many of these problems. Luckily, a few engineers such as Baron Schwartz (Original author of the Percona Toolkit (formerly known as Maatkit) have developed tools to make MySQL administration easier.

For example, I currently have a table with 50 million rows. A handful of rows are out of sync due to a server crash a few weeks ago. I need to discover which rows are out of sync, but that would be painful to do manually. I use pt-table-checksum to check for replication errors on the replica, and pt-table-sync to discover which rows are missing on the replica.

If you are considering MySQL replication, I highly recommend that you investigate and use the Percona Toolkit. If we had started off with the Percona Toolkit, the administration of our MySQL databases would have been much simpler.

I read the documentation and got confused:

The documentation for Percona Toolkit is written like a technical manual. It unfortunately does not do a good job at describing how to use the tools, how do they help you, etc. http://www.mysqlperformanceblog.com has some of this information, but it's largely focused on the Percona fork of MySQL (This is how they make a living), which requires the reader to do some translation.

Solution 3

Answer to Question 1

pt-table-sync (togerther with pt-table-checksum) can be used to correct replication errors like data corruption, someone directly modifying data on the slave, server crashes, schema changes in wrong order, etc.

However pt-table-sync can be also used without replication for keeping tables synced in near-realtime if the data don't change too much.

Correct Answer to Question 2

Of course you can use it also in non-replicated environment, the manual also mentiones it. I use it from cron to keep 3 mysql servers "in sync" every 5 minutes. They have the same copy of data which changes only sometimes (on first server), so replication for this purpose would be overkill.

You can specify individual databases or individual tables to be synced. You can have several destination servers. pt-table-sync uses several efficient algorithms to detect changes in database tables and copy only the changes (it categorizes the changes into 4 groups: Deletes, Replaces, Inserts, Updates).

Share:
8,474

Related videos on Youtube

Max
Author by

Max

Analytics consultant available for hire. More info: https://maxcorbeau.com

Updated on September 18, 2022

Comments

  • Max
    Max over 1 year

    I have been looking for a tool to synchronize tables from 2 different databases and found pt-table-sync. I read the documentation and got confused: they mostly use examples referring to a replicated environment, but I thought the whole point of replication was to take care of the synchronization of data for you, hence my questions:

    QUESTIONS

    1. What is the point of using pt-table-sync if the replication process is supposed to take care of the synchronization of data for you?

    2. Can I use pt-table-sync in a non-replicated environment (between 2+ hosts that have nothing to do with each other, is it the role of the pt-table-sync --execute host1 host2 host3 example given)?

    3. If I must use pt-table-sync in a replicated environment, can I do without the bin-logs on the master (there is an example talking about resolving differences discovered by pt-table-checksum so wondering if bin-logs are absolutely necessary)?

  • Max
    Max over 11 years
    Thanks Rolando for your answer. Here is my feedback: A1: "MySQL Replication is Asynchronous": isn't that affecting ps-sync-table too? (Q1 was about the benefits of ps-sync-table over replication). "Data Drift": OK A2: OK A3: "If you simply what binary logs to get out of the way": what do you mean there?
  • RolandoMySQLDBA
    RolandoMySQLDBA over 11 years
    For Answer #1, pt-table-sync tries coalesce the slave status to read a stable log file and position from the master such the seconds_behind_master is virtually zero when computing differences. For Answer #3, once slave read the Master's latest binary log, there is no need for any binary logs prior to it. You can have mysqld rotate out aged binary logs. The setting expire-logs-days simply sets the age at which binary logs are automatically erased. If you use the binary logs on the Master in the event of point-in-time recovery situations, you could keep as many days as needed.
  • Max
    Max over 11 years
    Very re-assuring answer as you seem to be going in my way and put things in a very understandable fashion (Rolando is too technical for me :(). I'll definitely check out the percona toolkit and the blog as they seem to be great resources.
  • Marki555
    Marki555 almost 9 years
    For Question #2, the answer is YES, not no (see details in my own answer).