Endless recovering state of secondary

mongodb replication

16,127

Solution 1

The problem (most likely)

The last operation on the primary is from "2015-05-15T02:10:56Z", whereas the last operation of the going to be secondary is from "2015-05-14T11:23:51Z", which is a difference of roughly 15 hours. That window may well exceed your replication oplog window (the difference between the time of the first and the last operation entry in your oplog). Put simply, there are too many operations on the primary for the secondary to catch up.

A bit more elaborated (though simplified): during an initial sync, the data the secondary syncs from is the data of a given point in time. When the data of that point in time is synced over, the secondary connects to the oplog and applies the changes that were made between said point in time and now according to the oplog entries. This works well as long as the oplog holds all operations between the mentioned point in time. But the oplog has a limited size (it is a so called capped collection). So if there are more operations happening on the primary than the oplog can hold during the initial sync, the oldest operations "fade out". The secondary recognises that not all operations are available necessary to "construct" the same data as the primary and refuses to complete the sync, staying in RECOVERY mode.

The solution(s)

The problem is a known one and not a bug, but a result of the inner workings of MongoDB and several fail-safe assumptions made by the development team. Hence, there are several ways to deal with the situation. Sadly, since you only have two data bearing nodes, all involve downtime.

Option 1: Increase the oplog size

This is my preferred method, since it deals with the problem once and (kind of) for all. It's a bit more complicated than other solutions, though. From a high level perspective, these are the steps you take.

Shut down the primary
Create a backup of the oplog using direct access to the data files
Restart the mongod in standalone mode
Copy the current oplog to a temporary collection
Delete the current oplog
Recreate the oplog with the desired size
Copy back the oplog entries from the temporary collection to the shiny new oplog
Restart mongod as part of the replica set

Do not forget to increase the oplog of the secondary before doing the initial sync, since it may become primary at some time in the future!

For details, please read "Change the size of the oplog" in the tutorials regarding replica set maintenance.

Option 2: Shut down the app during sync

If option 1 is not viable, the only real other solution is to shut down the application causing load on the replica set, restart the sync and wait for it too complete. Depending on the amount of the data to be transferred, calculate with several hours.

A personal note

The oplog window problem is a well known one. While replica sets and sharded clusters are easy to set up with MongoDB, quite some knowledge and a bit of experience is needed to maintain them properly. Do not run something as important as a database with a complex setup without knowing the basics - in case Something Bad (tm) happens, it might well lead to a situation FUBAR.

Solution 2

Another option (assuming primary has healthy data) is to simply delete the data in the secondary's mongo data folder and restart. This will cause it to sync back up to the primary as if you just added it to the replica set.

16,127

Author by

tottishi05

Updated on September 16, 2022

Comments

tottishi05 over 1 year

I build a replication set with one primary, one secondary and one arbiter on MongoDB 3.0.2. The primary and arbiter are on the same host and the secondary is on another host.

With the growing of write overload, the secondary can't follow the primary and step into the state of recovering. The primary can connect to the secondary as I can log to the secondary server by Mongo shell on the host of primary.

I stop all the operations and watch the secondary's state with the command rs.status() and type the command rs.syncFrom("primary's ip:port") on secondary.

Then the result of the rs.status() command shows that the optimeDate of secondary is far behind that of the primary and one message appears intermittently as below:

"set" : "shard01", "date" : ISODate("2015-05-15T02:10:55.382Z"), "myState" : 3, "members" : [ { "_id" : 0, "name" : "xxx.xxx.xxx.xxx:xxx", "health" : 1, "state" : 1, "stateStr" : "PRIMARY", "uptime" : 135364, "optime" : Timestamp(1431655856, 6), "optimeDate" : ISODate("2015-05-15T02:10:56Z"), "lastHeartbeat" : ISODate("2015-05-15T02:10:54.306Z"), "lastHeartbeatRecv" : ISODate("2015-05-15T02:10:53.634Z"), "pingMs" : 0, "electionTime" : Timestamp(1431520398, 2), "electionDate" : ISODate("2015-05-13T12:33:18Z"), "configVersion" : 3 }, { "_id" : 1, "name" : "xxx.xxx.xxx.xxx:xxx", "health" : 1, "state" : 7, "stateStr" : "ARBITER", "uptime" : 135364, "lastHeartbeat" : ISODate("2015-05-15T02:10:53.919Z"), "lastHeartbeatRecv" : ISODate("2015-05-15T02:10:54.076Z"), "pingMs" : 0, "configVersion" : 3 }, { "_id" : 2, "name" : "xxx.xxx.xxx.xxx:xxx", "health" : 1, "state" : 3, "stateStr" : "RECOVERING", "uptime" : 135510, "optime" : Timestamp(1431602631, 134), "optimeDate" : ISODate("2015-05-14T11:23:51Z"), "infoMessage" : "could not find member to sync from", "configVersion" : 3, "self" : true } ], "ok" : 1

"infoMessage" : "could not find member to sync from"

The primary and arbiter are both OK. I want to know the reason of this message and how to change the secondary's state from "recovering" to "secondary".