[SERVER-4927] Slaves stops replog sync if another slaves used fsyncLock Created: 10/Feb/12  Updated: 06/Dec/22  Resolved: 22/Feb/18

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.0.2
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Steffen Assignee: Backlog - Replication Team
Resolution: Done Votes: 0
Labels: sync
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Linux 2.6.32-38-server, Ubuntu 10.04, MongoDB 2.0.1, Replicaset with 4 Nodes, NUMA, 2x XEON E5620 , 24 GB RAM


Issue Links:
Depends
depends on SERVER-5208 Replica set periodic reevaluation of ... Closed
Assigned Teams:
Replication
Operating System: ALL
Participants:

 Description   

This is our Setup
2 Shards, each a ReplicaSet with 4 Nodes. 1 Node is dedicated for backups (priority:0,hidden:true)
If we start a backup we send the backup node the fsyncLock command and then start a rsync of the filesystem.
After we have finished the backup we send the fsyncUnLock command to the backup node.

If we have a master switch (due to upgrade or failure) in the ReplicaSet we encounter the problem that some or all slaves stops oplog syncing when the backup node starts the backup. It is exactly the same moment as we start the fsyncLock command, since the replication lag is the same for the backup nodes and the slaves which also stops syncing. When the backup is finished the other slaves also starts syncing again.
db.currentOp() doesn't show the fsyncLock on the slaves, only on the backup node.
To get rid of this problem we have to start the non backup slave. After this restart the slave runs well and never stop syncing again together with the backup node.

This is the second time we've encoutered this problem. Since this is our production environment we don't want to force a master switch if not needed.

It seems that the cause of this problem is the master switch in the replicaset.

Regards,
Steffen



 Comments   
Comment by Gregory McKeon (Inactive) [ 22/Feb/18 ]

We believe this has gone away - if this is still an issue, please feel free to file a new ticket.

Comment by Eric Milkie [ 08/Mar/13 ]

This should now behave somewhat better, in that the secondaries might pause for a bit but after 30 seconds should switch sync sources away from the locked node and catch up.

Comment by Kristina Chodorow (Inactive) [ 07/Mar/12 ]

The stuck secondaries were probably syncing from the fsync+locked secondary. The secondaries should recalculate who to sync from periodically.

Comment by Steffen [ 07/Mar/12 ]

So far the problem did not happen again. We are in the process of migrating all the host and also upgrade to 2.0.3.
For the upgrades and the migration there is going to be a master switch. We will monitor if the problem reoccurs than.

Comment by Steffen [ 07/Mar/12 ]

No, we don't use authentication.

Comment by Eliot Horowitz (Inactive) [ 07/Mar/12 ]

Are you running with authentication?

Comment by Steffen [ 10/Feb/12 ]

On the backup node we see the fsyncLock process. I don't have the db.currentOp() form the last time.
Log from backup node:
Fri Feb 10 05:59:35 [initandlisten] connection accepted from 172.20.4.219:21561 #403654
Fri Feb 10 05:59:35 [conn403654] CMD fsync: sync:1 lock:1
Fri Feb 10 05:59:35 [conn403654] removeJournalFiles
Fri Feb 10 05:59:35 [fsyncjob] db is now locked for snapshotting, no writes allowed. db.fsyncUnlock() to unlock
Fri Feb 10 05:59:35 [fsyncjob] For more info see http://www.mongodb.org/display/DOCS/fsync+Command
Fri Feb 10 05:59:35 [conn403654] command admin.$cmd command:

{ fsync: 1.0, lock: true }

ntoreturn:1 reslen:168 246ms

We don't see this log entry in the slave logs.

This time we had 1 slave which was 1 second behind the primary. The second slave and the backup node had the same lag, which increased over time.
We get this values using the nagios check for mongodb (http://tag1consulting.com/blog/mongodb-nagios-check).
We also have MMS agent running. This should also show the replication lag?

If my guess is correct, the next time this happens will be after a master switch.
Possible is also that the other slave which wasn't stale hangs the next time the backup runs, because I haven't restarted it yet.

Comment by Scott Hernandez (Inactive) [ 10/Feb/12 ]

Is the primary farther away from the secondaries than the backup replica? Can you run db.currentOp() on the backup replica the next time this happens? Also, running db.getReplicationInfo() on each of the other replicas would be helpful.

Generated at Thu Feb 08 03:07:22 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.