[SERVER-4927] Slaves stops replog sync if another slaves used fsyncLock Created: 10/Feb/12 Updated: 06/Dec/22 Resolved: 22/Feb/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 2.0.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Steffen | Assignee: | Backlog - Replication Team |
| Resolution: | Done | Votes: | 0 |
| Labels: | sync | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Linux 2.6.32-38-server, Ubuntu 10.04, MongoDB 2.0.1, Replicaset with 4 Nodes, NUMA, 2x XEON E5620 , 24 GB RAM |
||
| Issue Links: |
|
||||||||
| Assigned Teams: |
Replication
|
||||||||
| Operating System: | ALL | ||||||||
| Participants: | |||||||||
| Description |
|
This is our Setup If we have a master switch (due to upgrade or failure) in the ReplicaSet we encounter the problem that some or all slaves stops oplog syncing when the backup node starts the backup. It is exactly the same moment as we start the fsyncLock command, since the replication lag is the same for the backup nodes and the slaves which also stops syncing. When the backup is finished the other slaves also starts syncing again. This is the second time we've encoutered this problem. Since this is our production environment we don't want to force a master switch if not needed. It seems that the cause of this problem is the master switch in the replicaset. Regards, |
| Comments |
| Comment by Gregory McKeon (Inactive) [ 22/Feb/18 ] |
|
We believe this has gone away - if this is still an issue, please feel free to file a new ticket. |
| Comment by Eric Milkie [ 08/Mar/13 ] |
|
This should now behave somewhat better, in that the secondaries might pause for a bit but after 30 seconds should switch sync sources away from the locked node and catch up. |
| Comment by Kristina Chodorow (Inactive) [ 07/Mar/12 ] |
|
The stuck secondaries were probably syncing from the fsync+locked secondary. The secondaries should recalculate who to sync from periodically. |
| Comment by Steffen [ 07/Mar/12 ] |
|
So far the problem did not happen again. We are in the process of migrating all the host and also upgrade to 2.0.3. |
| Comment by Steffen [ 07/Mar/12 ] |
|
No, we don't use authentication. |
| Comment by Eliot Horowitz (Inactive) [ 07/Mar/12 ] |
|
Are you running with authentication? |
| Comment by Steffen [ 10/Feb/12 ] |
|
On the backup node we see the fsyncLock process. I don't have the db.currentOp() form the last time. ntoreturn:1 reslen:168 246ms We don't see this log entry in the slave logs. This time we had 1 slave which was 1 second behind the primary. The second slave and the backup node had the same lag, which increased over time. If my guess is correct, the next time this happens will be after a master switch. |
| Comment by Scott Hernandez (Inactive) [ 10/Feb/12 ] |
|
Is the primary farther away from the secondaries than the backup replica? Can you run db.currentOp() on the backup replica the next time this happens? Also, running db.getReplicationInfo() on each of the other replicas would be helpful. |