[SERVER-36775] Replication sync issue Created: 20/Aug/18  Updated: 15/Sep/18  Resolved: 20/Aug/18

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.2.20
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: Matthew S Davis Assignee: Nick Brewer
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-29837 TopologyCoordinator::shouldChangeSync... Backlog
Participants:

 Description   

We are experiencing a strange error in replication. We are using "chainingAllowed" : true". It seems that sometimes replication randomly stops and replica members will not be able find a valid sync source, instead a replica member will just keep trying the same sync source over and over again untill a point where it can no longer catch up at all due to oplog being to stale. Here is the log of a failing replica member:

2018-08-20T15:19:38.371-0600 I REPL [ReplicationExecutor] re-evaluating sync source because our current sync source's most recent OpTime is (term: -1, timestamp: Aug 19 15:29:07:1bf) which is more
than 30s behind member redacted-host-name-01.local:27017 whose most recent OpTime is (term: -1, timestamp: Aug 20 15:12:28:c5)
2018-08-20T15:19:38.371-0600 I REPL [ReplicationExecutor] syncing from: redacted-host-name-03.local:27017
2018-08-20T15:19:38.381-0600 I REPL [rsBackgroundSync] Chose same sync source candidate as last time, redacted-host-name-03.local:27017. Sleeping for 1 second to avoid immediately choos
ing a new sync source for the same reason as last time.
2018-08-20T15:19:39.381-0600 I REPL [SyncSourceFeedback] setting syncSourceFeedback to redacted-host-name-03.local:27017
2018-08-20T15:19:39.386-0600 I REPL [ReplicationExecutor] re-evaluating sync source because our current sync source's most recent OpTime is (term: -1, timestamp: Aug 19 15:29:07:1bf) which is more
than 30s behind member redacted-host-name-01.local:27017 whose most recent OpTime is (term: -1, timestamp: Aug 20 15:12:28:c5)
2018-08-20T15:19:39.386-0600 I REPL [ReplicationExecutor] syncing from: redacted-host-name-03.local:27017
2018-08-20T15:19:39.394-0600 I REPL [rsBackgroundSync] Chose same sync source candidate as last time, redacted-host-name-03.local:27017. Sleeping for 1 second to avoid immediately choos
ing a new sync source for the same reason as last time.

Restarting the mongod service or changing the replicaset config seems to force the replica member out of this loop and allows it to sync again to a non-stale member. Nickolas Golubev @ 16:03
The expected behavior would be for the replica member to try a different sync source instead of the same one over and over again.



 Comments   
Comment by Matthew S Davis [ 21/Aug/18 ]

In our case chaining is allowed but the same sync source is still selected. Eventually the member is to far behind.

Comment by Nick Brewer [ 20/Aug/18 ]

matthew.s.davis62.ctr@mail.mil I believe this is a duplicate - work to improve OpTime behavior with chainingAllowed is currently being tracked here: SERVER-29837

You can vote for that issue, and follow along with it for updates.

-Nick

Generated at Thu Feb 08 04:44:02 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.