[SERVER-57857] Impossible to change the sync source for the mongod Created: 21/Jun/21  Updated: 27/Oct/23  Resolved: 20/Jul/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: Konstantin Krasnov Assignee: Xuerui Fa
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Sprint: Repl 2021-07-26
Participants:

 Description   

Hi,

I want to change sync source for the mongod on core-storage106-dc2.example.com:

rs.syncFrom('core-storage104-dc2.example.com:27018');

 

And an error is given:

{"t":{"$date":"2021-06-21T10:13:44.762+03:00"},"s":"I", "c":"REPL", "id":21829, "ctx":"ReplCoordExtern-1","msg":"Choosing new sync source because the user has requested a sync source","attr":{"syncSource":"core-storage104-dc2.example.com:27018"}}
{"t":{"$date":"2021-06-21T10:13:44.762+03:00"},"s":"I", "c":"REPL", "id":21150, "ctx":"ReplCoordExtern-1","msg":"Canceling oplog query due to OplogQueryMetadata. We have to choose a new sync source","attr":{"syncSource":"core-storage101.example.com:27017","lastAppliedOpTime":{"ts":{"$timestamp":{"t":1624259624,"i":1}},"t":18},"syncSourceIndex":-1}}
{"t":{"$date":"2021-06-21T10:13:44.763+03:00"},"s":"W", "c":"REPL", "id":21122, "ctx":"BackgroundSync","msg":"Fetcher stopped querying remote oplog with error","attr":{"error":"InvalidSyncSource: sync source core-storage101.example.com:27017 (config version: 19; last applied optime: { ts: Timestamp(1624259624, 1), t: 18 }; sync source index: -1; primary index: 0) is no longer valid"}}
{"t":{"$date":"2021-06-21T10:13:44.765+03:00"},"s":"I", "c":"REPL", "id":21080, "ctx":"BackgroundSync","msg":"Clearing sync source to choose a new one","attr":{"syncSource":"core-storage101.example.com:27017"}}
{"t":{"$date":"2021-06-21T10:13:44.765+03:00"},"s":"I", "c":"REPL", "id":21782, "ctx":"BackgroundSync","msg":"Choosing sync source candidate by request","attr":{"syncSource":"core-storage104-dc2.example.com:27018"}}
{"t":{"$date":"2021-06-21T10:13:44.765+03:00"},"s":"I", "c":"REPL", "id":21760, "ctx":"SyncSourceFeedback","msg":"SyncSourceFeedback error sending update","attr":{"syncTarget":"core-storage101.example.com:27017","error":{"code":119,"codeName":"InvalidSyncSource","errmsg":"Sync source was cleared. Was core-storage101.example.com:27017"}}}
{"t":{"$date":"2021-06-21T10:13:44.766+03:00"},"s":"I", "c":"CONNPOOL", "id":22576, "ctx":"ReplCoordExternNetwork","msg":"Connecting","attr":{"hostAndPort":"core-storage104-dc2.example.com:27018"}}
{"t":{"$date":"2021-06-21T10:13:44.772+03:00"},"s":"I", "c":"REPL", "id":21088, "ctx":"BackgroundSync","msg":"Changed sync source","attr":{"oldSyncSource":"core-storage101.example.com:27017","newSyncSource":"core-storage104-dc2.example.com:27018"}}
{"t":{"$date":"2021-06-21T10:13:44.778+03:00"},"s":"W", "c":"REPL", "id":21122, "ctx":"BackgroundSync","msg":"Fetcher stopped querying remote oplog with error","attr":{"error":"InvalidSyncSource: Sync source must be ahead of me. My last fetched oplog optime: { ts: Timestamp(1624259624, 1), t: 18 }, latest oplog optime of sync source: { ts: Timestamp(1624259624, 1), t: 18 }"}}
{"t":{"$date":"2021-06-21T10:13:44.778+03:00"},"s":"I", "c":"REPL", "id":21080, "ctx":"BackgroundSync","msg":"Clearing sync source to choose a new one","attr":{"syncSource":"core-storage104-dc2.example.com:27018"}}
{"t":{"$date":"2021-06-21T10:13:44.778+03:00"},"s":"I", "c":"REPL", "id":21798, "ctx":"BackgroundSync","msg":"Could not find member to sync from"}
{"t":{"$date":"2021-06-21T10:13:45.779+03:00"},"s":"I", "c":"NETWORK", "id":22944, "ctx":"conn259","msg":"Connection ended","attr":{"remote":"127.0.0.1:44031","connectionId":259,"connectionCount":6}}
{"t":{"$date":"2021-06-21T10:13:49.775+03:00"},"s":"I", "c":"REPL", "id":21760, "ctx":"SyncSourceFeedback","msg":"SyncSourceFeedback error sending update","attr":{"syncTarget":"core-storage104-dc2.example.com:27018","error":{"code":119,"codeName":"InvalidSyncSource","errmsg":"Sync source was cleared. Was core-storage104-dc2.example.com:27018"}}}
{"t":{"$date":"2021-06-21T10:13:55.785+03:00"},"s":"I", "c":"REPL", "id":21799, "ctx":"BackgroundSync","msg":"Sync source candidate chosen","attr":{"syncSource":"core-storage102.example.com:27017"}}
{"t":{"$date":"2021-06-21T10:13:55.785+03:00"},"s":"I", "c":"CONNPOOL", "id":22576, "ctx":"ReplCoordExternNetwork","msg":"Connecting","attr":{"hostAndPort":"core-storage102.example.com:27017"}}
{"t":{"$date":"2021-06-21T10:13:55.940+03:00"},"s":"I", "c":"REPL", "id":21088, "ctx":"BackgroundSync","msg":"Changed sync source","attr":{"oldSyncSource":"empty","newSyncSource":"core-storage102.example.com:27017"}}

Why is it getting an error?

"InvalidSyncSource: Sync source must be ahead of me. My last fetched oplog optime: { ts: Timestamp(1624259624, 1), t: 18 }, latest oplog optime of sync source: { ts: Timestamp(1624259624, 1), t: 18 }"

How can I change sync source if the replicas have the same oplog optime?



 Comments   
Comment by Edwin Zhou [ 20/Jul/21 ]

Hi kkrasnov@gmail.com,

After some discussion with our internal engineering team, we've identified that during steady state replication, it's probable that all secondaries are caught up. In that case, changing sync sources will hit this error: InvalidSyncSource: Sync source must be ahead of me. to avoid forming a topology with a sync source cycle.

We don't have any recommendations for manually changing a sync source to sync to a desired node during steady state replication. Given these circumstances, we believe this is working as designed.

Best,
Edwin

Comment by Edwin Zhou [ 12/Jul/21 ]

Hi kkrasnov@gmail.com,

Thanks for your patience while I investigate this issue. Selecting a syncSource is a temporary measure to override the default sync target for the current node. We should also expect the syncSource from a node to change from time to time depending on ping time and the state of other members' replication. You can find more information on how a sync source is selected in our docs.

It seems that manually selecting a syncSource is unpredictable and may benefit from some clarification on if and how we can select a desired sync source from a node. I will pass this along to the replication team for further investigation.

Best,
Edwin

Comment by Konstantin Krasnov [ 05/Jul/21 ]

Hi Edwin,

 

Please find attached files.

 

Best regards,

Konstantin

Comment by Edwin Zhou [ 30/Jun/21 ]

Hi kkrasnov@gmail.com,

Thanks for your report and for your patience while we investigate this issue. It seems you're hitting the condition where the desired sync source has an optime equal to the node's optime.
https://github.com/mongodb/mongo/blob/5bbadc66ed462aed3cc4f5635c5003da6171c25d/src/mongo/db/repl/oplog_fetcher.cpp#L876-L882

In order to investigate why the two nodes have the same oplog optime, can you please

  1. Attempt to change the sync source again so the diagnostic data that follows captures the rs.syncFrom() attempt?
  2. Archive (tar or zip) the mongod.log files and the $dbpath/diagnostic.data directory (the contents are described here) and upload them to this support uploader location?
  3. Provide the output of rs.status() around the time you attempt to change the sync source?
  4. Provide the output of rs.config()?

Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time.

Best,
Edwin

Comment by Konstantin Krasnov [ 21/Jun/21 ]

core-storage104-dc2.example.com and core-storage106-dc2.example.com is hidden replicas.

Generated at Thu Feb 08 05:42:58 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.