[SERVER-36045] unexpected "re-evaluating sync source" Created: 10/Jul/18  Updated: 06/Dec/22  Resolved: 16/Jul/18

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.2.10
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: zhenyipeng Assignee: Backlog - Triage Team
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Server Triage
Operating System: ALL
Steps To Reproduce:
  1. Create a MongoDB replicaset
  2. Do some write requests to the replicaset
  3. Wait a long time ( bigger than 30 seconds )
  4. Do some write requests again, and the secondary nodes may re-evaluate sync source 
Participants:

 Description   

I got this warning in the log file:

"Fri Jun 29 22:00:02.585 I REPL [ReplicationExecutor] re-evaluating sync source because our current sync source's most recent OpTime is (term: 2, timestamp: Jun 29 21:50:07:1) which is more than 30s behind member xxx.xxx.xxx.xxx:7005 whose most recent OpTime is (term: 2, timestamp: Jun 29 22:00:01:1)"

The write majority request is hung up and timeout when the two secondaries are both doing "re-evaluating sync source"

The current sync source is primary node, so I think its "OpTime" should not behind the secondary nodes.

 



 Comments   
Comment by Ramon Fernandez Marina [ 16/Jul/18 ]

Thanks for your report zhenyipeng, sync source reevaluation is expected behavior and not indicative of a bug in and on itself.

Please note that the SERVER project is for reporting bugs or feature suggestions for the MongoDB server. For MongoDB-related support discussion please post on the mongodb-user group or Stack Overflow with the mongodb tag, where your question will reach a larger audience.]]

For questions about developing MongoDB please post on the mongodb-dev group. See also our Technical Support page for additional support resources.

Regards,
Ramón.

Comment by zhenyipeng [ 10/Jul/18 ]

I read the code of TopologyCoordinatorImpl::shouldChangeSyncSource function in src/mongo/db/repl/topology_coordinator_impl.cpp

The currentSourceOpTime is the max value between  "syncSourceLastOpTime" and "_hbdata[currentSourceIndex].getAppliedOpTime()"

But the "syncSourceLastOpTime" is always (term: -1, timestamp: Jan 1 08:00:00:0), so it is not effective.

So the the OpTime comparison between current sync source node and the other node depends on OpTime in heartbeat time.  But the heartbeat time is not a real-time value as it updates every 2 seconds, so I think here is the problem.

I  replace the "syncSourceLastOpTime" by "_lastOpTimeFetched" to solve this problem temporarily.

Can I do something to make "syncSourceLastOpTime" effective?

Generated at Thu Feb 08 04:41:51 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.