[SERVER-50111] Secondary stuck with old sync source for reporting up to 30 seconds Created: 04/Aug/20  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Lingzhi Deng Assignee: Backlog - Replication Team
Resolution: Unresolved Votes: 1
Labels: former-quick-wins
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
Related
Assigned Teams:
Replication
Participants:
Case:

 Description   

EDIT: Cancelling the reporter on changing sync source is a more robust solution.

syncSourceFeedbackNetworkTimeoutSecs is currently hardcoded to 30s. In case of network partition, the sync source feedback report might need to take 30s before timing out on the replSetUpdatePosition remote command against the old sync source even though the node has selected a new sync source. This could result in majority commit point lag after failovers. One idea is to have the syncSourceFeedbackNetworkTimeoutSecs the same as the feedback reporter's interval (or plus a buffer). Another idea is to hardcode the syncSourceFeedbackNetworkTimeoutSecs to a smaller number because we don't generally expect replSetUpdatePosition to block and the current 30s seem too much for a socket timeout.



 Comments   
Comment by Lingzhi Deng [ 05/Aug/20 ]

Cool, I think you are right. I added your comment in the description as well.

Comment by Siyuan Zhou [ 05/Aug/20 ]

I think cancelling the reporter on changing sync source is a more robust solution. We need to minimize the unnecessary waiting time in the longer term to improve planned maintenance, so canceling the request immediately makes more sense.

Generated at Thu Feb 08 05:21:45 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.