[SERVER-50111] Secondary stuck with old sync source for reporting up to 30 seconds Created: 04/Aug/20 Updated: 06/Dec/22 |
|
| Status: | Backlog |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Lingzhi Deng | Assignee: | Backlog - Replication Team |
| Resolution: | Unresolved | Votes: | 1 |
| Labels: | former-quick-wins | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Assigned Teams: |
Replication
|
||||||||
| Participants: | |||||||||
| Case: | (copied to CRM) | ||||||||
| Description |
|
EDIT: Cancelling the reporter on changing sync source is a more robust solution. syncSourceFeedbackNetworkTimeoutSecs is currently hardcoded to 30s. In case of network partition, the sync source feedback report might need to take 30s before timing out on the replSetUpdatePosition remote command against the old sync source even though the node has selected a new sync source. This could result in majority commit point lag after failovers. One idea is to have the syncSourceFeedbackNetworkTimeoutSecs the same as the feedback reporter's interval (or plus a buffer). Another idea is to hardcode the syncSourceFeedbackNetworkTimeoutSecs to a smaller number because we don't generally expect replSetUpdatePosition to block and the current 30s seem too much for a socket timeout. |
| Comments |
| Comment by Lingzhi Deng [ 05/Aug/20 ] |
|
Cool, I think you are right. I added your comment in the description as well. |
| Comment by Siyuan Zhou [ 05/Aug/20 ] |
|
I think cancelling the reporter on changing sync source is a more robust solution. We need to minimize the unnecessary waiting time in the longer term to improve planned maintenance, so canceling the request immediately makes more sense. |