[SERVER-63752] Make sure the shard merge never leaves an orphaned open backup cursor on donor. Created: 16/Feb/22  Updated: 03/Mar/23  Resolved: 03/Mar/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Suganthi Mani Assignee: [DO NOT USE] Backlog - Server Serverless (Inactive)
Resolution: Duplicate Votes: 0
Labels: shard-merge-milestone-3
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-57991 Architecture Guide updates for PM-2353 Closed
Duplicate
duplicates SERVER-74585 Ensure shard Merge recipient aborts c... Closed
Assigned Teams:
Serverless
Participants:

 Description   

Currently, we ignore errors that's returned for kill backup cursor command. So, shard merge can miss killing the backup cursor opened on the donor primary due to recipient primary shutdown or due to some transient n/w error. This would be bad, especially, when there is an orphaned active backup cursor on donor on shard merge abort. Currently, we only allow one active open backup cursor on a node at any point of time. This would shard merge retry or backup service to cause outage.



 Comments   
Comment by Suganthi Mani [ 17/Feb/22 ]

Additional Notes:
Since by default we kill cursor if the cursor is idle for more than 10 minutes, the outage is not that bad, but we need to check with cloud the following things,
1) If the cloud uses the default value (10 mins) for server parameter cursorTimeoutMillis?
2) If so, is the backup service outage tolerable?

In case if the 10 mins is not tolerable, then the fix, I am suggesting is, recipient should retry running the killCursors command on retryable errors and persist the backupCursorId in the state doc and make the new recipient primary to close the cursor (just to be noted, recipient primary opens a backup cursor not under a session, so it's ok to make the new recipient primary after failover to take the responsibility of closing the backup cursor)

Generated at Thu Feb 08 05:58:35 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.