[SERVER-63752] Make sure the shard merge never leaves an orphaned open backup cursor on donor. Created: 16/Feb/22 Updated: 03/Mar/23 Resolved: 03/Mar/23 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Suganthi Mani | Assignee: | [DO NOT USE] Backlog - Server Serverless (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | shard-merge-milestone-3 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Assigned Teams: |
Serverless
|
||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
Currently, we ignore errors that's returned for kill backup cursor command. So, shard merge can miss killing the backup cursor opened on the donor primary due to recipient primary shutdown or due to some transient n/w error. This would be bad, especially, when there is an orphaned active backup cursor on donor on shard merge abort. Currently, we only allow one active open backup cursor on a node at any point of time. This would shard merge retry or backup service to cause outage. |
| Comments |
| Comment by Suganthi Mani [ 17/Feb/22 ] |
|
Additional Notes: In case if the 10 mins is not tolerable, then the fix, I am suggesting is, recipient should retry running the killCursors command on retryable errors and persist the backupCursorId in the state doc and make the new recipient primary to close the cursor (just to be noted, recipient primary opens a backup cursor not under a session, so it's ok to make the new recipient primary after failover to take the responsibility of closing the backup cursor) |