[SERVER-69551] Shard Merge recipient should retry opening the backup cursor if backupCursorCheckpointTimestamp is < startMigrationDonorTimestamp Created: 09/Sep/22  Updated: 29/Oct/23  Resolved: 01/Nov/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 6.2.0-rc0

Type: Task Priority: Major - P3
Reporter: Suganthi Mani Assignee: Mathis Bessa
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Problem/Incident
Backwards Compatibility: Fully Compatible
Sprint: Server Serverless 2022-09-19, Server Serverless 2022-10-03, Server Serverless 2022-10-17, Server Serverless 2022-10-31, Server Serverless 2022-11-14
Participants:
Linked BF Score: 155

 Description   

Currently, there are ways we make best effort to ensure the backup cursor opened satisfies backupCursorCheckpointTimestamp >= startMigrationDonorTimestamp (see SERVER-69364 & SERVER-65084). But, there is another scenario where we transiently may not satisfy backupCursorCheckpointTimestamp >= startMigrationDonorTimestamp. For such transient cases, we shouldn't fail the entire migration, instead we should close the backup cursor retry opening a new backup cursor until we satisfy backupCursorCheckpointTimestamp >= startMigrationDonorTimestamp. And the retry step should be able to be interrupted by stepdown, shutdown and forgetMigration.

Transient failure scenario:
1) Donor primary, N1, sends recipientSyncData cmd (Before sending recipient sync data cmd, we ensure that any new backup cursor opened on that node will have checkpoint Ts >= startMigrationDonorTimestamp).
2) Recipient starts the migration process. But before establishing the connection to donor primary, N1 fails over and new primary on Donor (N2) is elected.
3) N2's POS wait for the lastopTime to be majority committed before starting the tenant migration donor instances on that node.
4) Recipient tries to establish connection to donor primary (N2) as the read preference for shard merge is "primaryOnly"
5) So, now there are chances when the recipient opens backup cursor on donor primary(N2), we may not satisfy the checkpoint Ts >= startMigrationDonorTimestamp for a brief period until the donor instance completes this step. After SERVER-69299, we throw error.



 Comments   
Comment by Githook User [ 01/Nov/22 ]

Author:

{'name': 'mathisbessamdb', 'email': 'mathis.bessa@mongodb.com', 'username': 'mathisbessamdb'}

Message: SERVER-69551 Shard Merge recipient should retry opening the backup cursor if backupCursorCheckpointTimestamp is < startMigrationDonorTimestamp
Branch: master
https://github.com/mongodb/mongo/commit/6f8af2868adc224f762555b8132992cd421ee6fb

Comment by Githook User [ 13/Oct/22 ]

Author:

{'name': 'mathisbessamdb', 'email': 'mathis.bessa@mongodb.com', 'username': 'mathisbessamdb'}

Message: Revert "SERVER-69551 Shard Merge recipient should retry opening the backup cursor if backupCursorCheckpointTimestamp is < startMigrationDonorTimestamp"

This reverts commit 14c95154f87f06dc36471591b1006ccf9eadb45c.
Branch: master
https://github.com/mongodb/mongo/commit/3ac37f174e6ab1df054ec5425d17f7a723526142

Comment by Githook User [ 10/Oct/22 ]

Author:

{'name': 'mathisbessamdb', 'email': 'mathis.bessa@mongodb.com', 'username': 'mathisbessamdb'}

Message: SERVER-69551 Shard Merge recipient should retry opening the backup cursor if backupCursorCheckpointTimestamp is < startMigrationDonorTimestamp
Branch: master
https://github.com/mongodb/mongo/commit/14c95154f87f06dc36471591b1006ccf9eadb45c

Generated at Thu Feb 08 06:13:46 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.