Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-69551

Shard Merge recipient should retry opening the backup cursor if backupCursorCheckpointTimestamp is < startMigrationDonorTimestamp

    • Type: Icon: Task Task
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 6.2.0-rc0
    • Affects Version/s: None
    • Component/s: None
    • Labels:
      None
    • Fully Compatible
    • Server Serverless 2022-09-19, Server Serverless 2022-10-03, Server Serverless 2022-10-17, Server Serverless 2022-10-31, Server Serverless 2022-11-14
    • 155

      Currently, there are ways we make best effort to ensure the backup cursor opened satisfies backupCursorCheckpointTimestamp >= startMigrationDonorTimestamp (see SERVER-69364 & SERVER-65084). But, there is another scenario where we transiently may not satisfy backupCursorCheckpointTimestamp >= startMigrationDonorTimestamp. For such transient cases, we shouldn't fail the entire migration, instead we should close the backup cursor retry opening a new backup cursor until we satisfy backupCursorCheckpointTimestamp >= startMigrationDonorTimestamp. And the retry step should be able to be interrupted by stepdown, shutdown and forgetMigration.

      Transient failure scenario:
      1) Donor primary, N1, sends recipientSyncData cmd (Before sending recipient sync data cmd, we ensure that any new backup cursor opened on that node will have checkpoint Ts >= startMigrationDonorTimestamp).
      2) Recipient starts the migration process. But before establishing the connection to donor primary, N1 fails over and new primary on Donor (N2) is elected.
      3) N2's POS wait for the lastopTime to be majority committed before starting the tenant migration donor instances on that node.
      4) Recipient tries to establish connection to donor primary (N2) as the read preference for shard merge is "primaryOnly"
      5) So, now there are chances when the recipient opens backup cursor on donor primary(N2), we may not satisfy the checkpoint Ts >= startMigrationDonorTimestamp for a brief period until the donor instance completes this step. After SERVER-69299, we throw error.

            Assignee:
            mathis.bessa@mongodb.com Mathis Bessa (Inactive)
            Reporter:
            suganthi.mani@mongodb.com Suganthi Mani
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: