Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-53070

Allow back to back tenant migrations for retries

    • Fully Compatible
    • Repl 2020-12-14
    • 15

      I think we agree that:
       
      If a donorStartMigration encounters a conflicting migration that is not yet marked as garbage collectable, the donorStartMigration should fail.
       
      The question is what to do for:
       
      If the donorStartMigration encounters a conflicting migration that is marked as garbage collectable.
       
      I think the options are:

      • The donorStartMigration should fail (current).
      • The donorStartMigration should immediately garbage collect the old migration and start the new one.
        • If the new migration has a different migrationId but is for the same tenant:
          • A delayed donorStartMigration from the first migration will get ConflictingOperationInProgress, which should be harmless, since Cloud shouldn't care about the response anymore.
          • A delayed donorForgetMigration from the first migration will get NoSuchTenantMigration, which should also be harmless, since Cloud shouldn't care about the response anymore.
        • If the new migration has the same migrationId but is for a different tenant:
          • This is not a legal thing for Cloud to do, so we can say the behavior is undefined.
      • Allow donorForgetMigration to take a  "garbageCollectImmediately: true" flag that Cloud should use if they want to retry a migration quickly.
        • This is only best-effort, since it's possible for donorForgetMigration to garbage collect the state, then a delayed retry of the first donorStartMigration to restart the first migration, then Cloud tries to start the second migration and the second migration still fails since there's a conflicting active migration.

      I think the second option is most practical, since it's the least amount of work for Cloud and has harmless side-effects.

      EDIT:
      In the end, we decided to remove the TenantMigrationAccessBlocker entry when we mark an aborted migration document as garbage collectable. In the second option, if we immediately garbage collect the old state doc and insert the new one, then it would be a problem if donor fails over in between. In that case, we could lose the old state doc without inserting the new one. If a delayed donorStartMigration from the first migration then comes in, we could mistakenly start a migration. So the property we are maintaining instead is "aborted garbage collectable documents do not have a TenantMigrationAccessBlocker entry". To maintain this property, we need to:
      1. remove the mtab entry when we mark an aborted document as garbage collectable.
      2. avoid creating the mtab entry for aborted garbage collectable documents when recovering mtabs from startup/rollback.
      3. remove the op observer onDelete code that deletes the mtab entry for aborted state doc since this entry will have already been deleted.

            Assignee:
            vishnu.kaushik@mongodb.com Vishnu Kaushik
            Reporter:
            judah.schvimer@mongodb.com Judah Schvimer
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: