Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-58398

Tenant migration hung indefinitely

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Blocker - P1 Blocker - P1
    • 5.0.1, 5.1.0-rc0
    • Affects Version/s: 5.0.0-rc7
    • Component/s: Replication
    • None
    • Fully Compatible
    • ALL
    • v5.0
    • Hide

      Hard to say here exactly. The test included two MTMs with ~50 tenants each. There was one tenant that was generating load significant enough to trigger auto-scaling, with the other 50 tenants on the donor generating minimal load. 5 migrations were issued for tenants (the least active tenants on the MTM) all with the same donor and recipient. 4 completed, 1 ended up in a "hung" state from the perspective of MMS.

      Show
      Hard to say here exactly. The test included two MTMs with ~50 tenants each. There was one tenant that was generating load significant enough to trigger auto-scaling, with the other 50 tenants on the donor generating minimal load. 5 migrations were issued for tenants (the least active tenants on the MTM) all with the same donor and recipient. 4 completed, 1 ended up in a "hung" state from the perspective of MMS.
    • Repl 2021-07-26

      During serverless load testing 5 tenant migrations were issued as a result of an auto-scaling round. 4 of the 5 completed successfully (although they took ~7 hours to complete for a few MiB of data with minimal activity for those specific tenants). One migration (tenant ID 60e4cf90ec86b15c50ab87b4 and migration id 62894dd9-aa8f-46b6-aeb8-aa30c6dc7359) seemed to hang indefinitely (~13 hours) and ended up in FAILED_MIGRATION_CLEANUP_IN_PROGRESS.

      I will try to reproduce and gather artifacts that will paint a clearer picture. What artifacts exactly would be needed/desired?

      In the meantime, here are the mongod logs for the donor and recipient. Note, there was a rolling restart during the course of the migrations, so i've attached both donor primary logs covering the period:

      • donor-new-primary-stuck-on-aborting-index-builds-atlas-ysf0ds-shard-00-01.6oxx1.mmscloudteam.com_2021-07-07T02_30_00_2021-07-07T15_00_00_mongodb.log.gz
        • after the migration started, this node was selected as primary
      • donor-proxy-original-primary-stuck-on-aborting-index-builds-atlas-ysf0ds-shard-00-02.6oxx1.mmscloudteam.com_2021-07-07T02_30_00_2021-07-07T15_00_00_mongodb.log.gz
        • this is the proxy instance noted in the tenant migration document and the original primary
      • recipient-primary-stuck-on-aborting-index-builds-atlas-cqwy0o-shard-00-02.6oxx1.mmscloudteam.com_2021-07-07T02_30_00_2021-07-07T15_00_00_mongodb.log.gz
        • this was the primary for the duration of the test
      • recipient-proxy-stuck-on-aborting-index-builds-atlas-cqwy0o-shard-00-00.6oxx1.mmscloudteam.com_2021-07-07T02_30_00_2021-07-07T15_00_00_mongodb.log.gz
        • this is the proxy instance noted in the tenant migration document

      Rough timeline courtesy of tomer.yakir:

      Donor:

      • Server restarted at 02:55
      • 2:58 - some migration related data following stepUp
      • 2:59 - oplog fetcher for migration
      • 4:08 - server was slow
      • 9:57 - some migrations finished
      • 9:58 - got forgetMigration

      Recipient:

      • 2:48 - migrations started
      • 2:59 - short read error
      • 9:48 - migrations get committed

            Assignee:
            jason.chan@mongodb.com Jason Chan
            Reporter:
            greg.banks@mongodb.com Gregory Banks
            Votes:
            0 Vote for this issue
            Watchers:
            26 Start watching this issue

              Created:
              Updated:
              Resolved: