[SERVER-58398] Tenant migration hung indefinitely Created: 09/Jul/21  Updated: 29/Oct/23  Resolved: 13/Jul/21

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 5.0.0-rc7
Fix Version/s: 5.0.1, 5.1.0-rc0

Type: Bug Priority: Blocker - P1
Reporter: Gregory Banks Assignee: Jason Chan
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File donor-new-primary-stuck-on-aborting-index-builds-atlas-ysf0ds-shard-00-01.6oxx1.mmscloudteam.com_2021-07-07T02_30_00_2021-07-07T15_00_00_mongodb.log-1.gz     File donor-proxy-original-primary-stuck-on-aborting-index-builds-atlas-ysf0ds-shard-00-02.6oxx1.mmscloudteam.com_2021-07-07T02_30_00_2021-07-07T15_00_00_mongodb.log.gz     File recipient-primary-stuck-on-aborting-index-builds-atlas-cqwy0o-shard-00-02.6oxx1.mmscloudteam.com_2021-07-07T02_30_00_2021-07-07T15_00_00_mongodb.log-1.gz     File recipient-proxy-stuck-on-aborting-index-builds-atlas-cqwy0o-shard-00-00.6oxx1.mmscloudteam.com_2021-07-07T02_30_00_2021-07-07T15_00_00_mongodb.log.gz    
Issue Links:
Backports
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v5.0
Steps To Reproduce:

Hard to say here exactly. The test included two MTMs with ~50 tenants each. There was one tenant that was generating load significant enough to trigger auto-scaling, with the other 50 tenants on the donor generating minimal load. 5 migrations were issued for tenants (the least active tenants on the MTM) all with the same donor and recipient. 4 completed, 1 ended up in a "hung" state from the perspective of MMS.

Sprint: Repl 2021-07-26
Participants:

 Description   

During serverless load testing 5 tenant migrations were issued as a result of an auto-scaling round. 4 of the 5 completed successfully (although they took ~7 hours to complete for a few MiB of data with minimal activity for those specific tenants). One migration (tenant ID 60e4cf90ec86b15c50ab87b4 and migration id 62894dd9-aa8f-46b6-aeb8-aa30c6dc7359) seemed to hang indefinitely (~13 hours) and ended up in FAILED_MIGRATION_CLEANUP_IN_PROGRESS.

I will try to reproduce and gather artifacts that will paint a clearer picture. What artifacts exactly would be needed/desired?

In the meantime, here are the mongod logs for the donor and recipient. Note, there was a rolling restart during the course of the migrations, so i've attached both donor primary logs covering the period:

  • donor-new-primary-stuck-on-aborting-index-builds-atlas-ysf0ds-shard-00-01.6oxx1.mmscloudteam.com_2021-07-07T02_30_00_2021-07-07T15_00_00_mongodb.log.gz
    • after the migration started, this node was selected as primary
  • donor-proxy-original-primary-stuck-on-aborting-index-builds-atlas-ysf0ds-shard-00-02.6oxx1.mmscloudteam.com_2021-07-07T02_30_00_2021-07-07T15_00_00_mongodb.log.gz
    • this is the proxy instance noted in the tenant migration document and the original primary
  • recipient-primary-stuck-on-aborting-index-builds-atlas-cqwy0o-shard-00-02.6oxx1.mmscloudteam.com_2021-07-07T02_30_00_2021-07-07T15_00_00_mongodb.log.gz
    • this was the primary for the duration of the test
  • recipient-proxy-stuck-on-aborting-index-builds-atlas-cqwy0o-shard-00-00.6oxx1.mmscloudteam.com_2021-07-07T02_30_00_2021-07-07T15_00_00_mongodb.log.gz
    • this is the proxy instance noted in the tenant migration document

Rough timeline courtesy of tomer.yakir:

Donor:

  • Server restarted at 02:55
  • 2:58 - some migration related data following stepUp
  • 2:59 - oplog fetcher for migration
  • 4:08 - server was slow
  • 9:57 - some migrations finished
  • 9:58 - got forgetMigration

Recipient:

  • 2:48 - migrations started
  • 2:59 - short read error
  • 9:48 - migrations get committed


 Comments   
Comment by Githook User [ 13/Jul/21 ]

Author:

{'name': 'Jason Chan', 'email': 'jason.chan@mongodb.com', 'username': 'jasonjhchan'}

Message: SERVER-58398 TenantMigrationDonor will not retry recipientSyncData on non-retriable interruption errors

(cherry picked from commit bbd0b90085c06de2882e48d68812ac822a4412f9)
Branch: v5.0
https://github.com/mongodb/mongo/commit/92662765968eff784a82adea2f57ee5d1125712d

Comment by Githook User [ 13/Jul/21 ]

Author:

{'name': 'Jason Chan', 'email': 'jason.chan@mongodb.com', 'username': 'jasonjhchan'}

Message: SERVER-58398 TenantMigrationDonor will not retry recipientSyncData on non-retriable interruption errors
Branch: master
https://github.com/mongodb/mongo/commit/bbd0b90085c06de2882e48d68812ac822a4412f9

Comment by Esha Maharishi (Inactive) [ 12/Jul/21 ]

Thanks lingzhi.deng - good to know, that would be a useful tool for Cloud if a similar hang happens in the future.

Comment by Lingzhi Deng [ 12/Jul/21 ]

Yes. I think manually aborting the migration should work around this as long as the donorAbortMigration command would stop the donor from retrying sending the recipientSyncData command, which I think it should.

Comment by Esha Maharishi (Inactive) [ 12/Jul/21 ]

I'm curious if manually aborting the migration would have worked despite this hang.

Generated at Thu Feb 08 05:44:24 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.