-
Type: Bug
-
Resolution: Fixed
-
Priority: Blocker - P1
-
Affects Version/s: 5.0.0-rc7
-
Component/s: Replication
-
None
-
Fully Compatible
-
ALL
-
v5.0
-
-
Repl 2021-07-26
During serverless load testing 5 tenant migrations were issued as a result of an auto-scaling round. 4 of the 5 completed successfully (although they took ~7 hours to complete for a few MiB of data with minimal activity for those specific tenants). One migration (tenant ID 60e4cf90ec86b15c50ab87b4 and migration id 62894dd9-aa8f-46b6-aeb8-aa30c6dc7359) seemed to hang indefinitely (~13 hours) and ended up in FAILED_MIGRATION_CLEANUP_IN_PROGRESS.
I will try to reproduce and gather artifacts that will paint a clearer picture. What artifacts exactly would be needed/desired?
In the meantime, here are the mongod logs for the donor and recipient. Note, there was a rolling restart during the course of the migrations, so i've attached both donor primary logs covering the period:
- donor-new-primary-stuck-on-aborting-index-builds-atlas-ysf0ds-shard-00-01.6oxx1.mmscloudteam.com_2021-07-07T02_30_00_2021-07-07T15_00_00_mongodb.log.gz
- after the migration started, this node was selected as primary
- donor-proxy-original-primary-stuck-on-aborting-index-builds-atlas-ysf0ds-shard-00-02.6oxx1.mmscloudteam.com_2021-07-07T02_30_00_2021-07-07T15_00_00_mongodb.log.gz
- this is the proxy instance noted in the tenant migration document and the original primary
- recipient-primary-stuck-on-aborting-index-builds-atlas-cqwy0o-shard-00-02.6oxx1.mmscloudteam.com_2021-07-07T02_30_00_2021-07-07T15_00_00_mongodb.log.gz
- this was the primary for the duration of the test
- recipient-proxy-stuck-on-aborting-index-builds-atlas-cqwy0o-shard-00-00.6oxx1.mmscloudteam.com_2021-07-07T02_30_00_2021-07-07T15_00_00_mongodb.log.gz
- this is the proxy instance noted in the tenant migration document
Rough timeline courtesy of tomer.yakir:
Donor:
- Server restarted at 02:55
- 2:58 - some migration related data following stepUp
- 2:59 - oplog fetcher for migration
- 4:08 - server was slow
- 9:57 - some migrations finished
- 9:58 - got forgetMigration
Recipient:
- 2:48 - migrations started
- 2:59 - short read error
- 9:48 - migrations get committed