Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Blocker - P1
Fix Version/s: 5.0.1, 5.1.0-rc0
Affects Version/s: 5.0.0-rc7
Component/s: Replication
Labels:
None

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v5.0
Steps To Reproduce:

Hide

Hard to say here exactly. The test included two MTMs with ~50 tenants each. There was one tenant that was generating load significant enough to trigger auto-scaling, with the other 50 tenants on the donor generating minimal load. 5 migrations were issued for tenants (the least active tenants on the MTM) all with the same donor and recipient. 4 completed, 1 ended up in a "hung" state from the perspective of MMS.

Show
Hard to say here exactly. The test included two MTMs with ~50 tenants each. There was one tenant that was generating load significant enough to trigger auto-scaling, with the other 50 tenants on the donor generating minimal load. 5 migrations were issued for tenants (the least active tenants on the MTM) all with the same donor and recipient. 4 completed, 1 ended up in a "hung" state from the perspective of MMS.
Sprint:
Repl 2021-07-26
Confidence Status:
None
Work Order:
0
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

During serverless load testing 5 tenant migrations were issued as a result of an auto-scaling round. 4 of the 5 completed successfully (although they took ~7 hours to complete for a few MiB of data with minimal activity for those specific tenants). One migration (tenant ID 60e4cf90ec86b15c50ab87b4 and migration id 62894dd9-aa8f-46b6-aeb8-aa30c6dc7359) seemed to hang indefinitely (~13 hours) and ended up in FAILED_MIGRATION_CLEANUP_IN_PROGRESS.

I will try to reproduce and gather artifacts that will paint a clearer picture. What artifacts exactly would be needed/desired?

In the meantime, here are the mongod logs for the donor and recipient. Note, there was a rolling restart during the course of the migrations, so i've attached both donor primary logs covering the period:

donor-new-primary-stuck-on-aborting-index-builds-atlas-ysf0ds-shard-00-01.6oxx1.mmscloudteam.com_2021-07-07T02_30_00_2021-07-07T15_00_00_mongodb.log.gz
- after the migration started, this node was selected as primary
donor-proxy-original-primary-stuck-on-aborting-index-builds-atlas-ysf0ds-shard-00-02.6oxx1.mmscloudteam.com_2021-07-07T02_30_00_2021-07-07T15_00_00_mongodb.log.gz
- this is the proxy instance noted in the tenant migration document and the original primary
recipient-primary-stuck-on-aborting-index-builds-atlas-cqwy0o-shard-00-02.6oxx1.mmscloudteam.com_2021-07-07T02_30_00_2021-07-07T15_00_00_mongodb.log.gz
- this was the primary for the duration of the test
recipient-proxy-stuck-on-aborting-index-builds-atlas-cqwy0o-shard-00-00.6oxx1.mmscloudteam.com_2021-07-07T02_30_00_2021-07-07T15_00_00_mongodb.log.gz
- this is the proxy instance noted in the tenant migration document

Rough timeline courtesy of tomer.yakir:

Donor:

Server restarted at 02:55
2:58 - some migration related data following stepUp
2:59 - oplog fetcher for migration
4:08 - server was slow
9:57 - some migrations finished
9:58 - got forgetMigration

Recipient:

2:48 - migrations started
2:59 - short read error
9:48 - migrations get committed

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

donor-new-primary-stuck-on-aborting-index-builds-atlas-ysf0ds-shard-00-01.6oxx1.mmscloudteam.com_2021-07-07T02_30_00_2021-07-07T15_00_00_mongodb.log-1.gz
20.98 MB
Jul 09 2021 03:49:24 PM UTC
donor-proxy-original-primary-stuck-on-aborting-index-builds-atlas-ysf0ds-shard-00-02.6oxx1.mmscloudteam.com_2021-07-07T02_30_00_2021-07-07T15_00_00_mongodb.log.gz
1.83 MB
Jul 09 2021 03:41:55 PM UTC
recipient-primary-stuck-on-aborting-index-builds-atlas-cqwy0o-shard-00-02.6oxx1.mmscloudteam.com_2021-07-07T02_30_00_2021-07-07T15_00_00_mongodb.log-1.gz
7.47 MB
Jul 09 2021 03:47:18 PM UTC
recipient-proxy-stuck-on-aborting-index-builds-atlas-cqwy0o-shard-00-00.6oxx1.mmscloudteam.com_2021-07-07T02_30_00_2021-07-07T15_00_00_mongodb.log.gz
7.31 MB
Jul 09 2021 03:44:04 PM UTC

Assignee:: Jason Chan
Reporter:: Gregory Banks
Participants:: Esha Maharishi, Githook User, Gregory Banks, Jason Chan, Lingzhi Deng
Votes:: 0 Vote for this issue
Watchers:: 26 Start watching this issue

Created:: Jul 09 2021 03:37:56 PM UTC
Updated:: Oct 29 2023 09:51:07 PM UTC
Resolved:: Jul 13 2021 11:25:11 PM UTC
Confidence Status Last Update:: 12/Jul/21 3:23 PM

Details

Description

Attachments

Attachments

Forms

Activity

People

Dates