[SERVER-58398] Tenant migration hung indefinitely Created: 09/Jul/21 Updated: 29/Oct/23 Resolved: 13/Jul/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 5.0.0-rc7 |
| Fix Version/s: | 5.0.1, 5.1.0-rc0 |
| Type: | Bug | Priority: | Blocker - P1 |
| Reporter: | Gregory Banks | Assignee: | Jason Chan |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Backwards Compatibility: | Fully Compatible | ||||
| Operating System: | ALL | ||||
| Backport Requested: |
v5.0
|
||||
| Steps To Reproduce: | Hard to say here exactly. The test included two MTMs with ~50 tenants each. There was one tenant that was generating load significant enough to trigger auto-scaling, with the other 50 tenants on the donor generating minimal load. 5 migrations were issued for tenants (the least active tenants on the MTM) all with the same donor and recipient. 4 completed, 1 ended up in a "hung" state from the perspective of MMS. |
||||
| Sprint: | Repl 2021-07-26 | ||||
| Participants: | |||||
| Description |
|
During serverless load testing 5 tenant migrations were issued as a result of an auto-scaling round. 4 of the 5 completed successfully (although they took ~7 hours to complete for a few MiB of data with minimal activity for those specific tenants). One migration (tenant ID 60e4cf90ec86b15c50ab87b4 and migration id 62894dd9-aa8f-46b6-aeb8-aa30c6dc7359) seemed to hang indefinitely (~13 hours) and ended up in FAILED_MIGRATION_CLEANUP_IN_PROGRESS. I will try to reproduce and gather artifacts that will paint a clearer picture. What artifacts exactly would be needed/desired? In the meantime, here are the mongod logs for the donor and recipient. Note, there was a rolling restart during the course of the migrations, so i've attached both donor primary logs covering the period:
Rough timeline courtesy of tomer.yakir: Donor:
Recipient:
|
| Comments |
| Comment by Githook User [ 13/Jul/21 ] |
|
Author: {'name': 'Jason Chan', 'email': 'jason.chan@mongodb.com', 'username': 'jasonjhchan'}Message: (cherry picked from commit bbd0b90085c06de2882e48d68812ac822a4412f9) |
| Comment by Githook User [ 13/Jul/21 ] |
|
Author: {'name': 'Jason Chan', 'email': 'jason.chan@mongodb.com', 'username': 'jasonjhchan'}Message: |
| Comment by Esha Maharishi (Inactive) [ 12/Jul/21 ] |
|
Thanks lingzhi.deng - good to know, that would be a useful tool for Cloud if a similar hang happens in the future. |
| Comment by Lingzhi Deng [ 12/Jul/21 ] |
|
Yes. I think manually aborting the migration should work around this as long as the donorAbortMigration command would stop the donor from retrying sending the recipientSyncData command, which I think it should. |
| Comment by Esha Maharishi (Inactive) [ 12/Jul/21 ] |
|
I'm curious if manually aborting the migration would have worked despite this hang. |