[SERVER-56435] ContinuousTenantMigration doesn't handle donor aborting migration due to ShutdownInProgress or InterruptedAtShutdown Created: 28/Apr/21 Updated: 29/Oct/23 Resolved: 04/May/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 4.9.0-rc1, 5.0.0-rc0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Pavithra Vetriselvan | Assignee: | Jack Mulrow |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | pm-1791_non-cloud-blocking, pm-1791_other_required | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||
| Operating System: | ALL | ||||||||||||
| Backport Requested: |
v4.9
|
||||||||||||
| Sprint: | Sharding 2021-05-03, Sharding 2021-05-17 | ||||||||||||
| Participants: | |||||||||||||
| Description |
|
While enabling the tenant migration terminate primary suites, I ran into the following scenario: If this is expected behavior on the donor, then the ContinuousTenantMigration hook should catch ShutdownInProgress and InterruptedAtShutdown donor abort errors. |
| Comments |
| Comment by Githook User [ 05/May/21 ] | |||||||||||
|
Author: {'name': 'Jack Mulrow', 'email': 'jack.mulrow@mongodb.com', 'username': 'jsmulrow'}Message: (cherry picked from commit 6b780f53b473a8f23042095642b1888bf3a2b237) | |||||||||||
| Comment by Githook User [ 04/May/21 ] | |||||||||||
|
Author: {'name': 'Jack Mulrow', 'email': 'jack.mulrow@mongodb.com', 'username': 'jsmulrow'}Message: | |||||||||||
| Comment by Jack Mulrow [ 03/May/21 ] | |||||||||||
Yeah exactly. That's a fair point about not being able to guarantee the cancellation token will always be cancelled first though, so I'll fix this by having donor service instances check the error code instead, and I'll leave the shutdown interruption behavior as is since it gives us coverage in the terminate suites. | |||||||||||
| Comment by Matthew Saltz (Inactive) [ 28/Apr/21 ] | |||||||||||
|
If I understand correctly, the problem is: What you're suggesting jack.mulrow is to always make sure to step-down PrimaryOnlyService before shutdown, because that would ensure the tokens are canceled prior to the WMFS being shut down - is that correct? I wouldn't be opposed to doing what you're suggesting, though I worry that other subsystems used by the donor could conceivably still be shutdown prior to the PrimaryOnlyService being interrupted, as long as there's no explicit dependency. In other words, there's no reason WFMS or some other service might not change some day to be shut down on step-down and restarted on step-up (I don't think we'd do that, it's just an example), which would bring you back to having the same problem again, right? Basically, using the tokens as a way to reliably check for step down isn't really possible unless we guarantee that we always step down/interrupt the PrimaryOnlyService before stepping down literally any other component. Is that right or am I making a problem that doesn't exist? In other words - I don't think it's a bad idea to always interrupt on step-down the way we do with the TransactionCoordinator, but I'm also not sure it's sufficient to handle all edge cases, and I think the donor and other services might still unfortunately have to deal with errors coming out of other subsystems like the WFMS manually. | |||||||||||
| Comment by Jack Mulrow [ 28/Apr/21 ] | |||||||||||
|
The donor should be robust to failovers, so this is a bug. It looks like the specific abortReason was "abortReason":{"code":91,"codeName":"ShutdownInProgress","errmsg":"rejecting wait for majority request due to server shutdown"} and based on the error message, the error came from here in the WaitForMajorityService, which the donor uses several times to asynchronously wait for majority write concern. What I think happened is the WaitForMajorityService was shutdown before the PrimaryOnlyService was (here vs here), and in that interval the donor received the shutdown error and was able to persist an abort decision before being shut down itself. This normally won't happen because as part of shutdown, a node is best effort stepped down, which should also interrupted the donor and prevent the abort, but from the logs, that stepdown failed:
The donor assumes that its cancellation tokens will be interrupted first on stepdown/shutdown, but in this scenario that assumption isn't valid. One way to fix this is to have the donor ignore certain error codes (w/ probably the same logic for every other PrimaryOnlyService), but I think a cleaner solution would be to do what we did for the TransactionCoordinatorService in matthew.saltz, what do you think? | |||||||||||
| Comment by Pavithra Vetriselvan [ 28/Apr/21 ] | |||||||||||
|
Evg Task |