[SERVER-56435] ContinuousTenantMigration doesn't handle donor aborting migration due to ShutdownInProgress or InterruptedAtShutdown Created: 28/Apr/21  Updated: 29/Oct/23  Resolved: 04/May/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 4.9.0-rc1, 5.0.0-rc0

Type: Bug Priority: Major - P3
Reporter: Pavithra Vetriselvan Assignee: Jack Mulrow
Resolution: Fixed Votes: 0
Labels: pm-1791_non-cloud-blocking, pm-1791_other_required
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
is depended on by SERVER-52713 [testing] Add stepdown/kill/terminate... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.9
Sprint: Sharding 2021-05-03, Sharding 2021-05-17
Participants:

 Description   

While enabling the tenant migration terminate primary suites, I ran into the following scenario:
1. ContinuousTenantMigration starts tenant migration X on donor primary
2. ContinuousStepdown terminates donor primary
3. Donor primary shuts down and interrupts migration X with ShutdownInProgress
4. New primary steps up and rebuilds TenantMigrationDonorService
5. New primary says tenant migration has completed because it was aborted due to ShutdownInProgress
6. ContinuousTenantMigration raises an error since this isn't an accepted abort reason.

If this is expected behavior on the donor, then the ContinuousTenantMigration hook should catch ShutdownInProgress and InterruptedAtShutdown donor abort errors.



 Comments   
Comment by Githook User [ 05/May/21 ]

Author:

{'name': 'Jack Mulrow', 'email': 'jack.mulrow@mongodb.com', 'username': 'jsmulrow'}

Message: SERVER-56435 Tenant migration donor shouldn't abort on shutdown or not primary errors

(cherry picked from commit 6b780f53b473a8f23042095642b1888bf3a2b237)
Branch: v4.9
https://github.com/mongodb/mongo/commit/1261c92d8a9fa97f5154d40866e41b34723af9cf

Comment by Githook User [ 04/May/21 ]

Author:

{'name': 'Jack Mulrow', 'email': 'jack.mulrow@mongodb.com', 'username': 'jsmulrow'}

Message: SERVER-56435 Tenant migration donor shouldn't abort on shutdown or not primary errors
Branch: master
https://github.com/mongodb/mongo/commit/6b780f53b473a8f23042095642b1888bf3a2b237

Comment by Jack Mulrow [ 03/May/21 ]

What you're suggesting jack.mulrow is to always make sure to step-down PrimaryOnlyService before shutdown, because that would ensure the tokens are canceled prior to the WMFS being shut down - is that correct?

Yeah exactly. That's a fair point about not being able to guarantee the cancellation token will always be cancelled first though, so I'll fix this by having donor service instances check the error code instead, and I'll leave the shutdown interruption behavior as is since it gives us coverage in the terminate suites.

Comment by Matthew Saltz (Inactive) [ 28/Apr/21 ]

If I understand correctly, the problem is:
1. The WFMS is shut down, which returns a ShutdownInProgress error somewhere in the donor
2. The donor checks cancellation tokens to find out if it's been stepped/shutdown, but since those tokens haven't been canceled, the donor assumes the WFMS is throwing an abort-worthy error
3. The donor aborts with ShutdownInProgress error, when really it shouldn't have aborted

What you're suggesting jack.mulrow is to always make sure to step-down PrimaryOnlyService before shutdown, because that would ensure the tokens are canceled prior to the WMFS being shut down - is that correct?

I wouldn't be opposed to doing what you're suggesting, though I worry that other subsystems used by the donor could conceivably still be shutdown prior to the PrimaryOnlyService being interrupted, as long as there's no explicit dependency. In other words, there's no reason WFMS or some other service might not change some day to be shut down on step-down and restarted on step-up (I don't think we'd do that, it's just an example), which would bring you back to having the same problem again, right? Basically, using the tokens as a way to reliably check for step down isn't really possible unless we guarantee that we always step down/interrupt the PrimaryOnlyService before stepping down literally any other component. Is that right or am I making a problem that doesn't exist?

In other words - I don't think it's a bad idea to always interrupt on step-down the way we do with the TransactionCoordinator, but I'm also not sure it's sufficient to handle all edge cases, and I think the donor and other services might still unfortunately have to deal with errors coming out of other subsystems like the WFMS manually.

Comment by Jack Mulrow [ 28/Apr/21 ]

The donor should be robust to failovers, so this is a bug. It looks like the specific abortReason was "abortReason":{"code":91,"codeName":"ShutdownInProgress","errmsg":"rejecting wait for majority request due to server shutdown"} and based on the error message, the error came from here in the WaitForMajorityService, which the donor uses several times to asynchronously wait for majority write concern. What I think happened is the WaitForMajorityService was shutdown before the PrimaryOnlyService was (here vs here), and in that interval the donor received the shutdown error and was able to persist an abort decision before being shut down itself.

This normally won't happen because as part of shutdown, a node is best effort stepped down, which should also interrupted the donor and prevent the abort, but from the logs, that stepdown failed:

// Donor receives the terminate signal, which triggers best-effort stepdown
[j0:rs0:n0] | 2021-04-26T16:28:08.455+00:00 I CONTROL 23377 [SignalHandler] "Received signal","attr":\{"signal":15,"error":"Terminated"}
[j0:rs0:n0] | 2021-04-26T16:28:08.455+00:00 I REPL 4784900 [SignalHandler] "Stepping down the ReplicationCoordinator for shutdown","attr":\{"waitTimeMillis":100}
...
// The stepdown takes too long so it's aborted but normal shutdown continues on
[j0:rs0:n0] | 2021-04-26T16:28:08.556+00:00 W COMMAND 4719000 [SignalHandler] "Error stepping down during force shutdown","attr":\{"error":{"code":262,"codeName":"ExceededTimeLimit","errmsg":"No electable secondaries caught up as of 2021-04-26T16:28:08.556+00:00. Please use the replSetStepDown command with the argument {force: true} to force node to step down."}}
...
// In normal shutdown, the WaitForMajorityService is shut down first which triggers the abort decision before the PrimaryOnlyServices are shut down.
[j0:rs0:n0] | 2021-04-26T16:28:08.559+00:00 I TENANT_M 5107301 [TenantMigrationDonorService-2] "Tenant migration starting to wait for abort OpTime to be majority-committed","attr":\{"tenantId":"tenantMigrationTenantId","abortOpTime":{"ts":{"$timestamp":{"t":1619454488,"i":6}},"t":1}}
[j0:rs0:n0] | 2021-04-26T16:28:08.559+00:00 I REPL 4784909 [SignalHandler] "Shutting down the ReplicationCoordinator"
[j0:rs0:n0] | 2021-04-26T16:28:08.559+00:00 I REPL 5123006 [SignalHandler] "Shutting down PrimaryOnlyService","attr":\{"service":"TenantMigrationDonorService","numInstances":1,"numOperationContexts":1}

The donor assumes that its cancellation tokens will be interrupted first on stepdown/shutdown, but in this scenario that assumption isn't valid. One way to fix this is to have the donor ignore certain error codes (w/ probably the same logic for every other PrimaryOnlyService), but I think a cleaner solution would be to do what we did for the TransactionCoordinatorService in SERVER-45009 and always run the PrimaryOnlyService stepdown handlers even if the best-effort stepdown fails, since either way the services will be interrupted.

matthew.saltz, what do you think?

Comment by Pavithra Vetriselvan [ 28/Apr/21 ]

Evg Task
Relevant Logs (ShutdownInProgress)
Logs for InterruptedAtShutdown

Generated at Thu Feb 08 05:39:15 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.