[SERVER-51650] Primary-Only Service's _rebuildCV should be notified even if stepdown happens quickly after stepup Created: 15/Oct/20 Updated: 29/Oct/23 Resolved: 07/Jun/23 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 7.1.0-rc0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Esha Maharishi (Inactive) | Assignee: | Wenbin Zhu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | lowcontext, servicearch-wfbf-day | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||
| Assigned Teams: |
Service Arch
|
||||||||||||||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||||||||||||||||||
| Backport Requested: |
v6.1
|
||||||||||||||||||||||||||||||||||||||||||||||||
| Sprint: | Service Arch 2022-05-30, Service Arch 2022-06-13, Service Arch 2022-06-27, Service Arch 2022-07-11, Service Arch 2022-07-25, Service Arch 2023-02-20, Service Arch 2023-03-06, Service Arch 2023-05-01, Service Arch 2023-05-15, Service Arch 2023-05-29, Service Arch 2023-06-12 | ||||||||||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||||||||||
| Linked BF Score: | 131 | ||||||||||||||||||||||||||||||||||||||||||||||||
| Story Points: | 4 | ||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
When stepup completes, each Primary-Only Service's _state is set to kRebuilding. Both PrimaryOnlyService::lookupInstance and PrimaryOnlyService::getOrCreateInstance wait until the _rebuildCV condition variable is notified and the _state is no longer kRebuilding. _rebuildCV is notified in PrimaryOnlyService::_rebuildInstances, which on stepup is scheduled to run asynchronously. If stepdown occurs before _rebuildInstances starts, e.g. if stepdown occurs here, then _rebuildCV may never be notified. So, any threads blocking in lookupInstance or getOrCreateInstance that don't get interrupted by stepdown will block indefinitely. Currently, there is an invariant in lookupInstance that the thread is guaranteed to be interrupted by stepdown. Otherwise, if the thread is holding the RSTL lock, the thread would prevent the stepdown from completing, leading to a deadlock. It would be better to notify _rebuildCV here to guarantee threads cannot block indefinitely in lookup or getOrCreateInstance.
Acceptance criteria: Reproduce issue in unit test |
| Comments |
| Comment by Githook User [ 07/Jun/23 ] |
|
Author: {'name': 'Wenbin Zhu', 'email': 'wenbin.zhu@mongodb.com', 'username': 'WenbinZhu'}Message: |
| Comment by George Wangensteen [ 24/Aug/22 ] |
|
The commit was reverted and therefore isn't present in the master or 6.1 branches. (The 6.1.0-rc0 in the fixVersion field cannot be cleared.) |
| Comment by Suganthi Mani [ 17/Aug/22 ] |
|
Since the fix got reverted, isn't the we need to reopen this ticket and clear the fix version? It currently gives a wrong indication that a fix for this ticket is in 6.1.0-rc0. |
| Comment by Githook User [ 02/Aug/22 ] |
|
Author: {'name': 'mathisbessamdb', 'email': 'mathis.bessa@mongodb.com', 'username': 'mathisbessamdb'}Message: Revert " |
| Comment by Mathis Bessa [ 01/Aug/22 ] |
|
We are reverting this due to a race condition that was introduced and related to the operation context. As mentioned above more details can be found in |
| Comment by George Wangensteen [ 01/Aug/22 ] |
|
Reverting due to this making the bug documented in |
| Comment by Githook User [ 13/Jul/22 ] |
|
Author: {'name': 'George Wangensteen', 'email': 'george.wangensteen@mongodb.com', 'username': 'gewa24'}Message: |
| Comment by Esha Maharishi (Inactive) [ 15/Jun/22 ] |
|
max.hirschhorn@mongodb.com makes sense, thanks. george.wangensteen@mongodb.com , ah I see, good point. So the continuations might run even if the error was due to stepdown. Maybe we could check if the state is still kRebuilding and only if so call _setState(kRebuildFailed)? |
| Comment by George Wangensteen [ 15/Jun/22 ] |
|
Ah, thanks for that explanation esha.maharishi@mongodb.com . I think the race might still be possible, though, because onStepDown (and therefore interruptInstances, and therefore the shutdown of scopedExecutor) will be called after the node transitioned out of Primary, whereas the future returned by waitUntilMajority could be set with an error anytime after the WaitForMajorityService's call to waitForWriteConcern fails. In particular, the POS is informed of the stepDown by the ReplicaSetAwareService infrastructure in ReplicationCoordinatorImpl::_performPostMemberStateUpdateAction which is called in ReplicationCoordinatorImpl::stepdown after calls to awaitReplication are set with error here via the _replicationWaiterList. So I think it's possible for the waitUntilMajority future to be set with an error (and therefore for continuations scheduled on it to run) before the POS::onStepDown logic runs and shuts down the executor. I definitely could be missing something from this analysis though as my knowledge of the ReplicationCoordinator is limited so let me know if something sounds off. To ensure there would be no race between the onStepDown/onStepUp threads, I think we'd need to make sure that the POS::onStepDown logic runs _before it is possible for the waitUntilMajority future to be set with an error due to step down. |
| Comment by Max Hirschhorn [ 14/Jun/22 ] |
There weren't other cases from what I had peeked at in the wait_for_majority_service.cpp implementation. |
| Comment by Esha Maharishi (Inactive) [ 13/Jun/22 ] |
|
max.hirschhorn@mongodb.com, are there cases you know of where waitUntilMajority would return an error besides stepdown/shutdown? george.wangensteen@mongodb.com, are you sure there would be a race between the stepdown/shutdown thread and the onStepUp thread? The onStepUp thread's continuations run on _newScopedExecutor, which is a reference to _scopedExecutor, and the stepdown/shutdown thread would shut down _scopedExecutor. So, the continuations wouldn't run on stepdown/shutdown. The continuations would only set the state to kRebuildFailed and notify _stateChangedCV if waitUntilMajority returned an error besides stepdown/shutdown. |
| Comment by George Wangensteen [ 13/Jun/22 ] |
|
I looked into this and chatted with max.hirschhorn@mongodb.com and janna.golden@mongodb.com about it a bit. I think this is no longer a bug, but the situation is has become a bit convoluted to reason about. Essentially: (1) If the continuations chained with .then() after the call to waitUntilMajority do not run, then either: (1a) the associated cancellation source _source was cancelled (1b) the newScopedExecutor was shutdown (1c) The waitUntilMajorityService was shutdown (1d) waitForWriteConcern returned an error status. (1a) and (1b) can happen only if the primary only service onStepDown or shutdown have been called - and both of these functions currently change the state from _kRebuilding and wake the CV. (1c) can happen only on process shutdown, and occurs after the replication coordinator steps/shuts down, which triggers POS stepdown/shutdown logic, so the same argument from above applies (1d) can occur only if there is a stepdown while waiting for replication (in which case the argument from 1a/1b) applies, or if there is an error in the specified write concern, which should not be the case as it derives from the WFMS. So in all cases, _state should be set from kRebuilding and the CV should be notified. If the continuations chained do run, then either the node is still primary and rebuilding continues, or the term has changed and therefore the node has stepped down, again allowing us to apply the argument from (1a)/(1b). So I don't see a case where the _stateChangedCV is not notified even if step up happens quickly after step up. It is difficult to add some deterministic assertions to verify this though, because there is a race between the stepdown/shutdown thread setting the state and waking the CV and the onStepUp thread waking from waitUntilMajority failing and setting the state to kRebuildFailed. This race should be harmless from a correctness perspective but makes it difficult to make the state-change ordering deterministic. Note that this test addresses the bug concern described in the ticket, because it ensures that the thread calling lookupInstances() is woken by the stepdown thread even when stepdown occurs quickly after stepup/the onStepUp async future chain does not succeed. I think we have a few options: -> Add some comments to the code explaining the above and all legal _state transitions, but leave things as they are. File a ticket to try and simplify state management. -> Make the end of the onStepUp future chain set _state to kRebuildFailed for the purposes of clarity when reading - as mentioned above this introduces a race where the stepUp thread can set state to kREbuildFailed after stepdown sets it to kPaused, but this race should be harmless max.hirschhorn@mongodb.com esha.maharishi@mongodb.com Do you have any thoughts on what might be helpful here or if this analysis is missing something/wrong? Thanks! |
| Comment by Esha Maharishi (Inactive) [ 13/Apr/22 ] |
|
Thanks max.hirschhorn@mongodb.com for pointing out that while SERVER-61717 will remove one reason to call opCtx->setAlwaysInterruptAtStepDownOrUp (caller can be waiting on an instance that is removed from the map), this ticket represents another reason to call opCtx->setAlwaysInterruptAtStepDownOrUp (caller is waiting on _rebuildCV) that would still need to be addressed. |
| Comment by Esha Maharishi (Inactive) [ 13/Apr/22 ] |
|
Just a note that SERVER-61717 will remove the need to call opCtx->setAlwaysInterruptAtStepDownOrUp. It isn't scheduled, but is a high priority item on the Serverless backlog. |
| Comment by Max Hirschhorn [ 11/Apr/22 ] |
|
blake.oler@mongodb.com, I believe this issue has been worked around in practice by having callers do opCtx->setAlwaysInterruptAtStepDownOrUp() before calling PrimaryOnlyService::lookupInstance() or PrimaryOnlyService::getOrCreateInstance() because they cannot rely on _waitForStateNotRebuilding() to eventually be satisfied. My impression has been we're looking to remove the pattern of opCtx->setAlwaysInterruptAtStepDownOrUp() from our C++ codebase and fixing PrimaryOnlyService's state transition lifecycle is one required piece of that. |
| Comment by Blake Oler [ 11/Apr/22 ] |
|
max.hirschhorn@mongodb.com has this been seen in the wild or in Evergreen? Also curious about relative impact of the issue for prioritization. |
| Comment by Max Hirschhorn [ 07/Apr/22 ] |
|
Reopening this ticket because PrimaryOnlyService::_state is still not being set to PrimaryOnlyService::State::kRebuildFailed when WaitForMajorityService::waitUntilMajority() returns an error future. I believe a similar case applies if newScopedExecutor is shut down. Note that while |
| Comment by Lauren Lewis (Inactive) [ 24/Feb/22 ] |
|
We haven’t heard back from you for at least one calendar year, so this issue is being closed. If this is still an issue for you, please provide additional information and we will reopen the ticket. |