[SERVER-51650] Primary-Only Service's _rebuildCV should be notified even if stepdown happens quickly after stepup Created: 15/Oct/20  Updated: 29/Oct/23  Resolved: 07/Jun/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 7.1.0-rc0

Type: Bug Priority: Major - P3
Reporter: Esha Maharishi (Inactive) Assignee: Wenbin Zhu
Resolution: Fixed Votes: 0
Labels: lowcontext, servicearch-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Duplicate
Problem/Incident
Related
related to SERVER-62682 PrimaryOnlyService Does Not Call _reb... Closed
related to SERVER-65469 Don't invariant in PrimaryOnlyService... Closed
related to SERVER-68438 Fix PrimaryOnlyService race condition... Closed
related to SERVER-68476 Revert 11ed931625c685b453a5244553f1d9... Closed
related to SERVER-66351 Audit uses of OperationContext::setAl... Open
is related to SERVER-61717 Ensure a POS instance remains in the ... Open
is related to SERVER-50982 PrimaryOnlyService::lookupInstance sh... Closed
is related to SERVER-51518 Reconcile how PrimaryOnlyService::loo... Closed
Assigned Teams:
Service Arch
Backwards Compatibility: Fully Compatible
Backport Requested:
v6.1
Sprint: Service Arch 2022-05-30, Service Arch 2022-06-13, Service Arch 2022-06-27, Service Arch 2022-07-11, Service Arch 2022-07-25, Service Arch 2023-02-20, Service Arch 2023-03-06, Service Arch 2023-05-01, Service Arch 2023-05-15, Service Arch 2023-05-29, Service Arch 2023-06-12
Participants:
Linked BF Score: 131
Story Points: 4

 Description   

When stepup completes, each Primary-Only Service's _state is set to kRebuilding.

Both PrimaryOnlyService::lookupInstance and PrimaryOnlyService::getOrCreateInstance wait until the _rebuildCV condition variable is notified and the _state is no longer kRebuilding.

_rebuildCV is notified in PrimaryOnlyService::_rebuildInstances, which on stepup is scheduled to run asynchronously.

If stepdown occurs before _rebuildInstances starts, e.g. if stepdown occurs here, then _rebuildCV may never be notified. So, any threads blocking in lookupInstance or getOrCreateInstance that don't get interrupted by stepdown will block indefinitely.

Currently, there is an invariant in lookupInstance that the thread is guaranteed to be interrupted by stepdown. Otherwise, if the thread is holding the RSTL lock, the thread would prevent the stepdown from completing, leading to a deadlock.

It would be better to notify _rebuildCV here to guarantee threads cannot block indefinitely in lookup or getOrCreateInstance.

 

Acceptance criteria: 

Reproduce issue in unit test
Fix as suggested 



 Comments   
Comment by Githook User [ 07/Jun/23 ]

Author:

{'name': 'Wenbin Zhu', 'email': 'wenbin.zhu@mongodb.com', 'username': 'WenbinZhu'}

Message: SERVER-51650 Ensure PrimaryOnlyService state gets updated in case of rebuild failure.
Branch: master
https://github.com/mongodb/mongo/commit/f0e996cbf2cee788498ea04d2586cc1a851fbaf4

Comment by George Wangensteen [ 24/Aug/22 ]

The commit was reverted and therefore isn't present in the master or 6.1 branches. (The 6.1.0-rc0 in the fixVersion field cannot be cleared.)

Comment by Suganthi Mani [ 17/Aug/22 ]

Since the fix got reverted, isn't the we need to reopen this ticket and clear the fix version? It currently gives a wrong indication that a fix for this ticket is in 6.1.0-rc0.

Comment by Githook User [ 02/Aug/22 ]

Author:

{'name': 'mathisbessamdb', 'email': 'mathis.bessa@mongodb.com', 'username': 'mathisbessamdb'}

Message: Revert "SERVER-51650 Ensure failure to rebuild PrimaryOnlyService on step up results in state change"
Branch: master
https://github.com/mongodb/mongo/commit/ea9f1be226f4611f62e8070bfc1ce9e5ef461b47

Comment by Mathis Bessa [ 01/Aug/22 ]

We are reverting this due to a race condition that was introduced and related to the operation context. As mentioned above more details can be found in SERVER-68438.

Comment by George Wangensteen [ 01/Aug/22 ]

Reverting due to this making the bug documented in SERVER-68438 occur more frequently in the waterfall.

Comment by Githook User [ 13/Jul/22 ]

Author:

{'name': 'George Wangensteen', 'email': 'george.wangensteen@mongodb.com', 'username': 'gewa24'}

Message: SERVER-51650 Ensure failure to rebuild PrimaryOnlyService on step up results in state change
Branch: master
https://github.com/mongodb/mongo/commit/11ed931625c685b453a5244553f1d97c81b80850

Comment by Esha Maharishi (Inactive) [ 15/Jun/22 ]

max.hirschhorn@mongodb.com makes sense, thanks.

george.wangensteen@mongodb.com , ah I see, good point. So the continuations might run even if the error was due to stepdown. Maybe we could check if the state is still kRebuilding and only if so call _setState(kRebuildFailed)?

Comment by George Wangensteen [ 15/Jun/22 ]

Ah, thanks for that explanation esha.maharishi@mongodb.com .

I think the race might still be possible, though, because onStepDown (and therefore interruptInstances, and therefore the shutdown of scopedExecutor) will be called after the node transitioned out of Primary, whereas the future returned by waitUntilMajority could be set with an error anytime after the WaitForMajorityService's call to waitForWriteConcern fails. In particular, the POS is informed of the stepDown by the ReplicaSetAwareService infrastructure in ReplicationCoordinatorImpl::_performPostMemberStateUpdateAction which is called in ReplicationCoordinatorImpl::stepdown after calls to awaitReplication are set with error here via the _replicationWaiterList. So I think it's possible for the waitUntilMajority future to be set with an error (and therefore for continuations scheduled on it to run) before the POS::onStepDown logic runs and shuts down the executor. I definitely could be missing something from this analysis though as my knowledge of the ReplicationCoordinator is limited so let me know if something sounds off.  To ensure there would be no race between the onStepDown/onStepUp threads, I think we'd need to make sure that the POS::onStepDown logic runs _before it is possible for the waitUntilMajority future to be set with an error due to step down.

Comment by Max Hirschhorn [ 14/Jun/22 ]

Max Hirschhorn, are there cases you know of where waitUntilMajority would return an error besides stepdown/shutdown?

There weren't other cases from what I had peeked at in the wait_for_majority_service.cpp implementation.

Comment by Esha Maharishi (Inactive) [ 13/Jun/22 ]

max.hirschhorn@mongodb.com, are there cases you know of where waitUntilMajority would return an error besides stepdown/shutdown?

george.wangensteen@mongodb.com, are you sure there would be a race between the stepdown/shutdown thread and the onStepUp thread? The onStepUp thread's continuations run on _newScopedExecutor, which is a reference to _scopedExecutor, and the stepdown/shutdown thread would shut down _scopedExecutor. So, the continuations wouldn't run on stepdown/shutdown. The continuations would only set the state to kRebuildFailed and notify _stateChangedCV if waitUntilMajority returned an error besides stepdown/shutdown.

Comment by George Wangensteen [ 13/Jun/22 ]

I looked into this and chatted with max.hirschhorn@mongodb.com  and janna.golden@mongodb.com about it a bit. I think this is no longer a bug, but the situation is has become a bit convoluted to reason about. Essentially:

(1) If the continuations chained with .then() after the call to waitUntilMajority do not run, then either:

    (1a) the associated cancellation source _source was cancelled

    (1b) the newScopedExecutor was shutdown

    (1c) The waitUntilMajorityService was shutdown

    (1d) waitForWriteConcern returned an error status.

(1a) and (1b) can happen only if the primary only service onStepDown or shutdown have been called - and both of these functions currently change the state from _kRebuilding and wake the CV.

(1c) can happen only on process shutdown, and occurs after the replication coordinator steps/shuts down, which triggers POS stepdown/shutdown logic, so the same argument from above applies

(1d) can occur only if there is a stepdown while waiting for replication (in which case the argument from 1a/1b) applies, or if there is an error in the specified write concern, which should not be the case as it derives from the WFMS.

So in all cases, _state should be set from kRebuilding and the CV should be notified. If the continuations chained do run, then either the node is still primary and rebuilding continues, or the term has changed and therefore the node has stepped down, again allowing us to apply the argument from (1a)/(1b). So I don't see a case where the _stateChangedCV is not notified even if step up happens quickly after step up.

It is difficult to add some deterministic assertions to verify this though, because there is a race between the stepdown/shutdown thread setting the state and waking the CV and the onStepUp thread waking from waitUntilMajority failing and setting the state to kRebuildFailed. This race should be harmless from a correctness perspective but makes it difficult to make the state-change ordering deterministic. Note that this test addresses the bug concern described in the ticket, because it ensures that the thread calling lookupInstances() is woken by the stepdown thread even when stepdown occurs quickly after stepup/the onStepUp async future chain does not succeed. 

I think we have a few options:

-> Add some comments to the code explaining the above and all legal _state transitions, but leave things as they are. File a ticket to try and simplify state management.

-> Make the end of the onStepUp future chain set _state to kRebuildFailed for the purposes of clarity when reading - as mentioned above this introduces a race where the stepUp thread can set state to kREbuildFailed after stepdown sets it to kPaused, but this race should be harmless

max.hirschhorn@mongodb.com esha.maharishi@mongodb.com Do you have any thoughts on what might be helpful here or if this analysis is missing something/wrong? Thanks!

Comment by Esha Maharishi (Inactive) [ 13/Apr/22 ]

Thanks max.hirschhorn@mongodb.com for pointing out that while SERVER-61717 will remove one reason to call opCtx->setAlwaysInterruptAtStepDownOrUp (caller can be waiting on an instance that is removed from the map), this ticket represents another reason to call opCtx->setAlwaysInterruptAtStepDownOrUp (caller is waiting on _rebuildCV) that would still need to be addressed.

Comment by Esha Maharishi (Inactive) [ 13/Apr/22 ]

Just a note that SERVER-61717 will remove the need to call opCtx->setAlwaysInterruptAtStepDownOrUp. It isn't scheduled, but is a high priority item on the Serverless backlog.

Comment by Max Hirschhorn [ 11/Apr/22 ]

blake.oler@mongodb.com, I believe this issue has been worked around in practice by having callers do opCtx->setAlwaysInterruptAtStepDownOrUp() before calling PrimaryOnlyService::lookupInstance() or PrimaryOnlyService::getOrCreateInstance() because they cannot rely on _waitForStateNotRebuilding() to eventually be satisfied. My impression has been we're looking to remove the pattern of opCtx->setAlwaysInterruptAtStepDownOrUp() from our C++ codebase and fixing PrimaryOnlyService's state transition lifecycle is one required piece of that.

Comment by Blake Oler [ 11/Apr/22 ]

max.hirschhorn@mongodb.com has this been seen in the wild or in Evergreen? Also curious about relative impact of the issue for prioritization.

Comment by Max Hirschhorn [ 07/Apr/22 ]

Reopening this ticket because PrimaryOnlyService::_state is still not being set to PrimaryOnlyService::State::kRebuildFailed when WaitForMajorityService::waitUntilMajority() returns an error future. I believe a similar case applies if newScopedExecutor is shut down.

Note that while SERVER-62682 addressed cases where PrimaryOnlyService::_stateChangeCV wasn't being notified, it didn't improve the situation described in SERVER-51650. The future chain must guarantee PrimaryOnlyService::_state != PrimaryOnlyService::State::kRebuilding at the end (i.e. on success or failure) and ought to additionally use PrimaryOnlyService::_setState() to ensure it wakes up any waiters.

Comment by Lauren Lewis (Inactive) [ 24/Feb/22 ]

We haven’t heard back from you for at least one calendar year, so this issue is being closed. If this is still an issue for you, please provide additional information and we will reopen the ticket.

Generated at Thu Feb 08 05:26:03 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.