[SERVER-53466] Race between PrimaryOnlyService::stepDown and _rebuildInstances Created: 21/Dec/20  Updated: 06/Dec/22  Resolved: 16/Mar/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: George Wangensteen Assignee: Backlog - Service Architecture
Resolution: Duplicate Votes: 0
Labels: servicearch-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
related to SERVER-52849 PrimaryOnlyService _rebuildServices a... Closed
Assigned Teams:
Service Arch
Operating System: ALL
Participants:
Linked BF Score: 29

 Description   

The PrimaryOnlyService stores a list of operation contexts running on its associated Client threads. When the host running the service steps down, and PrimaryOnlyService::onStepDown is called, each operation context in the list is killed here.

However, if another thread is currently managing the step-up process when stepDown is called, it's possible another thread is in the middle of running PrimaryOnlyService::_rebuildInstances. In this thread, a new operation context associated with the POS is created here, and registered with the POS (i.e. inserted into it's _opCtxs member) by the hooks in the PrimaryOnlyServiceClientObserver here. If this operation context goes out of scope while another thread runs onStepDown/tries to kill it, there will be a race between the killing thread reading the operationContext's _baton member here and the thread in which it has fallen out of scope writing the value of _baton here in the chain of calls starting with the opCtx's destructor.

To fix this, we could consider:
running the PrimaryOnlyServiceClientObserver's cleanup hooks, which will remove the opCtx from the POS's list, before allowing the opCtx destructor to modify any of it's state (i.e. switch the call to opCtx->getBaton->detach() with the line invoking the hooks here).



 Comments   
Comment by George Wangensteen [ 16/Mar/21 ]

Given the acceptance criteria of the linked issue, fixing that ticket (SERVER-52849) will fix the race here, so I'm closing this issue as a duplicate of that one.

Comment by George Wangensteen [ 21/Dec/20 ]

cc matthew.saltz

Generated at Thu Feb 08 05:31:01 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.