[SERVER-68438] Fix PrimaryOnlyService race condition with the PrimaryOnlyServiceClientObserver Created: 29/Jul/22  Updated: 27/Oct/23  Resolved: 01/Aug/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Mathis Bessa Assignee: Esha Maharishi (Inactive)
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
is related to SERVER-51650 Primary-Only Service's _rebuildCV sho... Closed
Operating System: ALL
Steps To Reproduce:

Step-up on a secondary during a tenant migration.

Participants:
Linked BF Score: 131

 Description   

There is currently a race condition between the POS and the PrimaryOnlyServiceClientObserver.

When a new primary steps up, we transition from the kRebuilding to the kRunning state in the POS.

In this case since the instance starts running before we are able to transition from `kRebuilding` to `kRunning. We create the OperationContext during the `run` of the PrimaryOnlyService, the PrimaryOnlyService will actually kill the OpCtx while being in that transition.

The reason why the operation context is killed is because during that transition the PrimaryOnlyServiceClientObserver which will register the OperationContext will check the current state and find that the current state is indeed kRebuilding. However the second condition which is to check if `allowOpCtxWhileRebuilding` is set to true will no longer be true due to the 
AllowOpCtxWhenServiceRebuildingBlock running out of scope and reseting the allowOpCtxWhileRebuilding flag to false.
We end up in a state where we are rebuilding but no longer are allowing the Operation Context while rebuilding and are not in the kRunning state yet.

Since the instance starts running before the POS state is able to transition from the `kRebuilding` state to the `kRunning` state, 



 Comments   
Comment by Mathis Bessa [ 01/Aug/22 ]

After carefully reviewing the issue we decided to close this current ticket since reverting SERVER-51650 will fix the bug that was introduced (there was no already existing issue, sorry for the confusion).

We are also going to work on SERVER-68473 which will result in better explaining the logic in the code as a comment.

The revert of SERVER-51650 will start shortly.

Comment by George Wangensteen [ 01/Aug/22 ]

If SERVER-51650 has made the BFs more frequent, I'm fine with reverting it until we fix the BF. Happy to do so, checking with Mathis.

Comment by Jason Chan [ 01/Aug/22 ]

Should we revert SERVER-51650 for now to prevent further BFs from being generated? george.wangensteen@mongodb.com

Comment by Suganthi Mani [ 29/Jul/22 ]

I don't see any reason for POS to not allow instances to create the opCtx while POS is in rebuilding state.

Comment by Suganthi Mani [ 29/Jul/22 ]

Note: This commit made the race condition more frequent. We need to fix this sooner as we have lots of BF failures.

Generated at Thu Feb 08 06:10:48 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.