[SERVER-68438] Fix PrimaryOnlyService race condition with the PrimaryOnlyServiceClientObserver Created: 29/Jul/22 Updated: 27/Oct/23 Resolved: 01/Aug/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Mathis Bessa | Assignee: | Esha Maharishi (Inactive) |
| Resolution: | Works as Designed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Operating System: | ALL | ||||||||||||
| Steps To Reproduce: | Step-up on a secondary during a tenant migration. |
||||||||||||
| Participants: | |||||||||||||
| Linked BF Score: | 131 | ||||||||||||
| Description |
|
There is currently a race condition between the POS and the PrimaryOnlyServiceClientObserver. When a new primary steps up, we transition from the kRebuilding to the kRunning state in the POS. In this case since the instance starts running before we are able to transition from `kRebuilding` to `kRunning. We create the OperationContext during the `run` of the PrimaryOnlyService, the PrimaryOnlyService will actually kill the OpCtx while being in that transition. The reason why the operation context is killed is because during that transition the PrimaryOnlyServiceClientObserver which will register the OperationContext will check the current state and find that the current state is indeed kRebuilding. However the second condition which is to check if `allowOpCtxWhileRebuilding` is set to true will no longer be true due to the Since the instance starts running before the POS state is able to transition from the `kRebuilding` state to the `kRunning` state, |
| Comments |
| Comment by Mathis Bessa [ 01/Aug/22 ] |
|
After carefully reviewing the issue we decided to close this current ticket since reverting We are also going to work on The revert of |
| Comment by George Wangensteen [ 01/Aug/22 ] |
|
If |
| Comment by Jason Chan [ 01/Aug/22 ] |
|
Should we revert |
| Comment by Suganthi Mani [ 29/Jul/22 ] |
|
I don't see any reason for POS to not allow instances to create the opCtx while POS is in rebuilding state. |
| Comment by Suganthi Mani [ 29/Jul/22 ] |
|
Note: This commit made the race condition more frequent. We need to fix this sooner as we have lots of BF failures. |