[SERVER-66351] Audit uses of OperationContext::setAlwaysInterruptAtStepDownOrUp Created: 10/May/22 Updated: 20/Jul/23 |
|
| Status: | Open |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | George Wangensteen | Assignee: | Backlog - Service Architecture |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Assigned Teams: |
Service Arch
|
||||||||||||||||||||||||
| Sprint: | Service Arch 2023-05-01, Service Arch 2023-05-15 | ||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||
| Linked BF Score: | 165 | ||||||||||||||||||||||||
| Description |
|
The function OperationContext::setAlwaysInterruptAtStepDownOrUp can be used to request that an operation context is interrupted/killed on replication state change. However, the function is dangerous to use because it is not synchronized internally with the RSTL, and does not insist that the caller is synchronized with the RSTL. This means that a node's view of it's replication state may change while the function is being called unless the RSTL is held by the caller, which causes subtle races because operations are registered to be killed only after state-change has already occurred. Here's an example of such a race:
Because of this, we plan to attempt to deprecate this function, and at a minimum will rename it to emphasize that is unsynchronized with the RSTL and should only be used in the very rare cases where callers are OK registering for interrupt concurrently with a replication state change.
Generally speaking, operations that want to be interrupted during replication state change should hold the RSTL to ensure the node has a consistent view of it's replication state while they register for interruption. For more complicated or asynchronous operations, higher level APIs are available to assist in this: ReplicaSetAwareService and PrimaryOnlyService both provide facilities that are safely synchronized with the RSTL for operations that need to be notified/interrupted on replication state change. We believe most existing uses of OperationContext::setAlwaysInterruptAtStepDownOrUp could be replaced by one of these aforementioned 'safe' patterns. In this ticket, we would like to audit existing uses of the function and identify which of the aforementioned patterns may be suitable and come up with paths forward to change to them, or find out why they won't work for specific use cases and document that reasoning. |