Loading...

XML

Word

Printable

JSON

Type: Task
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
- sa-backlog

Assigned Teams:

Server Programmability
Sprint:
Service Arch 2023-05-01, Service Arch 2023-05-15
Linked BF Score:
165
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

The function OperationContext::setAlwaysInterruptAtStepDownOrUp can be used to request that an operation context is interrupted/killed on replication state change.

However, the function is dangerous to use because it is not synchronized internally with the RSTL, and does not insist that the caller is synchronized with the RSTL. This means that a node's view of it's replication state may change while the function is being called unless the RSTL is held by the caller, which causes subtle races because operations are registered to be killed only after state-change has already occurred. Here's an example of such a race:

Thread enters function to setAlwaysInterruptAtstepDownOrStepUp() on Op without holding the RSTL // Op not yet registered to be killed
Node steps down
setAlwaysInterruptAtstepDownOrStepUp() completes // Op registered to be killed now
Op continues running anyways because of race above

Because of this, we plan to attempt to deprecate this function, and at a minimum will rename it to emphasize that is unsynchronized with the RSTL and should only be used in the very rare cases where callers are OK registering for interrupt concurrently with a replication state change.

Generally speaking, operations that want to be interrupted during replication state change should hold the RSTL to ensure the node has a consistent view of it's replication state while they register for interruption. For more complicated or asynchronous operations, higher level APIs are available to assist in this: ReplicaSetAwareService and PrimaryOnlyService both provide facilities that are safely synchronized with the RSTL for operations that need to be notified/interrupted on replication state change.

We believe most existing uses of OperationContext::setAlwaysInterruptAtStepDownOrUp could be replaced by one of these aforementioned 'safe' patterns. In this ticket, we would like to audit existing uses of the function and identify which of the aforementioned patterns may be suitable and come up with paths forward to change to them, or find out why they won't work for specific use cases and document that reasoning.

is related to

SERVER-51650 Primary-Only Service's _rebuildCV should be notified even if stepdown happens quickly after stepup

Closed

SERVER-61717 Ensure a POS instance remains in the POS map until the instance's run() is complete

Open

related to

SERVER-50486 invokeWithSessionCheckedOut being called on prepared transactions on secondaries

Closed

SERVER-58246 Commands flagged as 'never allowed on secondaries' can proceed running after a node steps down from primary

Closed

SERVER-59108 Resolve race with transaction operation not killed after step down

Closed

Assignee:: Unassigned
Reporter:: George Wangensteen (Inactive)
Participants:: George Wangensteen
Votes:: 0 Vote for this issue
Watchers:: 6 Start watching this issue

Created:: May 10 2022 02:49:08 PM UTC
Updated:: Oct 23 2024 03:47:26 PM UTC

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates