Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-66351

Audit uses of OperationContext::setAlwaysInterruptAtStepDownOrUp

    • Type: Icon: Task Task
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • Labels:
    • Service Arch
    • Service Arch 2023-05-01, Service Arch 2023-05-15
    • 165

      The function OperationContext::setAlwaysInterruptAtStepDownOrUp can be used to request that an operation context is interrupted/killed on replication state change.

      However, the function is dangerous to use because it is not synchronized internally with the RSTL, and does not insist that the caller is synchronized with the RSTL. This means that a node's view of it's replication state may change while the function is being called unless the RSTL is held by the caller, which causes subtle races because operations are registered to be killed only after state-change has already occurred. Here's an example of such a race: 


      1. Thread enters function to setAlwaysInterruptAtstepDownOrStepUp() on Op without holding the RSTL // Op not yet registered to be killed
      2. Node steps down
      3. setAlwaysInterruptAtstepDownOrStepUp() completes // Op registered to be killed now
      4. Op continues running anyways because of race above


      Because of this, we plan to attempt to deprecate this function, and at a minimum will rename it to emphasize that is unsynchronized with the RSTL and should only be used in the very rare cases where callers are OK registering for interrupt concurrently with a replication state change. 


      Generally speaking, operations that want to be interrupted during replication state change should hold the RSTL to ensure the node has a consistent view of it's replication state  while they register for interruption. For more complicated or asynchronous operations, higher level APIs are available to assist in this: ReplicaSetAwareService and PrimaryOnlyService both provide facilities that are safely synchronized with the RSTL for operations that need to be notified/interrupted  on replication state change.

      We believe most existing uses of OperationContext::setAlwaysInterruptAtStepDownOrUp could be replaced by one of these aforementioned 'safe' patterns. In this ticket, we would like to audit existing uses of the function and identify which of the aforementioned patterns may be suitable and come up with paths forward to change to them, or find out why they won't work for specific use cases and document that reasoning. 

            backlog-server-servicearch [DO NOT USE] Backlog - Service Architecture
            george.wangensteen@mongodb.com George Wangensteen
            0 Vote for this issue
            5 Start watching this issue