[SERVER-66353] Add documentation of concurrency rules for OperationContext::setAlwaysInterruptAtStepDownOrUp Created: 10/May/22  Updated: 29/Oct/23  Resolved: 24/May/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 6.1.0-rc0

Type: Improvement Priority: Major - P3
Reporter: George Wangensteen Assignee: George Wangensteen
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-59719 shardsvr{Commit, Abort}ReshardCollect... Closed
Backwards Compatibility: Fully Compatible
Sprint: Service Arch 2022-05-16, Service Arch 2022-05-30
Participants:

 Description   

The function OperationContext::setAlwaysInterruptAtStepDownOrUp can be used to request that an operation context is interrupted/killed on replication state change.

However, the function is dangerous to use because it is not synchronized internally with the RSTL, and does not insist that the caller is synchronized with the RSTL. This means that a node's view of it's replication state may change while the function is being called unless the RSTL is held by the caller, which causes subtle races because operations are registered to be killed only after state-change has already occurred. Here's an example of such a race: 

 

  1. Thread enters function to setAlwaysInterruptAtstepDownOrStepUp() on Op without holding the RSTL // Op not yet registered to be killed
  2. Node steps down
  3. setAlwaysInterruptAtstepDownOrStepUp() completes // Op registered to be killed now
  4. Op continues running anyways because of race above

 

We have seen many BFs with races of the above form. As a first step to improve the situation, in this ticket, let's add some documentation emphasizing the lack of built-in-RSTL synchronization with this function, and express the risk of concurrent replication-state-change clearly. Let's also consider adding _UNSAFE to the end of the function name, which would follow the pattern of functions on the replication coordinator that are not synchronized with the RSTL (see https://github.com/mongodb/mongo/blob/d5399825310e599b0cad119664c23e10d98ca5af/src/mongo/db/repl/replication_coordinator_impl.h#L152). 

 



 Comments   
Comment by Githook User [ 23/May/22 ]

Author:

{'name': 'George Wangensteen', 'email': 'george.wangensteen@mongodb.com', 'username': 'gewa24'}

Message: SERVER-66353 Add concurrency information to OperationContext::setAlwaysInterruptAtStepDownOrUp
Branch: master
https://github.com/mongodb/mongo/commit/c904c1853594468bbf60cc330d89a5b89b0e6365

Comment by Githook User [ 23/May/22 ]

Author:

{'name': 'George Wangensteen', 'email': 'george.wangensteen@mongodb.com', 'username': 'gewa24'}

Message: SERVER-66353 Add concurrency information to OperationContext::setAlwaysInterruptAtStepDownOrUp
Branch: master
https://github.com/10gen/mongo-enterprise-modules/commit/9d41a167bf45646f5112d082acc26ed8d38e8236

Generated at Thu Feb 08 06:05:11 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.