[SERVER-67466] Internal transactions API may be memory unsafe when outer OperationContext is interrupted Created: 22/Jun/22  Updated: 29/Oct/23  Resolved: 14/Jul/22

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 6.0.0-rc10
Fix Version/s: 6.0.3, 6.1.0-rc0

Type: Bug Priority: Major - P3
Reporter: Max Hirschhorn Assignee: Sanika Phanse (Inactive)
Resolution: Fixed Votes: 0
Labels: sharding-nyc-subteam3
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Related
related to SERVER-68237 Internal session pool may not reuse i... Closed
is related to SERVER-55813 ReshardingDataReplication may still e... Closed
is related to SERVER-67016 Transaction API transactions should b... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v6.0
Sprint: Sharding 2022-07-25
Participants:
Linked BF Score: 160

 Description   

SERVER-67016 made it so when the outer OperationContext is interrupted (e.g. due to a stepdown), then the cancellation token used by the transaction API will be canceled. This is important to ensure the operations running within the internal transaction are themselves eventually interrupted. However, canceling the cancellation source isn't sufficient to ensure the tasks running on the transaction API's executor have actually completely finished running. This means it is still possible for the outer OperationContext to be interrupted, for the cancellation source to be canceled, but for SyncTransactionWithRetries::runNoThrow() to return and the server to destroy the original command request before the task running on the transaction API's executor have drained.

Take the _configsvrRefineCollectionShardKey command for example. The _configsvrRefineCollectionShardKey command calls ShardingCatalogManager::refineCollectionShardKey() using a ShardKeyPattern with an underlying BSONObj which has its lifetime bound to the _configsvrRefineCollectionShardKey command request. If the OperationContext of the refineCollectionShardKey is interrupted (e.g. via the killOp command), then a task running in the task running on the transaction API's executor may continue to refer to the underlying BSONObj's memory even after it has been released.

Instead, the transaction API should additionally wait for the tasks running on the transaction API's executor to have all settled after canceling the cancellation source. This way none of the captures of the lambda callback may still be in use after SyncTransactionWithRetries::runNoThrow() has returned up the stack.

auto txnFuture = _txn->run(std::move(callback))
auto txnResult = txnFuture.getNoThrow(opCtx);
// Cancel the source to guarantee the transaction will terminate if our opCtx was interrupted.
_source.cancel();
txnFuture.wait()



 Comments   
Comment by Githook User [ 06/Oct/22 ]

Author:

{'name': 'Sanika Phanse', 'email': 'sanika.phanse@mongodb.com', 'username': 'sphanse99'}

Message: SERVER-67466 Internal transactions API may be memory unsafe when outer OperationContext is interrupted
Branch: v6.0
https://github.com/mongodb/mongo/commit/c3932575cb49845903e96e4c26c8e53f707f93f6

Comment by Githook User [ 14/Jul/22 ]

Author:

{'name': 'Sanika Phanse', 'email': 'sanika.phanse@mongodb.com', 'username': 'sphanse99'}

Message: SERVER-67466 Internal transactions API may be memory unsafe when outer OperationContext is interrupted
Branch: master
https://github.com/mongodb/mongo/commit/31efb07ad4029fb56f5a850a7f330330e155961f

Generated at Thu Feb 08 06:08:12 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.