[SERVER-67334] Accessing the opCtx decoration 'tenantIdToDeleteDecoration' from the on-commit hook of TenantMigrationRecipientOpObserver::onDelete() is not safe after the ttl batch deletion feature. Created: 16/Jun/22  Updated: 29/Oct/23  Resolved: 22/Jun/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 6.1.0-rc0

Type: Task Priority: Major - P3
Reporter: Suganthi Mani Assignee: Christopher Caplinger
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
is related to SERVER-63040 Batch TTL deletions Closed
is related to SERVER-67322 Update the stale comment in TenantMig... Closed
Backwards Compatibility: Fully Compatible
Sprint: Server Serverless 2022-06-27
Participants:
Linked BF Score: 143

 Description   

PM-2227 made the ttl to perform batch deletes. As a result we might end up crashing the system while trying to access the uninitialized boost::optional 'tenantIdToDeleteDecoration' value (an opCtx decoration ) from the TenantMigrationRecipientOpObserver::onDelete()'s on-commit hook. Consider the below scenario.

Assume, we started 2 migrations for tenant T1 & T2 with donor replica set rs0 and recipient replica set rs1.
1) Migration T1 is committed successfully and R state doc was updated to get garbage collected and set the expiry time for that state doc as timestamp TS1 . This migration should have an R access blocker installed for T1.
2) Assume Migration T2 is still in-progress, the cloud decided to abort and R primary ended up receiving recipientForgetMigration cmd before recipientSyncData cmd. This would not create a recipient access blocker for the migration T2. And, the R state doc for this migration is also updated to get garbage collected and the expiry time for this state doc is also set as  timestamp TS1 .
3) Now, when TTL monitor scans for any expired documents in the recipient state doc collection, it would see 2 documents needed to be deleted. So, it would do the batch deletion by doing those 2 deletes in a single recovery unit using the same opCtx and assuming the order of the state doc deletion is
    i) Delete state doc for T1 - This would would set the `tenantIdToDeleteDecoration` on the opctx to be T1 and registers the on-commit hook to delete the T1's R access blocker as part of the tenant recipient op observer imp.
   ii) Delete state doc for T2 - This would would set the `tenantIdToDeleteDecoration` on the opctx to be boost::none and don't register the on-commit hook as we don't have the R tenant access blocker for T2.
4) When the recovery unit of TTL batch deletion commits, we would run the T1's on-commit hook and leading to accessing uninitialized boost::optional `tenantIdToDeleteDecoration` value, leading to invariant failure and crashing the system.



 Comments   
Comment by Githook User [ 23/Jun/22 ]

Author:

{'name': 'Christopher Caplinger', 'email': 'christopher.caplinger@mongodb.com', 'username': 'UnicodeSnowman'}

Message: SERVER-67334: Pass tenantId directly to onCommit hook
Branch: davish/SERVER-63099
https://github.com/mongodb/mongo/commit/09f7596e084c12ffe154f9e8466d1239b5efe389

Comment by Githook User [ 22/Jun/22 ]

Author:

{'name': 'Christopher Caplinger', 'email': 'christopher.caplinger@mongodb.com', 'username': 'UnicodeSnowman'}

Message: SERVER-67334: Pass tenantId directly to onCommit hook
Branch: master
https://github.com/mongodb/mongo/commit/09f7596e084c12ffe154f9e8466d1239b5efe389

Comment by Steven Vannelli [ 21/Jun/22 ]

Much appreciated!

Comment by Christopher Caplinger [ 21/Jun/22 ]

steven.vannelli@mongodb.com on it, will get a patch up for this shortly

Comment by Esha Maharishi (Inactive) [ 17/Jun/22 ]

Ah I see, that solves the issue since the tenant id would be captured into the onCommit hook separately for each delete. Cool, that sounds good to me!

Comment by Suganthi Mani [ 17/Jun/22 ]

first half - "pass the tenant id directly just to the onCommit hook".

Comment by Esha Maharishi (Inactive) [ 17/Jun/22 ]

suganthi.mani@mongodb.com do you mean we should pass the tenant id directly just to the onCommit hook, or also pass the tenant id directly to onDelete?

Comment by Suganthi Mani [ 17/Jun/22 ]

esha.maharishi@mongodb.com Hopefully SERVER-65236 should be able the remove TTL dependency. My opinion is that irrespective of SERVER-65236 fix, we shouldn't pass the opCtx to this onCommit hook to access the opCtx decoration, instead directly pass the decoration value (tenant id to be deleted) as we don't want to make assumption about the opCtx responsible to delete only one state document.

Comment by Esha Maharishi (Inactive) [ 16/Jun/22 ]

Hmm, this problem will go away once we remove the tenant migration donor's dependency on the TTL index. I am thinking about if it will completely go away with SERVER-65236.

Comment by Esha Maharishi (Inactive) [ 16/Jun/22 ]

Ah, those two deletes share the same opCtx...

Yes, I think we should prioritize fixing this before releasing 6.1, so that we don't introduce these crashes in production.

Comment by Suganthi Mani [ 16/Jun/22 ]

esha.maharishi@mongodb.com This crash bug is caused due to PM-2227 whose Fix version is 6.0 & 6.1, but the SERVER-63040 which really changed TTL behavior is 6.1 version. Do you think we would be backporting this fix to 6.1 version as well? Also, do you think we should prioritize this ticket as it seems to be a regression?

CC steven.vannelli@mongodb.com christopher.caplinger@mongodb.com

Generated at Thu Feb 08 06:07:52 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.