[SERVER-72622] Resuming tenant oplog applier due to recipient failover can miss writing no-op entries for donor oplog entries. Created: 09/Jan/23  Updated: 15/Jun/23  Resolved: 15/Jun/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Suganthi Mani Assignee: Christopher Caplinger
Resolution: Won't Do Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File SERVER-72622.patch    
Issue Links:
Backports
Problem/Incident
Related
related to SERVER-72623 Add a server comment in tenant oplog ... Closed
is related to SERVER-75990 Tenant Migrations are not resilient t... Closed
Operating System: ALL
Sprint: Server Serverless 2023-02-06, Server Serverless 2023-02-20, Server Serverless 2023-03-06, Server Serverless 2023-03-20, Server Serverless 2023-04-03, Server Serverless 2023-04-17, Server Serverless 2023-05-01
Participants:
Linked BF Score: 173

 Description   

Tenant oplog applier first apply writes and then write no-ops for each donor oplog entries in a given oplog batch and then these no-ops are written in parallel using the writerpool threads. And to calculate the resume point on recipient failover, we traverse backwards through the oplog collection and find the most recent no-op oplog entry from the current migration. Due to this code logic, resuming tenant oplog applier due to recipient failover can miss writing no-op entries for donor oplog entries. The consequence of this would be
1) we might miss updating the session entries in config.transactions table for multi-statement replica set transactions, leading to duplicate transaction commit
2) Missing oplog chain for retryable writes
3) Change streams might miss generating the change event.



 Comments   
Comment by Didier Nadeau [ 17/Apr/23 ]

Due to the complexity of the fix and the side-effects faced while working on it, we decided to abort Tenant Migration in case of recipient failovers SERVER-75990. Therefore we close this ticket as won't do (the commit were rolled back).

Comment by Githook User [ 10/Apr/23 ]

Author:

{'name': 'Christopher Caplinger', 'email': 'christopher.caplinger@mongodb.com', 'username': 'UnicodeSnowman'}

Message: Revert "SERVER-72622: Track TenantOplogApplier progress in replicated collection"

This reverts commit 636ad08bdb6fb5c9ad98ee4cdde8f52929c29830.
Branch: master
https://github.com/mongodb/mongo/commit/c20a1829195384e6f9737cdba13b850364366e1d

Comment by Githook User [ 08/Apr/23 ]

Author:

{'name': 'Christopher Caplinger', 'email': 'christopher.caplinger@mongodb.com', 'username': 'UnicodeSnowman'}

Message: SERVER-72622: Track TenantOplogApplier progress in replicated collection
Branch: master
https://github.com/mongodb/mongo/commit/636ad08bdb6fb5c9ad98ee4cdde8f52929c29830

Comment by Githook User [ 28/Mar/23 ]

Author:

{'name': 'Suganthi Mani', 'email': 'suganthi.mani@mongodb.com', 'username': 'smani87'}

Message: Revert "SERVER-72622: Track TenantOplogApplier progress in replicated collection"

This reverts commit 3c130a69eaddc7cb44895f57af4da6e39556dbb4.
Branch: master
https://github.com/mongodb/mongo/commit/726362404fbe9fc1dc75f97fabcb47e76efac8d7

Comment by Suganthi Mani [ 28/Mar/23 ]

Reverting the commit due to build failure. Further inspecting this commit changes, it can lead to potential deadlocks and perf regression. So, will revert the commit and rework on the changes to address those problems.

Writing down the problems in detail with the commit

1) The root cause of the BF is that with the new change, it is possible to reapply retryable writes and transaction donor oplog entries on recipient failovers with a transaction number older than the current active transaction number for a given session. We handled it for statementIds but we didn't handle it for transaction numbers already executed case. As a result, the test was throwing ErrorCodes::TransactionTooOld.

2) Potential places where we can get into 2-way deadlock between stepdown and TenantOplogApplier startup are, this [1] and this [2] line in the tenant oplog applier code (see SERVER-60872 for more info on lock ordering violation ). Additionally, [2] makes a storage call with the tenant oplog applier mutex lock held which is as anti-pattern. We should try to avoid doing storage call with mutex lock held. Mutex should be taken only for short critical section.

3) Now with the changes, we are checking out the session which reads oplog from disk to load retryable chain and this session gets invalidated at the end of each batch. This means, every time a batch should read from disk to load the retryable write oplog chain while checking out session for a given session. This would lead to potential perf regression in tenant migration.

Additionally, I noted a minor change that on FCV 6.3 and Mongodb binary, we scan oplog backwards to find the resume opTime even without recipient failovers. This is a change in behavior from the original where we try to find resume opTime only when the resume phase is `ResumePhase::kOplogCatchup` 

(Update: Uploaded a diff patch to address problem 1 & 3 with base hash commit as 00e9ca719a7ef980e756ba27ba21dec44d69cf0f)

Comment by Githook User [ 21/Mar/23 ]

Author:

{'name': 'Christopher Caplinger', 'email': 'christopher.caplinger@mongodb.com', 'username': 'UnicodeSnowman'}

Message: SERVER-72622: Track TenantOplogApplier progress in replicated collection
Branch: master
https://github.com/mongodb/mongo/commit/3c130a69eaddc7cb44895f57af4da6e39556dbb4

Generated at Thu Feb 08 06:22:21 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.