[SERVER-54626] Retryable writes may execute more than once in resharding if statements straddle the fetchTimestamp Created: 19/Feb/21  Updated: 29/Oct/23  Resolved: 03/Mar/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 4.9.0

Type: Bug Priority: Major - P3
Reporter: Max Hirschhorn Assignee: Yuhong Zhang
Resolution: Fixed Votes: 0
Labels: PM-234-M3, PM-234-T-config-txn-clone
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-54681 Resharding recipient shards which are... Closed
related to SERVER-55214 Resharding txn cloner can miss config... Closed
related to SERVER-55305 Retryable write may execute more than... Closed
is related to SERVER-52921 Integrate config.transactions cloner ... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Sharding 2021-03-08
Participants:
Story Points: 1

 Description   
Donor 1 Donor 2 Coordinator
Donor 1 chooses minFetchTimestamp ts=10    
Donor 1 performs retryable write stmtId 1 at ts=20    
Donor 1 performs retryable write stmtId 2 at ts=30    
  Donor 2 chooses minFetchTimestamp ts=25  
    Coordinator chooses fetchTimestamp ts=25

leads to a sequence where the recipients won't have written an incomplete stmtId for the retryable write due to the timestamp for stmtId 2 being greater than the fetchTimestamp. This allows the statements from the retryable write to execute a second time on the recipients after the resharding operation has finished.

{
    aggregate: "transactions",
    pipeline: [
        {$match: {_id: {$gt: <resume lsid>}}},
        {$sort: {_id: 1}},
        {$match: {"lastWriteOpTime.ts": {$lt: <fetchTimestamp>}}},
    ],
    readConcern: {level: "majority", afterClusterTime: <fetchTimestamp>},
    hint: "_id_",
    cursor: {},
}

The {"lastWriteOpTime.ts": {$lt: <fetchTimestamp>}} clause is what causes stmtId 2 and therefore the entire retryable write to be skipped over by the recipients.



 Comments   
Comment by Yuhong Zhang [ 03/Mar/21 ]

Option 3 was chosen to address the issue.

Comment by Githook User [ 03/Mar/21 ]

Author:

{'name': 'Yuhong Zhang', 'email': 'danielzhangyh@gmail.com', 'username': 'YuhongZhang98'}

Message: SERVER-54626 Retryable writes may execute more than once in resharding if statements straddle the fetchTimestamp
Branch: master
https://github.com/mongodb/mongo/commit/82cb2954a0b252f7bc193bf01b5ca105cd637b6e

Comment by Max Hirschhorn [ 20/Feb/21 ]

I could imagine addressing this issue one of a few different ways:

  1. Change the pipeline to use $graphLookup similar to what is done for resharding's oplog fetching pipeline to walk the prevOpTime chain to its root. This would have additional complexity of needing another view to avoid exceeding the 100MB memory limit for aggregation pipelines and allowing that view to be queried on across databases.
  2. Change SessionTxnRecord to store a startOpTime-like field for retryable writes which corresponds to the prevOpTime root's optime. The aggregation pipeline would then use a {"firstWriteOpTime.ts": {$lt: <fetchTimestamp>}} filter instead.
  3. Change recipients to read from the config.transactions collection using {atClusterTime: <fetchTimestamp>} and remove the {"lastWriteOpTime.ts": {$lt: <fetchTimestamp>}} clause.

I suspect option (3) is likely the most straightforward to implement.

Generated at Thu Feb 08 05:34:03 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.