[SERVER-60624] txn_commit_optimizations_for_read_only_shards.js pauses replication on coordinator and can leave transaction stuck in prepare Created: 12/Oct/21  Updated: 29/Oct/23  Resolved: 29/Dec/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 5.3.0, 5.0.6

Type: Bug Priority: Major - P3
Reporter: Luis Osta (Inactive) Assignee: Matt Boros
Resolution: Fixed Votes: 0
Labels: neweng, sharding-nyc-subteam1, sharding-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Related
related to SERVER-48060 Make tests only set server parameter ... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v5.2, v5.0
Sprint: Sharding 2021-12-13, Sharding 2021-12-27
Participants:
Linked BF Score: 23
Story Points: 2

 Description   

Context
The test txn_commit_optimizations_for_read_only_shards.js runs transaction with the coordinateCommitReturnImmediatelyAfterPersistingDecision server parameter enabled.

This means that the commitTransaction command will return early as soon as the _decisionPromise gets emplaced (either successfully or due to an error).

This means that the next test case will be able to start before the TransactionCoordinator is finished with the existing transaction. Which is part of the coverage for this test.

The problem
For certain test cases, the beforeStatements function stops server replication. Meaning that the secondary stops applying oplogs.

This results in the following being possible:

  1. txnNumber 51 starts. If the secondary falls behind the oplog. Then the opTime for decisionPersisted hasn't been reached yet. But execution continues and the _decisionPromise is emplaced.
  2. The next test case starts for txnNumber 52 and replication is completely stopped. This results in the existing transaction to be stuck waiting for majority write concern
  3. The new test case gets stuck waiting for txnNumber 51 to exit the prepared state
  4. Since the new test case can never finish (because txnNumber 52 is waiting for the previous one to exit the prepared state), replication is never restarted and txnNumber 51 can never finish.
  5. This will cause the test to hang forever

Since the issue arises form the test stopping replication with the coordinateCommitReturnImmediatelyAfterPersistingDecision flag enabled, this is a test-only problem

Proposed Solution
If either in the cleanUp option available in the failureMode or in the for loop itself, we would wait for the existing transaction to finish before moving on to the next test then this problem wouldn't occur. As then, the new test case wouldn't be able to stop replication before the transaction was finished.



 Comments   
Comment by Githook User [ 30/Dec/21 ]

Author:

{'name': 'Matt Boros', 'email': 'matt.boros@mongodb.com'}

Message: SERVER-60624 Use unique LSID for each test in txn_commit_optimizations_for_read_only_shards.js

(cherry picked from commit 6e8beaab454ba83cf6123625de45bc0b22fb1079)
Branch: v5.0
https://github.com/mongodb/mongo/commit/c0f12d1a4c98e811b21233f6c8ff7df948056f76

Comment by Githook User [ 30/Dec/21 ]

Author:

{'name': 'Matt Boros', 'email': 'matt.boros@mongodb.com'}

Message: SERVER-60624 Use unique LSID for each test in txn_commit_optimizations_for_read_only_shards.js

(cherry picked from commit 6e8beaab454ba83cf6123625de45bc0b22fb1079)
Branch: v5.2
https://github.com/mongodb/mongo/commit/6f015837d63a2b9f672d78ae3feabff707cea96c

Comment by Githook User [ 17/Dec/21 ]

Author:

{'name': 'Matt Boros', 'email': 'matt.boros@mongodb.com'}

Message: SERVER-60624 Use unique LSID for each test in txn_commit_optimizations_for_read_only_shards.js
Branch: master
https://github.com/mongodb/mongo/commit/6e8beaab454ba83cf6123625de45bc0b22fb1079

Comment by Max Hirschhorn [ 22/Nov/21 ]

Proposed Solution
If either in the cleanUp option available in the failureMode or in the for loop itself, we would wait for the existing transaction to finish before moving on to the next test then this problem wouldn't occur. As then, the new test case wouldn't be able to stop replication before the transaction was finished.

Another thought here would be to give each test case a unique logical session ID to run with. This way each test won't need to wait for the cross-shard transaction from the previous test case to finish executing.

Comment by Luis Osta (Inactive) [ 12/Oct/21 ]

The server parameter in question was introduced as part of SERVER-48060

Generated at Thu Feb 08 05:50:19 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.