[SERVER-78228] _decisionDurable flag is not set correctly when continuing two-phase commit after step up Created: 20/Jun/23  Updated: 12/Dec/23

Status: Open
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Ivan Fefer Assignee: Brett Nawrocki
Resolution: Unresolved Votes: 0
Labels: cs-subteam1, sharding-nyc-subteam1
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Sharding NYC
Operating System: ALL
Participants:
Story Points: 5

 Description   

When persisting decision we check for _decisionDurable flag, because if we are continuing the commit after step up, there is no need to persist decision again.

However, we are not setting this flag in continueCommit: we should add _decisionDurable = true  if TransactionCoordinatorDocument has a decision.

This is not a correction issue, we just waste some time by writing the decision again.



 Comments   
Comment by Max Hirschhorn [ 22/Jun/23 ]

To add more detail about the subtlety - The logic which returns a null optime means the WaitForMajorityService will be trivially satisfied by the primary.

if (_decisionDurable)
    return Future<repl::OpTime>::makeReady(repl::OpTime());

This is problematic because the primary does not readily know whether the last write to the transaction coordinator state document has been majority-committed or not if the node was the primary in the term when the write was initially committed. Generally the solution to this problem involves waiting for the latest optime to become majority-committed by waiting on the optime from ReplClientInfo::setLastOpToSystemLastOpTime().

In a 5-node replica set taking the if-branch would lead to the following unsafe behavior:

  1. The primary in term 1 writes the participant list and waits for it to be majority-committed.
  2. The primary in term 1 writes the commit decision but steps down while waiting for it to be majority-committed. The write was successfully applied to 1 of 4 secondaries. The write has not been majority-committed (and in this example won't become majority-committed).
  3. The secondary which successfully applied the write from (2) steps up to be the primary in term 2.
  4. (Bug) The primary in term 2 sees the commit decision in its local state and communicates commitTransaction to each of the participant shards.
  5. The primary in term 2 steps down and one of the other 3 secondaries which never applied the write from (2) steps up to be the primary in term 3.
  6. The primary in term 3 writes the abort decision and waits for it to be majority-committed.
  7. The primary in term 3 communicates abortTransaction to each of the participant shards and now the outcome of the transaction has been torn because some participant shards may have committed due to (4) and others will abort.

Because the primary still attempts to update the transaction coordinator state document (it won't match a document if the decision has already been made) the primary ends up advancing the OperationContext to track the latest optime through the LastOpFixer::~LastOpFixer() destructor.

Comment by Ivan Fefer [ 20/Jun/23 ]

This can be more tricky than it seems, so maybe we should allow it to happen and remove misleading comment and always-false if check.

Generated at Thu Feb 08 06:37:47 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.