[SERVER-42171] Race in recover_prepared_transactions_startup_secondary_application.js Created: 11/Jul/19  Updated: 11/Jul/19  Resolved: 11/Jul/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 4.2.0-rc2
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: A. Jesse Jiryu Davis Assignee: A. Jesse Jiryu Davis
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Duplicate
duplicates SERVER-41718 recover_prepared_transactions_startup... Closed
Problem/Incident
is caused by SERVER-41008 Check lastCommittedOpTime instead of ... Closed
Operating System: ALL
Participants:
Linked BF Score: 0

 Description   

recover_prepared_transactions_startup_secondary_application.js prepares a transaction, then restarts a secondary and queries it, expecting it to be in state SECONDARY. Sometimes the query fails, however, with "node is recovering". This is due to a race in the test. The test thinks that waiting for the prepare entry to be majority-committed implies that the restarted node has transitioned to state SECONDARY, but this is not always so.

* Create a 2-node set
* Prepare a transaction
* In either order:
    * The secondary restarts & transitions to RECOVERING
    * The prepare entry is majority-committed
* The test waits for the prepare entry to be majority-committed
* The test queries the restarted node

If the secondary restarts before the prepare is majority-committed, then waiting for the majority commit is the same as waiting for the restarted node to become SECONDARY, so querying the node is fine.

If the secondary restarts after the prepare is majority-committed, then waiting for the majority commit is insufficient. The node could still be in RECOVERING and reject the query.



 Comments   
Comment by A. Jesse Jiryu Davis [ 11/Jul/19 ]

The race was introduced here:

https://github.com/mongodb/mongo/commit/204352fb65123323bb50800741b1b322fe648f15

Before this change, the test called ReplSetTest.awaitReplication(), which I think would wait for the secondary to recover as well as waiting for replication to reach a timestamp. After the change, it calls awaitMajorityCommitted(). This fixes the test's logic for awaiting replication, but it removes the check for recovery. The next fix is to put back the recovery check.

Generated at Thu Feb 08 04:59:46 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.