Resolution: Won't Do
Priority: Major - P3
Affects Version/s: None
Execution Team 2021-06-14, Execution Team 2021-06-28, Execution Team 2021-07-26, Execution Team 2021-08-23, Execution Team 2021-09-06, Execution Team 2021-09-20, Execution Team 2021-10-04, Execution Team 2021-11-01, Execution Team 2021-11-15, Execution Team 2022-02-21, Execution Team 2022-04-04, Execution Team 2022-05-16, Execution Team 2022-05-30
SERVER-57476 demonstrates a case where:
- A transaction T1 becomes prepared, preparing some update A
- Another transaction T2 reserves an oplog slot. This slot has an earlier timestamp than the prepare oplog entry of T1.
- T1 cannot commit its transaction until it is replicated to a majority of nodes. The oplog hole introduced by T2 prevents this from majority replicating.
- T2 attempts to read the document that's currently prepared with A.
This introduces a stall in the system.
SERVER-57476 plans to address the problem by returning a retryable error to the user when a transaction with a commit timestamp actually hits a prepare conflict. This targets the global problem. That fix will have no effect if a system doesn't have the requisite interleaving illustrated by T1 and T2.
This ticket is to craft a set of criteria local to a single operation to know when it may lead to the described stall. This is important because it's difficult for our system to thoroughly generate all combinations of operations that can bring out this interleaving.
However, there are challenges. It's not sufficient to simply invariant that anything entering a preparedConflictRetry loop must also not be holding any resources (i.e: have a commit/durable timestamp):
- Entering a prepareConflictRetry is safe on primaries when the operation has exclusive access to a collection.
- The transaction may be ignoring prepare conflicts.
- The system may be in a state (e.g: startup or rollback) where prepared transactions do not currently exist.