[SERVER-57506] Eagerly fail when a timestamped transaction enters a prepareConflictRetry loop Created: 07/Jun/21  Updated: 23/May/22  Resolved: 23/May/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Daniel Gottlieb (Inactive) Assignee: Gregory Noma
Resolution: Won't Do Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-57476 Operation may block on prepare confli... Closed
Sprint: Execution Team 2021-06-14, Execution Team 2021-06-28, Execution Team 2021-07-26, Execution Team 2021-08-23, Execution Team 2021-09-06, Execution Team 2021-09-20, Execution Team 2021-10-04, Execution Team 2021-11-01, Execution Team 2021-11-15, Execution Team 2022-02-21, Execution Team 2022-04-04, Execution Team 2022-05-16, Execution Team 2022-05-30
Participants:

 Description   

SERVER-57476 demonstrates a case where:

  • A transaction T1 becomes prepared, preparing some update A
  • Another transaction T2 reserves an oplog slot. This slot has an earlier timestamp than the prepare oplog entry of T1.
  • T1 cannot commit its transaction until it is replicated to a majority of nodes. The oplog hole introduced by T2 prevents this from majority replicating.
  • T2 attempts to read the document that's currently prepared with A.
    This introduces a stall in the system.

SERVER-57476 plans to address the problem by returning a retryable error to the user when a transaction with a commit timestamp actually hits a prepare conflict. This targets the global problem. That fix will have no effect if a system doesn't have the requisite interleaving illustrated by T1 and T2.

This ticket is to craft a set of criteria local to a single operation to know when it may lead to the described stall. This is important because it's difficult for our system to thoroughly generate all combinations of operations that can bring out this interleaving.

However, there are challenges. It's not sufficient to simply invariant that anything entering a preparedConflictRetry loop must also not be holding any resources (i.e: have a commit/durable timestamp):

  • Entering a prepareConflictRetry is safe on primaries when the operation has exclusive access to a collection.
  • The transaction may be ignoring prepare conflicts.
  • The system may be in a state (e.g: startup or rollback) where prepared transactions do not currently exist.


 Comments   
Comment by Gregory Noma [ 23/May/22 ]

I'm going to close out this ticket. The idea would be to add some sort of debug assertion which could alert us to additional interleavings of operations which may manifest as something similar to SERVER-57476. However, it is unfortunately quite difficult to craft such an assertion which would still be useful without overreaching and triggering on cases which are known to (theoretically) be safe. Additionally, even if this assertion did catch a new case, it is not clear what we would do with this information aside from poke a hole in the assertion and allow the fix in SERVER-57476 to do its job.

Generated at Thu Feb 08 05:42:01 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.