Investigate prioritizing replication status check over versioning protocol

    • Type: Task
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Catalog and Routing
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      As part of the investigation into ExceededTimeLimit retriability (SERVER-117235), we identified that the behavior introduced by SERVER-84623 acts as a minor regression for multi-document transactions.

      Currently, when a transaction waits for a refresh in a critical section and times out, the ExceededTimeLimit error bubbles up to the driver (labeled as TransientTransactionError). Previously, this manifested as a StaleConfig exception, which the MongoS strategy layer retried transparently without involving the driver.

      Investigation Proposal

      We would like to investigate if we can solve the original "shutdown masking" problem (SERVER-84623) by modifying the order of operations in the command execution path, rather than overriding errors.

      We propose investigating the following:

      • Can we check the replication status (verifying the node is not shutting down) before we check the versioning protocol?
      • If we can detect that a node is shutting down before entering the versioning logic, we should no longer need to override refresh errors to catch other exceptions.

      If this hypothesis holds, we should implement this reordering and revert SERVER-84623. This would restore the behavior where ExceededTimeLimit in transactions is converted to StaleConfig, allowing MongoS to handle the retry internally.

            Assignee:
            Unassigned
            Reporter:
            Pol Pinol
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: